A Data Science Central Community
Big data does not change the relationship between data quality and decision outcomes. It underscores it.
The elegance of analytics systems and processes lies not in the gathering, storing and processing of data. Today’s technology has made this relatively easy and common. Rather it is in the integration and management to provide the highest quality data in the timeliest fashion at the point of decision – regardless of whether the decision maker is an employee or a customer.
Most organizations realize that their success is increasingly tied to the quality of their information. Companies rely on data to make significant decisions that can affect many aspects of their business – from customer service to regulatory compliance and supply chain efficiency.
Improving decision outcomes is the key to strengthening employee engagement and customer satisfaction. It is also integral to generating a better return on data assets (RDA), a metric that will become increasingly important and as much a measure of a company’s success as its other financial ratios.
As more companies seek to gain deeper insights to improve operational performance and competitiveness, greater attention needs to be paid to data quality and the validation of models built on the data. As more evaluations drive prescriptive and predictive algorithms and decision-making – particularly in the IoT future – the input variables that models are based on must be of the highest quality possible.
For despite their deterministic nature, algorithms are only as good as the data their modelers work with. These models drive algorithms that can cause catastrophic outcomes if the underlying assumptions are flawed or biased.
Data Quality Trumps Volume, Velocity and Variety
Simply defined, algorithms follow a series of instructions to solve a problem based on the input variables in the underlying model. From high frequency trading, credit scores and insurance rates to web search, seed planting and self-driving cars, flawed algorithms and models can cause major displacements in markets and lives.
The excessive focus on volume, velocity and variety of data and the technologies emerging to store, process and analyze it are rendered ineffectual if the algorithms result in bad decision outcomes or abuses. For this reason we have suggested that the real 3Vs of big data should be validity, veracity and value.
Modeling in fields with controlled environments and reliable data inputs, such as drug discovery, ballplayer evaluation or predicting the next hit song provide data analysts the luxury of time to validate their models. However, in operating rooms, the time horizon may be seconds, while in web search it may be less than a second, and on a trading floor, milliseconds.
Examples of bad algorithmic decision outcomes abound. One example is the flash crash that occurred on May 6, 2010. Within a few minutes, The Dow Jones Industrial average plunged 1,000 points only to recover less than 20 minutes later. While the cause was never fully explained, many market participants agree that quantitative algorithms were to blame. With algorithms responsible for approximately 50% of trading volume, the potential for future calamitous events is more than likely. After all, bad data at the speed of light is still bad data.
On a more individual level, algorithms based on personal data, such as zip codes, payment histories and health records have the potential to be discriminatory in determining insurance rates, hospital treatments and credit scores. Add social and machine-generated streams into the mix and the resulting assumptions in models can skew outcomes even further.
Focus on model validation
As big data becomes more pervasive, it becomes even more important to validate models and the integrity of data. A correlation between two variables does not necessarily mean that one causes the other. Coefficients of determination can easily be manipulated to fit the hypothesis behind the model. As such, this also distorts the analysis of the residuals. Models for spatial and temporal data would only appear to complicate validation even further.
Data management tools have improved to significantly increase the reliability of the data inputs. Until machines devise the models, focus on the veracity of the data would improve model validation and reduce – not eliminate – inherent bias.
It would also yield more valuable data. The more valuable a company’s data, the higher it’s RDA. And higher data efficiency strengthens competitiveness and financial performance and valuation.