A Data Science Central Community
With all the hype around big data analytics, not enough attention is being given to data quality or the validation of models built on the data. Despite their deterministic nature, algorithms are only as good as the data their modelers work with.
Simply defined, algorithms follow a series of instructions to solve a problem based on the input variables in the underlying model. From high frequency trading, credit scores and insurance rates to web search, recruiting and online dating, flawed algorithms and models can cause major displacements in markets and lives. The excessive focus on volume, velocity and variety of data and the technologies emerging to store, process and analyze it are rendered ineffectual if the algorithms result in bad decision outcomes or abuses.
One example is the flash crash that occurred on May 6, 2010. Within a few minutes, The Dow Jones Industrial average plunged 1,000 points only to recover less than 20 minutes later. While the cause was never fully explained, many market participants agree that quantitative algorithms were to blame. With algorithms responsible for up to 75% of trading volume, the potential for future calamitous events is more than likely. Despite the efficiencies, the absence of human intervention resulted in a cascade of events that triggered more trades to tank the market further. Have we learned nothing from the portfolio insurance of the 1980s that ultimately caused the 1987 crash?
On a more individual level, algorithms based on personal data, such as zip codes, payment histories and health records have the potential to be discriminatory in determining insurance rates and credit scores. Include social data into the mix and the resulting assumptions in models can skew outcomes even further.
Another example is the revelations about the NSA’s collection and analysis of personal information. Governments have enacted legislation to allow data mining for indirect or non-obvious correlations in the name of national security. Similar algorithms are being used for profiling by municipal police departments. A modeling error may have devastating effects on every day citizens. And the potential breach of personal privacy leaves a gaping hole in governance.
Modeling in fields with controlled environments and reliable data inputs, such as drug discovery or predicting traffic patterns provide scientists the luxury of time to validate their models. However, in web search the time horizon may be two seconds and on a trading floor, milliseconds.
Focus on model validation
As big data becomes more pervasive, it becomes even more important to validate models and the integrity of data. A correlation between two variables does not necessarily mean that one causes the other. Coefficients of determination can easily be manipulated to fit the hypothesis behind the model. As such, this also distorts the analysis of the residuals. Models for spatial and temporal data would only appear to complicate validation even further.
Data management tools have improved to significantly increase the reliability of the data inputs. Until machines devise the models, focus on the veracity of the data would improve model validation and reduce, not eliminate, inherent bias. It would also yield more valuable data.
Ways to improve data quality
Bad data is not just an IT problem. Missing data, misfielded attributes and duplicate records are among the causes of flawed data models. These in turn, undermine the organization’s ability to execute on strategy, maximize revenue and cost opportunities and adhere to governance, regulatory and compliance (GRC) mandates. Organizations need to enact rules, policies and processes to identify root cause and assure better data integrity.
Below are some antidotes for common data quality problems: