A Data Science Central Community
Distributed architectures with more fragmented data sets beget the need for a dynamic data integration platform that is flexible and scalable to bridge existing enterprise infrastructure to newer apps developed for cloud and mobility. Data quality and master data management (MDM) are vital components of data integration to ensure consistency and reliability for upstream analytics.
But in order to succeed, firms need to first establish and commit to an ongoing data governance program with clear objectives endorsed by senior management. Line of business users must be involved throughout the process – from training on required data entry procedures and formatting to model testing, validation and analysis. In this post, we’ll focus on data quality. Part II will look at MDM and the link between the two. A data governance committee comprised of leaders from different functional areas, including IT, helps ingrain data quality best practices into company operations, processes and culture to identify the root cause of bad data and assure better data integrity.
Most companies are still struggling with how to integrate their existing data to gain more meaningful insights across data silos to better leverage their BI and analytics applications. An institutionalized data governance program should drive data integration strategy. The first step to successful data integration is to audit existing systems to discover what data they contain. Then, after mapping the data to its location, applications and owners, the company can begin to assess the quality of their data.
Structured data from traditional databases and document handling systems are commonly stuck in file systems and archives that are rarely, if ever, accessed after 90 days from their creation. Add to this the tsunami of unstructured text or semi-structured machine data, such as pdfs, tweets, videos, Web logs and sensors that potentially holds a new trove of valuable information. This precipitates the need for a data integration platform that can accommodate all of a company’s data – whether it comes from existing OLTP and analytics databases and storage infrastructures or from NoSQL databases, Hadoop clusters or cloud-based services.
Moreover, as big data becomes more pervasive, it becomes even more important to validate models and the integrity of data. A correlation between two variables does not necessarily mean that one causes the other. Coefficients of determination can easily be manipulated to fit the hypothesis behind the model. As such, this also distorts the analysis of the residuals. Models for spatial and temporal data would only appear to complicate validation even further.
Data Quality is Job #1
Bad data is the bane of any organization. It is not just an IT problem either. Missing data, misfielded attributes and duplicate records are among the causes of flawed data models. These in turn, undermine the organization’s ability to execute on strategy, maximize revenue and cost opportunities and adhere to governance, regulatory and compliance (GRC) mandates. It also undermines MDM efforts when enterprises discover how much bad data is in their files during profiling.
Data quality is the key to building better models and algorithms. Despite their deterministic nature, algorithms are only as good as the data their modelers work with. From high frequency trading, credit scores and insurance rates to web search, recruiting and online dating, flawed algorithms and models can cause major displacements in markets and lives.
Integrating different types of data from a growing variety of sources exponentially increases the probability of skunk data polluting modeling assumptions. In turn, this will skew the R squared during model validation, resulting in hypotheses that may or may not be correct and wonky predictions of future possible outcomes. In financial markets, flash crashes tied to algorithmic trading, incorrect post-trade reconciliations and funds transfers to wrong parties are all manifestations of bad data.
Preparing data for analysis is an arduous task. This is true for traditional data warehouses, where structure is so important that if your model is not valid, the analysis cannot be done. It is also true, however, if you’re working with Hadoop, where data is structured after ingestion. Either scenario involves data movement. The principal difference is whether the data is extracted, transformed and then loaded (ETL) into a data warehouse for analysis or whether it is extracted, loaded and then transformed (ELT) in a Hadoop cluster where it is analyzed. And nowadays, the data can be analyzed in SQL, NoSQL or a converged environment.
In order to become more agile and nimble in decision-making so that outcomes improve consistently enough to sustain competitive advantage, firms must take a more holistic approach to data integration. This requires a shift in emphasis from merely integrating data manually to adopting a highly automated data integration platform that incorporates data quality and master data management.
The excessive focus on volume, velocity and variety of data and the technologies emerging to store, process and analyze it are rendered ineffectual if the algorithms result in bad decision outcomes or abuses. For these sources and types of data to be useful in data modeling and analytics goes beyond mere integration to ensure the data is accurate, secure and auditable. A holistic approach that treats all data as big data will facilitate integration efforts. Moreover, while most of the big data hype revolves around the volume, velocity and variety of newer data formats, we have proposed that the real 3Vs of all data should be validity, veracity and value.
Whether the aim to generate alpha in high-frequency trading operations, improve customer service and retention rates, or develop products and services to meet faster changing consumer preferences and behavior, such an approach is suitable throughout the organization – from the most latency-sensitive applications to the most mundane back-office processes. After all, bad data at the speed of light is still bad data.