A Data Science Central Community
Big data is most often defined by volume, velocity and variety. But it is veracity, validity and value that are big data’s real 3Vs. For it’s the information we glean from data that yields better decisions.
The big risk in the big data hype is that we don’t focus on the value – and quality – of the data we are storing and analyzing. We will expend a lot of cost and energy to generate a low return on all of that data because the models and algorithms they drive will serve no more value than a spreadsheet with more reliable data. The focus on speed and capacity is misplaced without an equal focus on data quality. Bad data at the speed of light is still bad data.
Current tools have a long way to go. Storage costs, while continuing to decline, will be a key factor in answering what the value is in the data we store (excluding data mandated by compliance). The focus of development should be on analytics tools that can be applied to the data in real-time and in raw form. And the analytics should be taken to the data, rather than having to keep moving the data to be analyzed.
Refining a commodity into value
My take on the distinction between data and information are the differences between us. Give any 100 people a data set and they will draw different inferences and conclusions that will drive different decision trees and uses. That is the beauty of the human brain.
Data itself is raw - and I would argue - a commodity no different that the devices (hardware) we use to process it. Other examples of raw materials that require processing and refining to become useful products are crude oil, metal ores, cane sugar and sea water.
The value creation is upstream. In the case of data, it is the information gleaned by individuals after processing. This is why I am a proponent of associative search and visualization technology. Give the line of business (LOB) user the entire data set rather than a narrow one defined by a data analyst or IT person who lacks the insights into the business problems or opportunities. Let these 100 LOBs manipulate and analyze the data set and visualize the what-if scenarios. We will get a number of over-lapping decisions and a number of unique ones.
This is the digital version of the old suggestion box. Every employee can make their own observations, conclusions and suggestions/recommendations to improve a process or procedure that meets the three criteria of return on investment (ROI): 1) reduced costs, 2) enhanced productivity or 3) incremental revenue generation. Some, if not most of the suggestions/recommendations will not be useful. But the one or two that are, can yield great returns.
Data quality: do it first; do it right
Some data scientists advocate predictive analytics as a panacea. While I believe in the potential of different predictive approaches, data is still always subject to interpretation.
This is acutely true in health care, where the quality of the data may be factually correct and still lead to differing diagnoses. Isn't this why we are advised to seek second opinions?
A close family member was recently diagnosed with late-stage lymphoma based on what turned out to be faulty test results from a lab. After 8 hours at Sloane-Kettering, another blood test revealed she was fine and a scheduled bone marrow biopsy for that evening was cancelled.
The diagnosis was given by an endocrinologist. If traditional predictive analytics were applied to the patient's first set of test results, the diagnosis would have been the same. A biopsy would have been performed - and billed - unnecessarily. Had the second blood test not been performed, what should the patient have done given the urgency of the diagnosis?
In another example, a hedge fund trader lost tens of millions of dollars on a trade because the data feed was not accurate. The data was not tested for quality. Neither was the model, which tried to predict the relationship between variable data points. The algorithm that executed the trade was thus marred. Could this near-catastrophic loss been averted?
This is why I am a stickler for data quality and rigorous model validation. All the more so in life-or-death situations. Data may contain errors due to a variety of factors, and the sources of bad data may not be traced – at least not in all instances. Perfect data sets are rare. But we have seen countless bad decisions made on 100% accurate data as well. Otherwise, every decision made to date would be perfect and we would be in Eden.
Because data is increasingly high dimension - and growing rapidly - is precisely why quality is all the more important. This is true in environments where decisions are made in milliseconds. But in most other situations where time is not as acute, we should take advantage to make the best possible decisions we can - understanding that we will not achieve 100% optimal outcomes 100% of the time.
To extract value from raw data, the data must be refined. By refined, I mean purified of contaminants, anomalies, inaccuracies, etc. Without that, the quality of the data we use to make our decisions will be flawed. If the data quality is poor then the context in which it is used is just as poor. And in turn, the outcomes will be poor.
I believe that time-to-market pressures have marginalized quality. Thus, no matter what predictive technique is used, if the quality of the inputs is not cleansed as much as possible and within reasonable time constraints, the assumptions behind the models will be flawed. And if the models themselves are not validated adequately, the decisions they drive will be faulty. The result is outcomes that are less than desirable at best and irrelevant at worst.
Better data, better information. Better information, better decisions. Better decisions, better outcomes. Better outcomes, better ROI and value.