A Data Science Central Community
Big Data is an accumulation of data that is too large and complex for processing by traditional database management tools.
Yeah But, What Really Makes Big Data Big Data? This question is as fundamental to data science as the chicken/egg question should be to researchers at KFC. But we’re not dealing with an A/B chicken model here. It’s more elephant to the dark room or scaling it up, the nearest star to our galactic corner. The number of approaches to this question and vague boundaries between ‘big data’ and run of the mill complex data make it all the more interesting to keep on asking. The notion of ‘Big Data’ is in itself marketing brilliance. For instance, is it just size that makes big data big or is it specific features that make it harder to wrangle as is the case with unstructured data? Does the cost of a particular dataset make it big data or is it the specialized software needed for interpretation? Can big data be measured in physical terms, i.e. “I’ll take eleven pounds of that big data please,” or is it all situated on some cloud or somewhere over the rainbow? Does it come down to three ‘V’s’ or is there a two-page set of proofs?
These are all fair questions that roll up to ‘the big one,’ and they result in very ‘application centric’ answers. Forrester defines big data as “techniques and technologies that make handling data at extreme scale affordable.” Wikipedia’s description says the data are, “so large or complex that traditional data processing applications are inadequate. Challenges include analysis, capture, data curation, search, sharing, storage, transfer, visualization, querying, updating and information privacy.” So whose right?
Let’s say for instance that you have a relatively small and simple data set made up of 19 rows and 19 columns—surely that couldn’t be classed in the realm of big data could it? And what level of software is needed to cross into the ‘BD Zone.’ Do Excel and MySQL qualify, or must it be something with even greater analytical capacity? Is there a specific cutoff like a million rows or a hundred million that makes data big, average, or small? Does big data need to be quantified in storage terms like, terabytes, petabytes, exabyte, zettabytes, or the notorious yottabytes? And what of the yottabyte? With estimated internet traffic achieving 1.6 zettabytes just this year, shouldn’t three extra zeros qualify the yottabyte as ‘Major Data,’ or ‘Ginormous Data?’
Humanity is going to be okay! The big bad robots are not going to come and get you...
-John Thuma on Datasciencecentral.com
Should We Fear Big Data? All this chatter about big data can conjure up apocalyptic visions of roving terminators bent on total biological destruction, evil dictators vying for a machine to take over the world in a weekend, or alien lifeforms signing intergalactic leases on networked human brains. In fact, a physician friend and I were recently discussing this in terms of a movie called, The Matrix. We were comparing current tech trends with the movie’s themes when he got all serious and really quiet before saying such possibilities were too real to think about. Our conversation ended and his reaction gave me pause coming from a department chair at a major US hospital. Could there be a subtext where smart educated people believe they are witnessing the intersection of truth and science fiction occurring before their very eyes? Is it even possible we’re in a data war?
There is real fear around the growth of big data because it represents new and unknown possibilities happening at light speed. Massive computing power and far reaching information can amplify bad will as in the case of hackers, thieves, and other poorly intended souls. Fortunately, these cases most often play out on the small scale. Unfortunately, the same motivations can drive well planned operations to wreak international havoc like the theft of over twenty million files from the US personnel office, the loss of top secret warhead data at the Los Alamos nuclear research facility, and an alleged hack of the International Space Station to name just a few. So, could the ultimate data rush cause the US to drop the nuclear football--literally?
Put plainly, is big data something to fear or is it our new best friend? And, could it be both depending on how it’s ‘raised,’ like a baby animal brought in from the wild? After all, Burmese Python’s are often thought of as easy going, low maintenance pets. They don’t bark or eat very often and they generally keep to themselves with the rare exception of trying to constrict the life out of their human companions and then swallow them whole.
Thought leaders in technology like Bill Gates, Elon Musk, Stephen Hawking and Tom Davenport put the issue forth in a business context saying that artificial intelligence is morphing into ‘super intelligence’ and may replace a wide range of knowledge workers in short order. In fact, Tom Davenport has stated that he believes up to half of the US workforce could be mentally out gunned and replaced by computers very soon. And there is evidence that machine learning algorithms increasingly allow computers to make decisions once made by humans. For instance, IBM’s Watson is currently learning decision making paradigms of at least 50 companies including a large scale insurance decisioning experiment at Geico championed by Warren Buffett. How soon before 50 companies becomes 5,000 with computing power moving at the speed of light?
While all this talk about yottabytes, light speed, and alien brain harvesting may sound like a bad episode of Mystery Science Theater 3000, there is a clear trend arising worldwide where an increase in connectivity has created opportunities for bad behavior that were not present in the absence of the technology. The classic trial lawyer’s defense, “my client may have had motive, but no opportunity,” may be a thing of the past because today opportunity is all too present.
When the unaffordable becomes affordable, the impossible becomes possible.
So How Do We Work The Big Data? There are endless possibilities for positive big data applications as there are with all things under the influence of human nature. Big data has been around since the big bang including the entire period of human evolution. The data revolution of today centers around automatically identifying and capturing data, transforming and storing it, and then analyzing it for predictive and prescriptive validity. The gist is, it’s an aid that collectively makes us smarter. If we can reliably predict that seatbelts save lives, we make it a law (prescription) to wear the seatbelt. Knowing that 8.5 out of 10 Million people thought that high budget action movie stunk informs us about whether or not we might like to go see it.
A 2013 research paper by Forrester concludes that companies active in big data analytics are mainly focused on structured data housed internally for mining and analysis. And for efficiency in storage and retrieval a ‘Hub and Spoke’ model has emerged where the hottest information with the highest need for stability stays in the core data warehouse or hub, while data with less relevance or requiring more extensive analysis is decentralized to the spokes. Forrester describes this construct as a merger of the data warehouse and Hadoop. So as one small person in the lonely big data world, is it ok to simply be a data consumer and target of information gathering services? And when does it make sense to grab big data by the horns and develop expertise in the field? These questions can and probably will become big data projects of their own with enough interest, skill, money, and data. Suffice to say learning data analytics is a heavy commitment requiring a special combination of programming skills, math competency, and domain knowledge. With big data only getting bigger, today’s STEM programs are more important than ever.
We started this article by asking, “What makes big data big data?” Yet the answer is in a zip code far, far away from simplicity. My reply is this: Big data can be reasonably described in every way we’ve discussed here. In its broadest sense, big data defies concrete definition because it encompasses everything. The paradox is much like that described by Bertrand Russel in search of a universal set, as in the universe itself. For instance, the 19X19 matrix discussed earlier was a reference to the Chinese game of ‘Go,’ which is a game typically made of a small wooden board and a bowl of stones. Go has been around for thousands of years and is every bit as relevant to big data as a 2016 Cray supercomputer. This is particularly true in light of Google’s recent victory over the world’s top Go players using its Artificial Intelligence deployment known as AlphaGo. The rows and columns on a Go board are fewer than those seen on a single screenshot of a typical spreadsheet. But, Go has 2 x 10^170 legal combinations.
According to universetoday.com, Go has far more combinations than the total of all atoms (10^80) in the known universe. Not bad coming from a culture also thought to have given us an early version of the computer called an abacus. As for measuring big data physically—it turns out that’s not so hard to do. Most Go kits have wooden boards with slate pieces and weigh in at about 11 pounds. This means they’re easy to tote around while engaging the good old grey matter in deep cosmic reflection over strategies to defeat AlphaGo and win one small battle for mankind in the Data Wars.
Thuma, John. Machine Learning is Not the Boogie Man! Gates and Musk Are Wrong.
Davenport, Tom. United States. Analytics Frontiers Conference Charlotte, NC. 3/30/2016
Hopkins, Brian. United States. Forrester Research: The Patterns of Big Data. 6/11/2013