A Data Science Central Community
With big data come a few challenges: we've already mentioned the curse of big data. But what can we do when data flows faster than it can be processed?
Typically, this falls into two categories of problems:
No matter what processing, what algorithm is used, astronomical amounts of data keep piling up so fast that you need to delete some of it, a bigger proportion every day, before you can even look at it, not to mention analyze it even with the most rudimentary tools. An example of this is astronomical data used to detect new planets, new asteroids etc. It keeps coming faster, in larger amounts, than it can be processed on the cloud using massive parallelization. Maybe good sampling is a solution: carefully select which data to analyze, and which data to ignore, before even looking at the data. Or develop better compression algorithms so that one day, when we have more computing power, we can analyze all the data previously collected but not analyzed, and maybe in 2030, look at the hourly evolution of a far away supernova that took place in 2010 over a period of 20 years, but was undetected because the data was parked on a sleeping server.
Data is coming in very fast in very big amounts, but all of it can still be processed with modern, fast distributed, Map-reduced powered algorithms or some other techniques. The problem is that the data is so vast, the velocity so high, and sometimes the data unstructured, that it can only be processed by crude algorithms, resulting in bad side effects. Or at least, that's what your CEO thinks.
I will focus on this second category, particularly the last item: it is the most relevant to businesses. The types of situations that come to my mind include
Using crude algorithms results in:
So what is the solution?
I believe that in many cases, there is too much reliance on crowd-sourcing, and reluctance in using sampling techniques because very few experts know how to get great samples out of big data. In some cases, you still need to come up with very granular predictions anyway (not summaries), for instance house prices for every single house (Zillow) or weather forecasts for each zip-code. Yet even in these cases, good sampling would help.
Many reviews, "likes", tweets or spam flags are made by users (sometimes Botnet operators, sometimes business competitors) with a bad intent, gaming the system on a large scale. Greed is also part of the problem: if fake "likes' generate revenue for Facebook and advertisers don't notice, let's feed these advertisers (at least the small guys) with more fake "likes", because (we think) that we don't have enough relevant traffic to serve all advertisers, and we want good traffic to go to the big guys. When the small guys notice, either discontinue the practice (come up with new idea) or wait till you get hit by a class action lawsuit: $90 million is peanuts for Facebook, and that's what Google and others settled for when they were hit by a class action lawsuit for delivering fake traffic.
Yet there is a solution that benefits everyone (users, companies such as Google, Amazon, Netflix, Facebook or Twitter, and clients): better use of data science. I'm not talking about developing sophisticated, expensive statistical technology, but just simply switching to using better metrics, different weights (e.g. put less emphasis on data resulting from crowd-sourcing), better linkage analysis, association rules to detect collusion, Botnets and low frequency yet large-scale fraudsters, and better frequently updated look-up tables (white lists of IP addresses). All this without slowing down existing algorithms.
Here's one example for social network data: instead of counting the number of "likes" (not all "likes" are created equal), do:
The opposite problem also exists
When you can analyze data (usually in real time with automated algorithms) and extract insights faster than it can be delivered to and digested by the end user (executives and decision makers). It's bad when the decision makers get flooded with tons of un-prioritized reports.
Sometimes, this type of situation arises with machine-talking-to-machine, e.g. eBay automatically pricing million of bid keywords every day and feeding these prices automatically to Google Adwords via the Google API.
Similarly, I produced cluster simulations faster than a real-time streaming device can deliver them to a viewer: I called it FRT for faster than real time.