A Data Science Central Community
Denying that big data is a new paradigm (post year 2000) is like saying that the human population has been huge for a long time: if we can handle 10 million human beings as we did a few thousand years ago, we can handle 10 billion today the same way, even one trillion. It's the same as saying that data flowing at 10 million rows per day can be processed and analyzed the same way as 10 billion or one trillion per day, which (billions per day) is common in transaction data (credit cards), mobile, web traffic, sensor data, retail data, health data, NSA, NASA, stock trading and many more.
Each time a credit card is swiped or processed online, an analytic algorithm is used to detect if it's fraudulent or not (and the answer must come in less than 3 seconds most of the time, with low false negative rate). Each time you do a Google search, an analytic engine determines witch search results to show you, and which ads to display. Each time someone posts something on Facebook, an analytic algorithm is run to determine if it must be rejected (promotion, spam, porn etc) or not. Each Tweet posted is analyzed by analytic algorithms (designed by a number of various companies) to detect new viral trends (for journalists), or disease spread, intelligence leaks or many other things. Each time you browse Amazon, the customized content delivered to you is analytically "calculated" to optimize Amazon's revenue. Each time an email is sent, an analytic algorithms decides whether or not to put it in your spam box (that's intensive computations for Gmail). This is analytic at billions of rows per day. Evidently there is a gigantic amount of pre-computations and look-up tables being used to make this happens, but it still is "big data analytics". The analytic engineer knows that his Ad matching algorithm must use the right metrics, right look-up tables (that he should help design, if not automatically populate) to do a great computation (as best as possible) given the finite memory resources and the speed at which the results are delivered, typically measured in milliseconds. You just can't separate the two processes: data flow, and analytics or data science. Indeed the word "data science" conveys the idea that data and analytics are bedfellows.
Also, big data practitioners working for start-ups usually wear multiple hats: data engineer, business analyst and machine learning / statistics / analytics engineer. The term "data scientist" suits them really well.
Finally, even with transactional data, if you want to split the data scientist role (in large companies) in silos - data versus analytics or business engineers, there is still an important issue: sampling. Analytics engineers can work on samples, but how small, how big or how good? Who determines what makes a good sample? Again, you need to be a data scientist to solve these questions, and the answer is: samples must be far bigger than you think (100 million rows in the contexts described above) and also much better selected. I have worked with an Ad network company managing truly big data. They sent me a sample with about 3 million clicks. But it did not have a rich set of affiliate data (that is, many affiliates with enough data for each of them) that I could not clearly identify instances of affiliates collusion (a scheme leveraging Botnets to share hijacked IP addresses among affiliates, for click fraud). I needed 50 million rows (clicks) to clearly identify this type of massive (but low frequency) fraud. This raises three questions:
My point here is that samples, traditionally involving less than 10 million observations, are really far too small in a number of applications, or the wrong data is being used. Samples with 200 million rows might prove like a good compromise sometimes. This is true in data that can be segmented in millions of small buckets, and you need statistical significance in as many buckets as possible. But you can not apply the same statistical techniques to a 200 million rows data set, than to a 10 million rows data set, because of the curse of big data. Google my article "The Curse of Big Data" as it explains the problem and provides a solution - interestingly the solution is as much a data solution than a statistical solution, thus the word "data scientist" (rather than "statisticians") to describe people working on such projects.