A Data Science Central Community
Defining big data is now a hot topic. Berkeley University posted 40 very short definitions by thought leaders (including me). Here our goal is to offer a very detailed, comprehensive definition that (hopefully) suits everyone.
First, there are three layers of data:
This is about collecting data via sensors, log files, or any other data/signal capture mechanisms. It involves RAW internal data (NASA videos of the night sky to identify exo-planets) and external data (vendor or third-party data, Internet data such as tweets about your company).
Typically, Level 1 is big data, but it can be very sparse, easily compressed (with low content value, read this reference), or rather static (not flowing very fast) making this data looks large - but possibly shallow - rather than big.
Also note that there are two ways for data to qualify as big data:
Collecting level 1 data is mostly a question of data plumbing.
Here we are dealing with data summarization: deciding which metrics to track (raw and compound metrics), how to store and blend the somewhat cleaned, curated data (using database architecture - SQL, NoSQL, NewSQL, Hadoop, grapth databases). This is the data organizing step.
At this stage, we are dealing with highly refined data, possibly no longer big data - available as reports, visuals, email alerts, automated (machine-to-machine) bidding, or detection of taxpayer accounts worth an IRS audit. In short, actionable data and insights. In some cases, it is still big data, e.g. in the context of scoring all credit card transactions, predicting the value of any home in US (including trends), or producing a LinkedIn profile connection gragh. Most of the time, however, it is not big data. Whatever data it is, this is the data summarization step.
Hierarchical data level interactions
Data from level 1 is rarely accessed by level 3 data scientists. But sometimes it is, for instance accessing raw log files (level 1) to analyze a fraud case (level 3). So there are passerelles - and feedback loops - between the three levels.
We will also publish an article about the uplift provided by big data - broken down by industry - over the baseline consisting of leveraging small data only. It is our belief that big data is cheap and easy, compared with small data.