Subscribe to our Newsletter

Defining big data is now a hot topic. Berkeley University posted 40 very short definitions by thought leaders (including me). Here our goal is to offer a very detailed, comprehensive definition that (hopefully) suits everyone.

First, there are three layers of data:

Level 1

This is about collecting data via sensors, log files, or any other data/signal capture mechanisms. It involves RAW internal data (NASA videos of the night sky to identify exo-planets) and external data (vendor or third-party data, Internet data such as tweets about your company).

Typically, Level 1 is big data, but it can be very sparseeasily compressed (with low content value, read this reference), or rather static (not flowing very fast) making this data looks large - but possibly shallow - rather than big.

Also note that there are two ways for data to qualify as big data:

  • Absolute big data: more than 10 terabytes per year, or more than 1 terabyte for specific analyses (such as root cause analyses) - few companies are dealing with this volume of data; afterall, the Census data - about 300 million Americans - takes much less space.
  • Relative big data: 10 times bigger (per time unit) than anything you've been dealing with in the past, requiring new tools, employees, and methodology for your company

Collecting level 1 data is mostly a question of data plumbing.

Level 2

Here we are dealing with data summarization: deciding which metrics to track (raw and compound metrics), how to store and blend the somewhat cleaned, curated data (using database architecture - SQL, NoSQL, NewSQL, Hadoop, grapth databases). This is the data organizing step.

Level 3

At this stage, we are dealing with highly refined data, possibly no longer big data - available as reports, visuals, email alerts, automated (machine-to-machine) bidding, or detection of taxpayer accounts worth an IRS audit. In short, actionable data and insights. In some cases, it is still big data, e.g. in the context of scoring all credit card transactions, predicting the value of any home in US (including trends), or producing a LinkedIn profile connection gragh. Most of the time, however, it is not big data. Whatever data it is, this is the data summarization step.

Hierarchical data level interactions

Data from level 1 is rarely accessed by level 3 data scientists. But sometimes it is, for instance accessing raw log files (level 1) to analyze a fraud case (level 3). So there are passerelles - and feedback loops - between the three levels.

We will also publish an article about the uplift provided by big data - broken down by industry - over the baseline consisting of leveraging small data only. It is our belief that big data is cheap and easy, compared with small data. 

Additional Reading

Views: 616

Comment

You need to be a member of BigDataNews to add comments!

Join BigDataNews

On Data Science Central

© 2019   BigDataNews.com is a subsidiary of DataScienceCentral LLC and not affiliated with Systap   Powered by

Badges  |  Report an Issue  |  Privacy Policy  |  Terms of Service