A Data Science Central Community

Big data is not expensive. You can process 10 terabytes of data per year on collocated servers using open source tools (Python - I do it in Perl), using your own home-made Hadoop system if needed, to score 100 billion transactions, all for less than $1,000 per year. It requires a bit of optimization in the way that you manage your files (you don't even need a database, just a bit of robust file architecture), but it is entirely doable.

You can do it without expensive vendor tools, without even any tools other than programming languages, code libraries. The required expertise can be gained for free in various data science programs or books. And if you still need a scientist to analyze your big data, it won't cost more than the salary spent on a statistician working on small data.

I would go as far as to say that big data is easier to process than small data, once you get familiar with the right techniques, including:

- Practical illustration of Map-Reduce (Hadoop-style), on real data
- A synthetic variance designed for Hadoop and big data
- Fast Combinatorial Feature Selection with New Definition of Predict...
- A little known component that should be part of most data science a...
- 11 Features any database, SQL or NoSQL, should have
- Clustering idea for very large datasets
- Hidden decision trees revisited
- Correlation and R-Squared for Big Data
- Marrying computer science, statistics and domain expertize
- New pattern to predict stock prices, multiplies return by factor 5
- What Map Reduce can't do
- Excel for Big Data
- Fast clustering algorithms for massive datasets
- Source code for our Big Data keyword correlation API
- The curse of big data
- How to detect a pattern? Problem and solution
- Interesting Data Science Application: Steganography

One example where big data is simple is when you have 50 observations (from multiple users / clients) in millions of data buckets, it makes your inference process much easier - lower reliance on imputation and sophisticated experimental design (the complexity is in identifying the right buckets, make sure it's robust, and do cross-validation). Another example why big data is simple is when you compare our model-free confidence intervals with the concept of *p*-value in traditional statistical science: nobody but statisticians understand *p*-values, while my confidence intervals are easy to understand even by people with no college education. Much of the statistical science has been re-written for big data, to be easier to digest by computer scientists and data engineers, and to be more robust in the context of big data. It has been done at very little cost, which means it costs next to nothing to learn it, and there is no excuse not to gain this knowledge and skills.

Feel free to share your costs of storing / processing / analyzing big data, in the comments section. Mine are definitely below $1,000 per year, and I can scale with no extra costs to 10 terabytes per year - even more if I'm careful in selecting the right metrics to be tracked, and designing the right look-up tables and hierarchical summary tables (organized as smart text files a bit like Hadoop). However, in many projects, I actually use intuition, good judgment and sheer brain power, including selecting and reading data reports and analysis from selected external sources, to come up with a solution -- without collecting or analyzing any data at all. So it is still possible to do business with no data, but blending big data with intuition makes for an explosive cocktail, especially when you use big data where it helps most, and collect the right data (including external data sources), have skilled people help you identify/extract/analyze it, and/or use outsourced data services (such as Google Analytics if your data is digital, web log data - or FICO/Experian scores if your data is financial transactions).

The idea that big data is expensive and complicated is a myth propagated by people who refuse progress, or are concerned about their competitors successfully leveraging big data to outsmart them.

**Suggested reading**

© 2020 TechTarget, Inc. Powered by

Badges | Report an Issue | Privacy Policy | Terms of Service

**Most Popular Content on DSC**

To not miss this type of content in the future, subscribe to our newsletter.

- Book: Applied Stochastic Processes
- Long-range Correlations in Time Series: Modeling, Testing, Case Study
- How to Automatically Determine the Number of Clusters in your Data
- New Machine Learning Cheat Sheet | Old one
- Confidence Intervals Without Pain - With Resampling
- Advanced Machine Learning with Basic Excel
- New Perspectives on Statistical Distributions and Deep Learning
- Fascinating New Results in the Theory of Randomness
- Fast Combinatorial Feature Selection

**Other popular resources**

- Comprehensive Repository of Data Science and ML Resources
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- 100 Data Science Interview Questions and Answers
- Cheat Sheets | Curated Articles | Search | Jobs | Courses
- Post a Blog | Forum Questions | Books | Salaries | News

**Archives:** 2008-2014 |
2015-2016 |
2017-2019 |
Book 1 |
Book 2 |
More

**Most popular articles**

- Free Book and Resources for DSC Members
- New Perspectives on Statistical Distributions and Deep Learning
- Time series, Growth Modeling and Data Science Wizardy
- Statistical Concepts Explained in Simple English
- Machine Learning Concepts Explained in One Picture
- Comprehensive Repository of Data Science and ML Resources
- Advanced Machine Learning with Basic Excel
- Difference between ML, Data Science, AI, Deep Learning, and Statistics
- Selected Business Analytics, Data Science and ML articles
- How to Automatically Determine the Number of Clusters in your Data
- Fascinating New Results in the Theory of Randomness
- Hire a Data Scientist | Search DSC | Find a Job
- Post a Blog | Forum Questions

## You need to be a member of BigDataNews to add comments!

Join BigDataNews