A Data Science Central Community
What's the latest on the data growth crisis? Well, how about 40 zettabytes by 2020? What?
Yes, data is growing even faster than anyone imagined. And the good news is that Apache Hadoop has arrived as a next generation enterprise data management platform purpose built to handle all that data.
Not only is the volume of data growth enormous, the velocity and variety of data is accelerating as well. This data growth effect is known as the three V's:
Volume: The raw volume of data growth from terabytes to pedibytes
Velocity: Batch interfaces are less and less preferred as real time data feeds become available, improving the opportunity for enterprise analytics.
Variety: From social to mobile, text to video, structured to unstructured, data growth flows in a variety of record and file formats.
Prior to the emergence of Hadoop, structured and unstructured data was stored and managed as separate silos of enterprise data. As a result, unstructured data such as documents could not be combined with structured data to better describe business objects as Enterprise Business Records (EBRs).
Apache Hadoop stores any data in any format, and also provides rich and robust access capability such as text search in addition to structured query. Thus, Hadoop enables a single repository to store all enterprise data with improved access as well.
Unfortunately, as the amount of data to be processed grows, the performance of enterprise applications degrades, and the entire business is impacted. But according to Gartner, as much as 80% of data in a typical production portfolio may be inactive, thus hindering application performance, causing outages and increasing IT costs.
Over time, transactions age and are less frequently accessed. Inactive data grows putting more pressure on application performance and availability.
Then, operations and compliance concerns arise as batch jobs, data replication and disaster recovery all run slower making online users wait longer. Outages are extended as more time is required to process more data, such as when converting legacy data to the new release during upgrade cycles.
The Solution: Enterprise Archiving
Enterprise Archiving moves less frequently accessed data out of production databases and into nearline archive storage based on an Information Lifecycle Management (ILM) framework.
Data retention policies ensure not only compliance, but that users always maintain access to archived data. Data may be classified as sensitive or by business value or by legal hold and regulatory requirement for compliance.
Once the data is moved and purged from the tier one database infrastructure, the result is improved application performance and availability because production data sets are reduced to manageable levels. Structured relational databases and unstructured file stores all may be archived to improve application performance of current data.
Apache Hadoop is open source, and designed to leverage powerful, new low-cost infrastructure to deliver massive scalability. Using the MapReduce programming model to process large data sets across distributed data sets in parallel, Hadoop delivers highly scalable workload performance and very low-cost, bulk data storage.
All this means is that Hadoop offers dramatic cost savings over traditional tier one infrastructure.
Consider the following comparison. According to Monash Research, the cost of tier one database infrastructure may be over $60,000 per TB. At the same time, 1TB of S3 bucket storage at Amazon Web Services is $30 per month according to their recent price list.
The conclusion: Hadoop is 55.5X cheaper than tier one infrastructure.