A Data Science Central Community
Forward-looking organizations should consider Hadoop as a data management hub. New features in Hadoop 2.0 provide greater flexibility and manageability, allowing firms to efficiently move more workloads to Hadoop and build analytics models using a much larger data pool.
In financial services for example, many organizations have been experimenting with Hadoop projects. These range from securities pricing, risk analysis and setting insurance premiums to monitoring transactions flows for detecting and preventing fraud and money laundering. With rack-mountable blade servers densely deployed and the power of multi-parallel processing the economics of setting up Hadoop clusters is quite compelling.
Hadoop is an open-source general-purpose data storage and processing framework. The framework is based on clusters that distribute the computing jobs required for big data by moving analytics closer to nodes of commodity servers where the data resides. While many financial firms have deployed Hadoop within their data centers, clusters can also be run from the cloud, either via cloud vendors or through hosted services.
Hadoop has proven to be valuable in staging and refining big data sets such as web logs and sensor data, which can be ingested quickly into the Hadoop Distributed File System and batch processed with MapReduce. Alternatively, the data can be cleansed and transformed into load-ready format for a relational database management system (RDBMS). Due to its complexity, Hadoop has been mostly deployed next to a RDBMS, such as an enterprise data warehouse (EDW), a data warehouse appliance, or in concert with a document management system.
In our last post we suggested that the useful life of an EDW can be extended by offloading non-analytical functions onto Hadoop clusters. By doing this, Hadoop saves companies money even if it is used solely as an archive. Data that is infrequently accessed but must be retained for GRC (governance, regulatory, compliance) purposes can be moved from more expensive tiers of branded storage to clusters of commodity servers running Hadoop. This frees up the EDW to focus on high-performance processing and analytics on tier-1 data. In this scenario, Hadoop becomes another tier in a distributed data management architecture.
More data gives you more ways to search for and see patterns
The bigger opportunity, however, is to perform analytics on a much larger data sets. Most business users think of their data in the context of time as opposed to quantity. The future value in Hadoop is to allow users to discover patterns and run data models against years of data to gain a fuller understanding of customers and markets. With this extended perspective, firms can improve data quality that drives algorithms and predictive models.
A robust ecosystem of tools and technologies has developed around Hadoop to expand functionality and derive value out of big data. The ecosystem surrounding Hadoop has grown to more than 15 open source projects, which continues to evolve rapidly. These include the HBase non-relational database, Hive for data warehousing, a coding language called PIG, Mahout for mining machine data, Sqoop, a connectivity tool to move data from non-Hadoop data stores, Flume for populating Hadoop with data, Oozie for workflow processing, Ambari for provisioning and managing Hadoop clusters, HCatalog for centralized metadata management and sharing, and the Knox security gateway.
Hadoop 2.0 opens the door to mainstream adoption
While Hadoop may be more commonly used for optimizing internal IT or data management operations, it is rapidly gaining ground as a strategic big data analytics platform as well. Many advanced analytics tools on the market now support Hadoop, enabling visualization, data mining, predictive analytics and text analytics against big data sets.
With the general availability of Hadoop 2.0, the framework can evolve into a massive enterprise data repository that can handle workload beyond the batch capabilities of MapReduce. The enabler of this is YARN (Yet Another Resource Manager), the most significant new enhancement.
Despite its sardonic name, YARN can be thought of as a large-scale distributed operating system for big data applications in Hadoop. Its most important feature is to allow multiple big data jobs to run simultaneous, allocating resources across all the applications with consistent levels of service to end users. With YARN, the batch processing methods of MapReduce are no longer a necessity, but rather become just another option as determined by workload requirements.
Information can be used in multiple ways, including interactive SQL-based querying (Hive), online (HBase), stream-processing (Storm) and graph analysis (Giraph). Other functions include Spark for high-speed in-memory analytics and MPI for modeling risks, optimizing pricing and other advanced analytics capabilities.
Firms can use Hadoop to break down data silos across the enterprise and commingle data from transactional systems such as CRM and ERP with system logs, clickstreams, mobile GPS and other unstructured data to yield previously undiscoverable insights. As we pointed out in our last post, Hadoop 2.0 also provides better support for SQL-based queries, a key driver of adoption. With YARN, the distributed processing capabilities become more extensible and customizable for specific use cases and service levels.
It should be noted that Hadoop distributors have their own approach to running SQL inquires in Hadoop. Major data warehouse vendors have developed their own management tools as well. YARN could integrate with the management systems offered by these different vendors, all of which have thrown their support behind Hadoop 2.0 and YARN. Ultimately, customers will force the issue. If most organizations want to control Hadoop resources using YARN, each of the commercial vendors can easily re-architect to defer to YARN orchestration. If proprietary tools provide incremental value add, customers might have to accept some overlap in administrative controls.
Informed questions asked by informed users
While more purpose-built tools are emerging to perform analytics directly in Hadoop, big data provides the opportunity to attack the question using a variety of tools – both new and conventional RDBMS. Analytics is about matching the right technology for the job. For example, Hadoop clusters might be the place to find the lowest level of detail in a data set and to do broad initial exploratory analysis, but a relational database is better suited to storing and aggregating data for deeper-dive operational analysis.
Big data is not about the technology, however. The key is to understand the question you want to answer. What is the problem you are trying to solve? Then, make sure you’ve got the right data. To avoid swamping business units with masses of data that they cannot manipulate in a timely fashion, CIOs need to consider how to best present relevant data to users in the most effective way.
Work with smaller test cases rather than implement a broad deployment. Reference architectures are now available across the industry for various scenarios of Hadoop implementations. With multiple use cases across financial services businesses, firms should establish in-house expertise on the ecosystem and institute analytics training for a wide swath of employees. This would provide a common language of data in the enterprise and foster closer collaboration between IT and data analysts. As more users are added, think about how to scale your compute and storage, and assure security with end-user access and authentication policies and processes in front of a Hadoop cluster.
In the near term, Hadoop will not replace relational databases or traditional data warehouse platforms. But with new and continued enhancements such as YARN, companies can leverage Hadoop’s superior price/performance characteristics. Hadoop can realize significant ROI through improved insights and decision-making that drive revenues and profitability while reducing operational costs and mitigating risks. It also positions Hadoop as a data management hub in a multi-platform enterprise analytics environment.