A Data Science Central Community
A recent LinkedIn post linking to an Innovation Enterprise article entitled 'Hadoop Is Failing' certainly got our attention, as you might expect.
Apart from disagreeing with the assertion that 'Hadoop...is very much the foundation on which data today is built' the main thrust of the article is well founded.
Consider the following snippets:
And the grand finale:
'Most companies do not currently have enough data to warrant a Hadoop rollout, but did so anyway because they felt they needed to keep up with the Joneses..they soon realize that their data works better in other technologies.'
This can be summarised as: drank the Kool-Aid, convinced each other Hadoop was a silver bullet, rocked up a POC stack with an all-too-willing vendor, spent a lot of time/effort getting nowhere, quietly went back to existing data management technologies (almost certainly the RDBMS based data warehouse).
Mention is also made of the Hortonworks share price performance, which is down by about 66% since the December 2014 IPO. The NASDAQ is up 19% over the same period.
We've been exposed to the recent 'Big Data' thang from both a client and vendor perspective.
From the client side the main motivators appear to be twofold: reduce data warehouse costs and 'fear of missing out' (FOMO) or keeping up with the Joneses. In both cases the amount of Kool-Aid supplied by VCs, vendors, analysts, tech journos and geek fanbois willingly gulped down by those that should know better is a tad disappointing.
The architect that glibly told us his company would 'obviously' have to re-train thousands of analysts away from SQL and into the realms of Pig an Hive springs to mind. This despite having 25 years of skills invested in the existing EDW.
From the vendor perspective there has been a lot of 'over promising' that has failed to materialise. 'Nothing new there', the cynics might suggest.
The challenge during one POC was to run an existing batch suite of SQL against a relatively simple schema lifted from the EDW as part of an EDW-offloading test.
The POC was to be attempted using a Hadoop stack running Hive, and was categorised by the vendor as 'just a SQL race'. Given the stature of the vendor we reasonably assumed their confidence was well placed. How wrong we were - the SQL failed a few steps in. Very many tickets later, still no joy. EDW offloading? Err, no. EDW replacement? LMAO!
The point is, our real-world experiences bear out the main thrust of the article: the rush to get involved with Hadoop has *largely* been done without any real understanding of it and the failure to get it working, for whatever reason, has led to a lot of disillusionment and/or stagnation.
First of all, be honest with yourself: you're probably not Google, Facebook or Yahoo.
These companies conceived, designed & built the Hadoop ecosystem out of necessity. They have <understatement>slightly</understatement> different challenges to yours.
They also have a vastly different set of in-house skills. You're probably a bank, a retailer, a manufacturer, a telecoms company or the like. You simply don't possess the engineering capability that the 'Hadoop crowd' take for granted. Your salary scales also don't allow the required talent to be hired. That's before we take your location, stock options & 'coolness' into account!
When considering *any* new technology, it should always be apparent what extra capability is on offer, and what specific business benefit that capability delivers. Implicitly, this should be something we can't already achieve with the tools at our disposal.
On a personal note, the adage that "there's nowt new under the sun" is my default starting position. Greater minds than mine think there is a lack of novelty with MapReduce. We've had parallel data processing for decades. Bah humbug, etc.
Before any technology POC (Hadoop included) gets off the ground, the business case in the form of potential ROI should be established, as we all know. FOMO and 'keeping up with the Joneses' does not a business case make, no matter how much Kool-Aid is on offer.
During the ROI assessment, it's all too easy to tot up the potential project benefits without being honest about the costs. Hadoop stacks aren't free, no matter how much IT bang on about open source software, commodity servers/storage & vendor consulting support. Look at the amount of in-house resource required and the costs add up very quickly.
As several of our clients have rightly concluded, in most instances, Hadoop is a classic case of a solution looking for a problem - a problem that simply doesn't exist for most organisations.
US tech companies assume the rest of the world operates at the same scale as the US. We simply don't. Most organisations in the UK and Europe that we've encountered over the years operate at the 1-20TB scale. Over 100TB is rare. The case for a data processing stack running hundreds or even thousands of nodes simply doesn't exist.
To underpin wider adoption, the recent focus has been to polish 'SQL on HDFS'. This leads neatly back to where we started: the SQL compliant parallel data warehouse i.e. a database!
Prove us wrong by all means, but it is our belief that 99% of organisations can deliver 99% of their data analytics requirements without resort to a 'Hadoop stack'. Go back and give the data warehouse you take for granted some love :-)
We've yet to hear of a *single* case where an existing RDBMS data warehouse has been ditched for a Hadoop stack. Not one.
So, is Hadoop failing? No, not when used by the few organisations that have the right problem & skills.
Are Hadoop projects failing for lots of other organisations? Yes, mostly.