Monday, 3 June 2013
The Real Reason Hadoop Is Such A Big Deal In Big Data
Hadoop is the poster child for Big Data, so much so that the open source data platform has become practically synonymous with the wildly popular term for storing and analyzing huge sets of information.
While Hadoop is not the only Big Data game in town, the software has had a remarkable impact. But exactly why has Hadoop been such a major force in Big Data? What makes this software so damn special - and so important?
Sometimes the reasons behind something success can be staring you right in the face. For Hadoop, the biggest motivator in the market is simple: Before Hadoop, data storage was expensive.
Hadoop, however, lets you store as much data as you want in whatever form you need, simply by adding more servers to a Hadoop cluster. Each new server (which can be commodity x86 machines with relatively small price tags) adds more storage and more processing power to the overall cluster. This makes data storage with Hadoop far less costly than prior methods of data storage.
(See also Hadoop: What It Is And How It Works.)
We're not talking about data storage in terms of archiving… that's just putting data onto tape. Companies need to store increasingly large amounts of data and be able to easily get to it for a wide variety of purposes. That kind of data storage was, in the days before Hadoop, pricey.
And, oh what data there is to store. Enterprises and smaller businesses are trying to track a slew of data sets: emails, search results, sales data, inventory data, customer data, click-throughs on websites… all of this and more is coming in faster than ever before, and trying to manage it all in a relational database management system (RDBMS) is a very expensive proposition.
Historically, organizations trying to manage costs would sample that data down to a smaller subset. This down-sampled data would automatically carry certain assumptions, number one being that some data is more important than other data. For example, a company depending on e-commerce data might prioritize its data on the (reasonable) assumption that credit card data is more important than product data, which in turn would be more important than click-through data.
That's fine if your business is based on a single set of assumptions. But what what happens if the assumptions change? Any new business scenarios would have to use the down-sampled data still in storage, the data retained based on the original assumptions. The raw data would be long gone, because it was too expensive to keep around. That's why it was down-sampled in the first place.
Expensive RDBMS-based storage also led to data being siloed within an organization. Sales had its data, marketing had its data, accounting had its own data and so on. Worse, each department may have down-sampled its data based on its own assumptions. That can make it very difficult (and misleading) to use the data for company-wide decisions.
Hadoop's storage method uses a distributed filesystem that maps data wherever it sits in a cluster on Hadoop servers. The tools to process that data are also distributed, often located on the same servers where the data is housed, which makes for faster data processing.
Hadoop, then, allows companies to store data much more cheaply. How much more cheaply? In 2012, Rainstor estimated that running a 75-node, 300TB Hadoop cluster would cost $1.05 million over three years. In 2008, Oracle sold a database with a little over half the storage (168TB) for $2.33 million - and that's not including operating costs. Throw in the salary of an Oracle admin at around $95,000 per year, and you're talking an operational cost of $2.62 million over three years - 2.5 times the cost, for just over half of the storage capacity.
This kind of price savings mean Hadoop lets companies afford to hold all of their data, not just the down-sampled portions. Fixed assumptions don't need to be made in advance. All data becomes equal and equally available, so business scenarios can be run with raw data at any time as needed, without limitation or assumption. This is a very big deal, because if no data needs to be thrown away, any data model a company might want to try becomes fair game.
That scenario is the next step in Hadoop use, explained Doug Cutting, Chief Architect ofCloudera and an early Hadoop pioneer. "Now businesses can add more data sets to their collection," Cutting said. "They can break down the silos in their organization."
Hadoop also lets companies store data as it comes in - structured or unstructured - so you don't have to spend money and time configuring data for relational databases and their rigid tables. Since Hadoop can scale so easily, it can also be the perfect platform to catch all the data coming from multiple sources at once.
Hadoop's most touted benefit is its ability to store data much more cheaply than can be done with RDBMS software. But that's only the first part of the story. The capability to catch and hold so much data so cheaply means businesses can use all of their data to make more informed decisions.