It's difficult for the average person to grasp the amount of Data that is available in any form. Rather than working with GBs of Data, we are now required to analyze Terabytes and Petabytes of Data.
To put this in perspective, let's take a look at some stats for Twitter:
- 58 million new Tweets per day
- 9,100 Tweets per second
- 2.1 billion queries by Search Engines per day
http://www.statisticbrain.com/twitter-statistics/
That's a lot of data....
Traditionally, 'data' for analyzing is stored in a RDBMS or Relational Database Management System. Nice, neat Tables with Columns and Rows. Even the average Joe not in the IT industry has seen this format in an Excel Spreadsheet.
That's great if you're only querying GBs of Data AND more importantly, if that data is organized.
I think of it like a stack of Euros bills. Each value is a different size, which serves a purpose of allowing a person to quickly identify the different values quickly.
But when you put all the bills into one pile, it's difficult to count what you have without having to organize each bill into it's own pile. If you're like me, you'll spend that extra time organizing then count. With 10 bills it's quick and easy, but what if you had 1000 bills? Then 10k? You can see the increase in prep work that is needed to get to your result.
The pile of Bills represents what is the future of Data. No longer can we categorize data in familiar groupings stored in one container, like a database. We need to think bigger. We need to throw out the concept of a 'database' and see all forms of data as a potential container.
Let's revisit our Twitter stats from earlier. How long would it take to write 58 million rows of data? Mins, hours, days? So say, you commit most of your resources to writing all that data to table, but what of Users trying to query that same information? What if your little sister is looking for tweets on the latest news about her favorite singer, Justin Beiber? With the traditional RDBMS, performance is constrained to the resources available. So since most of your resources available are committed to adding the data, you're going to have to tell your sister she will need to wait. I'm sure she'll be patient and willing to wait... When she's at school the next day and her friends are chatting about the latest Beiber news, she won't feel out of the loop and lost. She knows when she gets home she can finally read about what her friends read yesterday. Yea, I don't think so... Timing of information is everything these days. A delay of a day or even an hour, can be the difference between winning or losing.
What to do about this predicament?
How can I store huge amounts of data AND access it quickly?
This is where Apache Hadoop comes into play.
Hadoop provides a 'software framework that supports Data-Intensive Distributed Applications'. The first time I read the formal definition I was like 'huh?', what does that mean? I understood the meaning of the individual words, but couldn't comprehend how they worked together.
Thinking differently about what is 'data' is the key to understanding the capabilities of Hadoop.
Hadoop enables a filtering, sorting, and summary method known as 'MapReduce', which is actually 2 separate functions.
All data, in the form of output, has some sort of structure. Processing, by it's nature, has to have some sort of pattern. This is where the 'Map' functionality comes into play, by filter, sorting, and querying the data. This means, you can now 'Map' any type of flat file.
Mapping can be done on multiple files at the same time which then the 'Reduce' function summarizes the results into a consolidated output.
The 'MapReduce' addresses the 'Data-Intensive' and 'Applications' part of the Hadoop description. So I'm guessing you're asking where does the 'Distributed' part come into play. Here's where Hadoop takes data analytics to a whole new level. Rather than constraining processing to one Server as with a traditional database, MapReduce can be ran across multiple servers in parallel. This means you can harness the resources of an unlimited number of machines and not have to consolidate all your data into one container. So you end up with multiple Servers and multiple files, breaking down the work into smaller chunks, and running in parallel.
Think about the possibilities this presents...
As an example, using Amazon EMR Web Services, The New York Times used 100 Amazon EC2 instances and a Hadoop application to process 4 TB of raw image TIFF data (stored in S3) into 11 million finished PDFs in the space of 24 hours at a computation cost of about $240 (not including bandwidth).wikipedia.org
I could go into greater detail of the different use cases and possibilities of Hadoop, but I think that is too much for one blog entry. My goal was to provide a brief overview of why Big Data is critical to our world today and provide a start of how to address the data predicament.
Big Data is here, and we need to keep up or be left behind...

Hello there, just became aware of your blog through Google,
ReplyDeleteand found that it's truly informative. I'm gonna watch out for brussels.
I'll be grateful if you continue this in future.
Many people will be benefited from your writing. Cheers!
my blog post ... search engine optimization ()