Friday, September 17, 2010

Hadoop - An OpenSource System Modelled after Google File System


We can't start talking about Hadoop immediately unless we know some buzz words that are around for a while... and they are Parallel Processing / computing, Concurrent Computing, Distributed processing / computing, Google File Systems, Map Reduce, BigTable, etc...
If we've to discuss about everything, then one post isn't enough... 

So i've decided to write a series of posts which will start from the basics & finally coverup Hadoop / HBase / HyperTable with installation instructions, samples...

I always get stuck when i've a mammoth task... lets break into small... shall we?
and am lazy to write, so here is the quick run thru of the buzz words...(wiki is always there for reference)
Parallel Processing / Computing

Good old math !!!, if one person takes 10 days to do one job, how long it'll take if 2 persons with same caliber are working?

Parallel processing is the simultaneous processing of the same task on two or more microprocessors in order to obtain faster results. The computer resources can include a single computer with multiple processors, or a number of computers connected by a network, or a combination of both. The processors access data through shared memory. Some supercomputer parallel processing systems have hundreds of thousands of microprocessors.

With the help of parallel processing, a number of computations can be performed at once, bringing down the time required to complete a project. Parallel processing is particularly useful in projects that require complex computations, such as weather modeling and digital special effects. Let's take a real-life example to understand the efficacy of parallel processing.

Eventually parallelism is the future of computing (Rather should i say that it has already started). A successful implementation of parallel computing involves two things apart from having a Strong Distributed processing system:
  • Tasks should be structured in such a manner that they can be executed at the same time
  • The sequence of tasks which must be executed one after the other should be maintained
Check out Parallel Computing article @ wiki for more information...

Concurrent Computing

Its again not much different than the Parallel processing...

Its a collection of processes which can interact with each other and run in parallel !!!

Biggest problem in writing Concurrent programs is to make sure that the sequence & the interaction between them are happening properly. These Concurrent programs can be executed in a single machine with single core machine, yet, it'll perform the same stuff but it'll take time. Or they can be ran in a machine with multiple core processors or in different machines across the network to get the parallelism.

Many programs or methods or threads (performing a specific task)... think of Concurrent computing as a Kernel core, which performs several things simultaneously & Seamlessly (though it is running in a single machine).

Several other things are playing some vital role like shared memory / Message passing (for communication between concurrent threads / processes / pieces of code)

Surely, the application throughput & high responsiveness are the core needs of opting for Concurrent Computing. I would strongly suggest you to keep the parallel processing / concurrent computing options as the last ones. Try everything you can do better your piece of code to meet your performance expectations. When you think / know that you can't do it with traditional way of doing it... start exploring into parallelism.

Distributed Processing / Computing

Distributed Processing / computing is again the same thing only difference is that, there will be more than one computers connected thru network to achieve a goal.

The program which will run on multiple computers is known as Parallel program !!! (it took so long to some to this state )

Please go thru the wiki page of Distributed Computing to get more information on this..

Google File System (GFS)

Google File System (GFS) is designed to meet the rapidly growing demands of data processing needs at Google
. (without which i dont think it is possible to crawl the internet, index them, do the keyword / content analysis, do the analytics, Searching, and knowing who is searching for what !!!, advertising and even then get the response in (Search for Google File System in Google) 

About 57,000,000 results (0.13 seconds) !!!

It is not much different than its predecessors (Distributed file systems) in terms of performance, scalability, reliability and of course data redundancy to achieve the availability.

To get an app processing around 800 to 1000 requests per second, we are breaking out heads on the wall (at-least me)... think about database designing, caching, indexing, memory leaks, architectural constraints, horizontal / vertical scaling, Code optimizations, Server tweaking, phew... How the hell they get this kinda load getting processed and still it works (at-least if we know how to search in google, it gives us what we want :) )

They had (yes they had) thousands of storage systems and connectivity between those systems and what not, to connect and retrive data from those systems. problem happens when one of the machine is down and can't self-recover!!! 

Each system should be capable enough to store retrive, process multi-GB to few TB of data! Data was written into files and stored, updating the records, Random writes were few of the problems which led to design GFS. And ofcourse, the system that are going to be written should be scalable enough interms of processing, I/O, etc.  

Enough of Story, lets have a look at the architecture.

Here is how they do it... Google File System, Big Table, Map Reducer

Hope you already have idea about Master - Slave, In memory data structures, Meta-data, data replication, GC (Garbage Collection), balancing and fault tolerance.

GFS has a Single Master and several chunk servers which will be accessed by several clients. Generally these machines / servers are inexpensive commodity hardwares... based on the capacity of the system, the chunk server & the client can run on the same system.

Extract From : gfs-sosp2003.pdf
Master will take care of assigning an 

immutable and globally unique 64 bit chunk handle at the time of chunk creation. Chunk servers 

store chunks on local disks as Linux files and read or write chunk data specified by a chunk handle and byte range. For reliability, each chunk is replicated on multi- ple chunk servers. By default, GFS stores three replicas, though users can designate different replication levels for different regions of the file namespace.

The master will maintain all file system metadata. This includes the namespace, access control information, the mapping from files to chunks, and the current locations of chunks. It also controls system-wide activities such as chunk lease management, garbage collection of orphaned chunks, and chunk migration between chunkservers. The master periodically communicates with each chunk server in HeartBeat messages to give it instructions and collect its state.

GFS client code linked into each application implements the file system API and communicates with the master and chunkservers to read or write data on behalf of the application. Clients interact with the master for metadata opera- tions, but all data-bearing communication goes directly to the chunk servers. 

Neither the client nor the chunkserver caches file data. Client caches offer little benefit because most applications stream through huge files or have working sets too large to be cached. Not having them simplifies the client and the overall system by eliminating cache coherence issues. (Clients do cache metadata, however.) Chunkservers need not cache file data because chunks are stored as local files and so Linux’s buffer cache already keeps frequently accessed data in memory.

Image Credit: gfs-sosp2003.pdf

Some of the challenges of GFS is to have Concurrent Append to the files, Reading the large set of streaming data, random reads and processing huge data / storing huge data in sequential manner !!

In Memory Data structure was required to achieve faster garbage collection and ofcourse the load balancing. Master will take care of scanning the state of each chunk server periodically and take care of Garbage Collection and Load balancing. 

And literally, these smaller chunks reduces the amount of metadata required to be stored. 

Activities of a Master includes Namespace management & locking, Replica management & replacements, rebalancing, Garbage collection, stale replica deduction and management

Interesting enough, GFS promises a few things...

  • Fast recovery - but how? using checkpoints.
  • Chunk replications - to achieve high availability
  • Master replications for reliability

Enough about GFS now... but wait, do you need more info... please go thru the gfs-sosp2003.pdf

Big Table

BigTable is a distributed storage system for managing structured data and it is meant to scale to a very large size (call it zeta bytes of data across several thousands of commodity servers). It is a DBMS system which didnot follow the principles of a traditional DBMS concepts, it has millions of columns (it can have different number of columns for different records).

BigTable uses the GFS distributed File System to store the data files. BigTable is designed in keeping mind that even the hundreds of thousands of servers might not be sufficient and it should be easy to scale horizontally by just adding more servers when required.

Many of the Google products (as per bigtable-osdi06.pdf document more than sixty products & projects) including Analytics, Indexing, Earth are using BigTable. Having said that, these products require different sized data, latency, high performance, etc. And BigTable offers them all (after all it is meant for that)

BigTable is a sparse, distributed, persistent multi-dimensional sorted map. It has a row-key, column-key and a timestamp.

Trust me, it is very hard to think in terms of Column - oriented database after used to the traditional DBMS systems. Need to understand the Row Key, Column Families, TimeStamps and flush out your DBMS knowledge and start designing the DBMS for the column-oriented Database.

(row:string, column:string, time:int64) → string

Figure 3. Image Credit: bigtable-osdi06.pdf
Rows in a BigTable are uniquely identified by the row key. BigTable maintains data in lexicographic order by the row key. The Row range is dynamically partitioned. Each Range is called a tablet (used for the load balancing). Tables are split into multiple tablets - segments of the table. Each tablet will be around 200MB approximately. When the size grows beyond the specified limit, compression takes place.

There are multiple level hierarchy analogous to that of a B+ tree to store the Tablet information itself !!!

First level is known as the Chubby which contains the location of the root tablet. Root tablet contains the location of all tablets in a special METADATA tablet. and in-turn METADATA tablet contains the location of a set of user tables. (though the Root Tablet is the first tablet in the  METADATA tablets, it is treated specially)

Column keys are grouped in to sets known as column families and it forms the basic unit of access control. A column family has to be created before the data is stored in to the family.
Column key is named like ColumnFamily:Qualifier.

Each column key in a anchor family (as shown in the Figure 3) represents a single anchor. Qualifier is the name of the referring site. Cell contents is the link text.

TimeStamps are used to versioning the data in a table in BigTable. Each cell in BigTable can contain multiple versions of the data which is distinguished by the timestamp.

BigTable also provides an API to create / delete / lookup for values in the BigTable.

For more information on how BigTable is implemented or how it works refer bigtable-osdi06.pdf

Map Reduce

MapReduce is a programming model for processing and generating large datasets in an distributed processing environment. 

The underlying architecture like GFS / Hadoop are capable enough to make these kind of programs (MapReduce) run in a parallel mode on a large number of computers (known as clusters). GFS / Hadoop takes care of the splitting the input data and creating as many mappers and reducer steps required (configuring the number mappers & reducers is also possible), schedules the program executions, handle machine failures, inter-machine communication. Terms Map & Reduce came from the primitives present in Lisp (and other functional languages). 

Input for the MapReduce programs are a set of key/value pairs and it produces a set of output key/value pairs. 

Map: a piece of program written by the user, which takes an input pair and produces intermediate key/value pair output. The Library takes care of splitting the huge data file into smaller chunks and sending it to different machines and receiving the intermediate outputs, combining them with other intermediate outputs of the same Key. Master node will take care of sending the data to worker nodes (until all the data is processed). 

Reduce: another piece of program written by the user, which takes the intermediate key and a set of values for that key. Merges them together to form a smaller set of values. Aggregating all these output values (combined ones) will form the final output of the problem for the given data. 

Advantage of the MapReduce is that it allows distributed processing of mappers and reducers (ofcourse unless there is no dependency between the mappers, it can run in parallel). MapReduce library (provided by GFS / Hadoop) will take care of the situation when a processing node (map or reduce) fails, It automatically reschedules the map / reduce task)

A sample pseudo-code for map-reducer.

Lets count the number of occurrences of each word in a large collection of documents. 
Map function emits each word and a number (which is count).
Reducer simply receives the word and adds the count it received from the Map to the total count for each word 
In case of Map output, key is the word, count is the value.

For Reducer, it simply gets the key, value and keep adding the value to associated to the key.
map (k1, v1) -> list (k2, v2)
reduce(k2, list(v2)) -> list (v2)

The figure above shows the execution flow of a typical map reduce program (in terms of GFS)

Step1: MR Library splits the huge data file into multiple chunks of 16 to 64 MB of data. and starts as many copies of programs as required on the cluster
Step2: One of the Copy acts as the Master. Rest are workers. Master has the information about which machine is idle and which is getting processed.
Step3: Get the input content which is allocated to the worker.
Step4: store the output into the local disk
Step5: Master notifies the reducer about the output of map is available
Step6: Reducer iterates over the sorted intermediate data and produces a output which is appended to the final output file.
Step7: Master wakesup the user program when all the map/reduce programs are completed.

The Master node:

Master node keeps track of each map & reducers, tasks assigned to them, Intermediate outputs, state of these tasks assigned to each map/reduce instances (forks) and the status of the worker machines.

The good thing about these MR libraries is that, even if the Master task dies, it has created several checkpoints for a given job... next time, it wont start a-fresh, instead, start from the last checkpoint.

phew... We've seen enough to get the grounds up for the Hadoop. But we are gonna wait for sometime to get the 2nd article of Hadoop Series.

Until then, if you need more information... do checkout the following links or post a comment.