Thursday, February 13, 2014

Hadoop Vs Cassandra and HBase

The Vanilla hadoop consists of a Distributed File System (DFS) at the core and libraries to support Map Reduce model to write programs to do analysis. DFS is what enables Hadoop to be scalable. It takes care of chunking data into multiple nodes in a multi node cluster so that Map Reduce can work on individual chunks of data available nodes thus enabling parallelism.

Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. It is not a conventional database but is more like Hashtable or HashMap which stores a key/value pair. Both Cassandra and HBase are implementations of Google's BigTable. Paper for Google BigTable can be found here.

BigTable makes use of a String Sorted Table (SSTable) to store key/value pairs. SSTable is just a File in HDFS which stores key followed by value. Furthermore BigTable maintains a index which has key and offset in the File for that key which enables reading of value for that key using only a seek to the offset location. SSTable is effectively immutable which means after creating the File there is no modifications can be done to existing key/value pairs. New key/value pairs are appended to the file. Update and Delete of records are appended to the file, update with a newer key/value and deletion with a key and tombstone value. Duplicate keys are allowed in this file for SSTable.  The index is also modified with whenever update or delete take place so that offset for that key points to the latest value or tombstone value.

Thus you can see Cassandra's/HBase's internals allow fast read/write which is crucial for real time data handling. Whereas Vanilla Hadoop with Map Reduce can be used to process batch oriented passive data.

When to use HBase and when to use Hive

MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.

Coming back to the first part of your question, Hadoop is basically 2 things - a Distributed FileSystem(HDFS)+a Computation or Processing framework(MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss(because of the replication). But, being a FS, HDFS lacks random read and write accees. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.

Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

Consider that you work with RDBMS and have to select what to use - full table scans, or index access - but only one of them.
If you select full table scan - use hive. If index access - HBase.

My Profile

My photo
can be reached at 09916017317