Thursday, February 13, 2014

When to use HBase and when to use Hive

MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.

Coming back to the first part of your question, Hadoop is basically 2 things - a Distributed FileSystem(HDFS)+a Computation or Processing framework(MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss(because of the replication). But, being a FS, HDFS lacks random read and write accees. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.

Coming to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.

Consider that you work with RDBMS and have to select what to use - full table scans, or index access - but only one of them.
If you select full table scan - use hive. If index access - HBase.

Wednesday, February 12, 2014

Content Delivery Network (CDN)

A content delivery network (CDN) is an interconnected system of computers on the Internet that provides Web content rapidly to numerous users by duplicating the content on multipleserver s and directing the content to users based on proximity. CDNs are used by Internet service providers (ISPs) to deliver static or dynamic Web pages but the technology is especially well suited to streaming audio, video, and Internet television ( IPTV ) programming.

In a CDN, content exists in multiple copies on strategically dispersed servers. This is known as content replication. A large CDN can have thousands of servers, making it possible to provide identical content to many users efficiently and reliably even at times of maximum Internet traffic or during sudden demand "spikes." When a specific page, file, or program is requested by a user, the server closest to that user (in terms of the minimum number ofnode s between the server and the user) is dynamically determined. This optimizes the speed with which the content is delivered to that user.

The use of CDN technology has obvious economic advantages to enterprises who expect, or experience, large numbers of hits on their Web sites from locations all over the world. If dozens or hundreds of other users happen to select the same Web page or content simultaneously, the CDN sends the content to each of them without delay or time-out. Problems with excessive latency , as well as large variations in latency from moment to moment (which can cause annoying "jitter" in streaming audio and video), are minimized. The bandwidth each user "sees" is maximized. The difference is noticed most by users with high-speed Internet connections who often demand streaming content or large files.

Another advantage of CDN technology is content redundancy that provides a fail-safe feature and allows for graceful degradation in the event of damage to, or malfunction of, a part of the Internet. Even during a large-scale attack that disables many servers, content on a CDN will remain available to at least some users. Still another advantage of CDN technology is the fact that it inherently offers enhanced data backup, archiving, and storage capacity. This can benefit individuals and enterprises who rely on online data backup services.

Definition of Node :

In a network, a node is a connection point, either a redistribution point or an end point for data transmissions. In general, a node has programmed or engineered capability to recognize and process or forward transmissions to other nodes.

My Profile

My photo
can be reached at 09916017317