Search This Blog

Monday, February 17, 2014

Concepts about usage of HBASE

NoSQL?
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
However, HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices. HBase features of note are:
    •    Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.
    •    Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
    •    Automatic RegionServer failover
    •    Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.
    •    MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
    •    Java Client API: HBase supports an easy to use Java API for programmatic access.
    •    Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
    •    Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
    •    Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.

When Should I Use HBase?
HBase isn't suitable for every problem.
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only.

What Is The Difference Between HBase and Hadoop/HDFS?
HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. See the Chapter 5, Data Model and the rest of this chapter for more information on how HBase achieves its goals.

Thursday, February 13, 2014

Some Important Definitions using in Hadoop

Distributed Processing :
Distributed processing is a phrase used to refer to a variety ofcomputer systems that use more than one computer (or processor) to run an application. This includes parallel processing in which a single computer uses more than one CPU to execute programs.
More often, however, distributed processing refers to local-area networks (LANs) designed so that a single program can run simultaneously at various sites. Most distributed processingsystems contain sophisticated software that detects idle CPUs on the network and parcels out programs to utilize them.
Another form of distributed processing involves distributed databases. This is databases in which the data is stored across two or more computer systems. The database system keeps track of where the data is so that the distributed nature of the database is not apparent to users.

Parallel Processing :
The simultaneous use of more than one CPU to execute aprogram. Ideally, parallel processing makes a program run faster because there are more engines (CPUs) running it. In practice, it is often difficult to divide a program in such a way that separate CPUs can execute different portions without interfering with each other.
Most computers have just one CPU, but some models have several. There are even computers with thousands of CPUs. With single-CPU computers, it is possible to perform parallel processing by connecting the computers in a network. However, this type of parallel processing requires very sophisticated software calleddistributed processing software.
Note that parallel processing differs from multitasking, in which a single CPU executes several programs at once.
Parallel processing is also called parallel computing.

MPP :
Short for massively parallel processing, a type of computing that uses many separate CPUs running in parallel to execute a singleprogram. MPP is similar to symmetric processing (SMP), with the main difference being that in SMP systems all the CPUs share the same memory, whereas in MPP systems, each CPU has its own memory. MPP systems are therefore more difficult to program because the application must be divided in such a way that all the executing segments can communicate with each other. On the other hand, MPP don't suffer from the bottleneck problems inherent in SMP systems when all the CPUs attempt to access the same memory at once.

Scalable Parallel Processor :
Abbreviated as SPP, a computer that utilizes parallel processingthat can be upgraded by adding more CPUs to it, effectively increasing its computing power.
Basically, scalability is determined by the ability to add to (or subtract from) an environment without having any adverse (mainly performance based) problems.

Scalability :
Basically, scalability is determined by the ability to add to (or subtract from) an environment without having any adverse (mainly performance based) problems.

One more definition of Distributed Processing :
The distribution of applications and business logic across multiple processing platforms.  Distributed processing implies that processing will occur on more than one processor in order for a transaction to be completed. In other words, processing is distributed across two or more machines and the processes are most likely not running at the same time, i.e. each process performs part of an application in a sequence. Often the data used in a distributed processing environment is also distributed across platforms.


My Profile

My photo
can be reached at 09916017317