Search This Blog

Tuesday, February 18, 2014

Why is Apache ZooKeeper used along with Hadoop ?

ZooKeeper will help you with coordination between Hadoop nodes.

For example, it makes it easier to:
    •    Manage configuration across nodes. If you have dozens or hundreds of nodes, it becomes hard to keep configuration in sync across nodes and quickly make changes. ZooKeeper helps you quickly push configuration changes.
    •    Implement reliable messaging. With ZooKeeper, you can easily implement a producer/consumer queue that guarantees delivery, even if some consumers or even one of the ZooKeeper servers fails.
    •    Implement redundant services. With ZooKeeper, a group of identical nodes (e.g. database servers) can elect a leader/master and let ZooKeeper refer all clients to that master server. If the master fails, ZooKeeper will assign a new leader and notify all clients.
    •    Synchronize process execution. With ZooKeeper, multiple nodes can coordinate the start and end of a process or calculation. This ensures that any follow-up processing is done only after all nodes have finished their calculations.

The interface provided by ZooKeeper is quite low-level. For example, in the configuration management example, the actual processing of the configuration changes must be developed as part of the application. However, ZooKeeper will ensure all clients are notified reliably and the order of configuration messages is maintained.

The functionality provided by ZooKeeper is often developed as part of Hadoop applications. However, these are tricky matters to get right, and it is easy to get errors in the implementation. ZooKeeper provides a solid foundation that helps build higher-level services. It also performs well in high-load situations, and it was used in several Yahoo! products, including the main crawler.

The purpose of Zookeepr is cluster management. This fits with the general philosophy of *nix of using smaller specialized components - so components of Hadoop that want clustering capabilities rely on Zookeeper for that rather than develop their own.

Zookeeper is a distributed storage that provides the following guarantees :
    •    Sequential Consistency - Updates from a client will be applied in the order that they were sent.
    •    Atomicity - Updates either succeed or fail. No partial results.
    •    Single System Image - A client will see the same view of the service regardless of the server that it connects to.
    •    Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
    •    Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
You can use these to implement different "recipes" that are required for cluster management like locks, leader election etc.

No comments:

My Profile

My photo
can be reached at 09916017317