Tuesday, February 18, 2014

Why is Apache ZooKeeper used along with Hadoop ?

ZooKeeper will help you with coordination between Hadoop nodes.

For example, it makes it easier to:
    •    Manage configuration across nodes. If you have dozens or hundreds of nodes, it becomes hard to keep configuration in sync across nodes and quickly make changes. ZooKeeper helps you quickly push configuration changes.
    •    Implement reliable messaging. With ZooKeeper, you can easily implement a producer/consumer queue that guarantees delivery, even if some consumers or even one of the ZooKeeper servers fails.
    •    Implement redundant services. With ZooKeeper, a group of identical nodes (e.g. database servers) can elect a leader/master and let ZooKeeper refer all clients to that master server. If the master fails, ZooKeeper will assign a new leader and notify all clients.
    •    Synchronize process execution. With ZooKeeper, multiple nodes can coordinate the start and end of a process or calculation. This ensures that any follow-up processing is done only after all nodes have finished their calculations.

The interface provided by ZooKeeper is quite low-level. For example, in the configuration management example, the actual processing of the configuration changes must be developed as part of the application. However, ZooKeeper will ensure all clients are notified reliably and the order of configuration messages is maintained.

The functionality provided by ZooKeeper is often developed as part of Hadoop applications. However, these are tricky matters to get right, and it is easy to get errors in the implementation. ZooKeeper provides a solid foundation that helps build higher-level services. It also performs well in high-load situations, and it was used in several Yahoo! products, including the main crawler.

The purpose of Zookeepr is cluster management. This fits with the general philosophy of *nix of using smaller specialized components - so components of Hadoop that want clustering capabilities rely on Zookeeper for that rather than develop their own.

Zookeeper is a distributed storage that provides the following guarantees :
    •    Sequential Consistency - Updates from a client will be applied in the order that they were sent.
    •    Atomicity - Updates either succeed or fail. No partial results.
    •    Single System Image - A client will see the same view of the service regardless of the server that it connects to.
    •    Reliability - Once an update has been applied, it will persist from that time forward until a client overwrites the update.
    •    Timeliness - The clients view of the system is guaranteed to be up-to-date within a certain time bound.
You can use these to implement different "recipes" that are required for cluster management like locks, leader election etc.

Monday, February 17, 2014

Oozie: Workflow Engine for Hadoop

Outline
1. What is oozie
2. Do you need oozie
3. How to use oozie
4. Use case sharing

What Is Oozie ?
- Originally designed at Yahoo!
- Apache incubator project since 2011
- A web service that launches your jobs based on:
 - Time dependency
 - Data dependency
- Ability to rerun from last point of failure
- Monitoring

Do You Need Oozie ?
Q1: Having multiple jobs with dependency ?
Q2: Need to run jobs regularly ?
Q3: Need to check data availability ?
Q4: Need monitoring and operational support ?

If any one of your answer is YES,
then you should consider Oozie!

How To Use Oozie
1. Deploy your workflow on HDFS, this includes:
 - oozie job definitions (workflow.xml)
 - your codes: MR/pig/streaming/java etc.
 - libraries (.so & .jar)

2. Submit your job
 $ oozie job -run -config job.properties
 Workflow ID: 0123-123456-oozie-wrkf-W

3. Check job status
 $ oozie job -info 0123-123456-oozie-wrkf-W
 $ oozie job -log 0123-123456-oozie-wrkf-W

(submit coordinator using the same way)

Use Case Sharing
- Was using crontab + python scripts

- After porting to oozie:
 - Reduce code size (4906 -> 1708 lines)
 - More smooth processing (1 week delay -> 3 days)
 - More stable

My Profile

My photo
can be reached at 09916017317