Search This Blog

Monday, February 17, 2014

Oozie: Workflow Engine for Hadoop

Outline
1. What is oozie
2. Do you need oozie
3. How to use oozie
4. Use case sharing

What Is Oozie ?
- Originally designed at Yahoo!
- Apache incubator project since 2011
- A web service that launches your jobs based on:
 - Time dependency
 - Data dependency
- Ability to rerun from last point of failure
- Monitoring

Do You Need Oozie ?
Q1: Having multiple jobs with dependency ?
Q2: Need to run jobs regularly ?
Q3: Need to check data availability ?
Q4: Need monitoring and operational support ?

If any one of your answer is YES,
then you should consider Oozie!

How To Use Oozie
1. Deploy your workflow on HDFS, this includes:
 - oozie job definitions (workflow.xml)
 - your codes: MR/pig/streaming/java etc.
 - libraries (.so & .jar)

2. Submit your job
 $ oozie job -run -config job.properties
 Workflow ID: 0123-123456-oozie-wrkf-W

3. Check job status
 $ oozie job -info 0123-123456-oozie-wrkf-W
 $ oozie job -log 0123-123456-oozie-wrkf-W

(submit coordinator using the same way)

Use Case Sharing
- Was using crontab + python scripts

- After porting to oozie:
 - Reduce code size (4906 -> 1708 lines)
 - More smooth processing (1 week delay -> 3 days)
 - More stable

No comments:

My Profile

My photo
can be reached at 09916017317