Tech Unpacked – Research & Fundamentals with Nitin Sharma

Monday, February 17, 2014

What is Oozie?

About Oozie
Oozie is an open source project that simplifies workflow and coordination between jobs. It provides users with the ability to define actions and dependencies between actions. Oozie will then schedule actions to execute when the required dependencies have been met.

A workflow in Oozie is defined in what is called a Directed Acyclical Graph (DAG). Acyclical means there are no loops in the graph (in other words, there’s a starting point and an ending point to the graph), and all tasks and dependencies point from start to end without going back. A DAG is made up of action nodes and dependency nodes. An action node can be a MapReducejob, a Pig application, a file system task, or a Java application. Flow control in the graph is represented by node elements that provide logic based on the input from the preceding task in the graph. Examples of flow control nodes are decisions, forks, and join nodes.

What is Oozie?
•  Oozie    allows    a    user    to    create    Directed    Acyclic
Graphs    of    workﬂows    and    these    can    be    ran    in
parallel    and    sequential    in    Hadoop.
•  Oozie    can    also    run    plain    java    classes,    Pig
workﬂows,    and    interact    with    the    HDFS
– Nice    if    you    need    to    delete    or    move    ﬁles    before    a
job    runs
•  Oozie    can    run    job’s    sequentially    (one after the other)    and    in    parallel    (multiple at a time)

Why    use    Oozie    instead    of    just
cascading    a    jobs    one    after    another?
•  Major    ﬂexibility
– Start,    Stop,    Suspend,    and    re-run    jobs
•  Oozie    allows    you    to    restart    from    a    failure
– You    can    tell    Oozie    to    restart    a    job    from    a    speciﬁc
node    in    the    graph    or    to    skip    speciﬁc    failed    nodes

Other    Features
•  Java    Client    API    /    Command    Line    Interface
– Launch,    control,    and    monitor    jobs    from    your    Java
Apps
•  Web    Service    API
– You    can    control    jobs    from    anywhere
•  Run    Periodic    jobs
– Have    jobs    that    you    need    to    run    every    hour,    day,
week?    Have    Oozie    run    the    jobs    for    you
•  Receive    an    email    when    a    job    is    complete

   How    do    you    make    a    workﬂow?
•  First    make    a    Hadoop    job    and    make    sure    that    it    works
using    the    jar    command    in    Hadoop
–  This    ensures    that    the    conﬁgura)on    is    correct    for    your    job
•  Make    a    jar    out    of    your    classes
•  Then    make    a    workﬂow.xml    ﬁle    and    copy    all    of    the    job
conﬁgura)on    proper)es    into    the    xml    ﬁle.        These
include:
–  Input    ﬁles
– Output    ﬁles
–  Input    readers    and    writers
–  Mappers    and    reducers
–  Job    speciﬁc    arguments

    How    do    you    make    a    workﬂow?
•  You    also    need    a    job.proper)es    ﬁle.        This    ﬁle
deﬁnes    the    Name    node,    Job    tracker,    etc.
•  It    also    gives    the    loca)on    of    the    shared    jars    and
other    ﬁles
•  When    you    have    these    ﬁles    ready,    you    need    to
copy    them    into    the    HDFS    and    then    you    can
run    them    from    the    command    line

Concepts about usage of HBASE

NoSQL?
HBase is a type of "NoSQL" database. "NoSQL" is a general term meaning that the database isn't an RDBMS which supports SQL as its primary access language, but there are many types of NoSQL databases: BerkeleyDB is an example of a local NoSQL database, whereas HBase is very much a distributed database. Technically speaking, HBase is really more a "Data Store" than "Data Base" because it lacks many of the features you find in an RDBMS, such as typed columns, secondary indexes, triggers, and advanced query languages, etc.
However, HBase has many features which supports both linear and modular scaling. HBase clusters expand by adding RegionServers that are hosted on commodity class servers. If a cluster expands from 10 to 20 RegionServers, for example, it doubles both in terms of storage and as well as processing capacity. RDBMS can scale well, but only up to a point - specifically, the size of a single database server - and for the best performance requires specialized hardware and storage devices. HBase features of note are:
    •    Strongly consistent reads/writes: HBase is not an "eventually consistent" DataStore. This makes it very suitable for tasks such as high-speed counter aggregation.
    •    Automatic sharding: HBase tables are distributed on the cluster via regions, and regions are automatically split and re-distributed as your data grows.
    •    Automatic RegionServer failover
    •    Hadoop/HDFS Integration: HBase supports HDFS out of the box as its distributed file system.
    •    MapReduce: HBase supports massively parallelized processing via MapReduce for using HBase as both source and sink.
    •    Java Client API: HBase supports an easy to use Java API for programmatic access.
    •    Thrift/REST API: HBase also supports Thrift and REST for non-Java front-ends.
    •    Block Cache and Bloom Filters: HBase supports a Block Cache and Bloom Filters for high volume query optimization.
    •    Operational Management: HBase provides build-in web-pages for operational insight as well as JMX metrics.

When Should I Use HBase?
HBase isn't suitable for every problem.
First, make sure you have enough data. If you have hundreds of millions or billions of rows, then HBase is a good candidate. If you only have a few thousand/million rows, then using a traditional RDBMS might be a better choice due to the fact that all of your data might wind up on a single node (or two) and the rest of the cluster may be sitting idle.
Second, make sure you can live without all the extra features that an RDBMS provides (e.g., typed columns, secondary indexes, transactions, advanced query languages, etc.) An application built against an RDBMS cannot be "ported" to HBase by simply changing a JDBC driver, for example. Consider moving from an RDBMS to HBase as a complete redesign as opposed to a port.
Third, make sure you have enough hardware. Even HDFS doesn't do well with anything less than 5 DataNodes (due to things such as HDFS block replication which has a default of 3), plus a NameNode.
HBase can run quite well stand-alone on a laptop - but this should be considered a development configuration only.

What Is The Difference Between HBase and Hadoop/HDFS?
HDFS is a distributed file system that is well suited for the storage of large files. It's documentation states that it is not, however, a general purpose file system, and does not provide fast individual record lookups in files. HBase, on the other hand, is built on top of HDFS and provides fast record lookups (and updates) for large tables. This can sometimes be a point of conceptual confusion. HBase internally puts your data in indexed "StoreFiles" that exist on HDFS for high-speed lookups. See the Chapter 5, Data Model and the rest of this chapter for more information on how HBase achieves its goals.

Tech Unpacked – Research & Fundamentals with Nitin Sharma

Popular Posts

Search This Blog

Monday, February 17, 2014

What is Oozie?

Concepts about usage of HBASE

My Profile

Featured Post

🚀 Introducing the Universal API Testing Tool — Built to Catch What Manual Testing Misses

!! IMPORTANT LINKS !!

!! INTERESTING TALKS !!

Contact Form

Labels

Total Pageviews