Sunday, February 23, 2014

Azkaban

A batch job scheduler can be seen as a combination of the cron and make Unix utilities combined with a friendly UI. Batch jobs need to be scheduled to run periodically. They also typically have intricate dependency chains—for example, dependencies on various data extraction processes or previous steps. Larger processes might have 50 or 60 steps, of which some might run in parallel and others must wait for the output of earlier steps. Combining all these processes into a single program allows you to control the dependency management, but can lead to sprawling monolithic programs that are difficult to test or maintain. Simply scheduling the individual pieces to run at different times avoids the monolithic problem, but introduces many timing assumptions that are inevitably broken. Azkaban is a workflow scheduler that allows the independent pieces to be declaratively assembled into a single workflow, and for that workflow to be scheduled to run periodically.
A good batch workflow system allows a program to be built out of small reusable pieces that need not know about one another. By declaring dependencies, you can control sequencing. Other functionality available from Azkaban can then be declaratively layered on top of the job without having to add any code. This includes things like email notifications of success or failure, resource locking, retry on failure, log collection, historical job run time information, and so on.

Azkaban consists of 3 key components:
    •    Relational Database (MySQL)
    •    AzkabanWebServer
    •    AzkabanExecutorServer

Relational Database (MySQL)
Azkaban uses MySQL to store much of its state. Both the AzkabanWebServer and the AzkabanExecutorServer access the DB.
How does AzkabanWebServer use the DB?

The web server uses the db for the following reasons:
    •    Project Management - The projects, the permissions on the projects as well as the uploaded files.
    •    Executing Flow State - Keep track of executing flows and which Executor is running them.
    •    Previous Flow/Jobs - Search through previous executions of jobs and flows as well as access their log files.
    •    Scheduler - Keeps the state of the scheduled jobs.
    •    SLA - Keeps all the sla rules 


How does the AzkabanExecutorServer use the DB?
The executor server uses the db for the following reasons:
    •    Access the project - Retrieves project files from the db.
    •    Executing Flows/Jobs - Retrieves and updates data for flows and that are executing
    •    Logs - Stores the output logs for jobs and flows into the db.
    •    Interflow dependency - If a flow is running on a different executor, it will take state from the DB.
There is no reason why MySQL was chosen except that it is a widely used DB. We are looking to implement compatibility with other DB’s, although the search requirement on historically running jobs benefits from a relational data store.

AzkabanWebServer
The AzkabanWebServer is the main manager to all of Azkaban. It handles project management, authentication, scheduler, and monitoring of executions. It also serves as the web user interface.
Using Azkaban is easy. Azkaban uses *.job key-value property files to define individual tasks in a work flow, and the dependenciesproperty to define the dependency chain of the jobs. These job files and associated code can be archived into a *.zip and uploaded through the web server through the Azkaban UI or through curl.

AzkabanExecutorServer
Previous versions of Azkaban had both the AzkabanWebServer and the AzkabanExecutorServer features in a single server. The Executor has since been separated into its own server. There were several reasons for splitting these services: we will soon be able to scale the number of executions and fall back on operating Executors if one fails. Also, we are able to roll our upgrades of Azkaban with minimal impact on the users. As Azkaban’s usage grew, we found that upgrading Azkaban became increasingly more difficult as all times of the day became ‘peak’.

By declaring dependencies you can control sequencing. Other functionality available from Azkaban can then be layered on top of the job--email notifications of success or failure, resource locking, retry on failure, log collection, historical job runtime information, and so on.

Friday, February 21, 2014

About Recursion Limit in Python

Python lacks the tail recursion optimizations common in functional languages like lisp. In Python, recursion is limited to 999 calls (see sys.getrecursionlimit).
I dare to say that in Python, pure recursive algorithm implementations are not correct/safe. A fib() implementation limited to 999 is not really correct. It is always possible to convert recursive into iterative, and doing so is trivial.
If you expect to get more than 999 deep, my advice is to convert the algorithm from recursive to iterative. If not, check for a runaway bug (the implementation lacks a condition that stops recursion, or this condition is wrong).

How to set Recursion Limit
You can increment the stack depth allowed - with this, deeper recursive calls will be possible, like this:
import sys
sys.setrecursionlimit(10000) # 10000 is an example, try with different values
... But I'd advise you to first try to optimize your code, for instance, using iteration instead of
recursion.

How to decide Recursion Limit
It is based on the TOTAL stack depth and not really the depth of any particular single function. You are probably already at a stack depth of 5 when you make the first call to rec().
Take for example 5 recursive functions. Each makes 98 recursive calls with the last one being to the next recursive function. With a recursion limit of 100 do you really want to allow each recursive function to make 99 calls for a total depth of ~500 calls? No, that might crash the interpreter at those depths.
Therefore the recursion limit is the maximum depth of all functions globally, not any single named function.

My Profile

My photo
can be reached at 09916017317