A batch job scheduler can be seen as a combination of the cron and make Unix utilities combined with a friendly UI. Batch jobs need to be scheduled to run periodically. They also typically have intricate dependency chains—for example, dependencies on various data extraction processes or previous steps. Larger processes might have 50 or 60 steps, of which some might run in parallel and others must wait for the output of earlier steps. Combining all these processes into a single program allows you to control the dependency management, but can lead to sprawling monolithic programs that are difficult to test or maintain. Simply scheduling the individual pieces to run at different times avoids the monolithic problem, but introduces many timing assumptions that are inevitably broken. Azkaban is a workflow scheduler that allows the independent pieces to be declaratively assembled into a single workflow, and for that workflow to be scheduled to run periodically.
A good batch workflow system allows a program to be built out of small reusable pieces that need not know about one another. By declaring dependencies, you can control sequencing. Other functionality available from Azkaban can then be declaratively layered on top of the job without having to add any code. This includes things like email notifications of success or failure, resource locking, retry on failure, log collection, historical job run time information, and so on.
Azkaban consists of 3 key components:
• Relational Database (MySQL)
• AzkabanWebServer
• AzkabanExecutorServer
Relational Database (MySQL)
Azkaban uses MySQL to store much of its state. Both the AzkabanWebServer and the AzkabanExecutorServer access the DB.
How does AzkabanWebServer use the DB?
The web server uses the db for the following reasons:
• Project Management - The projects, the permissions on the projects as well as the uploaded files.
• Executing Flow State - Keep track of executing flows and which Executor is running them.
• Previous Flow/Jobs - Search through previous executions of jobs and flows as well as access their log files.
• Scheduler - Keeps the state of the scheduled jobs.
• SLA - Keeps all the sla rules
How does the AzkabanExecutorServer use the DB?
The executor server uses the db for the following reasons:
• Access the project - Retrieves project files from the db.
• Executing Flows/Jobs - Retrieves and updates data for flows and that are executing
• Logs - Stores the output logs for jobs and flows into the db.
• Interflow dependency - If a flow is running on a different executor, it will take state from the DB.
There is no reason why MySQL was chosen except that it is a widely used DB. We are looking to implement compatibility with other DB’s, although the search requirement on historically running jobs benefits from a relational data store.
AzkabanWebServer
The AzkabanWebServer is the main manager to all of Azkaban. It handles project management, authentication, scheduler, and monitoring of executions. It also serves as the web user interface.
Using Azkaban is easy. Azkaban uses *.job key-value property files to define individual tasks in a work flow, and the dependenciesproperty to define the dependency chain of the jobs. These job files and associated code can be archived into a *.zip and uploaded through the web server through the Azkaban UI or through curl.
AzkabanExecutorServer
Previous versions of Azkaban had both the AzkabanWebServer and the AzkabanExecutorServer features in a single server. The Executor has since been separated into its own server. There were several reasons for splitting these services: we will soon be able to scale the number of executions and fall back on operating Executors if one fails. Also, we are able to roll our upgrades of Azkaban with minimal impact on the users. As Azkaban’s usage grew, we found that upgrading Azkaban became increasingly more difficult as all times of the day became ‘peak’.
By declaring dependencies you can control sequencing. Other functionality available from Azkaban can then be layered on top of the job--email notifications of success or failure, resource locking, retry on failure, log collection, historical job runtime information, and so on.
A good batch workflow system allows a program to be built out of small reusable pieces that need not know about one another. By declaring dependencies, you can control sequencing. Other functionality available from Azkaban can then be declaratively layered on top of the job without having to add any code. This includes things like email notifications of success or failure, resource locking, retry on failure, log collection, historical job run time information, and so on.
Azkaban consists of 3 key components:
• Relational Database (MySQL)
• AzkabanWebServer
• AzkabanExecutorServer
Relational Database (MySQL)
Azkaban uses MySQL to store much of its state. Both the AzkabanWebServer and the AzkabanExecutorServer access the DB.
How does AzkabanWebServer use the DB?
The web server uses the db for the following reasons:
• Project Management - The projects, the permissions on the projects as well as the uploaded files.
• Executing Flow State - Keep track of executing flows and which Executor is running them.
• Previous Flow/Jobs - Search through previous executions of jobs and flows as well as access their log files.
• Scheduler - Keeps the state of the scheduled jobs.
• SLA - Keeps all the sla rules
How does the AzkabanExecutorServer use the DB?
The executor server uses the db for the following reasons:
• Access the project - Retrieves project files from the db.
• Executing Flows/Jobs - Retrieves and updates data for flows and that are executing
• Logs - Stores the output logs for jobs and flows into the db.
• Interflow dependency - If a flow is running on a different executor, it will take state from the DB.
There is no reason why MySQL was chosen except that it is a widely used DB. We are looking to implement compatibility with other DB’s, although the search requirement on historically running jobs benefits from a relational data store.
AzkabanWebServer
The AzkabanWebServer is the main manager to all of Azkaban. It handles project management, authentication, scheduler, and monitoring of executions. It also serves as the web user interface.
Using Azkaban is easy. Azkaban uses *.job key-value property files to define individual tasks in a work flow, and the dependenciesproperty to define the dependency chain of the jobs. These job files and associated code can be archived into a *.zip and uploaded through the web server through the Azkaban UI or through curl.
AzkabanExecutorServer
Previous versions of Azkaban had both the AzkabanWebServer and the AzkabanExecutorServer features in a single server. The Executor has since been separated into its own server. There were several reasons for splitting these services: we will soon be able to scale the number of executions and fall back on operating Executors if one fails. Also, we are able to roll our upgrades of Azkaban with minimal impact on the users. As Azkaban’s usage grew, we found that upgrading Azkaban became increasingly more difficult as all times of the day became ‘peak’.
By declaring dependencies you can control sequencing. Other functionality available from Azkaban can then be layered on top of the job--email notifications of success or failure, resource locking, retry on failure, log collection, historical job runtime information, and so on.
No comments:
Post a Comment