Oozie to Airflow

A tool to easily convert between Apache Oozie workflows and Apache Airflow workflows.

The program targets Apache Airflow >= 2.x and Apache Oozie 1.0 XML schema.

If you want to contribute to the project, please take a look at CONTRIBUTING.md

Background
Running the Program
Supported Oozie features
Airflow-specific optimisations
- Removing unnecessary control nodes
- Removing inaccessible nodes
Common Known Limitations
Cloud execution environment for Oozie to Airflow conversion
- Cloud environment setup
Examples

Background

Apache Airflow is a workflow management system developed by AirBnB in 2014. It is a platform to programmatically author, schedule, and monitor workflows. Airflow workflows are designed as Directed Acyclic Graphs (DAGs) of tasks in Python. The Airflow scheduler executes your tasks on an array of workers while following the specified dependencies.

Apache Oozie is a workflow scheduler system to manage Apache Hadoop jobs. Oozie workflows are also designed as Directed Acyclic Graphs (DAGs) in XML.

There are a few differences noted below:

	Spec.	Task	Dependencies	"Subworkflows"	Parameterization	Notification
Oozie	XML	Action Node	Control Node	Subworkflow	EL functions/Properties file	URL based callbacks
Airflow	Python	Operators	Trigger Rules, set_downstream()	SubDag	jinja2 and macros	Callbacks/Emails

Running the Program

Note that you need Python >= 3.8 to run the converter.

Installing from PyPi

You can install o2a from PyPi via pip install o2a. After installation, the o2a and o2a-validate-workflows should be available on your path.

Installing from sources

(Optional) Install virtualenv:

In case you use sources of o2a, the environment can be set up via the virtualenv setup (you can create one using virtualenvwrapper for example).
Install Oozie-to-Airflow - you have 2 options to do so:
1. automatically: install o2a from local folder using pip install -e .
  
  This will take care about, among others, adding the bin subdirectory to the PATH.
2. more manually:
  1. While in your virtualenv, you can install all the requirements via pip install -r requirements.txt.
  2. You can add the bin subdirectory to your PATH, then all the scripts below can be run without adding the ./bin prefix. This can be done for example by adding a line similar to the one below to your .bash_profile or bin/postactivate from your virtual environment:
```
export PATH=${PATH}:<INSERT_PATH_TO_YOUR_OOZIE_PROJECT>/bin
```
  Otherwise you need to run all the scripts from the bin subdirectory, for example:
```
./bin/o2a --help
```

In all the example commands below, it is assumed that the bin directory is in your PATH - either installed from PyPi or from the sources.

Running the conversion

You can run the program by calling: o2a -i <INPUT_APPLICATION_FOLDER> -o <OUTPUT_FOLDER_PATH>

Example: o2a -i examples/demo -o output/demo

This is the full usage guide, available by running o2a -h

usage: o2a [-h] -i INPUT_DIRECTORY_PATH -o OUTPUT_DIRECTORY_PATH [-n DAG_NAME]
           [-u USER] [-s START_DAYS_AGO] [-v SCHEDULE_INTERVAL] [-d]

Convert Apache Oozie workflows to Apache Airflow workflows.

optional arguments:
  -h, --help            show this help message and exit
  -i INPUT_DIRECTORY_PATH, --input-directory-path INPUT_DIRECTORY_PATH
                        Path to input directory
  -o OUTPUT_DIRECTORY_PATH, --output-directory-path OUTPUT_DIRECTORY_PATH
                        Desired output directory
  -n DAG_NAME, --dag-name DAG_NAME
                        Desired DAG name [defaults to input directory name]
  -u USER, --user USER  The user to be used in place of all ${user.name}
                        [defaults to user who ran the conversion]
  -s START_DAYS_AGO, --start-days-ago START_DAYS_AGO
                        Desired DAG start as number of days ago
  -v SCHEDULE_INTERVAL, --schedule-interval SCHEDULE_INTERVAL
                        Desired DAG schedule interval as number of days
  -d, --dot             Renders workflow files in DOT format

Structure of the application folder

The input application directory has to follow the structure defined as follows:

<APPLICATION>/
             |- job.properties        - job properties that are used to run the job
             |- hdfs                  - folder with application - should be copied to HDFS
             |     |- workflow.xml    - Oozie workflow xml (1.0 schema)
             |     |- ...             - additional folders required to be copied to HDFS
             |- configuration.template.properties - template of configuration values used during conversion
             |- configuration.properties          - generated properties for configuration values

The o2a libraries

Converted Airflow DAGs use common libraries. Those libraries should be available on PYTHONPATH for all Airflow components - scheduler, webserver and workers - so that they can be imported when DAGs are parsed.

Those libraries are in o2a/o2a_libs folder and the easiest way to make them available to all the DAGs is to install them from PyPi via pip install o2a-lib.

Supported Oozie features

Control nodes

Fork and Join

A fork node splits the path of execution into multiple concurrent paths of execution.

A join node waits until every concurrent execution of the previous fork node arrives to it. The fork and join nodes must be used in pairs. The join node assumes concurrent execution paths are children of the same fork node.

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <fork name="[FORK-NODE-NAME]">
        <path start="[NODE-NAME]" />
        ...
        <path start="[NODE-NAME]" />
    </fork>
    ...
    <join name="[JOIN-NODE-NAME]" to="[NODE-NAME]" />
    ...
</workflow-app>

Decision

A decision node enables a workflow to make a selection on the execution path to follow.

The behavior of a decision node can be seen as a switch-case statement.

A decision node consists of a list of predicates-transition pairs plus a default transition. Predicates are evaluated in order or appearance until one of them evaluates to true and the corresponding transition is taken. If none of the predicates evaluates to true the default transition is taken.

Predicates are JSP Expression Language (EL) expressions (refer to section 4.2 of this document) that resolve into a boolean value, true or false . For example: ${fs:fileSize('/usr/foo/myinputdir') gt 10 * GB}

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <decision name="[NODE-NAME]">
        <switch>
            <case to="[NODE_NAME]">[PREDICATE]</case>
            ...
            <case to="[NODE_NAME]">[PREDICATE]</case>
            <default to="[NODE_NAME]"/>
        </switch>
    </decision>
    ...
</workflow-app>

Start

The start node is the entry point for a workflow job, it indicates the first workflow node the workflow job must transition to.

When a workflow is started, it automatically transitions to the node specified in the start .

A workflow definition must have one start node.

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
  ...
  <start to="[NODE-NAME]"/>
  ...
</workflow-app>

End

The end node is the end for a workflow job, it indicates that the workflow job has completed successfully.

When a workflow job reaches the end it finishes successfully (SUCCEEDED).

If one or more actions started by the workflow job are executing when the end node is reached, the actions will be killed. In this scenario the workflow job is still considered as successfully run.

A workflow definition must have one end node.

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <end name="[NODE-NAME]"/>
    ...
</workflow-app>

Kill

The kill node allows a workflow job to exit with an error.

When a workflow job reaches the kill it finishes in error (KILLED).

If one or more actions started by the workflow job are executing when the kill node is reached, the actions will be killed.

A workflow definition may have zero or more kill nodes.

<workflow-app name="[WF-DEF-NAME]" xmlns="uri:oozie:workflow:0.1">
    ...
    <kill name="[NODE-NAME]">
        <message>[MESSAGE-TO-LOG]</message>
    </kill>
    ...
</workflow-app>

EL Functions

As of now, a very minimal set of Oozie EL functions are supported. The way they work is that an EL expression is being translated to a jinja template. The translation is performed using Lark. All required variables should be passed in job.properties. Equivalents of EL functions can be found in o2a_libs/functions.py.

For example the following EL expression ${wf:user() == firstNotNull(arg1, arg2)} is translated to the following jinja equivalent: {{functions.wf.user() == functions.first_not_null(arg1, arg2)}} and it requires that job.properties includes values for arg1 and arg2.

This design allows for custom EL function mapping if one so chooses. By default everything gets mapped to the module o2a_libs.functions. This means in order to use EL function mapping, the folder o2a_libs.functions should be copied over to the Airflow DAG folder. This should then be picked up and parsed by the Airflow workers and then available to all DAGs.

Workflow and node notifications

Workflow jobs can be configured to make an HTTP GET notification upon start and end of a workflow action node and upon the start and completion of a workflow job. More information in Oozie docs.

Oozie-to-Airflow supports this feature. The job.properties file has contain URLs for workflow and action node notifications - example below:

oozie.wf.workflow.notification.url=http://example.com/workflow?job-id=$jobId&status=$status
oozie.wf.action.notification.url=http://example.com/action?job-id=$jobId&node-name=$nodeName&status=$status

If they are present, Oozie-to-Airflow will insert additional BashOperator to the generated DAG for each notification to be sent, right before or after the appropriate node (for node notifications) or at the beginning or end of the workflow (for workflow notifications). Inside the BashOperator will use curl to send an HTTP GET request to the appropriate URL endpoint.

Example DAG without notifications:

The same DAG with notifications:

Airflow-specific optimisations

Due to the fact that Oozie and Airflow differ with regards to some aspects of running workflows, there may be some differences in the output Airflow DAG with regards to the Oozie XML.

Removing unnecessary control nodes

In Airflow you don't need as many explicit control nodes as in Oozie. For example you don't ever need a Start node and in most cases End is also not needed.

We introduced the concept of Transformers in O2A, which modify the workflow. Below are the ones that remove unnecessary control nodes:

RemoveEndTransformer - removes End nodes with all relations when it's not connected to a Decision node,
RemoveKillTransformer - removes Kill nodes with all relations when it's not connected to a Decision node,
RemoveStartTransformer - removes Start nodes with all relations,
RemoveForkTransformer - removes Fork nodes when there are no upstream nodes,
RemoveJoinTransformer - removes Join nodes when there are no downstream nodes.

Removing inaccessible nodes

In Oozie for a node to be executed it has to be able to be traced back to the Start node. If a node is "loose" and is not connected to Start in any way (directly or indirectly via its "parents") it will be skipped.

However in Airflow all tasks will be executed. Therefore in order to replicate the "skipping" of loose nodes behaviour of Oozie we need to remove nodes unconnected to Start during the conversion phase.

This is achieved thanks to the RemoveInaccessibleNodeTransformer.

Common Known Limitations

There are few limitations in the implementation of the Oozie-To-Airflow converter. It's not possible to write a converter that handles all cases of complex workflows from Ooozie because some of functionalities available are not possible to map easily to existing Airflow Operators or cannot be tested because of the current Dataproc + Composer limitations. Some of those limitations might be removed in the future. Below is a list of common known limitations that we are aware of for now.

Many of those limitations are not blockers - the workflows will still be converted to Python DAGs and it should be possible to manually (or automatically) post-process the DAGs to add custom functionality. So even with those limitations in place you can still save a ton of work when converting many Oozie workflows.

In the following, "Examples" section more specific per-action limitations are listed as well.

File/Archive functionality

At the time of this writing we were not able to determine if file/archive functionality works as intended. While we map appropriate file/archive methods it seems that Oozie treats file/archive somewhat erraticaly. This is not a blocker to run most of the operations, however some particular complex workflows might be problematic. Further testing with real, production Oozie workflows is needed to verify our implementation.

Example Oozie docs

File/Archive in Pig doesn't work

Not all global configuration methods are supported

Oozie implements a number of ways how configuration parameters are passed to actions. Out of the existing configuration options the following ones are not supported (but can be easily added as needed):

Support for uber.jar feature

The uber.jar feature is not supported. Oozie docs

Support uber.jar feature

Support for .so and .jar lib files

Oozie adds .so and .jar files from the lib folder to Local Cache for all the jobs run to LD_LIBRARY_PATH/CLASSPATH. Currently only Java Mapper supports it.

Custom messages missing for Kill Node

The Kill Node might have custom log message specified. This is not implemented. Oozie docs

Add handling of custom Kill Node message

Capturing output is not supported

In several actions you can capture output from tasks. This is not yet implemented. Example Oozie docs

Add support for capture-ouput

Subworkflow DAGs must be placed in examples

Currently all subworkflow DAGs must be in examples folder

Subworkflow conversion expects to be run in examples

EL functions support

Currently many EL-functions are implemented (basic functions, fs functions and subset od wf functions). Check this document for full information about current state. The following wf:functions are not implemented:

All implemented function could be found in o2a_libs module. Camel case names of Oozie functions were substituted with snake case equivalents (ex. lastErrorNode becomes last_error_node).

Additionally some already implemented functions may not preserve the full logic of the original EL-expression due to differences between Oozie and Airflow. It's difficult to implement it in generic-enough way to cover all possible cases, it's much easier to eave the implementation of those functions to the user. It's perfectly possible to provide your own implementation of each of those functions if you need to customise it and in many cases it will be easier if it's specific implementation rather than generic one.

Notification proxy is not supported

In Oozie, the oozie.wf.workflow.notification.proxy property can be used to configure proxy, through which notifications will be sent.

This is not supported. Currently notifications will be sent directly, without proxy.

Cloud execution environment for Oozie to Airflow conversion

Cloud environment setup

An easy way of running the workflows of Oozie as well as running the oozie-to-airflow converted DAGs in Airflow is by using Cloud Composer and Dataproc in GCP. This the environment supported currently by the converter and one that it was heavily tested with. These services allow testing without much need for an on-premise setup. Here are some details about the environment that is supported:

Cloud Composer

composer-2.2.0-airflow-2.5.1
python version 3 (3.8.10)
machine n1-standard-1
node count: 3
Additional PyPi packages:
- sshtunnel==0.1.4

Cloud Dataproc Cluster with Oozie

n1-standard-2, 4 vCPU, 20 GB memory (! Minimum 16 GB RAM needed)
primary disk size, 50 GB
Image 1.3.29-debian9
Hadoop version
Init action: oozie-5.2.sh

Those are the steps you should follow to set it up:

Create a Dataproc cluster see Creating Dataproc Cluster below
Create a Cloud Composer Environment with at least Airflow version 2.0 to test the Apache Airflow workflows.
Set up all required Airflow Connections in Composer. This is required for things like SSHOperator.

Creating Dataproc cluster

We prepared Dataproc initialization action that allows to run Oozie 5.2.0 on Dataproc.

Please upload oozie-5.2.sh to your GCS bucket and create cluster using following command:

Note that you need at least 20GB RAM to run Oozie jobs on the cluster. The custom machine type below has enough RAM to handle oozie.

gcloud dataproc clusters create <CLUSTER_NAME> --region europe-west1 --subnet default --zone "" \
     --single-node --master-machine-type custom-4-20480 --master-boot-disk-size 500 \
     --image-version 1.3-deb9 --project <PROJECT_NAME> --initialization-actions 'gs://<BUCKET>/<FOLDER>/oozie-5.1.sh' \
     --initialization-action-timeout=30m

Note 1: it might take ~20 minutes to create the cluster Note 2: the init-action works only with single-node cluster and Dataproc 1.3

Once cluster is created, steps from example map reduce job can be run on master node to execute Oozie's example Map-Reduce job.

Oozie is serving web UI on port 11000. To enable access to it please follow official instructions on how to connect to the cluster web interfaces.

List of jobs with their statuses can be also shown by issuing oozie jobs command on master node.

More about testing the Oozie to Airflow conversion process can be found in CONTRIBUTING.md

Examples

All examples can be found in the examples directory.

EL
SSH
Email
MapReduce
FS
Java
Pig
Shell
Spark
Sub-workflow
DistCp
Decision
Hive/Hive2
Demo
Child workflow

EL Example

Running

The Oozie Expression Language (EL) example can be run as: o2a -i examples/el -o output/el

This will showcase the ability to use the o2a/o2a_libs folder to map EL functions to Python methods. This example assumes that the user has a valid Apache Airflow SSH connection set up and the o2a/o2a_libs folder has been copied to the dags folder (preserving o2a parent directory).

Please keep in mind that as of the current version only a single EL variable or single EL function. Variable/function chaining is not currently supported.

Output

In this example the output will be created in the ./output/el/ folder.

Known limitations

Decision example is not yet fully functional as EL functions are not yet fully implemented so condition is hard-coded for now. Once EL functions are implemented, the condition in the example will be updated.

Github issue: Implement decision node

SSH Example

Prerequisites

In order to change the user or host in the example, please edit the examples/ssh/hdfs/workflow.xml.

Running

The ssh example can be run as:

o2a -i examples/ssh -o output/ssh

This will convert the specified Oozie XML and write the output into the specified output directory, in this case output/ssh/ssh.py.

There are some differences between Apache Oozie and Apache Airflow as far as the SSH specification goes. In Airflow you will have to add/edit an SSH-specific connection that contains the credentials required for the specified SSH action. For example, if the SSH node looks like:

<action name="ssh">
    <ssh xmlns="uri:oozie:ssh-action:0.1">
        <host>user@apache.org</host>
        <command>echo</command>
        <args>"Hello Oozie!"</args>
    </ssh>
    <ok to="end"/>
    <error to="fail"/>
</action>

Then the default Airflow SSH connection, ssh_default should have at the very least a password set. This can be found in the Airflow Web UI under Admin > Connections. From the command line it is impossible to edit connections so you must add one like:

airflow connections --add --conn_id <SSH_CONN_ID> --conn_type SSH --conn_password <PASSWORD>

More information can be found in Airflow's documentation.

Output

In this example the output will be created in the ./output/ssh/ folder.

The converted DAG uses the SSHOperator in Airflow.

Known limitations

No known limitations.

Email Example

Prerequisites

Make sure to first copy /examples/email/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The Email example can be run as:

o2a -i examples/email -o output/email

Output

In this example the output will be created in the ./output/email/ folder.

The converted DAG uses the EmailOperator in Airflow.

Prerequisites

In Oozie the SMTP server configuration is located in oozie-site.xml.

For Airflow it needs to be located in airflow.cfg. Example Airflow SMTP configuration:

[email]
email_backend = airflow.utils.email.send_email_smtp

[smtp]
smtp_host = example.com
smtp_starttls = True
smtp_ssl = False
smtp_user = airflow_user
smtp_password = password
smtp_port = 587
smtp_mail_from = airflow_user@example.com

For more information on setting Airflow configuration options see here.

Known limitations

1. Attachments are not supported

Due to the complexity of extracting files from HDFS inside Airflow and providing them for the EmailOperator, the functionality of sending attachments has not yet been implemented.

Solution: Implement in O2A a mechanism to extract a file from HDFS inside Airflow.

Github Issue: Add support for attachment in Email mapper

2. <content_type> tag is not supported

From Oozie docs:

From uri:oozie:email-action:0.2 one can also specify mail content type as <content_type>text/html</content_type>. “text/plain” is default.

Unfortunately, currently the EmailOperator only accepts the mime_subtype parameter. However it only works for multipart subtypes, as the operator appends the subtype to the multipart/ prefix. Therefore passing either html or plain from Oozie makes no sense.

As a result the email will always be sent with the EmailOperator's default Content-Type value, which is multipart/mixed.

Solution: Modify the Airflow's EmailOperator to support more content types.

Github Issue: Content type support in Email mapper

3. cc and bcc fields are not templated in EmailOperator

Only the 'to', 'subject' and 'html_content' fields in EmailOperator are templated. In practice this covers all fields of an Oozie email action node apart from cc and bcc.

Therefore if there is an EL function in the action node in either of these two fields which will require a Jinja expression in Airflow, it will not work - the expression will not be executed, but rather treated as a plain string.

Solution: Modify the Airflow's EmailOperator to mark more fields as template_fields.

Github Issue: The CC: and BCC: fields are not templated in EmailOperator

MapReduce Example

Prerequisites

Make sure to first copy examples/mapreduce/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The MapReduce example can be run as:

o2a -i examples/mapreduce -o output/mapreduce

Output

In this example the output will be created in the ./output/mapreduce/ folder.

The converted DAG uses the DataProcHadoopOperator in Airflow.

Known limitations

1. Exit status not available

From the Oozie documentation:

The counters of the Hadoop job and job exit status (FAILED, KILLED or SUCCEEDED) must be available to the workflow job after the Hadoop jobs ends. This information can be used from within decision nodes and other actions configurations.

Currently we use the DataProcHadoopOperator which does not store the job exit status in an XCOM for other tasks to use.

Issue in Github: Implement exit status and counters in MapReduce Action

2. Configuration options

From the Oozie documentation (the strikethrough is from us):

Hadoop JobConf properties can be specified as part of

~~the config-default.xml or~~

~~JobConf XML file bundled with the workflow application or~~

~~<global> tag in workflow definition or~~

Inline map-reduce action configuration or

~~An implementation of OozieActionConfigurator specified by the tag in workflow definition.~~

Currently the only supported way of configuring the map-reduce action is with the inline action configuration, i.e. using the <configuration> tag in the workflow's XML file definition.

Issues in Github:

3. Streaming and pipes

Streaming and pipes are currently not supported.

Issue in github Implement streaming support

FS Example

Prerequisites

Make sure to first copy examples/fs/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The FS example can be run as:

o2a -i examples/fs -o output/fs

Output

In this example the output will be created in the ./output/fs/ folder.

The converted DAG uses the BashOperator in Airflow.

Known limitations

Not all FS operations are currently idempotent. It's not a problem if prepare action is used in other tasks but might be a problem in certain situations. Fixing the operators to be idempotent requires more complex logic and support for Pig actions is missing currently.

Issue in Github: FS Mapper and idempotence

The dirFiles are not supported in FSMapper.

Issue in Github: Add support for dirFiles in FsMapper

Java Example

Prerequisites

Make sure to first copy examples/fs/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The Java example can be run as:

o2a -i examples/java -o output/java

Output

In this example the output will be created in the ./output/java/ folder.

The converted DAG uses the DataProcHadoopOperator in Airflow.

Known limitations

Overriding action's Main class via oozie.launcher.action.main.class is not implemented.

Issue in Github: Override Java main class with property

Pig Example

Prerequisites

Make sure to first copy examples/pig/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The Pig example can be run as:

o2a -i examples/pig -o output/pig

Output

In this example the output will be created in the ./output/pig/ folder.

The converted DAG uses the DataProcPigOperator in Airflow.

Known limitations

1. Configuration options

From the Oozie documentation (the strikethrough is from us):

Hadoop JobConf properties can be specified as part of

~~the config-default.xml or~~

~~JobConf XML file bundled with the workflow application or~~

~~<global> tag in workflow definition or~~

Inline pig action configuration.

Currently the only supported way of configuring the pig action is with the inline action configuration, i.e. using the <configuration> tag in the workflow's XML file definition.

Issues in Github:

Shell Example

Prerequisites

Make sure to first copy examples/shell/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The Shell example can be run as:

o2a -i examples/shell -o output/shell

Output

In this example the output will be created in the ./output/shell/ folder.

The converted DAG uses the BashOperator in Airflow, which executes the desired shell action with Pig by invoking gcloud dataproc jobs submit pig --cluster=<cluster> --region=<region> --execute 'sh <action> <args>'.

Known limitations

1. Exit status not available

From the Oozie documentation:

The output (STDOUT) of the Shell job can be made available to the workflow job after the Shell job ends. This information could be used from within decision nodes.

Currently we use the BashOperator which can store only the last line of the job output in an XCOM. In this case the line is not helpful as it relates to the Dataproc job submission status and not the Shell action's result.

Issue in Github: Finalize shell mapper

2. No Shell launcher configuration

From the Oozie documentation:

Shell launcher configuration can be specified with a file, using the job-xml element, and inline, using the configuration elements.

Currently there is no way specify the shell launcher configuration (it is ignored).

Issue in Github: Shell Launcher Configuration

Spark Example

Prerequisites

Make sure to first copy /examples/spark/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The Spark example can be run as:

o2a -i examples/spark -o output/spark

Output

In this example the output will be created in the ./output/spark/ folder.

The converted DAG uses the DataProcSparkOperator in Airflow.

Known limitations

1. Only tasks written in Java are supported

From the Oozie documentation:

The jar element indicates a comma separated list of jars or python files.

The solution was tested with only a single Jar file.

2. No Spark launcher configuration

From the Oozie documentation:

Shell launcher configuration can be specified with a file, using the job-xml element, and inline, using the configuration elements.

Currently there is no way to specify the Spark launcher configuration (it is ignored).

3. Not all elements are supported

The following elements are not supported: job-tracker, name-node, master, mode.

Sub-workflow Example

Prerequisites

Make sure to first copy examples/subwf/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The Sub-workflow example can be run as:

o2a -i examples/subwf -o output/subwf

Output

In this example the output (together with sub-worfklow dag) will be created in the ./output/subwf/ folder.

The converted DAG uses the SubDagOperator in Airflow.

Known limitations

No known limitations.

DistCp Example

Prerequisites

Make sure to first copy examples/distcp/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The DistCp example can be run as:

o2a -i examples/distcp -o output/distcp

Output

In this example the output will be created in the ./output/distcp/ folder.

The converted DAG uses the BashOperator in Airflow, which submits the Hadoop DistCp job using the gcloud dataproc jobs submit hadoop command.

Known limitations

The system test of the example run with Oozie fails due to unknown reasons. The converted DAG run by Airflow completes successfully.

Decision Example

Prerequisites

Make sure to first copy examples/decision/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The decision example can be run as:

o2a -i examples/decision -o output/decision

Output

In this example the output will be created in the ./output/decision/ folder.

The converted DAG uses the BranchPythonOperator in Airflow.

Known limitations

Github issue: Implement decision node

Hive/Hive2 Example

Prerequisites

Make sure to first copy /examples/hive/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.

Running

The Hive example can be run as:

o2a -i examples/hive -o output/hive

Output

In this example the output will be created in the ./output/hive/ folder.

The converted DAG uses the DataProcHiveOperator in Airflow.

Known limitations

1. Only the connection to the local Hive instance is supported.

Connection configuration options are not supported.

2. Not all elements are supported

For Hive, the following elements are not supported: job-tracker, name-node. For Hive2, the following elements are not supported: job-tracker, name-node, jdbc-url, password.

The Github issue for both problems: Hive connection configuration and other elements

Demo Example

The demo example contains several action and control nodes. The control nodes are fork, join, decision, start, end, and kill. As far as action nodes go, there are fs, map-reduce, and pig.

Most of these are already supported, but when the program encounters a node it does not know how to parse, it will perform a sort of "skeleton transformation" - it will convert all the unknown nodes to dummy nodes. This will allow users to manually parse the nodes if they so wish as the control flow is there.

Prerequisites

Make sure to first copy examples/demo/configuration.template.properties, rename it as configuration.properties and fill in with configuration data.