Airtunnel is a means of supplementing Apache Airflow, a platform for workflow automation in Python which is angled at analytics/data pipelining. It was born out of years of project experience in data science, and the hardships of running large data platforms in real life businesses. Hence, Airtunnel is both a set of principles (read more on them in the Airtunnel introduction) and a lightweight Python library to tame your airflow!
Why choose airtunnel?
Because you will…
We suppose you have installed Apache Airflow in some kind of Python virtual environment. From there, simply do a
pip install airtunnelto get the package.
Configure your codebase according to the Airtunnel principles: You need to add three folders for a declaration store, a scripts store and finally the data store:
2.1) The declaration store folder has no subfolders. It is where your data asset declarations (YAML files) will reside
2.2) The scripts store folder is where all your Python & SQL scripts to process data assets will reside. It should be broken down by subfolders
pyfor Python scripts and
sqlfor SQL scripts. Please further add subfolders
2.3) The data store folder follows a convention as well, refer to the docs on how to structure it.
Configure Airtunnel by extending your existing
airflow.cfg(as documented here):
3.1) Add the configuration section
[airtunnel]in which, you need to add three configuration keys.
declarations_folderwhich takes the absolute path to the folder you set up in 2.1
scripts_folderwhich takes the absolute path to the folder you set up in 2.2
data_store_folder, which takes the absolute path to the folder you set up in 2.3 for your data store
Python >= 3.6, Airflow >=1.10 and Pandas >= 0.23
We assume Airtunnel is implemented best early on in a project, which is why going with a recent Python and Airflow version makes the most sense. In the future we might do more tests and include coverage for older Airflow versions.
PySpark is supported from 2.3+
Airtunnel's documentation is on GitHub pages.