airflow-provider-xlsx

Airflow operators for reading and writing XLSX files


Keywords
airflow, apache-airflow, excel, parquet
License
Apache-2.0
Install
pip install airflow-provider-xlsx==1.0.1

Documentation

Airflow Provider XLSX

Apache Airflow operators for converting XLSX files from/to Parquet, CSV and JSON.

Build Status PyPI version PyPI Downloads Code style: black

System Requirements

  • Airflow Versions
    • 2.0 or newer

Installation

$ pip install airflow-provider-xlsx

Operators

FromXLSXOperator

Read an XLSX or XLS file and convert it into Parquet, CSV, JSON, JSON Lines(one line per record) file.

API Documentation

Example

XLSX Source

image

Airflow Task

from xlsx_provider.operators.from_xlsx_operator import FromXLSXOperator

xlsx_to_jsonl = FromXLSXOperator(
   task_id='xlsx_to_jsonl',
   source='{{ var.value.tmp_path }}/test.xlsx',
   target='{{ var.value.tmp_path }}/test.jsonl',
   file_format='jsonl',
   dag=dag
)

JSON Lines Output

{"month": "Jan", "high": -12.2, "mean": -16.2, "low": -20.1, "precipitation": 19}
{"month": "Feb", "high": -10.3, "mean": -14.7, "low": -19.1, "precipitation": 14}
{"month": "Mar", "high": -2.6, "mean": -7.2, "low": -11.8, "precipitation": 15}
{"month": "Apr", "high": 8.1, "mean": 3.2, "low": -1.7, "precipitation": 24}
{"month": "May", "high": 17.5, "mean": 11.6, "low": 5.6, "precipitation": 36}
{"month": "Jun", "high": 24, "mean": 18.2, "low": 12.3, "precipitation": 58}
{"month": "Jul", "high": 25.7, "mean": 20.2, "low": 14.7, "precipitation": 72}
{"month": "Aug", "high": 22.2, "mean": 17, "low": 11.7, "precipitation": 66}
{"month": "Sep", "high": 16.6, "mean": 11.5, "low": 6.4, "precipitation": 44}
{"month": "Oct", "high": 6.8, "mean": 3.4, "low": 0, "precipitation": 38}

FromXLSXQueryOperator

Execute an SQL query an XLSX/XLS file and export the result into a Parquet or CSV file

This operators loads an XLSX or XLS file into an in-memory SQLite database, executes a query on the db and stores the result into a Parquet, CSV, JSON, JSON Lines(one line per record) file. The output columns names and types are determinated by the SQL query output.

API Documentation

Example

XLSX Source

image

SQL Query

 select
     g as high_tech_sector,
     h as eur_bilion,
     i as share
 from
     high_tech
 where
     _index > 1
     and high_tech_sector <> ''
     and lower(high_tech_sector) <> 'total'

Airflow Task

from xlsx_provider.operators.from_xlsx_query_operator import FromXLSXQueryOperator

xlsx_to_csv = FromXLSXQueryOperator(
   task_id='xlsx_to_csv',
   source='{{ var.value.tmp_path }}/high_tech.xlsx',
   target='{{ var.value.tmp_path }}/high_tech.parquet',
   file_format='csv',
   csv_delimiter=',',
   table_name='high_tech',
   worksheet='Figure 3',
   query='''
       select
           g as high_tech_sector,
           h as eur_bilion,
           i as share
       from
           high_tech
       where
           _index > 1
           and high_tech_sector <> ''
           and lower(high_tech_sector) <> 'total'
   ''',
   dag = dag
)

Output

high_tech_sector,value,share
Pharmacy,78280,0.231952169555313
Electronics-telecommunications,75243,0.222954583130376
Scientific instruments,64010,0.189670433253542
Aerospace,44472,0.131776952366115
Computers office machines,21772,0.0645136852766778
Non-electrical machinery,20813,0.0616714981835167
Chemistry,19776,0.058598734453222
Electrical machinery,9730,0.028831912195612
Armament,3384,0.0100300315856265

ToXLSXOperator

Read a Parquest, CSV, JSON, JSON Lines(one line per record) file and convert it into XLSX.

API Documentation

Example
from xlsx_provider.operators.to_xlsx_operator import ToXLSXOperator

parquet_to_xlsx = ToXLSXOperator(
   task_id='parquet_to_xlsx',
   source='{{ var.value.tmp_path }}/test.parquet',
   target='{{ var.value.tmp_path }}/test.xlsx',
   dag=dag
)

Links