Workflow

The Lightweight workflow orchestration with less dependencies the was created for easy to make a simple metadata driven for data workflow orchestration. It can to use for data operator by a .yaml template.

Warning

This package provide only orchestration workload task. That mean you should not use the workflow stage to process any large volume data which use lot of compute resource. 😰

In my opinion, I think it should not create duplicate workflow codes if I can write with dynamic input parameters on the one template workflow that just change the input parameters per use-case instead. This way I can handle a lot of logical workflows in our orgs with only metadata configuration. It called Metadata Driven Data Workflow.

📌 Rules of This Workflow engine:

Minimum frequency unit of scheduling is 1 minute ⚠️
Can not re-run only failed stage and its pending downstream 🚨
All parallel tasks inside workflow engine use Multi-Threading (Because Python 3.13 unlock GIL 🔓)

Note

Disclaimer: I inspire the dynamic statement from the GitHub Action with .yml files and all of config file from several data orchestration framework tools from my experience on Data Engineer. 😬

Other workflow that I interest on them and pick some interested feature to this package:

📍 Installation

This project need ddeutil-io extension namespace packages. If you want to install this package with application add-ons, you should add app in installation;

Usecase	Install Optional	Support
Python & CLI	`pip install ddeutil-workflow`	✔️
FastAPI Server	`pip install ddeutil-workflow[api]`	✔️

I added this feature to the main milestone.

🥚 Docker Images supported:

Docker Image Python Version Support

ddeutil-workflow:latest 3.9 ❌

ddeutil-workflow:python3.10 3.10 ❌

ddeutil-workflow:python3.11 3.11 ❌

ddeutil-workflow:python3.12 3.12 ❌

ddeutil-workflow:python3.12 3.13 ❌

Docker Image	Python Version	Support
ddeutil-workflow:latest	`3.9`	❌
ddeutil-workflow:python3.10	`3.10`	❌
ddeutil-workflow:python3.11	`3.11`	❌
ddeutil-workflow:python3.12	`3.12`	❌
ddeutil-workflow:python3.12	`3.13`	❌

🍻 Usage

This is examples that use workflow file for running common Data Engineering use-case.

Important

I recommend you to use the hook stage for all actions that you want to do with workflow activity that you want to orchestrate. Because it able to dynamic an input argument with the same hook function that make you use less time to maintenance your data workflows.

run-py-local:

   # Validate model that use to parsing exists for template file
   type: ddeutil.workflow.Workflow
   on:
      # If workflow deploy to schedule, it will running every 5 minutes
      # with Asia/Bangkok timezone.
      - cronjob: '*/5 * * * *'
        timezone: "Asia/Bangkok"
   params:
      # Incoming execution parameters will validate with this type. It allow
      # to set default value or templating.
      source-extract: str
      run-date: datetime
   jobs:
      getting-api-data:
         stages:
            - name: "Retrieve API Data"
              id: retrieve-api
              uses: tasks/get-api-with-oauth-to-s3@requests
              with:
                 # Arguments of source data that want to retrieve.
                 method: post
                 url: https://finances/open-data/currency-pairs/
                 body:
                    resource: ${{ params.source-extract }}

                    # You can able to use filtering like Jinja template but this
                    # package does not use it.
                    filter: ${{ params.run-date | fmt(fmt='%Y%m%d') }}
                 auth:
                    type: bearer
                    keys: ${API_ACCESS_REFRESH_TOKEN}

                 # Arguments of target data that want to landing.
                 writing_mode: flatten
                 aws_s3_path: my-data/open-data/${{ params.source-extract }}

                 # This Authentication code should implement with your custom hook
                 # function. The template allow you to use environment variable.
                 aws_access_client_id: ${AWS_ACCESS_CLIENT_ID}
                 aws_access_client_secret: ${AWS_ACCESS_CLIENT_SECRET}

The above workflow template is main executor pipeline that you want to do. If you want to schedule this workflow, you want to dynamic its parameters change base on execution time such as run-date should change base on that workflow running date.

So, this package provide the Schedule template for this action.

schedule-run-local-wf:

   # Validate model that use to parsing exists for template file
   type: ddeutil.workflow.scheduler.Schedule
   workflows:

      # Map existing workflow that want to deploy with scheduler application.
      # It allow you to passing release parameter that dynamic change depend the
      # current context of this scheduler application releasing that time.
      - name: run-py-local
        params:
          source-extract: "USD-THB"
          asat-dt: "${{ release.logical_date }}"

🍪 Configuration

The main configuration that use to dynamic changing with your propose of this application. If any configuration values do not set yet, it will use default value and do not raise any error to you.

Environment	Component	Default	Description
`WORKFLOW_ROOT_PATH`	Core	.	The root path of the workflow application
`WORKFLOW_CORE_REGISTRY`	Core	src.ddeutil.workflow,tests.utils	List of importable string for the hook stage
`WORKFLOW_CORE_REGISTRY_FILTER`	Core	ddeutil.workflow.utils	List of importable string for the filter template
`WORKFLOW_CORE_PATH_CONF`	Core	conf	The config path that keep all template `.yaml` files
`WORKFLOW_CORE_TIMEZONE`	Core	Asia/Bangkok	A Timezone string value that will pass to `ZoneInfo` object
`WORKFLOW_CORE_STAGE_DEFAULT_ID`	Core	true	A flag that enable default stage ID that use for catch an execution output
`WORKFLOW_CORE_STAGE_RAISE_ERROR`	Core	true	A flag that all stage raise StageException from stage execution
`WORKFLOW_CORE_MAX_NUM_POKING`	Core	4
`WORKFLOW_CORE_MAX_JOB_PARALLEL`	Core	2	The maximum job number that able to run parallel in workflow executor
`WORKFLOW_LOG_DEBUG_MODE`	Log	true	A flag that enable logging with debug level mode
`WORKFLOW_LOG_ENABLE_WRITE`	Log	true	A flag that enable logging object saving log to its destination
`WORKFLOW_APP_PROCESS_WORKER`	Schedule	2	The maximum process worker number that run in scheduler app module
`WORKFLOW_APP_SCHEDULE_PER_PROCESS`	Schedule	100	A schedule per process that run parallel
`WORKFLOW_APP_STOP_BOUNDARY_DELTA`	Schedule	'{"minutes": 5, "seconds": 20}'	A time delta value that use to stop scheduler app in json string format

API Application:

Environment	Component	Default	Description
`WORKFLOW_API_ENABLE_ROUTE_WORKFLOW`	API	true	A flag that enable workflow route to manage execute manually and workflow logging
`WORKFLOW_API_ENABLE_ROUTE_SCHEDULE`	API	true	A flag that enable run scheduler

🚀 Deployment

This package able to run as a application service for receive manual trigger from the master node via RestAPI or use to be Scheduler background service like crontab job but via Python API.

Schedule App

(venv) $ ddeutil-workflow schedule

API Server

(venv) $ uvicorn src.ddeutil.workflow.api:app --host 127.0.0.1 --port 80

Note

If this package already deploy, it able to use uvicorn ddeutil.workflow.api:app --host 127.0.0.1 --port 80 --workers 4

ddeutil-workflow
Release 0.0.6

Release 0.0.6

0.0.15

0.0.14

0.0.13

0.0.12

0.0.11

0.0.10

0.0.9

0.0.8

0.0.7

0.0.6

Documentation

Workflow

📍 Installation

🍻 Usage

🍪 Configuration

🚀 Deployment

Schedule App

API Server

Stats

Development practices

Releases

Contributors

ddeutil-workflow Release 0.0.6

Release 0.0.6 Toggle Dropdown 0.0.15 0.0.14 0.0.13 0.0.12 0.0.11 0.0.10 0.0.9 0.0.8 0.0.7 0.0.6

Documentation

Workflow

📍 Installation

🍻 Usage

🍪 Configuration

🚀 Deployment

Schedule App

API Server

Stats

Development practices

Releases

Contributors

ddeutil-workflow
Release 0.0.6

Release 0.0.6

0.0.15

0.0.14

0.0.13

0.0.12

0.0.11

0.0.10

0.0.9

0.0.8

0.0.7

0.0.6