Overview
Goal
There are many public clouds provide managed Apache Spark as service, such as databricks, AWS EMR, Oracle OCI DataFlow, see the table below for a detailed list.
However, the way to deploy Spark application and launch Spark application are incompatible among different cloud Spark platforms.
spark-etl is a python package, provides a standard way for building, deploying and running your Spark application that supports various cloud spark platforms.
Benefit
Your application using spark-etl
can be deployed and launched from different cloud spark platforms without changing the source code.
Application
An application is a python program. It contains:
- A
main.py
file which contains the application entry - A
manifest.json
file, which specify the metadata of the application. - A
requirements.txt
file, which specify the application dependency.
Application entry signature
In your application's main.py
, you shuold have a main
function with the following signature:
-
spark
is the spark session object -
input_args
a dict, is the argument user specified when running this application. -
sysops
is the system options passed, it is platform specific. Job submitter may inject platform specific object insysops
object. - Your
main
function's return value should be a JSON object, it will be returned from the job submitter to the caller.
def main(spark, input_args, sysops={}):
# your code here
Here is an application example.
Build your application
etl -a build -c <config-filename> -p <application-name>
Deploy your application
etl -a deploy -c <config-filename> -p <application-name> -f <profile-name>
Run your application
etl -a run -c <config-filename> -p <application-name> -f <profile-name> --run-args <input-filename>
Supported platforms
You setup your own Apache Spark Cluster. | |
Use PySpark package, fully compatible to other spark platform, allows you to test your pipeline in a single computer. | |
You host your spark cluster in databricks | |
You host your spark cluster in Amazon AWS EMR | |
You host your spark cluster in Google Cloud | |
You host your spark cluster in Microsoft Azure HDInsight | |
You host your spark cluster in Oracle Cloud Infrastructure, Data Flow Service | |
You host your spark cluster in IBM Cloud |
Demos
- Using local pyspark, access data on local disk
- Using local pyspark, access data on AWS S3
- Using on-premise spark, access data on HDFS
- Using on-premise spark, access data on AWS S3
- Using AWS EMR's spark, access data on AWS S3
- Using Oracle OCI's Dataflow with API key, access data on Object Storage
- Using Oracle OCI's Dataflow with instance principal, access data on Object Storage
APIs
Job Deployer
For job deployers, please check the wiki .
Job Submitter
For job submitters, please check the wiki