Spark Py Submit
A python library that can submit spark job to spark cluster and standalone using rest API
github.com/bernhard-42/spark-yarn-rest-api
Getting Started:
Use the library
# Import the SparkJobHandler
from spark_job_handler import SparkJobHandler
...
logger = logging.getLogger('TestLocalJobSubmit')
# Create a spark JOB
# jobName: name of the Spark Job
# jar: location of the Jar (local/hdfs)
# run_class: entry class of the appliaction
# hadoop_rm: hadoop resource manager host ip
# hadoop_web_hdfs: hadoop web hdfs ip
# hadoop_nn: hadoop name node ip (Normally same as of web_hdfs)
# env_type: env type is CDH or HDP
# local_jar: flag to define if a jar is local (Local jar gets uploaded to hdfs)
# spark_properties: custom properties that need to be set
sparkJob = SparkJobHandler(logger=logger, job_name="test_local_job_submit",
jar="./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar",
run_class="IrisApp", hadoop_rm='rma', hadoop_web_hdfs='nn', hadoop_nn='nn',
env_type="CDH", local_jar=True, spark_properties=None)
trackingUrl = sparkJob.run()
print "Job Tracking URL: %s" % trackingUrl
Build the simple-project
$ cd simple-project
$ sbt package;cd ..
The above steps will create the target jar as:
./simple-project/target/scala-2.10/simple-project_2.10-1.0.jar
Update the nodes Ip in test:
load the data and make it available to HDFS:
$ wget https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data
upload data to the HDFS:
$ python upload_to_hdfs.py <name_nodei_ip> iris.data /tmp/iris.data
Run the test cases:
Make the simple-project jar available in HDFS to test remote jar:
$ python upload_to_hdfs.py <name_nodei_ip> simple-project/target/scala-2.10/simple-project_2.10-1.0.jar /tmp/test_data/simple-project_2.10-1.0.jar
Run the test:
$ python test_spark_job_handler.py
Utility:
- upload_to_hdfs.py: upload local file to hdfs file system
Notes:
hdfs:/user/spark/share/lib/spark-assembly.jar