Sparkly

Helpers & syntax sugar for PySpark. There are several features to make your life easier:

Definition of spark packages, external jars, UDFs and spark options within your code;
Simplified reader/writer api for Cassandra, Elastic, MySQL, Kafka;
Testing framework for spark applications.

More details could be found in the official documentation.

Installation

Sparkly itself is easy to install:

pip install sparkly

The tricky part is pyspark. There is no official distribution on PyPI. As a workaround we can suggest:

Use env variable PYTHONPATH to point to your Spark installation, something like:

export PYTHONPATH="/usr/local/spark/python/lib/pyspark.zip:/usr/local/spark/python/lib/py4j-0.10.4-src.zip"

Use our setup.py file for pyspark. Just add this to your requirements.txt:

-e git+https://github.com/Tubular/spark@branch-2.1.0#egg=pyspark&subdirectory=python

Here in Tubular, we published pyspark to our internal PyPi repository.

Getting Started

Here is a small code snippet to show how to easily read Cassandra table and write its content to ElasticSearch index:

from sparkly import SparklySession


class MySession(SparklySession):
    packages = [
        'datastax:spark-cassandra-connector:2.0.0-M2-s_2.11',
        'org.elasticsearch:elasticsearch-spark-20_2.11:6.5.4',
    ]


if __name__ == '__main__':
    spark = MySession()
    df = spark.read_ext.cassandra('localhost', 'my_keyspace', 'my_table')
    df.write_ext.elastic('localhost', 'my_index', 'my_type')

See the online documentation for more details.

Testing

To run tests you have to have docker and docker-compose installed on your system. If you are working on MacOS we highly recommend you to use docker-machine. As soon as the tools mentioned above have been installed, all you need is to run: