sparkit

A package for PySpark utility functions.


Keywords
sparkit
License
BSD-3-Clause
Install
pip install sparkit==1.1.1

Documentation

The sparkit logo.

pypi docs ci status coverage license

About

A package for PySpark utility functions:

Installation

sparkit is available on PyPI for Python 3.8+ and Spark 3 (Java 11):

pip install sparkit

Examples

join multiple data frames on common key (pass single and / or an iterable of data frames):

>>> import sparkit
>>> from pyspark.sql import Row, SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> df1 = spark.createDataFrame([Row(id=1, x="a"), Row(id=2, x="b")])
>>> df2 = spark.createDataFrame([Row(id=1, y="c"), Row(id=2, y="d")])
>>> df3 = spark.createDataFrame([Row(id=1, z="e"), Row(id=2, z="f")])
>>> sparkit.join([df1, df2], df3, on="id").show()
+---+---+---+---+
| id|  x|  y|  z|
+---+---+---+---+
|  1|  a|  c|  e|
|  2|  b|  d|  f|
+---+---+---+---+

union multiple data frames by name (pass single and / or an iterable of data frames):

>>> import sparkit
>>> from pyspark.sql import Row, SparkSession
>>> spark = SparkSession.builder.getOrCreate()
>>> df1 = spark.createDataFrame([Row(x=1, y=2), Row(x=3, y=4)])
>>> df2 = spark.createDataFrame([Row(x=5, y=6), Row(x=7, y=8)])
>>> df3 = spark.createDataFrame([Row(x=0, y=1), Row(x=2, y=3)])
>>> df4 = spark.createDataFrame([Row(x=5, y=3), Row(x=9, y=6)])
>>> sparkit.union(df1, [df2, df3], df4).show()
+---+---+
|  x|  y|
+---+---+
|  1|  2|
|  3|  4|
|  5|  6|
|  7|  8|
|  0|  1|
|  2|  3|
|  5|  3|
|  9|  6|
+---+---+

Contributing to sparkit

Your contribution is greatly appreciated! See the following links to help you get started:

License

sparkit was created by sparkit Developers. It is licensed under the terms of the BSD 3-Clause license.