Pyspark DS Toolbox
The objective of the package is to provide a set of tools that helps the daily work of data science with spark. The documentation can be found here. Feel free to contribute :)
Installation
Directly from PyPi:
pip install pyspark-ds-toolbox
or from github, note that installing from github will install the latest development version:
pip install git+https://github.com/viniciusmsousa/pyspark-ds-toolbox.git
Organization
The package is currently organized in a structure based on the nature of the task, such as data wrangling, model/prediction evaluation, and so on.
pyspark_ds_toolbox # Main Package
ββ causal_inference # Sub-package dedicated to Causal Inferece
β ββ diff_in_diff.py
β ββ ps_matching.py
ββ ml # Sub-package dedicated to ML
β ββ data_prep # Sub-package to ML data preparation tools
β β ββ class_weights.py
β β ββ features_vector.py
β ββ classification # Sub-package decidated to classification tasks
β β ββ eval.py
β β ββ baseline_classifiers.py
β ββ feature_importance # Sub-package with feature importance tools
β ββ native_spark.py
β ββ shap_values.py
ββ wrangling # Sub-package decidated to data wrangling tasks
β ββ reshape.py
β ββ data_quality.py
ββ stats # Sub-package dedicated to basic statistic functionalities
ββ association.py