pyspark-ds-toolbox

A Pyspark companion for data science tasks.


Keywords
data-science, spark
License
GPL-3.0-only
Install
pip install pyspark-ds-toolbox==0.4.3

Documentation

Pyspark DS Toolbox

Lifecycle: experimental PyPI Latest Release CodeFactor Maintainability Codecov test coverage Package Tests Downloads

The objective of the package is to provide a set of tools that helps the daily work of data science with spark. The documentation can be found here. Feel free to contribute :)

Installation

Directly from PyPi:

pip install pyspark-ds-toolbox

or from github, note that installing from github will install the latest development version:

pip install git+https://github.com/viniciusmsousa/pyspark-ds-toolbox.git

Organization

The package is currently organized in a structure based on the nature of the task, such as data wrangling, model/prediction evaluation, and so on.

pyspark_ds_toolbox         # Main Package
β”œβ”€ causal_inference           # Sub-package dedicated to Causal Inferece
β”‚  β”œβ”€ diff_in_diff.py   
β”‚  └─ ps_matching.py    
β”œβ”€ ml                         # Sub-package dedicated to ML
β”‚  β”œβ”€ data_prep                  # Sub-package to ML data preparation tools
β”‚  β”‚  β”œβ”€ class_weights.py     
β”‚  β”‚  └─ features_vector.py 
β”‚  β”œβ”€ classification             # Sub-package decidated to classification tasks
β”‚  β”‚  β”œβ”€ eval.py
β”‚  β”‚  └─ baseline_classifiers.py 
β”‚  └─ feature_importance         # Sub-package with feature importance tools
β”‚     β”œβ”€ native_spark.py
β”‚     └─ shap_values.py    
β”œβ”€ wrangling                  # Sub-package decidated to data wrangling tasks
β”‚  β”œβ”€ reshape.py               
β”‚  └─ data_quality.py         
└─ stats                      # Sub-package dedicated to basic statistic functionalities
   └─ association.py