When we first started working Spark at HackerRank, we realized that within our dataset, the size of our outcome sets varied in size by quite a bit. This led to inconsistent model cross validation and training. However, with stratified sampling, we were able to eliminate these inconsistencies and improve overall model predictions. The goal of
spark-stratifier is to provide a tool to stratify datasets for cross validation in
PySpark. This class extends the current
CrossValidator class in Spark.
Currently, the stratified cross validator works with binary classification problems using labels
Read more at engineering.hackerrank.com
$ pip install spark-stratifier
You basically use this the exact same way you would with the Spark
CrossValidator... except this time, your data will be stratified.
from spark_stratifier import StratifiedCrossValidator scv = StratifiedCrossValidator( estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=evaluator, numFolds=8 ) model = scv.fit(matrix)
If you want to write some code and contribute to this project, go ahead and start a pull request. We hope this tool is useful for the community and we'd love to hear about how this helps solve your problems!