ipyparallel wrapper for data scientists

ipyparallel parallel edgell analytics sklearn xgboost
pip install edgell==1.2



author: Rohan Kotwani

description: ipyparallel wrapper for data scientists

pip install edgell

Please see edgell-titanic-example-v2.ipynb for methods and uses.

The Motivation

Some companies are creating automated data science tools, but these tools typically obfuscate the feature processing or modeling components in order to make their software "proprietary." This open source project aims to reveal some potential pit-falls with current implementations and redefine the concept of automated data science.

Given the iterative nature of data science it is important for an automated tool to allow for flexible-workflows between data understanding, feature development, and model comparison. The model building process should be scalable across multiple machines and be able to produce customizable interfaces for evaluation. Edgell, an ipyparallel wrapper, provides a simple interface to easily parallelize the model building process and to customize the output of the process.

Different Types of Parallel Computing

  1. Across coordinates (GPU)
  2. Across for loops (CPU multithreading)
  3. Across objects (CPU explicit parallel)

These are three types of parallel computing are commonly used in data science. (3) explicity sends data and object to the cores in parallel. It is the most flexible of the three when it comes to parallelizing models. Interestingly, Intel's new Math Kernel Library automatically multi-threads (2) matrix operations across cores.

Initializing IpyParallel

The computing cluster needs to be started from the command line, i.e., ipcluster start -n 6. The -n parameter specifics the number of cores that should be initialized. All the packages used by the parallel engine should be imported in parallel. Since MKL automatically uses multi-threading, we will limit the thread count to 2 per core, since my machine has 6 cores with 2 threads each. More on ipyparallel can be found here: https://ipyparallel.readthedocs.io/en/latest/intro.html

Edgell - IpyParallel Wrapper

Edgell is an ipyparallel wrapper that can be used to easily parallelize models across different hyper-parameters. Actually, the package also works to parallelize a variety of functions given an input dataset and parameter grid. Run pip install edgell to use Edgell. We will use this package to parallelize and evaluate different XGBoost models on the Titanic dataset.

We are still using all of our favorite libraries, i.e., XGBoost and SKlearn. The benefit is that now, the trained model, average cross validation score, train score, and other potential metrics can be returned and evaluated more explicitly.

Efficiency and Conciseness

The overall processing time for this function was around 7 seconds using Edgell. The trained model, train score, and average cross validation score included in the output. The code to produce the following output can be found in the link at the top of this post.

Similarly, SKlearn's standard GridSearchCV function with n_jobs=6, took around 7 seconds, but did not return the trained models. The documentation was also unclear on what cv_results_ actually returned. Non-customizable, messy results can make it difficult for data scientists understand and make use of SKlearn's GridSearchCV function.



data: a dictionary with the data used by the model(s) dview: the ipyparallel direct view


model: the model or SKlearn pipeline, i.e., xgboost.XGBClassifier() grid: a grid of parameter dictionaries to be fed into the models name: the name of the set of models


run: the function that describe the result dictionary

Example Model Evaluation

def model_evaluation(model,data:dict,args:dict):
    rslt = {}
    model = model.set_params(**args)
    kf = model_selection.StratifiedKFold(n_splits=10)
    cv_scores = model_selection.cross_val_score(model,
                                             data['y'], cv=kf)
    avg_cv_score = np.mean(cv_scores)

    rslt['avg_cv_score'] = avg_cv_score
    rslt['model'] = model

    rslt['train_score'] =  model.score(data['X'],data['y'])

    return rslt