Data preprocessing and analysis using XGBoost


License
MIT
Install
pip install xgbmagic==0.0.3.2

Documentation

xgbmagic

Pandas dataframe goes in, XGBoost model results come out

The feature engineering step (creating new features and selectively removing unwanted features) is the most creative and fun step of training a model, whereas what follows is usually a standard data-processing flow.

Once you're done engineering your features, xgbmagic automatically runs a standard workflow for using XGBoost to train a model on a pandas dataframe.

  • performs one-hot encoding for categorical features,
  • drops uninformative features (no variability, too many missing values...)
  • trains the model
  • plots the most important features in order of importance.

New features!

Iterate faster with smaller samples! Improve accuracy with ensemble learning (bagging)!

  • allows for training on random sample(s) - e.g. if you'd like to train a subset of your data
  • allows for training multiple times on different subsamples, then returning average/consensus from all models as the prediction (bagging)

To do

  • detect highly correlated columns and remove redundant columns
  • remove categorical features with too many possible category values (to remove unhelpful features like names and ids)
  • parameter tuning with GridsearchCV
  • allow custom values for more parameters
  • for classification problems, convert labels to 0-n integers (in case they're floats or strings)

Installation

Install xgboost first

git clone --recursive https://github.com/dmlc/xgboost
cd xgboost; make -j4
cd python-package; sudo python setup.py install

Then install xgbmagic

pip install xgbmagic

Documentation

Input parameters:

  • df (DataFrame): pandas dataframe
  • target_column (string): name of column containing the target parameter
  • id_column (string): name of column containing IDs
  • target_type (string): 'binary' for binary targets (classification), 'multiclass' for multiple classes (classification), 'linear' for continuous targets (linear regression)
  • categorical_columns (list of strings): a list of names of columns containing categorical data
  • numerical_columns (list of strings): a list of names of columns containing numerical data
  • drop_columns (list of strings): a list of names of columns to drop
  • verbose (boolean): verbosity of printouts. True = verbose
  • sample_fraction (float, 0-1): if this is not 1, a subsample of the data will be used to train the model
  • n_samples (int): if this is more than 1, this number of models will be trained, each iteration trained on a subsample of data (size of sample determined by sample_fraction). Note that if n_samples * sample_fraction is more than one, the maximum number of models will be trained based on how many samples we can get from the stated sample_fraction (e.g. if sample_fraction = 0.2, the maximum by number of samples will be 5)
  • prefix (string): prefix for names of the output files

predict(test_df, return_multi_outputs, return_mean_std)

  • return_multi_outputs (boolean): if True, returns one set of outputs per model trained on a distinct subsample
  • return_mean_std (boolean): if True, returns tuple with mean and standard deviation of the model outputs if both return_multi_outputs and return_mean_std are set to True, return_multi_outputs takes precedence

Example

import xgbmagic
import pandas as pd

# read the training data
df = pd.read_csv('train.csv')

# for logistic regression, target_type is 'binary'
target_type = 'binary'

# set columns that are categorical, numeric, and to be dropped here.
xgb = xgbmagic.Xgb(df, target_column='TARGET', id_column='ID', target_type=target_type, categorical_columns=[], drop_columns=[], numeric_columns=[], num_training_rounds=500, verbose=1, early_stopping_rounds=50)
xgb.train()

# use the model to predict values for the test set
test_df = pd.read_csv('test.csv')
print(xgb.feature_importance())
output = xgb.predict(test_df)

# write to csv
xgb.write_csv('output-xgbmagic.csv')

# save model
xgb.save('xgbmodel.pkl')

# load model
from sklearn.externals import joblib
xgb = joblib.load('xgbmodel.pkl')




Issues

Please report issues and feedback here