learning-curves

Python module allowing to easily calculate and plot the learning curve of a machine learning model and find the maximum expected accuracy


Keywords
Learning, curve, machine, saturation, accuracy
License
MIT
Install
pip install learning-curves==0.2.2

Documentation

learning-curves

Learning-curves is Python module that extends sklearn's learning curve feature. It will help you visualizing the learning curve of your models:

alt text

Learning curves give an opportunity to diagnose bias and variance in supervised learning models, but also to visualize how training set size influence the performance of the models (more informations here).

Such plots help you answer the following questions:

  • Would my model perform better with more data?
  • Can I train my model with less data without reducing accuracy?
  • Is my training/validation set biased?
  • What is the best model for my data?
  • What is the perfect training size for tuning parameters?

Learning-curves will also help you fitting the learning curve to extrapolate and find the saturation value of the curve.

Installation

This module is still under development. Therefore it is recommended to use:

$ pip install git+https://github.com/H4dr1en/learning-curves#egg=learning-curves

Usage

To create learning curve plots, you can start with the following lines:

import learning_curves as LC
lc = LC.LearningCurve()
lc.get_lc(estimator, X, Y)

Where estimator implements fit(X,Y) and predict(X,Y) (Sklearn interface).

Output:

alt text

On this example the green curve suggests that adding more data to the training set is not likely to improve the model accuracy. The green curve also shows a saturation near 0.7. We can easily fit a function to this curve:

lc.plot(predictor="best")

Output:

alt text

Here we used a predefined function, pow, to fit the green curve. The R2 score (0.999) is very close to 1, meaning that the fit is optimal. We can therefore use this curve to extrapolate the evolution of the accuracy with the training set size.

This also tells us how many data we should use to train our model to maximize performances and accuracy.

And much more!

Documentation

The documentation is available here.

Some functions have their function_name_cust equivalent. Calling the function without the _cust suffix will internally call the function with the _cust suffix with default parameters (such as the data points of the learning curves). Thanks to kwargs, you can pass exactly the same parameters to both functions.

Contributing

PRs, bug reports as well as improvment suggestions are welcomed :)