ml-pipeline

An intuitive, super easy machine learning pipeline framework for transforming DataFrames.


Keywords
ml_pipeline
License
MIT
Install
pip install ml-pipeline==0.02

Documentation

ml-pipeline

An intuitive, super easy machine learning pipeline framework for transforming DataFrames.

Easy to use

ml_pipeline is super easy to incorporate to your data manipulation and machine learning workflow. As a demostration, you first import your data into a dataframe:

>>> import pandas as pd
>>> df = pd.read_csv("tests/pets.csv")
>>> print(df)

Output:

   ID     pet  gender   age  height(cm)
0   1     dog       1   8.0        50.0
1   2     cat       0   6.0        30.0
2   3     cat       1   3.0        33.0
3   4    bird       1   7.0        16.0
4   5     dog       1   8.0        55.0
5   6    bird       1   3.0    999999.0
6   7  rabbit       1  11.0        -1.0
7   8    bird       0   NaN        20.0
8   9     cat       0   9.0        33.0
9  10     cat       0   NaN        21.5

Let's say we want to make two features out of this data: 1) pet as dummy variables, 2) age with imputed value.

>>> from ml_pipeline import Select, Impute, MakeDummy
>>> from ml_pipeline import notation
>>> ppl = notation(([Select('pet'), MakeDummy()],
...                 [Select('age'), Impute(0)]
...                  ))
>>> ppl.fit_transform(df)

Output:

   pet_dog  pet_cat  pet_bird  pet_rabbit   age
0        1        0         0           0   8.0
1        0        1         0           0   6.0
2        0        1         0           0   3.0
3        0        0         1           0   7.0
4        1        0         0           0   8.0
5        0        0         1           0   3.0
6        0        0         0           1  11.0
7        0        0         1           0   0.0
8        0        1         0           0   9.0
9        0        1         0           0   0.0

As you can see in the example above, using notation() function, pipelines within a tuple are unioned, wheres those within a list are appended. In fact, [e1, e2] makes a pipeline, and (e1,e2) makes a union.

[e1, e2] == e1 -> e2 -> output
            e1 --|
(e1, e2) ==      |--> output
            e2 --|

In the example above, we only have features that we enginerre. But in many cases, we need to keep other untouched features as-is. Following the code above, we just need to do:

>>> from ml_pipeline import KeepOthers
>>> ppl = notation(([Select('pet'), MakeDummy()],
...                 [Select('age'), Impute(0)],
...                 KeepOthers()
...                ))
>>> ppl.fit_transform(df)

Output:

   pet_dog  pet_cat  pet_bird  pet_rabbit   age  ID  gender  height(cm)
0        1        0         0           0   8.0   1       1        50.0
1        0        1         0           0   6.0   2       0        30.0
2        0        1         0           0   3.0   3       1        33.0
3        0        0         1           0   7.0   4       1        16.0
4        1        0         0           0   8.0   5       1        55.0
5        0        0         1           0   3.0   6       1    999999.0
6        0        0         0           1  11.0   7       1        -1.0
7        0        0         1           0   0.0   8       0        20.0
8        0        1         0           0   9.0   9       0        33.0
9        0        1         0           0   0.0  10       0        21.5