Automatic RandomForestImputer: Handling missing values with a random forest automatically

For supervised and semi-supervised learning

This library uses a random forest(regressor or classifier) to replace missing values in a dataset. It tackles:

Samples having missing values in one or more features
Samples having a missing target value and missing values in one or more features: both of them will be predicted and replaced.

Dependencies

Python(version>=3.6)
Numpy
Pandas
Matplolib
Sklearn
Tensorflow (version>=2.2.0)
DataTypeIdentifier

Instructions

You can get the library with pip install MissingValuesHandler
Import a dataset
The type of Random Forest is automatically handled: if the target variable is numerical, a RandomForestRegressor is selected and if it is categorical, the algorithm will choose a RandomForestClassifier.
Class instantiation: training_resilience is a parameter that lets the algorithm know how many times it must keep striving for convergence when there are still some values that didn't converge
The class possesses three important arguments among others:
- forbidden_variables_list: variables that don't require encoding will be put in that list
- ordinal_variables_list: suited for ordinal categorical variables encoding
- n_iterations_for_convergence: checks after n rounds if the predicted values converged. 4 or 5 rounds are usually enough
Set up the parameters of the random forest except for the criterion since it is also taken care of by the software: it is gini or entropy for a random forest classifier and mse (mean squared error) for a regressor. Set up essential parameters like the number of iterations, the additional trees, the base estimator…
The method train() contains two important arguments among others:
- sample_size [0;1[: allows to draw a representative sample from the data(can be used when the dataset is too big). 0 for no sampling
- n_quantiles: allows to draw a representative sample from the data when the target variable is numerical(default value at 0 if the variable is categorical)

Coding example:

from MissingValuesHandler.missing_data_handler import RandomForestImputer
from os.path import join
from pandas import read_csv
"""
############################################
############# IMPORT DATA  #################
############################################
"""
data = read_csv(join("data","Loan_approval.csv"), sep=",", index_col=False)

"""
############################################
############### RUN TIME ###################
############################################
"""
#Main object
random_forest_imputer = RandomForestImputer(data=data,
                                            target_variable_name="Status",
                                            training_resilience=3, 
                                            n_iterations_for_convergence=5,
                                            forbidden_features_list=["Credit_History"],
                                            ordinal_features_list=[])

#Setting the ensemble model parameters: it could be a random forest regressor or classifier
random_forest_imputer.set_ensemble_model_parameters(n_estimators=40, additional_estimators=10)

#Launching training and getting our new dataset
new_data = random_forest_imputer.train(sample_size=0.3, 
                                       path_to_save_dataset=join("data", "Loan_approval_no_nan.csv"))
"""
############################################
########## DATA RETRIEVAL ##################
############################################
"""
sample_used                         = random_forest_imputer.get_sample()
features_type_prediction            = random_forest_imputer.get_features_type_predictions()
target_variable_type_prediction     = random_forest_imputer.get_target_variable_type_prediction()
encoded_features                    = random_forest_imputer.get_encoded_features()
encoded_target_variable             = random_forest_imputer.get_target_variable_encoded()
final_proximity_matrix              = random_forest_imputer.get_proximity_matrix()
final_distance_matrix               = random_forest_imputer.get_distance_matrix()
weighted_averages                   = random_forest_imputer.get_nan_features_predictions(option="all")
convergent_values                   = random_forest_imputer.get_nan_features_predictions(option="conv")
divergent_values                    = random_forest_imputer.get_nan_features_predictions(option="div")
ensemble_model_parameters           = random_forest_imputer.get_ensemble_model_parameters()
all_target_value_predictions        = random_forest_imputer.get_nan_target_values_predictions(option="all")
target_value_predictions            = random_forest_imputer.get_nan_target_values_predictions(option="one")


"""
############################################
######## WEIGHTED AVERAGES PLOT ############
############################################
"""
random_forest_imputer.create_weighted_averages_plots(directory_path="graphs", both_graphs=1)

"""
############################################
######## TARGET VALUE(S) PLOT ##############
############################################
"""
random_forest_imputer.create_target_pred_plot(directory_path="graphs")

"""
############################################
##########      MDS PLOT    ################
############################################
"""
mds_coordinates = random_forest_imputer.get_mds_coordinates(n_dimensions=3, distance_matrix=final_distance_matrix)
random_forest_imputer.show_mds_plot(mds_coordinates, plot_type="3d")

3d Multidimensional Scaling(MDS):

We can use the distance matrix to plot the samples and observe how they are related to one another

We can use the K-means algorithm to cluster the data and analyze the features of every cluster

References for the supervised algorithm:

[1]: Leo Breiman’s website. Random Forests Leo Breiman and Adele Cutler stat.berkeley.edu/~breiman/RandomForests/cc_home.htm
[2]: John Starmer’s video on Youtube Channel StatQuest. Random Forests Part 2: Missing data and clustering https://youtu.be/nyxTdL_4Q-Q

MissingValuesHandler
Release 1.0.4

Release 1.0.4

1.1.6

1.1.5

1.1.4

1.1.3

1.1.2

1.1.1

1.0.8

1.0.7

1.0.6

1.0.4

Documentation

Automatic RandomForestImputer: Handling missing values with a random forest automatically

For supervised and semi-supervised learning

Dependencies

Instructions

Coding example:

3d Multidimensional Scaling(MDS):

References for the supervised algorithm:

Stats

Releases

Contributors

MissingValuesHandler Release 1.0.4

Release 1.0.4 Toggle Dropdown 1.1.6 1.1.5 1.1.4 1.1.3 1.1.2 1.1.1 1.0.8 1.0.7 1.0.6 1.0.4

Documentation

Automatic RandomForestImputer: Handling missing values with a random forest automatically

For supervised and semi-supervised learning

Dependencies

Instructions

Coding example:

3d Multidimensional Scaling(MDS):

References for the supervised algorithm:

Stats

Releases

Contributors

MissingValuesHandler
Release 1.0.4

Release 1.0.4

1.1.6

1.1.5

1.1.4

1.1.3

1.1.2

1.1.1

1.0.8

1.0.7

1.0.6

1.0.4