This package is an implementation of:
Paper: Comonotone-Independence Bayes classifier (CIBer)
Author: Yongzhao CHEN, Ka Chun CHEUNG, Nok Sang FAN, Suresh SETHI, and Sheung Chi Phillip YAM
This is the user guide for Comonotone-Independence Bayesian Classifier (CIBer). CIBer is a supervised learning model which deals with multi-class classification tasks. The continuous feature variables are discretized and those categorical ones are encoded via the proposed Joint Encoding.
This document mainly explain the important and practical functions in CIBer.py and CIBer_Engineering.py. Lastly, the CIBer_Bankchurner.ipynb gives a simple but illuminating example on CIBer with the use of Bankchurner dataset by Thomas Konstantin. Please refer to original author Kaiser's Repository for details.
The MDLP discretization method has been disabled, you need to install package manually since it requires additional tools.
Step1: install c/c++ tools
install visual studio community, and then install Microsoft C++ Build Tools for C/C++ related packages
type the following line in terminal to install the Command Line Tools package
xcode-select --install
Step2: type the following line in terminal to install
pip install mdlp-discretization
pip install git+https://github.com/hlin117/mdlp-discretization
refer to author hlin117's repository
CIBer deals with multi-class classification tasks with numerical or discrete (but should be ordered) input variables. Before passing the data into the model, please perform some proper data preprocessing beforehand, e.g. removals of outlier and missing observation, and encode all categorical feature variables with numerical values.
To use CIBer:
from CIBer import CIBer
init(self, cont_col=[], asso_method='modified', min_asso=0.95, alpha=1, disc_method="norm", joint_encode=True, **kwargs)
cont_col: a list, containing the indices of the continuous variables
asso_method: a string can be set to "pearson", "spearman", "kendall", "modified". Four measurements to correlation. The default is "modified"
min_asso: a number between
alpha: a positive number used in Laplacian smoothing. The default value is 1
joint_encode: a boolean, whether to use joint encoding. The default value is True
disc_method: a string indicating the discretization method adopted for each continuous feature variable. The default string is "norm" for normal distribution quantile method
**kwargs: additional keyworded arguments passing to Discretization(), below are two acceptable keyworded arguments
n_bins: a positive integer for the total number of bins for each discretization.
disc_backup: a string indicating the discretization method adopted if the method disc_method="mdlp" fails.
fit(self, x_train, y_train)
x_train: a numpy
y_train: a numpy
predict(self, x_test)
x_test: a numpy
return: a numpy
predict_proba(self, x_test)
x_test: a numpy
return: a numpy
self.cluster_book a Python dictionary where
- keys: class label
- vals: lists of clusters, each of which contains the indices of feature variables within the same cluster, generated by the AGNES algorithm. If there is only one integer value in a given list, then the corresponding feature variable is seen to be independent to all other feature variables given the class label. Otherwise, they are modelled by conditional comonotonicity given the class label.
self.distance_matrix_ a numpy
$p \times p$ array, where the$(i,j)$ entry is the corresponding association value computed according to the chosen asso_method of feature$i$ and feature$j$ .
Discretization(cont_col, disc_method, disc_backup="pkid", n_bins=10)
cont_col: a list of indices to be discretized
disc_method: any string in DISC_BASE + SCIPY_DIST, (refer to CIBer.py)
list of distributions provided by scipy used in Equal-quantile distribution method, number of bins determined by n_bins
SCIPY_DIST = ["uniform", "norm", "t", "chi2", "expon", "laplace", "skewnorm", "gamma"]
list of common discretiztion methods for Na"ive Bayes classifier
SIZE_BASE = ["equal_size", "pkid", "ndd", "wpkid"]
list of all discretization methods except SCIPY_DIST
DISC_BASE = ["equal_length", "auto"] + SIZE_BASE
list of alternative discretization methods if mdlp fails except SCIPY_DIST
MDLP_BACKUP = ["equal_length", "auto"] + SIZE_BASE
return a class for discretization method
init(self, df, col_index)
df: a
col_index: a list, containing the indices of categorical feature variables
fit(self, x_train)
x_train: a
transform(self, x_test)
x_test: a numpy
return: a numpy