INGOT-DR

INGOT-DR ( INterpretable GrOup Testing for Drug Resistance) is an interpretable rule-based predictive model base on Group Testing and Boolean Compressed Sesing. For more details and citation please see the INGOT-DR paper. To access scripts used to produce the results in the paper please visit INGOT-DR Project. To access the data used in the paper please visit/cite M.tuberculosis dataset for drug resistant.

##Table of content

Installation
Usage
Citation

Installation

INGOT-DR can be installed from PyPI.

pip install ingotdr

Usage

INGOT-DR is implemented as a scikit-learn classifier. As a result, this classifier is compatible with most of scikit-learn tools (e.g. cross validation and hyper-parameter tuning tools). In the following section, we provide some usage examples:

Arguments

ingot.INGOTClassifier( w_weight=1, lambda_p=1, lambda_z=1, lambda_e=1, false_positive_rate_upper_bound=None,
                       false_negative_rate_upper_bound=None, max_rule_size=None, rounding_threshold=1e-5,
                       lp_relaxation=False, only_slack_lp_relaxation=False, lp_rounding_threshold=0,
                       is_it_noiseless=False, solver_name='PULP_CBC_CMD', solver_options=None)

Name	Type	Description	Default
w_weight	vector, float	A vector, float to provide prior weight to w.	1.0
lambda_p	float	Regularization coefficient for positive labels.	1.0
lambda_z	float	Regularization coefficient for negative/zero labels.	1.0
lambda_e	float	Regularization coefficient for all slack variables.	1.0
false_positive_rate_upper_bound	float	False positive rate (FPR) upper bound.	None
false_negative_rate_upper_bound	float	False negative rate(FNR) upper bound.	None
max_rule_size	int	Maximum rule size.	None
rounding_threshold	float	Threshold for ILP solutions for Rounding to 0 and 1.	1e-5
lp_relaxation	bool	A flag to use the lp relaxed version.	False
only_slack_lp_relaxation	bool	A flag to only use the lp relaxed slack variables.	False
lp_rounding_threshold	float	Threshold for lp solutions for Rounding to 0 and 1. Range from 0 to 1.	0.0
is_it_noiseless	bool	A flag to specify whether the problem is noisy or noiseless.	False
solver_name	str	Solver's name provided by Pulp.	'PULP_CBC_CMD'
solver_options	dict	Solver's options provided by Pulp.	None

Methods

Method	Description
`fit(X,y)`	Fit the model with respect to the given data.
`get_params_dictionary(variable_type='w')`	Provide a dictionary of individuals with their status obtained by decoder. Type of the variable.e.g. 'w', 'ep' or 'en'
`solution()`	Provide a vector of binary features importance. i.e. 1 if feature was used in the model 0 otherwise.
`predict(X)`	Provide a predicted labels for X.
`score(X,y)`	Provide the accuracy of `self.predict(X)` with respect to `y`
`learned_rule(return_type='feature_name')`	Return a list of rules. return_type can be 'feature_name' or 'feature_id'.
`write(fileType='mps', **kwargs)`	Create a file from the problem. `fileType` can be 'mps', 'lp', 'json' or 'display'. 'display' shows the ILP/LP problem on screen.

Training and evaluation

Example: The following is an example of training a classifier to predict resistance to second line drug Ciprofloxacin in TB isolates. In this example the feature matrix indicates presence/absence of SNPs in TB isolates, and the label vector represents the drug resistance phenotype. Sample data is available here.

from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot

feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector =  'ciprofloxacinLabel.csv'

X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)

clf = ingot.INGOTClassifier(lambda_p=10, lambda_z=0.01, false_positive_rate_upper_bound=0.1,
                            max_rule_size=20, solver_name='CPLEX_PY')
clf.fit(X_train,y_train)

y_pred = clf.predict(X_test)

print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print("Accuracy: {}".format(clf.score(X_test,y_test)))
print("Features in the learned rule: {}".format(clf.learned_rule()))

Output:

Balanced accuracy: 0.8449477351916377
Accuracy: 0.9550561797752809
Features in the learned rule: ['7570, C, T', '7572, T, C', '7581, G, T', '7582, A, C', '7582, A, G']

Hyper-parameter tuning

Hyper-parameter tuning via scikit-learn Grid Search CV:

Example:

from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer

feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector =  'ciprofloxacinLabel.csv'

X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)

clf = ingot.INGOTClassifier(false_positive_rate_upper_bound=0.1, max_rule_size=20, solver_name='CPLEX_PY',
                            solver_options={'timeLimit': 1800})

scoring = dict(Accuracy='accuracy', balanced_accuracy=make_scorer(balanced_accuracy_score))
param_grid={'lambda_p': [ 1, 10, 100 ], 'lambda_z': [ 0.01, 0.1, 1 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
                    n_jobs=-1, verbose= 3)
grid.fit(X_train, y_train)

y_pred = grid.predict(X_test)

print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print('Best params: {}'.format(grid.best_params_))

Output:

Balanced accuracy: 0.8449477351916377
Best params: {'lambda_p': 10, 'lambda_z': 0.01}

Optimizing for different target metric

Note: w_weight and lambda_e are not part of the main ILP (Eq (11)) defined in the INGOT-DR paper. These two variables are defined to provide freedom when Optimizing for different target metric (section 1.4) is needed. The complete objective function with these two variables would be:

Example: Classifier corresponding to Eq (16) with maximum rule size k=20 and specificity lower bound t= 90% can be defined as following:

clf = ingot.INGOTClassifier(w_weight=0, lambda_z=0, false_positive_rate_upper_bound=0.1, max_rule_size=20,
                            solver_name='CPLEX_PY')

The following table shows the combination of arguments needed to define some of ILPs in the paper

lp_relaxation	only_slack_lp_relaxation	is_it_noiseless	Equation number in the paper
False	False	False	Eq (11)
False	True	True	Eq (3)
False	True	False	Eq (4) with objective function of Eq (11)
False	False	True	Eq (3)
True	True	False	LP relaxation of Eq (4) with objective function of Eq (11)
True	False	False	LP relaxation of Eq (4) with objective function of Eq (11)
True	False	True	LP relaxation of Eq (3)
True	True	True	LP relaxation of Eq (3)

Note: True value of lp_relaxation or is_it_noiseless with override only_slack_lp_relaxation. i.e. if one of them is True then value of only_slack_lp_relaxation is not important.

Note: To recreate and work with Eq (4), you only need to use combination in row 3 and use or tune lambda_e instead of lambda_p and lambda_z. For example:

param_grid={'lambda_e': [0.01, 0.1,  1, 10, 100 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
                    n_jobs=-1, verbose= 3)

Choosing the solver

INGOT-DR supports a variety of solvers through the PuLP application programming interface (API). Solvers such as GLPK, COIN-OR CLP/CBC, CPLEX, GUROBI, MOSEK, XPRESS, CHOCO, MIPCL, SCIP.

List of available solvers on your machine:

import pulp as pl
solver_list = pl.listSolvers(onlyAvailable=True)

Name and properties of the solver can be specified via solver_name and solver_options. e.g:

clf = ingot.INGOTClassifier(solver_name='CPLEX_PY', solver_options={'timeLimit': 1800})

In the INGOT-DR paper, 'CPLEX_PY' is the main solver. IBM CPLEX for academic use is available here.

Citation:

For general use please cite our paper: INGOT-DR: an interpretable classifier forpredicting drug resistance in M. tuberculosis. (bibtex)

ingotdr
Release 0.0.5

Release 0.0.5

0.0.5

0.0.4

0.0.3

0.0.2

0.0.1

Documentation

INGOT-DR

Installation

Usage

Arguments

Methods

Training and evaluation

Hyper-parameter tuning

Optimizing for different target metric

Choosing the solver

Citation:

Stats

Development practices

Releases

Contributors

ingotdr Release 0.0.5

Release 0.0.5 Toggle Dropdown 0.0.5 0.0.4 0.0.3 0.0.2 0.0.1

Documentation

INGOT-DR

Installation

Usage

Arguments

Methods

Training and evaluation

Hyper-parameter tuning

Optimizing for different target metric

Choosing the solver

Citation:

Stats

Development practices

Releases

Contributors

ingotdr
Release 0.0.5

Release 0.0.5

0.0.5

0.0.4

0.0.3

0.0.2

0.0.1