INGOT-DR
INGOT-DR ( INterpretable GrOup Testing for Drug Resistance) is an interpretable rule-based predictive model base on Group Testing and Boolean Compressed Sesing. For more details and citation please see the INGOT-DR paper. To access scripts used to produce the results in the paper please visit INGOT-DR Project. To access the data used in the paper please visit/cite M.tuberculosis dataset for drug resistant.
##Table of content
Installation
INGOT-DR can be installed from PyPI.
pip install ingotdr
Usage
INGOT-DR is implemented as a scikit-learn classifier. As a result, this classifier is compatible with most of scikit-learn tools (e.g. cross validation and hyper-parameter tuning tools). In the following section, we provide some usage examples:
Arguments
ingot.INGOTClassifier( w_weight=1, lambda_p=1, lambda_z=1, lambda_e=1, false_positive_rate_upper_bound=None,
false_negative_rate_upper_bound=None, max_rule_size=None, rounding_threshold=1e-5,
lp_relaxation=False, only_slack_lp_relaxation=False, lp_rounding_threshold=0,
is_it_noiseless=False, solver_name='PULP_CBC_CMD', solver_options=None)
Name | Type | Description | Default |
---|---|---|---|
w_weight | vector, float | A vector, float to provide prior weight to w. | 1.0 |
lambda_p | float | Regularization coefficient for positive labels. | 1.0 |
lambda_z | float | Regularization coefficient for negative/zero labels. | 1.0 |
lambda_e | float | Regularization coefficient for all slack variables. | 1.0 |
false_positive_rate_upper_bound | float | False positive rate (FPR) upper bound. | None |
false_negative_rate_upper_bound | float | False negative rate(FNR) upper bound. | None |
max_rule_size | int | Maximum rule size. | None |
rounding_threshold | float | Threshold for ILP solutions for Rounding to 0 and 1. | 1e-5 |
lp_relaxation | bool | A flag to use the lp relaxed version. | False |
only_slack_lp_relaxation | bool | A flag to only use the lp relaxed slack variables. | False |
lp_rounding_threshold | float | Threshold for lp solutions for Rounding to 0 and 1. Range from 0 to 1. | 0.0 |
is_it_noiseless | bool | A flag to specify whether the problem is noisy or noiseless. | False |
solver_name | str | Solver's name provided by Pulp. | 'PULP_CBC_CMD' |
solver_options | dict | Solver's options provided by Pulp. | None |
Methods
Method | Description |
---|---|
fit(X,y) |
Fit the model with respect to the given data. |
get_params_dictionary(variable_type='w') |
Provide a dictionary of individuals with their status obtained by decoder. Type of the variable.e.g. 'w', 'ep' or 'en' |
solution() |
Provide a vector of binary features importance. i.e. 1 if feature was used in the model 0 otherwise. |
predict(X) |
Provide a predicted labels for X. |
score(X,y) |
Provide the accuracy of self.predict(X) with respect to y
|
learned_rule(return_type='feature_name') |
Return a list of rules. return_type can be 'feature_name' or 'feature_id'. |
write(fileType='mps', **kwargs) |
Create a file from the problem. fileType can be 'mps', 'lp', 'json' or 'display'. 'display' shows the ILP/LP problem on screen. |
Training and evaluation
Example: The following is an example of training a classifier to predict resistance to second line drug Ciprofloxacin in TB isolates. In this example the feature matrix indicates presence/absence of SNPs in TB isolates, and the label vector represents the drug resistance phenotype. Sample data is available here.
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot
feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector = 'ciprofloxacinLabel.csv'
X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)
clf = ingot.INGOTClassifier(lambda_p=10, lambda_z=0.01, false_positive_rate_upper_bound=0.1,
max_rule_size=20, solver_name='CPLEX_PY')
clf.fit(X_train,y_train)
y_pred = clf.predict(X_test)
print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print("Accuracy: {}".format(clf.score(X_test,y_test)))
print("Features in the learned rule: {}".format(clf.learned_rule()))
Output:
Balanced accuracy: 0.8449477351916377
Accuracy: 0.9550561797752809
Features in the learned rule: ['7570, C, T', '7572, T, C', '7581, G, T', '7582, A, C', '7582, A, G']
Hyper-parameter tuning
Hyper-parameter tuning via scikit-learn Grid Search CV:
Example:
from sklearn.model_selection import train_test_split
from sklearn.metrics import balanced_accuracy_score
import pandas as pd
import ingot
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import make_scorer
feature_matrix = 'SNPsMatrix_ciprofloxacin.csv'
label_vector = 'ciprofloxacinLabel.csv'
X = pd.read_csv(feature_matrix, index_col=0)
y = pd.read_csv(label_vector, index_col=0).to_numpy().ravel()
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=33, test_size=0.2, stratify=y)
clf = ingot.INGOTClassifier(false_positive_rate_upper_bound=0.1, max_rule_size=20, solver_name='CPLEX_PY',
solver_options={'timeLimit': 1800})
scoring = dict(Accuracy='accuracy', balanced_accuracy=make_scorer(balanced_accuracy_score))
param_grid={'lambda_p': [ 1, 10, 100 ], 'lambda_z': [ 0.01, 0.1, 1 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
n_jobs=-1, verbose= 3)
grid.fit(X_train, y_train)
y_pred = grid.predict(X_test)
print("Balanced accuracy: {}".format(balanced_accuracy_score(y_test, y_pred)))
print('Best params: {}'.format(grid.best_params_))
Output:
Balanced accuracy: 0.8449477351916377
Best params: {'lambda_p': 10, 'lambda_z': 0.01}
Optimizing for different target metric
Note: w_weight and lambda_e are not part of the main ILP (Eq (11)) defined in the INGOT-DR paper. These two variables are defined to provide freedom when Optimizing for different target metric (section 1.4) is needed. The complete objective function with these two variables would be:
Example: Classifier corresponding to Eq (16) with maximum rule size k=20 and specificity lower bound t= 90% can be defined as following:
clf = ingot.INGOTClassifier(w_weight=0, lambda_z=0, false_positive_rate_upper_bound=0.1, max_rule_size=20,
solver_name='CPLEX_PY')
The following table shows the combination of arguments needed to define some of ILPs in the paper
lp_relaxation | only_slack_lp_relaxation | is_it_noiseless | Equation number in the paper |
---|---|---|---|
False | False | False | Eq (11) |
False | True | True | Eq (3) |
False | True | False | Eq (4) with objective function of Eq (11) |
False | False | True | Eq (3) |
True | True | False | LP relaxation of Eq (4) with objective function of Eq (11) |
True | False | False | LP relaxation of Eq (4) with objective function of Eq (11) |
True | False | True | LP relaxation of Eq (3) |
True | True | True | LP relaxation of Eq (3) |
Note: True value of lp_relaxation or is_it_noiseless with override only_slack_lp_relaxation. i.e. if one of them is True then value of only_slack_lp_relaxation is not important.
Note: To recreate and work with Eq (4), you only need to use combination in row 3 and use or tune lambda_e
instead of lambda_p
and lambda_z
. For example:
param_grid={'lambda_e': [0.01, 0.1, 1, 10, 100 ]}
grid = GridSearchCV(estimator=clf, param_grid= param_grid, scoring=scoring, cv=5, refit ='balanced_accuracy',
n_jobs=-1, verbose= 3)
Choosing the solver
INGOT-DR supports a variety of solvers through the PuLP application programming interface (API). Solvers such as GLPK, COIN-OR CLP/CBC, CPLEX, GUROBI, MOSEK, XPRESS, CHOCO, MIPCL, SCIP.
List of available solvers on your machine:
import pulp as pl
solver_list = pl.listSolvers(onlyAvailable=True)
Name and properties of the solver can be specified via solver_name
and
solver_options
. e.g:
clf = ingot.INGOTClassifier(solver_name='CPLEX_PY', solver_options={'timeLimit': 1800})
In the INGOT-DR paper, 'CPLEX_PY'
is the main solver. IBM CPLEX for academic use is available
here.
Citation:
For general use please cite our paper: INGOT-DR: an interpretable classifier forpredicting drug resistance in M. tuberculosis. (bibtex)