funmap
integrates multiple omics data sets (such as proteomics and RNASeq) to construct a functional network using supervised machine learning (xgboost).
funmap
requires the following:
- python (>= 3.7)
- numpy (>= 1.17)
- scipy (>= 1.4.0)
- scikit-learn (>= 0.22)
- joblib (>= 0.17.0)
The easiest way to install funmap
is using pip
pip install funmap
To upgrade to a newer release use the --upgrade
flag
pip install --upgrade funmap
usage: funmap [-h] [--version] {qc,run} ...
funmap command line interface
options:
-h, --help show this help message and exit
--version show program's version number and exit
Commands:
{qc,run}
qc check the data quality
run run funmap
Before running the experiment, user can check the quality of the input data using the following command
funmap qc -c test_config.yml
User needs to prepare configuration file and an input data file. The configuration file is a YAML file that specifies the parameters for the experiment. The input data file is a tar gzipped file that contains the data for the experiment. A sample configuration file and a sample input data file can be found in the test
directory.
To run the experiment, use the following command
funmap run -c test_config.yml
The run time of the experiment depends on the size of the input data file. The above command takes about 20-30 minutes to run on a standard computer using 4 threads.
Field | Description | Example Value |
---|---|---|
task | protein or kinase function, can be 'protein_func' or 'kinase_func' | protein_func |
name | experiment name | test |
seed | Seed for random number generation. | 42 |
cor_type | Type of correlation, can be 'pearson' or 'spearman'. | pearson |
feature_type | Type of features used for training a model, can be 'cc' (correlation coefficient) or 'mr' (mutual rank). | cc |
n_jobs | Number of parallel jobs to run. | 8 |
n_chunk | Number of chunks to split the data for parallel processing. | 4 |
start_edge_num | The starting number for evaluating the Log-Likelihood Ratio (LLR) for each individual data set. | 100 |
min_sample_count | Minimum number of valid data points required when computing correlation. | 15 |
max_num_edges | Maximum number of edges to consider. | 25000 |
step_size | Step size for the evaluation of LLR. | 1 |
lr_cutoff | Likelihood Ratio (LR) cutoff value. | 10 |
data_files | List of data files with their name, type, and path. | See the example |
funmap
package requires only a standard computer with enough RAM to support the in-memory operations.
The output directory contains the following files and directories:
.
├── config.yml
├── figures
│  └── results.pdf
├── llr_dataset.tsv
├── llr_results_ei_25000.tsv
├── llr_results_ex_25000.tsv
├── networks
│  ├── funmap.tsv
│  ├── network_ei_25000.tsv
│  └── network_ex_25000.tsv
├── saved_data
│  ├── all_features.fth
│  ├── all_pairs.tsv.gz
│  ├── all_valid_gene.txt
│  ├── gold_standard_test_neg.pkl.gz
│  ├── gold_standard_test_pos.pkl.gz
│  └── gold_standard_train.pkl.gz
├── saved_models
│  └── model.pkl.gz
└── saved_predictions
└── predicted_all_pairs.pkl.gz
-
config.yml
: the configuration file used for the experiment -
figures
: the directory that contains the figures generated by the experiment. If QC was performed, the figures will be saved in this directory also. -
llr_dataset.tsv
: a tsv file contains log-likelihood ratio (LLR) analysis for each individual input data set. -
llr_results_ei_25000.tsv
: a tsv file contains LLR analysis for predictions based on the model trained with mutual rank and PPI features. The number in the file name indicates the maximum number of edges selected for LLR analysis. -
llr_results_ex_25000.tsv
: a tsv file contains LLR analysis for predictions based on the model trained with mutual rank features. -
networks
: the directory that contains the predicted networks. The network files are tab-separated files with three columns: gene1, gene2, and score. The score is the predicted probability of the edge between gene1 and gene2.funmap.tsv
is the final predicted network. The edges meet the required LLR threshold. -
saved_data
: the directory that contains the saved data used for the experiment. -
saved_models
: the directory that contains the trained model. -
saved_predictions
: the directory that contains the predicted probabilities for all pairs of genes.