Multi Omics Pipeline
- Overview
- Project Structure
- Setup the project
- Run with Conda
- Run with plain Nextflow (In progress)
- References
Overview
The Multi Omics Pipeline is a Nextflow pipeline designed for evaluating various multi-omics (genomics, proteomics, metabolomics) data integration methods with the task of classification/regression, factor analysis, clustering and others.
Project Structure
Some important locations:
- Shell scripts for setting up the project is located at
bin/
- Configurations for the pipeline is at
nextflow.config
, andconfigs/
- Python and R source codes are located in
modules/
- Logs, cluster output, execution report in
results
For more details for project organization, please see here
.
βββ bin
βΒ Β βββ 01-get_nxf_conda.sh
βΒ Β βββ 02-pull_all_containers.sh
βΒ Β βββ helper.sh
βΒ Β βββ install.sh
βΒ Β βββ pbs
βΒ Β βββ test_job.sh
βββ configs
βΒ Β βββ base.config
βΒ Β βββ local.config
βΒ Β βββ pbs_remote.config
βββ containers
βΒ Β βββ dockerfiles
βΒ Β βΒ Β βββ codia.Dockerfile
βΒ Β βΒ Β βββ cooperative_learning.Dockerfile
βΒ Β βΒ Β βββ mixdiablo.Dockerfile
βΒ Β βΒ Β βββ mogonet.Dockerfile
βΒ Β βΒ Β βββ R_template.Dockerfile
βΒ Β βΒ Β βββ rbase.Dockerfile
βΒ Β βΒ Β βββ smgr.Dockerfile
βΒ Β βββ names.md
βΒ Β βββ README.md
βΒ Β βββ scripts
βΒ Β βββ pull_all.sh
βΒ Β βββ pull_container.sh
βββ data
βΒ Β βββ moni_data_reference_data.xlsx
βΒ Β βββ multiomics_data.xlsx
βΒ Β βββ README.md
βΒ Β βββ test1
βΒ Β βΒ Β βββ rnorm_data_*.csv
βΒ Β βββ test2
βΒ Β βββ ...
βββ docs
βΒ Β βββ personal
βΒ Β βΒ Β βββ links_for_nextflow.md
βΒ Β βΒ Β βββ notes.md
βΒ Β βββ README.md
βΒ Β βββ sockeye_paths.md
βΒ Β βββ todo_logging.md
βββ LICENSE
βββ main.nf
βββ Makefile
βββ modules
βΒ Β βββ Python
βΒ Β βΒ Β βββ Python_Mogonet.nf
βΒ Β βΒ Β βββ Python_run_mogonet.py
βΒ Β βββ R
βΒ Β βΒ Β βββ cooperative_learning
βΒ Β βΒ Β βΒ Β βββ R_Cooperative_Learning.nf
βΒ Β βΒ Β βΒ Β βββ R_run_cooperative_learning.R
βΒ Β βΒ Β βββ diablo
βΒ Β βΒ Β βΒ Β βββ diablo_helpers.R
βΒ Β βΒ Β βΒ Β βββ R_Diablo.nf
βΒ Β βΒ Β βΒ Β βββ R_run_diablo.R
βΒ Β βΒ Β βββ helpers.R
βΒ Β βΒ Β βββ test.R
βΒ Β βΒ Β βββ unused.R
βΒ Β βΒ Β βββ write_data.R
βΒ Β βββ README.md
βββ nextflow.config
βββ pbs_job_nxf.sh
βββ README.md
βββ results
βΒ Β βββ nxf_logs
βΒ Β βΒ Β βββ 2023-06
βΒ Β βΒ Β βΒ Β βββ nxf-run_2023-06-30_17-52-54.log
βΒ Β βΒ Β βββ 2023-07
βΒ Β βΒ Β βββ ...
βΒ Β βββ pbs_output
βΒ Β βΒ Β βββ 2023-06
βΒ Β βΒ Β βΒ Β βββ job-output_2023-06-30_17-52-34.txt
βΒ Β βΒ Β βββ 2023-07
βΒ Β βΒ Β βββ ...
βΒ Β βββ README.md
βΒ Β βββ reports
βΒ Β βββ 2023-06
βΒ Β βΒ Β βββ execution_report_2023-06-30_17-53-06.html
βΒ Β βββ 2023-07
βΒ Β βββ ...
βββ rstudio.pbs
βββ subworkflow
βΒ Β βββ helpers.nf
βΒ Β βββ Python.nf
βΒ Β βββ R.nf
βΒ Β βββ README.md
βββ test_job.sh
Setup
In order to execute the pipeline, you need to have satisfy the following requirements:
Requirements:
- Nextflow 22.10.7 or above
- Bash
4.2.46
- Java 11 (or later, up to 18), recommend using openJDK
11.0.18
- Docker/Apptainer
1.1.4
(formerly Singularity3.8.5
) - Conda (optional)
Then, execute the following commands and follow an alternative you like by running it with conda (RECOMMENDED) or without conda as plain nextflow:
Note: This is on
ARC sockeye
only for now
-
First, choose a place you want to clone this project, preferrably in your scratch space:
# Replace st-singha53-1 with you own allocation code if any # $USER is defined on your sockeye, it is your cwl cd /scratch/st-singha53-1/$USER
-
Then, load required modules and clone this github repository to the path
cd
before:# Assuming you're in your scratch space # Load require modules module load gcc/9.4.0 git/2.31.1 # Clone repo # After successful clone, you would see 'multi-omics-pipeline' is created in your current pwd git clone https://github.com/tonyliang19/multi-omics-pipeline.git
-
Proceed to one of the option below for more instructions to run the project:
- conda (RECOMMENDED)
- Plain Nextflow
Download from conda (Sockeye only)
View details here
Recommended: run make all
in your terminal on Sockeye
NOTE: The setup process could take 10-15 for the first time, you could come back later to it.
Recommended:
# cd to the cloned repo
cd multi-omics-pipeline
# setup the environment and submit a sample batch job
make all # This relates to the Makefile, if you wish to know more about it
Alternative (if make
is not available):
-
Run this script
install.sh
located in~/bin
dir:cd multi-omics-pipeline # This going to take some time bash bin/install.sh
-
After finished installation, you could submit the job by:
# Assuming you in ~/../multi-omics-pipeline # Named output by formatted time OUTPUT_NAME="results/pbs_output/job-output_$(eval date +%Y-%m-%d_%H-%M-%S).txt)" # Submit job qsub -o ${OUTPUT_NAME} pbs_job_nxf.sh
Download from Nextflow
View detatils here
Install Nextflow by using the following command:
curl -s https://get.nextflow.io | bash
Launch the pipeline execution with the following commands:
-
Local testing environment
make run_local
-
Remote cluster environment
make_run_remote
License
This project is licensed under the MIT License
Reference
Ding DY, Li S, Narasimhan B, Tibshirani R (2022) Cooperative learning for multiview analysis. Proc Natl Acad Sci USA 119:e2202113119.
Rohart, F., Gautier, B., Singh, A. & Le Cao, K. A. mixOmics: An R package for βomics feature selection and multiple data integration. PLoS Comput. Biol. 13, e1005752 (2017).
Singh A, Shannon CP, Gautier B, Rohart F, Vacher M, Tebbutt SJ, et al. DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assays. Bioinformatics.2019;35:3055β62.
Wang, T., Shao, W., Huang, Z. et al. MOGONET integrates multi-omics data using graph convolutional networks allowing patient classification and biomarker identification. Nat Commun 12, 3445 (2021).