A deep-learning method for detecting DNA methylation state from Oxford Nanopore sequencing reads


Keywords
methylation, nanopore, neural, network, bioinformatics, epigenetics, nanopore-sequencing, tensorflow
License
GPL-3.0
Install
pip install deepsignal==0.2.0

Documentation

News

  • 2021.03.15: We developed deepsignal2. Compared to deepsignal, deepsignal2 has much smaller DNN model in size, and slightly better performance in 5mCpG detection of human.

DeepSignal

Python PyPI version GitHub License PyPI-Downloads PyPI-Downloads/m

A deep-learning method for detecting DNA methylation state from Oxford Nanopore sequencing reads.

DeepSignal constructs a BiLSTM+Inception structure to detect DNA methylation state from Nanopore reads. It is built with Tensorflow and Python 3.

Contents

Installation

deepsignal is built on Python3. tombo is required to re-squiggle the raw signals from nanopore reads before running deepsignal.

1. Create an environment

We highly recommend using a virtual environment for the installation of deepsignal and its dependencies. A virtual environment can be created and (de)activated as follows by using conda:

# create
conda create -n deepsignalenv python=3.7
# activate
conda activate deepsignalenv
# deactivate
conda deactivate

The virtual environment can also be created by using virtualenv.

2. Install deepsignal

  • After creating and activating the environment, download and install deepsignal (latest version) from github:
git clone https://github.com/bioinfomaticsCSU/deepsignal.git
cd deepsignal
python setup.py install

or install deepsignal using pip:

pip install deepsignal
  • tombo is required to be installed in the same environment:
# install using conda
conda install -c bioconda ont-tombo
# or install using pip
pip install ont-tombo
  • install tensorflow (version: 1.8.0<=tensorflow<=1.13.1) in the same environment:
# install using conda
conda install -c anaconda tensorflow==1.13.1
# or install using pip
pip install 'tensorflow==1.13.1'

If a GPU-machine is used, install the gpu version of tensorflow. The cpu version is not required:

# install using conda
conda install -c anaconda tensorflow-gpu==1.13.1
# or install using pip
pip install 'tensorflow-gpu==1.13.1'

Trained models

The models we trained can be downloaded from google drive.

Currently we have trained the following models:

  • model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+.tar.gz: A CpG model trained using HX1 R9.4 1D reads (for deepsignal>=0.1.7).
  • model.CpG.R9.4_1D.human_hx1.bn17.sn360.tar.gz: A CpG model trained using HX1 R9.4 1D reads (for deepsignal<=0.1.6).
  • model.GATC.R9_2D.tem.puc19.bn17.sn360.tar.gz: A GATC model trained using pUC19 R9 2D template reads (for deepsignal<=0.1.6).

Example data

The example data can be downloaded from google drive.

  • fast5s.sample.tar.gz: The data contain ~4000 yeast R9.4 1D reads each with called events (basecalled by Albacore), along with a genome reference.

Quick start

To call modifications, the raw fast5 files should be basecalled (Guppy or Albacore) and then be re-squiggled by tombo. At last, modifications of specified motifs can be called by deepsignal. The following are commands to call 5mC in CG contexts from the example data:

# 1. guppy basecall
guppy_basecaller -i fast5s.al -r -s fast5s.al.guppy --config dna_r9.4.1_450bps_hac_prom.cfg
cat fast5s.al.guppy/*.fastq > fast5s.al.guppy.fastq
# 2. tombo resquiggle
tombo preprocess annotate_raw_with_fastqs --fast5-basedir fast5s.al --fastq-filenames fast5s.al.guppy.fastq --sequencing-summary-filenames fast5s.al.guppy/sequencing_summary.txt --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template --overwrite --processes 10
tombo resquiggle fast5s.al GCF_000146045.2_R64_genomic.fna --processes 10 --corrected-group RawGenomeCorrected_001 --basecall-group Basecall_1D_000 --overwrite
# 3. deepsignal call_mods
deepsignal call_mods --input_path fast5s.al/ --model_path model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+/bn_17.sn_360.epoch_9.ckpt --result_file fast5s.al.CpG.call_mods.tsv --corrected_group RawGenomeCorrected_001 --nproc 10 --is_gpu no
python /path/to/deepsignal/scripts/call_modification_frequency.py --input_path fast5s.al.CpG.call_mods.tsv --result_file fast5s.al.CpG.call_mods.frequency.tsv

Usage

1. Basecall and re-squiggle

Before run deepsignal, the raw reads should be basecalled (Guppy or Albacore) and then be processed by the re-squiggle module of tombo.

Note:

  • If the fast5 files are in multi-read FAST5 format, please use multi_to_single_fast5 command from the ont_fast5_api package to convert the fast5 files first (Ref to issue #173 in tombo).
multi_to_single_fast5 -i $multi_read_fast5_dir -s $single_read_fast5_dir -t 30 --recursive

For the example data:

# 1. basecall
guppy_basecaller -i fast5s.al -r -s fast5s.al.guppy --config dna_r9.4.1_450bps_hac_prom.cfg
# 2. proprecess fast5 if basecall results are saved in fastq format
cat fast5s.al.guppy/*.fastq > fast5s.al.guppy.fastq
tombo preprocess annotate_raw_with_fastqs --fast5-basedir fast5s.al --fastq-filenames fast5s.al.guppy.fastq --sequencing-summary-filenames fast5s.al.guppy/sequencing_summary.txt --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template --overwrite --processes 10
# 3. resquiggle, cmd: tombo resquiggle $fast5_dir $reference_fa
tombo resquiggle fast5s.al GCF_000146045.2_R64_genomic.fna --processes 10 --corrected-group RawGenomeCorrected_001 --basecall-group Basecall_1D_000 --overwrite

2. extract features

Features of targeted sites can be extracted for training or testing.

For the example data (deepsignal extracts 17-mer-seq and 360-signal features of each CpG motif in reads by default. Note that the value of --corrected_group must be the same as that of --corrected-group in tombo.):

deepsignal extract --fast5_dir fast5s.al/ --write_path fast5s.al.CpG.signal_features.17bases.rawsignals_360.tsv --corrected_group RawGenomeCorrected_001 --nproc 10

The extracted_features file is a tab-delimited text file in the following format:

  • chrom: the chromosome name
  • pos: 0-based position of the targeted base in the chromosome
  • strand: +/-, the aligned strand of the read to the reference
  • pos_in_strand: 0-based position of the targeted base in the aligned strand of the chromosome (legacy column, not necessary for downstream analysis)
  • readname: the read name
  • read_strand: t/c, template or complement
  • k_mer: the sequence around the targeted base
  • signal_means: signal means of each base in the kmer
  • signal_stds: signal stds of each base in the kmer
  • signal_lens: lens of each base in the kmer
  • cent_signals: the central signals of the kmer
  • methy_label: 0/1, the label of the targeted base, for training

3. call modifications

The extracted features can be used to call modifications as follows (If a GPU-machine is used, set --is_gpu to "yes".):

# the CpGs are called by using the CpG model of HX1 R9.4 1D
deepsignal call_mods --input_path fast5s.al.CpG.signal_features.17bases.rawsignals_360.tsv --model_path model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+/bn_17.sn_360.epoch_9.ckpt --result_file fast5s.al.CpG.call_mods.tsv --nproc 10 --is_gpu no

The modifications can also be called from the fast5 files directly:

# use CPU
deepsignal call_mods --input_path fast5s.al/ --model_path model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+/bn_17.sn_360.epoch_9.ckpt --result_file fast5s.al.CpG.call_mods.tsv --corrected_group RawGenomeCorrected_001 --nproc 10 --is_gpu no
# or use GPU
CUDA_VISIBLE_DEVICES=0 deepsignal call_mods --input_path fast5s.al/ --model_path model.CpG.R9.4_1D.human_hx1.bn17.sn360.v0.1.7+/bn_17.sn_360.epoch_9.ckpt --result_file fast5s.al.CpG.call_mods.tsv --corrected_group RawGenomeCorrected_001 --nproc 10 --is_gpu yes

The modification_call file is a tab-delimited text file in the following format:

  • chrom: the chromosome name
  • pos: 0-based position of the targeted base in the chromosome
  • strand: +/-, the aligned strand of the read to the reference
  • pos_in_strand: 0-based position of the targeted base in the aligned strand of the chromosome (legacy column, not necessary for downstream analysis)
  • readname: the read name
  • read_strand: t/c, template or complement
  • prob_0: [0, 1], the probability of the targeted base predicted as 0 (unmethylated)
  • prob_1: [0, 1], the probability of the targeted base predicted as 1 (methylated)
  • called_label: 0/1, unmethylated/methylated
  • k_mer: the kmer around the targeted base

A modification-frequency file can be generated by the script scripts/call_modification_frequency.py with the modification_call file:

python /path/to/deepsignal/scripts/call_modification_frequency.py --input_path fast5s.al.CpG.call_mods.tsv --result_file fast5s.al.CpG.call_mods.frequency.tsv --prob_cf 0

The modification_frequency file is a tab-delimited text file in the following format:

  • chrom: the chromosome name
  • pos: 0-based position of the targeted base in the chromosome
  • strand: +/-, the aligned strand of the read to the reference
  • pos_in_strand: 0-based position of the targeted base in the aligned strand of the chromosome (legacy column, not necessary for downstream analysis)
  • prob_0_sum: sum of the probabilities of the targeted base predicted as 0 (unmethylated)
  • prob_1_sum: sum of the probabilities of the targeted base predicted as 1 (methylated)
  • count_modified: number of reads in which the targeted base counted as modified
  • count_unmodified: number of reads in which the targeted base counted as unmodified
  • coverage: number of reads aligned to the targeted base
  • modification_frequency: modification frequency
  • k_mer: the kmer around the targeted base

4. train new models

A new model can be trained as follows:

# need two independent datasets for training and validating
# use deepsignal train -h/--help for more details
deepsignal train --train_file /path/to/train_data/file --valid_file /path/to/valid_data/file --model_dir /dir/to/save/the/new/model

Publication

Peng Ni, Neng Huang, Zhi Zhang, De-Peng Wang, Fan Liang, Yu Miao, Chuan-Le Xiao, Feng Luo, and Jianxin Wang, "DeepSignal: detecting DNA methylation state from Nanopore sequencing reads using deep-learning.", Bioinformatics 35, no. 22 (2019): 4586-4595. doi:10.1093/bioinformatics/btz276

License

Copyright (C) 2018 Jianxin Wang, Feng Luo, Peng Ni, Neng Huang

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.

Jianxin Wang, Peng Ni, Neng Huang, School of Information Science and Engineering, Central South University, Changsha 410083, China

Feng Luo, School of Computing, Clemson University, Clemson, SC 29634, USA