GC-analysis

A program that compute the GC percentage of a given genomic sequence


Keywords
calculating-gc-percentages, gc-analysis
License
Apache-2.0
Install
pip install GC-analysis==0.4.5

Documentation

Build status docs

GC-analysis

A command-line utility for calculating GC percentages of genome sequences

Quick starter

Calculate the GC content of chromosome 17 of the human reference genome with window size (or span) = 5 and shift (or step) = 5. Input fasta file is GRCh38-Chrom17.fasta and output wiggle file is GRCh38-Chrom17.wig. Note that the output file's extension is added by the program.

~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17

Installation guide

Note that pyBigWig can only be used under linux environment. To work with Windows system, the Docker image can be used as shown below. Alternatively, you can clone the repository, comment out import pyBigWig and the script would work but without BigWig support.

  1. Pip install GC_analysis (NB. Python3 is recommanded but GC_analysis should work with Python2 as well)
pip3 install GC_analysis

Then GC_analysis.py command will be available globally.

GC_analysis.py -i [INPUT] -o [OUTPUT] -w [window size] -s [shift]
  1. Run the python script directly. Please ensure you have python3 installed with pyBigwig and Biopython. Clone the github repository and install packages.
git clone https://github.com/tonyyzy/GC_analysis
cd GC_analysis
pip3 install -r requirements.txt

run the script from GC_analysis directory.

python3 ./GC_analysis/GC_analysis.py -i [INPUT] -o [OUTPUT] -w [window size] -s [shift]
  1. Use the packaged binary.
mkdir ~/GC_analysis
cd ~/GC_analysis
wget https://github.com/tonyyzy/GC_analysis/releases/download/v0.3/GC_analysis

Execute the binary command

GC_analysis -i [INPUT] -o [OUTPUT] -w [window size] -s [shift]
  1. Use the Docker image. Firstly, pull the docker image (around 384 MB)
docker pull tonyyzy/gc_analysis

To use input files outside the container and save output files on your computer, the -v volume mapping option will be used. You will need to know the absolute path of the directory you want to map (which can be found out with pwd).

docker run -v /your/local/path:/app tonyyzy/gc_analysis GC_analysis -i /app/yours.fasta -o /app/yours -w 5 -s 5

This option maps /your/local/path to /app under the container's root directory. Your result file will be saved to /your/local/path/yours.wig.

Command-line options

~ $ GC_analysis -h
usage: GC_analysis [-h] -i INPUT_FILE -w WINDOW_SIZE -s SHIFT [-o OUTPUT_FILE]
                   [-ot] [-f {wiggle,gzip,bigwig}]

required named arguments:

-i INPUT_FILE, --input_file INPUT_FILE
INPUTFILE: Name of the input file in FASTA format

-w WINDOW_SIZE, --window_size WINDOW_SIZE
WINDOW_SIZE: Number of base pairs that the GC percentage is calculated for

-s SHIFT, --shift SHIFT
SHIFT: The shift increment (step size)

optional arguments:

-h, --help
Show the help message and exit

-o OUTPUT_FILE, --output_file OUTPUT_FILE
OUTPUT_FILE: Name of the output file

-ot, --omit_tail
Use if the trailing sequence should be omitted. Default behaviour is to retain the leftover sequence.

-f {wiggle,bigwig,gzip}, --output_format {wiggle,bigwig,gzip}
Choose output formats from wiggle, bigwig or gzip compressed wiggle file.

-one, --one_file
Force one file output

Example usage

  1. Calculate the GC content of chromosome 17 of the human reference genome, the percentage is calculated over five base pairs (window_size), and the window is shifted by five base pairs every time (i.e. there is no overlapping base paires in each entry).
~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17
  1. By default, the GC percentage of the trailing sequence is calculated and appended to the end of the output file. For example, with the following input
~ $ GC_analysis -i examaple1.fasta -w 5 -s 5 -o with_tail

and example1.fasta is

>chr1
AAAAACC

the generated with_tail.wig will look like

track type=wiggle_0 name="GC percentage" description="chr1"
variableStep chrom=chr1 span=5
1	0
6	100

If it is desirable to omit the trailing sequence in the result, the -ot or --omit_tail option can be used. For example

~ $ GC_analysis -i examaple1.fasta -w 5 -s 5 -o without_tail -ot

will generate output file without_tail with the following content

track type=wiggle_0 name="GC percentage" description="chr1"
variableStep chrom=chr1 span=5
1	0
  1. The program support three output file formats, wiggle, bigwig and gzip compressed wiggle file. Wiggle output file follows the UCSC variableStep format definition. Wiggle file is the default output format. The output format can be changed with -f or --format option.
~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17

and

~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17 -f wiggle

will generate GRCh38-Chrom17.wig as the output file.

~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17 -f gzip

will generate GRCh38-Chrom17.wig.gz as the output file. Decompress GRCh38-Chrom17.wig.gz will give you the same wiggle file as choosing wiggle as the output format.

~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17 -f bigwig

will generate GRCh38-Chrom17.bw as the output file. It should be noted that bigwig format does not allow overlapping bases, which means that -w 5 -s 3 is an invalid option with choosing bigwig as the output format. In this case, where shift is smaller than window size and bigwig format is specified, the program will generate a wiggle file instead and output a warning message.

~ $ GC_analysis -i GRCh38-Chrom17.fasta -w 5 -s 3 -o GRCh38-Chrom17 -f bigwig
WARNING! BigWig file does not allow overlapped items. A wiggle file was generated instead.
  1. If an output filename is not given, the result will be written to stdout. If the output filename is not given and a file format other than wiggle was chosen, the program will automatically output the result to stdout and give you a warning before and after the result. Eg.
GC_analysis -i example1.fasta -w 5 -s 3 -f bigwig
WARNING! BigWig file does not allow overlapped items. A wiggle file will be generated instead.
WARNING! An output filename is needed to save output as bigwig. The result is shown below:
track type=wiggle_0 name="GC percentage" description="chr1"
variableStep chrom=chr1 span=5
1       0
4       50
WARNING! BigWig file does not allow overlapped items. A wiggle file was generated instead.
WARNING! An output filename is needed to save output as bigwig. The result is shown above.
  1. If the input FASTA file contains multiple sequences and the results should be written to a single file instead of one sequence per file, you can use the -one optional arguments.
GC_analysis -i multiple.fasta -o multiple -w 5 -s 5 -one

If -one is not specified, each sequence's GC result will be written to one file. The filenames will be the given filename followed by "_seq" + the sequence's number.

Timing againts human chromosomes

Click for raw data table

Entry Human chromosome No. of base pairs Average real time - single thread (s) Average real time - multi threads (s)
CM000663.2.fasta 1 248956422 288.429 179.221
CM000664.2.fasta 2 242193529 276.355 169.611
CM000665.2.fasta 3 198295559 227.528 135.637
CM000666.2.fasta 4 190214555 217.846 153.091
CM000667.2.fasta 5 181538259 205.623 123.858
CM000668.2.fasta 6 170805979 193.209 117.180
CM000669.2.fasta 7 159345973 183.445 109.135
CM000670.2.fasta 8 145138636 166.607 98.632
CM000671.2.fasta 9 138394717 157.142 93.898
CM000672.2.fasta 10 133797422 150.872 92.371
CM000673.2.fasta 11 135086622 154.003 92.498
CM000674.2.fasta 12 133275309 150.533 90.807
CM000675.2.fasta 13 114364328 129.951 77.498
CM000676.2.fasta 14 107043718 121.008 71.970
CM000677.2.fasta 15 101991189 115.194 68.336
CM000678.2.fasta 16 90338345 103.169 60.799
CM000679.2.fasta 17 83257441 94.353 55.729
CM000680.2.fasta 18 80373285 92.020 53.395
CM000681.2.fasta 19 58617616 67.506 39.308
CM000682.2.fasta 20 64444167 74.048 43.280
CM000683.2.fasta 21 46709983 53.633 31.118
CM000684.2.fasta 22 50818468 57.466 33.701
CM000685.2.fasta X 156040895 176.895 105.408
CM000686.2.fasta Y 57227415 67.016 38.142
J01415.2.fasta MT 16569 0.231 0.397

Execution time vs. number of base pairs plot * 1) Real time data is the average of three runs; 2) GC_analysis parameters for each run is -w 5 -s 5; 3) Serial data is collected with the Master branch, Parallel data is collected with the Parallel branch.

As can be seen from the plot, GC_analysis scales well with number of base pairs, resulted a linear relationship between the execution time and the size of the chromosomes. Although multi-threaded version can provide ~1.7x speed improvement, it has a significantly higher memory consumption, hence it's not recommended.

(EXPERIMENTAL!!!) Multi-threaded GC_analysis

Git clone the parallel branch from GitHub repo:

git clone --single-branch -b parallel https://github.com/tonyyzy/GC_analysis

Execute as normal from GC_analysis directory

~ python3 ./scripts/GC_analysis.py -i GRCh38-Chrom17.fasta -w 5 -s 5 -o GRCh38-Chrom17

This multithreading implementation is a very crude one and only result in ~1.7x speed up. A large amount of RAM is needed to store out-of-order intermediate results for sorting.