# snv-spectrum

A Python 3.6 library for plotting mutational spectra

**Installation**
·
**Tutorial**
·
**Contributing**

A base substitution spectrum includes the frequency or probability of all transition and transversions in the DNA grammar. Including the local context of the base substitution aids in the interpretation of biochemical process and disease etiology.

This **in-development** library will aid in the construction of spectrums with different local context sizes and their visualization in Python.

Please see Contributing for feature requests and an intended roadmap.

### Installation

```
❯ pip install git+https://github.com/clintval/snv-spectrum.git
```

### Tutorial

The base unit of the library is the `Snv`

which represents a transition or transversion in a given local context.

The local context must be symmetrical in length about the base substitution.

```
from snv_spectrum import Snv, Spectrum, plot_spectrum
snv = Snv(reference='G', alternate='T', context='AGC')
```

Unless the chemical process for the base substitution is specifically known it is useful to represent all base substitutions in a canonical form with either a pyrimidine or purine as the reference base.

```
>>> snv.with_pyrimidine_reference
Snv(reference="C", alternate="A", context="GCT")
```

You can automatically generate a spectrum of `Snv`

by specifying both the size of the local context and the reference notation.

```
>>> spectrum = Spectrum(k=3, reference_notation='pyrimidine')
>>> list(Spectrum(k=3, reference_notation='pyrimidine'))
"""
[(Snv(reference="C", alternate="A", context="ACA"), 0),
(Snv(reference="C", alternate="A", context="ACC"), 0),
(Snv(reference="C", alternate="A", context="ACG"), 0),
(Snv(reference="C", alternate="A", context="ACT"), 0),
(Snv(reference="C", alternate="A", context="CCA"), 0),
(Snv(reference="C", alternate="A", context="CCC"), 0),
...
"""
```

Begin to record observations by accessing the `Spectrum`

like a Python dictionary.

`spectrum[snv] += 2`

If you already have a vector of counts or probabilities then you can build a `Spectrum`

quickly as long as the data is listed in the correct lexicographic order of the chosen reference notation.

```
vector = [0, 2, 3, 4, ..., 95]
spectrum = Spectrum(k=3, reference_notation='pyrimidine')
for snv, count in zip(spectrum.substitutions, vector):
spectrum[snv] = count
```

##### Working with Probability

Many spectra are produced from whole-genome or whole-exome sequencing experiments. Spectra must be normalized to the *kmer* frequencies in the target study. Without normalization, no valid spectrum comparison can be made between data generated from different target territories or species.

By default each `Snv`

is given a weight of 1 and calling `spectrum.density`

will simply give the proportion of `Snv`

counts in the `Spectrum`

. After weights are set to the observed *kmer* counts or frequency of the target territory, calling `spectrum.density`

will compute a true normalized probability density.

All weights can be set with assignment *e.g.*: `spectrum.weights['ACA'] = 23420`

.

```
>>> spectrum.density
"""
{Snv(reference="C", alternate="A", context="ACA"): 0.015677491601343786,
Snv(reference="C", alternate="A", context="ACC"): 0.007838745800671893,
Snv(reference="C", alternate="A", context="ACG"): 0.0011198208286674132,
Snv(reference="C", alternate="A", context="ACT"): 0.006718924972004479,
Snv(reference="C", alternate="A", context="CCA"): 0.010078387458006719,
Snv(reference="C", alternate="A", context="CCC"): 0.008958566629339306,
...
"""
```

*Kmer* counts can be found with `skbio.DNA.kmer_frequencies`

for small targets and with `jellyfish`

for large targets.

##### Plotting

Spectra with `k=3`

in either `pyrimidine`

or `purine`

reference notation can be plotted using a style that was first used in Alexandrov *et. al.* in 2013 (PMID: 23945592).

Both `Snv`

raw counts (`kind="count"`

) or their probabilities (`kind="density"`

) can be plotted.

The figure and axes are returned to allow for custom formatting.

```
import numpy
spectrum = Spectrum(k=3, reference_notation='pyrimidine')
for snv, count in zip(spectrum.substitutions, range(96)):
spectrum[snv] = numpy.random.randint(20)
fig, axes = plot_spectrum(spectrum, kind='density')
```

### Contributing

Pull requests, feature requests, and issues welcome!

Unit tests, continuous test integration, docstrings, and code coverage to come soon.