dtool-overlay

dtool CLI utilities for working with per item metadata


License
MIT
Install
pip install dtool-overlay==0.2.0

Documentation

dtool CLI commands for working with per item metadata

PyPi package Travis CI build status (Linux) Code Coverage

Installation

pip install dtool-overlay

Example usage

Get a dataset to play with:

LOCAL_DS_URI=$(dtool cp -q http://bit.ly/Ecoli-ref-genome .)

Show the existing overlays:

$ dtool overlays show $LOCAL_DS_URI
identifiers,relpaths
23ebd7cd21a905d5f255919ca1d0491901cb8718,reference.4.bt2
37e2d68bb38271036d96b6979d24666e0d4fd814,reference.rev.1.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e,reference.1.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404,reference.rev.2.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc,U00096.3.fasta
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,reference.3.bt2
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,reference.2.bt2

The output above show that there are no overlays on this dataset. (The "identifiers" and "relpaths" columns are there for bookkeeping).

Create a "is_fasta" boolean overlay template by using a glob pattern:

$ dtool overlays template glob $LOCAL_DS_URI is_fasta '*.fasta' > is_fasta.csv
$ cat is_fasta.csv
identifiers,is_fasta,relpaths
23ebd7cd21a905d5f255919ca1d0491901cb8718,False,reference.4.bt2
37e2d68bb38271036d96b6979d24666e0d4fd814,False,reference.rev.1.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e,False,reference.1.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404,False,reference.rev.2.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc,True,U00096.3.fasta
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,False,reference.3.bt2
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,False,reference.2.bt2

Write the overlay template to the dataset:

$ dtool overlays write $LOCAL_DS_URI is_fasta.csv

Show the newly created overlay:

$ dtool overlays show $LOCAL_DS_URI
identifiers,is_fasta,relpaths
23ebd7cd21a905d5f255919ca1d0491901cb8718,False,reference.4.bt2
37e2d68bb38271036d96b6979d24666e0d4fd814,False,reference.rev.1.bt2
41fb9ae5d4f6c37226ff324c701b84bc3110709e,False,reference.1.bt2
828ebf503926b7c1b8b07c1995b4ca818814b404,False,reference.rev.2.bt2
b445ff5a1e468ab48628a00a944cac2e007fb9bc,True,U00096.3.fasta
d21454a7338c53eabc8d8ed7c2f9c3ff4585c4cf,False,reference.3.bt2
dda8452b346d51b9cf60f0662ef3d6e3b6da2e74,False,reference.2.bt2

To extract multiple pieces of metadata from the items' relpath one can use the dtool overlays template parse command. This takes as input a dataset URI, a parse rule (see https://pypi.org/project/parse/ for more details) and a glob rule. The latter decides which relpaths to apply the parsing to.

Consider for example the dataset below:

$ dtool ls http://bit.ly/Ecoli-reads-minified
8bda245a8cd526673aab775f90206c8b67d196af  ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc  ERR022075_1.fastq.gz

The command below could be used to generate a template for the overlays "useful_name" and "read":

$ dtool overlays template parse  \
    http://bit.ly/Ecoli-reads-minified  \
    '{useful_name}_{read:d}.fastq.gz'

Results in the CSV output below:

identifiers,read,useful_name,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,2,ERR022075,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,1,ERR022075,ERR022075_1.fastq.gz

To ignore a variable element when parsing one can use unnamed curly braces. The command below for example only generates the overlay "useful_name":

$ dtool overlays template parse  \
    http://bit.ly/Ecoli-reads-minified  \
    '{useful_name}_{:d}.fastq.gz'
identifiers,useful_name,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,ERR022075,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075,ERR022075_1.fastq.gz

Sometimes it is useful to be able to find pairs of items. For example when dealing with genomic sequencing data that has forward and reverse reads.

One can create a "pair_id" overlay CSV template for this dataset using the command below:

$  dtool overlays template pairs http://bit.ly/Ecoli-reads-minified .fastq.gz
identifiers,pair_id,relpaths
8bda245a8cd526673aab775f90206c8b67d196af,9760280dc6313d3bb598fa03c5931a7f037d7ffc,ERR022075_2.fastq.gz
9760280dc6313d3bb598fa03c5931a7f037d7ffc,8bda245a8cd526673aab775f90206c8b67d196af,ERR022075_1.fastq.gz

In the above the suffix ".fastq.gz" is used to extract the prefix ERR022075_ that is used to find matching pairs.

Related packages