hpo-downloader

Python package to download HPO annotations and mapping to Uniprot ID and AC and CAFA4 IDs.


License
MIT
Install
pip install hpo-downloader==1.1.0

Documentation

hpo_downloader

Travis CI build SonarCloud Quality SonarCloud Maintainability Codacy Maintainability Maintainability Pypi project Pypi total project downloads

Python package to download HPO annotations and mapping to Uniprot ID and AC and CAFA4 IDs.

How do I install this package?

As usual, just download it using pip:

pip install hpo_downloader

Tests Coverage

Since some software handling coverages sometime get slightly different results, here's three of them:

Coveralls Coverage SonarCloud Coverage Code Climate Coverate

Pipeline

The package pipeline is illustrated in the following image:

Pipeline

Preprocessing

For the pre-processing you have to retrieve the uniprot mapping files by asking directly to the Uniprot team since each mapping is aroung 17GB. Let's save each file in a directory within this repository called "mapping/{month}/idmapping.dat.gz".

Cache for the pre-processing results is available within the python package, so there is no need to retrieve the original files unless you need to fully reproduce the pipeline.

For each release, we have to retrieve the "GeneID" and the human uniprot_IDs, and we can do so using zgrep.

zgrep "GeneID" mapping/{month}/idmapping.dat.gz > gene_id.tsv
zgrep "HUMAN" mapping/{month}/idmapping.dat.gz > human_id.tsv

Now we have to map in a non-bijective way uniprot IDs to GeneIDs on the uniprot ACs. We can use the package method non_unique_mapping.

from hpo_downloader.utils import non_unique_mapping
import pandas as pd

gene_id = pd.read_csv(
    f"mapping/{month}/gene_id.tsv",
    sep="\t",
    header=None,
    usecols=[0, 2]
)
gene_id.columns = ["uniprot_ac", "gene_id"]
human_id = pd.read_csv(
    f"mapping/{month}/human_ids.tsv",
    sep="\t",
    header=None,
    usecols=[0, 2]
)
human_id.columns = ["uniprot_ac", "uniprot_id"]
non_unique_mapping(gene_id, human_id, "uniprot_ac").to_csv(
    f"hpo_downloader/uniprot/data/{month}.tsv.gz",
    sep="\t",
    index=False
)

Package usage examples

To generate the complete mapping (optionally filtering only for Uniprot IDs within CAFA4) proceed as follows:

from hpo_downloader import mapping

my_mapping = mapping(
    month="november"
)

my_mapping_cafa_only = mapping(
    month="november",
    cafa_only=True
)

The obtained pandas DataFrames look as follows:

HPO mappings: October, November, December

gene_id hpo_id uniprot_ac uniprot_id
8192 HP:0004322 Q16740 CLPP_HUMAN
8192 HP:0001250 Q16740 CLPP_HUMAN
8192 HP:0000786 Q16740 CLPP_HUMAN
8192 HP:0000007 Q16740 CLPP_HUMAN
8192 HP:0000252 Q16740 CLPP_HUMAN

HPO mappings (CAFA4 only): October (CAFA only), November (CAFA only), December (CAFA only)

cafa4_id uniprot_id gene_id hpo_id uniprot_ac
T96060000002 1433E_HUMAN 7531 HP:0000960 P62258
T96060000002 1433E_HUMAN 7531 HP:0001539 P62258
T96060000002 1433E_HUMAN 7531 HP:0002119 P62258
T96060000002 1433E_HUMAN 7531 HP:0002120 P62258
T96060000002 1433E_HUMAN 7531 HP:0000463 P62258

Author notes

HPO missing GeneID mappings

Around 54 to 55 GeneID to Uniprot IDs mapping are currently missing in Uniprot. I have already signaled this to the Uniprot team and will update the package accordingly, if anything is to be made about these.

Month HPO unique missed samples HPO unique missed percentage HPO total missed samples HPO total missed percentage
October 54 1.26% 3076 1.86%
November 55 1.28% 3162 1.91%
December 55 1.28% 3162 1.91%

HPO phenotype ID to CAFA4 Uniprot_IDs missed mappings

A considerable percentage (around 80%) of the HUMAN uniprot IDs used in CAFA4 are not mappable to the HPO phenotype IDs.

Month CAFA4 unique missed samples CAFA4 unique missed percentage CAFA4 total missed samples CAFA4 total missed percentage
October 16182 79.21% 16182 79.21%
November 16184 79.22% 16184 79.22%
December 16187 79.23% 16187 79.23%