Basic functions to start using semantic similarity measures directly from a rdf or owl file.


Keywords
graphs, semantic, similarity, ontologies
License
MIT
Install
pip install ssmpy==0.2.4

Documentation

DiShIn: Semantic Similarity Measures using Disjunctive Shared Information

Downloads

This software package provides the basic functions to start using semantic similarity measures directly from a rdf or owl file.

A web tool using this package is available at: http://labs.fc.ul.pt/dishin/

Package documentation: https://dishin.readthedocs.io/en/latest/

Reference:

INSTALLATION

Either clone this repository or install from pypi:

pip install ssmpy

If you use it from the shell, you need to install python3, sqlite3, rdflib and pandas:

sudo apt-get update
sudo apt-get install python3 python3-rdflib python3-pandas sqlite3

and then clone and enter the folder:

git clone https://github.com/lasigeBioTM/DiShIn.git
cd DiShIn

If you just have python2 or you cannot install packages, then create and use a lighter version of DiShIn:

curl https://raw.githubusercontent.com/lasigeBioTM/DiShIn/master/dishin.py | sed -e 's/import ssmpy/import ssm\nimport annotations/; s/ssmpy\.ssm\./ssm./g; s/ssmpy\./ssm./g; s/ssm.get_uniprot_annotations/annotations.get_uniprot_annotations/g' > dishin.py
curl https://raw.githubusercontent.com/lasigeBioTM/DiShIn/master/ssmpy/ssm.py | sed 's/from ssmpy./# from ssmpy./' > ssm.py
curl https://raw.githubusercontent.com/lasigeBioTM/DiShIn/master/ssmpy/annotations.py | sed 's/import ssmpy./import /; s/ssmpy./ssm./' > annotations.py

Note, this light version cannot create new databases.

USAGE:

You can use DiShIn as a command line tool with the dishin.py script of this repository:

python dishin.py <semanticbase>.db <term1> <term2>
python dishin.py <semanticbase>.[owl|rdf] <semanticbase>.db <name_prefix> <relation> <annotation_file>

or use the python functions directly:

>>> import ssmpy

You can find more usage examples at https://dishin.readthedocs.io/en/latest/other_examples.html.

Metals Example

To create the semantic base file (metals.db) from the metals.owl file:

python dishin.py metals.owl metals.db https://raw.githubusercontent.com/lasigeBioTM/ssm/master/metals.owl# http://www.w3.org/2000/01/rdf-schema#subClassOf metals.txt

The metals.txt contains the a list of occurrences. For example, the following contents has one occurrence for each term, except gold and silver with two occurrences.

gold
silver
gold
silver
copper
platinum
palladium
metal
coinage
precious

Now to calculate the similarity between copper and gold execute:

python dishin.py metals.db copper gold

Output:

Resnik     DiShIn    intrinsic          0.2938933324510595
Resnik     MICA      intrinsic          0.587786664902119
Lin        DiShIn    intrinsic          0.19539774554219633
Lin        MICA      intrinsic          0.39079549108439265
JC         DiShIn    intrinsic          0.29236619053475066
JC         MICA      intrinsic          0.35303485982596094
Resnik     DiShIn    extrinsic          0.22599256187152864
Resnik     MICA      extrinsic          0.45198512374305727
Lin        DiShIn    extrinsic          0.1504595366201814
Lin        MICA      extrinsic          0.3009190732403628
JC         DiShIn    extrinsic          0.281527889373394
JC         MICA      extrinsic          0.322574315537045

Using the python function directly (first download metals.db and metals.txt from this repository):

>>> ssmpy.create_semantic_base("metals.owl", "metals.db", "https://raw.githubusercontent.com/lasigeBioTM/ssm/master/metals.owl#", "http://www.w3.org/2000/01/rdf-schema#subClassOf", "metals.txt")
>>> ssmpy.semantic_base("metals.db")
>>> e1 = ssmpy.get_id("copper")
>>> e2 = ssmpy.get_id("gold")
>>> ssmpy.ssm_resnik (e1,e2)

Gene Ontology (GO) and UniProt proteins Example

Download the latest version of the database we created:

wget http://labs.rd.ciencias.ulisboa.pt/dishin/go202104.db.gz
gunzip -N go202104.db.gz

Now to calculate the similarity between maltose biosynthetic process and maltose catabolic process execute:

python dishin.py go.db GO_0000023 GO_0000025

Output:

Resnik     DiShIn    intrinsic          3.775439615001474
Resnik     MICA      intrinsic          8.880063901891981
Lin        DiShIn    intrinsic          0.4091891133909429
Lin        MICA      intrinsic          0.9624377146523844
JC         DiShIn    intrinsic          0.08401669887638269
JC         MICA      intrinsic          0.5906161091496418
Resnik     DiShIn    extrinsic          4.315813746201754
Resnik     MICA      extrinsic          10.575802576015931
Lin        DiShIn    extrinsic          0.38793452313030363
Lin        MICA      extrinsic          0.950624649327762
JC         DiShIn    extrinsic          0.06840605034663635
JC         MICA      extrinsic          0.4765053580405049

Now to calculate the similarity between proteins Q12345 and Q12346 execute:

python dishin.py go.db Q12345 Q12346

Output:

Resnik     DiShIn    intrinsic          1.4462923030269426
Resnik     MICA      intrinsic          1.4462923030269426
Lin        DiShIn    intrinsic          0.18745282441602068
Lin        MICA      intrinsic          0.18745282441602068
JC         DiShIn    intrinsic          0.08633506268285998
JC         MICA      intrinsic          0.08633506268285998
Resnik     DiShIn    extrinsic          0.6015115682274214
Resnik     MICA      extrinsic          0.6015115682274214
Lin        DiShIn    extrinsic          0.12201023476842265
Lin        MICA      extrinsic          0.12201023476842265
JC         DiShIn    extrinsic          0.09317326288224918
JC         MICA      extrinsic          0.09317326288224918

To create an updated version of the database, download the ontology and annotations:

wget http://purl.obolibrary.org/obo/go.owl
wget http://geneontology.org/gene-associations/goa_uniprot_all_noiea.gaf.gz
gunzip goa_uniprot_all_noiea.gaf.gz 

And then create the new database:

python dishin.py go.owl go.db http://purl.obolibrary.org/obo/ http://www.w3.org/2000/01/rdf-schema#subClassOf goa_uniprot_all_noiea.gaf

Chemical Entities of Biological Interest (ChEBI) Example

Download the lastest version of the database we created:

wget http://labs.rd.ciencias.ulisboa.pt/dishin/chebi202104.db.gz
gunzip -N chebi202104.db.gz

Now to calculate the similarity between aripiprazole and bithionol execute:

python dishin.py chebi.db CHEBI_31236 CHEBI_3131

Output:

Resnik     DiShIn    intrinsic          1.4393842298350599
Resnik     MICA      intrinsic          5.5106315826160674
Lin        DiShIn    intrinsic          0.12935491517581163
Lin        MICA      intrinsic          0.4952307147453835
JC         DiShIn    intrinsic          0.049077257018319796
JC         MICA      intrinsic          0.0817424736051902

To create an updated version of the database, download the ontology:

wget http://purl.obolibrary.org/obo/chebi/chebi_lite.owl

And then create the new database:

python dishin.py chebi_lite.owl chebi.db http://purl.obolibrary.org/obo/ http://www.w3.org/2000/01/rdf-schema#subClassOf ''

Human Phenotype (HP) Example

Download the lastest version of the database we created:

wget http://labs.rd.ciencias.ulisboa.pt/dishin/hp202104.db.gz
gunzip -N hp202104.db.gz

Now to calculate the similarity between Optic nerve coloboma and Optic nerve dysplasia execute:

python dishin.py hp.db HP_0000588 HP_0001093

Output:

Resnik     DiShIn    intrinsic          4.593979372426621
Resnik     MICA      intrinsic          6.005278943833842
Lin        DiShIn    intrinsic          0.5118244533189668
Lin        MICA      intrinsic          0.6690601683812312
JC         DiShIn    intrinsic          0.10242304162282165
JC         MICA      intrinsic          0.14407501033681872

To create an updated version of the database, download the ontology:

wget http://purl.obolibrary.org/obo/hp.owl

And then create the new database:

python dishin.py hp.owl hp.db http://purl.obolibrary.org/obo/ http://www.w3.org/2000/01/rdf-schema#subClassOf ''

Human Disease Ontology (HDO) Example

Download the lastest version of the database we created:

wget http://labs.rd.ciencias.ulisboa.pt/dishin/doid202104.db.gz
gunzip -N doid202104.db.gz

Now to calculate the similarity between Asthma and Lung cancer execute:

python dishin.py doid.db DOID_2841 DOID_1324

Output:

Resnik     DiShIn    intrinsic          2.3627836143597176
Resnik     MICA      intrinsic          3.791674698804828
Lin        DiShIn    intrinsic          0.4328907089097581
Lin        MICA      intrinsic          0.6946809425735787
JC         DiShIn    intrinsic          0.13906777879867938
JC         MICA      intrinsic          0.2307893214756218

To create an updated version of the database, download the ontology:

wget http://purl.obolibrary.org/obo/doid.owl

And then create the new database:

python dishin.py doid.owl doid.db http://purl.obolibrary.org/obo/ http://www.w3.org/2000/01/rdf-schema#subClassOf ''

Medical Subject Headings (MeSH) Example

Download the lastest version of the database we created:

wget http://labs.rd.ciencias.ulisboa.pt/dishin/mesh202104.db.gz
gunzip -N mesh202104.db.gz

Now to calculate the similarity between Malignant Hyperthermia and Fever execute:

python dishin.py mesh.db D008305 D005334

Output:

Resnik     DiShIn    intrinsic          1.2582571367910345
Resnik     MICA      intrinsic          1.2582571367910345
Lin        DiShIn    intrinsic          0.17390901691859173
Lin        MICA      intrinsic          0.17390901691859173
JC         DiShIn    intrinsic          0.07719755683816652
JC         MICA      intrinsic          0.07719755683816652

To create an updated version of the database, download the NT version from ftp://nlmpubs.nlm.nih.gov/online/mesh/rdf/mesh.nt.gz and unzip it:

wget ftp://nlmpubs.nlm.nih.gov/online/mesh/rdf/mesh.nt.gz
gunzip mesh.nt.gz

And then create the new database:

python dishin.py mesh.nt mesh.db http://id.nlm.nih.gov/mesh/ http://id.nlm.nih.gov/mesh/vocab#broaderDescriptor ''

Radiology Lexicon (RadLex) Example

Download the lastest version of the database we created:

wget http://labs.rd.ciencias.ulisboa.pt/dishin/radlex202104.db.gz
gunzip -N radlex202104.db.gz

Now to calculate the similarity between nervous system of right upper limb and nervous system of left upper limb execute:

python dishin.py radlex.db RID16139 RID16140

Output:

Resnik     MICA      intrinsic          9.366531825151093
Lin        MICA      intrinsic          0.9310964912333252
JC         MICA      intrinsic          0.41905978419640516

To create an updated version of the database, download the RDF/XML version from http://bioportal.bioontology.org/ontologies/RADLEX and save it as radlex.rdf

And then create the new database:

python dishin.py radlex.rdf radlex.db http://radlex.org/RID/ http://www.w3.org/2000/01/rdf-schema#subClassOf '' 

WordNet Example

Download the lastest version of the database we created:

wget http://labs.rd.ciencias.ulisboa.pt/dishin/wordnet202104.db.gz
gunzip -N wordnet202104.db.gz

Now to calculate the similarity between the nouns ambulance and motorcycle execute:

python dishin.py wordnet.db ambulance-noun-1 motorcycle-noun-1

Output:

Resnik     MICA      intrinsic          6.331085809208157
Lin        MICA      intrinsic          0.6792379292396559
JC         MICA      intrinsic          0.14327549414725688

To create an updated version of the database, download the ontology:

wget http://www.w3.org/2006/03/wn/wn20/rdf/wordnet-hyponym.rdf

And then create the new database:

python dishin.py wordnet-hyponym.rdf wordnet.db http://www.w3.org/2006/03/wn/wn20/instances/synset- http://www.w3.org/2006/03/wn/wn20/schema/hyponymOf ''

Source Code

  • ssmpy/semanticbase.py : provides a function to produce the semantic-base as a SQLite database

  • ssmpy/ssm.py : provides the functions to calculate semantic similarity based on the SQLite database

  • ssmpy/annotations.py : provides the functions to get the annotations for the given proteins

  • dishin.py : executes the functions according to the input given