pycollatinus

Collatinus Port for Python


License
GPL-2.0-only
Install
pip install pycollatinus==0.1.6

Documentation

PyCollatinus

Build Status Coverage Status DOI PyPI version

PyCollatinus is a port of the famous Collatinus developed in France by Y. Ouvrard and P. Verkerk. I translated directly the code from C++, mostly manually.

PyCollatinus aims to provide a Lemmatizer for CLTK but can also be used for simple things such as searching for all possible lemmas of each single token of a sentence.

How to

Install

You can install PyCollatinus using pip : pip install pycollatinus

Use

The analyzer is pretty easy to use :

from pycollatinus import Lemmatiseur
analyzer = Lemmatiseur()
analyzer.lemmatise_multiple("Cogito ergo sum")

will result in

[
    [{'lemma': 'cogo', 'morph': '2ème singulier impératif futur actif', 'form': 'cogito',
      'radical': 'cog', 'desinence': 'ito'},
     {'lemma': 'cogo', 'morph': '3ème singulier impératif futur actif', 'form': 'cogito',
      'radical': 'cog', 'desinence': 'ito'},
     {'lemma': 'cogito', 'morph': '1ère singulier indicatif présent actif', 'form': 'cogito',
      'radical': 'cogit', 'desinence': 'o'},
     {'lemma': 'cogito', 'morph': '1ère singulier indicatif présent actif', 'form': 'cogito',
      'radical': 'cogit', 'desinence': 'o'}],
    [{'lemma': 'ergo', 'morph': '1ère singulier indicatif présent actif', 'form': 'ergo',
      'radical': 'erg', 'desinence': 'o'},
     {'lemma': 'ergo', 'morph': 'positif', 'form': 'ergo',
      'radical': 'ergo', 'desinence': ''},
     {'lemma': 'ergo', 'morph': '-', 'form': 'ergo',
      'radical': 'ergo', 'desinence': ''}],
    [{'lemma': 'sum', 'morph': '1ère singulier indicatif présent actif', 'form': 'sum',
      'radical': 's', 'desinence': 'um'}]
]

How to make it faster

There is a lot of data to process for PyCollatinus and we decided not to convert this data to keep as close as possible to the original C and this way be able to load any new data coming our way or helping them correct some more.

To avoid a huge loading time, you can compile the Lemmatizer and load it :

from pycollatinus import Lemmatiseur
analyzer = Lemmatiseur()
analyzer.compile()  # Persists the data

Next time, just do :

from pycollatinus import Lemmatiseur
analyzer = Lemmatiseur.load()

Performance

On a Intel(R) Core(TM) i3-3120M CPU @ 2.50GHz, LinuxMint 17 3.8.4 (Ubuntu 2015-12-02), Python 3.4.3

Method Average Time on 10 calls
From Collatinus data 11.62 s
 From compiled data 5.92 s

Script run for these evaluations

Licence

Collatinus is developed and maintained by Yves Ouvrard and Philippe Verkerk. It is made available under the GNU GPL v3 licence.

As such, this software bit is also GNU GPL v3.