The docuscospacy package contains a set of functions to facilitate the processing of tagged corpora using:
- en_docusco_spacy -- a spaCy model trained on the CLAWS7 tagset and DocuScope; and
- tmtoolkit -- a set of tools for text mining and topic modeling
The documentation for docuscospacy is available on docuscospacy.readthedocs.org and the GitHub code repository is on github.com/browndw/docuscospacy.
docuscospacy works with Python 3.8 or newer (tested up to Python 3.10). It also requires spacy >= 3.3.
The recommended way of installing docuscospacy is to:
- create and activate a Python Virtual Environment ("venv")
- install spacy and tmtoolkit with a recommended set of dependencies
- download the en_docusco_spacy model
- install docuscospacy
pip install docuscospacy
The docuscospacy package supports the post-tagging generation of:
- Tagged token frequency tables
- Tag frequency tables
- Ngram/ntag tables
- Collocation tables around a node word and tag
- Document term matrices for tags
- Keyword comparisons against a reference corpus
Outputs can be controlled either by part-of-speech or by DocuScope tag. Thus, can as noun and can as verb, for example, can be disambiguated.
Additionally, tagged multi-token sequences are aggregated for analysis. So, for example, where in spite of is tagged as a token sequence, it is combined into a single token.
- KWIC tables that locate a node word in a center column with context columns on either side
- the model that this package is designed for has only been trained on English
- all data must reside in memory, i.e. no streaming of large data from the hard disk (which for example Gensim supports)
Code licensed under Apache License 2.0. See LICENSE file.