A library for processing text data

cophi is a Python library for handling, modeling and processing text corpora. You can easily pipe a collection of text files using the high-level API:

corpus, metadata = cophi.corpus(directory="british-fiction-corpus",
                                filepath_pattern="**/*.txt",
                                encoding="utf-8",
                                lowercase=True,
                                token_pattern=r"\p{L}+\p{P}?\p{L}+")

You can also plug the DARIAH-DKPro-Wrapper into this pipeline to lemmatize text, or just keep certain word types.

Check out the introducing Jupyter notebook.

Getting started

To install the latest stable version:

$ pip install cophi

To install the latest development version:

$ pip install --upgrade git+https://github.com/cophi-wue/cophi-toolbox.git@testing

Available complexity measures

There are also a plenty of complexity metrics for measuring the lexical richness of (literary) texts.

Measures that use sample size and vocabulary size:

Type-Token Ratio TTR
Guiraud’s R
Herdan’s C
Dugast’s k
Maas’ a²
Dugast’s U
Tuldava’s LN
Brunet’s W
Carroll’s CTTR
Summer’s S

Measures that use part of the frequency spectrum:

Honoré’s H
Sichel’s S
Michéa’s M

Measures that use the whole frequency spectrum:

Entropy S
Yule’s K
Simpson’s D
Herdan’s V_m

Parameters of probabilistic models:

Orlov’s Z

cophi
Release 1.3.2

Release 1.3.2

1.3.2

1.3.1

1.3.0

1.2.3

1.2.2

1.2.1

1.1.1

1.1.0

1.0.10

1.0.9

Documentation

A library for processing text data

Getting started

Available complexity measures

Stats

Development practices

Releases

Contributors

cophi Release 1.3.2

Release 1.3.2 Toggle Dropdown 1.3.2 1.3.1 1.3.0 1.2.3 1.2.2 1.2.1 1.1.1 1.1.0 1.0.10 1.0.9

Documentation

A library for processing text data

Getting started

Available complexity measures

Stats

Development practices

Releases

Contributors

cophi
Release 1.3.2

Release 1.3.2

1.3.2

1.3.1

1.3.0

1.2.3

1.2.2

1.2.1

1.1.1

1.1.0

1.0.10

1.0.9