catcoocc

The catcoocc library is designed for the study of co-occurrence association between categorical variables by implementing a number of symmetric and asymmetric measures of association. Given a series of co-occurrence observations, starting from data such as records, alignments, and matrices of presence-absence, it allows to compute dictionaries with the association score between categories, offering methods focused on strength of association, direction of association, or both. It is primarily developed for linguistic research, but can be applied to any kind of data exploration and description based on categorical data; besides the main methods for numeric computation of measures of association, it includes auxiliary ones for dealing with relational data, n-grams from sequences, alignments, and binary matrices of presence/absence.

Background

A measure of association is a factor or coefficient used to quantify the relationship between two or more variables. Various measures exist to determine the strength and relationship of such associations, the most common being measures of correlation which, in a sense stricter than association, refers to linear correlation. Among the most common measures, are Pearson's rho coefficient of product-moment correlation for continuous values, Spearman rho coefficient for measuring the strenght of monotonic ordinal or ranked variables, and Chi-square measure for association between categorical values. Each measure is usually indicated to investigate either strength (such as Pearson's rho) or significance (such as Chi-square), and most are symmetric, meaning that, when measuring the relationship between series X and series Y, the association between any x and y value is equal to that between y and x.

While symmetric measures are the natural measure for numeric variables, the analyses arising from many studies and applications for categorical variables can in most cases benefit from asymmetric measures, as the fraction of variability in x that is explainable by variations in y (Pearson, 2016). Such property can be easily demonstrated by modifying the example given by (Zychlinski, 2018) while introducing his dython library

	X	Y
Observation 1	A	c
Observation 2	A	d
Observation 3	A	c
Observation 4	B	g
Observation 5	B	g
Observation 6	B	f

In this example, the categorical value of y cannot be determined with full certainty given x, but x can be determined with certainty from y. In a symmetric version Maximum-Likelihood estimation (MLE), which just divides the number of cases for the total number of observations (i.e., Cxy/Cx and Cxy/Cy, where C is the overall count), the tables the XY and YX are the transposed version of each other:

X given Y	`A`	`B`
`c`	0.75	0.75
`d`	0.00	0.00
`f`	0.00	0.00
`g`	0.75	0.75

Y given X	`c`	`d`	`f`	`g`
`A`	0.75	0.00	0.00	0.75
`B`	0.75	0.00	0.00	0.75

With the same MLE scorer, asymmetric tables are able to capture the difference in information expressing that, if we know y in this simple dataset, we can predict x with certainty.

X given Y	`A`	`B`
`c`	1.00	0.00
`d`	1.00	0.00
`f`	0.00	1.00
`g`	0.00	1.00

Y given X	`c`	`d`	`f`	`g`
`A`	0.67	0.33	0.00	0.00
`B`	0.00	0.00	0.33	0.67

The most popular methods for measure of categorical association are the aforementioned Chi-square and Cramer's V, defined as the square root of a normalized chi-square value. Both are symmetric values. Among the best known asymmetric measures are Theil's U and Goodman and Kruskal's tau. The former is particularly useful for domains of the humanities such as lingustic research, as it is ultimately based on the conditional entropy between x and y, that is, how many possible states of y are observed given x and how often they occur.

The following scorers are implemented:

Maximum-Likelihood Estimation
Pointwise Mutual Information
Normalized Pointwise Mutual Information
Chi-square (over both 2x2 and 3x2 contingency tables)
Cramér's V (over both 2x2 and 3x2 contingency tables)
Fisher Exact Odds Ratio (over unconditional MLE)
Theil's U ("uncertainty score")
Conditional Entropy
A new scorer tresoldi, for the study of linguistic alignment (combining information from MLE and PMI)

The library also offers functions for scaling scores with user-determined ranges using different methods (minmax, mean, and stdev) as well as functions for plotting heatmaps of the scorers. The same dataset of above plotted with the tresoldi scorer, where positive numbers indicate co-occurrence and negative numbers indicate no co-occurrence (with the larger the number, the higher the degree of confidence), results in the following heatmaps:

Installation and usage

The library can be installed as any standard Python library with pip:

pip install catcoocc

Detailed instructions on how to use the library can be found in the official documentation.

A show-case example with a subset of the mushroom dataset is shown here:

import tabulate
import catcoocc
from catcoocc.scorer import CatScorer

mushroom_data = catcoocc.read_sequences("resources/mushroom-small.tsv")
mushroom_cooccs = catcoocc.collect_cooccs(mushroom_data)
scorer = catcoocc.scorer.CatScorer(mushroom_cooccs)

mle = scorer.mle()
pmi = scorer.pmi()
npmi = scorer.pmi(True)
chi2 = scorer.chi2()
chi2_ns = scorer.chi2(False)
cramersv = scorer.cramers_v()
cramersv_ns = scorer.cramers_v(False)
fisher = scorer.fisher()
theil_u = scorer.theil_u()
cond_entropy = scorer.cond_entropy()
tresoldi = scorer.tresoldi()

headers = [
    'pair',
    'mle_xy',          'mle_yx',
    'pmi_xy',          'pmi_yx',
    'npmi_xy',         'npmi_yx',
    'chi2_xy',         'chi2_yx',
    'chi2ns_xy',       'chi2ns_yx',
    'cremersv_xy',     'cremersv_yx',
    'cremersvns_xy',   'cremersvns_yx',
    'fisher_xy',       'fisher_yx',
    'theilu_xy',       'theilu_yx',
    'cond_entropy_xy', 'cond_entropy_yx',
    'tresoldi_xy',     'tresoldi_yx'
]

table = []
for pair in sorted(scorer.obs):
    buf = [
        pair,
        "%0.4f" % mle[pair][0],          "%0.4f" % mle[pair][1],
        "%0.4f" % pmi[pair][0],          "%0.4f" % pmi[pair][1],
        "%0.4f" % npmi[pair][0],         "%0.4f" % npmi[pair][1],
        "%0.4f" % chi2[pair][0],         "%0.4f" % chi2[pair][1],
        "%0.4f" % chi2_ns[pair][0],      "%0.4f" % chi2_ns[pair][1],
        "%0.4f" % cramersv[pair][0],     "%0.4f" % cramersv[pair][1],
        "%0.4f" % cramersv_ns[pair][0],  "%0.4f" % cramersv_ns[pair][1],
        "%0.4f" % fisher[pair][0],       "%0.4f" % fisher[pair][1],
        "%0.4f" % theil_u[pair][0],      "%0.4f" % theil_u[pair][1],
        "%0.4f" % cond_entropy[pair][0], "%0.4f" % cond_entropy[pair][1],
        "%0.4f" % tresoldi[pair][0],     "%0.4f" % tresoldi[pair][1],
    ]
    table.append(buf)


print(tabulate.tabulate(table, headers=headers, tablefmt='markdown'))

Which will output:

pair	mle_xy	mle_yx	pmi_xy	pmi_yx	npmi_xy	npmi_yx	chi2_xy	chi2_yx	chi2ns_xy	chi2ns_yx	cremersv_xy	cremersv_yx	cremersvns_xy	cremersvns_yx	fisher_xy	fisher_yx	theilu_xy	theilu_yx	cond_entropy_xy	cond_entropy_yx	tresoldi_xy	tresoldi_yx
('edible', 'bell')	0.3846	1	0.4308	0.4308	0.3107	0.3107	1.8315	1.8315	3.5897	3.5897	0.2027	0.2027	0.1987	0.1987	inf	inf	0	1	1.119	0	0.5956	1
('edible', 'convex')	0.4615	0.4615	-0.3424	-0.3424	-0.2844	-0.2844	3.6735	3.6735	5.7988	5.7988	0.3719	0.3719	0.3101	0.3101	0	0	0.2147	0.3071	0.7273	0.4486	-0.5615	-0.5615
('edible', 'flat')	0.0769	1	0.4308	0.4308	0.1438	0.1438	0.1041	0.1041	0.5668	0.5668	0	0	0	0	inf	inf	0	1	1.119	0	0.4596	1
('edible', 'sunken')	0.0769	1	0.4308	0.4308	0.1438	0.1438	0.1041	0.1041	0.5668	0.5668	0	0	0	0	inf	inf	0	1	1.119	0	0.4596	1
('poisonous', 'bell')	0	0	-3.5553	-3.5553	-0.5934	-0.5934	1.8315	1.8315	3.5897	3.5897	0.2027	0.2027	0.1987	0.1987	0	0	1	1	0	0	-3.5553	-3.5553
('poisonous', 'convex')	1	0.5385	0.4308	0.4308	0.4103	0.4103	3.6735	3.6735	5.7988	5.7988	0.3719	0.3719	0.3101	0.3101	inf	inf	1	0	0	0.6902	1	0.6779
('poisonous', 'flat')	0	0	-1.9459	-1.9459	-0.3248	-0.3248	0.1041	0.1041	0.5668	0.5668	0	0	0	0	0	0	1	1	0	0	-1.9459	-1.9459
('poisonous', 'sunken')	0	0	-1.9459	-1.9459	-0.3248	-0.3248	0.1041	0.1041	0.5668	0.5668	0	0	0	0	0	0	1	1	0	0	-1.9459	-1.9459

Changelog

Version 0.2.2:

Added function for inverting a scorer

Version 0.2.1:

Added basic functions for double series correlation

Similar Projects

https://github.com/pafoster/pyitlib

Griffith, Daniel M.; Veech, Joseph A.; and Marsh, Charles J. (2016) cooccur: Probabilistic Species Co-Occurrence Analysis in R. Journal of Statistical Software (69). doi: 10.18627/jss.v069.c02

https://cran.r-project.org/web/packages/GoodmanKruskal/vignettes/GoodmanKruskal.html

Community guidelines

While the author can be contacted directly for support, it is recommended that third parties use GitHub standard features, such as issues and pull requests, to contribute, report problems, or seek support.

Author and citation

The library is developed by Tiago Tresoldi (tresoldi@shh.mpg.de).

The author has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No. ERC Grant #715618, Computer-Assisted Language Comparison.

If you use catcoocc, please cite it as:

Tresoldi, Tiago (2020). catcoocc, a library for symmetric and asymmetric analysis of categorical co-occurrences. Version 0.1. Jena. Available at: https://github.com/tresoldi/catcoocc

In BibTeX:

@misc{Tresoldi2020catcoocc,
  author = {Tresoldi, Tiago},
  title = {catcoocc, a library for symmetric and asymmetric analysis of categorical co-occurrences. Version 0.1.},
  howpublished = {\url{https://github.com/tresoldi/catcoocc}},
  address = {Jena},
  year = {2020},
}

catcoocc
Release 0.2.2

Release 0.2.2

0.2.3

0.2.2

0.2.1

0.2

0.1

Documentation

catcoocc

Background

Installation and usage

Changelog

Similar Projects

Community guidelines

Author and citation

Stats

Development practices

Releases

Contributors

catcoocc Release 0.2.2

Release 0.2.2 Toggle Dropdown 0.2.3 0.2.2 0.2.1 0.2 0.1

Documentation

catcoocc

Background

Installation and usage

Changelog

Similar Projects

Community guidelines

Author and citation

Stats

Development practices

Releases

Contributors

catcoocc
Release 0.2.2

Release 0.2.2

0.2.3

0.2.2

0.2.1

0.2

0.1