Module to resolve intra-document coreference.
Homepage Repository PyPI Python
pip install corefgraph==1.2.3
CorefGraph is an independent python module to perform coreference resolution, a Natural Language Processing task which consists of determining the mentions that refer to the same entity in a text or discourse. CorefGraph is a multilingual rule-based system loosely based on the Stanford Multi Sieve Coreference System Stanford Multi Sieve Pass system (Lee et al., 2013). Currently it supports English and Spanish but it can be extended to other languages.
If you use corefgraph, please cite this paper:
Rodrigo Agerri, Josu Bermudez and German Rigau (2014): "IXA pipeline: Efficient and Ready to Use Multilingual NLP tools", in: Proceedings of the 9th Language Resources and Evaluation Conference (LREC2014), 26-31 May, 2014, Reykjavik, Iceland.
CorefGraph is being used by the ixa pipes tools which can be used to provide the necessary input.
This work has been partially funded by a PhD Grant of the University of Deusto.
PIP can install the module and every dependency in one command.
sudo -H pip install corefgraph
This module may be used to process single files or directories (corpus). CorefGraph takes NAF documents as input. The input NAF documents must contain:
The NAF specification can be found here
We recommend using the tools in the IXA pipeline to obtain the necessary linguistic annotation in a easy and efficient manner.
The most simple way to use this module is this:
corefgraph --file your_file.KAF -l en_conll
This sentence outputs a NAF file containing all the original file info plus the coreference clusters.
The module is usable as a pipe:
cat your_file.naf | corefgraph -l en_conll > output.naf
The system comes with a lot of options. Use --help parameter for the default and possible values.
These options can be passed to the program via yalm file with the -c parameter cat yourfile.naf | corefgraph -c semevalconfig.yalm > output.naf
See configArgParse for yaml available syntax
english.yalm ~~~~
language: en_conll encoding: utf-8
mention_catchers: [NamedEntities, EnumerableCatcher, ConstituentCatcher, PronounCatcher]
mention_filters: [ReplicatedSpanFilter, NamedEntityPartFilter, QuantityFilter, PleonasticFilter, DemonymFilter, InterjectionFilter, PartitiveFilter, BareNPFilter, QuantifierFilter, InvalidWordFilter, InvalidNerFilter, NonWordFilter, SameHeadFilter]
sieves: [SPM, RSM, ESM, PCM, SHMA, SHMB, SHMC, SHMD, RHM, PNM]
writer: NAF
reader:NAF
~~~~
Spanish.yalm ~~~~
language: es_semeval encoding: utf-8
mention_catchers: [NamedEntities, PermissiveEnumerableCatcher, PermissiveConstituentCatcher, PermissivePronounCatcher]
mention_filters: [ReplicatedSpanFilter, NamedEntityPartFilter, QuantityFilter, PleonasticFilter, DemonymFilter, InterjectionFilter, PartitiveFilter, BareNPFilter, QuantifierFilter, InvalidWordFilter, InvalidNerFilter, NonWordFilter, SameHeadFilter]
sieves: [SPM, ESM, RSM, PCM, SHMSNEA, SHMSNEB, SHMSNEC, SHMSNED, RHMSNE_A, PNM]
writer: NAF
reader:NAF
~~~~
Multiple files mode, or corpus mode, can process multiple files concurrently.
python corefgraph_corpus --directories /home/KAF_dir -config configfile
The multi file processor needs two basic parameters: a list of files and/or a list of input directories, plus a list of configuration files. Both lists should at least contain one element, otherwise the processing will end with empty results.
You can pass the parameter via yalm file with -p parameter:
corefgraph_corpus -p corpus_parameters.yalm
you can control the max concurrent jobs with --jobs parameter:
corefgraph_corpus -jobs 4 -p corpus_parameters.yalm
Input files
--file FILES File to process. May be used multiple times and with
directory parameter.
--directory DIRECTORIES
All the files contained by the directory(recursively)
are processed. May be used multiple times and with the
file parameter.
--extension EXTENSIONS
The extensions of the files(without dot) that must be
processed form directories. The '*' is used as accept
all. May be used multiple times .WARNING doesn't
filter files from --files.
--result (Optional) A extension added to the result files. The
files are stored next to the original files with
the same base name. Also is used in evaluation.
--log_base The prefix added to the log files usually a directory
--speaker_extension (Optional) If set, the module searches for a file with
the same base name plus the extension and uses
it as speaker file. This option is switched off by default.
--treebank_extension (Optional) If set, the module searches for a file with
the same base name plus the extension and uses it
as treebank file.
This option is switched off by default if is set the system
ignores the naf file parse.
Proceesing parameters The parameters used processing each file are passed with parameters files.
--config CONFIG The config files that contains the parameter each
experiment.Use ':' to use multiple files in one
experiment. Repeat the parameter for multiple
experiments.
--common COMMON A common config for all experiments.May be multiple
files separated by ':'
** evaluation ** The parameters used during the evaluation
--evaluate Activates the evaluation.
--report Activates report system.
--evaluation_script The full path to evaluation script
--metrics (Optional) When the evaluation parameter is on,
it is possible to specify the evaluation metric used.
--gold The path to the golden corpus
--gold_ext The extension of the golden corpus files.
Make sure you have python 2.7.1 or higher.
python --version
If you have problems using the --user option you may consider to update pip.
sudo pip install --upgrade pip
The python dist-package directory might be in different location than:
/usr/local/lib/python2.7/dist-packages/
Josu Bermúdez
DeustoTech
University of Deusto
Bilbao
josu.bermudez at deusto.es
Rodrigo Agerri
IXA NLP Group
University of the Basque Country (UPV/EHU)
E-20018 Donostia-San Sebastián
rodrigo.agerri at ehu.es