Work with the PARC dataset in Python


Keywords
NLP natrual language processing computational linguistics Penn Attribution Relation Corpus PARC PARC3
License
MIT
Install
pip install parc-reader==0.1.5

Documentation

Parc-reader documentation

.. py:module:: corenlp_xml_reader

Purpose

The parc_reader provides an API in python that simplifies working with the annotated files in PARC3, while also incorporating parallel annotations from CoreNLP (if available).

PARC3 consists of files that have been annotated for attribution relations, each typically consisting of three spans of tokens: a source span (the speaker), a content span (what is being quoted), and a cue span (usually a verb phrase, indicating the act of speech or expression).

When loaded into a ParcCorenlpReader or ParcAnnotatedText object, the parc consist of sentences containing tokens and attributions whose properties can be accessed as if they were simple python lists and dicts.

Install

Basic install: pip install parc-reader

Hackable install:

git clone https://github.com/enewe101/parc-reader.git
cd parc-reader
python setup.py develop

ParcCorenlpReader vs. ParcAnnotatedText

If you just want to work with PARC3 files (and you don't have parallel CoreNLP annotations), then you will want to use the ParcAnnotatedText class. It exposes the API for working with PARC-only data.

If you also have parallel CoreNLP annotations, and you want to be able to access both information from PARC and CoreNLP annotations, then you want to use ParcCorenlpReader.

ParcCorenlpReader Examples

Instances of ParcCorenlpReader monkey-patch contents of the corresponding CorenlpAnnotatedText object built from the parallel CoreNLP annotation of a PARC article. So the ParcCorenlpReader feels just like a CorenlpAnnotatedText that has been augmented with PARC annotations.

Before continuing, be familiar with the API for CorenlpAnnotatedText objects, all of which is satisfied by the ParcCorenlpReader.

To begin working with PARC / CoreNLP data, make an instance of the ParcCorenlpReader. You'll need to supply its constructor with three strings representing the parc xml, corenlp xml, and also the raw article text (which has paragraph break information in it):

>>> from parc_reader import ParcCorenlpReader as P
>>>
>>> parc_xml = open('data/example-parc.xml').read()
>>> corenlp_xml = open('data/example-corenlp.xml').read()
>>> raw_text = open('data/example-raw.txt').read()
>>>
>>> article = P(corenlp_xml, parc_xml, raw_text)

(Note that both the parc_xml and raw_text are optional. Usually it's desired to provide both of them, but if, for example, there is no parc xml for the data you're loading, you can use this class to create it. More on that below.)

You can follow along using the same example data which ships with this git repo. If you installed using pip, you can just download the example data a-la-carte.

The first thing to notice is that, in addition to having a sentences list the reader also creates a list of paragraphs. Each paragraph is itself, unsurprisingly, a list of sentences. Sentences know what paragraph they're in too.

>>> type(article.paragraph)
<type 'list'>
>>> print 'this article has %d pragraphs' % len(article.paragraphs)
this article has 17 pragraphs
>>>
>>> second_paragraph = article.paragraphs[1]
>>> first_sent_second_para = second_paragraph[0]
>>> print ' '.join([
...     token['word'] for token in first_sent_second_para['tokens']
... ])
Not only is development of the new company 's initial machine tied directly to Mr. Cray , so is its balance sheet .
>>> first_sent_second_para['paragraph_idx']
1

Sentences accessed this way (or by indexing into the article.sentences list) have all of the usual features that they do in the corenlp_xml_reader, plus attributions. A sentence's attributions property is a dictionary of attribution objects, with the keys being the PARC3 attribution relation ids. Let's have a look at the second sentence of the second paragraph, which has an attribution in it:

>>> sentence = second_paragraph[1]
>>> sentence.keys()
['tokens', 'entities', 'attributions', 'references', 'mentions', 'root', 'id', 'paragraph_idx']
>>> len(sentence['attributions'])
... 1

Attributions have as properties an 'id', as well as 'source', 'cue', and 'content' spans:

>>> attribution = sentence['attributions'].values()[0]
>>> attribution.keys()
['content', 'source', 'cue', 'id']
>>>
>>> print attribution['id']
wsj_0018_PDTB_annotation_level.xml_set_0

The text spans in attributions are just lists of tokens -- the same kind of token as is found in corenlp_xml_reader. Be warned that, while every attribution is guaranteed to have a non-empty 'cue', the 'source' is sometimes empty. One additional feature that tokens have, beyond those of corenlp_xml_reader is that they know if they are in an attribution, and they know what role (which span) they are part of, and retain a reference back to the attribution itself. So it is possible both to get all the tokens in a given attribution span, as well as to check if a given token is part of an attribution.

>>> source_tokens = attribution['source']
>>> print ' '.join([token['word'] for token in source_tokens])
Documents filed with the Securities and Exchange Commission on the pending spinoff
>>>
>>> securities = source_tokens[4]
>>> securities.keys()
['attribution', 'word', 'character_offset_begin', 'character_offset_end', 'pos', 'children', 'lemma', 'sentence_id', 'entity_idx', 'speaker', 'mention', 'parents', 'role', 'ner', 'id']
>>> print securities['role']
source
>>> attribution == securities['attribution']
True

Careful not to confuse the token property 'speaker' which is inherited from CoreNLP and is not related to the 'source' of attributions! It's best to ignore 'speaker'!

There is also a global attributions dict if you just want to iterate over all attributions in the file.

>>> len(article.attributions)
18
>>> print '\n'.join(article.attributions.keys())
wsj_0018_PDTB_annotation_level.xml_set_5
wsj_0018_Attribution_relation_level.xml_set_3
wsj_0018_PDTB_annotation_level.xml_set_7
wsj_0018_PDTB_annotation_level.xml_set_6
wsj_0018_Attribution_relation_level.xml_set_6
wsj_0018_PDTB_annotation_level.xml_set_0
wsj_0018_PDTB_annotation_level.xml_set_3
wsj_0018_PDTB_annotation_level.xml_set_2
wsj_0018_Attribution_relation_level.xml_set_8
wsj_0018_Attribution_relation_level.xml_set_5
wsj_0018_PDTB_annotation_level.xml_set_8
wsj_0018_Attribution_relation_level.xml_set_2
wsj_0018_Attribution_relation_level.xml_set_1
wsj_0018_Attribution_relation_level.xml_set_4
wsj_0018_Nested_relation_level.xml_set_10
wsj_0018_PDTB_annotation_level.xml_set_1
wsj_0018_Attribution_relation_level.xml_set_9
wsj_0018_Attribution_relation_level.xml_set_7

Prounoun interpolation in attributions

Oftentimes a source will contain a pronoun, like "he", "she", or "they". These can be automatically substituted with a more informative sequence of tokens found using CoreNLPs coreference resolution:

>>> article.attributions['some-attribution-id'].interpolate_source_pronouns()

Doing this will find the "representative" mention corresponding to any pronouns in the attribution's source, and will use it to replace the pronouns. It will have a few effects, aside from just replacing the pronouns in the attribution's 'source' token list. It also replaces the pronouns in the sentence's token list, and it grafts the replacement into the dependency tree as well. So this brings about a relatively full substitution.

One important side effect, though, is that the token 'id's in the interpolated sentence will no longer be consecutive, nor unique.

Creating New Attributions

As mentioned above, it is possible to create a ParcCorenlpReader without loading any parc_xml (if for example if none exists for the given article). This can be useful if you want to programatically add annotation information to existing CoreNLP annotations. To do that, simply create a ParcCorenlpReader instance without supplying anything for the parc_xml argument.

You can also add additional annotations even if you've loaded parc_xml. Just make a ParcCorenlpReader as usual, and use the commands shown below.

To make a new annotation, use the function add_annotation. Supply the source, cue, and content token lists as parameters. The tokens supplied should be actual tokens from the article itself. Suppose we have the following sentence, and we want to mark the attribution that occurs in it:

>>> article.sentence[0]
    Sentence 0:
         0: Pierre (0,6) NNP PERSON
         1: Vinken (7,13) NNP PERSON
         2: , (14,15) , -
         3: 61 (16,18) CD DURATION
         4: years (19,24) NNS DURATION
         5: old (25,28) JJ DURATION
         6: , (29,30) , -
         7: said (31,35) VB -
         8: he (36,38) PRP -
         9: will (31,35) MD -
        10: join (36,40) VB -
        11: the (41,44) DT -
        12: board (45,50) NN -
        14: as (51,53) IN -
        15: a (54,55) DT -
        16: nonexecutive (56,68) JJ -
        17: director (69,77) NN -
        18: Nov. (78,82) NNP DATE
        19: 29 (83,85) CD DATE
        20: . (86,87) . -

We collect the tokens involved in different parts of the attribution, and use them to create a new attribution:

>>> source = article.sentences[0]['tokens'][0:2]
>>> cue = article.sentences[0]['tokens'][7:8]
>>> content = article.sentences[0]['tokens'][8:20]
>>>
>>> attribution = article.add_attribution(
    cue_tokens=cue,
    content_tokens=content,
    source_tokens=source,
    id_formatter='my_attribution_'
)

References to the new attribution will automatically be created in the global attributions dictionary, in the sentence(s) involved, and in the tokens involved in the attribution. It also adds role information to the tokens. In other words, the result is exactly as if the attribution were read from a parc_xml file:

>>> article.attributions.keys()
'my_attribution_0'
>>> article.sentences[0]['attributions'].keys()
'my_attribution_0'
>>>
>>> article.sentences[0]['tokens'][0]['role']
'source'
>>>
>>> article.sentences[0]['tokens'][0]['attribution']
{'my_attribution_0': {
    'id': 'my_attribution_0',
    'source': [
         0: Pierre (0,6) NNP PERSON,
         1: Vinken (7,13) NNP PERSON],
    'cue': [7: said (31,35) VB -],
    'content': [
         8: he (36,38) PRP -,
         9: will (31,35) MD -,
        10: join (36,40) VB -,
        11: the (41,44) DT -,
        12: board (45,50) NN -,
        14: as (51,53) IN -,
        15: a (54,55) DT -,
        16: nonexecutive (56,68) JJ -,
        17: director (69,77) NN -,
        18: Nov. (78,82) NNP DATE,
        19: 29 (83,85) CD DATE]
    }
}

The call signature for add_attribution is:

add_attribution(
        cue_tokens=[],
        content_tokens=[],
        source_tokens=[],
        attribution_id=None,
        id_formatter=''
)

All of the arguments to add_attribution are optional, meaning that you can create an empty attribution and fill it later (described below). Every attribution must be given a unique id. You can either supply the id via the attribution_id parameter, or you can simply supply an id_formatter which is a prefix that gets an incrementing integer added onto it to create a unique id. If the id_formatter contains a '%d' then this will be replaced by the integer so you can have arbitrarily formatted ids. If you supply neither an attribution_id nor an id_formatter, then the id will simply be an integer (as a string).

You can also make an empty attribution, and then fill in tokens for given roles afterwards. The following has the exact same effect as the previous example:

>>> source = article.sentences[0]['tokens'][0:2]
>>> cue = article.sentences[0]['tokens'][7:8]
>>> content = article.sentences[0]['tokens'][8:20]
>>>
>>> attribution = article.add_attribution(id_formatter='my\_attribution\_')
>>>
>>> article.add_to_attribution(attribution, 'source', source)
>>> article.add_to_attribution(attribution, 'cue', cue)
>>> article.add_to_attribution(attribution, 'content', content)

Note that it isn't necessary to supply tokens for each role. For example you could just supply token(s) for the 'cue' role, or indeed leave the attribution completely empty.

Note that tokens can only be part of one attribution. The ParcCorenlpReader doesn't support nested or overlapping attributions!

Trying to create an attribution using an attribution_id that's already in use, or trying to create an attribution involving token(s) that are already part of another attribution will cause a ValueError to be raised.

Finally, you can delete attributions by supplying the attribution_id. All references throughout the datastructure to the attribution will be cleaned up.

>>> 'my_attribution_0' in article.attributions
True
>>>
>>> article.remove_attribution('my_attribution_0')
>>>
>>> 'my_attribution_0' in article.attributions
False

Saving Parc Files to Disk

You can obtain an xml serialization of a ParcCorenlpReader, in the xml format used by the parc3 dataset, then save it to disk, as follows:

>>> xml_string = article.get_parc_xml(indent='  ')
>>> open('my-parc-file.xml', 'w').write(xml_string)