citation-parser

A parser for canonical references.


License
GPL-3.0
Install
pip install citation-parser==0.4.1

Documentation

CitationParser

The CitationParser (CiPa) is composed by a lexer, a parser and a tree parser written in ANTLR and compiled into Python code.

The idea behind CiPa id pretty simple. Canonical citations are constructed using punctuation symbols in a consistent way, so that we can define a syntax to extract their meaning. Once extracted the meaning is then formalised into JSON as intermendiate representation format.

Given for example the citation """ Hom. Il. 1, 124 - 125 """, to a human reader the following facts are known:

  • the hyphen is used to specify a range of text passages, from X to Y
  • the characetr string preceding the numbers contains information about work and author being cited
  • the semicolon separates a reference from another within the sanme citation (is common to chain together references to mutiple of the same work or of different works)
  • the comma separates the heirarchical level of the work being cited. In the example above 1,124-5 stands for from Book 1, Line 124 to Book 1, Line 125
  • when the citation scope is a range, the identical hierarchical level are collapsed: 1.124 - 1.125 can be written as 1.124-125 or 1.124 s. without any loss of information for the human reader

So, given the input: Hom. Il. 1, 124 - 125 the output of the citation parser expressed in JSON is: "[{'work': u'Hom. Il.', 'scp': {'start': ['1', '124'], 'end': ['1', '125']}

Compile the ANTLR grammar files

From the directory ./citation_parser/antlr/, run:

java -cp ../../lib/antlr-3.1.2.jar org.antlr.Tool -o ~/Downloads/ cp_lexer.g cp_parser.g cp_treeparser.g