standoffconverter

converter from xml to standoff and back


License
MIT
Install
pip install standoffconverter==0.8.11

Documentation

standoffconverter

Interactive Demo

An interactive demo of the basic functionality of the project can be found here:
so.davidlassner.com
The code for this demo can be found at examples/wysiwyg.py

Simple use case

I intended this package to be used in the following situation: Given a collection of TEI files, I would like to add new annotations (for example with an ML method). The workflow would include the following steps:

  1. create a standoff representation of the lxml Tree
so = Standoff(some_xml_tree)
  1. create a view of the standoff data that works well for NLP methods, such as converting <lb> into \n or strip multiple white spaces into a single one
view = (
    View(so)
        .shrink_whitespace()
        .insert_tag_text("http://www.tei-c.org/ns/1.0}lb","\n")
)

The resulting text can be retrieved by

plain = view.get_plain()

Note that a lookup table is also returned that keeps the links between the character position in plain and its original position in the so.table.

  1. pass the resulting plain text into an NLP pipeline and retrieve results on character level (for example Named Entities):
for ent in nlp(plain).ents:
    break;
  1. use the lookups to annotate the original lxml Tree
start_ind = view.get_table_pos(ent.start_char)
end_ind = view.get_table_pos(ent.end_char)

so.add_inline(
    begin=start_ind,
    end=end_ind,
    tag="entity",
)

Examples

Find more examples here

Documentation

https://standoffconverter.readthedocs.io