verbaendeliste-bundestag

Parse PDF-to-XML converted lobby list of German Bundestag


License
MIT
Install
pip install verbaendeliste-bundestag==0.1.0

Documentation

Verbaendeliste-Bundestag Extractor

Use pdftohtml to get an XML file from the pdf.

pdftohtml -xml input.pdf output.xml

Then use the extractor with first and last relevant page number to convert to parsed JSON:

python extract_lobby.py 4 690 < lobbylist.xml > lobbylist.json

Here is extracted JSON (15th of June 2012).

License: MIT-License