pdftobb
General PDF parsing code extracted from my newspaper mining codebase.
This uses pdfminer3k to dump a PDF file as an XML file. Then it collates bounding boxes and outputs a (pandas-style) CSV.
Installation
Via pip
- Initialize a virtual environment
pip install pdftobb
Via git
- Clone this repository,
cd
to it - Initialize a virtual environment, activate it
pip install -r requirements.txt
Usage:
If installed via pip: pdftobb path/to/pdf.pdf
If installed locally: python path/to/pdftobb.py path/to/pdf.pdf
Running pdftobb
on a file file.pdf
will generate two files: file.pdf.xml
(the output of pdf2txt.py -t xml file.pdf > file.pdf.xml
) and file.pdf.csv
, which is the XML file turned into a slightly more condensed csv file.