leaf-focus

Extract structured text from pdf files.


Keywords
data-science, machine-learning, parser, pdf, utility
License
Apache-2.0
Install
pip install leaf-focus==0.6.2

Documentation

leaf-focus

Extract structured text from pdf files.

Install

Install from PyPI using pip:

pip install leaf-focus

PyPI PyPI - Python Version GitHub Workflow Status (branch)

Download the Xpdf command line tools and extract the executable files.

Provide the directory containing the executable files as --exe-dir.

Usage

usage: leaf-focus [-h] [--version] --exe-dir EXE_DIR [--page-images] [--ocr]
                  [--first FIRST] [--last LAST]
                  [--log-level {debug,info,warning,error,critical}]
                  input_pdf output_dir

Extract structured text from a pdf file.

positional arguments:
  input_pdf             path to the pdf file to read
  output_dir            path to the directory to save the extracted text files

optional arguments:
  -h, --help            show this help message and exit
  --version             show program's version number and exit
  --exe-dir EXE_DIR     path to the directory containing xpdf executable files
  --page-images         save each page of the pdf as a separate image
  --ocr                 run optical character recognition on each page of the
                        pdf
  --first FIRST         the first pdf page to process
  --last LAST           the last pdf page to process
  --log-level {debug,info,warning,error,critical}
                        the log level: debug, info, warning, error, critical

Examples

# Extract the pdf information and embedded text.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages

# Extract the pdf information, embedded text, an image of each page, and Optical Character Recognition results of each page.
leaf-focus --exe-dir [path-to-xpdf-exe-dir] file.pdf file-pages --ocr

Dependencies