pdf-wrangler

PDFMiner Wrapper for extractions


Keywords
pdf, parser, text, mining, python3
License
MIT
Install
pip install pdf-wrangler==0.0.31

Documentation

pdf-wrangler

PDFMiner wrapper used to simplify PDF extraction and other PDF utilities.

Document class

The Document class is used to represent a PDF document. It contains functionality to access the raw text by page, PDF metadata and images in the form of PDFMiner's LTImage object.

Example Usage

from pdf_wrangler import Document

pdf_document = Document('path/to/pdf.pdf')

# to access pdf metadata
pdf_document.get_metadata()

# to access full pdf text
pdf_document.get_text()

# print text by pdf page
for page in pdf_document.pages:
    print(page.get_text())

# to access pdf images by page
page_1_images = pdf_document.pages[0].images

# get first image bytes representation
page_1_images[0].stream.get_data()

Installation

To install, run:

pip install pdf-wrangler