pyxpdf

Powerful and Pythonic PDF processing library based on xpdf-4.02


Keywords
pdf, parser, converter, text, mining, xpdf, bindings, cython, pdf-converter, pdf-parser, pdfparser, pdftohtml, pdftopng, pdftotext, python, xpdf-reader
License
GPL-2.0+
Install
pip install pyxpdf==0.2.3

Documentation

pyxpdf

pyxpdf is a fast and memory efficient python module for parsing PDF documents based on xpdf reader sources.

docs Read the Docs
tests Azure DevOps builds (branch) Travis (.com) Codecov
package PyPI PyPI - Python Version PyPI - Wheel PyPI - Downloads
license GitHub

Features

  • Almost x20 times faster than pure python based pdf parsers (see Speed Comparison)
  • Extract text while maintaining original document layout (best possible)
  • Support almost all PDF encodings, CMaps and predefined CMaps.
  • Extract LZW, RLE, CCITTFax, DCT, JBIG2 and JPX compressed images and image masks along with their BBox.
  • Render PDF Pages as image with support of '1', 'L', 'LA', 'RGB', 'RGBA' and 'CMYK' color modes.
  • No explict dependencies (except optional ones, see Installation)
  • Thread Safe

More Information

License

pyxpdf is licensed under the GNU General Public License (GPL), version 3. See the LICENSE

Credits