pyxpdf
pyxpdf is a fast and memory efficient python module for parsing PDF documents based on xpdf reader sources.
docs | |
---|---|
tests |
|
package |
|
license |
Features
- Almost x20 times faster than pure python based pdf parsers (see Speed Comparison)
- Extract text while maintaining original document layout (best possible)
- Support almost all PDF encodings, CMaps and predefined CMaps.
- Extract LZW, RLE, CCITTFax, DCT, JBIG2 and JPX compressed images and image masks along with their BBox.
- Render PDF Pages as image with support of '1', 'L', 'LA', 'RGB', 'RGBA' and 'CMYK' color modes.
- No explict dependencies (except optional ones, see Installation)
- Thread Safe
More Information
License
pyxpdf
is licensed under the GNU General Public License (GPL), version 3. See the LICENSE
Credits
- xpdf reader by Derek Noonburg
- lxml - project structure and build adapted from lxml
- poppler project