minecart: A Pythonic interface to PDF documents
minecart is a Python package that simplifies the extraction of text,
images, and shapes from a PDF document. It provides a very Pythonic
interface to extract positioning, color, and font metadata for all of
the objects in the PDF. It is a pure-Python package (it depends on
pdfminer for the low-level parsing).
inspiration from Tim McNamara’s
slate, but aims to provide more
>>> pdffile = open('example.pdf', 'rb') >>> doc = minecart.Document(pdffile) >>> page = doc.get_page(3) >>> for shape in page.shapes.iter_in_bbox((0, 0, 100, 200)): ... print shape.path, shape.fill.color.as_rgb() >>> im = page.images.as_pil() # requires pillow >>> im.show()
As of version
0.3.0, only Python 3 is support, using
- The easy way:
pip install minecart
- The hard way: download the source code, change into the working
directory, and run
python setup.py install
For CJK languages: Supporting the CJK languages requires an
addtional step, as detailed in
Shapes: You can extract path information, bounding box, stroke
parameters, and stroke/fill colors. Color support is fairly robust,
allowing the simple
.as_rgb()in most cases. (To be concrete,
Indexedcolors are supported if they index into one of the above.)
minecartcan easily extract images to
Letteringin the source) In addition to extracting plain text from the PDF, you can access the position/bounding box information and the font used.
If there’s a feature you’d like to extract from a PDF that’s not currently supported, open up an issue or submit a pull request! I’m especially interested in hearing whether there are many PDFs using color spaces outside of the ones currently supported.
The main entry point will always be
minecart.Document, which accepts
a single parameter, an open file-like object which will be read to
create the document. The
Document has two primary methods for
accessing its contents:
minecart.Page objects, which provide access to the
graphical elements found on the page.
Page objects have three main
.images: A list of all the
minecart.Imageobjects found on the page.
.letterings: A list of all the text objects found on the page, as
unicodesubclass which adds bounding box and font information (using
.shapes: A list of all the squares, circles, lines, etc. found on the page as
Shapeobjects have three main attributes of interest:
.stroke: An object containing the stroke parameters used to draw the shape.
.dashattributes. If the shape was not stroked,
.fill: An object containing the fill parameters used to draw the shape. Right now,
.fillonly has a
.path: A list with the coordinates used to defined the shape, as well as the type of line segment each set of coordinates defines. Refer to the
minecart.Shapedocumentation for more details
Note on color: The PDF spec spends a fair amount of time dealing
with color specifications, defining color spaces, and transforms and
minecart's approach is to simplify things down with sensible
defaults, so that every color has an
.as_rgb() method, which returns
a 3-tuple with component values between 0 (black) and 1 (white). If you
are interested in extracting colorspace families and parameters, you can
do that too, though!
We try to keep docstrings complete and up to date, so you can read
through the source or use
help to see what methods are
If you are having trouble working with
minecart, feel free to create
a new issue.
Bug reports are always welcome (using the GitHub tracker) as are feature requests. The PDF spec has so many corners, it is hard to prioritize which features to implement. If there’s something you’d like to extract from a document but isn’t currently supported, please create a new issue.
If you’d like to contribute code, you can either create an issue and include a patch (if the changes are small) or fork the project and create a pull request.
This project is licensed under the MIT license.