Renard-Incunabula-MuPDF-mutool

Retrieve PDF image and text data via MuPDF's mutool


Keywords
api, cpan, library, mupdf, pdf, perl5
License
Artistic-1.0-Perl

Documentation

NAME

Renard::API::MuPDF::mutool - Retrieve PDF image and text data via MuPDF's mutool

VERSION

version 0.006

FUNCTIONS

_call_mutool

_call_mutool( @args )

Helper function which calls mutool with the contents of the @args array.

Returns the captured STDOUT of the call.

This function dies if mutool unsuccessfully exits.

get_mutool_pdf_page_as_png

get_mutool_pdf_page_as_png($pdf_filename, $pdf_page_no)

This function returns a PNG stream that renders page number $pdf_page_no of the PDF file $pdf_filename.

get_mutool_text_stext_raw

get_mutool_text_stext_raw($pdf_filename, $pdf_page_no)

This function returns an XML string that contains structured text from page number $pdf_page_no of the PDF file $pdf_filename.

The XML format is defined by the output of mutool looks like this (for page 23 of the pdf_reference_1-7.pdf file):

<?xml version="1.0"?>
<document name="(null)">
  <page height="666" width="531">
    <block bbox="261.18 616.16397 269.77766 625.2532">
      <line bbox="261.18 616.16397 269.77766 625.2532" dir="1 0" wmode="0">
        <font name="MyriadPro-Semibold" size="7.98">
          <char bbox="261.18 616.16397 265.45729 625.2532" c="2" x="261.18" y="623.2582"/>
          <char bbox="265.50038 616.16397 269.77766 625.2532" c="3" x="265.50038" y="623.2582"/>
        </font>
      </line>
    </block>
    <block bbox="225.78 88.20229 305.18159 117.93829">
      <line bbox="225.78 88.20229 305.18159 117.93829" dir="1 0" wmode="0">
        <font name="MyriadPro-Bold" size="24">
          <char bbox="225.78 88.20229 239.724 117.93829" c="P" x="225.78" y="111.93829"/>
          <char bbox="239.5176 88.20229 248.63759 117.93829" c="r" x="239.5176" y="111.93829"/>
          <char bbox="248.4552 88.20229 261.1272 117.93829" c="e" x="248.4552" y="111.93829"/>
          <char bbox="261.1128 88.20229 269.29679 117.93829" c="f" x="261.1128" y="111.93829"/>
        </font>
      </line>
    </block>
  </page>
</document>

Simplified, the high-level structure looks like:

<page> -> [list of blocks]
  <block> -> [list of blocks]
    a block is either:
      - stext
          <line> -> [list of lines] (all have same baseline)
            <font> -> [list of fonts] (horizontal spaces over a line)
              <char> -> [list of chars]
      - image
          # TODO document the image data from mutool

get_mutool_text_stext_xml

get_mutool_text_stext_xml($pdf_filename, $pdf_page_no)

Returns a HashRef of the structured text from from page number $pdf_page_no of the PDF file $pdf_filename.

See the function get_mutool_text_stext_raw for details on the structure of this data.

get_mutool_page_info_raw

get_mutool_page_info_raw($pdf_filename)

Returns an XML string of the page bounding boxes of PDF file $pdf_filename.

The data is in the form:

<document>
  <page pagenum="1">
    <MediaBox l="0" b="0" r="531" t="666" />
    <CropBox l="0" b="0" r="531" t="666" />
    <Rotate v="0" />
  </page>
  <page pagenum="2">
    ...
  </page>
</document>

get_mutool_page_info_xml

get_mutool_page_info_xml($pdf_filename)

Returns a HashRef containing the page bounding boxes of PDF file $pdf_filename.

See function get_mutool_page_info_raw for information on the structure of the data.

get_mutool_outline_simple

fun get_mutool_outline_simple($pdf_filename)

Returns an array of the outline of the PDF file $pdf_filename as an ArrayRef[HashRef] which corresponds to the items attribute of Renard::Incunabula::Outline.

get_mutool_get_trailer_raw

fun get_mutool_get_trailer_raw($pdf_filename)

Returns the trailer of the PDF file $pdf_filename as a string.

get_mutool_get_object_raw

fun get_mutool_get_object_raw($pdf_filename, $object_id)

Returns the object given by the ID $object_id for PDF file $pdf_filename as a string.

get_mutool_get_info_object_parsed

fun get_mutool_get_info_object_parsed( $pdf_filename )

Returns the document information dictionary as a Renard::API::MuPDF::mutool::ObjectParser object.

See Table 10.2 on pg. 844 of the PDF Reference, version 1.7 to see the entries that usually used (e.g., Title, Author).

SEE ALSO

Repository information

AUTHOR

Project Renard

COPYRIGHT AND LICENSE

This software is copyright (c) 2017 by Project Renard.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.