mc-pdf2txt

Multi-column PDF to Text


Keywords
cli-tool, ocr, pdf, tesseract
License
BSD-1-Clause
Install
pip install mc-pdf2txt==0.3.0

Documentation

mc-pdf2txt

Convert multi-column pdf to text with poppler and tesseract.

Install

(1) Install dependencies:

Install poppler.

sudo apt install poppler-utils

Install tesseract-ocr

sudo apt install tesseract-ocr

with the language data files of your choice, e.g.,

sudo apt install tesseract-ocr-jpn

(2) Install mc-pdf2txt

pip3 install mc-pdf2txt

Usage

Usage:
  mc-pdf2txt [options] <input>...

Options:
  -l LANG           Language, such as `eng`, `jpn`, or `eng+jpn`.
  <input>           Input PDF file.
  -o OUTPUT         Output text file.
  -r DPI            Resolution of temporary image file [default: 600].
  --timeout SEC     Timeout in sec to exec `pdftoppm` [default: 60].
  --page-separator LINE     String to be output as page separator [default: ---].
  --psm VALUE       Page segmentation mode of `tessoract-ocr` [default: 3].
  --verbose         Verbose.