OCR utils
Python tools for interacting with Tesseract
Features
- Detects tables in PDF/images and performs OCR on each cell
- Performs OCR on PDF and generates SVG image
Quick Start
from ocr_utils import pdf_to_svg
pdf_to_svg(
input_filename='in.pdf',
output_filename='out.svg',
detect_tables=True,
lang='eng',
)
Execution example
Input pdf
Output svg
Installation
Stable Release: pip install tesseract_ocr_utils
Development Head: pip install git+https://github.com/envinorma/ocr_utils.git
This library is built upon pytesseract and pdf2image which have non-pip requirements. Visit these libraries installation pages to install dependencies.
For example, on ubuntu, the following libraries need to be installed:
apt-get install libarchive13
apt-get install tesseract-ocr
apt-get install poppler-utils
Documentation
For full package documentation please visit envinorma.github.io/ocr_utils.
Development
See CONTRIBUTING.md for information related to developing the code.
MIT license