pyhocr is a Python package to help you parse and navigate hocr documents.
To install the module, run:
pip install pyhocr
pyhocr parses the following elements from hocr:
- ocr pages: represented by
- ocr content areas: represented by
- ocr paragraphs: represented by
- ocr lines: represented by
- ocr words: represented by
and returns them as
Words objects respectively.
You can navigate through the hocr by asking for any children elements or any parent element. You can navigate down the structure like:
import pyhocr with open('example.hocr') as f: hocr_string = f.read() hocr_document = pyhocr.parse(hocr_string) # get the first page page = hocr_document.pages # pulling all lines out: lines = page.lines # getting text of last line last_line_text = lines[-1].text # getting all words of page words = page.words
Or navigate up the data structure by:
# get parent page page = word.page # get parent line line = word.line # get parent block line = word.block # get parent page of the block page = block.page
Please feel free to post pull requests or report issues.