pdf-highlight-extractor

Extract and summarize highlights from PDF files.


Keywords
pdf, highlights, extraction, annotation, pymupdf, research, notes, text
License
MIT
Install
pip install pdf-highlight-extractor==0.1.2

Documentation

๐Ÿ“˜ pdf_highlight_extractor

Extract highlighted text from PDF files using PyMuPDF.

This lightweight utility reads highlights from PDFs, along with the associated page number and highlight color. Perfect for summarizing annotated documents, research papers, or ebooks.


๐Ÿ”ง Installation

Install from PyPI:

pip install pdf-highlight-extractor

๐Ÿš€ Usage

from pdf_highlight_extractor.reader import extract_highlights

highlights = extract_highlights("sample.pdf")

for h in highlights:
    print(f"Page {h['page']} | Color: {h['color']} | Text: {h['text']}")

๐Ÿ“ Output Example

Page 2 | Color: (1.0, 1.0, 0.0) | Text: This is a highlighted phrase
Page 5 | Color: (0.0, 1.0, 0.0) | Text: Another important note

๐Ÿง  Features

  • โœ… Extract text from highlights
  • โœ… Get page number and highlight color
  • โœ… Fallback extraction if highlight text is not directly stored
  • โœ… Simple API for automation or personal use

๐Ÿงช Example PDF

You can test the tool using any PDF with highlights created in:

  • Adobe Acrobat Reader
  • Preview (macOS)
  • Xodo or other PDF apps

๐Ÿ“ฆ Requirements

  • Python 3.7+
  • PyMuPDF (automatically installed)

Only needed for development:

pip install -e .