Extract highlighted text from PDF files using PyMuPDF.
This lightweight utility reads highlights from PDFs, along with the associated page number and highlight color. Perfect for summarizing annotated documents, research papers, or ebooks.
Install from PyPI:
pip install pdf-highlight-extractor
from pdf_highlight_extractor.reader import extract_highlights
highlights = extract_highlights("sample.pdf")
for h in highlights:
print(f"Page {h['page']} | Color: {h['color']} | Text: {h['text']}")
Page 2 | Color: (1.0, 1.0, 0.0) | Text: This is a highlighted phrase
Page 5 | Color: (0.0, 1.0, 0.0) | Text: Another important note
- โ Extract text from highlights
- โ Get page number and highlight color
- โ Fallback extraction if highlight text is not directly stored
- โ Simple API for automation or personal use
You can test the tool using any PDF with highlights created in:
- Adobe Acrobat Reader
- Preview (macOS)
- Xodo or other PDF apps
- Python 3.7+
- PyMuPDF (automatically installed)
Only needed for development:
pip install -e .