pdfsplitter

Turn PDFs into image files for use in machine learning projects


Keywords
some, keywords, computer-vision, data-science, machine-learning, pdf, python
License
Apache-2.0
Install
pip install pdfsplitter==0.0.2

Documentation

pdfsplitter

A simple way to extract and parse images for machine learning workflows.

This file will become your README and also the index of your documentation.

Install

pip install --upgrade pdfsplitter

How to use

The highest-level function for exporting image files from a series of images is extract_images_from_pdfs, which will take all the PDF files inside a source directory and extract the images to a destination directory. You have the added option of specifying which sort of image filetype you'd like for the exported images, as in this example:

source = Path("./tryout/")
destination = Path("./tryout/processed")

# download all the PDFs listed on a particular list of URLs
download_pdf_files(
    get_pdf_links("https://open.defense.gov/Transparency/FOIA.aspx"), "./tryout"
)

# extracts all the images from the downloaded PDFs and saves them to a directory
extract_images_from_pdfs(source, destination, "jpg")
# get stats on the downloaded PDF files
display_stats(get_stats(source))
                                  Stats for your PDF Files                                   
┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓
┃ PageCou… ┃ Filename                                      ┃ ocr_lay… ┃ pdf_fil… ┃ author   ┃
┑━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩
β”‚       27 β”‚ 2014_ACFO_Report_FINAL_REPORT.pdf             β”‚ False    β”‚ 236655   β”‚ Stephan… β”‚
β”‚          β”‚                                               β”‚          β”‚          β”‚ Carr     β”‚
β”‚        3 β”‚ 7-26-2013_Determination.pdf                   β”‚ False    β”‚ 214683   β”‚          β”‚
β”‚        2 β”‚ DA Determination-DCRIT Hawaii Water Wells.pdf β”‚ False    β”‚ 115574   β”‚          β”‚
β”‚        3 β”‚ 12-18-14_Determination.pdf                    β”‚ False    β”‚ 50925    β”‚          β”‚
β”‚        4 β”‚ 6-1-2012_Determination.pdf                    β”‚ False    β”‚ 463902   β”‚          β”‚
β”‚        2 β”‚ 8-19-2021_Determination.pdf                   β”‚ False    β”‚ 350438   β”‚          β”‚
β”‚       15 β”‚ 2012_ACFO_Report_FINAL_REPORT.pdf             β”‚ False    β”‚ 242305   β”‚ CarrS    β”‚
β”‚        3 β”‚ 2-12-2014_Determination.pdf                   β”‚ False    β”‚ 23823    β”‚ timothy… β”‚
β”‚        2 β”‚ DA%20Determination%20DoD%20Flights.pdf        β”‚ False    β”‚ 111521   β”‚          β”‚
β”‚       22 β”‚ 2013_ACFO_Report_FINAL_REPORT.pdf             β”‚ False    β”‚ 258462   β”‚ CarrS    β”‚
β”‚        2 β”‚ 2-15-2018_Determination.pdf                   β”‚ False    β”‚ 342195   β”‚          β”‚
β”‚       49 β”‚ DoDFY2020AnnualFOIA_Report.pdf                β”‚ False    β”‚ 1247446  β”‚          β”‚
β”‚        3 β”‚ 7-5-2019_Determination.pdf                    β”‚ False    β”‚ 204453   β”‚          β”‚
β”‚       30 β”‚ 2017_DoD_Chief_FOIA_Officer_Report.pdf        β”‚ False    β”‚ 4810077  β”‚          β”‚
β”‚       28 β”‚ 2021_DoD_Chief_FOIA_Officer_Report.pdf        β”‚ False    β”‚ 1131474  β”‚          β”‚
β”‚       10 β”‚ 2011_DoD_Chief_FOIA_OfficerReport.pdf         β”‚ False    β”‚ 113387   β”‚ CarrS    β”‚
β”‚       27 β”‚ 2018_DoD_Chief_FOIA_Officer_Report.pdf        β”‚ False    β”‚ 788227   β”‚ brandoct β”‚
β”‚        2 β”‚ 8-3-15_Determination.pdf                      β”‚ False    β”‚ 105563   β”‚          β”‚
β”‚        3 β”‚ 1-21-2016_Determination.pdf                   β”‚ False    β”‚ 122706   β”‚          β”‚
β”‚        2 β”‚ 12-6-2017_Determination.pdf                   β”‚ False    β”‚ 189563   β”‚ deleonv  β”‚
β”‚        2 β”‚ 12-18-2018_Determination.pdf                  β”‚ False    β”‚ 153675   β”‚          β”‚
β”‚       30 β”‚ 2016_ACFO_Report_FINAL_REPORT.pdf             β”‚ False    β”‚ 1108008  β”‚          β”‚
β”‚        2 β”‚ 11-29-2017_Determination.pdf                  β”‚ False    β”‚ 369290   β”‚          β”‚
β”‚        2 β”‚ DoD SAP IT DCRIT Determination.pdf            β”‚ False    β”‚ 127858   β”‚          β”‚
β”‚        3 β”‚ 10-19-2018_Determination.pdf                  β”‚ False    β”‚ 70088    β”‚ JAMES    β”‚
β”‚          β”‚                                               β”‚          β”‚          β”‚ HOGAN    β”‚
β”‚       30 β”‚ 2015_ACFO_Report_FINAL_REPORT.pdf             β”‚ False    β”‚ 287445   β”‚ Stephan… β”‚
β”‚          β”‚                                               β”‚          β”‚          β”‚ Carr     β”‚
β”‚        3 β”‚ 7-31-2020_Determination.pdf                   β”‚ False    β”‚ 88447    β”‚ Dziecic… β”‚
β”‚          β”‚                                               β”‚          β”‚          β”‚ Gerald J β”‚
β”‚          β”‚                                               β”‚          β”‚          β”‚ Jr CIV   β”‚
β”‚          β”‚                                               β”‚          β”‚          β”‚ OSD OGC  β”‚
β”‚          β”‚                                               β”‚          β”‚          β”‚ (USA)    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
TOTAL PAGECOUNT: 311

What is pdfsplitter?

Features

  • statistics generation
  • image extraction

Install

How to use

Roadmap