pdfsplitter
A simple way to extract and parse images for machine learning workflows.
This file will become your README and also the index of your documentation.
Install
pip install --upgrade pdfsplitter
How to use
The highest-level function for exporting image files from a series of images is extract_images_from_pdfs
, which will take all the PDF files inside a source directory and extract the images to a destination directory. You have the added option of specifying which sort of image filetype you'd like for the exported images, as in this example:
source = Path("./tryout/")
destination = Path("./tryout/processed")
# download all the PDFs listed on a particular list of URLs
download_pdf_files(
get_pdf_links("https://open.defense.gov/Transparency/FOIA.aspx"), "./tryout"
)
# extracts all the images from the downloaded PDFs and saves them to a directory
extract_images_from_pdfs(source, destination, "jpg")
# get stats on the downloaded PDF files
display_stats(get_stats(source))
Stats for your PDF Files ┏━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┳━━━━━━━━━━┓ ┃ PageCou… ┃ Filename ┃ ocr_lay… ┃ pdf_fil… ┃ author ┃ ┡━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━╇━━━━━━━━━━┩ │ 27 │ 2014_ACFO_Report_FINAL_REPORT.pdf │ False │ 236655 │ Stephan… │ │ │ │ │ │ Carr │ │ 3 │ 7-26-2013_Determination.pdf │ False │ 214683 │ │ │ 2 │ DA Determination-DCRIT Hawaii Water Wells.pdf │ False │ 115574 │ │ │ 3 │ 12-18-14_Determination.pdf │ False │ 50925 │ │ │ 4 │ 6-1-2012_Determination.pdf │ False │ 463902 │ │ │ 2 │ 8-19-2021_Determination.pdf │ False │ 350438 │ │ │ 15 │ 2012_ACFO_Report_FINAL_REPORT.pdf │ False │ 242305 │ CarrS │ │ 3 │ 2-12-2014_Determination.pdf │ False │ 23823 │ timothy… │ │ 2 │ DA%20Determination%20DoD%20Flights.pdf │ False │ 111521 │ │ │ 22 │ 2013_ACFO_Report_FINAL_REPORT.pdf │ False │ 258462 │ CarrS │ │ 2 │ 2-15-2018_Determination.pdf │ False │ 342195 │ │ │ 49 │ DoDFY2020AnnualFOIA_Report.pdf │ False │ 1247446 │ │ │ 3 │ 7-5-2019_Determination.pdf │ False │ 204453 │ │ │ 30 │ 2017_DoD_Chief_FOIA_Officer_Report.pdf │ False │ 4810077 │ │ │ 28 │ 2021_DoD_Chief_FOIA_Officer_Report.pdf │ False │ 1131474 │ │ │ 10 │ 2011_DoD_Chief_FOIA_OfficerReport.pdf │ False │ 113387 │ CarrS │ │ 27 │ 2018_DoD_Chief_FOIA_Officer_Report.pdf │ False │ 788227 │ brandoct │ │ 2 │ 8-3-15_Determination.pdf │ False │ 105563 │ │ │ 3 │ 1-21-2016_Determination.pdf │ False │ 122706 │ │ │ 2 │ 12-6-2017_Determination.pdf │ False │ 189563 │ deleonv │ │ 2 │ 12-18-2018_Determination.pdf │ False │ 153675 │ │ │ 30 │ 2016_ACFO_Report_FINAL_REPORT.pdf │ False │ 1108008 │ │ │ 2 │ 11-29-2017_Determination.pdf │ False │ 369290 │ │ │ 2 │ DoD SAP IT DCRIT Determination.pdf │ False │ 127858 │ │ │ 3 │ 10-19-2018_Determination.pdf │ False │ 70088 │ JAMES │ │ │ │ │ │ HOGAN │ │ 30 │ 2015_ACFO_Report_FINAL_REPORT.pdf │ False │ 287445 │ Stephan… │ │ │ │ │ │ Carr │ │ 3 │ 7-31-2020_Determination.pdf │ False │ 88447 │ Dziecic… │ │ │ │ │ │ Gerald J │ │ │ │ │ │ Jr CIV │ │ │ │ │ │ OSD OGC │ │ │ │ │ │ (USA) │ └──────────┴───────────────────────────────────────────────┴──────────┴──────────┴──────────┘
TOTAL PAGECOUNT: 311
What is pdfsplitter?
Features
- statistics generation
- image extraction