pdfsplitter
A simple way to extract and parse images for machine learning workflows.
This file will become your README and also the index of your documentation.
Install
pip install --upgrade pdfsplitter
How to use
The highest-level function for exporting image files from a series of images is extract_images_from_pdfs
, which will take all the PDF files inside a source directory and extract the images to a destination directory. You have the added option of specifying which sort of image filetype you'd like for the exported images, as in this example:
source = Path("./tryout/")
destination = Path("./tryout/processed")
# download all the PDFs listed on a particular list of URLs
download_pdf_files(
get_pdf_links("https://open.defense.gov/Transparency/FOIA.aspx"), "./tryout"
)
# extracts all the images from the downloaded PDFs and saves them to a directory
extract_images_from_pdfs(source, destination, "jpg")
# get stats on the downloaded PDF files
display_stats(get_stats(source))
Stats for your PDF Files ββββββββββββ³ββββββββββββββββββββββββββββββββββββββββββββββββ³βββββββββββ³βββββββββββ³βββββββββββ β PageCouβ¦ β Filename β ocr_layβ¦ β pdf_filβ¦ β author β β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ© β 27 β 2014_ACFO_Report_FINAL_REPORT.pdf β False β 236655 β Stephanβ¦ β β β β β β Carr β β 3 β 7-26-2013_Determination.pdf β False β 214683 β β β 2 β DA Determination-DCRIT Hawaii Water Wells.pdf β False β 115574 β β β 3 β 12-18-14_Determination.pdf β False β 50925 β β β 4 β 6-1-2012_Determination.pdf β False β 463902 β β β 2 β 8-19-2021_Determination.pdf β False β 350438 β β β 15 β 2012_ACFO_Report_FINAL_REPORT.pdf β False β 242305 β CarrS β β 3 β 2-12-2014_Determination.pdf β False β 23823 β timothyβ¦ β β 2 β DA%20Determination%20DoD%20Flights.pdf β False β 111521 β β β 22 β 2013_ACFO_Report_FINAL_REPORT.pdf β False β 258462 β CarrS β β 2 β 2-15-2018_Determination.pdf β False β 342195 β β β 49 β DoDFY2020AnnualFOIA_Report.pdf β False β 1247446 β β β 3 β 7-5-2019_Determination.pdf β False β 204453 β β β 30 β 2017_DoD_Chief_FOIA_Officer_Report.pdf β False β 4810077 β β β 28 β 2021_DoD_Chief_FOIA_Officer_Report.pdf β False β 1131474 β β β 10 β 2011_DoD_Chief_FOIA_OfficerReport.pdf β False β 113387 β CarrS β β 27 β 2018_DoD_Chief_FOIA_Officer_Report.pdf β False β 788227 β brandoct β β 2 β 8-3-15_Determination.pdf β False β 105563 β β β 3 β 1-21-2016_Determination.pdf β False β 122706 β β β 2 β 12-6-2017_Determination.pdf β False β 189563 β deleonv β β 2 β 12-18-2018_Determination.pdf β False β 153675 β β β 30 β 2016_ACFO_Report_FINAL_REPORT.pdf β False β 1108008 β β β 2 β 11-29-2017_Determination.pdf β False β 369290 β β β 2 β DoD SAP IT DCRIT Determination.pdf β False β 127858 β β β 3 β 10-19-2018_Determination.pdf β False β 70088 β JAMES β β β β β β HOGAN β β 30 β 2015_ACFO_Report_FINAL_REPORT.pdf β False β 287445 β Stephanβ¦ β β β β β β Carr β β 3 β 7-31-2020_Determination.pdf β False β 88447 β Dziecicβ¦ β β β β β β Gerald J β β β β β β Jr CIV β β β β β β OSD OGC β β β β β β (USA) β ββββββββββββ΄ββββββββββββββββββββββββββββββββββββββββββββββββ΄βββββββββββ΄βββββββββββ΄βββββββββββ
TOTAL PAGECOUNT: 311
What is pdfsplitter?
Features
- statistics generation
- image extraction