Pdf to Dict
Small library for converting pdf to python`s dict
Task
To build a python library that does the following;
Primary
-
Take in a standardised PDF document
-
Strip out the text from the document
-
Convert the fields and data into a dictionary
-
Provide an output dictionary
Secondary
-
Be able to detect when a non-standardised PDF was submitted and send an alert email
-
Validate that the information captured in the dictionary is accurate
Main goal
The goal of this project is to extract all the information from within a Bizfile PDF document and return that information in a dictionary format in python.
It has the following requirements;
-
Reference the Biz profile samples available at acra.gov.sg/how-to-guides/buying-information/business-profile
-
Create a python library that works with the latest version of python that reads the PDF document and extracts all the fields and their corresponding values.
-
These values should be returned in the format of a dictionary
Description
This library can scrpae pdf file with structure, that was in the examples.
This library is using pdfquery (https://pypi.org/project/pdfquery/) for scrapping data from PDF files.
This library cannot scrape next tables:
-
https://prnt.sc/qtbfcd because of invalid structure, you'll need to update this in your report generator. It can be possible, it can be for example configurable, but in this case you cannot scrape data from pdf automatically;
-
https://prnt.sc/qtbg1d because of chaotic strucure. This structure can not be handle because the data is scrapping follow the rules, and with such of this structure no rule can not be created.
But library can scrape these structures:
- https://prnt.sc/qtbgqt
- https://prnt.sc/qtbgwi
- https://prnt.sc/qtbgzi
- https://prnt.sc/qtbh3w (empty table)
- https://prnt.sc/qtbhfq (long table)
This is an example of scrapping data https://prnt.sc/qtbitr. As you can see, tables scrapped with their own structure (list of dicts), where keys are table headers. Other data is scrapping into their own dict. All the data is grouped.
The invalid data is handling only from your side. You can check if values in data is invalid e.g. contain "AMOUNT" or other not necessary data and remove it and etc.
How to use
from pdftodict import PdfScrapper
if __name__ == '__main__':
pdf_path = 'path/to/pdf'
app = PdfScrapper()
# setting pdf path
app.set_data(pdf_path)
data = app.scrape_all_data()
# also you can set page to scrape
data = app.scrape_all_data(page=1)
print(data)
You can set invalid fields data. When one of fields of scrapping data contain fields, that
you've set to invalid it will display message to you. Also you can validate data by
yourself via the app.validate_scrapping_data(data)
. Invalid fields list is empty by default.
from pdftodict import PdfScrapper
if __name__ == '__main__':
app = PdfScrapper()
app.set_invalid_fields_data(
[
'e.g. invalid field'
]
)
You can an email sending when the pdf structure is invalid. It is connecting via SSL.
from pdftodict import PdfScrapper
if __name__ == '__main__':
app = PdfScrapper()
app.set_mail_data(
email_to='',
email_from='',
port=465,
password='',
host='smtp.gmail.com',
)
You can set a custom logger:
import logging
from pdftodict import PdfScrapper
if __name__ == '__main__':
logger = logging.getLogger(__name__)
app = PdfScrapper()
app.set_logger(logger)
When default, info displaying via print.
Additional links
PyPI package: https://pypi.org/project/pdftodict/0.1/