pdftodict

convert pdf to dict


Keywords
pdf
License
MIT
Install
pip install pdftodict==1.0

Documentation

Pdf to Dict

Small library for converting pdf to python`s dict

Task

To build a python library that does the following;

Primary

  • Take in a standardised PDF document

  • Strip out the text from the document

  • Convert the fields and data into a dictionary

  • Provide an output dictionary

Secondary

  • Be able to detect when a non-standardised PDF was submitted and send an alert email

  • Validate that the information captured in the dictionary is accurate

Main goal

The goal of this project is to extract all the information from within a Bizfile PDF document and return that information in a dictionary format in python.

It has the following requirements;

  • Reference the Biz profile samples available at acra.gov.sg/how-to-guides/buying-information/business-profile

  • Create a python library that works with the latest version of python that reads the PDF document and extracts all the fields and their corresponding values.

  • These values should be returned in the format of a dictionary

Description

This library can scrpae pdf file with structure, that was in the examples.

This library is using pdfquery (https://pypi.org/project/pdfquery/) for scrapping data from PDF files.

This library cannot scrape next tables:

  • https://prnt.sc/qtbfcd because of invalid structure, you'll need to update this in your report generator. It can be possible, it can be for example configurable, but in this case you cannot scrape data from pdf automatically;

  • https://prnt.sc/qtbg1d because of chaotic strucure. This structure can not be handle because the data is scrapping follow the rules, and with such of this structure no rule can not be created.

But library can scrape these structures:

This is an example of scrapping data https://prnt.sc/qtbitr. As you can see, tables scrapped with their own structure (list of dicts), where keys are table headers. Other data is scrapping into their own dict. All the data is grouped.

The invalid data is handling only from your side. You can check if values in data is invalid e.g. contain "AMOUNT" or other not necessary data and remove it and etc.

How to use

from pdftodict import PdfScrapper


if __name__ == '__main__':
    pdf_path = 'path/to/pdf'

    app = PdfScrapper()
    # setting pdf path
    app.set_data(pdf_path)

    data = app.scrape_all_data()
    # also you can set page to scrape
    data = app.scrape_all_data(page=1)

    print(data)

You can set invalid fields data. When one of fields of scrapping data contain fields, that you've set to invalid it will display message to you. Also you can validate data by yourself via the app.validate_scrapping_data(data). Invalid fields list is empty by default.

from pdftodict import PdfScrapper


if __name__ == '__main__':
    app = PdfScrapper()
    app.set_invalid_fields_data(
        [
            'e.g. invalid field'
        ]
    )

You can an email sending when the pdf structure is invalid. It is connecting via SSL.

from pdftodict import PdfScrapper


if __name__ == '__main__':
    app = PdfScrapper()
    app.set_mail_data(
        email_to='',
        email_from='',
        port=465,
        password='',
        host='smtp.gmail.com',
    )

You can set a custom logger:

import logging

from pdftodict import PdfScrapper


if __name__ == '__main__':
    logger = logging.getLogger(__name__)

    app = PdfScrapper()
    app.set_logger(logger)

When default, info displaying via print.

Additional links

PyPI package: https://pypi.org/project/pdftodict/0.1/