How to reduce your reliance on "bad" open source packages ✨ RSVP

pdfextractdata
Release 4.9

Release 4.9

4.9

4.8

4.7

4.6

4.5

4.4

4.3

4.2

4.1

4.0

Python package for extracting data from pdf

Homepage PyPI Python

Keywords: pdf, extract, data, from
License: MIT
Install: pip install pdfextractdata==4.9

Documentation

pdfextractdata

This Package is for extracting tables, table_images, and texts from pdf files.

Also texts in the table are deleted and are replaced by **table number**.

The main function is pdf_to_pickle ; output type of this function is dictionary. (dictionary keys: text, tables, images)

Image data can be displayed using matplotlib.pyplot package. ( plt.imshow() )

Installation

Using pip

pip3 install pdfextractdata

USAGE

from pdfextractdata.extract import Extract

p = Extract()
dic = p.pdf_to_pickle(path)

Function Description

filename_change(path)
if file type is not PDF, convert it to PDF.

pdf_to_text(path)
this function can convert pdf to text using pdfminer package.

process_text(text)

parameter:

text (parameter) is output from above function ( pdf_to_text )

function:

text preprocessing function.

cos_sim(sent1, sent2)

parameter:

sent1 is the text extracted from tables.

sent2 is the original text converted from the pdf file; its length is adjusted to match the length of sent1

function:

calculate cosine similarity of the two sentences.

process_table(tables)

parameter:

tables (parameter) is output from camelot package (extract tables from pdf)

function:

preprocessing text in tables.

delete_table_text(text, tables)

parameter:

text (parameter) is output from above function ( pdf_to_text )

tables (parameter) is output from camelot package (extract tables from pdf)

function:

texts in the table are deleted and are replaced by **table number** ; using above cos_sim(sent1, sent2) function.

clean_table(tables)

parameter:

tables (parameter) is output from camelot package (extract tables from pdf)

function:

The camelot package is not enough to display merged columns and rows, so preprocess table to solve this problem.

extracting_images(tables)

parameter:

tables (parameter) is output from camelot package (extract tables from pdf)

function:

extracting table images from pdf.

pdf_to_pickle(path)

function:

The main function is pdf_to_pickle ;

output type of this function is dictionary (dictionary keys: text, tables, images)

and is saved as pickle file. saving directory is working directory/pickle/name.pickle

OUTPUT

text

table

image

Dependencies: 1
Dependent packages: 0
Dependent repositories: 0
Total releases: 33
Latest release: Jul 23, 2019
First release: May 21, 2019
Stars: 2
Forks: 0
Watchers: 1
Contributors: 1
Repository size: 50.8 KB
SourceRank: 6

Source repo 2FA enabled: TEXT!
Package manager 2FA enabled: TEXT!
Is security responsive: TEXT!
Dependencies are managed: TEXT!
Issue-free release available: TEXT!
Succession plan available: TEXT!
Package manager 2FA enabled: TEXT!

Releases

4.9: Jul 23, 2019
4.8: Jul 22, 2019
4.7: Jul 22, 2019
4.6: Jul 22, 2019
4.5: Jul 22, 2019
4.4: Jul 22, 2019
4.3: Jul 22, 2019
4.2: Jul 22, 2019
4.1: Jul 22, 2019
4.0: Jul 22, 2019

See all 33 releases

Contributors

See all contributors

Something wrong with this page? Make a suggestion

Export .ABOUT file for this package

Last synced: 2021-02-19 08:34:42 UTC

Login to resync this project