TabulaPro: Pro-version of Tabula-py

TabulaPro is a layer on the tabula-py library to extract tables from Scan PDFs and Images.

TabulaPro vs Tabula

TabulaPro is no different from the original Tabula to code. Turn your current tabula-py code to TabulaPro compatible with flavor="TabulaPro" or tabulapro=True in read_pdf() to process images or scanned PDFs".

Installation

💡 ProTip: ExtractTable-py is the official library, FASTER than this wrapper, has NO software dependencies.

As the library itself is dependent on Tabula which has software dependencies, the developer is expected to install them*, to use the regular Tabula flavors ("stream", "lattice") along with "TabulaPro".

Using pip

After installing software dependencies, you can simply use pip to install TabulaPro:

$ pip install -U TabulaPro

Prerequisites

The developer needs an api_key (free credits here) to use TabulaPro. Each Image file or one PDF page consumes one credit to trigger the process.

api_key should be passed through pro_kwargs, a dict type argument that accepts api_key, job_id, dup_check, wait_for_output as keys, can be used as below

{
    "api_key": str,
    Mandatory, to trigger "TabulaPro" flavor, to process Scan PDFs and images, also text PDF files

    "job_id": str,
        optional, if processing a new file
        Mandatory, to retrieve the result of the already submitted file

    "dup_check": bool, default: False - to bypass the duplicate check
        Useful to handle duplicate requests, check based on the FileName

    "max_wait_time": int, default: 300
        Checks for the output every 15 seconds until successfully processed or for a maximum of 300 seconds.
}

Let's code

Quickly validate the API key and see the number of credits attached to it

api_key = YOUR_API_KEY_HERE

from tabula_pro import check_usage
print(check_usage(api_key))

No error from the above code snippet run implies API Key is valid

Here's how you can extract tables from Image files.

The example image (tabula-data-page-1.PNG) used in the code below, can be found here. Notice that tabula-data-page-1.PNG is the image version of the first page of Tabula's PDF example, data.pdf.

from tabula_pro import read_pdf
pro_tables = read_pdf(
    'foo-image.jpg', 
    flavor="tabulapro", 
    pro_kwargs={"api_key": api_key}
)

# To process PDF, make use of pages ("1", "1,3-4", "all") params in the read_pdf function
# pro_tables = read_pdf('foo-image.PDF', flavor="tabulaPro", pages="1,3-4", pro_kwargs={'api_key': api_key})

pro_tables is a list of dataframes that are found in the file

pro_tables[0]

mpg	cyl	disp	hp	drat	wt	gsec	VS	am	gear	carb
Mazda RX4	21.0	6	160.0	110	3.90	2.620	16.46	0	1	4	4
Mazda RX4 Wag	21.0	6	160.0	110	3.90	2.875	17.02	0	1	4	4
Datsun 710	22.8	4	108.0	93	3.85	2.320	18.61	1	1	4	1
Hornet 4 Drive	21.4	6	258.0	110	3.08	3.215	19.44	L	0	3	L
Hornet Sportabout	18.7	8	360.0	175	3.15	3.440	17.02	0	0	3	2
Valiant	18.1	6	225.0	105	2.76	3.460	20.22	1	0	3	1
Duster 360	14.3	8	360.0	245	3.21	3.570	15.84	0	0	3	4
Mere 240D	24.4	4	146.7	62	3.69	3.190	20.00	1	0	4	2
Mere 230	22.8	4	140.8	95	3.92	3.150	22.90	1	0	4	2
Merc 280	19.2	6	167.6	123	3.92	3.440	18.30	1	0	4	4
Merc 280C	17.8	6	167.6	123	3.92	3.440	18.90	1	0	4	4
Mere 450SE	16.4	8	275.8	180	3.07	4.070	17.40	0	0	3	3
Merc 450SL	17.3	8	275.8	180	3.07	3.730	17.60	0	0	3	3
Merc 450SLC	15.2	8	275.8	180	3.07	3.780	18.00	0	0	3	3
Cadillac Fleetwood	10.4	8	472.0	205	2.93	5.250	17.98	0	0	3	4
Lincoln Continental	10.4	8	460.0	215	3.00	5.424	17.82	0	0	3	4
Chrysler Imperial	14.7	8	440.0	230	3.23	5.345	17.42	0	0	3	4
Fiat 128	32.4	4	78.7	66	4.08	2.200	19.47	1	1	4	1
Honda Civic	30.4	4	75.7	52	4.93	1.615	18.52	1	1	4	2
Toyota Corolla	33.9	4	71.1	65	4.22	1.835	19.90	L	1	4	L
Toyota Corona	21.5	4	120.1	97	3.70	2.465	20.01	1	0	3	1
Dodge Challenger	15.5	8	318.0	150	2.76	3.520	16.87	0	0	3	2
AMC Javelin	15.2	8	304.0	150	3.15	3.435	17.30	0	0	3	2
Camaro 728	13.3	8	350.0	245	3.73	3.840	15.41	0	0	3	4
Pontiac Firebird	19.2	8	400.0	175	3.08	3.845	17.05	0	0	3	2
Fiat X1-9	27.3	4	79.0	66	4.08	1.935	18.90	1	1	4	1
Porsche 914-2	26.0	4	120.3	91	4.43	2.140	16.70	0	1	5	2
Lotus Europa	30.4	4	95.1	113	3.77	1.513	16.90	L	1	5	2
Ford Pantera L	15.8	8	351.0	264	4.22	3.170	14.50	0	1	5	4
Ferrari Dino	19.7	6	145.0	175	3.62	2.770	15.50	0	1	5	6
Maserati Bora	15.0	8	301.0	335	3.54	3.570	14.60	0	1	5	8
Volyo 142F	21.4	4	121.0	109	4.11	2.780	18.60	1	1	4	2

Most of the image files are processed under 5 seconds. At times a blurry/big/bad image processing may take up to 15 seconds and the PDF file depends on the page count. In these cases, the process waits for a maximum of 300 seconds to check the job status every 15 seconds until a process ends successfully to return a final response.

ProTip: To have more control on the process wait time checkout ExtractTable-py

Pull Requests & Rewards

Pull requests are most welcome and greatly appreciated with API credits.

License

This project is licensed under the Apache License 2.0, see the LICENSE file for details.

Credits

Last but not least, we want to be thankful to the contributors of tabula-py

TabulaPro
Release 0.0.0

Release 0.0.0

0.0.0

1.2.1

1.2.0

1.0.0

0.0.1

Documentation

TabulaPro: Pro-version of Tabula-py

TabulaPro vs Tabula

Installation

Using pip

Prerequisites

Let's code

Pull Requests & Rewards

License

Credits

Social Media

Stats

Development practices

Releases

Contributors

TabulaPro Release 0.0.0

Release 0.0.0 Toggle Dropdown 0.0.0 1.2.1 1.2.0 1.0.0 0.0.1

Documentation

TabulaPro: Pro-version of Tabula-py

TabulaPro vs Tabula

Installation

Using pip

Prerequisites

Let's code

Pull Requests & Rewards

License

Credits

Social Media

Stats

Development practices

Releases

Contributors

TabulaPro
Release 0.0.0

Release 0.0.0

0.0.0

1.2.1

1.2.0

1.0.0

0.0.1