TabulaPro: Pro-version of Tabula-py
TabulaPro is a layer on the tabula-py library to extract tables from Scan PDFs and Images.
TabulaPro vs Tabula
TabulaPro is no different from the original Tabula to code. Turn your current tabula-py code to TabulaPro compatible with flavor="TabulaPro"
or tabulapro=True
in read_pdf() to process images or scanned PDFs".
Installation
💡 ProTip: ExtractTable-py is the official library, FASTER than this wrapper, has NO software dependencies.
As the library itself is dependent on Tabula which has software dependencies, the developer is expected to install them*, to use the regular Tabula flavors ("stream", "lattice") along with "TabulaPro".
Using pip
After installing software dependencies, you can simply use pip to install TabulaPro:
$ pip install -U TabulaPro
Prerequisites
The developer needs an api_key (free credits here) to use TabulaPro. Each Image file or one PDF page consumes one credit to trigger the process.
api_key should be passed through pro_kwargs
, a dict
type argument that accepts api_key, job_id, dup_check, wait_for_output as keys, can be used as below
{
"api_key": str,
Mandatory, to trigger "TabulaPro" flavor, to process Scan PDFs and images, also text PDF files
"job_id": str,
optional, if processing a new file
Mandatory, to retrieve the result of the already submitted file
"dup_check": bool, default: False - to bypass the duplicate check
Useful to handle duplicate requests, check based on the FileName
"max_wait_time": int, default: 300
Checks for the output every 15 seconds until successfully processed or for a maximum of 300 seconds.
}
Let's code
Quickly validate the API key and see the number of credits attached to it
api_key = YOUR_API_KEY_HERE
from tabula_pro import check_usage
print(check_usage(api_key))
No error from the above code snippet run implies API Key is valid
Here's how you can extract tables from Image files.
The example image (tabula-data-page-1.PNG) used in the code below, can be found here. Notice that tabula-data-page-1.PNG is the image version of the first page of Tabula's PDF example, data.pdf.
from tabula_pro import read_pdf
pro_tables = read_pdf(
'foo-image.jpg',
flavor="tabulapro",
pro_kwargs={"api_key": api_key}
)
# To process PDF, make use of pages ("1", "1,3-4", "all") params in the read_pdf function
# pro_tables = read_pdf('foo-image.PDF', flavor="tabulaPro", pages="1,3-4", pro_kwargs={'api_key': api_key})
pro_tables
is a list of dataframes that are found in the file
pro_tables[0]
mpg | cyl | disp | hp | drat | wt | gsec | VS | am | gear | carb | |
---|---|---|---|---|---|---|---|---|---|---|---|
Mazda RX4 | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.620 | 16.46 | 0 | 1 | 4 | 4 |
Mazda RX4 Wag | 21.0 | 6 | 160.0 | 110 | 3.90 | 2.875 | 17.02 | 0 | 1 | 4 | 4 |
Datsun 710 | 22.8 | 4 | 108.0 | 93 | 3.85 | 2.320 | 18.61 | 1 | 1 | 4 | 1 |
Hornet 4 Drive | 21.4 | 6 | 258.0 | 110 | 3.08 | 3.215 | 19.44 | L | 0 | 3 | L |
Hornet Sportabout | 18.7 | 8 | 360.0 | 175 | 3.15 | 3.440 | 17.02 | 0 | 0 | 3 | 2 |
Valiant | 18.1 | 6 | 225.0 | 105 | 2.76 | 3.460 | 20.22 | 1 | 0 | 3 | 1 |
Duster 360 | 14.3 | 8 | 360.0 | 245 | 3.21 | 3.570 | 15.84 | 0 | 0 | 3 | 4 |
Mere 240D | 24.4 | 4 | 146.7 | 62 | 3.69 | 3.190 | 20.00 | 1 | 0 | 4 | 2 |
Mere 230 | 22.8 | 4 | 140.8 | 95 | 3.92 | 3.150 | 22.90 | 1 | 0 | 4 | 2 |
Merc 280 | 19.2 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.30 | 1 | 0 | 4 | 4 |
Merc 280C | 17.8 | 6 | 167.6 | 123 | 3.92 | 3.440 | 18.90 | 1 | 0 | 4 | 4 |
Mere 450SE | 16.4 | 8 | 275.8 | 180 | 3.07 | 4.070 | 17.40 | 0 | 0 | 3 | 3 |
Merc 450SL | 17.3 | 8 | 275.8 | 180 | 3.07 | 3.730 | 17.60 | 0 | 0 | 3 | 3 |
Merc 450SLC | 15.2 | 8 | 275.8 | 180 | 3.07 | 3.780 | 18.00 | 0 | 0 | 3 | 3 |
Cadillac Fleetwood | 10.4 | 8 | 472.0 | 205 | 2.93 | 5.250 | 17.98 | 0 | 0 | 3 | 4 |
Lincoln Continental | 10.4 | 8 | 460.0 | 215 | 3.00 | 5.424 | 17.82 | 0 | 0 | 3 | 4 |
Chrysler Imperial | 14.7 | 8 | 440.0 | 230 | 3.23 | 5.345 | 17.42 | 0 | 0 | 3 | 4 |
Fiat 128 | 32.4 | 4 | 78.7 | 66 | 4.08 | 2.200 | 19.47 | 1 | 1 | 4 | 1 |
Honda Civic | 30.4 | 4 | 75.7 | 52 | 4.93 | 1.615 | 18.52 | 1 | 1 | 4 | 2 |
Toyota Corolla | 33.9 | 4 | 71.1 | 65 | 4.22 | 1.835 | 19.90 | L | 1 | 4 | L |
Toyota Corona | 21.5 | 4 | 120.1 | 97 | 3.70 | 2.465 | 20.01 | 1 | 0 | 3 | 1 |
Dodge Challenger | 15.5 | 8 | 318.0 | 150 | 2.76 | 3.520 | 16.87 | 0 | 0 | 3 | 2 |
AMC Javelin | 15.2 | 8 | 304.0 | 150 | 3.15 | 3.435 | 17.30 | 0 | 0 | 3 | 2 |
Camaro 728 | 13.3 | 8 | 350.0 | 245 | 3.73 | 3.840 | 15.41 | 0 | 0 | 3 | 4 |
Pontiac Firebird | 19.2 | 8 | 400.0 | 175 | 3.08 | 3.845 | 17.05 | 0 | 0 | 3 | 2 |
Fiat X1-9 | 27.3 | 4 | 79.0 | 66 | 4.08 | 1.935 | 18.90 | 1 | 1 | 4 | 1 |
Porsche 914-2 | 26.0 | 4 | 120.3 | 91 | 4.43 | 2.140 | 16.70 | 0 | 1 | 5 | 2 |
Lotus Europa | 30.4 | 4 | 95.1 | 113 | 3.77 | 1.513 | 16.90 | L | 1 | 5 | 2 |
Ford Pantera L | 15.8 | 8 | 351.0 | 264 | 4.22 | 3.170 | 14.50 | 0 | 1 | 5 | 4 |
Ferrari Dino | 19.7 | 6 | 145.0 | 175 | 3.62 | 2.770 | 15.50 | 0 | 1 | 5 | 6 |
Maserati Bora | 15.0 | 8 | 301.0 | 335 | 3.54 | 3.570 | 14.60 | 0 | 1 | 5 | 8 |
Volyo 142F | 21.4 | 4 | 121.0 | 109 | 4.11 | 2.780 | 18.60 | 1 | 1 | 4 | 2 |
Most of the image files are processed under 5 seconds. At times a blurry/big/bad image processing may take up to 15 seconds and the PDF file depends on the page count. In these cases, the process waits for a maximum of 300 seconds to check the job status every 15 seconds until a process ends successfully to return a final response.
ProTip: To have more control on the process wait time checkout ExtractTable-py
Pull Requests & Rewards
Pull requests are most welcome and greatly appreciated with API credits.
License
This project is licensed under the Apache License 2.0, see the LICENSE file for details.
Credits
Last but not least, we want to be thankful to the contributors of tabula-py
Social Media
Follow us on Social media for library updates and free credits.