Extract is "document understanding" Cloud API.
It extracts text from PDF documents in a smart way: it detects pages and, inside them, headings, body-level text, tables, headers and footers. It returns all the blocks of text of each page in the same order in which a human would read them, so helping creating a stream of text than gives better results when applying Natural Language Processing (NLP).
Knowing where text occurs, e.g. inside a table's cell, helps improve the quality of information extraction tasks because they can have a more accurate scope.


To install the client library with pip:

pip install expertai-extract

To install using conda:

conda install -c conda-forge expertai-extract


Subcription and credentials

Currently Extract is in Beta testing phase, so you have to contact, describe your use case and ask to participate in the test program.
If they say "yes", they will tell you what to do (you'll have to subscribe the Extract Beta plan from inside the developer portal) and then you can use Extract for free during the test phase.

This Python client needs to know your dveloper account credentials, so you'll have to set these two environment variables:


Create the client

To use this client in your code, import the ExtractClient class:

from expertai.extract.extract_client import ExtractClient

Then create an instance of the client object:

extractClient = ExtractClient()

You can then invoke the object methods to use Extract API



Use the layout_document_async() method to analyze a document.
The method corresponds to the layout-document-async API resource and it starts an asyncronous layout recognition task. It returns the ID of that task.

There are two possible syntaxes:

layout_document_async(file_path=filePath, file_name=fileName)


layout_document_async(file=base64, file_name=fileName)

filePath is the path of the PDF file (including the file name), fileName is the file name. base64 is the Base64 encoding of the PDF file.

fileName is only the name (not the path) of the PDF file. If you use the first syntax, you are free to set fileName to a different value than the name of the file specified in filePath, since fileName is more of a "document name", but if you don't have any special reason for doing so, use the same value.

Be aware of Extract limits: the maximum size of the PDF file you can analyze is 10MB and the document must have at most 500 pages.

The method returns a dictionary containing an item with key task_id which value (a string) is the ID of the layout recognition task. For example:

from expertai.extract.extract_client import ExtractClient

extractClient = ExtractClient()

layoutRecognitionTask = extractClient.layout_document_async(file_path="test/resources/test.pdf", file_name="test.pdf")
taskId = layoutRecognitionTask["task_id"]

You then have to call the status() method to know about the progress of that task and also to get results when the task is complete.


Use the status() method to know about the progress of a layout recognition task that was started with the layout_document_async() method and to get results when the task is complete. It corresponds to the status API resource.

The syntax is:


where taskID is the ID of the layout recognition task returned by the layout_document_async() method.

The method returns an object with these properties:

  • current (int): percentage of completion of the task
  • message (str): phase of the task, for example "page conversion", "classification"
  • result (dict): results (when the task is complete)
  • state (str): task status, for example "PROGRESS", "SUCCESS"

If current is 100, the task is finished and result contains the results.
The structure of the result dictionary reflects that of the JSON object returned by the status resource of the API.
Refer to the API documentation or use the Swagger UI in the developer portal to learn about the result structure.


Basic usage

This example shows the basic usage of the client to start a recognition task, wait until it's finished and then print results.

import time
from expertai.extract.extract_client import ExtractClient

extractClient = ExtractClient()

layoutRecognitionTask = extractClient.layout_document_async(file_path="test/resources/test.pdf", file_name="test.pdf")
taskId = layoutRecognitionTask["task_id"]

status = extractClient.status(taskId)

while status.state != "SUCCESS" and status.state != "FAILURE":
    print("Status: " + status.state + " ( " + str(status.current) + "% )")
    status = extractClient.status(taskId)


Printing titles

This example extends the previous to show how to print all the documents headings.

import time
from expertai.extract.extract_client import ExtractClient

extractClient = ExtractClient()

layoutRecognitionTask = extractClient.layout_document_async(file_path="test/resources/test.pdf", file_name="test.pdf")
taskId = layoutRecognitionTask["task_id"]

status = extractClient.status(taskId)

while status.state != "SUCCESS" and status.state != "FAILURE":
    print("Status: " + status.state + " ( " + str(status.current) + "% )")
    status = extractClient.status(taskId)

for layoutItem in status.result["layout"]:
    if layoutItem["type"] == "title":

Decoding and printing words

This example extends the first to show how to decode and print the items of the words list.
Each item in that list contains all the words of a page, no matter the type of block in which they are, together with their bounding box. Use words instead of layout items when you simply need all the text of a page (or document) in the correct reading order.
To make API output as compact as possible, words are returned compressed and Base64-encoded. Refer to the documentation to more about this representation.

import time
import base64
import gzip
from expertai.extract.extract_client import ExtractClient

extractClient = ExtractClient()

layoutRecognitionTask = extractClient.layout_document_async(file_path="test/resources/test.pdf", file_name="test.pdf")
taskId = layoutRecognitionTask["task_id"]

status = extractClient.status(taskId)

while status.state != "SUCCESS" and status.state != "FAILURE":
    print("Status: " + status.state + " ( " + str(status.current) + "% )")
    status = extractClient.status(taskId)

words = status.result["words"]

for item in words:
    encoded = gzip.decompress(base64.standard_b64decode(item))
    index = 0
    while index < len(encoded):
        index_old = index
        index = index + encoded[index:].find(b'\x00')
        text = bytes(encoded[index_old:index]).decode('utf-8')
        index += 1  # skip byte 0
        index_elem = int.from_bytes(encoded[index:index + 4], 'little')
        index += 4
        bbox = [int.from_bytes(encoded[i:i + 4], 'little') for i in range(index,
                                                                          index + 16, 4)]
        index += 16
        print(text, end=" ")

        # skip 4 elements of the array with byte 0
        index = index + 4