s3-ocr
Tools for running OCR against files stored in S3
Installation
Install this tool using pip
:
pip install s3-ocr
Starting OCR against PDFs in a bucket
The start
command takes a list of keys and submits them to Textract for OCR processing.
You need to have AWS configured using environment variables or a credentials file in your home directory.
You can start the process running like this:
s3-ocr start name-of-your-bucket my-pdf-file.pdf
The paths you specify should be paths within the bucket. If you stored your PDF files in folders inside the bucket it should look like this:
s3-ocr start name-of-your-bucket path/to/one.pdf path/to/two.pdf
OCR can take some time. The results of the OCR will be stored in textract-output
in your bucket.
To process every file in the bucket with a .pdf
extension use --all
:
s3-ocr start name-of-bucket --all
s3-ocr start --help
Usage: s3-ocr start [OPTIONS] BUCKET [KEYS]...
Start OCR tasks for PDF files in an S3 bucket
s3-ocr start name-of-bucket path/to/one.pdf path/to/two.pdf
To process every file with a .pdf extension:
s3-ocr start name-of-bucket --all
Options:
--all Process all PDF files in the bucket
--access-key TEXT AWS access key ID
--secret-key TEXT AWS secret access key
--session-token TEXT AWS session token
--endpoint-url TEXT Custom endpoint URL
-a, --auth FILENAME Path to JSON/INI file containing credentials
--help Show this message and exit.
Checking status
The s3-ocr status <bucket-name>
command shows a rough indication of progress through the tasks:
% s3-ocr status sfms-history
153 complete out of 532 jobs
It compares the jobs that have been submitted, based on .s3-ocr.json
files, to the jobs that have their results written to the textract-output/
folder.
s3-ocr status --help
Usage: s3-ocr status [OPTIONS] BUCKET
Show status of OCR jobs for a bucket
Options:
--access-key ...
Fetching the results
Once an OCR job has completed you can download the resulting JSON using the fetch
command:
s3-ocr fetch name-of-bucket path/to/file.pdf
This will save files in the current directory with names like this:
4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-1.json
4d9b5cf580e761fdb16fd24edce14737ebc562972526ef6617554adfa11d6038-2.json
The number of files will vary depending on the length of the document.
If you don't want separate files you can combine them together using the -c/--combine
option:
s3-ocr fetch name-of-bucket path/to/file.pdf --combine output.json
The output.json
file will then contain data that looks something like this:
{
"Blocks": [
{
"BlockType": "PAGE",
"Geometry": {...}
"Page": 1,
...
},
{
"BlockType": "LINE",
"Page": 1,
...
"Text": "Barry",
},
s3-ocr fetch --help
Usage: s3-ocr fetch [OPTIONS] BUCKET KEY
Fetch the OCR results for a specified file
s3-ocr fetch name-of-bucket path/to/key.pdf
This will save files in the current directory called things like
a806e67e504fc15f...48314e-1.json a806e67e504fc15f...48314e-2.json
To combine these together into a single JSON file with a specified name, use:
s3-ocr fetch name-of-bucket path/to/key.pdf --combine output.json
Use "--output -" to print the combined JSON to standard output instead.
Options:
-c, --combine FILENAME Write combined JSON to file
--access-key ...
Fetching just the text of a page
If you don't want to deal with the JSON directly, you can use the text
command to retrieve just the text extracted from a PDF:
s3-ocr text name-of-bucket path/to/file.pdf
This will output plain text to standard output.
To save that to a file, use this:
s3-ocr text name-of-bucket path/to/file.pdf > text.txt
Separate pages will be separated by three newlines. To separate them using a ----
horizontal divider instead add --divider
:
s3-ocr text name-of-bucket path/to/file.pdf --divider
s3-ocr text --help
Usage: s3-ocr text [OPTIONS] BUCKET KEY
Retrieve the text from an OCRd PDF file
s3-ocr text name-of-bucket path/to/key.pdf
Options:
--divider Add ---- between pages
--access-key ...
Changes made to your bucket
To keep track of which files have been submitted for processing, s3-ocr
will create a JSON file for every file that it adds to the OCR queue.
This file will be called:
path-to-file/name-of-file.pdf.s3-ocr.json
Each of these JSON files contains data that looks like this:
{
"job_id": "a34eb4e8dc7e70aa9668f7272aa403e85997364199a654422340bc5ada43affe",
"etag": "\"b0c77472e15500347ebf46032a454e8e\""
}
The recorded job_id
can be used later to associate the file with the results of the OCR task in textract-output/
.
The etag
is the ETag of the S3 object at the time it was submitted. This can be used later to determine if a file has changed since it last had OCR run against it.
This design for the tool, with the .s3-ocr.json
files tracking jobs that have been submitted, means that it is safe to run s3-ocr start
against the same bucket multiple times without the risk of starting duplicate OCR jobs.
Creating a SQLite index of your OCR results
The s3-ocr index <bucket> <database_file>
command creates a SQLite database contaning the results of the OCR, and configure SQLite full-text search for the text:
% s3-ocr index sfms-history index.db
Fetching job details [####################################] 100%
Populating pages table [####################----------------] 55% 00:03:18
The schema of the resulting database looks like this (excluding the FTS tables):
CREATE TABLE [pages] (
[path] TEXT,
[page] INTEGER,
[folder] TEXT,
[text] TEXT,
PRIMARY KEY ([path], [page])
);
CREATE TABLE [ocr_jobs] (
[key] TEXT PRIMARY KEY,
[job_id] TEXT,
[etag] TEXT,
[s3_ocr_etag] TEXT
);
CREATE TABLE [fetched_jobs] (
[job_id] TEXT PRIMARY KEY
);
The database is designed to be used with Datasette.
s3-ocr index --help
Usage: s3-ocr index [OPTIONS] BUCKET DATABASE
Create a SQLite database with OCR results for files in a bucket
Options:
--access-key ...
Development
To contribute to this tool, first checkout the code. Then create a new virtual environment:
cd s3-ocr
python -m venv venv
source venv/bin/activate
Now install the dependencies and test dependencies:
pip install -e '.[test]'
To run the tests:
pytest
To regenerate the README file with the latest --help
:
cog -r README.md