Welcome to Open eDiscovery
Open eDiscovery helps you search and explore large document collections
Overview
Open eDiscovery is a Python package that facilitates searching and exploring large collections of files. It can function as both a programmatic API and as an out-of-the-box, production-ready Web application. Powered by well-tested open-source tools such as Elasticsearch (for the back-end data repository), Streamlit (for the front-end Web-based user interface), and ktrain (for machine learning and data analytics), Open eDiscovery has applications to legal e-Discovery, digital forensics, scientific literature reviews, and any task that involves making sense of a collection of documents. In addition to standard full-text searches and document-tagging, Open eDiscovery includes out-of-the-box support for AI/ML such as end-to-end question-answering, theme discovery, and keyphrase extraction.
How to Install
1. Install eDiscovery from PyPI along with APT dependencies
pip install ediscovery
python -m ediscovery.install_system_dependencies
The command install_system_dependencies
will download and install Ubuntu APT dependencies, which are necessary for better preprocessing of documents (e.g., extract RAR/PST archives, detect file type accurately, optical character recognition).
Open eDiscovery has been tested on ubuntu-latest
(currently Ubuntu 20.04), which is the recommended environment. If you are on Microsoft Windows, we recommend a virtualized Ubuntu environment like the Windows Subsystem for Linux (WSL)).
2. Download and start an Elasticsearch instance
If you don't already have an Elasticsearch instance running, you can download and start one easily:
# Example using Elasticsearch 7.10.1
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
elasticsearch-7.10.1/bin/elasticsearch # starts Elasticsearch
Install a version of elasticsearch
that is compatible with the Elasticseach instance you just downloaded:
# example
pip install elasticsearch==7.10.1
How to Use the Web App
Once eDiscovery is installed, the Web app can be started from the command-line by typing ediscovery
:
ediscovery --port 8888
When starting eDiscovery for the first time, there will be instructions on how to setup the admin
user for the app. The admin
account can then be used to add additional users. The Web app includes a point-and-click, no-code interface to full-text faceted search and various AI/ML/NLP capabilities such as end-to-end question-answering, theme discovery, and keyphrase extraction. Unlike most existing search engines, Open eDiscovery allows you to surgically apply machine-learning analyses to the specific documents of interest to you.
How to Use the Programmatic API
Open eDiscovery can also be used as programmatic Python API.
1. Process and Analyze a Document
The Document.from_file
and Document.from_url
class methods can be used to process and analyze raw documents (e.g., PDFs, MS PowerPoints, Word documents, plain text, etc.).
from ediscovery import Document
document = Document.from_url('https://arxiv.org/pdf/2004.10703.pdf', project_id=project_id)
print(document)
# OUTPUT:
# Document(filepath=/home/amaiya/ediscovery_data/projects/1/2004.10703-0dfd3e9.pdf, filename=2004.10703-0dfd3e9.pdf, extension=pdf, mimetype=application/pdf,
# md5=0dfd3e96bd2bd0a2ace640d42514775d,
# topictags=['machine learning', 'step', 'arxiv preprint', 'learning rate',
# 'text classification', 'open-domain question-answering',
# 'augmented machine', 'augmented machine learning', 'bert', 'low-code library'],
# warning=False, error=False, err_msg=None)
The document.content
variable (not shown) holds the raw text extracted from the document. The document.topictags
field stores auto-extracted keyphrases from the document:
Pro Tips:
- Use
from_file
class method (instead offrom_url
) for files that are already downloaded to the local file system - The
ngram_range
andcandidate_generator
arguments tofrom_file
andfrom_url
can be used to configure the keyphrase extraction. -
OCR is supported, but turned off by default. To enable, use
enable_pdf_ocr
,enable_jpg_ocr
, etc.:
# ocr_timeout=720 means that the OCR will stop processing long documents after 720 seconds
d = Document.from_file('some_scanned_pdf_document.pdf', project_id=project_id,
enable_pdf_ocr=True, ocr_timeout=720)
2. Index Document to Elasticsearch
Indexing your document into Elasticsearch is simple using the ingest
method:
from ediscovery import ESearch
es = ESearch('myindex', 'localhost')
result, success, fail = es.ingest([document.to_dict()])
print(result)
3. Search Documents Based on Queries and Filters
Once indexed, documents can be easily searched. Here, we search for PDFs containing the words "ktrain" and "machine learning" using standard Lucene query syntax and return only the first result:
resp = es.search('"ktrain" AND "machine learning"', filters=[{'extension':['pdf']}], size=1)
The resp
variable is the raw response from the Elasticsearch server as documented here.
For more information on the programmatic API, see the example notebook.
Questions? Contact Arun S. Maiya: arun [at] maiya [dot] net