ediscovery

nlp tool


Keywords
nlp
License
Apache-2.0
Install
pip install ediscovery==0.0.1

Documentation

Welcome to Open eDiscovery

Open eDiscovery helps you search and explore large document collections

Overview

Open eDiscovery is a Python package that facilitates searching and exploring large collections of files. It can function as both a programmatic API and as an out-of-the-box, production-ready Web application. Powered by well-tested open-source tools such as Elasticsearch (for the back-end data repository), Streamlit (for the front-end Web-based user interface), and ktrain (for machine learning and data analytics), Open eDiscovery has applications to legal e-Discovery, digital forensics, scientific literature reviews, and any task that involves making sense of a collection of documents. In addition to standard full-text searches and document-tagging, Open eDiscovery includes out-of-the-box support for AI/ML such as end-to-end question-answering, theme discovery, and keyphrase extraction.

How to Install

1. Install eDiscovery from PyPI along with APT dependencies

pip install ediscovery
python -m ediscovery.install_system_dependencies

The command install_system_dependencies will download and install Ubuntu APT dependencies, which are necessary for better preprocessing of documents (e.g., extract RAR/PST archives, detect file type accurately, optical character recognition). Open eDiscovery has been tested on ubuntu-latest (currently Ubuntu 20.04), which is the recommended environment. If you are on Microsoft Windows, we recommend a virtualized Ubuntu environment like the Windows Subsystem for Linux (WSL)).

2. Download and start an Elasticsearch instance

If you don't already have an Elasticsearch instance running, you can download and start one easily:

# Example using Elasticsearch 7.10.1
wget https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.10.1-linux-x86_64.tar.gz
tar -xzf elasticsearch-7.10.1-linux-x86_64.tar.gz
elasticsearch-7.10.1/bin/elasticsearch # starts Elasticsearch

Install a version of elasticsearch that is compatible with the Elasticseach instance you just downloaded:

# example
pip install elasticsearch==7.10.1

How to Use the Web App

Once eDiscovery is installed, the Web app can be started from the command-line by typing ediscovery:

ediscovery --port 8888

When starting eDiscovery for the first time, there will be instructions on how to setup the admin user for the app. The admin account can then be used to add additional users. The Web app includes a point-and-click, no-code interface to full-text faceted search and various AI/ML/NLP capabilities such as end-to-end question-answering, theme discovery, and keyphrase extraction. Unlike most existing search engines, Open eDiscovery allows you to surgically apply machine-learning analyses to the specific documents of interest to you.

eDiscovery screenshot

How to Use the Programmatic API

Open eDiscovery can also be used as programmatic Python API.

1. Process and Analyze a Document

The Document.from_file and Document.from_url class methods can be used to process and analyze raw documents (e.g., PDFs, MS PowerPoints, Word documents, plain text, etc.).

from ediscovery import Document
document = Document.from_url('https://arxiv.org/pdf/2004.10703.pdf', project_id=project_id)
print(document)
# OUTPUT:
# Document(filepath=/home/amaiya/ediscovery_data/projects/1/2004.10703-0dfd3e9.pdf, filename=2004.10703-0dfd3e9.pdf, extension=pdf, mimetype=application/pdf, 
#          md5=0dfd3e96bd2bd0a2ace640d42514775d, 
#          topictags=['machine learning', 'step', 'arxiv preprint', 'learning rate', 
#                     'text classification', 'open-domain question-answering', 
#                     'augmented machine', 'augmented machine learning', 'bert', 'low-code library'],
#          warning=False, error=False, err_msg=None)

The document.content variable (not shown) holds the raw text extracted from the document. The document.topictags field stores auto-extracted keyphrases from the document:

Pro Tips:

  • Use from_file class method (instead of from_url) for files that are already downloaded to the local file system
  • The ngram_range and candidate_generator arguments to from_file and from_url can be used to configure the keyphrase extraction.
  • OCR is supported, but turned off by default. To enable, use enable_pdf_ocr, enable_jpg_ocr, etc.:
# ocr_timeout=720 means that the OCR will stop processing long documents after 720 seconds
d = Document.from_file('some_scanned_pdf_document.pdf', project_id=project_id, 
                      enable_pdf_ocr=True, ocr_timeout=720)

2. Index Document to Elasticsearch

Indexing your document into Elasticsearch is simple using the ingest method:

from ediscovery import ESearch
es = ESearch('myindex', 'localhost')
result, success, fail = es.ingest([document.to_dict()])
print(result)

3. Search Documents Based on Queries and Filters

Once indexed, documents can be easily searched. Here, we search for PDFs containing the words "ktrain" and "machine learning" using standard Lucene query syntax and return only the first result:

resp = es.search('"ktrain" AND "machine learning"', filters=[{'extension':['pdf']}], size=1)

The resp variable is the raw response from the Elasticsearch server as documented here.

For more information on the programmatic API, see the example notebook.


Questions? Contact Arun S. Maiya: arun [at] maiya [dot] net