Distiller

Automatic Keyword Extraction from Document Collections


License
Other
Install
pip install Distiller==0.1.1.1

Documentation

Distiller

Distiller provides convenient auto-extraction of document key words based on term-frequency/inverse-document-frequency (TF-IDF) and word positioning.

Distiller handles all of the pre-processing details and produces final statistic reports in JSON format.

Requirements

Distiller uses the Natural Language Toolkit

You will need to download a couple of NLTK packages:

>>> import nltk
>>> nltk.download()
Downloader> d
Download which package (l=list; x=cancel)?
    Identifier> maxent_treebank_pos_tagger
Downloader> d
Download which package (l=list; x=cancel)?
    Identifier> stopwords

Installation

Installation using pip:

$ pip install Distiller

Usage

Typical usage from within the Python interpreter:

>>> from Distiller.distiller import Distiller
>>> distiller = Distiller(data, target, options)

Arguments

data

Path to file containing the document collection in JSON format.

{
    'metadata': {
        'base_url': 'The document's source URL (if any)'
        },
    'documents': [
            {
                'id': 'The document's unique identifier (if any)',
                'body': 'The entire body of the document in a single text blob.',
                'description': 'An optional brief description of this doc.'
            }, ...
        ]
}

###target

Path where Distiller will output the following reports:

keywords: A list of words and the frequency with which they were detected as being keywords of documents.

bigrams: A list of word pairs and the frequency with which they were detected as being key pairs in documents.

trigrams: A list of word triples and the frequency with which they were detected as being key pairs in documents.

docmap: A mapping of document IDs to their respective keywords, n-grams, and other statistics.

keymap: A mapping of keywords to the documents they appear in.

###options

An optional dictionary containing document processing arguments in this format:

{
    'normalize': True,          # normalize tokens during pre processing
    'stem': True,               # stems tokens during pre processing
    'lemmatize': False,         # lemmatize during pre processing
    'tfidf_cutoff': 0.001,      # cutoff value to use for term-freq/doc-freq score
    'pos_list': ['NN','NNP'],   # POS white list used to filter for candidates
    'black_list': []            # token list used to filter out from candidates
}