txplib

text preprocessing utils.


License
MIT
Install
pip install txplib==0.1

Documentation

Text Processor

Intro

This python package provides a easy-use interface to process human language text with extensive NLP resource, such as corporal, stemmers, tokenizers and language embeddings. The major goal is to ease the effort of integrating different nlp python packages.

All the text-processing moduels in this package are build on top of the NLPLibrary, which is the resource-management module that summarises all the known NLP resource, and provides a consistent interface to load queried NLP resource on the fly.

Install

pip install txplib

Run configuration file.

git clone https://github.com/KeyiT/txplib.git
cd txplib
./scripts/config.sh

Initialize TextPreprocess and NLPLibrary

In:
  tp = TextPreprocess(NLPLibrary())

Load the resource contents required for the target task (optional).

In:
  tp.load_content_from_library("resource_class_name", "resource_library_name")
  
  tp.load_content_from_library("sentence_tokenizer", "nltk_eng_punkt")

Resource Class Name is a required argument, which indicates the what type of resource you need. It can be stopword, punctuation or stemmer etc.

Each type of resource can supported multiple libraries. Resource Library Name is indicates which library you want to use. As an example, you can use either Porter ('resource_library_name': 'nltk_eng_porter') or Lancaster ('resource_library_name': 'nltk_eng_lancaster') to stem ('resource_class_name': 'stemmer') your text. Resource Library Name has default value for every class of resource.

Calling this function for a loaded resource class will change the resource library.

Call NLP Functions

In:
  sentences_list = tp.tokenize_to_sentences(text)

TextPreprocess tends to encapsulate and organize all the NLP methods that are needed for preprocessing documents before ML phase. The calling function will use the resource loaded in its NLPLibrary to perform the task it responsible to.

If the required resource has not been loaded before calling, it will load the default resource autonomously to support its task.

Show Resource Catalog

TextPreprocess and NLPLibrary provide interface to print out the resource list:

In:
  tp.show_library_catalog()

This function can help find the available resources.

Show Loaded Items

In:
  tp.show_library_items()

This function can tell the information about the loaded resources.

Scikit-Learn Moduels

NLPUnit is the core interface which provides a scikit-learn wrapper for every NLP moduel, inlucding Tokenizer, Normalizer, Filter, Encoder etc.

Three composite text-preprocessing modules are implemented:

They are introduced in Data Flow section.

To Figure How to Work With Them, Please See Unit Test Files.

Data Flow

Data Model Convention

To offer a consistent and intuitive interface, this package follows a name convention for text data.

Documents(s)

  • Type: String or List of Strings
  • Data: An untokenized raw text or list of untokenized raw texts.

Sentences

  • Type: List of Strings.
  • Data: List of sentences. Output of sentence tokenizer. Sentences are ordinal, i.e the sequential order of sentences is kept.

Words

  • Type: List of String or List of String Tuple.
  • Data: List of word token or List of tagged word tokens. * For tagged words, each element is a tuple. The first member of the tuple is the word, and the second is its corresponding tag.

Word Page

  • Type: List of String List or List of String Tuple List.
  • Data: List of Words. Output of word tokenizer if the input is Sentences.
    • Word Page is ordinal, i.e the sequential order of words is kept.
    • For tagged word page, each element is a tuple. The first member of the tuple is the word, and the second is its corresponding tag.

Bags Of Words (BOW)

  • Type: List of String List or List of String Tuple List.
  • Data: List of Words.
    • Words in BOW are not neccessarily ordinal, i.e no sequential order between words.
    • Each string list is a collection of representative words of a document.

Composite Text Preprocessing Blocks

Document2WordPage

Document2WordPage transforms a raw Document to a Word Page:

As all the blocks after WordTokenizer take Word Page as input and output, they (CaseLower, POSTagger, Lemmatizer, TagCleaner) can be switched off to skip certain operations on the text data.

Documents2WordPages

Documents2WordPages transforms a list of raw Documents to a list of word pages. Documents2WordPages uses Document2WordPage block to map each input document to corresponding word page, and output them as a list in the same as the input.

The type of DocumentsWordPages output is a 3-neste-layer list of string.

Documents2BOW

Documents2BOW transforms a list of raw Documents to BOW.

TokenTensorReducer merges lists in lower level of a given nested list. It transforms a List of Word Page to BOW. Multiple filter blocks used in Documents2BOW, including POSFilter, Stopwords Filter and Punctuation Filter. Filter Block and TagCleaner Block can be switched off.

Showcase Example of TextPreprocess Interface

  • Given a simple text.
In:
  text = "She likes dogs. Food is awesome! We went to library."
  • Initialize a TextPreprocess instance by passing a NLPLibrary instance to its initializer.
In:
  tp = TextPreprocess(NLPLibrary())
  • Tokenize the text into sentences and word sequences.
In:
  sentences_list = tp.tokenize_to_sentences(text)
  documents = tp.tokenize_sents_to_words(sentences_list)
  print(documents)
Out:
  [['She', 'likes', 'dogs', '.'], ['Food', 'is', 'awesome', '!'], ['We', 'went', 'to', 'library', '.']]
  • Part-of-Speech (POS) Tagging tokens and Normalizing the text.
In:
  tagged_documents = tp.pos_tag(documents)
  normalized_documents = tp.lemmatize_documents(tagged_documents)
  print(normalized_documents)
Out:
  [['She', 'like', 'dog', '.'], ['Food', 'be', 'awesome', '!'], ['We', 'go', 'to', 'library', '.']]
  • Keeping Verbs Only and Removing Other Words.
In:
  verbs_in_sents = tp.focus_on_pos_tag_type(documents, ['verb'])
  print(verbs_in_sents)
Out:
  [['like'], ['be'], ['go']]

TODO List

  • Model Evaluation Modules
  • Spelling Checking Modules
  • Sentence Structure Filtering Modules