Automatically performs NLP techniques. Currently supports German and English language.


Keywords
NLP, POS, NER, NLTK, german-language, named-entity-recognition, nlp-keywords-extraction, part-of-speech, python3, tokenizer
License
Other
Install
pip install informationminer==1.6.8

Documentation

informationminer

Automatically performs NLP techniques. Currently supports German and English language.

The following techniques are used on the passed text:

  • Tokenization
  • POS Tagging
    • English tagger is based on NLTK default
    • German tagger is generated from TIGER corpus
  • Chunking
  • Named Entity recognition

Install

This package is on pip ! Just use pip3 install informationminer.

Getting started

Look at the following example. More complex tasks like creating your own Tagger will be added later.

>>> import informationminer
>>> im = informationminer.InformationMiner("This is a samelp sentence. I love InformationMiner !")
>>> im.process()
INFO:root:Start processing text
INFO:root:Tokenizing text
INFO:root:Creating new tokens
INFO:root:Writing output/01_token_output.json
INFO:root:POS tagging tokens
INFO:root:Creating new POS tags. This can take some time ...
INFO:root:Writing output/02_pos_output.json
INFO:root:Chunking POS
INFO:root:Creating new chunks. This can take some time ...
INFO:root:Writing output/03_chunk_output.pickle
INFO:root:Extracting entity names
INFO:root:Searching for named entities
INFO:root:Writing output/04_ne_output.json
INFO:root:Processing finished in 0.24 s
>>> im.tokens
['This', 'is', 'a', 'sample', 'sentence', '.', 'I', 'love', 'InformationMiner', '!']
>>> im.pos
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('sample', 'JJ'), ('sentence', 'NN'), ('.', '.'), ('I', 'PRP'), ('love', 'VBP'), ('InformationMiner', 'NN'), ('!', '.')]
>>> im.ne
['InformationMiner']

The InformationMiner class has a couple of optional parameters:

  • save_output: Write output to outdir/outfile. Enable this, so you don't do work twice.
  • force_create: Will allways overwrite files if save_output is enabled
  • language: English by default. Currently either ger or en

Common Hickups

Please propose design changes to fix those if you have a great idea ! :)

  • Existing files in the output directory are always used, ignoring given text
  • save_output / force_create have some strange interaction