esanpy

Elasticsearch based Text Analyzer.


Keywords
text, analyzer, python
License
Apache-2.0
Install
pip install esanpy==1.1.0

Documentation

Esanpy: Elasticsearch based Analyzer for Python Build Status

Esanpy is Python Text Analyzer based on Elasticsearch. Using Elasticsearch, Esanpy provides powerful and fully-customizable text analysis. Since Esanpy manages Elasticsearch instance internally, you DO NOT need to install/configure Elasticsearch.

Install Esanpy

$ pip install esanpy

If you want to install development version, run as below:

$ git clone https://github.com/codelibs/esanpy.git
$ cd esanpy
$ pip install .

Requirement

  • Python 2.7 or 3.4-3.6
  • Java 8 or above

Python

First of all, import esanpy module.

import esanpy

Start Server

To access to Elasticsearch, use start_server function. This function downloads/configures embedded elasticsearch and plugins, and then start Elasticsearch instance. The elasticsearch is saved in ~/.esanpy directory. If they are configured, this function just start elasticsearch instance.

esanpy.start_server()

Analyze Text

Esanpy provides analyzer and custom_analyzer function.

tokens = esanpy.analyzer("This is a pen.")
# tokens = ["this", "is", "a", "pen"]

To use other analyzer, set an analyzer name with analyzer.

tokens = esanpy.analyzer("今日の天気は晴れです。", analyzer="koromoji")

custom_analyzer has tokenizer, token_filter and char_filter as arguments.

tokens = esanpy.custom_analyzer('this is a <b>test</b>',
                                tokenizer="keyword",
                                token_filter=["lowercase"],
                                char_filter=["html_strip"])

For Elasticsearch Analyze API, see Analyze.

Stop Server

To stop Elasticsearch, use stop_server().

esanpy.stop_server()

Command

Esanpy provides esanpy command.

$ esanpy --text "This is a pen."
this
is
a
pen

esanpy starts Elasticsearch if it does not run. So, it takes time to start it, but it will be fast after that because Elasticsearch instance is reused.

To change analyzer, use --analyzer option.

$ esanpy --text 今日の天気は晴れです。 --analyzer kuromoji
今日
天気
晴れ

--stop opition stops Elasticsearch instance on the command exit.

$ esanpy --text "This is a pen." --stop

Advance Usecases

Register Analyzer

You can register own analyzers by create_analysis. To register analyzers with my_analyzers namespace:

esanpy.create_analysis('my_analyzers',
                       char_filter={
                           "mapping_ja_filter": {
                               "type": "mapping",
                               "mappings_path": mapping_file
                               }
                       },
                       tokenizer={
                           "kuromoji_user_dict": {
                               "type": "kuromoji_tokenizer",
                               "mode": "normal",
                               "user_dictionary": userdict_file,
                               "discard_punctuation": False
                               }
                       },
                       token_filter={
                           "ja_stopword": {
                               "type": "ja_stop",
                               "stopwords": [
                                   "行く"
                                   ]
                               }
                       },
                       analyzer={
                           "kuromoji_analyzer": {
                               "type": "custom",
                               "char_filter": ["mapping_ja_filter"],
                               "tokenizer": "kuromoji_user_dict",
                               "filter": ["ja_stopword"]
                               }
                       }
                       )

To use kuromoji_analyzer, invoke analyzer with a namespace and analyzer:

tokens = esanpy.analyzer('①東京スカイツリーに行く',
                         analyzer="kuromoji_analyzer",
                         namespace='my_analyzers')
# tokens = ['1', '東京スカイツリー', 'に']

To delete namespace, use delete_analysis:

esanpy.delete_analysis('my_analyzers')

For more information, see Analysis.

Use Kuromoji Neologd

Installing analysis-kuromoji-neologd plugin, you can use Nelogd analyzer. To install it, use --plugin option.

$ esanpy --stop
$ esanpy --plugin org.codelibs:elasticsearch-analysis-kuromoji-neologd:5.6.1

After installation, kuromoji_neologd analyzer is available.

$ esanpy --text 今日の天気は晴れです。 --analyzer kuromoji_neologd
今日の天気
晴れ

Uninstall Esanpy

To remove Esanpy, check/kill processes:

$ ps aux | grep esanpy
$ kill [above PIDs]

and then remove ~/.esanpy directory:

$ rm -rf ~/.esanpy