chinormfilter

Filter synonym files written in lucene format to avoid duplication with Sudachi normalization.


License
Apache-2.0
Install
pip install chinormfilter==0.5.1

Documentation

chinormfilter

PyPi version PyTest

Filter synonym written in lucene format to avoid duplication with Sudachi normalization. Mainly used when migrating to sudachi analyzer.

Usage

$ chinormfilter tests/test.txt -o out.txt

filtered result is following.

γƒ¬γƒŠγƒͺγƒ‰γƒŸγƒ‰,γƒ¬γƒŠγƒͺγƒ‰γƒžγ‚€γƒ‰
γƒͺンゴ => ζž—ζͺŽ
ι£²γ‚€,ε‘‘γ‚€
tlc => tlc,全肺気量
γƒͺγƒ³γŸγ‚“γ±γθ³ͺ,γƒͺン蛋白θ³ͺ,γƒͺンタンパクθ³ͺ

↓ filter

γƒ¬γƒŠγƒͺγƒ‰γƒŸγƒ‰,γƒ¬γƒŠγƒͺγƒ‰γƒžγ‚€γƒ‰
tlc => tlc,全肺気量

Specify system dict

$ chinormfilter tests/test.txt -s full -o out.txt

Use Custom Dict

Specify dict via sudachi.json

$ chinormfilter tests/test.txt -s sudachi.json -o out.txt

TODO

  • custom dict test