KEM
AKA word2vec, but our nlp suite all starts with prefix "k"
so named it as KEM, keyword embedding model.
Install
- (Recommended): Use docker-compose to install
Manually Install
If you want to integrate kem
into your own django project, use manually install.
pip install kem
Config
Cause this is a django app
so need to finish these django setups.
- settings.py:
INSTALLED_APPS = [
'kem'
...
]
- urls.py:
import kem.urls
urlpatterns += [
url(r'^kem/', include(kem.urls))
]
-
python3 manage.py buildkem --lang <lang, e.g., zh or en or th> --dimension <int: e.g., 400> --cpus <default=6> --ontology <default=False>
- ontology: experimantal feature, see details
- fire
python manage.py runserver
and go127.0.0.1:8000/
to check whether the config is all ok.
API
- get similar word:
/kem
-
keyword
-
num (default=10)
-
ontology (default=False)
["原生動物", 0.7895185351371765] ["藍菌", 0.7865398526191711] ["甲藻", 0.7792112827301025] ["藍綠藻", 0.7636655569076538] ["芽孢", 0.7631546258926392] ["兼性", 0.7622398138046265] ["纖毛蟲", 0.7605307102203369] ["專性", 0.7589520215988159] ["莢膜", 0.7575902938842773] ... etc
["中華民國總統府國策顧問"], ["中華民國內政部部長"], ["中華民國法官"], ["中華民國檢察官"], ["國立臺灣大學法律學院校友"] ... etc
- get vector:
/kem/vector
-
keyword
-
example: http://udiclab.cs.nchu.edu.tw/kem/vector?keyword=女生&lang=zh
[1.3885987997055054, 0.5394280552864075, -0.2656879723072052, 0.7741730809211731, 0.591987133026123 ...]
Experimental Feature
This feature is based on kcem
which is a ontology with isA relation
Setting --ontology
to True would turn all noun in the training corpus into hypernym
and concatenate this transformed corpus with original one
Finally, train word2vec with this transformed corpus.
It really enhance the original vector space.
result:
>>> model.most_similar('中華民國法務部部長')
[
[
"中華民國總統府國策顧問",
0.7841469645500183
],
[
"中華民國內政部部長",
0.7837527990341187
],
[
"中華民國法官",
0.7816867828369141
],
[
"中華民國檢察官",
0.7780462503433228
],
[
"國立臺灣大學法律學院校友",
0.7581177949905396
]
]
origin:
>>> model.most_similar('中華民國法務部部長')
[
[
"楊芳婉",
0.8307946920394897
],
[
"吳朱疆",
0.830314040184021
],
[
"郭宗德",
0.8272522687911987
],
[
"莊懷義",
0.8246101140975952
],
[
"蔡兆陽",
0.821085512638092
]
]
Built With
python3.5
Contributors
License
This package use GPL3.0
License.