suparkanbun

Tokenizer POS-tagger and Dependency-parser for Classical Chinese


Keywords
NLP, Chinese
License
MIT
Install
pip install suparkanbun==1.5.1

Documentation

Current PyPI packages

SuPar-Kanbun

Tokenizer, POS-Tagger and Dependency-Parser for Classical Chinese Texts (ๆผขๆ–‡/ๆ–‡่จ€ๆ–‡) with spaCy, Transformers and SuPar.

Basic usage

>>> import suparkanbun
>>> nlp=suparkanbun.load()
>>> doc=nlp("ไธๅ…ฅ่™Ž็ฉดไธๅพ—่™Žๅญ")
>>> print(type(doc))
<class 'spacy.tokens.doc.Doc'>
>>> print(suparkanbun.to_conllu(doc))
# text = ไธๅ…ฅ่™Ž็ฉดไธๅพ—่™Žๅญ
1	ไธ	ไธ	ADV	v,ๅ‰ฏ่ฉž,ๅฆๅฎš,็„ก็•Œ	Polarity=Neg	2	advmod	_	Gloss=not|SpaceAfter=No
2	ๅ…ฅ	ๅ…ฅ	VERB	v,ๅ‹•่ฉž,่กŒ็‚บ,็งปๅ‹•	_	0	root	_	Gloss=enter|SpaceAfter=No
3	่™Ž	่™Ž	NOUN	n,ๅ่ฉž,ไธปไฝ“,ๅ‹•็‰ฉ	_	4	nmod	_	Gloss=tiger|SpaceAfter=No
4	็ฉด	็ฉด	NOUN	n,ๅ่ฉž,ๅ›บๅฎš็‰ฉ,ๅœฐๅฝข	Case=Loc	2	obj	_	Gloss=cave|SpaceAfter=No
5	ไธ	ไธ	ADV	v,ๅ‰ฏ่ฉž,ๅฆๅฎš,็„ก็•Œ	Polarity=Neg	6	advmod	_	Gloss=not|SpaceAfter=No
6	ๅพ—	ๅพ—	VERB	v,ๅ‹•่ฉž,่กŒ็‚บ,ๅพ—ๅคฑ	_	2	parataxis	_	Gloss=get|SpaceAfter=No
7	่™Ž	่™Ž	NOUN	n,ๅ่ฉž,ไธปไฝ“,ๅ‹•็‰ฉ	_	8	nmod	_	Gloss=tiger|SpaceAfter=No
8	ๅญ	ๅญ	NOUN	n,ๅ่ฉž,ไบบ,้–ขไฟ‚	_	6	obj	_	Gloss=child|SpaceAfter=No

>>> import deplacy
>>> deplacy.render(doc)
ไธ ADV  <โ•โ•โ•โ•โ•—   advmod
ๅ…ฅ VERB โ•โ•โ•โ•—โ•โ•โ•โ•— ROOT
่™Ž NOUN <โ•— โ•‘   โ•‘ nmod
็ฉด NOUN โ•โ•<โ•   โ•‘ obj
ไธ ADV  <โ•โ•โ•โ•โ•— โ•‘ advmod
ๅพ— VERB โ•โ•โ•โ•—โ•โ•<โ• parataxis
่™Ž NOUN <โ•— โ•‘     nmod
ๅญ NOUN โ•โ•<โ•     obj

suparkanbun.load() has two options suparkanbun.load(BERT="roberta-classical-chinese-base-char",Danku=False). With the option Danku=True the pipeline tries to segment sentences automatically. Available BERT options are:

Installation for Linux

pip3 install suparkanbun --user

Installation for Cygwin64

Make sure to get python37-devel python37-pip python37-cython python37-numpy python37-wheel gcc-g++ mingw64-x86_64-gcc-g++ git curl make cmake packages, and then:

curl -L https://raw.githubusercontent.com/KoichiYasuoka/CygTorch/master/installer/supar.sh | sh
pip3.7 install suparkanbun

Installation for Jupyter Notebook (Google Colaboratory)

!pip install suparkanbun 

Try notebook for Google Colaboratory.

Author

Koichi Yasuoka (ๅฎ‰ๅฒกๅญไธ€)

Reference

Koichi Yasuoka, Christian Wittern, Tomohiko Morioka, Takumi Ikeda, Naoki Yamazaki, Yoshihiro Nikaido, Shingo Suzuki, Shigeki Moro, Kazunori Fujita: Designing Universal Dependencies for Classical Chinese and Its Application, Journal of Information Processing Society of Japan, Vol.63, No.2 (February 2022), pp.355-363.