PyCantonese: Cantonese Linguistics and NLP in Python

Full Documentation: https://pycantonese.org

PyCantonese is a Python library for Cantonese linguistics and natural language processing (NLP). Currently implemented features (more to come!):

Accessing and searching corpus data
Parsing and conversion tools for Jyutping romanization
Parsing Cantonese text
Stop words
Word segmentation
Part-of-speech tagging

Download and Install

To download and install the stable, most recent version:

$ pip install --upgrade pycantonese

Ready for more? Check out the Quickstart page.

Consulting

If your team would like professional assistance in using PyCantonese, freelance consulting and training services are available for both academic and commercial groups. Please email Jackson L. Lee.

Support

If you have found PyCantonese useful and would like to offer support, buying me a coffee would go a long way!

How to Cite

PyCantonese is authored and maintained by Jackson L. Lee.

A talk introducing PyCantonese:

Lee, Jackson L. 2015. PyCantonese: Cantonese linguistic research in the age of big data. Talk at the Childhood Bilingualism Research Centre, Chinese University of Hong Kong. September 15. 2015. Notes+slides

License

MIT License. Please see LICENSE.txt in the GitHub source code for details.

The HKCanCor dataset included in PyCantonese is substantially modified from its source in terms of format. The original dataset has a CC BY license. Please see pycantonese/data/hkcancor/README.md in the GitHub source code for details.

The rime-cantonese data (release 2021.05.16) is incorporated into PyCantonese for word segmentation and characters-to-Jyutping conversion. This data has a CC BY 4.0 license. Please see pycantonese/data/rime_cantonese/README.md in the GitHub source code for details.

Logo

The PyCantonese logo is the Chinese character 粵 meaning Cantonese, with artistic design by albino.snowman (Instagram handle).

Acknowledgments

Wonderful resources with a permissive license that have been incorporated into PyCantonese:

HKCanCor
rime-cantonese

Individuals who have contributed feedback, bug reports, etc. (in alphabetical order of last names):

@cathug
Litong Chen
Jenny Chim
@g-traveller
Rachel Han
Ryan Lai
Charles Lam
Chaak Ming Lau
Hill Ma
@richielo
@rylanchiu
Stephan Stiller
Tsz-Him Tsui
Robin Yuen

Changelog

Please see CHANGELOG.md.

Setting up a Development Environment

The latest code under development is available on Github at jacksonllee/pycantonese. You need to have Git LFS installed on your system (run brew install git-lfs if you have Homebrew installed on MacOS, or run sudo apt-get install git-lfs if you're on Ubuntu). To obtain this version for experimental features or for development:

$ git clone https://github.com/jacksonllee/pycantonese.git
$ cd pycantonese
$ git lfs pull
$ pip install -r dev-requirements.txt
$ pip install -e .

To run tests and styling checks:

$ pytest -vv --doctest-modules --cov=pycantonese pycantonese docs/source
$ flake8 pycantonese
$ black --check pycantonese

To build the documentation website files:

$ python docs/source/build_docs.py

pycantonese
Release 3.4.0

Release 3.4.0

3.4.0

3.3.1

3.3.0

3.2.4

3.2.3

3.2.2

3.2.1

3.2.0

3.1.1

3.1.0

Documentation

PyCantonese: Cantonese Linguistics and NLP in Python

Download and Install

Consulting

Support

Links

How to Cite

License

Logo

Acknowledgments

Changelog

Setting up a Development Environment

Stats

Development practices

Releases

Contributors

pycantonese Release 3.4.0

Release 3.4.0 Toggle Dropdown 3.4.0 3.3.1 3.3.0 3.2.4 3.2.3 3.2.2 3.2.1 3.2.0 3.1.1 3.1.0

Documentation

PyCantonese: Cantonese Linguistics and NLP in Python

Download and Install

Consulting

Support

Links

How to Cite

License

Logo

Acknowledgments

Changelog

Setting up a Development Environment

Stats

Development practices

Releases

Contributors

pycantonese
Release 3.4.0

Release 3.4.0

3.4.0

3.3.1

3.3.0

3.2.4

3.2.3

3.2.2

3.2.1

3.2.0

3.1.1

3.1.0