cypunct: Fast-ish unicode string splitting
Cypunct is designed to solve the problem of quickly splitting a Unicode string based on a set of characters.
Cypunct is designed to work on Python 2.6, 2.7, and 3.3+. Because Cypunct is a Cython extension, it will only work in the CPython runtime.
For Python versions 2.6 and 2.7, Cypunct will only run if these CPython runtimes are
Usage
Cypunct takes a Unicode string and a frozenset
of delimiter characters,
and splits the string based on that set.
A simple example, where we provide a small frozenset
is below.
However, if you only need to split on whitespace characters, str.split()
much
better performance. If you only need to split on one character, str.split(char)
will also be much faster.
Cypunct really shines when you need to split on many possible characters, such as an entire Unicode character category.
The below example splits on all Unicode punctuation, and nothing else.
The following Unicode classes are available as sets: 'C', 'Cc', 'Cf', 'Co', 'Cs', 'L', 'Ll', 'Lm', 'Lo', 'Lt', 'Lu', 'M', 'Mc', 'Me', 'Mn', 'N', 'Nd', 'Nl', 'No', 'P', 'Pc', 'Pd', 'Pe', 'Pf', 'Pi', 'Po', 'Ps', 'S', 'Sc', 'Sk', 'Sm', 'So', 'Z', 'Zl', 'Zp', 'Zs'
cypunct.unicode_classes.COMMON_SEPARATORS
is the union of the C
, P
, S
, and Z
frozensets
. I have found it personally useful when splitting text for natural
language processing applications.
If` you don't specify a frozenset
for Cypunct to use, then Cypunct will
default to COMMON_SEPARATORS
.
Updating Unicode data
Currently, cypunct.unicode_classes
is a Python module autogenerated from a
UnicodeData.txt
file. The autogeneration script exists in
make_punctuation_file.py <https://github.com/jamesmishra/cypunct/blob/master/make_punctuation_file.py>.
Most Cypunct users will not need to concern themselves with this, but this is important to know if you are experiencing Unicode bugs or want to contribute to Cypunct.
The current UnicodeData.txt
is from ftp://ftp.unicode.org/Public/10.0.0/ucd/UnicodeData.txt.
Frequently Asked Questions (FAQ)
Q: Wouldn't this be way faster if it were written in Pure C?
Yes, it would. I'm too lazy to hand-code a C CPython extension, but it's on my todo list. Right now, Cypunct is "fast enough", and I can move onto other things in my daily life.
However, if you want to take on the challenge of rewriting Cypunct in C and having the exact same functionality as the current Cython version, I'll send you $100 USD.