charex

A unicode and character set explorer.


License
MIT
Install
pip install charex==0.1.0

Documentation

charex

charex is a Unicode and character set explorer for understanding issues with character set translation and Unicode normalization.

Why Did I Make This?

I find the ambiguity of text data interesting. In memory it's all ones and zeros. There is nothing inherent to the data that makes 0x20 mean a space character, but we've mostly agreed that it does. That "mostly" part is what's interesting to me, and it's where a lot of fun problems lie.

How Do I Use This?

It's in PyPI, so you can install it with pip, as long as you are using Python 3.11 or higher:

pip install charex

charex has four modes of operation:

  • Direct command line invocation,
  • An interactive shell,
  • A graphical user interface (GUI),
  • An application programming interface (API).

Command Line

To get help for direct invocation from the command line:

$ charex -h

Interactive Shell

To launch the interactive shell:

$ charex

That will bring you to the charex shell:

Welcome to the charex shell.
Press ? for a list of comands.

charex>

From here you can type ? to see the list of available commands:

Welcome to the charex shell.
Press ? for a list of comands.

charex> ?
The following commands are available:

  * cd: Decode the given address in all codecs.
  * ce: Encode the given character in all codecs.
  * cl: List registered character sets.
  * clear: Clear the terminal.
  * ct: Count denormalization results.
  * dm: Build a denormalization map.
  * dn: Perform denormalizations.
  * dt: Display details for a code point.
  * el: List the registered escape schemes.
  * es: Escape a string using the given scheme.
  * fl: List registered normalization forms.
  * nl: Perform normalizations.
  * sh: Run in an interactive shell.
  * up: List the Unicode properties.
  * uv: List the valid values for a Unicode property.

For help on individual commands, use "help {command}".

charex>

And then type help then a name of one of the commands to learn what it does:

charex> help dn
usage: charex dn [-h] [-m MAXDEPTH] [-n NUMBER] [-r] [-s SEED] form base

Denormalize a string.

positional arguments:
  form                  The normalization form for the denormalization. Valid
                        options are: casefold, nfc, nfd, nfkc, nfkd.
  base                  The base normalized string.

options:
  -h, --help            show this help message and exit
  -m MAXDEPTH, --maxdepth MAXDEPTH
                        Maximum number of reverse normalizations to use for
                        each character.
  -n NUMBER, --number NUMBER
                        Maximum number of results to return.
  -r, --random          Randomize the denormalization.
  -s SEED, --seed SEED  Seed the randomized denormalization.

charex>

GUI

To launch the charex GUI:

$ charex gui

API

To import charex into your Python script to get a summary of a Unicode character:

>>> import charex
>>>
>>>
>>> value = 'a'
>>> char = charex.Character(value)
>>> print(char.summarize())
a U+0061 (LATIN SMALL LETTER A)

What Is Left To Do?

The following features are planned for the v0.2.1 or later releases:

  • (Done.) Add every Unihan property to charex.Character.
  • (Done.) Fix names for CJK Ideographs.
  • (Done.) Move all Y/N properties to simple lists.
  • (Done.) Move the resource file registry to sources.json.
  • Support for more versions of Unicode.
  • Emoji combiner.
  • Basic doctests for all public classes and functions.
  • Web API.
  • Registration for character set codecs.

The list of Unicode properties can be found here: Index

The list of Unihan properties is here: tr38

How Do I Run the Tests?

charex is using the pytest package for unit testing. It also comes with a makefile that automates testing. So, to run the tests:

  • Install pipenv: pip install pipenv
  • Install the development dependencies: pipenv install --dev
  • To run just the unit tests: make test
  • To run the full test suite: make pre