Unicode to 8-bit charset transliteration codec.
This package contains codecs for transliterating ISO 10646 texts into
best-effort representations using smaller coded character sets (ASCII,
ISO 8859, etc.). The translation tables used by the codecs are from
the ``transtab`` collection by Markus Kuhn.
Three types of transliterating codecs are provided:
"long", using as many characters as needed to make a natural
replacement. For example, \u00e4 LATIN SMALL LETTER A WITH
DIAERESIS ``ä`` will be replaced with ``ae``.
"short", using the minimum number of characters to make a
replacement. For example, \u00e4 LATIN SMALL LETTER A WITH
DIAERESIS ``ä`` will be replaced with ``a``.
"one", only performing single character replacements. Characters
that can not be transliterated with a single character are passed
through unchanged. For example, \u2639 WHITE FROWNING FACE ``☹``
will be passed through unchanged.
Using the codecs is simple::
>>> import translitcodec
>>> import codecs
>>> codecs.encode('fácil € ☺', 'translit/long')
'facil EUR :-)'
>>> codecs.encode('fácil € ☺', 'translit/short')
'facil E :-)'
The codecs return Unicode by default. To receive a bytestring back,
either chain the output of encode() to another codec, or append the
name of the desired byte encoding to the codec name::
>>> codecs.encode('fácil € ☺', 'translit/one').encode('ascii', 'replace')
'facil E ?'
>>> 'fácil € ☺'.encode('translit/one/ascii', 'replace')
'facil E ?'
The package also supplies a 'transliterate' codec, an alias for
'translit/long'.
Another way to use the library is to use an error handle.
Error handles are available:
* 'strict/translit/long', 'strict/translit/short', 'strict/translit/one' - similar to 'strict'
* 'ignore/translit/long', 'ignore/translit/short', 'ignore/translit/one' - similar to 'ignore'
* 'replace/translit/long', 'replace/translit/short', 'replace/translit/one' - similar to 'replace'
These error handles above, work similarly to Python's built-in ones.
The difference is that transliteration is attempted first.
>>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'replace/translit/long').decode('ISO-8859-2')
'Zażółć gęślą jaźń EUR :-)?!@#'
>>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'replace/translit/short').decode('ISO-8859-2')
'Zażółć gęślą jaźń E :-)?!@#'
>>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'replace/translit/one').decode('ISO-8859-2')
'Zażółć gęślą jaźń E ??!@#'
>>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'ignore/translit/long').decode('ISO-8859-2')
'Zażółć gęślą jaźń EUR :-)!@#'
>>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'ignore/translit/short').decode('ISO-8859-2')
'Zażółć gęślą jaźń E :-)!@#'
>>> codecs.encode('Zażółć gęślą jaźń € ☺另!@#', 'ISO-8859-2', 'ignore/translit/one').decode('ISO-8859-2')
'Zażółć gęślą jaźń E !@#'