The Real First Universal Charset Detector. No Cpp Bindings, Using Voodoo and Magical Artifacts.


Keywords
encoding, i18n, txt, text, charset, charset-detector, normalization, unicode, chardet, charset-conversion, charset-normalizer, encodings, humans, language-detection, mess, python, text-analysis
License
MIT
Install
pip install charset-normalizer==1.3.4

Documentation

Welcome to Charset Detection for Humans πŸ‘‹

The Real First Universal Charset Detector
Download Count /Month License: MIT Code Quality Badge Documentation Status Download Count Total

A library that helps you read text from an unknown charset encoding.
Motivated by chardet, I'm trying to resolve the issue by taking a new approach. All IANA character set names for which the Python core library provides codecs are supported.

>>>>> ❀️ Try Me Online Now, Then Adopt Me ❀️ <<<<<

This project offers you an alternative to Universal Charset Encoding Detector, also known as Chardet.

Feature Chardet Charset Normalizer cChardet
Fast ❌
❌
βœ”οΈ
Universal** ❌ βœ”οΈ ❌
Reliable without distinguishable standards ❌ βœ”οΈ βœ”οΈ
Reliable with distinguishable standards βœ”οΈ βœ”οΈ βœ”οΈ
Free & Open βœ”οΈ βœ”οΈ βœ”οΈ
License LGPL-2.1 MIT MPL-1.1
Native Python βœ”οΈ βœ”οΈ ❌
Detect spoken language ❌ βœ”οΈ N/A
Supported Encoding 30 πŸŽ‰ 90 40
Package Accuracy Mean per file (ns) File per sec (est)
chardet 93.5 % 126 081 168 ns 7.931 file/sec
cchardet 97.0 % 1 668 145 ns 599.468 file/sec
charset-normalizer 97.25 % 209 503 253 ns 4.773 file/sec

Reading Normalized TextCat Reading Text

** : They are clearly using specific code for a specific encoding even if covering most of used one

Your support

Please ⭐ this repository if this project helped you!

✨ Installation

Using PyPi

pip install charset_normalizer

πŸš€ Basic Usage

CLI

This package comes with a CLI

usage: normalizer [-h] [--verbose] [--normalize] [--replace] [--force]
                  file [file ...]
normalizer ./data/sample.1.fr.srt
+----------------------+----------+----------+------------------------------------+-------+-----------+
|       Filename       | Encoding | Language |             Alphabets              | Chaos | Coherence |
+----------------------+----------+----------+------------------------------------+-------+-----------+
| data/sample.1.fr.srt |  cp1252  |  French  | Basic Latin and Latin-1 Supplement | 0.0 % |  84.924 % |
+----------------------+----------+----------+------------------------------------+-------+-----------+

Python

Just print out normalized text

from charset_normalizer import CharsetNormalizerMatches as CnM
print(CnM.from_path('./my_subtitle.srt').best().first())

Normalize any text file

from charset_normalizer import CharsetNormalizerMatches as CnM
try:
    CnM.normalize('./my_subtitle.srt') # should write to disk my_subtitle-***.srt
except IOError as e:
    print('Sadly, we are unable to perform charset normalization.', str(e))

Upgrade your code without effort

from charset_normalizer import detect

The above code will behave the same as chardet.

See the docs for advanced usage : readthedocs.io

πŸ˜‡ Why

When I started using Chardet, I noticed that it was unreliable nowadays and also it's unmaintained, and most likely will never be.

I don't care about the originating charset encoding, because two different tables can produce two identical files. What I want is to get readable text, the best I can.

In a way, I'm brute forcing text decoding. How cool is that ? 😎

Don't confuse package ftfy with charset-normalizer or chardet. ftfy goal is to repair unicode string whereas charset-normalizer to convert raw file in unknown encoding to unicode.

🍰 How

  • Discard all charset encoding table that could not fit the binary content.
  • Measure chaos, or the mess once opened with a corresponding charset encoding.
  • Extract matches with the lowest mess detected.
  • Finally, if there is too much match left, we measure coherence.

Wait a minute, what is chaos/mess and coherence according to YOU ?

Chaos : I opened hundred of text files, written by humans, with the wrong encoding table. I observed, then I established some ground rules about what is obvious when it seems like a mess. I know that my interpretation of what is chaotic is very subjective, feel free to contribute in order to improve or rewrite it.

Coherence : For each language there is on earth, we have computed ranked letter appearance occurrences (the best we can). So I thought that intel is worth something here. So I use those records against decoded text to check if I can detect intelligent design.

⚑ Known limitations

  • Not intended to work on non (human) speakable language text content. eg. crypted text.
  • Language detection is unreliable when text contains two or more languages sharing identical letters.
  • Not well tested with tiny content.

πŸ‘€ Contributing

Contributions, issues and feature requests are very much welcome.
Feel free to check issues page if you want to contribute.

πŸ“ License

Copyright Β© 2019 Ahmed TAHRI @Ousret.
This project is MIT licensed.

Letter appearances frequencies used in this project Β© 2012 Denny VrandečiΔ‡