org.glavo:chardet

Java Chardet is a Java encoding detector library


Licenses
MPL-1.1/NGPL/LGPL-2.1+

Documentation

Java Chardet

Latest release

This library is a fork of albfernandez/juniversalchardet, based on commit ff74981.

The purpose of this library is to detect the encoding of unknown encoded text.

Note: This library is in beta stage, and there may be breaking changes to the API in the future.

Differences from upstream

The main difference between this library and the upstream (and my motivation for creating this fork) is that all APIs are based on ByteBuffer instead of byte[], so this library can directly handle off-heap memory.

Of course, I've also provided byte[] based shorthands for these APIs, so working with byte[] isn't any more cumbersome.

In addition, I also did some cleaning up of the library. The more important difference is that this library no longer uses String to represent encoding, instead DetectedCharset is used. You can convert DetectedCharset to Java java.nio.charset.Charset easily:

DetectedCharset result = UniversalDetector.detectCharset(Paths.get("testfile.txt"));
Charset charset = result != null ? result.getCharset() : StandardCharsets.UTF_8;

The reason for not using Charset directly is that this library supports detection of some encodings that Java does not support (e.g. HZ-GB-2312).

There are some other minor cleanups and fixes to this library. I plan to submit some patches to upstream in the future.

Adding the library to your build

Maven:

<dependency>
  <groupId>org.glavo</groupId>
  <artifactId>chardet</artifactId>
  <version>2.4.0-beta1</version>
</dependency>

Gradle:

implementation("org.glavo:chardet:2.4.0-beta1")

License

The library is subject to the Mozilla Public License Version 1.1.

Alternatively, the library may be used under the terms of either the GNU General Public License Version 2 or later, or the GNU Lesser General Public License 2.1 or later.