This library is a fork of albfernandez/juniversalchardet, based on commit ff74981.
The purpose of this library is to detect the encoding of unknown encoded text.
Note: This library is in beta stage, and there may be breaking changes to the API in the future.
The main difference between this library and the upstream (and my motivation for creating this fork)
is that all APIs are based on ByteBuffer
instead of byte[]
, so this library can directly handle off-heap memory.
Of course, I've also provided byte[]
based shorthands for these APIs, so working with byte[]
isn't any more cumbersome.
In addition, I also did some cleaning up of the library.
The more important difference is that this library no longer uses String
to represent encoding,
instead DetectedCharset is used.
You can convert DetectedCharset to Java java.nio.charset.Charset
easily:
DetectedCharset result = UniversalDetector.detectCharset(Paths.get("testfile.txt"));
Charset charset = result != null ? result.getCharset() : StandardCharsets.UTF_8;
The reason for not using Charset
directly is that this library supports detection of some encodings that Java does not support (e.g. HZ-GB-2312
).
There are some other minor cleanups and fixes to this library. I plan to submit some patches to upstream in the future.
Maven:
<dependency>
<groupId>org.glavo</groupId>
<artifactId>chardet</artifactId>
<version>2.4.0-beta1</version>
</dependency>
Gradle:
implementation("org.glavo:chardet:2.4.0-beta1")
The library is subject to the Mozilla Public License Version 1.1.
Alternatively, the library may be used under the terms of either the GNU General Public License Version 2 or later, or the GNU Lesser General Public License 2.1 or later.