detect-character-encoding

Detect character encoding using ICU


Keywords
detect, character, encoding, charset, icu, c-plus-plus, character-encoding, javascript, nodejs
License
BSD-2-Clause
Install
npm install detect-character-encoding@0.9.0

Documentation

detect-character-encoding

Detect character encoding using ICU

Tip: If you don’t need ICU in particular, consider using ced, which is based on Google’s lighter compact_enc_det library.

Installation

$ npm install detect-character-encoding

detect-character-encoding is a C++ addon. Therefore, you may need to install various build tools. Check node-gyp’s readme for more information.

Usage

const fs = require('fs');
const detectCharacterEncoding = require('detect-character-encoding');

const fileBuffer = fs.readFileSync('file.txt');
const charsetMatch = detectCharacterEncoding(fileBuffer);

console.log(charsetMatch);
// {
//   encoding: 'UTF-8',
//   confidence: 60
// }

detect-character-encoding may return null if no charset matches.

Supported operating systems

  • macOS Sonoma
  • Ubuntu 22.04 and 20.04
  • Debian 12, 11, and 10

detect-character-encoding does not support 32-bit operating systems.

Supported character sets

As listed in ICU’s user guide:

  • UTF-8
  • UTF-16BE
  • UTF-16LE
  • UTF-32BE
  • UTF-32LE
  • Shift_JIS
  • ISO-2022-JP
  • ISO-2022-CN
  • ISO-2022-KR
  • GB18030
  • Big5
  • EUC-JP
  • EUC-KR
  • ISO-8859-1
  • ISO-8859-2
  • ISO-8859-5
  • ISO-8859-6
  • ISO-8859-7
  • ISO-8859-8
  • ISO-8859-9
  • windows-1250
  • windows-1251
  • windows-1252
  • windows-1253
  • windows-1254
  • windows-1255
  • windows-1256
  • KOI8-R
  • IBM420
  • IBM424

License

detect-character-encoding is licensed under the BSD 2-clause license but includes third-party software under different licenses. See LICENSE.md for the full license text.