Unicode-Normalization

A composer-package providing an enhanced facade to existing unicode-normalization implementations.

Installation

php composer.phar require sjorek/unicode-normalization

Usage

Unicode Normalization

<?php

/**
 * Class for normalizing unicode.
 *
 *    “Normalization: A process of removing alternate representations of equivalent
 *    sequences from textual data, to convert the data into a form that can be
 *    binary-compared for equivalence. In the Unicode Standard, normalization refers
 *    specifically to processing to ensure that canonical-equivalent (and/or
 *    compatibility-equivalent) strings have unique representations.”
 *
 *     -- quoted from unicode glossary linked below
 *
 * @see http://www.unicode.org/glossary/#normalization
 * @see http://www.php.net/manual/en/class.normalizer.php
 * @see http://www.w3.org/wiki/I18N/CanonicalNormalization
 * @see http://www.w3.org/TR/charmod-norm/
 * @see http://blog.whatwg.org/tag/unicode
 * @see http://en.wikipedia.org/wiki/Unicode_equivalence
 * @see http://stackoverflow.com/questions/7931204/what-is-normalized-utf-8-all-about
 * @see http://php.net/manual/en/class.normalizer.php
 */
class Sjorek\UnicodeNormalization\Normalizer
    implements Sjorek\UnicodeNormalization\Implementation\NormalizerInterface
{

    /**
     * Constructor.
     *
     * @param null|bool|int|string $form (optional) Set normalization form, default: NFC
     *
     * Besides the normalization form class constants defined below,
     * the following case-insensitive aliases are supported:
     * <pre>
     * - Disable unicode-normalization     : 0,  false, null, empty
     * - Ignore/skip unicode-normalization : 1,  NONE, true, binary, default, validate
     * - Normalization form D              : 2,  NFD, FORM_D, D, form-d, decompose, collation
     * - Normalization form D (mac)        : 18, NFD_MAC, FORM_D_MAC, D_MAC, form-d-mac, d-mac, mac
     * - Normalization form KD             : 3,  NFKD, FORM_KD, KD, form-kd
     * - Normalization form C              : 4,  NFC, FORM_C, C, form-c, compose, recompose, legacy, html5
     * - Normalization form KC             : 5,  NFKC, FORM_KC, KC, form-kc, matching
     * </pre>
     *
     * Hints:
     * <pre>
     * - The W3C recommends NFC for HTML5 Output.
     * - Mac OS X's HFS+ filesystem uses a NFD variant to store paths. We provide one implementation for this
     *   special variant, but plain NFD works in most cases too. Even if you use something else than NFD or its
     *   variant HFS+ will always use decomposed NFD path-strings if needed.
     * </pre>
     */
    public function __construct($form = null);

    /**
     * Ignore any decomposition/composition.
     *
     * Ignoring Implementation decomposition/composition, means nothing is automatically normalized.
     * Many Linux- and BSD-filesystems do not normalize paths and filenames, but treat them as binary data.
     * Apple™'s APFS filesystem treats paths and filenames as binary data.
     *
     * @var int
     */
    const NONE = 1;

    /**
     * Canonical decomposition.
     *
     *    “A normalization form that erases any canonical differences, and produces a
     *    decomposed result. For example, ä is converted to a + umlaut in this form.
     *    This form is most often used in internal processing, such as in collation.”
     *
     *    -- quoted from unicode glossary linked below
     *
     * @var int
     *
     * @see http://www.unicode.org/glossary/#normalization_form_d
     * @see https://developer.apple.com/library/content/qa/qa1173/_index.html
     * @see https://developer.apple.com/library/content/qa/qa1235/_index.html
     */
    const NFD = 2;

    /**
     * Compatibility decomposition.
     *
     *    “A normalization form that erases both canonical and compatibility differences,
     *    and produces a decomposed result: for example, the single ǆ character is
     *    converted to d + z + caron in this form.”
     *
     *    -- quoted from unicode glossary linked below
     *
     * @var int
     *
     * @see http://www.unicode.org/glossary/#normalization_form_kd
     */
    const NFKD = 3;

    /**
     * Canonical decomposition followed by canonical composition.
     *
     *    “A normalization form that erases any canonical differences, and generally produces
     *    a composed result. For example, a + umlaut is converted to ä in this form. This form
     *    most closely matches legacy usage.”
     *
     *    -- quoted from unicode glossary linked below
     *
     * W3C recommends NFC for HTML5 output and requires NFC for HTML5-compliant parser implementations.
     *
     * @var int
     * @var int $FORM_C
     *
     * @see http://www.unicode.org/glossary/#normalization_form_c
     */
    const NFC = 4;

    /**
     * Compatibility Decomposition followed by Canonical Composition.
     *
     *    “A normalization form that erases both canonical and compatibility differences,
     *    and generally produces a composed result: for example, the single ǆ character
     *    is converted to d + ž in this form. This form is commonly used in matching.”
     *
     *    -- quoted from unicode glossary linked below
     *
     * @var int
     * @var int $FORM_KC
     *
     * @see http://www.unicode.org/glossary/#normalization_form_kc
     */
    const NFKC = 5;

    /**
     * Apple™ Canonical decomposition for HFS Plus filesystems.
     *
     *    “For example, HFS Plus (OS X Extended) uses a variant of Normal Form D in
     *    which U+2000 through U+2FFF, U+F900 through U+FAFF, and U+2F800 through U+2FAFF
     *    are not decomposed …”
     *
     *    -- quoted from Apple™'s Technical Q&A 1173 linked below
     *
     *    “The characters with codes in the range u+2000 through u+2FFF are punctuation,
     *    symbols, dingbats, arrows, box drawing, etc. The u+24xx block, for example, has
     *    single characters for things like u+249c "⒜". The characters in this range are
     *    not fully decomposed; they are left unchanged in HFS Plus strings. This allows
     *    strings in Mac OS encodings to be converted to Implementation and back without loss of
     *    information. This is not unnatural since a user would not necessarily expect a
     *    dingbat "⒜" to be equivalent to the three character sequence "(a)" in a file name.
     *
     *    The characters in the range u+F900 through u+FAFF are CJK compatibility ideographs,
     *    and are not decomposed in HFS Plus strings.
     *
     *    So, for the example given earlier, u+00E9 ("é") must be stored as the two Implementation
     *    characters u+0065 and u+0301 (in that order). The Implementation character u+00E9 ("é")
     *    may not appear in a Implementation string used as part of an HFS Plus B-tree key.”
     *
     *    -- quoted from Apple™'s Technical Q&A 1150 linked below
     *
     * @var int
     *
     * @see NormalizerInterface::NFD
     * @see https://developer.apple.com/library/content/qa/qa1173/_index.html
     * @see https://developer.apple.com/library/content/qa/qa1235/_index.html
     * @see http://dubeiko.com/development/FileSystems/HFSPLUS/tn1150.html#CanonicalDecomposition
     * @see https://opensource.apple.com/source/libiconv/libiconv-50/libiconv/lib/utf8mac.h.auto.html
     */
    const NFD_MAC = 18; // 0x02 (NFD) | 0x10 = 0x12 (18)

    /**
     * Set the default normalization form to the given value.
     *
     * @param int|string $form
     *
     * @see \Sjorek\UnicodeNormalization\NormalizationUtility::parseForm()
     *
     * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm
     */
    public function setForm($form);

    /**
     * Retrieve the current normalization-form constant.
     *
     * @return int
     */
    public function getForm();

    /**
     * Normalizes the input provided and returns the normalized string.
     *
     * @param string $input the input string to normalize
     * @param int    $form  (optional) One of the normalization forms
     *
     * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm
     *
     * @return string the normalized string or FALSE if an error occurred
     *
     * @see http://php.net/manual/en/normalizer.normalize.php
     */
    public function normalize($input, $form = null);

    /**
     * Checks if the provided string is already in the specified normalization form.
     *
     * @param string $input The input string to normalize
     * @param int    $form  (optional) One of the normalization forms
     *
     * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm
     *
     * @return bool TRUE if normalized, FALSE otherwise or if an error occurred
     *
     * @see http://php.net/manual/en/normalizer.isnormalized.php
     */
    public function isNormalized($input, $form = null);

    /**
     * Normalizes the $string provided to the given or default $form and returns the normalized string.
     *
     * Calls underlying implementation even if given $form is NONE, but finally it normalizes only if needed.
     *
     * @param string $input the string to normalize
     * @param int    $form  (optional) normalization form to use, overriding the default
     *
     * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm
     *
     * @return null|string Normalized string or null if an error occurred
     */
    public function normalizeTo($input, $form = null);

    /**
     * Normalizes the $string provided to the given or default $form and returns the normalized string.
     *
     * Does not call underlying implementation if given normalization is NONE and normalizes only if needed.
     *
     * @param string $input the string to normalize
     * @param int    $form  (optional) normalization form to use, overriding the default
     *
     * @throws \Sjorek\UnicodeNormalization\Exception\InvalidNormalizationForm
     *
     * @return null|string Normalized string or null if an error occurred
     */
    public function normalizeStringTo($input, $form = null);

    /**
     * Get the supported unicode version level as version triple ("X.Y.Z").
     *
     * @return string
     */
    public static function getUnicodeVersion();

    /**
     * Get the supported unicode normalization forms as array.
     *
     * @return int[]
     */
    public static function getNormalizationForms();
}

Stream filtering

<?php

/**
 * @var $stream        resource    The stream to filter.
 * @var $form          string      The form to normalize unicode to.
 * @var $read_write    int         (optional) STREAM_FILTER_* constant to override the filter injection point
 * @var $params        string|int  (optional) A normalization-form alias or value
 *
 * @link http://php.net/manual/en/function.stream-filter-append.php
 * @link http://php.net/manual/en/function.stream-filter-prepend.php
 */
stream_filter_append($stream, "convert.unicode-normalization.$form"[, $read_write[, $params]]);

Note: Be careful when using on streams in r+ or w+ (or similar) modes; by default PHP will assign the filter to both the reading and writing chain. This means it will attempt to convert the data twice - first when reading from the stream, and once again when writing to it.

Examples

Unicode Normalization

<?php

use Sjorek\UnicodeNormalization\Normalizer;

$string = 'äöü';

$normalizer = new Normalizer(Normalizer::NONE);
$nfc = new Normalizer();
$nfd = new Normalizer(Normalizer::NFD);
$nfkc = new Normalizer('matching');

var_dump(
    // yields false as form NONE is never normalized
    $normalizer->isNormalized($string),

    // yields true, as NFC is the default for utf8 in the web.
    $nfc->isNormalized($string),

    // yields false
    $nfd->isNormalized($string),

    // yields false
    $nfkc->isNormalized($string),

    // yields false
    $normalizer->isNormalized($string, Normalizer::NFKD),

    // yields true
    $normalizer->normalize($string) === $string,

    // yields true
    $nfc->normalize($string) === $string,

    // yields false
    $nfd->normalize($string) === $string,

    // yields true, as only combined characters (means two or more letters in one
    // character, like the single ǆ character) are decomposed (for faster matching).
    $nfkc->normalize($string) === $string,

    Normalizer::getUnicodeVersion(),
    Normalizer::getNormalizationForms()
);

Stream filtering

<?php

$in_file = fopen('utf8-file.txt', 'r');
$out_file = fopen('utf8-normalized-to-nfc-file.txt', 'w');

// It works as a read filter:
stream_filter_append($in_file, 'convert.unicode-normalization.NFC');

// Normalization form may be given as fourth parameter:
// stream_filter_append($in_file, 'convert.unicode-normalization', null, 'NFC');

// And it also works as a write filter:
// stream_filter_append($out_file, 'convert.unicode-normalization.NFC');

stream_copy_to_stream($in_file, $out_file);