toxygene/confusables

This library is an implementation of the skeleton function described in the Confusion Detection section of the Unicode Security Mechanisms technical standard.


Keywords
confusable, php, unicode
License
MIT

Documentation

Unicode Confusables

This library is an implementation of the skeleton function described in the Confusion Detection section of the Unicode Security Mechanisms technical standard.

Because Unicode contains such a large number of characters and incorporates the varied writing systems of the world, incorrect usage can expose programs or systems to possible security attacks.

- http://unicode.org/reports/tr39/

Description

The skeleton function deconstructs complex Unicode graphemes into a string that can be used to detect if other strings are visually similar (aka confusable).

Usage

(Re) Building the Class File

The Confusable class file is generated by bin/build-confusables. That build script will automatically be called by composer on install/update.

The reason the class file is built dynamically is two fold:

  • The confusables.txt file is quite large (~120k). Caching them locally is an improvement, but it still requires a disk read and parsing.
  • Injecting the confusables rules into the PHP file it can be stored in PHP byte-code caches.

Should the Unicode confusables.txt file be updated, developers can rerun the build script at any time, even via a cronjob.

API

skeleton(string $a): string

Create the skeleton of a string.

Storing this value in the database will give developers a way of doing a visual uniqueness check against existing identifiers.

isConfusable(string $a, string $b): bool

Check if two strings are confusable for each other.

Under the hood, this is implemented as skeleton(A) == skeleton(B).

Warning

Casefolding is not part of the skeleton algorithm. If the requirements of your application include casefolding identifiers, it is your responsibility to supply the strings in the correct case to the skeleton function.