THE PROJECT IS ARCHIVED
Forks: https://github.com/orsinium/forks
Homoglyphs
Homoglyphs -- python library for getting homoglyphs and converting to ASCII.
Features
It's smarter version of confusable_homoglyphs:
- Autodect or manual choosing category (aliases from ISO 15924).
- Auto or manual load only needed alphabets in memory.
- Converting to ASCII.
- More configurable.
- More stable.
Installation
sudo pip install homoglyphs
Usage
Best way to explain something is show how it works. So, let's have a look on the real usage.
Importing:
import homoglyphs as hg
Languages
#detect
hg.Languages.detect('w')
# {'pl', 'da', 'nl', 'fi', 'cz', 'sr', 'pt', 'it', 'en', 'es', 'sk', 'de', 'fr', 'ro'}
hg.Languages.detect('ั')
# {'mk', 'ru', 'be', 'bg', 'sr'}
hg.Languages.detect('.')
# set()
# get alphabet for languages
hg.Languages.get_alphabet(['ru'])
# {'ะฒ', 'ะ', 'ะ', 'ะข', ..., 'ะ ', 'ะ', 'ะญ'}
# get all languages
hg.Languages.get_all()
# {'nl', 'lt', ..., 'de', 'mk'}
Categories
Categories -- (aliases from ISO 15924).
#detect
hg.Categories.detect('w')
# 'LATIN'
hg.Categories.detect('ั')
# 'CYRILLIC'
hg.Categories.detect('.')
# 'COMMON'
# get alphabet for categories
hg.Categories.get_alphabet(['CYRILLIC'])
# {'ำ', 'ิ', 'า', 'ะฏ', ..., 'ะญ', 'ิ', 'ำป'}
# get all categories
hg.Categories.get_all()
# {'RUNIC', 'DESERET', ..., 'SOGDIAN', 'TAI_LE'}
Homoglyphs
Get homoglyphs:
# get homoglyphs (latin alphabet initialized by default)
hg.Homoglyphs().get_combinations('q')
# ['q', '๐ช', '๐', '๐', '๐', '๐บ', '๐ฎ', '๐ข', '๐', '๐', '๐พ', '๐ฒ', '๐ฆ', '๐']
Alphabet loading:
# load alphabet on init by categories
homoglyphs = hg.Homoglyphs(categories=('LATIN', 'COMMON', 'CYRILLIC')) # alphabet loaded here
homoglyphs.get_combinations('ะณั')
# ['rั', 'ะณั', '๊ญั', '๊ญั', '๐ซั', '๐ั', '๐ั', '๐ั', '๐ปั', '๐ฏั', '๐ฃั', '๐ั', '๐ั', '๐ฟั', '๐ณั', '๐งั', '๐ั']
# load alphabet on init by languages
homoglyphs = hg.Homoglyphs(languages={'ru', 'en'}) # alphabet will be loaded here
homoglyphs.get_combinations('ะณั')
# ['rั', 'ะณั']
# manual set alphabet on init # eng rus
homoglyphs = hg.Homoglyphs(alphabet='abc ะฐะฑั')
homoglyphs.get_combinations('ั')
# ['c', 'ั']
# load alphabet on demand
homoglyphs = hg.Homoglyphs(languages={'en'}, strategy=hg.STRATEGY_LOAD)
# ^ alphabet will be loaded here for "en" language
homoglyphs.get_combinations('ะณั')
# ^ alphabet will be loaded here for "ru" language
# ['rั', 'ะณั']
You can combine categories
, languages
, alphabet
and any strategies as you want. The strategies specify how to handle any characters not already loaded:
-
STRATEGY_LOAD
: load category for this character -
STRATEGY_IGNORE
: add character to result -
STRATEGY_REMOVE
: remove character from result
Converting glyphs to ASCII chars
homoglyphs = hg.Homoglyphs(languages={'en'}, strategy=hg.STRATEGY_LOAD)
# convert
homoglyphs.to_ascii('ะขะะกะข')
# ['TECT']
homoglyphs.to_ascii('ะฅะ 123.') # this is cyrillic "ั
" and "ั"
# ['XP123.', 'XPI23.', 'XPl23.']
# string with chars which can't be converted by default will be ignored
homoglyphs.to_ascii('ะปะพะป')
# []
# you can set strategy for removing not converted non-ASCII chars from result
homoglyphs = hg.Homoglyphs(
languages={'en'},
strategy=hg.STRATEGY_LOAD,
ascii_strategy=hg.STRATEGY_REMOVE,
)
homoglyphs.to_ascii('ะปะพะป')
# ['o']
# also you can set up range of allowed char codes for ascii (0-128 by default):
homoglyphs = hg.Homoglyphs(
languages={'en'},
strategy=hg.STRATEGY_LOAD,
ascii_strategy=hg.STRATEGY_REMOVE,
ascii_range=range(ord('a'), ord('z')),
)
homoglyphs.to_ascii('ะฅะ 123.')
# ['l']
homoglyphs.to_ascii('ั
ั123.')
# ['xpl']