anglicize

A simple package to help sort non-English names.


License
BSD-2-Clause
Install
pip install anglicize==0.0.3

Documentation

Overview

tests
Travis-CI Build Status
package

A simple package to help sort non-English names.

  • Free software: BSD 2-Clause License

Installation

pip install anglicize

You can also install the in-development version with:

pip install https://github.com/rciorba/python-anglicize/archive/master.zip

Documentation

This library provides one function, which takes a string and substitutes characters.

To use:

# call the function directly:
anglicize("Łukasz") == "Lukasz"

# or use it to sort a list:
sorted(["Ana", "Łukasz", "Zack"], key=anglicize) == ["Ana", "Łukasz", "Zack"]

# there we go, that's much better than this:
sorted(["Ana", "Łukasz", "Zack"]) == ["Ana", "Zack", "Łukasz"]

Rationale

The purpose of this library is to help you sort non-English names writen in Latin-based alphabets.

Different languages have wildly different rules for sorting, for example Ö comes after Z in Finnish but after O in Hungarian. The approach taken here is to treat visually similar letters the same, so basically ÖÔÓÒṌṎ (and others) should all become O.

Handling letters that have little similarity to A-Z

The German ß is the main issue here. I chose to handle it like an S, mostly because it's different enough from B (the most similar visually) and because it's well known as a version of S to most Europeans.

Languages covered

  • Albanian
  • Azerbaijani
  • Bosnian
  • Bulgarian transliteration
  • Croatian
  • Dutch
  • Estonian
  • Finnish
  • French
  • Gagauz
  • German
  • Hungarian
  • Icelandic
  • Latvian
  • Lithuanian
  • Luxembourgish
  • Montenegrin
  • Norwegian
  • Polish
  • Portuguese
  • Romanian
  • Serbian
  • Spanish
  • Swedish
  • Tatar
  • Turkish
  • Turkmen

Contributing

Do you know a language written in a Latin alphabet and want to check it's correctly handled? Have a look in tests/test_anglicize.py. If the language is there please check all "special" letters are handled. This list has been mostly compiled off of Wikipedia, so I would not be surprised to hear about errors :)

You can either make the changes and submit a PR or just create an issue mentioning - language - characters which need handling

Development

To run tests for all Python environments run:

tox