a multilingual phone tokenizer


License
Other
Install
pip install phonepiece==1.3.5

Documentation

phonepiece

CI Test

phonepiece is library to manage phone inventories, it also has a few linguistic/phonetics tools.

It is mainly intended to be used in the following projects, but it can be used as a standalone library

  • allosaurus: phone recognition toolkit
  • transphone: grapheme-to-phoneme toolkit
  • asr2k: speech recognition systems for 2000 languages

Install

phonepiece is available from pip

pip install phonepiece

You can also clone this repository and install

python setup.py install

Usage

Inventory Lookup

The main feature of phonepiece is to look-up inventory.

An inventory typically contains the following information:

  • phoneme: language-dependent units
  • phone: language-independent units
  • allophone: the mapping between phone and phoneme

A simple usage is as follows:

In [1]: from phonepiece import read_inventory                                                                                                   

In [2]: eng = read_inventory('eng')                                                                                                             

In [3]: eng                                                                                                                                     
Out[3]: <Inventory eng phoneme: 40, phone: 46>

In [4]: eng.phoneme                                                                                                                             
Out[4]: <Unit: 40 elems: {'<blk>': 0, 'a': 1, 'b': 2, 'd': 3, 'd͡ʒ': 4, 'e': 5, 'f': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 's': 16, 't': 17, 't͡ʃ': 18, 'u': 19, 'v': 20, 'w': 21, 'z': 22, 'æ': 23, 'ð': 24, 'ŋ': 25, 'ɑ': 26, 'ɔ': 27, 'ə': 28, 'ɛ': 29, 'ɡ': 30, 'ɪ': 31, 'ɹ': 32, 'ɹ̩': 33, 'ʃ': 34, 'ʊ': 35, 'ʌ': 36, 'ʒ': 37, 'θ': 38, '<eos>': 39}>

In [5]: eng.phone                                                                                                                               
Out[5]: <Unit: 46 elems: {'<blk>': 0, 'a': 1, 'b': 2, 'b̥': 3, 'd': 4, 'dʒ': 5, 'd̥': 6, 'e': 7, 'f': 8, 'g': 9, 'h': 10, 'i': 11, 'j': 12, 'k': 13, 'kʰ': 14, 'l': 15, 'm': 16, 'n': 17, 'o': 18, 'p': 19, 'pʰ': 20, 's': 21, 't': 22, 'tʃ': 23, 'tʰ': 24, 'u': 25, 'v': 26, 'w': 27, 'z': 28, 'æ': 29, 'ð': 30, 'ŋ': 31, 'ɑ': 32, 'ɔ': 33, 'ə': 34, 'ɛ': 35, 'ɡ̥': 36, 'ɪ': 37, 'ɹ': 38, 'ɹ̩': 39, 'ʃ': 40, 'ʊ': 41, 'ʌ': 42, 'ʒ': 43, 'θ': 44, '<eos>': 45}>

In [6]: eng.phoneme2phone                                                                                                                       
Out[6]: 
defaultdict(list,
            {'a': ['a'],
             'b': ['b', 'b̥'],
             'd': ['d', 'd̥'],
             'd͡ʒ': ['dʒ'],
             'e': ['e'],
             'f': ['f'],
             'h': ['h'],
             'i': ['i'],
             'j': ['j'],
             'k': ['kÊ°', 'k'],
             'l': ['l'],
             'm': ['m'],
             'n': ['n'],
             'o': ['o'],
             'p': ['pÊ°', 'p'],
             's': ['s'],
             't': ['tÊ°', 't'],
             't͡ʃ': ['tʃ'],
             'u': ['u'],
             'v': ['v'],
             'w': ['w'],
             'z': ['z'],
             'æ': ['æ'],
             'ð': ['ð'],
             'Å‹': ['Å‹'],
             'É‘': ['É‘'],
             'É”': ['É”'],
             'É™': ['É™'],
             'É›': ['É›'],
             'É¡': ['g', 'É¡Ì¥'],
             'ɪ': ['ɪ'],
             'ɹ': ['ɹ'],
             'ɹ̩': ['ɹ̩'],
             'ʃ': ['ʃ'],
             'ÊŠ': ['ÊŠ'],
             'ʌ': ['ʌ'],
             'Ê’': ['Ê’'],
             'θ': ['θ'],
             '<blk>': ['<blk>'],
             '<eos>': ['<eos>']})

Phone Tokenization

This lib also provides a tokenizer which splits a concatenated IPA string into separate IPAs

In [1]: from phonepiece.ipa import read_ipa                                                         

In [2]: ipa = read_ipa()                                                                            

In [3]: ipa.tokenize('kʰæt')                                                                        
Out[3]: ['kʰ', 'æ', 't']

Phonological Distance

The phonological_distance is an augmented edit distance, it takes phonological distance into account as well.

In [1]: from phonepiece.distance import phonological_distance

In [2]: phonological_distance('a', 'b')
Out[2]: 0.5862068965517241

In [3]: phonological_distance('a', 'e')
Out[3]: 0.03448275862068961

In [4]: phonological_distance('a', 'bc')
Out[4]: 1.5862068965517242

Lexicon Lookup

It also includes many lexicon dictionaries, you can look up pronunciation of a particular word (if it exists) The output phonemes are consistent with its language's inventory phoneme space

In [1]: from phonepiece.lexicon import read_lexicon

In [2]: eng = read_lexicon('eng')

In [3]: eng['hello']
Out[3]: ['h', 'ʌ', 'l', 'o', 'w']

Models

model # supported languages description
phoible ~2k phone/phoneme databases extracted from Phoible [1]. Allophone information is from Allovera [3]
latest ~8k Phoible database + estimated inventory based on our LREC work [2]

Acknowledgement

This repository use code/data from the following repository

Reference

  • [1] Moran, Steven, Daniel McCloy, and Richard Wright. "PHOIBLE online." (2014).
  • [2] Li, Xinjian, et al. "Phone Inventories and Recognition for Every Language" LREC 2022. 2022
  • [3] Mortensen, David R., et al. "AlloVera: A Multilingual Allophone Database." Proceedings of the 12th Language Resources and Evaluation Conference. 2020.