phonepiece
phonepiece
is library to manage phone inventories, it also has a few linguistic/phonetics tools.
It is mainly intended to be used in the following projects, but it can be used as a standalone library
- allosaurus: phone recognition toolkit
- transphone: grapheme-to-phoneme toolkit
- asr2k: speech recognition systems for 2000 languages
Install
phonepiece is available from pip
pip install phonepiece
You can also clone this repository and install
python setup.py install
Usage
Inventory Lookup
The main feature of phonepiece is to look-up inventory.
An inventory typically contains the following information:
-
phoneme
: language-dependent units -
phone
: language-independent units -
allophone
: the mapping between phone and phoneme
A simple usage is as follows:
In [1]: from phonepiece import read_inventory
In [2]: eng = read_inventory('eng')
In [3]: eng
Out[3]: <Inventory eng phoneme: 40, phone: 46>
In [4]: eng.phoneme
Out[4]: <Unit: 40 elems: {'<blk>': 0, 'a': 1, 'b': 2, 'd': 3, 'd͡ʒ': 4, 'e': 5, 'f': 6, 'h': 7, 'i': 8, 'j': 9, 'k': 10, 'l': 11, 'm': 12, 'n': 13, 'o': 14, 'p': 15, 's': 16, 't': 17, 't͡ʃ': 18, 'u': 19, 'v': 20, 'w': 21, 'z': 22, 'æ': 23, 'ð': 24, 'ŋ': 25, 'ɑ': 26, 'ɔ': 27, 'ə': 28, 'ɛ': 29, 'ɡ': 30, 'ɪ': 31, 'ɹ': 32, 'ɹ̩': 33, 'ʃ': 34, 'ʊ': 35, 'ʌ': 36, 'ʒ': 37, 'θ': 38, '<eos>': 39}>
In [5]: eng.phone
Out[5]: <Unit: 46 elems: {'<blk>': 0, 'a': 1, 'b': 2, 'b̥': 3, 'd': 4, 'dʒ': 5, 'd̥': 6, 'e': 7, 'f': 8, 'g': 9, 'h': 10, 'i': 11, 'j': 12, 'k': 13, 'kʰ': 14, 'l': 15, 'm': 16, 'n': 17, 'o': 18, 'p': 19, 'pʰ': 20, 's': 21, 't': 22, 'tʃ': 23, 'tʰ': 24, 'u': 25, 'v': 26, 'w': 27, 'z': 28, 'æ': 29, 'ð': 30, 'ŋ': 31, 'ɑ': 32, 'ɔ': 33, 'ə': 34, 'ɛ': 35, 'ɡ̥': 36, 'ɪ': 37, 'ɹ': 38, 'ɹ̩': 39, 'ʃ': 40, 'ʊ': 41, 'ʌ': 42, 'ʒ': 43, 'θ': 44, '<eos>': 45}>
In [6]: eng.phoneme2phone
Out[6]:
defaultdict(list,
{'a': ['a'],
'b': ['b', 'b̥'],
'd': ['d', 'd̥'],
'd͡ʒ': ['dʒ'],
'e': ['e'],
'f': ['f'],
'h': ['h'],
'i': ['i'],
'j': ['j'],
'k': ['kÊ°', 'k'],
'l': ['l'],
'm': ['m'],
'n': ['n'],
'o': ['o'],
'p': ['pÊ°', 'p'],
's': ['s'],
't': ['tÊ°', 't'],
't͡ʃ': ['tʃ'],
'u': ['u'],
'v': ['v'],
'w': ['w'],
'z': ['z'],
'æ': ['æ'],
'ð': ['ð'],
'Å‹': ['Å‹'],
'É‘': ['É‘'],
'É”': ['É”'],
'É™': ['É™'],
'É›': ['É›'],
'É¡': ['g', 'É¡Ì¥'],
'ɪ': ['ɪ'],
'ɹ': ['ɹ'],
'ɹ̩': ['ɹ̩'],
'ʃ': ['ʃ'],
'ÊŠ': ['ÊŠ'],
'ʌ': ['ʌ'],
'Ê’': ['Ê’'],
'θ': ['θ'],
'<blk>': ['<blk>'],
'<eos>': ['<eos>']})
Phone Tokenization
This lib also provides a tokenizer which splits a concatenated IPA string into separate IPAs
In [1]: from phonepiece.ipa import read_ipa
In [2]: ipa = read_ipa()
In [3]: ipa.tokenize('kʰæt')
Out[3]: ['kʰ', 'æ', 't']
Phonological Distance
The phonological_distance
is an augmented edit distance, it takes phonological distance into account as well.
In [1]: from phonepiece.distance import phonological_distance
In [2]: phonological_distance('a', 'b')
Out[2]: 0.5862068965517241
In [3]: phonological_distance('a', 'e')
Out[3]: 0.03448275862068961
In [4]: phonological_distance('a', 'bc')
Out[4]: 1.5862068965517242
Lexicon Lookup
It also includes many lexicon dictionaries, you can look up pronunciation of a particular word (if it exists) The output phonemes are consistent with its language's inventory phoneme space
In [1]: from phonepiece.lexicon import read_lexicon
In [2]: eng = read_lexicon('eng')
In [3]: eng['hello']
Out[3]: ['h', 'ʌ', 'l', 'o', 'w']
Models
model | # supported languages | description |
---|---|---|
phoible | ~2k | phone/phoneme databases extracted from Phoible [1]. Allophone information is from Allovera [3] |
latest | ~8k | Phoible database + estimated inventory based on our LREC work [2] |
Acknowledgement
This repository use code/data from the following repository
Reference
- [1] Moran, Steven, Daniel McCloy, and Richard Wright. "PHOIBLE online." (2014).
- [2] Li, Xinjian, et al. "Phone Inventories and Recognition for Every Language" LREC 2022. 2022
- [3] Mortensen, David R., et al. "AlloVera: A Multilingual Allophone Database." Proceedings of the 12th Language Resources and Evaluation Conference. 2020.