Ukrainian word stress
Word stress is an emphasis we place on a particular syllable of a word as we pronounce it: ΠΌΠ°ΜΠΌΠ°
This package takes text in Ukrainian and adds the stress mark after an accented vowel. This is useful in speech synthesis applications and for preparing text for language learners.
Example
From Python
>>> from ukrainian_word_stress import Stressifier
>>> text = """ΠΠΎΡΡΠ³ Π·ΡΠΏΠΈΠ½ΠΈΠ²ΡΡ, ΠΌΠΈ Π·ΡΠΉΡΠ»ΠΈ Π½Π° ΠΏΠ»Π°ΡΡΠΎΡΠΌΡ. ΠΡΠ»ΠΎ ΡΠΈΡ
ΠΎ, ΡΠΈΡΠΎΠΊΡ Π½Π°Π²ΡΠΊΡΡΠ½Ρ ΠΏΡΠΎΠΌΠ΅Π½Ρ Π·ΠΎΠ»ΠΎΡΠΈΠ»ΠΈ ΠΏΠΎΠ²ΡΡΡΡ, Π·Π°Π²Π°ΠΆΠ°ΡΡΠΈ Π±Π°ΡΠΈΡΠΈ ΡΠ΅ΡΡ ΡΠ°ΠΊΠΈΠΌΠΈ, ΡΠΊΠΈΠΌΠΈ Π²ΠΎΠ½ΠΈ Π±ΡΠ»ΠΈ. Π’ΡΠ΅ΡΡ ΠΏΠΎ ΠΎΠ±ΡΠ΄Ρ. ΠΠΎΠ΄Π½ΠΎΡ ΠΆΠΈΠ²ΠΎΡ Π΄ΡΡΡ. ΠΠ°ΠΉΠΊΡΠ°ΡΠΈΠΉ ΡΠ°Ρ Π΄Π»Ρ ΡΡΠΎΡΠΈΡΡΠΈΡ
Π²ΡΠ΄Π²ΡΠ΄ΠΈΠ½ ΠΏΠΎΠΌΠ΅ΡΠ»ΠΈΡ
. ΠΠ·ΡΠ»ΠΈ Π² ΠΏΡΠΈΠ²ΠΎΠΊΠ·Π°Π»ΡΠ½ΠΎΠΌΡ ΡΠΎΡΠ±Ρ Π²ΠΈΠ½Π°, ΡΡΡΠΈΠ»ΠΈ Π²Π·Π΄ΠΎΠ²ΠΆ ΠΊΠΎΠ»ΡΠΉ, ΠΏΡΡΠ°Π½ΠΎΡ ΡΡΠ΅ΠΆΠΊΠΎΡ."""
>>> stressify = Stressifier()
>>> stressify(text)
'ΠΠΎΡΡΠ³ Π·ΡΠΏΠΈΠ½ΠΈΒ΄Π²ΡΡ, ΠΌΠΈ Π·ΡΠΉΡΠ»ΠΈΒ΄ Π½Π° ΠΏΠ»Π°ΡΡΠΎΒ΄ΡΠΌΡ. ΠΡΠ»ΠΎΒ΄ ΡΠΈΒ΄Ρ
ΠΎ, ΡΠΈΡΠΎΒ΄ΠΊΡ Π½Π°Π²ΡΠΊΡΒ΄ΡΠ½Ρ
ΠΏΡΠΎΒ΄ΠΌΠ΅Π½Ρ Π·ΠΎΠ»ΠΎΡΠΈΒ΄Π»ΠΈ ΠΏΠΎΠ²ΡΒ΄ΡΡΡ, Π·Π°Π²Π°ΠΆΠ°Β΄ΡΡΠΈ Π±Π°Β΄ΡΠΈΡΠΈ ΡΠ΅Β΄ΡΡ ΡΠ°ΠΊΠΈΒ΄ΠΌΠΈ, ΡΠΊΠΈΒ΄ΠΌΠΈ Π²ΠΎΠ½ΠΈΒ΄
Π±ΡΠ»ΠΈΒ΄. Π’ΡΠ΅Β΄ΡΡ ΠΏΠΎ ΠΎΠ±ΡΒ΄Π΄Ρ. ΠΠΎΠ΄Π½ΠΎΡ ΠΆΠΈΠ²ΠΎΡ Π΄ΡΡΡΒ΄. ΠΠ°ΠΉΠΊΡΠ°Β΄ΡΠΈΠΉ ΡΠ°Ρ Π΄Π»Ρ ΡΡΠΎΡΠΈΒ΄ΡΡΠΈΡ
Π²ΡΠ΄Π²ΡΒ΄Π΄ΠΈΠ½ ΠΏΠΎΠΌΠ΅Β΄ΡΠ»ΠΈΡ
. ΠΠ·ΡΒ΄Π»ΠΈ Π² ΠΏΡΠΈΠ²ΠΎΠΊΠ·Π°Β΄Π»ΡΠ½ΠΎΠΌΡ ΡΠΎΒ΄ΡΠ±Ρ Π²ΠΈΠ½Π°, ΡΡΒ΄ΡΠΈΠ»ΠΈ Π²Π·Π΄ΠΎΠ²ΠΆ
ΠΊΠΎΒ΄Π»ΡΠΉ, ΠΏΡΡΠ°Β΄Π½ΠΎΡ ΡΡΠ΅Β΄ΠΆΠΊΠΎΡ.'
The ukrainian_word_stress.Stressifier
class has optional arguments for
fine-graded configuration (see sections below). For example:
>>> from ukrainian_word_stress import Stressifier, StressSymbol
>>> stressify = Stressifier(stress_symbol=StressSymbol.CombiningAcuteAccent)
>>> stressify(text)
'ΠΠΎΡΡΠ³ Π·ΡΠΏΠΈΠ½ΠΈΜΠ²ΡΡ, ΠΌΠΈ Π·ΡΠΉΡΠ»ΠΈΜ Π½Π° ΠΏΠ»Π°ΡΡΠΎΜΡΠΌΡ. ΠΡΠ»ΠΎΜ ΡΠΈΜΡ
ΠΎ, ΡΠΈΡΠΎΜΠΊΡ Π½Π°Π²ΡΠΊΡΜΡΠ½Ρ ΠΏΡΠΎΜΠΌΠ΅Π½Ρ
Π·ΠΎΠ»ΠΎΡΠΈΜΠ»ΠΈ ΠΏΠΎΠ²ΡΜΡΡΡ, Π·Π°Π²Π°ΠΆΠ°ΜΡΡΠΈ Π±Π°ΜΡΠΈΡΠΈ ΡΠ΅ΜΡΡ ΡΠ°ΠΊΠΈΜΠΌΠΈ, ΡΠΊΠΈΜΠΌΠΈ Π²ΠΎΠ½ΠΈΜ Π±ΡΠ»ΠΈΜ. Π’ΡΠ΅ΜΡΡ ΠΏΠΎ
ΠΎΠ±ΡΜΠ΄Ρ. ΠΠΎΠ΄Π½ΠΎΡ ΠΆΠΈΠ²ΠΎΡ Π΄ΡΡΡΜ. ΠΠ°ΠΉΠΊΡΠ°ΜΡΠΈΠΉ ΡΠ°Ρ Π΄Π»Ρ ΡΡΠΎΡΠΈΜΡΡΠΈΡ
Π²ΡΠ΄Π²ΡΜΠ΄ΠΈΠ½ ΠΏΠΎΠΌΠ΅ΜΡΠ»ΠΈΡ
. ΠΠ·ΡΜΠ»ΠΈ
Π² ΠΏΡΠΈΠ²ΠΎΠΊΠ·Π°ΜΠ»ΡΠ½ΠΎΠΌΡ ΡΠΎΜΡΠ±Ρ Π²ΠΈΠ½Π°, ΡΡΜΡΠΈΠ»ΠΈ Π²Π·Π΄ΠΎΠ²ΠΆ ΠΊΠΎΜΠ»ΡΠΉ, ΠΏΡΡΠ°ΜΠ½ΠΎΡ ΡΡΠ΅ΜΠΆΠΊΠΎΡ.'
From command-line
$ echo 'ΠΠΎΠ»ΠΎΡΡ ΡΠΉΡΡ, Π°Π»Π΅ Π½Π΅ΠΌΠ° Π½Ρ ΡΠΉΡΡ' | ukrainian-word-stress
ΠΠΎΠ»ΠΎΡΡΒ΄ ΡΒ΄ΠΉΡΡ, Π°Π»Π΅Β΄ Π½Π΅ΠΌΠ°Β΄ Π½Ρ ΡΠΉΡΡΒ΄
Setup
$ pip install ukrainian-word-stress
Note, that on the first call this will download around 500M of Stanza resources.
The default location for this is ~/stanza_resources
Handling ambiguity
Some words have different pronunciation and meaning but share the same spelling. These are so called heteronyms.
In most cases, this happens when a word used in its form (singular/plural, case). For example:
- Π±Π»ΠΎΡ ΠΈΜ - ΡΠΎΠ΄ΠΎΠ²ΠΈΠΉ Π²ΡΠ΄ΠΌΡΠ½ΠΎΠΊ Π² ΠΎΠ΄Π½ΠΈΠ½Ρ ("Π½Π΅ΠΌΠ°Ρ Π°Π½Ρ Π±Π»ΠΎΡ ΠΈΜ")
- Π±Π»ΠΎΜΡ ΠΈ - ΠΌΠ½ΠΎΠΆΠΈΠ½Π° Π½Π°Π·ΠΈΠ²Π½ΠΎΠ³ΠΎ Π²ΡΠ΄ΠΌΡΠ½ΠΊΡ ("ΠΏΠΎΠ²ΡΡΠ΄ΠΈ Π±ΡΠ»ΠΈ Π±Π»ΠΎΜΡ ΠΈ")
We handle this more or less correctly by doing morphological and POS text parse with Stanza.
A much smaller category of heteronyms is where words have completely different meanings:
- Π°ΜΡΠ»Π°Ρ - Π·Π±ΡΡΠ½ΠΈΠΊ ΠΊΠ°ΡΡ
- Π°ΡΠ»Π°ΜΡ - ΡΠΊΠ°Π½ΠΈΠ½Π°
Resolving this is much harder and sometimes impossible.
There's no ideal solution to heteronyms ambiguity. We let you decide what to do for such cases. Possible strategies are:
-
skip
: do not place stress at all (this is the default). -
all
: return all possible options at once. This will look as multiple stress symbols in one word (Π·Π°Β΄ΠΌΠΎΒ΄ΠΊ). -
first
: place a stress of the first match with a high chance of being incorrect. Essentially, means a random guess on the heteronyms meaning.
The strategy can be configured via --on-ambiguity
parameter of the
command-line utility. In Python, use on_ambiguity
parameter of the
ukrainian_word_stress.Stressifier
class.
Stress mark symbols
By default, the Unicode Acute Acent symbol is used: βΒ΄β (U+00B4).
On print, Combining Acute Acent is more common and visually less intrusive.
This can be turned on by passing "--symbol=combining" to the CLI utility,
or stress_symbol=StressSymbol.CombiningAcuteAccent
in the Stressifier
class.
Note, that some platforms (Windows, for example) render it incorrectly.
You can also pass custom characters in place of these two:
$ echo 'ΠΎΠ»Π΅Π½Ρ Π½Π΅Π±ΡΠΈΡΡ Ρ Π½Π΅ Π³ΠΎΠ»Π΅Π½Ρ.' | ukrainian-word-stress --symbol +
ΠΎ+Π»Π΅Π½Ρ Π½Π΅Π±ΡΠΈ+ΡΡ Ρ Π½Π΅ Π³ΠΎ+Π»Π΅Π½Ρ.
$ echo 'ΠΎΠ»Π΅Π½Ρ Π½Π΅Π±ΡΠΈΡΡ Ρ Π½Π΅ Π³ΠΎΠ»Π΅Π½Ρ.' | ukrainian-word-stress --symbol combining
ΠΎΜΠ»Π΅Π½Ρ Π½Π΅Π±ΡΠΈΜΡΡ Ρ Π½Π΅ Π³ΠΎΜΠ»Π΅Π½Ρ.
Variative stress
Some words allow for multiple stress positions. For example, ΠΏΠΎΜΠΌΠΈΠ»ΠΊΠ° and ΠΏΠΎΠΌΠΈΜΠ»ΠΊΠ° are both acceptable. For such words we return double stress:
$ echo ΠΏΠΎΠΌΠΈΠ»ΠΊΠ° | ukrainian-word-stress
ΠΏΠΎΒ΄ΠΌΠΈΒ΄Π»ΠΊΠ°
Debugging and reporting issues
Use the --verbose
switch to get info useful for debugging.
If you believe that you found a bug, please open a Github issue
But first, make sure that the bug is not related to heteronyms disambiguation.
For example, if you see that some word lacks accent, add the --on-ambiguity=all
switch to see if this was a heteronym. If the word of question has
multiple accents, that's a heteronym, not a bug:
$ echo Π·Π°ΠΌΠΎΠΊ | ukrainian-word-stress --on-ambiguity=all
Π·Π°Β΄ΠΌΠΎΒ΄ΠΊ