ukrainian-word-stress

Find word stress for texts in Ukrainian


Keywords
ukrainian, nlp, word, stress, accents, dictionary, linguistics
License
MIT
Install
pip install ukrainian-word-stress==1.1.0

Documentation

Ukrainian word stress

Word stress is an emphasis we place on a particular syllable of a word as we pronounce it: ма́ма

This package takes text in Ukrainian and adds the stress mark after an accented vowel. This is useful in speech synthesis applications and for preparing text for language learners.

Example

From Python

>>> from ukrainian_word_stress import Stressifier
>>> text = """ΠŸΠΎΡ‚ΡΠ³ зупинився, ΠΌΠΈ Π·Ρ–ΠΉΡˆΠ»ΠΈ Π½Π° ΠΏΠ»Π°Ρ‚Ρ„ΠΎΡ€ΠΌΡƒ. Π‘ΡƒΠ»ΠΎ Ρ‚ΠΈΡ…ΠΎ, ΡˆΠΈΡ€ΠΎΠΊΡ– навскісні ΠΏΡ€ΠΎΠΌΠ΅Π½Ρ– Π·ΠΎΠ»ΠΎΡ‚ΠΈΠ»ΠΈ повітря, Π·Π°Π²Π°ΠΆΠ°ΡŽΡ‡ΠΈ Π±Π°Ρ‡ΠΈΡ‚ΠΈ Ρ€Π΅Ρ‡Ρ– Ρ‚Π°ΠΊΠΈΠΌΠΈ, якими Π²ΠΎΠ½ΠΈ Π±ΡƒΠ»ΠΈ. ВрСтя ΠΏΠΎ ΠΎΠ±Ρ–Π΄Ρ–. Π–ΠΎΠ΄Π½ΠΎΡ– ΠΆΠΈΠ²ΠΎΡ– Π΄ΡƒΡˆΡ–. Найкращий час для урочистих Π²Ρ–Π΄Π²Ρ–Π΄ΠΈΠ½ ΠΏΠΎΠΌΠ΅Ρ€Π»ΠΈΡ…. Взяли Π² ΠΏΡ€ΠΈΠ²ΠΎΠΊΠ·Π°Π»ΡŒΠ½ΠΎΠΌΡƒ Ρ‚ΠΎΡ€Π±Ρƒ Π²ΠΈΠ½Π°, Ρ€ΡƒΡˆΠΈΠ»ΠΈ Π²Π·Π΄ΠΎΠ²ΠΆ ΠΊΠΎΠ»Ρ–ΠΉ, ΠΏΡ–Ρ‰Π°Π½ΠΎΡŽ ΡΡ‚Π΅ΠΆΠΊΠΎΡŽ."""
>>> stressify = Stressifier()
>>> stressify(text)

'ΠŸΠΎΡ‚ΡΠ³ зупини´вся, ΠΌΠΈ Π·Ρ–ΠΉΡˆΠ»ΠΈΒ΄ Π½Π° ΠΏΠ»Π°Ρ‚Ρ„ΠΎΒ΄Ρ€ΠΌΡƒ. Π‘ΡƒΠ»ΠΎΒ΄ Ρ‚ΠΈΒ΄Ρ…ΠΎ, ΡˆΠΈΡ€ΠΎΒ΄ΠΊΡ– навскі´сні
ΠΏΡ€ΠΎΒ΄ΠΌΠ΅Π½Ρ– Π·ΠΎΠ»ΠΎΡ‚ΠΈΒ΄Π»ΠΈ пові´тря, Π·Π°Π²Π°ΠΆΠ°Β΄ΡŽΡ‡ΠΈ Π±Π°Β΄Ρ‡ΠΈΡ‚ΠΈ Ρ€Π΅Β΄Ρ‡Ρ– Ρ‚Π°ΠΊΠΈΒ΄ΠΌΠΈ, яки´ми Π²ΠΎΠ½ΠΈΒ΄
Π±ΡƒΠ»ΠΈΒ΄. ВрС´тя ΠΏΠΎ ΠΎΠ±Ρ–Β΄Π΄Ρ–. Π–ΠΎΠ΄Π½ΠΎΡ– ΠΆΠΈΠ²ΠΎΡ– Π΄ΡƒΡˆΡ–Β΄. Найкра´щий час для урочи´стих
Π²Ρ–Π΄Π²Ρ–Β΄Π΄ΠΈΠ½ ΠΏΠΎΠΌΠ΅Β΄Ρ€Π»ΠΈΡ…. Взя´ли Π² ΠΏΡ€ΠΈΠ²ΠΎΠΊΠ·Π°Β΄Π»ΡŒΠ½ΠΎΠΌΡƒ Ρ‚ΠΎΒ΄Ρ€Π±Ρƒ Π²ΠΈΠ½Π°, Ρ€ΡƒΒ΄ΡˆΠΈΠ»ΠΈ Π²Π·Π΄ΠΎΠ²ΠΆ
ΠΊΠΎΒ΄Π»Ρ–ΠΉ, ΠΏΡ–Ρ‰Π°Β΄Π½ΠΎΡŽ ΡΡ‚Π΅Β΄ΠΆΠΊΠΎΡŽ.'

The ukrainian_word_stress.Stressifier class has optional arguments for fine-graded configuration (see sections below). For example:

>>> from ukrainian_word_stress import Stressifier, StressSymbol
>>> stressify = Stressifier(stress_symbol=StressSymbol.CombiningAcuteAccent)
>>> stressify(text)

'ΠŸΠΎΡ‚ΡΠ³ зупини́вся, ΠΌΠΈ Π·Ρ–ΠΉΡˆΠ»ΠΈΜ Π½Π° платфо́рму. Було́ ти́хо, ΡˆΠΈΡ€ΠΎΜΠΊΡ– навскі́сні про́мСні
золоти́ли пові́тря, Π·Π°Π²Π°ΠΆΠ°ΜΡŽΡ‡ΠΈ ба́чити рС́чі таки́ми, яки́ми вони́ були́. ВрС́тя ΠΏΠΎ
обі́ді. Π–ΠΎΠ΄Π½ΠΎΡ– ΠΆΠΈΠ²ΠΎΡ– Π΄ΡƒΡˆΡ–Μ. Найкра́щий час для урочи́стих відві́дин помС́рлих. Взя́ли
Π² ΠΏΡ€ΠΈΠ²ΠΎΠΊΠ·Π°ΜΠ»ΡŒΠ½ΠΎΠΌΡƒ то́рбу Π²ΠΈΠ½Π°, Ρ€ΡƒΜΡˆΠΈΠ»ΠΈ Π²Π·Π΄ΠΎΠ²ΠΆ ко́лій, ΠΏΡ–Ρ‰Π°ΜΠ½ΠΎΡŽ ΡΡ‚Π΅ΜΠΆΠΊΠΎΡŽ.'

From command-line

$ echo 'Π—ΠΎΠ»ΠΎΡ‚Ρ– яйця, Π°Π»Π΅ Π½Π΅ΠΌΠ° Π½Ρ– яйця' | ukrainian-word-stress
Π—ΠΎΠ»ΠΎΡ‚Ρ–Β΄ я´йця, Π°Π»Π΅Β΄ Π½Π΅ΠΌΠ°Β΄ Π½Ρ– яйця´

Setup

$ pip install ukrainian-word-stress

Note, that on the first call this will download around 500M of Stanza resources. The default location for this is ~/stanza_resources

Handling ambiguity

Some words have different pronunciation and meaning but share the same spelling. These are so called heteronyms.

In most cases, this happens when a word used in its form (singular/plural, case). For example:

  • блохи́ - Ρ€ΠΎΠ΄ΠΎΠ²ΠΈΠΉ Π²Ρ–Π΄ΠΌΡ–Π½ΠΎΠΊ Π² ΠΎΠ΄Π½ΠΈΠ½Ρ– ("Π½Π΅ΠΌΠ°Ρ” Π°Π½Ρ– блохи́")
  • бло́хи - ΠΌΠ½ΠΎΠΆΠΈΠ½Π° Π½Π°Π·ΠΈΠ²Π½ΠΎΠ³ΠΎ Π²Ρ–Π΄ΠΌΡ–Π½ΠΊΡƒ ("повсюди Π±ΡƒΠ»ΠΈ бло́хи")

We handle this more or less correctly by doing morphological and POS text parse with Stanza.

A much smaller category of heteronyms is where words have completely different meanings:

  • а́тлас - Π·Π±Ρ–Ρ€Π½ΠΈΠΊ ΠΊΠ°Ρ€Ρ‚
  • атла́с - Ρ‚ΠΊΠ°Π½ΠΈΠ½Π°

Resolving this is much harder and sometimes impossible.

There's no ideal solution to heteronyms ambiguity. We let you decide what to do for such cases. Possible strategies are:

  • skip: do not place stress at all (this is the default).

  • all: return all possible options at once. This will look as multiple stress symbols in one word (Π·Π°Β΄ΠΌΠΎΒ΄ΠΊ).

  • first: place a stress of the first match with a high chance of being incorrect. Essentially, means a random guess on the heteronyms meaning.

The strategy can be configured via --on-ambiguity parameter of the command-line utility. In Python, use on_ambiguity parameter of the ukrainian_word_stress.Stressifier class.

Stress mark symbols

By default, the Unicode Acute Acent symbol is used: β€œΒ΄β€ (U+00B4).

On print, Combining Acute Acent is more common and visually less intrusive. This can be turned on by passing "--symbol=combining" to the CLI utility, or stress_symbol=StressSymbol.CombiningAcuteAccent in the Stressifier class.

Note, that some platforms (Windows, for example) render it incorrectly.

You can also pass custom characters in place of these two:

$ echo 'ΠΎΠ»Π΅Π½Ρ– Π½Π΅Π±Ρ€ΠΈΡ‚Ρ– Ρ– Π½Π΅ Π³ΠΎΠ»Π΅Π½Ρ–.' | ukrainian-word-stress --symbol +
ΠΎ+Π»Π΅Π½Ρ– Π½Π΅Π±Ρ€ΠΈ+Ρ‚Ρ– Ρ– Π½Π΅ Π³ΠΎ+Π»Π΅Π½Ρ–.

$ echo 'ΠΎΠ»Π΅Π½Ρ– Π½Π΅Π±Ρ€ΠΈΡ‚Ρ– Ρ– Π½Π΅ Π³ΠΎΠ»Π΅Π½Ρ–.' | ukrainian-word-stress --symbol combining
о́лСні нСбри́ті Ρ– Π½Π΅ го́лСні.

Variative stress

Some words allow for multiple stress positions. For example, по́милка and поми́лка are both acceptable. For such words we return double stress:

$ echo ΠΏΠΎΠΌΠΈΠ»ΠΊΠ° | ukrainian-word-stress
ΠΏΠΎΒ΄ΠΌΠΈΒ΄Π»ΠΊΠ°

Debugging and reporting issues

Use the --verbose switch to get info useful for debugging.

If you believe that you found a bug, please open a Github issue

But first, make sure that the bug is not related to heteronyms disambiguation. For example, if you see that some word lacks accent, add the --on-ambiguity=all switch to see if this was a heteronym. If the word of question has multiple accents, that's a heteronym, not a bug:

$ echo Π·Π°ΠΌΠΎΠΊ | ukrainian-word-stress --on-ambiguity=all
Π·Π°Β΄ΠΌΠΎΒ΄ΠΊ

More docs