spoteno

spoteno (Spoken-Text-Normalization) is a tool to cleanup text-transcripts for speech recognition systems. These systems normally expect target transcripts to contain only characters from a restricted set.

Installation

Install the latest development version:

pip install git+https://github.com/ynop/spoteno.git

Examples

The default usecase would be to normalize a sentence. This enforces the output string to contain only valid characters (as defined by the configuration).

import spoteno

sentence = ('Am 11. Januar geht er um 5m nach links,'
            'weshalb er $d schon "ziemlich" müde ist.')

norm = spoteno.Normalizer.de()
outsent = norm.normalize(sentence)
print(outsent)

# >>> am elfte januar geht er um fünf m nach links weshalb er d schon ziemlich müde ist

With force=False, the final cleanup can be disabled. This way invalid characters may occurr in the output, if the configuration hasn't handled them specifically.

outsent = norm.normalize(sentence, force=False)
print(outsent)

# >>> am elfte januar geht er um fünf m nach links weshalb er $d schon ziemlich müde ist

With the debug method, one can retrieve a set of invalid characters in the final output. This can be used to create or debug a configuration. Additionaly the outputs of the different configuration steps can be printed.

outsent, error = norm.debug(sentence)
print(error)

# >>> START               Am 11. Januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.
# >>> Strip               ['Am 11. Januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.']
# >>> Lower               ['am 11. januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.']
# >>> StripChar           ['am 11. januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceIfNotSurroundedByDigits['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceIfNotPrecededByDigit['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceRegex        ['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceChar         ['am 11. januar geht er um 5m nach links weshalb er $d schon  ziemlich  müde ist']
# >>> ReplaceChar         ['am 11. januar geht er um 5m nach links weshalb er $d schon  ziemlich  müde ist']
# >>> WhitespaceTokenize  ['am', '11.', 'januar', 'geht', 'er', 'um', '5m', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> SplitNumberSuffix   ['am', '11.', 'januar', 'geht', 'er', 'um', '5', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> NumberToWords       ['am', '11.', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> OrdinalNumberToWords['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> ReplaceChar         ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> ReplaceFull         ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> RemoveDiacritics    ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> Strip               ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> END                 ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']k

# >>> {'$'}

Development

Prerequisites

A supported version of Python 3

It's recommended to use a virtual environment when developing spoteno. To create one, execute the following command in the project's root directory:

python -m venv .

To install spoteno and all it's dependencies, execute:

pip install -e .

Running the test suite

pip install -e .[dev]
python setup.py test

With PyCharm you might have to change the default test runner. Otherwise, it might only suggest to use nose. To do so, go to File > Settings > Tools > Python Integrated Tools (on the Mac it's PyCharm > Preferences > Settings > Tools > Python Integrated Tools) and change the test runner to py.test.

Versions

Versions is handled using bump2version. To bump the version:

bump2version [major,minor,patch,release,num]

In order to directly go to a final relase version (skip .dev/.rc/...):

bump2version [major,minor,patch] --new-version x.x.x

Release

Commands to create a new release on pypi.

rm -rf build
rm -rf dist

python setup.py sdist
python setup.py bdist_wheel
twine upload dist/*

spoteno
Release 0.1.1

Release 0.1.1

0.1.1

0.1.0

Documentation

spoteno

Installation

Examples

Development

Prerequisites

Running the test suite

Versions

Release

Stats

Development practices

Releases

Contributors

spoteno Release 0.1.1

Release 0.1.1 Toggle Dropdown 0.1.1 0.1.0

Documentation

spoteno

Installation

Examples

Development

Prerequisites

Running the test suite

Versions

Release

Stats

Development practices

Releases

Contributors

spoteno
Release 0.1.1

Release 0.1.1

0.1.1

0.1.0