asr-deepspeech

ASRDeepspeech (English / Japanese)


Keywords
asr, audio, deep-learning, japanese, python, sakura, sakura-ml, speech-recognition, zakuro, zakuro-ai
License
MIT
Install
pip install asr-deepspeech==0.3.1

Documentation



ASRDeepspeech x Sakura-ML (English/Japanese)

Modules β€’ Code structure β€’ Installing the application β€’ Makefile commands β€’ Environments β€’ Datasetβ€’ Running the applicationβ€’ Notesβ€’

This repository offers a clean code version of the original repository from SeanNaren with classes and modular components (eg trainers, models, loggers...).

I have added a configuration file to manage the parameters set in the model. You will also find a pretrained model in japanese performing a CER = 34 on JSUT test set .

Modules

At a granular level, ASRDeepSpeech is a library that consists of the following components:

Component Description
asr_deepspeech Speech Recognition package
asr_deepspeech.data Data related module
asr_deepspeech.data.dataset Build the dataset
asr_deepspeech.data.loaders Load the dataset
asr_deepspeech.data.parsers Parse the dataset
asr_deepspeech.data.samplers Sample the dataset
asr_deepspeech.decoders Decode the generated text
asr_deepspeech.loggers Loggers
asr_deepspeech.modules Components of the network
asr_deepspeech.parsers Arguments parser
asr_deepspeech.tests Test units
asr_deepspeech.trainers Trainers

Code structure

from setuptools import setup
from asr_deepspeech import __version__

setup(
    name="asr_deepspeech",
    version=__version__,
    short_description="ASRDeepspeech (English / Japanese)",
    long_description="".join(open("README.md", "r").readlines()),
    long_description_content_type="text/markdown",
    url="https://github.com/zakuro-ai/asr",
    license="MIT Licence",
    author="CADIC Jean-Maximilien",
    python_requires=">=3.8",
    packages=[
        "asr_deepspeech",
        "asr_deepspeech.audio",
        "asr_deepspeech.data",
        "asr_deepspeech.data.dataset",
        "asr_deepspeech.data.loaders",
        "asr_deepspeech.data.manifests",
        "asr_deepspeech.data.parsers",
        "asr_deepspeech.data.samplers",
        "asr_deepspeech.decoders",
        "asr_deepspeech.etl",
        "asr_deepspeech.loggers",
        "asr_deepspeech.models",
        "asr_deepspeech.modules",
        "asr_deepspeech.parsers",
        "asr_deepspeech.tests",
        "asr_deepspeech.trainers",
    ],
    include_package_data=True,
    package_data={"": ["*.yml"]},
    install_requires=[r.rsplit()[0] for r in open("requirements.txt")],
    author_email="git@zakuro.ai",
    description="ASRDeepspeech (English / Japanese)",
    platforms="linux_debian_10_x86_64",
    classifiers=[
        "Programming Language :: Python :: 3.8",
        "License :: OSI Approved :: MIT License",
    ],
)

Installing the application

To clone and run this application, you'll need the following installed on your computer:

Install bpd:

# Clone this repository and install the code
git clone https://github.com/zakuro-ai/asr

# Go into the repository
cd asr

Makefile commands

Exhaustive list of make commands:

pull                # Download the docker image
sandbox             # Launch the sandox image 
install_wheels      # Install the wheel
tests               # Test the code

Environments

We are providing a support for local or docker setup. However we recommend to use docker to avoid any difficulty to run the code. If you decide to run the code locally you will need python3.6 with cuda>=10.1. Several libraries are needed to be installed for training to work. I will assume that everything is being installed in an Anaconda installation on Ubuntu, with Pytorch 1.0. Install PyTorch if you haven't already.

Docker

Note

Running this application by using Docker is recommended.

To build and run the docker image

make pull
make sandbox

PythonEnv

Warning

Running this application by using PythonEnv is possible but not recommended.

make install_wheels

Test

make tests

You should be able to get an output like

=1= TEST PASSED : asr_deepspeech
=1= TEST PASSED : asr_deepspeech.data
=1= TEST PASSED : asr_deepspeech.data.dataset
=1= TEST PASSED : asr_deepspeech.data.loaders
=1= TEST PASSED : asr_deepspeech.data.parsers
=1= TEST PASSED : asr_deepspeech.data.samplers
=1= TEST PASSED : asr_deepspeech.decoders
=1= TEST PASSED : asr_deepspeech.loggers
=1= TEST PASSED : asr_deepspeech.modules
=1= TEST PASSED : asr_deepspeech.parsers
=1= TEST PASSED : asr_deepspeech.test
=1= TEST PASSED : asr_deepspeech.trainers

Datasets

By default we process the JSUT dataset. See the config section to know how to process a custom dataset.

from gnutools.remote import gdrive
from asr_deepspech import  cfg

# This will download the JSUT dataset in your /tmp
gdrive(cfg.gdrive_uri)

ETL

python -m asr_deepspeech.etl

Running the application

Training a Model

To train on a single gpu

sakura -m asr_deepspeech.trainers

Pretrained model

python -m asr_deepspeech

Notes

  • Clean verbose during training
    ================ VARS ===================
    manifest: clean
    distributed: True
    train_manifest: __data__/manifests/train_clean.json
    val_manifest: __data__/manifests/val_clean.json
    model_path: /data/ASRModels/deepspeech_jp_500_clean.pth
    continue_from: None
    output_file: /data/ASRModels/deepspeech_jp_500_clean.txt
    main_proc: True
    rank: 0
    gpu_rank: 0
    world_size: 2
    ==========================================
    
  • Progress bar
    ...
    clean - 0:00:46 >> 2/1000 (1) | Loss 95.1626 | Lr 0.30e-3 | WER/CER 98.06/95.16 - (98.06/[95.16]): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 18/18 [00:46<00:00,  2.59s/it]
    clean - 0:00:47 >> 3/1000 (1) | Loss 96.3579 | Lr 0.29e-3 | WER/CER 97.55/97.55 - (98.06/[95.16]): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 18/18 [00:47<00:00,  2.61s/it]
    clean - 0:00:47 >> 4/1000 (1) | Loss 97.5705 | Lr 0.29e-3 | WER/CER 100.00/100.00 - (98.06/[95.16]): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 18/18 [00:47<00:00,  2.66s/it]
    clean - 0:00:48 >> 5/1000 (1) | Loss 97.8628 | Lr 0.29e-3 | WER/CER 98.74/98.74 - (98.06/[95.16]): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 18/18 [00:50<00:00,  2.78s/it]
    clean - 0:00:50 >> 6/1000 (5) | Loss 97.0118 | Lr 0.29e-3 | WER/CER 96.26/93.61 - (96.26/[93.61]): 100%|β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ| 18/18 [00:49<00:00,  2.76s/it]
    clean - 0:00:50 >> 7/1000 (5) | Loss 97.2341 | Lr 0.28e-3 | WER/CER 98.35/98.35 - (96.26/[93.61]):  17%|β–ˆβ–ˆβ–ˆβ–Š                   | 3/18 [00:10<00:55,  3.72s/it]
    ...
    
  • Separated text file to check wer/cer with histogram on CER values (best/last/worst result)
    ================= 100.00/34.49 =================
    ----- BEST -----
    Ref:良ある人γͺらそんγͺ钨にに話しかけγͺいだろう
    Hyp:用ある人γͺらそんγͺ钨にに話しかけγͺいだろう
    WER:100.0  - CER:4.761904761904762
    ----- LAST -----
    Ref:γ™γΏγΎγ›γ‚“γŒγ‚ͺースチンさんは5ζ—₯にはです
    Hyp:γ™γΏγΎγ›γ‚“γŒγƒΌγ‚Ήγƒ³γ•γ‚“γ―δΈ€γ€γ‹γ«γ―γ§γ™
    WER:100.0  - CER:25.0
    ----- WORST -----
    Ref:ε°εˆ‡γ«γ―ε†…γŒγΏγ‚‰γ‚Œγ‚‹
    Hyp:γ‚³γ«γ―ε†…ε…ˆι‡‘εœ°γ€δ½œγΏγŒθ¦‹γ‚‰γ‚Œγ‚‹
    WER:100.0  - CER:90.0
    CER histogram
    |###############################################################################
    |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                           6  0-10  
    |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                          15  10-20 
    |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ  36  20-30 
    |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ    35  30-40 
    |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                   27  40-50 
    |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                        16  50-60 
    |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                             5  60-70 
    |β–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆβ–ˆ                                                           6  70-80 
    |                                                                      0  80-90 
    |β–ˆ                                                                     1  90-100
    =============================================
    

    Acknowledgements

    Thanks to Egor and Ryan for their contributions!

    This is a fork from https://github.com/SeanNaren/deepspeech.pytorch. The code has been improved for the readability only.

    For any question please contact me at j.cadic[at]protonmail.ch