ASRDeepspeech x Sakura-ML
(English/Japanese)
Modules • Code structure • Installing the application • Makefile commands • Environments • Dataset• Running the application• Notes•
This repository offers a clean code version of the original repository from SeanNaren with classes and modular components (eg trainers, models, loggers...).
I have added a configuration file to manage the parameters set in the model. You will also find a pretrained model in japanese performing a CER = 34
on JSUT test set .
Modules
At a granular level, ASRDeepSpeech is a library that consists of the following components:
Component | Description |
---|---|
asr_deepspeech | Speech Recognition package |
asr_deepspeech.data | Data related module |
asr_deepspeech.data.dataset | Build the dataset |
asr_deepspeech.data.loaders | Load the dataset |
asr_deepspeech.data.parsers | Parse the dataset |
asr_deepspeech.data.samplers | Sample the dataset |
asr_deepspeech.decoders | Decode the generated text |
asr_deepspeech.loggers | Loggers |
asr_deepspeech.modules | Components of the network |
asr_deepspeech.parsers | Arguments parser |
asr_deepspeech.tests | Test units |
asr_deepspeech.trainers | Trainers |
Code structure
from setuptools import setup
from asr_deepspeech import __version__
setup(
name="asr_deepspeech",
version=__version__,
short_description="ASRDeepspeech (English / Japanese)",
long_description="".join(open("README.md", "r").readlines()),
long_description_content_type="text/markdown",
url="https://github.com/zakuro-ai/asr",
license="MIT Licence",
author="CADIC Jean-Maximilien",
python_requires=">=3.8",
packages=[
"asr_deepspeech",
"asr_deepspeech.audio",
"asr_deepspeech.data",
"asr_deepspeech.data.dataset",
"asr_deepspeech.data.loaders",
"asr_deepspeech.data.manifests",
"asr_deepspeech.data.parsers",
"asr_deepspeech.data.samplers",
"asr_deepspeech.decoders",
"asr_deepspeech.etl",
"asr_deepspeech.loggers",
"asr_deepspeech.models",
"asr_deepspeech.modules",
"asr_deepspeech.parsers",
"asr_deepspeech.tests",
"asr_deepspeech.trainers",
],
include_package_data=True,
package_data={"": ["*.yml"]},
install_requires=[r.rsplit()[0] for r in open("requirements.txt")],
author_email="git@zakuro.ai",
description="ASRDeepspeech (English / Japanese)",
platforms="linux_debian_10_x86_64",
classifiers=[
"Programming Language :: Python :: 3.8",
"License :: OSI Approved :: MIT License",
],
)
Installing the application
To clone and run this application, you'll need the following installed on your computer:
Install bpd:
# Clone this repository and install the code
git clone https://github.com/zakuro-ai/asr
# Go into the repository
cd asr
Makefile commands
Exhaustive list of make commands:
pull # Download the docker image
sandbox # Launch the sandox image
install_wheels # Install the wheel
tests # Test the code
Environments
We are providing a support for local or docker setup. However we recommend to use docker to avoid any difficulty to run the code. If you decide to run the code locally you will need python3.6 with cuda>=10.1. Several libraries are needed to be installed for training to work. I will assume that everything is being installed in an Anaconda installation on Ubuntu, with Pytorch 1.0. Install PyTorch if you haven't already.
Docker
Note
Running this application by using Docker is recommended.
To build and run the docker image
make pull
make sandbox
PythonEnv
Warning
Running this application by using PythonEnv is possible but not recommended.
make install_wheels
Test
make tests
You should be able to get an output like
=1= TEST PASSED : asr_deepspeech
=1= TEST PASSED : asr_deepspeech.data
=1= TEST PASSED : asr_deepspeech.data.dataset
=1= TEST PASSED : asr_deepspeech.data.loaders
=1= TEST PASSED : asr_deepspeech.data.parsers
=1= TEST PASSED : asr_deepspeech.data.samplers
=1= TEST PASSED : asr_deepspeech.decoders
=1= TEST PASSED : asr_deepspeech.loggers
=1= TEST PASSED : asr_deepspeech.modules
=1= TEST PASSED : asr_deepspeech.parsers
=1= TEST PASSED : asr_deepspeech.test
=1= TEST PASSED : asr_deepspeech.trainers
Datasets
By default we process the JSUT dataset. See the config
section to know how to process a custom dataset.
from gnutools.remote import gdrive
from asr_deepspech import cfg
# This will download the JSUT dataset in your /tmp
gdrive(cfg.gdrive_uri)
ETL
python -m asr_deepspeech.etl
Running the application
Training a Model
To train on a single gpu
sakura -m asr_deepspeech.trainers
Pretrained model
python -m asr_deepspeech
Notes
================ VARS ===================
manifest: clean
distributed: True
train_manifest: __data__/manifests/train_clean.json
val_manifest: __data__/manifests/val_clean.json
model_path: /data/ASRModels/deepspeech_jp_500_clean.pth
continue_from: None
output_file: /data/ASRModels/deepspeech_jp_500_clean.txt
main_proc: True
rank: 0
gpu_rank: 0
world_size: 2
==========================================
...
clean - 0:00:46 >> 2/1000 (1) | Loss 95.1626 | Lr 0.30e-3 | WER/CER 98.06/95.16 - (98.06/[95.16]): 100%|██████████████████████| 18/18 [00:46<00:00, 2.59s/it]
clean - 0:00:47 >> 3/1000 (1) | Loss 96.3579 | Lr 0.29e-3 | WER/CER 97.55/97.55 - (98.06/[95.16]): 100%|██████████████████████| 18/18 [00:47<00:00, 2.61s/it]
clean - 0:00:47 >> 4/1000 (1) | Loss 97.5705 | Lr 0.29e-3 | WER/CER 100.00/100.00 - (98.06/[95.16]): 100%|████████████████████| 18/18 [00:47<00:00, 2.66s/it]
clean - 0:00:48 >> 5/1000 (1) | Loss 97.8628 | Lr 0.29e-3 | WER/CER 98.74/98.74 - (98.06/[95.16]): 100%|██████████████████████| 18/18 [00:50<00:00, 2.78s/it]
clean - 0:00:50 >> 6/1000 (5) | Loss 97.0118 | Lr 0.29e-3 | WER/CER 96.26/93.61 - (96.26/[93.61]): 100%|██████████████████████| 18/18 [00:49<00:00, 2.76s/it]
clean - 0:00:50 >> 7/1000 (5) | Loss 97.2341 | Lr 0.28e-3 | WER/CER 98.35/98.35 - (96.26/[93.61]): 17%|███▊ | 3/18 [00:10<00:55, 3.72s/it]
...
================= 100.00/34.49 =================
----- BEST -----
Ref:良ある人ならそんな風にに話しかけないだろう
Hyp:用ある人ならそんな風にに話しかけないだろう
WER:100.0 - CER:4.761904761904762
----- LAST -----
Ref:すみませんがオースチンさんは5日にはです
Hyp:すみませんがースンさんは一つかにはです
WER:100.0 - CER:25.0
----- WORST -----
Ref:小切には内がみられる
Hyp:コには内先金地つ作みが見られる
WER:100.0 - CER:90.0
CER histogram
|###############################################################################
|███████████ 6 0-10
|███████████████████████████ 15 10-20
|███████████████████████████████████████████████████████████████████ 36 20-30
|█████████████████████████████████████████████████████████████████ 35 30-40
|██████████████████████████████████████████████████ 27 40-50
|█████████████████████████████ 16 50-60
|█████████ 5 60-70
|███████████ 6 70-80
| 0 80-90
|█ 1 90-100
=============================================
Acknowledgements
Thanks to Egor and Ryan for their contributions!
This is a fork from https://github.com/SeanNaren/deepspeech.pytorch. The code has been improved for the readability only.
For any question please contact me at j.cadic[at]protonmail.ch