Automatic Speech Recognition (ASR) Assessment
This Python package allows you to assess the phonetic error rate and visualise them.
Use the package manager pip to install asrassessment.
For the latest version check the PyPI page.
pip install asrassessment
Take Note:
- When installing the latest version, there may be an error. Try pip installing a second time to allow the pip install to work.
- In jupyter notebook and google colab, use '%pip install' instead of '!pip install'.
- File direcotry names might differ in capitalisaton styles, so take note when writing code. i.e. "TRAIN' instead of "train"
Brief Overview
- Calculate the phoneme error rate (%) using the TIMIT database
- Identify specific frames for which there was an error in the phoneme conversion
- Boxplot of accuracy rate for each phoneme across selected TIMIT files
- Stacked boxplot of accuracy rate across varying added noise
- Time/frequency Plot any given TIMIT audio showing the timing/phoneme which was incorrected predicted (substitution and deletion only)
Usage of package
Calculating Phoneme Error Rate(%)
ASR Model: Here we use allosuarus which is defined below.
import os
import pandas as pd
from asrassessment.utils.timit_load import TIMIT_file
#Load TIMIT Files
timit_dir = f"{os.getcwd()}/{TIMIT_PACKAGE_NAME}" #note this is the folder containing the 'test' & 'train' folders. Usually they are named 'TIMIT'/'timit'
TIMIT_dict = TIMIT_file(timit_dir)
#Take sample phoneme string from TIMIT file
phn_file_dir = TIMIT_dict['train']['dr1']['fecd0']['phn'][0]
#Load ASR Model
#Calculate Phoneneme Error Rate btw. 2 strings
from asrassessment.utils.data_input import convert_wav
from asrassessment.utils.generalfunc import *
from asrassessment.utils.standardizer import *
#file directory
wav_file_dir = TIMIT_dict['train']['dr1']['fecd0']['wav'][0]
#convert and overwrite wav file so it is useable
#test model
asr_phn = allosaurus_model(file_dir)
#standardize phoneme string
asr_phn_conv = IPA_to_TIMIT(asr_phn)
#load TIMIT phn
timit_phn = read_phn(phn_file_dir,string=True)
#standardize phoneme string
timit_phn_conv = TIMIT_to_IPA(timit_phn)
from asrassessment.utils.phone_error_rate import error_rate
output, error_df = error_rate(timit_phn_conv,asr_phn_conv)
#Final dataframe showing phoneme comparison & type of error
Ploting boxplot for phoneme accuracy of ASR model across selected TIMIT files
Having defined the ASR model prior to this, simply put the function name as a variable.
Then choose which range of DR files to use within TIMIT and "TRAIN"/"TEST". Take note of point 3 in Installation
from asrassessment import main as asrtest
asrtest.full_phn_boxplot(asr_model=allosaurus_model,file_set="TRAIN", DR=[0,1])
Ploting stacked boxplot for phoneme accuracy of ASR model across varying added noise
Note that adding noise function here requires a 'noisyspeech.cfg' file.
Noise file should be in wav file and you can find such an example download here
ASR_Model Here we use the allosaurus model
Speech-to-text Model Here we use the google speech-to-text
#Load ASR model
#Load Speech to Text Model
from asrassessment import main as asrtest
cfg_filedir= 'noisyspeech.cfg',
DR = [0,1],
SPK = [0,1],
Ploting time/frequency plot of ASR model to identify phoneme error at given frame
from asrassessment import main as asrtest
#file directory
phn_file_dir = TIMIT_dict['train']['dr1']['fecd0']['phn'][0]
wav_file_dir = TIMIT_dict['train']['dr1']['fecd0']['wav'][0]
asrtest.phoneme_wavchart(timit_phndir = phn_file_dir,
timit_wavdir = wav_file_dir,
Allosaurus Model
%pip install allosaurus
from import read_recognizer
def allosaurus_model(file_directory,fr=16000,dataframe=False):
model = read_recognizer()
str_output = model.recognize(file_directory,lang_id='eng',timestamp=True)
lst_output = str_output.split("\n")
df = pd.DataFrame(lst_output, columns=['header'])
df = df.header.str.split(pat=' ',expand=True)
df.columns = ['start','timing','phoneme']
#edit dataframe (add 'timing' to 'start' to get 'end' time/change start & end to milliseconds)
df['start'] = df['start'].astype(float)
df['start'] = df['start'].values*fr
df['timing'] = df['timing'].astype(float)
df['timing'] = df['timing'].values*fr
df['end'] = df.apply(lambda row: row.start + row.timing, axis = 1)
finaldf = df[['start','end', 'phoneme']]
if dataframe == True:
return finaldf
allosaurus_phn = col_to_string(finaldf,colname='phoneme')
return allosaurus_phn
Speech Recognition Model (Google)
!pip install SpeechRecognition
import speech_recognition as sr
def speech_recog(timit_wav):
r = sr.Recognizer()
with sr.AudioFile(timit_wav) as source:
audio = r.record(source)
return r.recognize_google(audio)
Further Description
Method to get phoneme_error_rate
Sources for detailed explanation for the error rate algorithm used in this package:
TIMIT standardisation mapping
As the phoneme standard is not similar across various websites, this package follows a standardized mapping found in the module
TIMIT Acoustic-Phonetic Continuous Speech Corpus
TIMIT file: TIMIT is a corpus of phonemically and lexically transcribed speech of American English speakers of different sexes and dialects
Samples of the corpus can be found here
You can download the entire corpus here.
Watch how to download the torrent here
Package Requirements
Python version required: 3.9
This Project is created with:
- glob2 version: 0.7
- tqdm version: 4.64.0
- librosa version: 0.9.2
- scipy version: 1.9.0
- numpy version: 1.23.1
- pandas version: 1.4.3
- sklearn version: 0.0
- pydub version: 0.25.1
- soundfile version: 0.10.2
- plotly version: 5.8.0
- matplotlib version: 3.5.3
