univoc

A PyTorch implementation of Towards Achieving Robust Universal Neural Vocoding.


Keywords
Speech, Synthesis, Neural, Vocoder, Text-to-Speech, PyTorch, neural-vocoder, speech-synthesis, wavernn
License
MIT
Install
pip install univoc==0.2.1

Documentation

Towards Achieving Robust Universal Neural Vocoding

A PyTorch implementation of Towards Achieving Robust Universal Neural Vocoding. Audio samples can be found here.

Architecture of the vocoder.
Fig 1:Architecture of the vocoder.

Quick Start

Ensure you have Python 3.6 and PyTorch 1.7 or greater installed. Then install the package with:

pip install univoc

Example Usage

import torch
import soundfile as sf
from univoc import Vocoder

# download pretrained weights (and optionally move to GPU)
vocoder = Vocoder.from_pretrained(
    "https://github.com/bshall/UniversalVocoding/releases/download/v0.2/univoc-ljspeech-7mtpaq.pt"
).cuda()

# load log-Mel spectrogram from file or tts
mel = ...

# generate waveform
with torch.no_grad():
    wav, sr = vocoder.generate(mel)

# save output
sf.write("path/to/save.wav", wav, sr)

Train from Scratch

  1. Clone the repo:
git clone https://github.com/bshall/UniversalVocoding
cd ./UniversalVocoding
  1. Install requirements:
pip install -r requirements.txt
  1. Download and extract the LJ-Speech dataset:
wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
tar -xvjf LJSpeech-1.1.tar.bz2
  1. Download the train split here and extract it in the root directory of the repo.
  2. Extract Mel spectrograms and preprocess audio:
python preprocess.py in_dir=path/to/LJSpeech-1.1 out_dir=datasets/LJSpeech-1.1
  1. Train the model:
python train.py checkpoint_dir=ljspeech dataset_dir=datasets/LJSpeech-1.1

Pretrained Models

Pretrained weights for the 10-bit LJ-Speech model are available here.

Notable Differences from the Paper

  1. Trained on 16kHz audio from a single speaker. For an older version trained on 102 different speakers form the ZeroSpeech 2019: TTS without T English dataset click here.
  2. Uses an embedding layer instead of one-hot encoding.

Acknowlegements