Fork of https://github.com/alokprasad/fastspeech_squeezewave to also fix denoising in squeezewave
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
 
 
alokprasad 00cd3dc8ec non cuda and importing fastspeech mel 4 years ago
..
configs Original Fastspeech and Squeezewave 4 years ago
README.md Original Fastspeech and Squeezewave 4 years ago
SqueezeWave_computational_complexity.ipynb Original Fastspeech and Squeezewave 4 years ago
TacotronSTFT.py Original Fastspeech and Squeezewave 4 years ago
audio_processing.py Original Fastspeech and Squeezewave 4 years ago
convert_model.py Original Fastspeech and Squeezewave 4 years ago
denoiser.py non cuda and importing fastspeech mel 4 years ago
distributed.py Original Fastspeech and Squeezewave 4 years ago
glow.py non cuda and importing fastspeech mel 4 years ago
inference.py non cuda and importing fastspeech mel 4 years ago
mel2samp.py Original Fastspeech and Squeezewave 4 years ago
requirements.txt Original Fastspeech and Squeezewave 4 years ago
stft.py Original Fastspeech and Squeezewave 4 years ago
train.py Original Fastspeech and Squeezewave 4 years ago

README.md

SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis

By Bohan Zhai *, Tianren Gao *, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph Gonzalez, and Kurt Keutzer (UC Berkeley)

Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs.

Link to the paper: paper. If you find this work useful, please consider citing

@inproceedings{squeezewave,
   Author = {Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph Gonzalez, Kurt Keutzer},
   Title = {SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis},
   Journal = {arXiv:2001.05685},
   Year = {2020}
}

Audio samples generated by SqueezeWave

Audio samples of SqueezeWave are here: https://tianrengao.github.io/SqueezeWaveDemo/

Results

We introduce 4 variants of SqueezeWave in our paper. See the table below.

Model length n_channels MACs Reduction MOS
WaveGlow 2048 8 228.9 1x 4.57±0.04
SqueezeWave-128L 128 256 3.78 60x 4.07±0.06
SqueezeWave-64L 64 256 2.16 106x 3.77±0.05
SqueezeWave-128S 128 128 1.06 214x 3.79±0.05
SqueezeWave-64S 64 128 0.68 332x 2.74±0.04

Model Complexity

A detailed MAC calculation can be found from here

Setup

  1. (Optional) Create a virtual environment

    virtualenv env
    source env/bin/activate
    
  2. Clone our repo and initialize submodule

    git clone https://github.com/tianrengao/SqueezeWave.git
    cd SqueezeWave
    git submodule init
    git submodule update
    
  3. Install requirements pip3 install -r requirements.txt

  4. Install Apex

    cd ../
    git clone https://www.github.com/nvidia/apex
    cd apex
    python setup.py install
    

Generate audio with our pretrained model

  1. Download our pretrained models. We provide 4 pretrained models as described in the paper.

  2. Download mel-spectrograms

  3. Generate audio. Please replace SqueezeWave.pt to the specific pretrained model's name.

    python3 inference.py -f <(ls mel_spectrograms/*.pt) -w SqueezeWave.pt -o . --is_fp16 -s 0.6

Train your own model

  1. Download LJ Speech Data. We assume all the waves are stored in the directory ^/data/

  2. Make a list of the file names to use for training/testing

    ls data/*.wav | tail -n+10 > train_files.txt
    ls data/*.wav | head -n10 > test_files.txt
    
  3. We provide 4 model configurations with audio channel and channel numbers specified in the table below. The configuration files are under /configs directory. To choose the model you want to train, select the corresponding configuration file.

  4. Train your SqueezeWave model

    mkdir checkpoints
    python train.py -c configs/config_a256_c128.json
    

    For multi-GPU training replace train.py with distributed.py. Only tested with single node and NCCL.

    For mixed precision training set "fp16_run": true on config.json.

  5. Make test set mel-spectrograms

    mkdir -p eval/mels
    python3 mel2samp.py -f test_files.txt -o eval/mels -c configs/config_a128_c256.json
    
  6. Run inference on the test data.

    ls eval/mels > eval/mel_files.txt
    sed -i -e 's_.*_eval/mels/&_' eval/mel_files.txt
    mkdir -p eval/output
    python3 inference.py -f eval/mel_files.txt -w checkpoints/SqueezeWave_10000 -o eval/output --is_fp16 -s 0.6
    

    Replace SqueezeWave_10000 with the checkpoint you want to test.

Credits

The implementation of this work is based on WaveGlow: https://github.com/NVIDIA/waveglow