Fork of https://github.com/alokprasad/fastspeech_squeezewave to also fix denoising in squeezewave
You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.

129 lines
5.3 KiB

  1. ## SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis
  2. By Bohan Zhai *, Tianren Gao *, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph Gonzalez, and Kurt Keutzer (UC Berkeley)
  3. Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs.
  4. Link to the paper: [paper]. If you find this work useful, please consider citing
  5. ```
  6. @inproceedings{squeezewave,
  7. Author = {Bohan Zhai, Tianren Gao, Flora Xue, Daniel Rothchild, Bichen Wu, Joseph Gonzalez, Kurt Keutzer},
  8. Title = {SqueezeWave: Extremely Lightweight Vocoders for On-device Speech Synthesis},
  9. Journal = {arXiv:2001.05685},
  10. Year = {2020}
  11. }
  12. ```
  13. ### Audio samples generated by SqueezeWave
  14. Audio samples of SqueezeWave are here: https://tianrengao.github.io/SqueezeWaveDemo/
  15. ### Results
  16. We introduce 4 variants of SqueezeWave in our paper. See the table below.
  17. | Model | length | n_channels| MACs | Reduction | MOS |
  18. | --------------- | ------ | --------- | ----- | --------- | --------- |
  19. |WaveGlow | 2048 | 8 | 228.9 | 1x | 4.57±0.04 |
  20. |SqueezeWave-128L | 128 | 256 | 3.78 | 60x | 4.07±0.06 |
  21. |SqueezeWave-64L | 64 | 256 | 2.16 | 106x | 3.77±0.05 |
  22. |SqueezeWave-128S | 128 | 128 | 1.06 | 214x | 3.79±0.05 |
  23. |SqueezeWave-64S | 64 | 128 | 0.68 | 332x | 2.74±0.04 |
  24. ### Model Complexity
  25. A detailed MAC calculation can be found from [here](https://github.com/tianrengao/SqueezeWave/blob/master/SqueezeWave_computational_complexity.ipynb)
  26. ## Setup
  27. 0. (Optional) Create a virtual environment
  28. ```
  29. virtualenv env
  30. source env/bin/activate
  31. ```
  32. 1. Clone our repo and initialize submodule
  33. ```command
  34. git clone https://github.com/tianrengao/SqueezeWave.git
  35. cd SqueezeWave
  36. git submodule init
  37. git submodule update
  38. ```
  39. 2. Install requirements
  40. ```pip3 install -r requirements.txt```
  41. 3. Install [Apex]
  42. ```1
  43. cd ../
  44. git clone https://www.github.com/nvidia/apex
  45. cd apex
  46. python setup.py install
  47. ```
  48. ## Generate audio with our pretrained model
  49. 1. Download our [pretrained models]. We provide 4 pretrained models as described in the paper.
  50. 2. Download [mel-spectrograms]
  51. 3. Generate audio. Please replace `SqueezeWave.pt` to the specific pretrained model's name.
  52. ```python3 inference.py -f <(ls mel_spectrograms/*.pt) -w SqueezeWave.pt -o . --is_fp16 -s 0.6```
  53. ## Train your own model
  54. 1. Download [LJ Speech Data]. We assume all the waves are stored in the directory `^/data/`
  55. 2. Make a list of the file names to use for training/testing
  56. ```command
  57. ls data/*.wav | tail -n+10 > train_files.txt
  58. ls data/*.wav | head -n10 > test_files.txt
  59. ```
  60. 3. We provide 4 model configurations with audio channel and channel numbers specified in the table below. The configuration files are under ```/configs``` directory. To choose the model you want to train, select the corresponding configuration file.
  61. 4. Train your SqueezeWave model
  62. ```command
  63. mkdir checkpoints
  64. python train.py -c configs/config_a256_c128.json
  65. ```
  66. For multi-GPU training replace `train.py` with `distributed.py`. Only tested with single node and NCCL.
  67. For mixed precision training set `"fp16_run": true` on `config.json`.
  68. 5. Make test set mel-spectrograms
  69. ```
  70. mkdir -p eval/mels
  71. python3 mel2samp.py -f test_files.txt -o eval/mels -c configs/config_a128_c256.json
  72. ```
  73. 6. Run inference on the test data.
  74. ```command
  75. ls eval/mels > eval/mel_files.txt
  76. sed -i -e 's_.*_eval/mels/&_' eval/mel_files.txt
  77. mkdir -p eval/output
  78. python3 inference.py -f eval/mel_files.txt -w checkpoints/SqueezeWave_10000 -o eval/output --is_fp16 -s 0.6
  79. ```
  80. Replace `SqueezeWave_10000` with the checkpoint you want to test.
  81. ## Credits
  82. The implementation of this work is based on WaveGlow: https://github.com/NVIDIA/waveglow
  83. [//]: # (TODO)
  84. [//]: # (PROVIDE INSTRUCTIONS FOR DOWNLOADING LJS)
  85. [pytorch 1.0]: https://github.com/pytorch/pytorch#installation
  86. [website]: https://nv-adlr.github.io/WaveGlow
  87. [paper]: https://arxiv.org/abs/2001.05685
  88. [WaveNet implementation]: https://github.com/r9y9/wavenet_vocoder
  89. [Glow]: https://blog.openai.com/glow/
  90. [WaveNet]: https://deepmind.com/blog/wavenet-generative-model-raw-audio/
  91. [PyTorch]: http://pytorch.org
  92. [pretrained models]: https://drive.google.com/file/d/1RyVMLY2l8JJGq_dCEAAd8rIRIn_k13UB/view?usp=sharing
  93. [mel-spectrograms]: https://drive.google.com/file/d/1g_VXK2lpP9J25dQFhQwx7doWl_p20fXA/view?usp=sharing
  94. [LJ Speech Data]: https://keithito.com/LJ-Speech-Dataset
  95. [Apex]: https://github.com/nvidia/apex