Text-to-Speech
English
jiaqili3 commited on
Commit
b9bbf17
·
verified ·
1 Parent(s): ecfd2e9

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +135 -10
README.md CHANGED
@@ -1,10 +1,16 @@
1
  ---
2
  pipeline_tag: text-to-speech
 
 
 
 
 
3
  ---
4
  # FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates
5
 
6
  [![Demo Page](https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=Github&style=flat-square)](https://flexicodec.github.io/)
7
- [![ArXiv](https://img.shields.io/badge/arxiv-PDF-green?logo=arxiv&style=flat-square)](https://arxiv.org/abs/2510.00981)
 
8
 
9
  ## Abstract
10
  Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS.
@@ -48,24 +54,143 @@ torchaudio.save(output_path, reconstructed_audio.cpu().squeeze(1), 16000)
48
  print(f"Saved decoded audio to {output_path}")
49
  print(f"This sample avg frame rate: {encoded_output['token_lengths'].shape[-1] / duration:.4f} frames/sec")
50
  ```
51
- For Chinese users, you might need to execute `export HF_ENDPOINT=https://hf-mirror.com` in terminal, before running the code. If you don't want to automatically download from huggingface, you can manually specify your downloaded checkpoint paths in `prepare_model`.
 
 
 
 
52
 
53
 
54
- Batched input is supported. You can directly pass audios shaped [B,T] to the script above, but the audio length information will be unavailable.
55
  To resolve this, you can additionally pass an `audio_lens` parameter to `encode_flexicodec`, and you can crop the output for each audio in `encoded_output[speech_token_len]`.
56
 
57
- If you want to use the above code elsewhere, you might want to add `sys.path.append('PATH_TO_FLEXICODEC_REPOSITORY')` to find the code.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
58
 
59
- To extract continuous features from the semantic tokens, use:
60
  ```python
61
- feat = model_dict['model'].get_semantic_feature(encoded_output['semantic_codes'])
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
62
  ```
63
 
64
- ## FlexiCodec-TTS
65
- Our code for Flexicodec-based AR TTS is available at [`flexicodec/ar_tts/modeling_artts.py`](flexicodec/ar_tts/modeling_artts.py). The training step is inside `training_forward` method. It receives a `dl_output` dictionary containing `x` (the [`feature_extractor`](flexicodec/infer.py#L50) output), `x_lens` (length of each x before padding), `audio` (the 16khz audio tensor). The inference is at the `inference` method in the same file.
 
 
 
 
 
 
 
 
66
 
67
- Our code for Flow matching-based NAR TTS is based on the voicebox-based implementation [here](https://github.com/jiaqili3/DualCodec/tree/main/dualcodec/model_tts/voicebox).
68
- We plan to release TTS trained models and TTS training examples.
69
 
70
  ## Acknowledgements & Citation
71
  - Our codebase setup is based on [DualCodec](https://github.com/jiaqili3/DualCodec)
 
1
  ---
2
  pipeline_tag: text-to-speech
3
+ datasets:
4
+ - facebook/multilingual_librispeech
5
+ - parler-tts/mls_eng
6
+ language:
7
+ - en
8
  ---
9
  # FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates
10
 
11
  [![Demo Page](https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=Github&style=flat-square)](https://flexicodec.github.io/)
12
+ [![ArXiv](https://img.shields.io/badge/arXiv-PDF-green?logo=arxiv&style=flat-square)](https://arxiv.org/abs/2510.00981)
13
+
14
 
15
  ## Abstract
16
  Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS.
 
54
  print(f"Saved decoded audio to {output_path}")
55
  print(f"This sample avg frame rate: {encoded_output['token_lengths'].shape[-1] / duration:.4f} frames/sec")
56
  ```
57
+
58
+ Notes:
59
+ - You may tune the `num_quantizers=xxx` (maximum 24), `merging_threshold=xxx` (maximum 1.0) parameters. If you set `merging_threshold=1.0`, it will be a standard 12.5Hz neural audio codec. All of its `token_lengths` items will be 1.
60
+
61
+ - For mainland China users, you might need to execute `export HF_ENDPOINT=https://hf-mirror.com` in terminal, before running the code. If you don't want to automatically download from huggingface, you can manually specify your downloaded checkpoint paths [![Huggingface](https://img.shields.io/badge/huggingface-yellow?logo=huggingface&style=flat-square)](https://huggingface.co/jiaqili3/flexicodec/tree/main) in `prepare_model`.
62
 
63
 
64
+ - Batched input is supported. You can directly pass audios shaped [B,T] to the script above, but the audio length information will be unavailable.
65
  To resolve this, you can additionally pass an `audio_lens` parameter to `encode_flexicodec`, and you can crop the output for each audio in `encoded_output[speech_token_len]`.
66
 
67
+ - If you want to use the above code elsewhere, you might want to add `sys.path.append('/path/to/FlexiCodec')` to find the code.
68
+
69
+ - To extract continuous features from the semantic tokens, use:
70
+ ```python
71
+ feat = model_dict['model'].get_semantic_feature(encoded_output['semantic_codes'])
72
+ ```
73
+
74
+ ## FlexiCodec-TTS
75
+ First, install additional dependencies:
76
+ ```bash
77
+ sudo apt install espeak-ng
78
+ pip install cached_path phonemizer openai-whisper
79
+ ```
80
+
81
+ ### FlexiCodec-based Voicebox NAR Inference
82
+ The VoiceBox NAR system can decode FlexiCodec's RVQ-1 tokens into speech. It is used as the second stage in FlexiCodec-TTS, but can also be used standalone.
83
+ To run NAR TTS inference using FlexiCodec-Voicebox:
84
+
85
+ ```python
86
+ import torch
87
+ import torchaudio
88
+ from flexicodec.nar_tts.inference_voicebox import (
89
+ prepare_voicebox_model,
90
+ infer_voicebox_tts
91
+ )
92
+ import cached_path
93
+ # Prepare model (loads model and vocoder)
94
+ checkpoint_path = cached_path('hf://jiaqili3/flexicodec/nartts.safetensors')
95
+ model_dict = prepare_voicebox_model(checkpoint_path)
96
+
97
+ # Option 1: Inference with audio file paths
98
+ gt_audio_path = "audio_examples/61-70968-0000_gt.wav" # Target content. Example GT audio
99
+ ref_audio_path = "audio_examples/61-70968-0000_ref.wav" # Reference voice/style.
100
+
101
+ output_audio, output_sr = infer_voicebox_tts(
102
+ model_dict=model_dict,
103
+ gt_audio_path=gt_audio_path,
104
+ ref_audio_path=ref_audio_path,
105
+ n_timesteps=15, # Number of diffusion steps (default: 15)
106
+ cfg=2.0, # Classifier-free guidance scale (default: 2.0)
107
+ rescale_cfg=0.75, # CFG rescaling factor (default: 0.75)
108
+ merging_threshold=1.0 # Merging threshold for frame rate control (default: 1.0, max: 1.0)
109
+ )
110
+
111
+ # Save output
112
+ torchaudio.save("output.wav", output_audio.unsqueeze(0) if output_audio.dim() == 1 else output_audio, output_sr)
113
+
114
+ # Option 2: Inference with audio tensors
115
+ gt_audio, gt_sr = torchaudio.load("path/to/ground_truth.wav")
116
+ ref_audio, ref_sr = torchaudio.load("path/to/reference.wav")
117
+
118
+ output_audio, output_sr = infer_voicebox_tts(
119
+ model_dict=model_dict,
120
+ gt_audio=gt_audio,
121
+ ref_audio=ref_audio,
122
+ gt_sample_rate=gt_sr,
123
+ ref_sample_rate=ref_sr,
124
+ n_timesteps=15,
125
+ cfg=2.0,
126
+ rescale_cfg=0.75,
127
+ merging_threshold=1.0
128
+ )
129
+ ```
130
+
131
+ **Notes:**
132
+ - The model automatically detects and uses CUDA, MPS (Apple Silicon), or CPU devices
133
+ - Ground truth audio (`gt_audio`) determines the semantic content of the output
134
+ - Reference audio (`ref_audio`) determines the voice/style characteristics
135
+ - Output sample rate is typically 16000 Hz or 24000 Hz depending on the model configuration
136
+ - You can reuse `model_dict` for multiple inference calls to avoid reloading the model
137
+ - `merging_threshold` controls FlexiCodec's dynamic frame rate: lower values (e.g., 0.87, 0.91) enable merging for lower average frame rates, while 1.0 disables merging (standard 12.5Hz)
138
+
139
+ ### FlexiCodec-based AR+NAR TTS Inference
140
+ The AR+NAR TTS system generates speech tokens from text using an autoregressive transformer model, and then uses the Voicebox NAR system to decode the tokens into audio.
141
+
142
+ To perform complete text-to-speech with both AR generation and NAR decoding:
143
 
 
144
  ```python
145
+ import torch
146
+ import torchaudio
147
+ from flexicodec.ar_tts.inference_tts import tts_synthesize
148
+ from flexicodec.ar_tts.modeling_artts import prepare_artts_model
149
+ from flexicodec.nar_tts.inference_voicebox import prepare_voicebox_model
150
+ import cached_path
151
+
152
+ # Prepare both AR and NAR models
153
+ ar_checkpoint = cached_path('hf://jiaqili3/flexicodec/artts.safetensors')
154
+ nar_checkpoint = cached_path('hf://jiaqili3/flexicodec/nartts.safetensors')
155
+
156
+ ar_model_dict = prepare_artts_model(ar_checkpoint)
157
+ nar_model_dict = prepare_voicebox_model(nar_checkpoint)
158
+
159
+ # Full TTS synthesis
160
+ output_audio, output_sr = tts_synthesize(
161
+ ar_model_dict=ar_model_dict,
162
+ nar_model_dict=nar_model_dict,
163
+ text="Hello, this is a complete text-to-speech example.",
164
+ language="en",
165
+ ref_audio_path="audio_examples/61-70968-0000_ref.wav", # Reference voice
166
+ ref_text="bear us escort so far as the Sheriff's house", # Optional reference text
167
+ merging_threshold=0.91, # Frame rate control (used for both AR and NAR)
168
+ beam_size=1,
169
+ top_k=25,
170
+ temperature=1.0,
171
+ predict_duration=True,
172
+ duration_top_k=1,
173
+ n_timesteps=15, # NAR diffusion steps
174
+ cfg=2.0, # NAR classifier-free guidance
175
+ rescale_cfg=0.75, # NAR CFG rescaling
176
+ use_nar=True, # Set to False for AR-only decoding
177
+ )
178
+
179
+ # Save output
180
+ torchaudio.save("output.wav", output_audio.unsqueeze(0) if output_audio.dim() == 1 else output_audio, output_sr)
181
  ```
182
 
183
+ **Notes:**
184
+ - `tts_synthesize` performs the full pipeline: AR generation + NAR decoding to audio
185
+ - Reference audio (`ref_audio_path`) provides the voice/style characteristics
186
+ - Reference text (`ref_text`) is optional and can help with prosody alignment
187
+ - Set `use_nar=False` in `tts_synthesize` to use AR-only decoding (faster but lower quality)
188
+
189
+ ### Training reference implementations
190
+ Inside `flexicodec/ar_tts/modeling_artts.py` and `flexicodec/nar_tts/modeling_voicebox.py` there are `training_forward` methods that receive audios and prepared sensevoice-small input "FBank" features. (`dl_output` dictionary containing `x` (the [`feature_extractor`](flexicodec/infer.py#L50) output), `x_lens` (length of each x before padding), `audio` (the 16khz audio tensor)).
191
+ Training can be replicated by passing the same data to the `training_forward` methods.
192
+
193
 
 
 
194
 
195
  ## Acknowledgements & Citation
196
  - Our codebase setup is based on [DualCodec](https://github.com/jiaqili3/DualCodec)