Text-to-Speech
English
File size: 10,428 Bytes
82fc022
 
b9bbf17
 
 
 
 
82fc022
 
 
 
b9bbf17
 
82fc022
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9bbf17
 
 
 
 
82fc022
 
b9bbf17
82fc022
 
b9bbf17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82fc022
 
b9bbf17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82fc022
 
b9bbf17
 
 
 
 
 
 
 
 
 
82fc022
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
---
pipeline_tag: text-to-speech
datasets:
- facebook/multilingual_librispeech
- parler-tts/mls_eng
language:
- en
---
# FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates

[![Demo Page](https://img.shields.io/badge/GitHub.io-Demo_Page-blue?logo=Github&style=flat-square)](https://flexicodec.github.io/)
[![ArXiv](https://img.shields.io/badge/arXiv-PDF-green?logo=arxiv&style=flat-square)](https://arxiv.org/abs/2510.00981)


## Abstract
Neural audio codecs are foundational to speech language models. It is expected to have a low frame rate and decoupled semantic and acoustic information. A lower frame rate codec can reduce the computational cost of speech language models by shortening the sequence length. Recent studies have developed 12.5Hz low-frame-rate audio codecs, but even lower frame rate codecs remain underexplored. We find that a major challenge for very low frame rate tokens is missing semantic information. This paper introduces FlexiCodec to address this limitation. FlexiCodec improves semantic preservation with a dynamic frame rate approach and introduces a novel architecture featuring an ASR feature-assisted dual stream encoding and Transformer bottlenecks. With dynamic frame rates, it uses less frames at information-sparse regions through adaptively merging semantically similar frames. A dynamic frame rate also allows FlexiCodec to support inference-time controllable frame rates between 3Hz and 12.5Hz. Experiments on 6.25Hz, 8.3Hz and 12.5Hz average frame rates confirm that FlexiCodec excels over baseline systems in semantic information preservation and delivers a high audio reconstruction quality. We also validate the effectiveness of FlexiCodec in language model-based TTS.
![](.github/flexicodec.png)

## Installation
```bash
git clone https://github.com/amphionspace/FlexiCodec.git
cd FlexiCodec
pip install -r requirements.txt
```
<!-- # pip install -e . -->

## FlexiCodec
Code is available under [`flexicodec/modeling_flexicodec.py`](flexicodec/modeling_flexicodec.py). 

To run inference (automatically downloads checkpoint from huggingface):
```python
import torch
import torchaudio
from flexicodec.infer import prepare_model, encode_flexicodec

model_dict = prepare_model()
  
# Load a real audio file
audio_path = "YOUR_WAV.wav"
audio, sample_rate = torchaudio.load(audio_path)
with torch.no_grad():
    encoded_output = encode_flexicodec(audio, model_dict, sample_rate, num_quantizers=8, merging_threshold=0.91)
    
    reconstructed_audio = model_dict['model'].decode_from_codes(
        semantic_codes=encoded_output['semantic_codes'],
        acoustic_codes=encoded_output['acoustic_codes'],
        token_lengths=encoded_output['token_lengths'],
    )

duration = audio.shape[-1] / sample_rate
output_path = 'decoded_audio.wav'
torchaudio.save(output_path, reconstructed_audio.cpu().squeeze(1), 16000)

print(f"Saved decoded audio to {output_path}")
print(f"This sample avg frame rate: {encoded_output['token_lengths'].shape[-1] / duration:.4f} frames/sec")
```

Notes:
- You may tune the `num_quantizers=xxx` (maximum 24), `merging_threshold=xxx` (maximum 1.0) parameters. If you set `merging_threshold=1.0`, it will be a standard 12.5Hz neural audio codec. All of its `token_lengths` items will be 1. 

- For mainland China users, you might need to execute `export HF_ENDPOINT=https://hf-mirror.com` in terminal, before running the code. If you don't want to automatically download from huggingface, you can manually specify your downloaded checkpoint paths [![Huggingface](https://img.shields.io/badge/huggingface-yellow?logo=huggingface&style=flat-square)](https://huggingface.co/jiaqili3/flexicodec/tree/main) in `prepare_model`. 


- Batched input is supported. You can directly pass audios shaped [B,T] to the script above, but the audio length information will be unavailable.
To resolve this, you can additionally pass an `audio_lens` parameter to `encode_flexicodec`, and you can crop the output for each audio in `encoded_output[speech_token_len]`. 

- If you want to use the above code elsewhere, you might want to add `sys.path.append('/path/to/FlexiCodec')` to find the code.

- To extract continuous features from the semantic tokens, use:
  ```python
  feat = model_dict['model'].get_semantic_feature(encoded_output['semantic_codes'])
  ```

## FlexiCodec-TTS
First, install additional dependencies:
```bash
sudo apt install espeak-ng
pip install cached_path phonemizer openai-whisper
```

### FlexiCodec-based Voicebox NAR Inference
The VoiceBox NAR system can decode FlexiCodec's RVQ-1 tokens into speech. It is used as the second stage in FlexiCodec-TTS, but can also be used standalone.
To run NAR TTS inference using FlexiCodec-Voicebox:

```python
import torch
import torchaudio
from flexicodec.nar_tts.inference_voicebox import (
    prepare_voicebox_model, 
    infer_voicebox_tts
)
import cached_path
# Prepare model (loads model and vocoder)
checkpoint_path = cached_path('hf://jiaqili3/flexicodec/nartts.safetensors')
model_dict = prepare_voicebox_model(checkpoint_path)

# Option 1: Inference with audio file paths
gt_audio_path = "audio_examples/61-70968-0000_gt.wav"  # Target content. Example GT audio
ref_audio_path = "audio_examples/61-70968-0000_ref.wav"     # Reference voice/style. 

output_audio, output_sr = infer_voicebox_tts(
    model_dict=model_dict,
    gt_audio_path=gt_audio_path,
    ref_audio_path=ref_audio_path,
    n_timesteps=15,          # Number of diffusion steps (default: 15)
    cfg=2.0,                 # Classifier-free guidance scale (default: 2.0)
    rescale_cfg=0.75,        # CFG rescaling factor (default: 0.75)
    merging_threshold=1.0    # Merging threshold for frame rate control (default: 1.0, max: 1.0)
)

# Save output
torchaudio.save("output.wav", output_audio.unsqueeze(0) if output_audio.dim() == 1 else output_audio, output_sr)

# Option 2: Inference with audio tensors
gt_audio, gt_sr = torchaudio.load("path/to/ground_truth.wav")
ref_audio, ref_sr = torchaudio.load("path/to/reference.wav")

output_audio, output_sr = infer_voicebox_tts(
    model_dict=model_dict,
    gt_audio=gt_audio,
    ref_audio=ref_audio,
    gt_sample_rate=gt_sr,
    ref_sample_rate=ref_sr,
    n_timesteps=15,
    cfg=2.0,
    rescale_cfg=0.75,
    merging_threshold=1.0
)
```

**Notes:**
- The model automatically detects and uses CUDA, MPS (Apple Silicon), or CPU devices
- Ground truth audio (`gt_audio`) determines the semantic content of the output
- Reference audio (`ref_audio`) determines the voice/style characteristics
- Output sample rate is typically 16000 Hz or 24000 Hz depending on the model configuration
- You can reuse `model_dict` for multiple inference calls to avoid reloading the model
- `merging_threshold` controls FlexiCodec's dynamic frame rate: lower values (e.g., 0.87, 0.91) enable merging for lower average frame rates, while 1.0 disables merging (standard 12.5Hz)

### FlexiCodec-based AR+NAR TTS Inference
The AR+NAR TTS system generates speech tokens from text using an autoregressive transformer model, and then uses the Voicebox NAR system to decode the tokens into audio.

To perform complete text-to-speech with both AR generation and NAR decoding:

```python
import torch
import torchaudio
from flexicodec.ar_tts.inference_tts import tts_synthesize
from flexicodec.ar_tts.modeling_artts import prepare_artts_model
from flexicodec.nar_tts.inference_voicebox import prepare_voicebox_model
import cached_path

# Prepare both AR and NAR models
ar_checkpoint = cached_path('hf://jiaqili3/flexicodec/artts.safetensors')
nar_checkpoint = cached_path('hf://jiaqili3/flexicodec/nartts.safetensors')

ar_model_dict = prepare_artts_model(ar_checkpoint)
nar_model_dict = prepare_voicebox_model(nar_checkpoint)

# Full TTS synthesis
output_audio, output_sr = tts_synthesize(
    ar_model_dict=ar_model_dict,
    nar_model_dict=nar_model_dict,
    text="Hello, this is a complete text-to-speech example.",
    language="en",
    ref_audio_path="audio_examples/61-70968-0000_ref.wav",  # Reference voice
    ref_text="bear us escort so far as the Sheriff's house",  # Optional reference text
    merging_threshold=0.91,  # Frame rate control (used for both AR and NAR)
    beam_size=1,
    top_k=25,
    temperature=1.0,
    predict_duration=True,
    duration_top_k=1,
    n_timesteps=15,  # NAR diffusion steps
    cfg=2.0,  # NAR classifier-free guidance
    rescale_cfg=0.75,  # NAR CFG rescaling
    use_nar=True,  # Set to False for AR-only decoding
)

# Save output
torchaudio.save("output.wav", output_audio.unsqueeze(0) if output_audio.dim() == 1 else output_audio, output_sr)
```

**Notes:**
- `tts_synthesize` performs the full pipeline: AR generation + NAR decoding to audio
- Reference audio (`ref_audio_path`) provides the voice/style characteristics
- Reference text (`ref_text`) is optional and can help with prosody alignment
- Set `use_nar=False` in `tts_synthesize` to use AR-only decoding (faster but lower quality)

### Training reference implementations
Inside `flexicodec/ar_tts/modeling_artts.py` and `flexicodec/nar_tts/modeling_voicebox.py` there are `training_forward` methods that receive audios and prepared sensevoice-small input "FBank" features. (`dl_output` dictionary containing `x` (the [`feature_extractor`](flexicodec/infer.py#L50) output), `x_lens` (length of each x before padding), `audio` (the 16khz audio tensor)). 
Training can be replicated by passing the same data to the `training_forward` methods. 



## Acknowledgements & Citation
- Our codebase setup is based on [DualCodec](https://github.com/jiaqili3/DualCodec)
- We thank the [Mimi Codec](https://github.com/kyutai-labs/moshi) for transformer implementations

If you find our works useful, please consider citing as:
```biblatex
@article{li2025flexicodec,
  title={FlexiCodec: A Dynamic Neural Audio Codec for Low Frame Rates},
  author={Li, Jiaqi and Qian, Yao and Hu, Yuxuan and Zhang, Leying and Wang, Xiaofei and Lu, Heng and Thakker, Manthan and Li, Jinyu and Zhao, Shang and Wu, Zhizheng},
  journal={arXiv preprint arXiv:2510.00981},
  year={2025}
}

@article{li2025dualcodec,
  title={Dualcodec: A low-frame-rate, semantically-enhanced neural audio codec for speech generation},
  author={Li, Jiaqi and Lin, Xiaolong and Li, Zhekai and Huang, Shixi and Wang, Yuancheng and Wang, Chaoren and Zhan, Zhenpeng and Wu, Zhizheng},
  journal={Interspeech 2025},
  year={2025}
}
```