How to make NeuTTS-air generate over 200 seconds of audio in a single second.

Community Article Published November 21, 2025

NeuTTS-air is a high quality 0.5b parameters TTS model that generates speech based on text input. It can generate realistic and emotional speech as well as clone voices. However, it is relatively slow on gpus out of the box using transformers for a tiny 0.5b model. So I tried to optimize the model considerably so it can generate minutes of audio in seconds. All benchmarks were run on a single 4070 Ti Super gpu.

Optimizing the llm

The llm in NeuTTS-air is just 0.5b parameters and standard Qwen2LM model hence there were many libraries to optimize it. So this was infact somewhat the "easier" part. I used the extremely optimized LMdeploy library to run the 0.5b LLM. There are several reasons I chose LMdeploy instead of other fast libraries such as vllm, sglang, etc.

  • Simpler installation: vllm/sglang/tensorrt_llm tend to lead to dependancy hell without a completely clean venv.
  • Windows: LMdeploy works in windows without custom installation or sacrificing speed
  • Extremely fast: faster then vllm and consistently on par with sglang/tensorrt_llm if not faster in both single batches and larger batches
  • Low latency: Lmdeploy has an extremely low latency that is less then 50ms Hence, LMdeploy clearly seems the best overall choice.

Findings in optimizing the llm

To further improve speed, I also used the advanced techniques in lmdeploy such as prefix caching and int8 cache. Overall, they seem to be good but some information I found out about them is:

  • Prefix caching significantly improved batching speed at the cost of some latency. So, the repo has options to turn off prefix caching which is a good idea for streaming.
  • Int8 kv-cache also helped in batching as it would save context in int8 instead of the default bf16 which would save vram. There is a minor quality loss but not much.

Another key thing I found out is NeuTTS-air does not work in float16 dtype and only works in bfloat16 dtype or float32. Thus unfortunately, any gpu below ampere such as T4/20xx gpus will not work with LMdeploy. Vllm supports the float32 dtype which does work on older gpus but it is much slower then LMdeploy so instead of using 2 seperate libraries and creating lots of messy code, I believed the best step was to only use LMdeploy.

The neural codec

NeuTTS-air can use two seperate codecs. It can either use the basic neucodec which is very similar to xcodec2's architecture. Essentially, to summarize the architecture:

  • a wav2vec2 semantic encoder which encodes the "meaning" of the audio
  • a bigcodec acoustic encoder which encodes the "acoustic" information of the audio
  • a vocos decoder to decode tokens back into audio However, a key difference is that while xcodec2 decodes tokens into 16khz audio, neucodec decodes tokens into 24khz audio.

NeuTTS-air also has another codec named neucodec-distill which has an identical decoder but much faster encoders. It does these two simple modifications

  • BigCodec acoustic encoder --> SQCodec acoustic encoder(much faster)
  • Wav2Vec2 semantic encoder --> DistillHubert semantic encoder(much smaller)

Hence, I used the distill-neucodec instead to improve encoding speed considerably.

Distill NeuCodec architecture: image

Optimizing the codec further

Using the distill neucodec worked well for seconds of audio, but as soon as I started generating minutes of audio, the neucodec quickly became the bottleneck instead of the LLM. So I further optimized it by splitting all generated tokens by the LLM into groups of 50. Then, batch decode the audio which considerably improved the speed compared to sequentially decode audio. The neucodec alone went from 400x faster then realtime to 800x realtime which was really major for improving the end to end speed. This results in a final end to end realtime factor of 211x using test.txt as the input in the repo!

Full architecture: image

Next steps

There are still many new things to implement but obviously I could not implement everything at once. However, some future things I will add are:

  • multilingual models: there are several neutts models already uploaded for hindi, french, dutch, spanish
  • multispeaker models: generating speech in podcast formats would be useful to many
  • online streaming inference: generating speech with 100ms latency and supporting many concurrent users

So thanks for reading this article, it would be great to support the repo by giving a star and I will be happy to help fix issues or implement new features.

Repo link: https://github.com/ysharma3501/FastNeuTTS

Original NeuTTS model: https://huggingface.co/neuphonic/neutts-air

Community

Sign up or log in to comment