VieNeu-TTS-1000h

⚠️ PHIÊN BẢN THỬ NGHIỆM / EXPERIMENTAL VERSION

VieNeu-TTS-1000h là phiên bản thử nghiệm với hỗ trợ song ngữ Tiếng Việt + Tiếng Anh. Model này đang trong giai đoạn đánh giá và có thể chứa các vấn đề về ngữ điệu, độ ổn định giọng nói, hoặc tổng hợp văn bản dài.

🔥 Khuyến nghị sử dụng phiên bản ổn định:

👉 VieNeu-TTS (Stable) - Phiên bản chính thức, ổn định cho tiếng Việt

Phiên bản 1000h sẽ thay thế bản stable sau khi hoàn tất tất cả các đánh giá và cải tiến.

The 1000h version will replace the stable release after all evaluations and refinements are complete.

Feedback & issues are highly appreciated: [email protected]

Support This Project

Training high-quality TTS models on 1000+ hours of data requires significant GPU resources and compute time. If you find this model useful, please consider supporting the development:

Your support helps maintain and improve VieNeu-TTS! 🙏

Voice Cloning Inference

Reference Voice (Speaker Example):

Input Text:

Trên bầu trời xanh thẳm, những đám mây trắng lửng lờ trôi như những chiếc thuyền nhỏ đang lướt nhẹ theo dòng gió. Dưới mặt đất, cánh đồng lúa vàng rực trải dài tới tận chân trời, những bông lúa nghiêng mình theo từng làn gió.

Generated Output (Cloned Voice):

Long Text Inference

VieNeu-TTS-1000h supports long-form text synthesis (multiple sentences, paragraphs, or entire articles).
For efficient sentence splitting, text normalization, and streaming playback, please refer to the example script in the repository:

🔗 https://github.com/pnnbao97/VieNeu-TTS
Example file: examples/infer_long_text.py

Long-form speech output example:

Model Architecture

Component	Description
Backbone	Qwen 0.5B (chat-format LM)
Codec	NeuCodec (supports ONNX + quantization)
Output	24 kHz waveform synthesis
Context Window	2048 tokens shared text + speech
Watermark	Enabled
Training Data	~1000h Vietnamese + English speech data

Features

High-quality Vietnamese speech synthesis
Bilingual support: Vietnamese + English
Instant voice cloning (3–5 second reference audio)
Fully offline inference
Real-time or faster performance
Multi-voice reference support
Python API + CLI + Gradio interface

Installation

git clone https://github.com/pnnbao97/VieNeu-TTS.git
cd VieNeu-TTS
uv sync

Quick Usage (Python)

from vieneu_tts import VieNeuTTS
import soundfile as sf
import torch
import os

device = "cuda" if torch.cuda.is_available() else "cpu"

input_texts = [
    "Các khóa học trực tuyến đang giúp học sinh tiếp cận kiến thức mọi lúc mọi nơi. Giáo viên sử dụng video, bài tập tương tác và thảo luận trực tuyến để nâng cao hiệu quả học tập.",

    "Các nghiên cứu về bệnh Alzheimer cho thấy tác dụng tích cực của các bài tập trí não và chế độ dinh dưỡng lành mạnh, giúp giảm tốc độ suy giảm trí nhớ ở người cao tuổi.",

    "Một tiểu thuyết trinh thám hiện đại dẫn dắt độc giả qua những tình tiết phức tạp, bí ẩn, kết hợp yếu tố tâm lý sâu sắc khiến người đọc luôn hồi hộp theo dõi diễn biến câu chuyện.",

    "Các nhà khoa học nghiên cứu gen người phát hiện những đột biến mới liên quan đến bệnh di truyền. Điều này giúp nâng cao khả năng chẩn đoán và điều trị.",
]

output_dir = "./output_audio"
os.makedirs(output_dir, exist_ok=True)

def main(backbone="pnnbao-ump/VieNeu-TTS-1000h", codec="neuphonic/neucodec"):
    """
    In the sample directory, there are wav files and txt files with matching names.
    These are pre-prepared reference files for testing with Vietnamese names:
    - Bình (nam miền Bắc) - Male, North accent
    - Tuyên (nam miền Bắc) - Male, North accent
    - Nguyên (nam miền Nam) - Male, South accent
    - Sơn (nam miền Nam) - Male, South accent
    - Vĩnh (nam miền Nam) - Male, South accent
    - Hương (nữ miền Bắc) - Female, North accent
    - Ly (nữ miền Bắc) - Female, North accent
    - Ngọc (nữ miền Bắc) - Female, North accent
    - Đoan (nữ miền Nam) - Female, South accent
    - Dung (nữ miền Nam) - Female, South accent
    
    Note: The model can clone any voice you provide (with corresponding text).
    However, quality may not match the sample files. For best results, finetune
    the model on your target voice. See finetune guide at:
    https://github.com/pnnbao-ump/VieNeuTTS/blob/main/finetune.ipynb
    """
    # Male voice (South accent)
    ref_audio_path = "./sample/Vĩnh (nam miền Nam).wav"
    ref_text_path = "./sample/Vĩnh (nam miền Nam).txt"
    
    # Female voice (South accent) - uncomment to use
    # ref_audio_path = "./sample/Đoan (nữ miền Nam).wav"
    # ref_text_path = "./sample/Đoan (nữ miền Nam).txt"

    ref_text_raw = open(ref_text_path, "r", encoding="utf-8").read()
    
    if not ref_audio_path or not ref_text_raw:
        print("No reference audio or text provided.")
        return None

    # Initialize VieNeuTTS-1000h
    tts = VieNeuTTS(
        backbone_repo=backbone,
        backbone_device=device,
        codec_repo=codec,
        codec_device=device
    )

    print("Encoding reference audio...")
    ref_codes = tts.encode_reference(ref_audio_path)

    # Generate speech for all input texts
    for i, text in enumerate(input_texts, 1):
        print(f"Generating audio {i}/{len(input_texts)}: {text[:50]}...")
        wav = tts.infer(text, ref_codes, ref_text_raw)
        output_path = os.path.join(output_dir, f"output_{i}.wav")
        sf.write(output_path, wav, 24000)
        print(f"✓ Saved to {output_path}")

if __name__ == "__main__":
    main()

Gradio Demo

uv run gradio_app.py

Open your browser at http://127.0.0.1:7860.

Reference Voices

The sample/ directory contains pre-recorded reference voices:

File	Gender	Accent	Description
Bình (nam miền Bắc)	Male	North	Male voice, North accent
Tuyên (nam miền Bắc)	Male	North	Male voice, North accent
Nguyên (nam miền Nam)	Male	South	Male voice, South accent
Sơn (nam miền Nam)	Male	South	Male voice, South accent
Vĩnh (nam miền Nam)	Male	South	Male voice, South accent
Hương (nữ miền Bắc)	Female	North	Female voice, North accent
Ly (nữ miền Bắc)	Female	North	Female voice, North accent
Ngọc (nữ miền Bắc)	Female	North	Female voice, North accent
Đoan (nữ miền Nam)	Female	South	Female voice, South accent
Dung (nữ miền Nam)	Female	South	Female voice, South accent

Best Practices

Text length: Keep input ≤ 250 characters per inference call for optimal quality
Long text: For longer content, use examples/infer_long_text.py for proper sentence splitting
Reference audio: Use clean, 3–5 second clips with clear speech
Custom voices: Finetune the model for best results with your target voice
Normalization: The model handles Vietnamese text normalization automatically

Improvements Over VieNeu-TTS-140h

Feature	VieNeu-TTS-140h	VieNeu-TTS-1000h
Training data	140 hours	~1000 hours
Languages	Vietnamese only	Vietnamese + English
Pronunciation accuracy	Good	Excellent
Voice cloning fidelity	Good	Enhanced
Code-switching	Limited	Native support

Troubleshooting

Issue	Cause	Solution
Missing `libespeak`	System dependency	Install eSpeak NG
GPU OOM	VRAM too small	Use CPU mode or smaller batch size
Poor voice match	Low-quality reference	Use clear 3-5s audio, consider finetuning
Import errors	Package not installed	`pip install vieneu-tts`

License

Apache 2.0

Citation

@misc{vieneutts1000h2025,
  title        = {VieNeu-TTS-1000h: Vietnamese-English Bilingual Text-to-Speech with Instant Voice Cloning},
  author       = {Pham Nguyen Ngoc Bao},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/pnnbao-ump/VieNeu-TTS-1000h}}
}

Please also cite the base model:

@misc{neuttsair2025,
  title        = {NeuTTS Air: On-Device Speech Language Model with Instant Voice Cloning},
  author       = {Neuphonic},
  year         = {2025},
  publisher    = {Hugging Face},
  howpublished = {\url{https://huggingface.co/neuphonic/neutts-air}}
}

Downloads last month: 545

Safetensors

Model size

0.6B params

Tensor type

BF16

Model tree for pnnbao-ump/VieNeu-TTS-1000h

Base model

neuphonic/neutts-air

Finetuned

pnnbao-ump/VieNeu-TTS