Text Generation
Transformers
Safetensors
qwen3
conversational
text-generation-inference

Qwen3-1.7B-Multilingual-TTS

Continue pretraining Qwen/Qwen3-1.7B-Base on Multilingual Voice Conversion and TTS.

  1. Use neucodec as speech detokenizer, 50 TPS, output in 24k sample rate.
  2. Multi-speaker multilingual Voice Conversion up to 25.5B tokens.
  3. Multi-speaker multilingual TTS up to 5B tokens.
  4. Flash Attention 3 10k context length multipacking.
  5. Liger Kernel for swiglu, rms_norm and fused_linear_cross_entropy.

WanDB at https://wandb.ai/huseinzol05/Qwen-Qwen3-1.7B-Base-multilingual-tts-neucodec

Still on training, currently paused on training, waiting for my own pocket money to burn.

How to

import soundfile as sf
import torch
import torchaudio
from transformers import AutoTokenizer, AutoModelForCausalLM
from neucodec import NeuCodec
import re

model_name = "malaysia-ai/Qwen3-1.7B-Multilingual-TTS"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto").to('cuda')
codec = NeuCodec.from_pretrained("neuphonic/neucodec")
_ = codec.eval().to('cuda')

TTS

text = "Hello! how come I help you? 你好!有什么可以帮你的吗?வணக்கம்! நான் உங்களுக்கு எப்படி உதவுவது? Bonjour! Comment puis-je vous aider ? Xin chào! Tôi có thể giúp gì cho bạn? こんにちは!どうしてお手伝いしましょうか?안녕하세요! 어떻게 도와드릴까요?"
prompt = f"<|im_start|>jenny_tts_dataset_audio_jenny: {text}<|speech_start|>"

inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to('cuda')

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=True,
        temperature=0.6,
        repetition_penalty=1.15,
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
audio_tokens = re.findall(r'<\|s_(\d+)\|>', generated_text.split('<|speech_start|>')[1])
audio_tokens = [int(token) for token in audio_tokens]
audio_codes = torch.tensor(audio_tokens)[None, None]

with torch.no_grad():
    audio_waveform = codec.decode_code(audio_codes.cuda())

sf.write('7-languages.mp3', audio_waveform[0, 0].cpu(), 24000)

You can check the audio 7-languages.mp3.

Voice Conversion

import librosa

y, sr = librosa.load('jenny.wav', sr = 16000)
with torch.no_grad():
    codes = codec.encode_code(torch.tensor(y)[None, None])
tokens = ''.join([f'<|s_{i}|>' for i in codes[0, 0]])
prompt = f"<|im_start|>I wonder if I shall ever be happy enough to have real lace on my clothes and bows on my caps.<|speech_start|>{tokens}<|im_end|><|im_start|>Hello, how come I help you, 你好, 有什么可以帮你的吗, வணக்கம், நான் உங்களுக்கு எப்படி உதவுவது, bonjour, comment puis-je vous aider.<|speech_start|>"

inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to('cuda')

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=2048,
        do_sample=True,
        temperature=0.6,
        repetition_penalty=1.15,
    )

generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
audio_tokens = re.findall(r'<\|s_(\d+)\|>', generated_text.split('<|speech_start|>')[-1])
audio_tokens = [int(token) for token in audio_tokens]
audio_codes = torch.tensor(audio_tokens)[None, None]

with torch.no_grad():
    audio_waveform = codec.decode_code(audio_codes.cuda())

sf.write('jenny-4-languages.mp3', audio_waveform[0, 0].cpu(), 24000)

You can check the audio jenny-4-languages.mp3.

Source code

Source code at https://github.com/malaysia-ai/cooking/tree/main/qwen-tts

Downloads last month
432
Safetensors
Model size
2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for malaysia-ai/Qwen3-1.7B-Multilingual-TTS

Finetuned
(121)
this model
Quantizations
1 model

Datasets used to train malaysia-ai/Qwen3-1.7B-Multilingual-TTS