Qwen3-1.7B-Multilingual-TTS
Continue pretraining Qwen/Qwen3-1.7B-Base on Multilingual Voice Conversion and TTS.
- Use neucodec as speech detokenizer, 50 TPS, output in 24k sample rate.
- Multi-speaker multilingual Voice Conversion up to 25.5B tokens.
- Multi-speaker multilingual TTS up to 5B tokens.
- Flash Attention 3 10k context length multipacking.
- Liger Kernel for
swiglu,rms_normandfused_linear_cross_entropy.
WanDB at https://wandb.ai/huseinzol05/Qwen-Qwen3-1.7B-Base-multilingual-tts-neucodec
Still on training, currently paused on training, waiting for my own pocket money to burn.
How to
import soundfile as sf
import torch
import torchaudio
from transformers import AutoTokenizer, AutoModelForCausalLM
from neucodec import NeuCodec
import re
model_name = "malaysia-ai/Qwen3-1.7B-Multilingual-TTS"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype="auto").to('cuda')
codec = NeuCodec.from_pretrained("neuphonic/neucodec")
_ = codec.eval().to('cuda')
TTS
text = "Hello! how come I help you? 你好!有什么可以帮你的吗?வணக்கம்! நான் உங்களுக்கு எப்படி உதவுவது? Bonjour! Comment puis-je vous aider ? Xin chào! Tôi có thể giúp gì cho bạn? こんにちは!どうしてお手伝いしましょうか?안녕하세요! 어떻게 도와드릴까요?"
prompt = f"<|im_start|>jenny_tts_dataset_audio_jenny: {text}<|speech_start|>"
inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to('cuda')
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=2048,
do_sample=True,
temperature=0.6,
repetition_penalty=1.15,
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
audio_tokens = re.findall(r'<\|s_(\d+)\|>', generated_text.split('<|speech_start|>')[1])
audio_tokens = [int(token) for token in audio_tokens]
audio_codes = torch.tensor(audio_tokens)[None, None]
with torch.no_grad():
audio_waveform = codec.decode_code(audio_codes.cuda())
sf.write('7-languages.mp3', audio_waveform[0, 0].cpu(), 24000)
You can check the audio 7-languages.mp3.
- You can pick any speaker name from malaysia-ai/Multilingual-TTS.
- Not bad from 0.35 epoch model.
Voice Conversion
import librosa
y, sr = librosa.load('jenny.wav', sr = 16000)
with torch.no_grad():
codes = codec.encode_code(torch.tensor(y)[None, None])
tokens = ''.join([f'<|s_{i}|>' for i in codes[0, 0]])
prompt = f"<|im_start|>I wonder if I shall ever be happy enough to have real lace on my clothes and bows on my caps.<|speech_start|>{tokens}<|im_end|><|im_start|>Hello, how come I help you, 你好, 有什么可以帮你的吗, வணக்கம், நான் உங்களுக்கு எப்படி உதவுவது, bonjour, comment puis-je vous aider.<|speech_start|>"
inputs = tokenizer(prompt,return_tensors="pt", add_special_tokens=True).to('cuda')
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=2048,
do_sample=True,
temperature=0.6,
repetition_penalty=1.15,
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=False)
audio_tokens = re.findall(r'<\|s_(\d+)\|>', generated_text.split('<|speech_start|>')[-1])
audio_tokens = [int(token) for token in audio_tokens]
audio_codes = torch.tensor(audio_tokens)[None, None]
with torch.no_grad():
audio_waveform = codec.decode_code(audio_codes.cuda())
sf.write('jenny-4-languages.mp3', audio_waveform[0, 0].cpu(), 24000)
You can check the audio jenny-4-languages.mp3.
Source code
Source code at https://github.com/malaysia-ai/cooking/tree/main/qwen-tts
- Downloads last month
- 432