🎧 VocalNet-Qwen3-1.7B Model Card

VocalNet-Qwen3-1.7B is a high-performance, low-latency speech large language model (LLM) capable of both English and Mandarin, optimized for real-time voice interaction.

The official repo for model training and inference will be open-sourced as soon as possible.

πŸ† VocalBench Performance

Model Knowledge Reasoning Creativity UTMOS WER Single-Round Multi-Round Instruction Following Emotional Empathy Safety Robust Overall
Mini-Omni (0.5B) 2.20 1.291 1.4725 4.435 19.571 1.645 - 0.00 5.428 81.25 84.14 40.646
Mini-Omni2 (0.5B) 4.65 1.501 1.8025 4.413 36.269 1.915 - 0.11 5.709 88.50 82.26 43.224
SLAM-Omni (0.5B) 12.05 1.875 2.5175 4.424 6.065 2.880 1.9800 3.11 6.452 90.25 77.91 54.649
VocalNet-1B (1B) 43.00 2.869 3.1800 4.437 5.123 3.335 3.2550 16.11 6.754 89.00 92.42 66.632
VocalNet-Qwen3-1.7B (1.7B) 45.65 3.712 3.3625 4.353 1.775 3.450 3.6325 31.89 7.000 82.75 91.47 72.152
Downloads last month
18
Safetensors
Model size
5B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for VocalNet/VocalNet-Qwen3-1.7B

Finetuned
Qwen/Qwen3-1.7B
Finetuned
(317)
this model