🎧 VocalNet-Qwen3-1.7B Model Card

VocalNet-Qwen3-1.7B is a high-performance, low-latency speech large language model (LLM) capable of both English and Mandarin, optimized for real-time voice interaction.

The official repo for model training and inference will be open-sourced as soon as possible.

🏆 VocalBench Performance

Model	Knowledge	Reasoning	Creativity	UTMOS	WER	Single-Round	Multi-Round	Instruction Following	Emotional Empathy	Safety	Robust	Overall
Mini-Omni (0.5B)	2.20	1.291	1.4725	4.435	19.571	1.645	-	0.00	5.428	81.25	84.14	40.646
Mini-Omni2 (0.5B)	4.65	1.501	1.8025	4.413	36.269	1.915	-	0.11	5.709	88.50	82.26	43.224
SLAM-Omni (0.5B)	12.05	1.875	2.5175	4.424	6.065	2.880	1.9800	3.11	6.452	90.25	77.91	54.649
VocalNet-1B (1B)	43.00	2.869	3.1800	4.437	5.123	3.335	3.2550	16.11	6.754	89.00	92.42	66.632
VocalNet-Qwen3-1.7B (1.7B)	45.65	3.712	3.3625	4.353	1.775	3.450	3.6325	31.89	7.000	82.75	91.47	72.152

Downloads last month: 18

Safetensors

Model size

5B params

Tensor type

BF16

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for VocalNet/VocalNet-Qwen3-1.7B

Base model

Qwen/Qwen3-1.7B-Base

Finetuned

Qwen/Qwen3-1.7B

Finetuned

(317)

this model