sensefvg commited on
Commit
e65b73c
·
verified ·
1 Parent(s): b3f3294

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +889 -3
README.md CHANGED
@@ -1,3 +1,889 @@
1
- ---
2
- license: mit
3
- ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: mit
3
+ pipeline_tag: any-to-any
4
+ language:
5
+ - zh
6
+ - en
7
+ ---
8
+
9
+ # InteractiveOmni
10
+
11
+ <p align="center">
12
+ InteractiveOmni-4B <a href="https://huggingface.co/sensefvg/InteractiveOmni-4B">🤗</a>&nbsp; | InteractiveOmni-8B <a href="https://huggingface.co/sensefvg/InteractiveOmni-8B">🤗</a>&nbsp; | 📑 <a href="https://arxiv.org/abs/2510.13747">Paper</a> &nbsp;&nbsp;
13
+ </p>
14
+
15
+
16
+ ## Introduction
17
+ InteractiveOmni is a unified omni-modal model that can simultaneously receive inputs such as images, audio, text, and
18
+ video and directly generate coherent text and speech streams, achieving truly integrated interaction.
19
+
20
+ This is the schematic diagram for multi-turn audio-visual interaction.
21
+ <p align="center">
22
+ <img src="https://raw.github.com/SenseTime-FVG/InteractiveOmni/main/assets/demo_interaction.png" width="99%"/>
23
+ <p>
24
+
25
+ ### Key Features
26
+ * **Strong Performance Across Modalities:** Exhibiting omni-modal understanding and speech generation capabilities. InteractiveOmni outperforms the similarly sized vision-language models, audio-language models and omni-modal models.
27
+ * **State-of-the-Art Performance:** Achieve SOTA results on various open-source benchmarks for image, audio, and video understanding, as well as speech conversation.
28
+ * **Excellent Interactive Performance:** Achieve more intelligent audio-visual experience with multi-turn and long-term memory capabilities.
29
+ * **Multi-turn Interactive Benchmarks:** Propose multi-modal multi-turn benchmark to evaluate multi-turn memory and speech interaction of leading MLLMs.
30
+ * **On-device Model:** the 4B model achieves 97% of the performance with just 50% of the model size compared with 8B model.
31
+ ### Model Architecture
32
+ <p align="center">
33
+ <img src="https://raw.github.com/SenseTime-FVG/InteractiveOmni/main/assets/model_architecture.png" width="80%"/>
34
+ <p>
35
+
36
+
37
+ ## Quickstart
38
+ ### Get the Code
39
+ ```bash
40
+ git clone https://github.com/SenseTime-FVG/InteractiveOmni.git
41
+ cd InteractiveOmni
42
+ pip install -r requirements.txt
43
+ ```
44
+
45
+ We provide an example code to run `InteractiveOmni` using 🤗 `Transformers`.
46
+
47
+ > Please use transformers>=4.51.0 and FlashAttention2 to ensure the model works normally.
48
+ ### Model Loading
49
+ ```python
50
+ import torch
51
+ from transformers import AutoTokenizer, AutoModel
52
+ path = "sensefvg/InteractiveOmni-8B"
53
+ model = AutoModel.from_pretrained(
54
+ path,
55
+ torch_dtype=torch.bfloat16,
56
+ trust_remote_code=True).eval().cuda()
57
+ ```
58
+
59
+ ### Inference with Transformers
60
+
61
+ ```python
62
+ import torch
63
+ from transformers import AutoModel, AutoTokenizer
64
+ import torchaudio
65
+
66
+ path = "sensefvg/InteractiveOmni-8B"
67
+ model = AutoModel.from_pretrained(
68
+ path,
69
+ torch_dtype=torch.bfloat16,
70
+ trust_remote_code=True).eval().cuda()
71
+ tokenizer = AutoTokenizer.from_pretrained(path, trust_remote_code=True, use_fast=True)
72
+
73
+ # set the max number of tiles in `max_num`
74
+ max_num = 12
75
+ frame = 8
76
+ generation_config = dict(max_new_tokens=1024, do_sample=True)
77
+
78
+ # pure-text conversation (纯文本对话)
79
+ messages = [
80
+ {
81
+ 'role': "user",
82
+ 'content': 'Hello, who are you?',
83
+ }
84
+ ]
85
+ response = model.chat(tokenizer, generation_config, messages)
86
+
87
+ # audio conversation (音频对话)
88
+ messages = [
89
+ {
90
+ 'role': "user",
91
+ 'content': [
92
+ {
93
+ "type": "audio",
94
+ "audio": "assets/hello_en.wav"
95
+ }
96
+ ]
97
+ }
98
+ ]
99
+ response = model.chat(tokenizer, generation_config, messages)
100
+
101
+ ## Generate both audio and text output
102
+ messages = [
103
+ {
104
+ 'role': "user",
105
+ 'content': [
106
+ {
107
+ "type": "audio",
108
+ "audio": "assets/hello_zh.wav"
109
+ }
110
+ ]
111
+ }
112
+ ]
113
+ response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True)
114
+ torchaudio.save("result.wav", wav_response.cpu(), 24000, format="wav")
115
+
116
+ # image-text conversation (图文对话)
117
+ messages = [
118
+ {
119
+ 'role': "user",
120
+ 'content': [
121
+ {
122
+ "type": "image",
123
+ "image": 'assets/cat_cup.jpeg'
124
+ },
125
+ {
126
+ "type": "text",
127
+ "text": "Please describe the image shortly."
128
+ }
129
+ ]
130
+ }
131
+ ]
132
+ response = model.chat(tokenizer, generation_config, messages, max_num)
133
+
134
+ # image-audio conversation (图音对话)
135
+ messages = [
136
+ {
137
+ 'role': "user",
138
+ 'content': [
139
+ {
140
+ "type": "image",
141
+ "image": 'assets/cat_cup.jpeg'
142
+ },
143
+ {
144
+ "type": "audio",
145
+ "audio": "assets/describe_img_en.wav"
146
+ }
147
+ ]
148
+ }
149
+ ]
150
+ response = model.chat(tokenizer, generation_config, messages, max_num)
151
+
152
+ ## image-audio conversation, generate both audio and text output
153
+ messages = [
154
+ {
155
+ 'role': "user",
156
+ 'content': [
157
+ {
158
+ "type": "image",
159
+ "image": 'assets/cat_cup.jpeg'
160
+ },
161
+ {
162
+ "type": "audio",
163
+ "audio": "assets/describe_img_en.wav"
164
+ }
165
+ ]
166
+ }
167
+ ]
168
+ response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True)
169
+ torchaudio.save("result.wav", wav_response.cpu(), 24000, format="wav")
170
+
171
+ # video conversation (视频对话)
172
+ messages = [
173
+ {
174
+ 'role': "user",
175
+ 'content': [
176
+ {
177
+ "type": "video",
178
+ "video": 'video_path'
179
+ },
180
+ {
181
+ "type": "text",
182
+ "text": "Describe this video in detail."
183
+ }
184
+ ]
185
+ }
186
+ ]
187
+ response = model.chat(tokenizer, generation_config, messages, max_num, frame)
188
+ ```
189
+
190
+ ### Use audio output
191
+ * If users need audio output, the system prompt must be set as follows, otherwise the audio output may not work as expected.
192
+ ```
193
+ You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech.
194
+ ```
195
+ ```python
196
+ messages = [
197
+ {
198
+ "role": "system",
199
+ "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech."
200
+ },
201
+ {
202
+ 'role': "user",
203
+ 'content': [
204
+ {
205
+ "type": "audio",
206
+ "audio": "assets/hello_zh.wav",
207
+ }
208
+ ]
209
+ }
210
+ ]
211
+ response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True)
212
+ torchaudio.save("result_none_speaker.wav", wav_response.cpu(), 24000, format="wav")
213
+ ```
214
+ * Use default speaker to generate output audio.
215
+ ```python
216
+ messages = [
217
+ {
218
+ "role": "system",
219
+ "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech."
220
+ },
221
+ {
222
+ 'role': "user",
223
+ 'content': [
224
+ {
225
+ "type": "audio",
226
+ "audio": "assets/hello_zh.wav",
227
+ }
228
+ ]
229
+ }
230
+ ]
231
+ response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True, speaker_embedding=model.default_speaker_embedding)
232
+ torchaudio.save("result_default_speaker.wav", wav_response.cpu(), 24000, format="wav")
233
+ ```
234
+ * Use custom speaker to generate output audio, similar to sound cloning.
235
+ ```python
236
+ messages = [
237
+ {
238
+ "role": "system",
239
+ "content": "You are a highly advanced multimodal conversational AI designed for human-like interaction. You can perceive auditory, visual, speech, and textual inputs, and generate text and speech."
240
+ },
241
+ {
242
+ 'role': "user",
243
+ 'content': [
244
+ {
245
+ "type": "audio",
246
+ "audio": "assets/hello_zh.wav",
247
+ }
248
+ ]
249
+ }
250
+ ]
251
+ speaker_embedding = model.extract_speaker_embedding("assets/hello_zh.wav")
252
+ response, wav_response = model.chat(tokenizer, generation_config, messages, generate_audio=True, speaker_embedding=speaker_embedding)
253
+ torchaudio.save("result_custom_speaker.wav", wav_response.cpu(), 24000, format="wav")
254
+ ```
255
+
256
+ ## Evaluation
257
+ InteractiveOmni achieves state-of-the-art performance across a wide range of multi-modal understanding and speech generation benchmarks.
258
+ <p align="center">
259
+ <img src="https://raw.github.com/SenseTime-FVG/InteractiveOmni/main/assets/radar_chart.png" width="70%"/>
260
+ <p>
261
+
262
+ <details>
263
+ <summary>Image Understanding</summary>
264
+
265
+ <table style="width:100%; border-collapse: collapse;">
266
+ <thead>
267
+ <tr style="border-bottom: 1px solid black;">
268
+ <th align="left" style="padding: 8px;">Model</th>
269
+ <th align="center" style="padding: 8px;">MMBench</th>
270
+ <th align="center" style="padding: 8px;">MMStar</th>
271
+ <th align="center" style="padding: 8px;">MMMU</th>
272
+ <th align="center" style="padding: 8px;">MathVista</th>
273
+ <th align="center" style="padding: 8px;">HallusionBench</th>
274
+ <th align="center" style="padding: 8px;">AI2D</th>
275
+ <th align="center" style="padding: 8px;">OCRBench</th>
276
+ <th align="center" style="padding: 8px;">Avg</th>
277
+ </tr>
278
+ </thead>
279
+ <tbody>
280
+ <tr>
281
+ <td colspan="9" align="center" style="font-weight: bold; border-top: 1px solid #ddd; border-bottom: 1px solid black;">Vision-Language Model</td>
282
+ </tr>
283
+ <tr>
284
+ <td align="left" style="padding: 8px;">InternVL3-8B</td>
285
+ <td align="center" style="padding: 8px;">82.1</td>
286
+ <td align="center" style="padding: 8px;">68.7</td>
287
+ <td align="center" style="padding: 8px;">62.2</td>
288
+ <td align="center" style="padding: 8px;">70.5</td>
289
+ <td align="center" style="padding: 8px;">49.0</td>
290
+ <td align="center" style="padding: 8px;">85.1</td>
291
+ <td align="center" style="padding: 8px;">88.4</td>
292
+ <td align="center" style="padding: 8px;">72.3</td>
293
+ </tr>
294
+ <tr>
295
+ <td align="left" style="padding: 8px;">InternVL3.5-8B</td>
296
+ <td align="center" style="padding: 8px;">79.5</td>
297
+ <td align="center" style="padding: 8px;">69.3</td>
298
+ <td align="center" style="padding: 8px;">73.4</td>
299
+ <td align="center" style="padding: 8px;">78.4</td>
300
+ <td align="center" style="padding: 8px;">54.5</td>
301
+ <td align="center" style="padding: 8px;">84.0</td>
302
+ <td align="center" style="padding: 8px;">84.0</td>
303
+ <td align="center" style="padding: 8px;">74.7</td>
304
+ </tr>
305
+ <tr>
306
+ <td align="left" style="padding: 8px;">Qwen2.5-VL-7B</td>
307
+ <td align="center" style="padding: 8px;">82.2</td>
308
+ <td align="center" style="padding: 8px;">64.1</td>
309
+ <td align="center" style="padding: 8px;">58.0</td>
310
+ <td align="center" style="padding: 8px;">68.1</td>
311
+ <td align="center" style="padding: 8px;">51.9</td>
312
+ <td align="center" style="padding: 8px;">84.3</td>
313
+ <td align="center" style="padding: 8px;">88.8</td>
314
+ <td align="center" style="padding: 8px;">71.1</td>
315
+ </tr>
316
+ <tr>
317
+ <td colspan="9" align="center" style="font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;">Omni Model</td>
318
+ </tr>
319
+ <tr>
320
+ <td align="left" style="padding: 8px;">GPT-4o-mini</td>
321
+ <td align="center" style="padding: 8px;">76.0</td>
322
+ <td align="center" style="padding: 8px;">54.8</td>
323
+ <td align="center" style="padding: 8px;">60.0</td>
324
+ <td align="center" style="padding: 8px;">52.5</td>
325
+ <td align="center" style="padding: 8px;">46.1</td>
326
+ <td align="center" style="padding: 8px;">77.8</td>
327
+ <td align="center" style="padding: 8px;">78.5</td>
328
+ <td align="center" style="padding: 8px;">63.7</td>
329
+ </tr>
330
+ <tr>
331
+ <td align="left" style="padding: 8px;">VITA-1.5</td>
332
+ <td align="center" style="padding: 8px;">76.8</td>
333
+ <td align="center" style="padding: 8px;">60.2</td>
334
+ <td align="center" style="padding: 8px;">52.6</td>
335
+ <td align="center" style="padding: 8px;">66.2</td>
336
+ <td align="center" style="padding: 8px;">44.6</td>
337
+ <td align="center" style="padding: 8px;">79.2</td>
338
+ <td align="center" style="padding: 8px;">74.1</td>
339
+ <td align="center" style="padding: 8px;">64.8</td>
340
+ </tr>
341
+ <tr>
342
+ <td align="left" style="padding: 8px;">Ming-Lite-Omni</td>
343
+ <td align="center" style="padding: 8px;">80.8</td>
344
+ <td align="center" style="padding: 8px;">64.7</td>
345
+ <td align="center" style="padding: 8px;">56.3</td>
346
+ <td align="center" style="padding: 8px;">71.6</td>
347
+ <td align="center" style="padding: 8px;">55.0</td>
348
+ <td align="center" style="padding: 8px;">83.1</td>
349
+ <td align="center" style="padding: 8px;">88.4</td>
350
+ <td align="center" style="padding: 8px;">71.4</td>
351
+ </tr>
352
+ <tr>
353
+ <td align="left" style="padding: 8px;">Qwen2.5-Omni-7B</td>
354
+ <td align="center" style="padding: 8px;">81.3</td>
355
+ <td align="center" style="padding: 8px;">64.0</td>
356
+ <td align="center" style="padding: 8px;">59.2</td>
357
+ <td align="center" style="padding: 8px;">67.9</td>
358
+ <td align="center" style="padding: 8px;">47.4</td>
359
+ <td align="center" style="padding: 8px;">83.2</td>
360
+ <td align="center" style="padding: 8px;">83.4</td>
361
+ <td align="center" style="padding: 8px;">69.5</td>
362
+ </tr>
363
+ <tr>
364
+ <td align="left" style="padding: 8px;">InteractiveOmni-4B</td>
365
+ <td align="center" style="padding: 8px;">78.9</td>
366
+ <td align="center" style="padding: 8px;">62.6</td>
367
+ <td align="center" style="padding: 8px;">61.1</td>
368
+ <td align="center" style="padding: 8px;">61.7</td>
369
+ <td align="center" style="padding: 8px;">52.2</td>
370
+ <td align="center" style="padding: 8px;">83.8</td>
371
+ <td align="center" style="padding: 8px;">80.0</td>
372
+ <td align="center" style="padding: 8px;">68.6</td>
373
+ </tr>
374
+ <tr>
375
+ <td align="left" style="padding: 8px;">InteractiveOmni-8B</td>
376
+ <td align="center" style="padding: 8px;"><strong>81.4</strong></td>
377
+ <td align="center" style="padding: 8px;"><strong>66.8</strong></td>
378
+ <td align="center" style="padding: 8px;"><strong>66.9</strong></td>
379
+ <td align="center" style="padding: 8px;">68.0</td>
380
+ <td align="center" style="padding: 8px;"><strong>61.3</strong></td>
381
+ <td align="center" style="padding: 8px;"><strong>84.3</strong></td>
382
+ <td align="center" style="padding: 8px;">83.7</td>
383
+ <td align="center" style="padding: 8px;"><strong>73.2</strong></td>
384
+ </tr>
385
+ </tbody>
386
+ </table>
387
+
388
+ </details>
389
+
390
+ <details>
391
+ <summary>Video Understanding</summary>
392
+
393
+ <table style="width:100%; border-collapse: collapse;">
394
+ <thead>
395
+ <tr style="border-bottom: 1px solid black;">
396
+ <th align="left" style="padding: 8px;">Model</th>
397
+ <th align="center" style="padding: 8px;">Video-MME<br>(wo sub)</th>
398
+ <th align="center" style="padding: 8px;">Video-MME<br>(w sub)</th>
399
+ <th align="center" style="padding: 8px;">MLVU<br>(M-Avg)</th>
400
+ <th align="center" style="padding: 8px;">LongVideoBench<br>(val total)</th>
401
+ <th align="center" style="padding: 8px;">Avg</th>
402
+ </tr>
403
+ </thead>
404
+ <tbody>
405
+ <tr>
406
+ <td colspan="6" align="center" style="font-weight: bold; border-top: 1px solid #ddd; border-bottom: 1px solid black;">Vision-Language Model</td>
407
+ </tr>
408
+ <tr>
409
+ <td align="left" style="padding: 8px;">InternVL3-8B</td>
410
+ <td align="center" style="padding: 8px;"><strong>66.3</strong></td>
411
+ <td align="center" style="padding: 8px;">68.9</td>
412
+ <td align="center" style="padding: 8px;">71.4</td>
413
+ <td align="center" style="padding: 8px;">58.8</td>
414
+ <td align="center" style="padding: 8px;">66.4</td>
415
+ </tr>
416
+ <tr>
417
+ <td align="left" style="padding: 8px;">InternVL3.5-8B</td>
418
+ <td align="center" style="padding: 8px;">66.0</td>
419
+ <td align="center" style="padding: 8px;">68.6</td>
420
+ <td align="center" style="padding: 8px;">70.2</td>
421
+ <td align="center" style="padding: 8px;">62.1</td>
422
+ <td align="center" style="padding: 8px;">66.7</td>
423
+ </tr>
424
+ <tr>
425
+ <td align="left" style="padding: 8px;">Qwen2.5-VL-7B</td>
426
+ <td align="center" style="padding: 8px;">65.1</td>
427
+ <td align="center" style="padding: 8px;">71.6</td>
428
+ <td align="center" style="padding: 8px;">70.2</td>
429
+ <td align="center" style="padding: 8px;">56.0</td>
430
+ <td align="center" style="padding: 8px;">64.5</td>
431
+ </tr>
432
+ <tr>
433
+ <td colspan="6" align="center" style="font-weight: bold; border-top: 1px solid black; border-bottom: 1px solid black;">Omni Model</td>
434
+ </tr>
435
+ <tr>
436
+ <td align="left" style="padding: 8px;">GPT-4o-mini</td>
437
+ <td align="center" style="padding: 8px;">64.8</td>
438
+ <td align="center" style="padding: 8px;">-</td>
439
+ <td align="center" style="padding: 8px;">-</td>
440
+ <td align="center" style="padding: 8px;">-</td>
441
+ <td align="center" style="padding: 8px;">-</td>
442
+ </tr>
443
+ <tr>
444
+ <td align="left" style="padding: 8px;">Qwen2.5-Omni-7B</td>
445
+ <td align="center" style="padding: 8px;">64.3</td>
446
+ <td align="center" style="padding: 8px;"><strong>72.4</strong></td>
447
+ <td align="center" style="padding: 8px;">-</td>
448
+ <td align="center" style="padding: 8px;">-</td>
449
+ <td align="center" style="padding: 8px;">-</td>
450
+ </tr>
451
+ <tr>
452
+ <td align="left" style="padding: 8px;">InteractiveOmni-4B</td>
453
+ <td align="center" style="padding: 8px;">63.3</td>
454
+ <td align="center" style="padding: 8px;">69.3</td>
455
+ <td align="center" style="padding: 8px;">68.0</td>
456
+ <td align="center" style="padding: 8px;">57.0</td>
457
+ <td align="center" style="padding: 8px;">64.4</td>
458
+ </tr>
459
+ <tr>
460
+ <td align="left" style="padding: 8px;">InteractiveOmni-8B</td>
461
+ <td align="center" style="padding: 8px;">66.0</td>
462
+ <td align="center" style="padding: 8px;">71.8</td>
463
+ <td align="center" style="padding: 8px;"><strong>71.6</strong></td>
464
+ <td align="center" style="padding: 8px;">59.1</td>
465
+ <td align="center" style="padding: 8px;"><strong>67.1</strong></td>
466
+ </tr>
467
+ </tbody>
468
+ </table>
469
+
470
+ </details>
471
+
472
+ <details>
473
+ <summary>Audio Understanding</summary>
474
+
475
+ <table style="width:100%; border-collapse: collapse;">
476
+ <thead>
477
+ <tr>
478
+ <th align="left" style="padding: 8px;">Model</th>
479
+ <th align="center" style="padding: 8px;">Qwen2-Audio</th>
480
+ <th align="center" style="padding: 8px;">Step-Audio-Chat</th>
481
+ <th align="center" style="padding: 8px;">Kimi-Audio</th>
482
+ <th align="center" style="padding: 8px;">Qwen2.5-Omni-7B</th>
483
+ <th align="center" style="padding: 8px;">InteractiveOmni-4B</th>
484
+ <th align="center" style="padding: 8px;">InteractiveOmni-8B</th>
485
+ </tr>
486
+ </thead>
487
+ <tbody>
488
+ <tr>
489
+ <td colspan="9" align="center" style="font-weight: bold; border-top: 1px solid #ddd; border-bottom: 1px solid black;">ASR (wer)</td>
490
+ </tr>
491
+ <tr>
492
+ <td align="left" style="padding: 8px;">Wenetspeech<br><em>test-net</em></td>
493
+ <td align="center" style="padding: 8px;">10.60</td>
494
+ <td align="center" style="padding: 8px;">8.75</td>
495
+ <td align="center" style="padding: 8px;">5.37</td>
496
+ <td align="center" style="padding: 8px;">5.90</td>
497
+ <td align="center" style="padding: 8px;">5.40</td>
498
+ <td align="center" style="padding: 8px;"><strong>5.04</strong></td>
499
+ </tr>
500
+ <tr>
501
+ <td align="left" style="padding: 8px;">Wenetspeech<br><em>test-meeting</em></td>
502
+ <td align="center" style="padding: 8px;">10.68</td>
503
+ <td align="center" style="padding: 8px;">9.52</td>
504
+ <td align="center" style="padding: 8px;">6.28</td>
505
+ <td align="center" style="padding: 8px;">7.70</td>
506
+ <td align="center" style="padding: 8px;">6.95</td>
507
+ <td align="center" style="padding: 8px;"><strong>5.55</strong></td>
508
+ </tr>
509
+ <tr>
510
+ <td align="left" style="padding: 8px;">LibriSpeech<br><em>test-clean</em></td>
511
+ <td align="center" style="padding: 8px;">1.60</td>
512
+ <td align="center" style="padding: 8px;">3.19</td>
513
+ <td align="center" style="padding: 8px;"><strong>1.28</strong></td>
514
+ <td align="center" style="padding: 8px;">1.80</td>
515
+ <td align="center" style="padding: 8px;">1.73</td>
516
+ <td align="center" style="padding: 8px;">1.64</td>
517
+ </tr>
518
+ <tr>
519
+ <td align="left" style="padding: 8px;">LibriSpeech<br><em>test-other</em></td>
520
+ <td align="center" style="padding: 8px;">3.60</td>
521
+ <td align="center" style="padding: 8px;">10.67</td>
522
+ <td align="center" style="padding: 8px;"><strong>2.42</strong></td>
523
+ <td align="center" style="padding: 8px;">3.40</td>
524
+ <td align="center" style="padding: 8px;">3.69</td>
525
+ <td align="center" style="padding: 8px;">3.41</td>
526
+ </tr>
527
+ <tr>
528
+ <td align="left" style="padding: 8px;">Aishell-2 IOS</td>
529
+ <td align="center" style="padding: 8px;">4.48</td>
530
+ <td align="center" style="padding: 8px;">3.57</td>
531
+ <td align="center" style="padding: 8px;">2.56</td>
532
+ <td align="center" style="padding: 8px;">2.56</td>
533
+ <td align="center" style="padding: 8px;">2.85</td>
534
+ <td align="center" style="padding: 8px;"><strong>2.18</strong></td>
535
+ </tr>
536
+ <tr>
537
+ <td align="left" style="padding: 8px;">ChildMandarin</td>
538
+ <td align="center" style="padding: 8px;">14.62</td>
539
+ <td align="center" style="padding: 8px;">-</td>
540
+ <td align="center" style="padding: 8px;">-</td>
541
+ <td align="center" style="padding: 8px;">19.34</td>
542
+ <td align="center" style="padding: 8px;">17.21</td>
543
+ <td align="center" style="padding: 8px;"><strong>14.03</strong></td>
544
+ </tr>
545
+ <tr>
546
+ <td colspan="9" align="center" style="font-weight: bold; border-top: 1px solid #ddd; border-bottom: 1px solid black;">Audio Understanding</td>
547
+ </tr>
548
+ <tr>
549
+ <td align="left" style="padding: 8px;">MMAU</td>
550
+ <td align="center" style="padding: 8px;">56.60</td>
551
+ <td align="center" style="padding: 8px;">-</td>
552
+ <td align="center" style="padding: 8px;">65.20</td>
553
+ <td align="center" style="padding: 8px;">65.60</td>
554
+ <td align="center" style="padding: 8px;"><strong>72.00</strong></td>
555
+ <td align="center" style="padding: 8px;">67.39</td>
556
+ </tr>
557
+ <tr>
558
+ <td align="left" style="padding: 8px;">MELD</td>
559
+ <td align="center" style="padding: 8px;">55.30</td>
560
+ <td align="center" style="padding: 8px;">33.54</td>
561
+ <td align="center" style="padding: 8px;"><strong>59.13</strong></td>
562
+ <td align="center" style="padding: 8px;">57.00</td>
563
+ <td align="center" style="padding: 8px;">57.16</td>
564
+ <td align="center" style="padding: 8px;">57.55</td>
565
+ </tr>
566
+ <tr>
567
+ <td align="left" style="padding: 8px;">ClothoAQA<br><em>dev</em></td>
568
+ <td align="center" style="padding: 8px;">72.63</td>
569
+ <td align="center" style="padding: 8px;">44.98</td>
570
+ <td align="center" style="padding: 8px;"><strong>73.18</strong></td>
571
+ <td align="center" style="padding: 8px;">73.12</td>
572
+ <td align="center" style="padding: 8px;">71.91</td>
573
+ <td align="center" style="padding: 8px;">72.98</td>
574
+ </tr>
575
+ <tr>
576
+ <td align="left" style="padding: 8px;">ClothoAQA<br><em>test</em></td>
577
+ <td align="center" style="padding: 8px;">71.73</td>
578
+ <td align="center" style="padding: 8px;">45.84</td>
579
+ <td align="center" style="padding: 8px;">71.24</td>
580
+ <td align="center" style="padding: 8px;">72.86</td>
581
+ <td align="center" style="padding: 8px;">71.28</td>
582
+ <td align="center" style="padding: 8px;"><strong>74.49</strong></td>
583
+ </tr>
584
+ </tbody>
585
+ </table>
586
+
587
+
588
+ </details>
589
+
590
+ <details>
591
+ <summary>Omni-modal Understanding</summary>
592
+
593
+ <table>
594
+ <thead>
595
+ <tr>
596
+ <th>Model</th>
597
+ <th>Speech</th>
598
+ <th>Sound Event</th>
599
+ <th>Music</th>
600
+ <th>Avg</th>
601
+ </tr>
602
+ </thead>
603
+ <tbody>
604
+ <tr>
605
+ <td colspan="9" align="center" style="font-weight: bold; border-top: 1px solid #ddd; border-bottom: 1px solid black;">OmniBench</td>
606
+ </tr>
607
+ <tr>
608
+ <td align="left" style="padding: 8px;">MiniCPM-o-2.6</td>
609
+ <td align="center" style="padding: 8px;">-</td>
610
+ <td align="center" style="padding: 8px;">-</td>
611
+ <td align="center" style="padding: 8px;">-</td>
612
+ <td align="center" style="padding: 8px;">40.50</td>
613
+ </tr>
614
+ <tr>
615
+ <td align="left" style="padding: 8px;">Baichuan-Omni-1.5</td>
616
+ <td align="center" style="padding: 8px;">-</td>
617
+ <td align="center" style="padding: 8px;">-</td>
618
+ <td align="center" style="padding: 8px;">-</td>
619
+ <td align="center" style="padding: 8px;">42.90</td>
620
+ </tr>
621
+ <tr>
622
+ <td align="left" style="padding: 8px;">Qwen2.5-Omni-7B</td>
623
+ <td align="center" style="padding: 8px;">55.25</td>
624
+ <td align="center" style="padding: 8px;">60.00</td>
625
+ <td align="center" style="padding: 8px;">52.83</td>
626
+ <td align="center" style="padding: 8px;">56.13</td>
627
+ </tr>
628
+ <tr>
629
+ <td align="left" style="padding: 8px;">InteractiveOmni-4B</td>
630
+ <td align="center" style="padding: 8px;"><strong>60.70</strong></td>
631
+ <td align="center" style="padding: 8px;">61.51</td>
632
+ <td align="center" style="padding: 8px;">42.45</td>
633
+ <td align="center" style="padding: 8px;">59.19</td>
634
+ </tr>
635
+ <tr>
636
+ <td align="left" style="padding: 8px;">InteractiveOmni-8B</td>
637
+ <td align="center" style="padding: 8px;">60.18</td>
638
+ <td align="center" style="padding: 8px;"><strong>62.64</strong></td>
639
+ <td align="center" style="padding: 8px;"><strong>55.66</strong></td>
640
+ <td align="center" style="padding: 8px;"><strong>60.33</strong></td>
641
+ </tr>
642
+ </tbody>
643
+ </table>
644
+
645
+ </details>
646
+
647
+
648
+ <details>
649
+
650
+ <summary>Speech-to-text</summary>
651
+
652
+ <table>
653
+ <thead>
654
+ <tr>
655
+ <th align="left">Datasets</th>
656
+ <th align="left">Model</th>
657
+ <th align="left">Performance</th>
658
+ </tr>
659
+ </thead>
660
+ <tbody>
661
+ <tr>
662
+ <td rowspan="11" align="center" valign="middle"><strong>OpenAudioBench</strong><br><em>Reasoning QA</em> | <em>Llama Questions</em> <br>| <em>Web Questions</em> | <em>TriviaQA</em><br> | <em>AlpacaEval</em> | <em>Avg</em></td>
663
+ <td align="left">Qwen2-Audio</td>
664
+ <td align="left">42.77 | 69.67 | 45.20 | 40.30 | 57.19 | 51.03</td>
665
+ </tr>
666
+ <tr>
667
+ <td align="left">GLM-4-Voice</td>
668
+ <td align="left">47.43 | 76.00 | 55.40 | 51.80 | 57.89 | 57.70</td>
669
+ </tr>
670
+ <tr>
671
+ <td align="left">VITA-1.5</td>
672
+ <td align="left">41.00 | 74.20 | 57.30 | 46.80 | 68.20 | 57.50</td>
673
+ </tr>
674
+ <tr>
675
+ <td align="left">Step-Audio-chat</td>
676
+ <td align="left">60.00 | 72.33 | <strong>73.00</strong> | 56.80 | 56.53 | 63.73</td>
677
+ </tr>
678
+ <tr>
679
+ <td align="left">Baichuan-Audio</td>
680
+ <td align="left">41.90 | 78.40 | 64.50 | 61.70 | 77.40 | 64.78</td>
681
+ </tr>
682
+ <tr>
683
+ <td align="left">Kimi-Audio</td>
684
+ <td align="left">58.02 | 79.33 | 70.20 | 62.10 | 75.73 | 69.08</td>
685
+ </tr>
686
+ <tr>
687
+ <td align="left">MiniCPM-o-2.6</td>
688
+ <td align="left">38.60 | 77.80 | 68.60 | 61.90 | 51.80 | 59.74</td>
689
+ </tr>
690
+ <tr>
691
+ <td align="left">Baichuan-Omni-1.5</td>
692
+ <td align="left">50.00 | 78.50 | 59.10 | 57.20 | <strong>77.90</strong> | 64.54</td>
693
+ </tr>
694
+ <tr>
695
+ <td align="left">Qwen2.5-Omni-7B</td>
696
+ <td align="left">63.76 | 75.33 | 62.80 | 57.06 | 72.76 | 66.34</td>
697
+ </tr>
698
+ <tr>
699
+ <td align="left">InteractiveOmni-4B</td>
700
+ <td align="left">69.11 | 79.33 | 65.80 | 56.40 | 74.87 | 69.10</td>
701
+ </tr>
702
+ <tr>
703
+ <td align="left">InteractiveOmni-8B</td>
704
+ <td align="left"><strong>71.68</strong> | <strong>80.67</strong> | 70.30 | <strong>66.50</strong> | 74.57 | <strong>72.74</strong></td>
705
+ </tr>
706
+ <tr style="border-top: 1px solid #333;">
707
+ </tr>
708
+ <tr>
709
+ <td rowspan="11" align="center" valign="middle"><strong>VoiceBench</strong><br><em>AlpacaEval</em> | <em>CommonEval</em> <br>| <em>WildVoice</em> | <em>SD-QA</em> | <em>MMSU</em></td>
710
+ <td align="left">Qwen2-Audio</td>
711
+ <td align="left">3.69 | 3.40 | 3.01 | 35.35 | 35.43</td>
712
+ </tr>
713
+ <tr>
714
+ <td align="left">GLM-4-Voice</td>
715
+ <td align="left">4.06 | 3.48 | 3.18 | 43.31 | 40.11</td>
716
+ </tr>
717
+ <tr>
718
+ <td align="left">VITA-1.5</td>
719
+ <td align="left">4.21 | 3.66 | 3.48 | 38.88 | 52.15</td>
720
+ </tr>
721
+ <tr>
722
+ <td align="left">Step-Audio-chat</td>
723
+ <td align="left">3.99 | 2.99 | 2.93 | 46.84 | 28.72</td>
724
+ </tr>
725
+ <tr>
726
+ <td align="left">Baichuan-Audio</td>
727
+ <td align="left">4.41 | 4.08 | 3.92 | 45.84 | 53.19</td>
728
+ </tr>
729
+ <tr>
730
+ <td align="left">Kimi-Audio</td>
731
+ <td align="left">4.46 | 3.97 | 4.20 | <strong>63.12</strong> | 62.17</td>
732
+ </tr>
733
+ <tr>
734
+ <td align="left">MiniCPM-o-2.6</td>
735
+ <td align="left">4.42 | 4.15 | 3.94 | 50.72 | 54.78</td>
736
+ </tr>
737
+ <tr>
738
+ <td align="left">Baichuan-Omni-1.5</td>
739
+ <td align="left">4.50 | 4.05 | 4.06 | 43.40 | 57.25</td>
740
+ </tr>
741
+ <tr>
742
+ <td align="left">Qwen2.5-Omni-7B</td>
743
+ <td align="left">4.50 | 3.84 | 3.89 | 56.40 | 61.32</td>
744
+ </tr>
745
+ <tr>
746
+ <td align="left">InteractiveOmni-4B</td>
747
+ <td align="left">4.27 | 4.20 | 3.94 | 41.41 | 63.24</td>
748
+ </tr>
749
+ <tr>
750
+ <td align="left">InteractiveOmni-8B</td>
751
+ <td align="left"><strong>4.61</strong> | <strong>4.34</strong> | <strong>4.21</strong> | 44.67 | <strong>65.26</strong></td>
752
+ </tr>
753
+ <tr style="border-top: 1px solid #333;">
754
+ </tr>
755
+ <tr>
756
+ <td rowspan="11" align="center" valign="middle"><strong>VoiceBench</strong><br><em>OpenBookQA</em> | <em>IFEval</em> <br>| <em>BBH</em> | <em>AdvBench</em> | <em>Avg</em></td>
757
+ <td align="left">Qwen2-Audio</td>
758
+ <td align="left">49.01 | 54.70 | 22.57 | 98.85 | 55.32</td>
759
+ </tr>
760
+ <tr>
761
+ <td align="left">GLM-4-Voice</td>
762
+ <td align="left">52.97 | 52.80 | 24.91 | 88.08 | 57.40</td>
763
+ </tr>
764
+ <tr>
765
+ <td align="left">VITA-1.5</td>
766
+ <td align="left">71.65 | 55.30 | 38.14 | 97.69 | 64.53</td>
767
+ </tr>
768
+ <tr>
769
+ <td align="left">Step-Audio-chat</td>
770
+ <td align="left">31.87 | 50.60 | 29.19 | 65.77 | 50.13</td>
771
+ </tr>
772
+ <tr>
773
+ <td align="left">Baichuan-Audio</td>
774
+ <td align="left">71.65 | 54.80 | 50.31 | 99.42 | 69.27</td>
775
+ </tr>
776
+ <tr>
777
+ <td align="left">Kimi-Audio</td>
778
+ <td align="left">83.52 | 69.70 | <strong>61.10</strong> | <strong>100.0</strong> | <strong>76.91</strong></td>
779
+ </tr>
780
+ <tr>
781
+ <td align="left">MiniCPM-o-2.6</td>
782
+ <td align="left">78.02 | 60.40 | 49.25 | 97.69 | 71.23</td>
783
+ </tr>
784
+ <tr>
785
+ <td align="left">Baichuan-Omni-1.5</td>
786
+ <td align="left">74.51 | 62.70 | 54.54 | 97.31 | 71.32</td>
787
+ </tr>
788
+ <tr>
789
+ <td align="left">Qwen2.5-Omni-7B</td>
790
+ <td align="left">80.90 | 66.70 | 53.50 | 99.20 | 73.60</td>
791
+ </tr>
792
+ <tr>
793
+ <td align="left">InteractiveOmni-4B</td>
794
+ <td align="left">82.64 | 55.90 | 60.90 | 99.62 | 73.10</td>
795
+ </tr>
796
+ <tr>
797
+ <td align="left">InteractiveOmni-8B</td>
798
+ <td align="left"><strong>86.37</strong> | <strong>73.30</strong> | 57.99 | 99.42 | 76.69</td>
799
+ </tr>
800
+ </tbody>
801
+ </table>
802
+
803
+ </details>
804
+
805
+ <details>
806
+ <summary>Speech Generation</summary>
807
+
808
+ <table>
809
+ <thead>
810
+ <tr>
811
+ <th>Model</th>
812
+ <th>test-zh</th>
813
+ <th>test-en</th>
814
+ <th>test-zh-hard</th>
815
+ </tr>
816
+ </thead>
817
+ <tbody>
818
+ <tr>
819
+ <td colspan="9" align="center" style="font-weight: bold; border-top: 1px solid #ddd; border-bottom: 1px solid black;">TTS Model</td>
820
+ </tr>
821
+ <tr>
822
+ <td align="left" style="padding: 8px;">MaskGCT</td>
823
+ <td align="center" style="padding: 8px;">2.27</td>
824
+ <td align="center" style="padding: 8px;">2.62</td>
825
+ <td align="center" style="padding: 8px;">10.27</td>
826
+ </tr>
827
+ <tr>
828
+ <td align="left" style="padding: 8px;">SeedTTS</td>
829
+ <td align="center" style="padding: 8px;">1.12</td>
830
+ <td align="center" style="padding: 8px;">2.25</td>
831
+ <td align="center" style="padding: 8px;">7.59</td>
832
+ </tr>
833
+ <tr>
834
+ <td align="left" style="padding: 8px;">CosyVoice 2</td>
835
+ <td align="center" style="padding: 8px;">1.45</td>
836
+ <td align="center" style="padding: 8px;">2.57</td>
837
+ <td align="center" style="padding: 8px;">6.83</td>
838
+ </tr>
839
+ <tr>
840
+ <td colspan="9" align="center" style="font-weight: bold; border-top: 1px solid #ddd; border-bottom: 1px solid black;">MLLM</td>
841
+ </tr>
842
+ <tr>
843
+ <td align="left" style="padding: 8px;">MinMo</td>
844
+ <td align="center" style="padding: 8px;">2.48</td>
845
+ <td align="center" style="padding: 8px;">2.90</td>
846
+ <td align="center" style="padding: 8px;">-</td>
847
+ </tr>
848
+ <tr>
849
+ <td align="left" style="padding: 8px;">Ming-Lite-Omni</td>
850
+ <td align="center" style="padding: 8px;">1.69</td>
851
+ <td align="center" style="padding: 8px;">4.31</td>
852
+ <td align="center" style="padding: 8px;">-</td>
853
+ </tr>
854
+ <tr>
855
+ <td align="left" style="padding: 8px;">Qwen2.5-Omni-7B</td>
856
+ <td align="center" style="padding: 8px;">1.70</td>
857
+ <td align="center" style="padding: 8px;">2.72</td>
858
+ <td align="center" style="padding: 8px;">7.97</td>
859
+ </tr>
860
+ <tr>
861
+ <td align="left" style="padding: 8px;">InteractiveOmni-4B</td>
862
+ <td align="center" style="padding: 8px;"></strong>1.37<strong></td>
863
+ <td align="center" style="padding: 8px;">3.73</td>
864
+ <td align="center" style="padding: 8px;">8.02</td>
865
+ </tr>
866
+ <tr>
867
+ <td align="left" style="padding: 8px;">InteractiveOmni-8B</td>
868
+ <td align="center" style="padding: 8px;">1.56</td>
869
+ <td align="center" style="padding: 8px;"><strong>2.33</strong></td>
870
+ <td align="center" style="padding: 8px;"><strong>7.92</strong></td>
871
+ </tr>
872
+ </tbody>
873
+ </table>
874
+
875
+ </details>
876
+
877
+
878
+ ## Citation
879
+ If you find our paper and code useful in your research, please cite our technical report.
880
+ ```bibtex
881
+ @misc{tong2025interactiveomniunifiedomnimodalmodel,
882
+ title={InteractiveOmni: A Unified Omni-modal Model for Audio-Visual Multi-turn Dialogue},
883
+ author={Wenwen Tong and Hewei Guo and Dongchuan Ran and Jiangnan Chen and Jiefan Lu and Kaibin Wang and Keqiang Li and Xiaoxu Zhu and Jiakui Li and Kehan Li and Xueheng Li and Lumin Li and Chenxu Guo and Jiasheng Zhou and Jiandong Chen and Xianye Wu and Jiahao Wang and Silei Wu and Lei Chen and Hanming Deng and Yuxuan Song and Dinghao Zhou and Guiping Zhong and Ken Zheng and Shiyin Kang and Lewei Lu},
884
+ year={2025},
885
+ eprint={2510.13747},
886
+ archivePrefix={arXiv},
887
+ primaryClass={cs.CV},
888
+ url={https://arxiv.org/abs/2510.13747},
889
+ }