ValueError: Text contains too many audio placeholders. (Expected 1 placeholders)

by taresh18 - opened 26 days ago

26 days ago

Hi, Just ran the example code on one of the samples from your benchmark dataset:

turns = [
{
"content": "Which phone brand do you use?",
"role": "assistant"
},
{
"content": "Do you mean my work phone or my personal phone?",
"role": "user"
},
{
"content": "Your personal phone.",
"role": "assistant"
},
{
"content": "Samsung",
"role": "user"
}
]

it throws this error:
ValueError: Text contains too many audio placeholders. (Expected 1 placeholders)

If my understanding is correct, I need to provide this complete chant context to the model along with the most recent audio buffer, is it?

varmology

5 days ago

You need to remove the last turn in the turns list. The model expects the last turn to be with the role: "assistant"

taresh18

4 days ago

@varmology Hey, thanks for the reply. It fixed the issue. Have you tried to run this in prod? So, we need the current turn audio segment (as detected by vad) plus the chat context up until the previous turn, right?.

Also, Im getting ~110ms latency on an a40 gpu. Any recommenations you can share to decrease the latency?

varmology

4 days ago

•

edited 4 days ago

Hey @taresh18 ,

I have not tried running this model on prod yet. But I have been benchmarking this code to check the feasibility of running this on production environment.

Here are my benchmark results:

Total Tokens	Batch 1	Batch 2	Batch 4	Batch 8	Batch 16
18	50.8	80.2	132.4	261.5	490.4
97	56.6	100.1	187.1	332.8	653.7
128	66.8	126.1	239.1	482.9	886.5
138	76.9	127.8	205.2	389.8	702.7
149	86.7	134.6	276.0	499.7	964.1
180	90.6	161.4	323.8	602.0	1165.2
201	95.5	163.5	289.9	553.6	1143.3
226	100.7	191.3	378.7	673.5	1313.0
251	109.7	225.6	432.4	795.4	1515.7

Note: All latency values are in milliseconds (ms). Total Tokens are audio+text token inputs.

I have used torch compiled ultraVAD model and the benchmarks results are averaged. The GPU is H200. TBH I was confused as well seeing the high latency of the model.

Also the model card shows 0.7B parameter model but in real its preprocessor is Whisper large and the backbone of the LLM is LLaMA 8B. So what makes it 0.7B ?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment