ValueError: Text contains too many audio placeholders. (Expected 1 placeholders)

#2
by taresh18 - opened

Hi, Just ran the example code on one of the samples from your benchmark dataset:

turns = [
{
"content": "Which phone brand do you use?",
"role": "assistant"
},
{
"content": "Do you mean my work phone or my personal phone?",
"role": "user"
},
{
"content": "Your personal phone.",
"role": "assistant"
},
{
"content": "Samsung",
"role": "user"
}
]

it throws this error:
ValueError: Text contains too many audio placeholders. (Expected 1 placeholders)

If my understanding is correct, I need to provide this complete chant context to the model along with the most recent audio buffer, is it?

You need to remove the last turn in the turns list. The model expects the last turn to be with the role: "assistant"

@varmology Hey, thanks for the reply. It fixed the issue. Have you tried to run this in prod? So, we need the current turn audio segment (as detected by vad) plus the chat context up until the previous turn, right?.

Also, Im getting ~110ms latency on an a40 gpu. Any recommenations you can share to decrease the latency?

Hey @taresh18 ,

I have not tried running this model on prod yet. But I have been benchmarking this code to check the feasibility of running this on production environment.

Here are my benchmark results:

Total Tokens Batch 1 Batch 2 Batch 4 Batch 8 Batch 16
18 50.8 80.2 132.4 261.5 490.4
97 56.6 100.1 187.1 332.8 653.7
128 66.8 126.1 239.1 482.9 886.5
138 76.9 127.8 205.2 389.8 702.7
149 86.7 134.6 276.0 499.7 964.1
180 90.6 161.4 323.8 602.0 1165.2
201 95.5 163.5 289.9 553.6 1143.3
226 100.7 191.3 378.7 673.5 1313.0
251 109.7 225.6 432.4 795.4 1515.7

Note: All latency values are in milliseconds (ms). Total Tokens are audio+text token inputs.

I have used torch compiled ultraVAD model and the benchmarks results are averaged. The GPU is H200. TBH I was confused as well seeing the high latency of the model.

Also the model card shows 0.7B parameter model but in real its preprocessor is Whisper large and the backbone of the LLM is LLaMA 8B. So what makes it 0.7B ?

Sign up or log in to comment