Qwen-3 (0.6B) Embedding Model – High CUDA Memory Spikes with Misspelled Sentences

#38
by rgsurya - opened

Context:
We are experimenting with Qwen-3 0.6B embedding model for RAG experiments. The model performs well on correctly spelled queries. However, during evaluation, we observed unusual GPU memory usage spikes when queries contain spelling mistakes.

Since real-world systems often cannot guarantee spell correction at runtime, this behaviour can cause serious reliability and stability risks in production environments.

Observations:
Correctly spelled queries - Memory usage is stable, no spikes.
Spell Corrected.png

Misspelled queries - GPU memory usage spikes sharply.
Miss_spellword.png

In several cases, CUDA runs into OOM (Out of Memory) errors, leading to unresponsiveness for subsequent more batch requests.
In case if can have production impact
In real-time systems, users may not always provide spell-corrected input.
This makes Qwen-3 embedding risky for live deployments, as a single malformed query can cause system-level failures.

Environment:
Model: Qwen-3 0.6B Embedding
Framework: SentenceTransformers interface
Hardware: NVIDIA A10 (24GB), CUDA 12.x
Batch size: Tested with both single-query and batched inputs (same behavior observed)

Questions for the Community:
Is this a known issue with Qwen-3 embeddings (or other Qwen embedding models)?
Could this be due to subword/token expansion when encountering OOV (out-of-vocabulary) spellings?
Are there any recommended mitigations (e.g., preprocessing, model configs, smaller batch size)?
Has anyone successfully used Qwen embeddings in noisy-query / spelling-error tolerant environments?

Request: Any feedback, insights, or reproducible confirmations from other practitioners would be highly valuable.

I am using Qwen3-embedding-0.6B for inference and I found that increasing batch size results in a non-linear increase in computation time.

I have observed large memory spikes on some (rare) documents too. I could not figure out what actually triggers the spikes. Problems happens with CUDA and MPS.

Sign up or log in to comment