Qwen-3 (0.6B) Embedding Model – High CUDA Memory Spikes with Misspelled Sentences
Context:
We are experimenting with Qwen-3 0.6B embedding model for RAG experiments. The model performs well on correctly spelled queries. However, during evaluation, we observed unusual GPU memory usage spikes when queries contain spelling mistakes.
Since real-world systems often cannot guarantee spell correction at runtime, this behaviour can cause serious reliability and stability risks in production environments.
Observations:
Correctly spelled queries - Memory usage is stable, no spikes.
Misspelled queries - GPU memory usage spikes sharply.
In several cases, CUDA runs into OOM (Out of Memory) errors, leading to unresponsiveness for subsequent more batch requests.
In case if can have production impact
In real-time systems, users may not always provide spell-corrected input.
This makes Qwen-3 embedding risky for live deployments, as a single malformed query can cause system-level failures.
Environment:
Model: Qwen-3 0.6B Embedding
Framework: SentenceTransformers interface
Hardware: NVIDIA A10 (24GB), CUDA 12.x
Batch size: Tested with both single-query and batched inputs (same behavior observed)
Questions for the Community:
Is this a known issue with Qwen-3 embeddings (or other Qwen embedding models)?
Could this be due to subword/token expansion when encountering OOV (out-of-vocabulary) spellings?
Are there any recommended mitigations (e.g., preprocessing, model configs, smaller batch size)?
Has anyone successfully used Qwen embeddings in noisy-query / spelling-error tolerant environments?
Request: Any feedback, insights, or reproducible confirmations from other practitioners would be highly valuable.
I am using Qwen3-embedding-0.6B for inference and I found that increasing batch size results in a non-linear increase in computation time.
I have observed large memory spikes on some (rare) documents too. I could not figure out what actually triggers the spikes. Problems happens with CUDA and MPS.