n24q02m/Qwen3-Embedding-0.6B-ONNX
ONNX-optimized version of Qwen/Qwen3-Embedding-0.6B for use with qwen3-embed and fastembed (PR #605).
Available Variants
| Variant | File | Size | Description |
|---|---|---|---|
| INT8 | onnx/model_quantized.onnx |
572 MB | Dynamic INT8 quantization (default) |
| Q4F16 | onnx/model_q4f16.onnx |
517 MB | INT4 weights + FP16 activations |
Usage
qwen3-embed
pip install qwen3-embed
from qwen3_embed import TextEmbedding
# INT8 (default)
model = TextEmbedding("n24q02m/Qwen3-Embedding-0.6B-ONNX")
embeddings = list(model.embed(["Hello world"])) # 1024-dim
# MRL: reduce dimension
embeddings_256 = list(model.embed(["Hello world"], dim=256)) # 256-dim
# Query with instruction
query_emb = list(model.query_embed("What is machine learning?"))
# Q4F16 (smaller, slightly less accurate)
model_q4 = TextEmbedding("n24q02m/Qwen3-Embedding-0.6B-ONNX-Q4F16")
fastembed
pip install fastembed
from fastembed import TextEmbedding
# INT8 (default)
model = TextEmbedding("Qwen/Qwen3-Embedding-0.6B")
embeddings = list(model.embed(["Hello world"]))
# Q4F16
model_q4 = TextEmbedding("Qwen/Qwen3-Embedding-0.6B-Q4F16")
Note: fastembed support requires PR #605 or install from fork:
pip install git+https://github.com/n24q02m/fastembed.git@feat/qwen3-support
Conversion Details
- Source: Qwen/Qwen3-Embedding-0.6B
- ONNX opset: 21
- INT8:
onnxruntime.quantization.quantize_dynamic(QInt8) - Q4F16:
MatMulNBitsQuantizer(block_size=128, symmetric) + FP16 cast
Related
- GGUF variants: n24q02m/Qwen3-Embedding-0.6B-GGUF
- Downloads last month
- 1,302