n24q02m/Qwen3-Embedding-0.6B-ONNX

ONNX-optimized version of Qwen/Qwen3-Embedding-0.6B for use with qwen3-embed and fastembed (PR #605).

Available Variants

Variant	File	Size	Description
INT8	`onnx/model_quantized.onnx`	572 MB	Dynamic INT8 quantization (default)
Q4F16	`onnx/model_q4f16.onnx`	517 MB	INT4 weights + FP16 activations

Usage

qwen3-embed

pip install qwen3-embed

from qwen3_embed import TextEmbedding

# INT8 (default)
model = TextEmbedding("n24q02m/Qwen3-Embedding-0.6B-ONNX")
embeddings = list(model.embed(["Hello world"]))               # 1024-dim

# MRL: reduce dimension
embeddings_256 = list(model.embed(["Hello world"], dim=256))  # 256-dim

# Query with instruction
query_emb = list(model.query_embed("What is machine learning?"))

# Q4F16 (smaller, slightly less accurate)
model_q4 = TextEmbedding("n24q02m/Qwen3-Embedding-0.6B-ONNX-Q4F16")

fastembed

pip install fastembed

from fastembed import TextEmbedding

# INT8 (default)
model = TextEmbedding("Qwen/Qwen3-Embedding-0.6B")
embeddings = list(model.embed(["Hello world"]))

# Q4F16
model_q4 = TextEmbedding("Qwen/Qwen3-Embedding-0.6B-Q4F16")

Note: fastembed support requires PR #605 or install from fork: pip install git+https://github.com/n24q02m/fastembed.git@feat/qwen3-support

Conversion Details

Source: Qwen/Qwen3-Embedding-0.6B
ONNX opset: 21
INT8: onnxruntime.quantization.quantize_dynamic (QInt8)
Q4F16: MatMulNBitsQuantizer (block_size=128, symmetric) + FP16 cast

GGUF variants: n24q02m/Qwen3-Embedding-0.6B-GGUF

Downloads last month: 1,302

Model tree for n24q02m/Qwen3-Embedding-0.6B-ONNX

Base model

Qwen/Qwen3-0.6B-Base

Finetuned

Qwen/Qwen3-Embedding-0.6B

Quantized

(45)

this model

n24q02m
/

Qwen3-Embedding-0.6B-ONNX

n24q02m/Qwen3-Embedding-0.6B-ONNX

Available Variants

Usage

qwen3-embed

fastembed

Conversion Details

Related

Model tree for n24q02m/Qwen3-Embedding-0.6B-ONNX