please share how export qwen3 to onnx foramt, many thanks!

by cloudyu - opened May 12

May 12

When I try to export qwen3 4B to onnx format by optimum lib, it reports the following error :

ValueError: Trying to export a qwen3 model, that is a custom or unsupported architecture, but no custom onnx configuration was passed as custom_onnx_configs. Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the model type qwen3 to be supported natively in the ONNX export.

anthonymikinka

Sep 28

When I try to export qwen3 4B to onnx format by optimum lib, it reports the following error :

ValueError: Trying to export a qwen3 model, that is a custom or unsupported architecture, but no custom onnx configuration was passed as custom_onnx_configs. Please refer to https://huggingface.co/docs/optimum/main/en/exporters/onnx/usage_guides/export_a_model#custom-export-of-transformers-models for an example on how to export custom models. Please open an issue at https://github.com/huggingface/optimum/issues if you would like the model type qwen3 to be supported natively in the ONNX export.

https://discuss.huggingface.co/t/onnx-export-failed-for-qwen-qwen3-embedding-0-6b-with-invalid-unordered-map-k-t-key/160909/6
thats a fix/way around it.

not sure what you are using, but I just spent the past 8 hours trying to do this myself. with various onnx was and pytorch. spent another 10 yesterday with no luck.
I just came across https://github.com/microsoft/Olive I got to say. Its the closeset I've come to exporting the https://huggingface.co/Kwaipilot/KAT-Dev

I'll be doing next what grok is recommending:

Deep Research Report: Qwen3 MoE and Qwen3 Next - Overview, Issues, and Fixes

Executive Summary

Qwen3, released by Alibaba's Qwen team in April 2025, represents the third generation of the Qwen large language model (LLM) series, building on Qwen2.5 with advancements in hybrid reasoning, multilingual capabilities, and efficient scaling. It includes both dense models (e.g., 1.5B to 72B parameters) and Mixture-of-Experts (MoE) variants, such as the Qwen3-235B-A22B (235 billion total parameters, 22 billion active) and smaller MoE like Qwen3-30B-A3B. Qwen3 Next, unveiled in September 2025, is a specialized hybrid MoE architecture (e.g., Qwen3-Next-80B-A3B with 80 billion total, 3 billion active per token), emphasizing ultra-efficiency through sparse MoE, hybrid attention mechanisms (linear + quadratic), and optimizations for long-context handling (up to 260K+ tokens). It achieves high throughput (10x vs. dense models) at lower training costs (10% of equivalents).

These models excel in benchmarks for coding (e.g., Qwen3-Coder-480B-A35B topping SWE-bench-Verified among open models), multimodal tasks (via Qwen3-Omni for text/audio/image/video), and agentic workflows, but they introduce challenges in deployment, especially for ONNX export and quantization—key for edge hardware like Ryzen AI NPUs. Issues stem from novel architectures (e.g., MoE routing, hybrid attention) incompatible with standard exporters, leading to tracing errors, runtime failures, and accuracy degradation. Fixes often involve custom patches, updated libraries, or alternative tools like Optimum and AMD Quark.

This report draws from official releases, GitHub issues, Hugging Face discussions, academic studies, and recent X posts (as of September 28, 2025) to provide a comprehensive analysis.

Model Overviews

Qwen3 MoE

Architecture: MoE variants use sparse activation, where only a subset of experts (e.g., 22B out of 235B in Qwen3-235B-A22B) process each token, reducing compute while maintaining scale. Supports long contexts (up to 128K native) and hybrid reasoning (deductive/inductive/abductive).
Variants: Includes Qwen3-30B-A3B (smaller, efficient) and specialized like Qwen3-Coder-480B-A35B (MoE for agentic coding with 256K context, excelling in benchmarks like SWE-bench).
Strengths: High efficiency, strong in multilingual and coding tasks; open-sourced with tools like Qwen Code for integration.

Qwen3 Next

Architecture: Hybrid MoE with ultra-sparsity (e.g., 3B active out of 80B in Qwen3-Next-80B-A3B), combining linear attention for long contexts (260K+ tokens) and quadratic for short-range precision. Trained at 10% cost of dense equivalents, with 10x inference throughput.
Variants: Includes FP8 quantized versions for high-throughput; integrated with NVIDIA for optimized inference on Blackwell/Hopper GPUs.
Strengths: Excels in reasoning and efficiency; outperforms denser models like Qwen3-32B on downstream tasks. Available on platforms like Amazon Bedrock and Hugging Face.

Identified Issues

Issues cluster around deployment (ONNX export/quantization), inference performance, and model behavior. MoE adds complexity due to routing layers, while Next's hybrid attention exacerbates tracing problems.

1. ONNX Export Issues

Common Problems:
- Tracing errors from vmap in causal masking (transformers' masking_utils.py), leading to "invalid unordered_map K-T key" or unsupported ops for Qwen3 architectures.
- MoE-specific: Routing logic doesn't export cleanly, causing graph inconsistencies or runtime failures in ONNX Runtime.
- Next-specific: Hybrid attention (linear for long contexts) introduces dynamic shapes incompatible with standard exporters, resulting in export failures or bloated models (>2GB limit in ONNX optimization).
Prevalence: High in Hugging Face discussions; Qwen3 is "unsupported" in Optimum without custom configs.

2. Quantization Issues

Common Problems:
- Accuracy loss in low-bit (e.g., INT8/FP8) due to sensitivity in MoE experts; empirical studies show Qwen3 robustness varies by method (e.g., RTN vs. AWQ).
- Runtime errors: INT8 quantized Qwen3 fails in TensorRT-LLM serving (e.g., ModelOpt API conversion issues).
- MoE-specific: Expert imbalance post-quantization leads to degraded performance; smaller MoE like 30B-A3B show syntax errors in code gen.
- Next-specific: FP8 builds (for throughput) may overflow on non-optimized hardware; sparse MoE quantization amplifies activation-aware challenges, requiring GPU (unavailable in your CPU setup).
Prevalence: Quantized exports differ from PyTorch (e.g., ±1 errors); large models hit 2GB limits.

3. Inference and Performance Issues

Common Problems:
- Latency: Qwen3-32B GGUF slower than Qwen2.5 on llama.cpp; MoE variants feel "slow" due to thinking mode looping.
- Behavior: Task looping, imperfect tool use in larger MoE; syntax issues in coding for A3B models.
- Next-specific: Prompt processing delays in hybrid SSM models, though mitigated in updates like MLX-LM.
Prevalence: User reports on X highlight slower inference on Apple silicon (M4/M1) for MoE/Next vs. predecessors.

Fixes and Recommendations

For ONNX Export

Patch Transformers Library: Edit masking_utils.py to replace vmap with loops/tensors; use dynamo exporter in Optimum CLI: optimum-cli export onnx --model <ID> --task text-generation-with-past --use_dynamo_exporter.
Custom Configs: For MoE/Next, define custom ONNX configs in Optimum (e.g., handle routing as custom ops). Use fixed shapes (no_dynamic_axes) to avoid runtime errors.
Tools: Olive with OptimumConversion pass; for Ryzen AI, prepend OnnxQuantizationPreprocess.

For Quantization

CPU-Friendly Methods: Use RTN (Round-to-Nearest) via AMD Quark for XINT8 on CPU: quark.quantize(model_path, scheme="xint8_wo_128", algorithm="rtn"). Avoid AWQ (GPU-dependent).
MoE/Next-Specific: Calibrate with domain-specific data (e.g., C4 for general, code datasets for Coder); exclude sensitive layers via exclude_layers. For FP8, use NVIDIA ModelOpt with careful overflow checks.
Tools: ONNX Runtime quantization (static for QDQ format); update to latest (e.g., avoid >2GB issues by splitting optimizations).

For Inference/Performance

Optimize Frameworks: Use MLX-LM updates for Apple silicon (e.g., 75 tok/s on M4 Max for Qwen3-Next-80B-A3B@4bit); vLLM for general speedups.
Behavior Fixes: Fine-tune with PEFT/LoRA on task-specific data; adjust prompts to reduce looping (e.g., structured thinking traces in Qwen3-Next-Thinking).
Hardware: For Ryzen AI, ensure Vitis AI EP; test CPU fallbacks first.

Future Outlook

Qwen3 MoE and Next signal a shift to sparse, efficient LLMs, but deployment hurdles persist. Alibaba's rapid iterations (e.g., Qwen3-Omni, Guard) suggest ongoing fixes. Monitor GitHub/QwenLM for patches; community tools like MLX/vLLM are evolving fast.

anthonymikinka

Oct 21

I got Qwen3 0.6B converted to Onnx. I uploaded it to my models page.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment