Model Overview

  • Model Architecture: Qwen3-VL-235B-A22B-Instruct
  • Supported Hardware Microarchitecture: AMD MI350/MI355
  • ROCm: 7.0
  • Operating System(s): Linux
  • Inference Engine: vLLM
  • Model Optimizer: AMD-Quark
    • Weight quantization: Perchannel, FP8E4M3, Static
    • Activation quantization: Pertoken, FP8E4M3, Dynamic
  • Calibration Dataset: Pile

This model was built with Qwen3-VL-235B-A22B-Instruct model by applying AMD-Quark for ptpc quantization.

Model Quantization

The model was quantized from Qwen/Qwen3-VL-235B-A22B-Instruct using AMD-Quark. The weights are quantized to FP8 and activations are quantized to FP8.

Quantization scripts:

# pip install amd-quark

from transformers import AutoTokenizer, AutoModelForCausalLM
from quark.torch import ModelQuantizer, export_safetensors
from quark.torch.quantization import FP8E4M3PerChannelSpec
from quark.torch.quantization.config.config import Config, QuantizationConfig

ckpt_path = "Qwen/Qwen3-VL-235B-A22B-Instruct"
exclude_layers = ["lm_head","*mlp.gate", "*.visual.*"]
output_dir = ckpt_path.rstrip("/").split("/")[-1] + "-ptpc"

# Load the original floating-point model
model = AutoModelForCausalLM.from_pretrained(ckpt_path, device_map="auto", torch_dtype="auto", trust_remote_code=True)
model.eval()
tokenizer = AutoTokenizer.from_pretrained(ckpt_path)

# Set the quantization configuration
FP8_PER_CHANNEL_SPEC = FP8E4M3PerChannelSpec(is_dynamic=False, ch_axis=0).to_quantization_spec()
FP8_PER_TOKEN_DYNAMIC_SPEC = FP8E4M3PerChannelSpec(is_dynamic=True, ch_axis=1).to_quantization_spec()
W_FP8_PER_CHANNEL_STATIC_A_FP8_PER_TOKEN_DYNAMIC_CONFIG = QuantizationConfig(input_tensors=FP8_PER_TOKEN_DYNAMIC_SPEC, weight=FP8_PER_CHANNEL_SPEC)
quant_config = Config(global_quant_config=W_FP8_PER_CHANNEL_STATIC_A_FP8_PER_TOKEN_DYNAMIC_CONFIG, exclude=exclude_layers)

# Apply quantization
quantizer = ModelQuantizer(quant_config)
model = quantizer.quantize_model(model)

# Export quantized model
model = quantizer.freeze(model)
export_safetensors(model, output_dir)
tokenizer.save_pretrained(output_dir)

Accuracy

Benchmark Qwen3-VL-235B-A22B-Instruct Qwen3-VL-235B-A22B-Instruct-ptpc(this model)
GSM8K 0.93 0.925

Reproduction

Docker: rocm/vllm-private:rocm7.1_ubuntu22.04_vllm0.11.2_ptpc_fp8

The result of GSM8K was obtained using vLLM.

vllm version: main(0b2549)

aiter version: 0.13.20191203

GSM8K

lm_eval --model vllm \
    --model_args pretrained=/model_path/Qwen/Qwen3-VL-235B-A22B-Instruct-ptpc,add_bos_token=true,tensor_parallel_size=4 \
    --tasks gsm8k \
    --num_fewshot 5 \
    --batch_size auto \
    --limit 200

Deployment

Use with vLLM

This model can be deployed efficiently using the vLLM backend.

Evaluation

The evaluation results and reproduction script are being prepared.

License

Modifications Copyright(c) 2025 Advanced Micro Devices, Inc. All rights reserved.

Downloads last month
9
Safetensors
Model size
236B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for amd/Qwen3-VL-235B-A22B-Instruct-ptpc

Quantized
(21)
this model

Collection including amd/Qwen3-VL-235B-A22B-Instruct-ptpc