Running on 6 GPUs

#10
by 0xSero - opened
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
vllm serve /mnt/llm_models/GLM-4.5-Air-AWQ-4bit \
--tensor-parallel-size 2 \
--pipeline-parallel-size 3 \
--dtype bfloat16 \
--max-model-len 129500 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 \
--swap-space 16 \
--enable-expert-parallel \
--reasoning-parser glm45 \
--tool-call-parser glm45 \
--enable-auto-tool-choice \
--guided-decoding-backend outlines \
--host 0.0.0.0 \
--port 8000

It took me 4 hours to get this working on 6x 3090s hope this helps.

Thank you for sharing :)

If you’re on RTX 3090s, use --dtype float16 (SM86 doesn’t have native bfloat16 Tensor Core support). Download the Marlin CUDA kernels. marlin_awq is compatible with compressed tensors and vLLM.

If you’re on RTX 3090s, use --dtype float16 (SM86 doesn’t have native bfloat16 Tensor Core support). Download the Marlin CUDA kernels. marlin_awq is compatible with compressed tensors and vLLM.

VLLM_LOGGING_LEVEL=DEBUG \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
vllm serve /mnt/llm_models/GLM-4.5-Air-AWQ-4bit \
--tensor-parallel-size 2 \
--pipeline-parallel-size 3 \
--dtype float16 \
--max-model-len 131072 \
--disable-custom-all-reduce \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.85 \
--swap-space 32 \
--reasoning-parser glm45 \
--tool-call-parser glm45 \
--enable-auto-tool-choice \
--host 0.0.0.0 \
--port 8000

Here is my final config, thank you cpatonn for the quant

Sign up or log in to comment