Running on 6 GPUs

#10

by 0xSero - opened 4 days ago

4 days ago

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
vllm serve /mnt/llm_models/GLM-4.5-Air-AWQ-4bit \
--tensor-parallel-size 2 \
--pipeline-parallel-size 3 \
--dtype bfloat16 \
--max-model-len 129500 \
--max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 \
--swap-space 16 \
--enable-expert-parallel \
--reasoning-parser glm45 \
--tool-call-parser glm45 \
--enable-auto-tool-choice \
--guided-decoding-backend outlines \
--host 0.0.0.0 \
--port 8000

It took me 4 hours to get this working on 6x 3090s hope this helps.

cpatonn

Owner 4 days ago

Thank you for sharing :)

twhitworth

4 days ago

•

edited 4 days ago

If you’re on RTX 3090s, use --dtype float16 (SM86 doesn’t have native bfloat16 Tensor Core support). Download the Marlin CUDA kernels. marlin_awq is compatible with compressed tensors and vLLM.

0xSero

about 9 hours ago

If you’re on RTX 3090s, use --dtype float16 (SM86 doesn’t have native bfloat16 Tensor Core support). Download the Marlin CUDA kernels. marlin_awq is compatible with compressed tensors and vLLM.

VLLM_LOGGING_LEVEL=DEBUG \
CUDA_VISIBLE_DEVICES=0,1,2,3,4,5 \
vllm serve /mnt/llm_models/GLM-4.5-Air-AWQ-4bit \
--tensor-parallel-size 2 \
--pipeline-parallel-size 3 \
--dtype float16 \
--max-model-len 131072 \
--disable-custom-all-reduce \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.85 \
--swap-space 32 \
--reasoning-parser glm45 \
--tool-call-parser glm45 \
--enable-auto-tool-choice \
--host 0.0.0.0 \
--port 8000

Here is my final config, thank you cpatonn for the quant

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment