benchmark test

#1
by seamoonlight - opened

您好,
请问平均接受长度是怎么测得的?也是在specforge里吗。
请问有没有做throughput的测试呢,
非常感谢

taobao-mnn org

是的,使用SpecForge里的benchmark进行的测试,throughput没有写到README中,因为不同的环境结果是不一样的,因此只放了接受率。

附上测试脚本:

#!/bin/bash

export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1

config_list=(
    "4,3,1,4"
    "4,7,1,8"
)

i=2

CUDA_VISIBLE_DEVICES=0 python benchmarks/bench_model_speedup.py \
    --model-path /path/to/base_model \
    --speculative-draft-model-path /path/to/eagle_train/epoch_$i \
    --port 20001 \
    --trust-remote-code \
    --mem-fraction-static 0.8 \
    --config-list "${config_list[@]}" \
    --output res/epoch_$i.jsonl \
    --benchmark-list mtbench:80 gsm8k:200 humaneval:200 math500:200 ceval:200 cmmlu:200

请问是使用了修改之后的 sglang 的代码吗,我运行上述的命令报错

[2025-11-18 19:31:44] Scheduler hit an exception: Traceback (most recent call last):
  File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 2712, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 312, in __init__
    self.tp_worker = TpModelWorker(
                     ^^^^^^^^^^^^^^
  File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 237, in __init__
    self._model_runner = ModelRunner(
                         ^^^^^^^^^^^^
  File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 324, in __init__
    self.initialize(min_per_gpu_memory)
  File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in initialize
    self.init_device_graphs()
  File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 2034, in init_device_graphs
    self.graph_runner = graph_runners[self.device](self)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 348, in __init__
    self.model_runner.model.set_eagle3_layers_to_capture()
    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/sglang/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1962, in __getattr__
    raise AttributeError(
AttributeError: 'Qwen3VLForConditionalGeneration' object has no attribute 'set_eagle3_layers_to_capture'

[2025-11-18 19:31:44] Received sigquit from a child process. It usually means the child failed.
Traceback (most recent call last):
  File "/ch/code/ACL2026/SpecForge/benchmarks/bench_model_speedup.py", line 460, in <module>
    main()
  File "/ch/code/ACL2026/SpecForge/benchmarks/bench_model_speedup.py", line 423, in main
    process = launch_sglang_server(
              ^^^^^^^^^^^^^^^^^^^^^
  File "/ch/code/ACL2026/SpecForge/benchmarks/bench_model_speedup.py", line 243, in launch_sglang_server
    process = popen_launch_server(
              ^^^^^^^^^^^^^^^^^^^^
  File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/test/test_utils.py", line 636, in popen_launch_server
    raise Exception(
Exception: Server process exited with code -9. Check server logs for errors.
taobao-mnn org

哦 是的,我忘记加载评论中了。

sglang 还不能直接支持qwen3-vl的eagle,需要对sglang代码做一些修改;可以参考sglang中qwen2.5-vl的支持简单修改即可

hi 感谢您的回复,我按照 https://github.com/sgl-project/sglang/pull/8801/files 对于 Qwen 2.5 VL的修改,修改了 Qwen3 VL,但是测试的效果不是很理想。

我运行的命令如下

python -m sglang.launch_server \
    --model-path /ch/pretrained_models/Qwen3-VL-2B-Instruct \
    --speculative-draft-model-path /ch/pretrained_models/Qwen3-VL-2B-Instruct-Eagle3 \
    --speculative-algorithm EAGLE3 \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 6 \
    --speculative-num-draft-tokens 24 \
    --trust-remote-code \
    --chunked-prefill-size -1 \
    --cuda-graph-max-bs 1 \
    --tp 1 \
    --mem-fraction-static 0.7 \
    --host 0.0.0.0 \
    --port 8080
python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 50

测试的结果如下

Average Latency: 41.887 s
Average Output throughput: 89.240 token/s
Average Accept length: 1.020

我截取了一些日志如下,可以看到 accept len 的长度非常低:

[2025-11-19 04:10:42] INFO:     127.0.0.1:57852 - "POST /generate HTTP/1.1" 200 OK
[2025-11-19 04:10:42] Prefill batch, #new-seq: 1, #new-token: 9, #cached-token: 212, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-11-19 04:10:42] Decode batch, #running-req: 1, #token: 244, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 84.08, #queue-req: 0, 
[2025-11-19 04:10:43] Decode batch, #running-req: 1, #token: 284, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 98.45, #queue-req: 0, 
[2025-11-19 04:10:43] Decode batch, #running-req: 1, #token: 324, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 97.48, #queue-req: 0, 
[2025-11-19 04:10:43] Decode batch, #running-req: 1, #token: 364, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 97.93, #queue-req: 0, 
[2025-11-19 04:10:44] Decode batch, #running-req: 1, #token: 404, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 98.28, #queue-req: 0, 
[2025-11-19 04:10:44] Decode batch, #running-req: 1, #token: 444, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 97.95, #queue-req: 0, 

在此之前我使用了 https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl 对于 Qwen 2.5 VL 也进行了测试,效果基本上也一样,accept len 的长度非常低。上述的命令和配置均参考自 https://github.com/sgl-project/SpecForge/pull/102

然后我使用同样的配置(相同的 speculative-num-steps、speculative-eagle-topk、speculative-num-draft-tokens)在 Llama-3.1-8B-Instruct 上使用 https://huggingface.co/lmsys/sglang-EAGLE-LLaMA3-Instruct-8B
测试了 LLM 的 EAGLE3,效果是正常的,我的运行命令如下

python -m sglang.launch_server \
    --model-path /ch/pretrained_models/Llama-3.1-8B-Instruct \
    --speculative-draft-model-path /ch/pretrained_models/sglang-EAGLE3-LLaMA3.1-Instruct-8B \
    --speculative-algorithm EAGLE3 \
    --speculative-num-steps 4 \
    --speculative-eagle-topk 6 \
    --speculative-num-draft-tokens 24 \
    --trust-remote-code \
    --chunked-prefill-size -1 \
    --cuda-graph-max-bs 1 \
    --tp 1 \
    --mem-fraction-static 0.75 \
    --host 0.0.0.0 \
    --dtype bfloat16 \
    --port 8080
python run_gsm8k.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 50

运行的结果如下

Average Latency: 52.161 s
Average Output throughput: 86.099 token/s
Average Accept length: 2.313

是我的任何操作有误吗?如果是我的配置有任何和您不一样的,或者是我的操作有任何的不当的,您能为我解答一下嘛,十分感谢!

------------------------------------------------------------------------------------------------ update ------------------------------------------------------------------------------------------------
我使用 vllm 测试 Qwen 2.5 VL eagle3 效果是符合预期的,我的命令(参考自 https://github.com/vllm-project/vllm/pull/22872)如下

vllm serve \
    /ch/pretrained_models/Qwen2.5-VL-7B-Instruct \
    --port 5580 --host 0.0.0.0 \
    --max-num-seqs 128 --dtype bfloat16 --max-model-len=8192  \
    --no-enable-prefix-caching --trust-remote-code -tp 1\
    --speculative-config '{"method": "eagle3", "model": "/ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "max_model_len": 8192}' \
    --num-lookahead-slots=3 \
    --gpu-memory-utilization=0.93
vllm serve \
    /ch/pretrained_models/Qwen2.5-VL-7B-Instruct \
    --port 5580 --host 0.0.0.0 \
    --max-num-seqs 128 --dtype bfloat16 --max-model-len=8192  \
    --no-enable-prefix-caching --trust-remote-code -tp 1\
    --num-lookahead-slots=3 \
    --gpu-memory-utilization=0.93

结果如下

with EAGLE3

============ Serving Benchmark Result ============
Successful requests:                     50        
Failed requests:                         0         
Maximum request concurrency:             2         
Benchmark duration (s):                  31.10     
Total input tokens:                      1226      
Total generated tokens:                  4219      
Request throughput (req/s):              1.61      
Output token throughput (tok/s):         135.67    
Peak output token throughput (tok/s):    96.00     
Peak concurrent requests:                5.00      
Total Token throughput (tok/s):          175.09    
---------------Time to First Token----------------
Mean TTFT (ms):                          215.38    
Median TTFT (ms):                        217.87    
P99 TTFT (ms):                           346.19    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          11.67     
Median TPOT (ms):                        11.65     
P99 TPOT (ms):                           15.67     
---------------Inter-token Latency----------------
Mean ITL (ms):                           23.08     
Median ITL (ms):                         20.88     
P99 ITL (ms):                            127.35    
==================================================

without EAGLE3

============ Serving Benchmark Result ============
Successful requests:                     50        
Failed requests:                         0         
Maximum request concurrency:             2         
Benchmark duration (s):                  42.94     
Total input tokens:                      1226      
Total generated tokens:                  4205      
Request throughput (req/s):              1.16      
Output token throughput (tok/s):         97.92     
Peak output token throughput (tok/s):    118.00    
Peak concurrent requests:                5.00      
Total Token throughput (tok/s):          126.47    
---------------Time to First Token----------------
Mean TTFT (ms):                          222.40    
Median TTFT (ms):                        225.44    
P99 TTFT (ms):                           361.31    
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          17.69     
Median TPOT (ms):                        17.68     
P99 TPOT (ms):                           18.78     
---------------Inter-token Latency----------------
Mean ITL (ms):                           18.68     
Median ITL (ms):                         17.15     
P99 ITL (ms):                            101.11    
==================================================

e2e 加速 1.38 倍

多谢大佬的贡献,请问能公开一下 推理的代码 嘛?
多谢!!!

taobao-mnn org

@huggingface4ch 我是使用的 SpecForge中的测试脚本测试的:https://github.com/sgl-project/SpecForge/blob/main/benchmarks/bench_model_speedup.py

您可以试一下

Sign up or log in to comment