benchmark test
您好,
请问平均接受长度是怎么测得的?也是在specforge里吗。
请问有没有做throughput的测试呢,
非常感谢
是的,使用SpecForge里的benchmark进行的测试,throughput没有写到README中,因为不同的环境结果是不一样的,因此只放了接受率。
附上测试脚本:
#!/bin/bash
export SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1
config_list=(
"4,3,1,4"
"4,7,1,8"
)
i=2
CUDA_VISIBLE_DEVICES=0 python benchmarks/bench_model_speedup.py \
--model-path /path/to/base_model \
--speculative-draft-model-path /path/to/eagle_train/epoch_$i \
--port 20001 \
--trust-remote-code \
--mem-fraction-static 0.8 \
--config-list "${config_list[@]}" \
--output res/epoch_$i.jsonl \
--benchmark-list mtbench:80 gsm8k:200 humaneval:200 math500:200 ceval:200 cmmlu:200
请问是使用了修改之后的 sglang 的代码吗,我运行上述的命令报错
[2025-11-18 19:31:44] Scheduler hit an exception: Traceback (most recent call last):
File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 2712, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/managers/scheduler.py", line 312, in __init__
self.tp_worker = TpModelWorker(
^^^^^^^^^^^^^^
File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/managers/tp_worker.py", line 237, in __init__
self._model_runner = ModelRunner(
^^^^^^^^^^^^
File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 324, in __init__
self.initialize(min_per_gpu_memory)
File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 491, in initialize
self.init_device_graphs()
File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/model_executor/model_runner.py", line 2034, in init_device_graphs
self.graph_runner = graph_runners[self.device](self)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 348, in __init__
self.model_runner.model.set_eagle3_layers_to_capture()
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/sglang/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1962, in __getattr__
raise AttributeError(
AttributeError: 'Qwen3VLForConditionalGeneration' object has no attribute 'set_eagle3_layers_to_capture'
[2025-11-18 19:31:44] Received sigquit from a child process. It usually means the child failed.
Traceback (most recent call last):
File "/ch/code/ACL2026/SpecForge/benchmarks/bench_model_speedup.py", line 460, in <module>
main()
File "/ch/code/ACL2026/SpecForge/benchmarks/bench_model_speedup.py", line 423, in main
process = launch_sglang_server(
^^^^^^^^^^^^^^^^^^^^^
File "/ch/code/ACL2026/SpecForge/benchmarks/bench_model_speedup.py", line 243, in launch_sglang_server
process = popen_launch_server(
^^^^^^^^^^^^^^^^^^^^
File "/opt/conda/envs/sglang/lib/python3.11/site-packages/sglang/test/test_utils.py", line 636, in popen_launch_server
raise Exception(
Exception: Server process exited with code -9. Check server logs for errors.
哦 是的,我忘记加载评论中了。
sglang 还不能直接支持qwen3-vl的eagle,需要对sglang代码做一些修改;可以参考sglang中qwen2.5-vl的支持简单修改即可
hi 感谢您的回复,我按照 https://github.com/sgl-project/sglang/pull/8801/files 对于 Qwen 2.5 VL的修改,修改了 Qwen3 VL,但是测试的效果不是很理想。
我运行的命令如下
python -m sglang.launch_server \
--model-path /ch/pretrained_models/Qwen3-VL-2B-Instruct \
--speculative-draft-model-path /ch/pretrained_models/Qwen3-VL-2B-Instruct-Eagle3 \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 4 \
--speculative-eagle-topk 6 \
--speculative-num-draft-tokens 24 \
--trust-remote-code \
--chunked-prefill-size -1 \
--cuda-graph-max-bs 1 \
--tp 1 \
--mem-fraction-static 0.7 \
--host 0.0.0.0 \
--port 8080
python run_mmstar.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 50
测试的结果如下
Average Latency: 41.887 s
Average Output throughput: 89.240 token/s
Average Accept length: 1.020
我截取了一些日志如下,可以看到 accept len 的长度非常低:
[2025-11-19 04:10:42] INFO: 127.0.0.1:57852 - "POST /generate HTTP/1.1" 200 OK
[2025-11-19 04:10:42] Prefill batch, #new-seq: 1, #new-token: 9, #cached-token: 212, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-11-19 04:10:42] Decode batch, #running-req: 1, #token: 244, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 84.08, #queue-req: 0,
[2025-11-19 04:10:43] Decode batch, #running-req: 1, #token: 284, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 98.45, #queue-req: 0,
[2025-11-19 04:10:43] Decode batch, #running-req: 1, #token: 324, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 97.48, #queue-req: 0,
[2025-11-19 04:10:43] Decode batch, #running-req: 1, #token: 364, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 97.93, #queue-req: 0,
[2025-11-19 04:10:44] Decode batch, #running-req: 1, #token: 404, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 98.28, #queue-req: 0,
[2025-11-19 04:10:44] Decode batch, #running-req: 1, #token: 444, token usage: 0.00, accept len: 1.00, accept rate: 0.04, cuda graph: True, gen throughput (token/s): 97.95, #queue-req: 0,
在此之前我使用了 https://huggingface.co/Rayzl/qwen2.5-vl-7b-eagle3-sgl 对于 Qwen 2.5 VL 也进行了测试,效果基本上也一样,accept len 的长度非常低。上述的命令和配置均参考自 https://github.com/sgl-project/SpecForge/pull/102
然后我使用同样的配置(相同的 speculative-num-steps、speculative-eagle-topk、speculative-num-draft-tokens)在 Llama-3.1-8B-Instruct 上使用 https://huggingface.co/lmsys/sglang-EAGLE-LLaMA3-Instruct-8B
测试了 LLM 的 EAGLE3,效果是正常的,我的运行命令如下
python -m sglang.launch_server \
--model-path /ch/pretrained_models/Llama-3.1-8B-Instruct \
--speculative-draft-model-path /ch/pretrained_models/sglang-EAGLE3-LLaMA3.1-Instruct-8B \
--speculative-algorithm EAGLE3 \
--speculative-num-steps 4 \
--speculative-eagle-topk 6 \
--speculative-num-draft-tokens 24 \
--trust-remote-code \
--chunked-prefill-size -1 \
--cuda-graph-max-bs 1 \
--tp 1 \
--mem-fraction-static 0.75 \
--host 0.0.0.0 \
--dtype bfloat16 \
--port 8080
python run_gsm8k.py --host http://0.0.0.0 --port 8080 --parallel 1 --num-questions 50
运行的结果如下
Average Latency: 52.161 s
Average Output throughput: 86.099 token/s
Average Accept length: 2.313
是我的任何操作有误吗?如果是我的配置有任何和您不一样的,或者是我的操作有任何的不当的,您能为我解答一下嘛,十分感谢!
------------------------------------------------------------------------------------------------ update ------------------------------------------------------------------------------------------------
我使用 vllm 测试 Qwen 2.5 VL eagle3 效果是符合预期的,我的命令(参考自 https://github.com/vllm-project/vllm/pull/22872)如下
vllm serve \
/ch/pretrained_models/Qwen2.5-VL-7B-Instruct \
--port 5580 --host 0.0.0.0 \
--max-num-seqs 128 --dtype bfloat16 --max-model-len=8192 \
--no-enable-prefix-caching --trust-remote-code -tp 1\
--speculative-config '{"method": "eagle3", "model": "/ch/pretrained_models/qwen2.5-vl-7b-eagle3-sgl", "prefill_token_shift": false, "num_speculative_tokens": 3, "draft_tensor_parallel_size": 1, "max_model_len": 8192}' \
--num-lookahead-slots=3 \
--gpu-memory-utilization=0.93
vllm serve \
/ch/pretrained_models/Qwen2.5-VL-7B-Instruct \
--port 5580 --host 0.0.0.0 \
--max-num-seqs 128 --dtype bfloat16 --max-model-len=8192 \
--no-enable-prefix-caching --trust-remote-code -tp 1\
--num-lookahead-slots=3 \
--gpu-memory-utilization=0.93
结果如下
with EAGLE3
============ Serving Benchmark Result ============
Successful requests: 50
Failed requests: 0
Maximum request concurrency: 2
Benchmark duration (s): 31.10
Total input tokens: 1226
Total generated tokens: 4219
Request throughput (req/s): 1.61
Output token throughput (tok/s): 135.67
Peak output token throughput (tok/s): 96.00
Peak concurrent requests: 5.00
Total Token throughput (tok/s): 175.09
---------------Time to First Token----------------
Mean TTFT (ms): 215.38
Median TTFT (ms): 217.87
P99 TTFT (ms): 346.19
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 11.67
Median TPOT (ms): 11.65
P99 TPOT (ms): 15.67
---------------Inter-token Latency----------------
Mean ITL (ms): 23.08
Median ITL (ms): 20.88
P99 ITL (ms): 127.35
==================================================
without EAGLE3
============ Serving Benchmark Result ============
Successful requests: 50
Failed requests: 0
Maximum request concurrency: 2
Benchmark duration (s): 42.94
Total input tokens: 1226
Total generated tokens: 4205
Request throughput (req/s): 1.16
Output token throughput (tok/s): 97.92
Peak output token throughput (tok/s): 118.00
Peak concurrent requests: 5.00
Total Token throughput (tok/s): 126.47
---------------Time to First Token----------------
Mean TTFT (ms): 222.40
Median TTFT (ms): 225.44
P99 TTFT (ms): 361.31
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 17.69
Median TPOT (ms): 17.68
P99 TPOT (ms): 18.78
---------------Inter-token Latency----------------
Mean ITL (ms): 18.68
Median ITL (ms): 17.15
P99 ITL (ms): 101.11
==================================================
e2e 加速 1.38 倍
多谢大佬的贡献,请问能公开一下 推理的代码 嘛?
多谢!!!
@huggingface4ch 我是使用的 SpecForge中的测试脚本测试的:https://github.com/sgl-project/SpecForge/blob/main/benchmarks/bench_model_speedup.py
您可以试一下