Works with SGLang Standalone. Details on how to build these models?

#3
by anttip - opened

SGLang implemented a new speculative decoding method "Standalone", built on Eagle, which supports this model: https://github.com/sgl-project/sglang/pull/10090

Standalone can run this speculator with MistralSmall models, reducing latencies quite a lot. But the Qwen model this speculator was built on doesn't tensor-parallelize to 4GPU or 8-GPU instances, which limits the options for deploying this speculator.

Are there more details on how this speculator was built, so it can be done on other base-models?

It was built by:

I'm not familiar with that problem. What exactly limits Qwen for parallelization and what small models are suitable for that?

Thanks for the details. I found the repo and other models, and built some similar speculators.

To tensor-parallelize a speculator, the dimensions and sizes have to be divisible by the tensor parallel number. So with 4 or 8, the values of the model have to divide by that. This model won't do with tp>2. The model has num_attention_heads 14, which divides by 2, but not by 4.

The one ok-quality model that does is llama3.2-1b, which can be transplanted and does tensor-parallelize to 4. But SGLang still throws warnings with tp>2, and there's no speedups with tp>2 for some reason.

Interesting, I haven't realized that, I'll take it into account if I return to creating draft models.

Sign up or log in to comment