Works with SGLang Standalone. Details on how to build these models?

by anttip - opened 19 days ago

19 days ago

SGLang implemented a new speculative decoding method "Standalone", built on Eagle, which supports this model: https://github.com/sgl-project/sglang/pull/10090

Standalone can run this speculator with MistralSmall models, reducing latencies quite a lot. But the Qwen model this speculator was built on doesn't tensor-parallelize to 4GPU or 8-GPU instances, which limits the options for deploying this speculator.

Are there more details on how this speculator was built, so it can be done on other base-models?

alamios

Owner 14 days ago

It was built by:

Transplanting vocab of Mistral into Qwen, making Qwenstral (https://huggingface.co/alamios/Qwenstral-Small-3.1-0.5B)
Then generating a small dataset of Mistral responses on multiple tasks and datasets (https://huggingface.co/datasets/alamios/Mistral-Small-24B-Instruct-2501-Conversations)
Finally, SFT'ing Qwenstral on Mistral's outputs to align them (superficially, on common tasks)

I'm not familiar with that problem. What exactly limits Qwen for parallelization and what small models are suitable for that?

anttip

14 days ago

Thanks for the details. I found the repo and other models, and built some similar speculators.

To tensor-parallelize a speculator, the dimensions and sizes have to be divisible by the tensor parallel number. So with 4 or 8, the values of the model have to divide by that. This model won't do with tp>2. The model has num_attention_heads 14, which divides by 2, but not by 4.

The one ok-quality model that does is llama3.2-1b, which can be transplanted and does tensor-parallelize to 4. But SGLang still throws warnings with tp>2, and there's no speedups with tp>2 for some reason.

alamios

Owner 12 days ago

Interesting, I haven't realized that, I'll take it into account if I return to creating draft models.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment