Works with SGLang Standalone. Details on how to build these models?
SGLang implemented a new speculative decoding method "Standalone", built on Eagle, which supports this model: https://github.com/sgl-project/sglang/pull/10090
Standalone can run this speculator with MistralSmall models, reducing latencies quite a lot. But the Qwen model this speculator was built on doesn't tensor-parallelize to 4GPU or 8-GPU instances, which limits the options for deploying this speculator.
Are there more details on how this speculator was built, so it can be done on other base-models?
It was built by:
- Transplanting vocab of Mistral into Qwen, making Qwenstral (https://huggingface.co/alamios/Qwenstral-Small-3.1-0.5B)
- Then generating a small dataset of Mistral responses on multiple tasks and datasets (https://huggingface.co/datasets/alamios/Mistral-Small-24B-Instruct-2501-Conversations)
- Finally, SFT'ing Qwenstral on Mistral's outputs to align them (superficially, on common tasks)
I'm not familiar with that problem. What exactly limits Qwen for parallelization and what small models are suitable for that?
Thanks for the details. I found the repo and other models, and built some similar speculators.
To tensor-parallelize a speculator, the dimensions and sizes have to be divisible by the tensor parallel number. So with 4 or 8, the values of the model have to divide by that. This model won't do with tp>2. The model has num_attention_heads 14, which divides by 2, but not by 4.
The one ok-quality model that does is llama3.2-1b, which can be transplanted and does tensor-parallelize to 4. But SGLang still throws warnings with tp>2, and there's no speedups with tp>2 for some reason.
Interesting, I haven't realized that, I'll take it into account if I return to creating draft models.