--- license: apache-2.0 base_model: - Qwen/Qwen2.5-0.5B-Instruct datasets: - agentlans/common-crawl-sample - bigcode/the-stack-smol-xl - rombodawg/Everything_Instruct tags: - draft - speculative-decoding --- A `0.6B` parameter draft (speculative decoding) model for use with [DeepSeek-R1-0528](https://huggingface.co/deepseek-ai/DeepSeek-R1-0528) and [DeepSeek-R1](https://huggingface.co/deepseek-ai/DeepSeek-R1). See [DeepSeek-R1-DRAFT-0.6B-v3.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-R1-DRAFT-0.6B-v3.0-GGUF) for the models in `gguf` format for use with `llama.cpp`. --- # Extending the context above 32k The current `config.json` is set for context length up to 32k tokens. Add the `"rope_scaling"` section to `config.json` to enable [YaRN](https://arxiv.org/abs/2309.00071), eg: ## To extend the context to 64k: ```json "max_position_embeddings": 65536, ... "rope_scaling": { "factor": 2.0, "original_max_position_embeddings": 32768, "type": "yarn" }, ``` ## To extend the context to 128k: ```json "max_position_embeddings": 131072, ... "rope_scaling": { "factor": 4.0, "original_max_position_embeddings": 32768, "type": "yarn" }, ``` ## To extend the context to 160k: ```json "max_position_embeddings": 163840, ... "rope_scaling": { "factor": 5.0, "original_max_position_embeddings": 32768, "type": "yarn" }, ``` **NOTE**: Because `llama.cpp` uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the `rope_scaling` configuration when processing long contexts is required... --- # How this model was created ## 1. The initial model was created from [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab): ```sh > python ./transplant_vocab.py \ ./Qwen2.5-0.5B-Instruct \ ./DeepSeek-R1-0528 \ ./DeepSeek-R1-DRAFT-0.6B-UNTRAINED \ --override "<|begin▁of▁sentence|>" "<|endoftext|>" \ --override "<|end▁of▁sentence|>" "<|im_end|>" \ --override "<|▁pad▁|>" "<|endoftext|>" \ --override "<|fim▁hole|>" "<|fim_middle|>" \ --override "<|fim▁begin|>" "<|fim_prefix|>" \ --override "<|fim▁end|>" "<|fim_suffix|>" \ --override "<|User|>" "<|im_start|>user\\n" \ --override "<|Assistant|>" "<|im_start|>assistant\\n" \ --override "<|EOT|>" "<|endoftext|>" \ --override "<|tool▁calls▁begin|>" "" \ --override "<|tool▁calls▁end|>" "" \ --override "<|tool▁call▁begin|>" "" \ --override "<|tool▁call▁end|>" "" \ --override "<|tool▁outputs▁begin|>" "" \ --override "<|tool▁outputs▁end|>" "" \ --override "<|tool▁output▁begin|>" "" \ --override "<|tool▁output▁end|>" "" \ --override "<|tool▁sep|>" "" Loading config from 'Qwen2.5-0.5B-Instruct'... Done. Loading config from 'DeepSeek-R1-0528'... Done. Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done. Loading tokenizer from 'DeepSeek-R1-0528'... Done. Loading model from 'Qwen2.5-0.5B-Instruct'... Done. Input model configuration: - Target vocabulary size : 129280 (used = 128815, unused = 465) - Donor vocabulary size : 151936 - Donor num layers : 24 (tied embeddings = True) - Donor hidden size : 896 - Donor attention heads : 14 - Donor intermediate size : 4864 (ratio = 1:5.4) - Donor total parameters : 494032768 (0.49B) -- Embedding parameters : 136134656 (0.14B) -- Non-embedding parameters : 357898112 (0.36B) Processing 3 automatic token overrides: ✔ 'bos_token_id' : 0 '<|begin▁of▁sentence|>' → [151643] '<|endoftext|>' ✔ 'eos_token_id' : 1 '<|end▁of▁sentence|>' → [151645] '<|im_end|>' ✘ 'pad_token_id' : 1 is already mapped to [151645] Processing 18 manual token overrides: ✔ 0 : '<|begin▁of▁sentence|>' → [151643] '<|endoftext|>' ✔ 1 : '<|end▁of▁sentence|>' → [151645] '<|im_end|>' ✔ 2 : '<|▁pad▁|>' → [151643] '<|endoftext|>' ✔ 128800 : '<|fim▁hole|>' → [151660] '<|fim_middle|>' ✔ 128801 : '<|fim▁begin|>' → [151659] '<|fim_prefix|>' ✔ 128802 : '<|fim▁end|>' → [151661] '<|fim_suffix|>' ✔ 128803 : '<|User|>' → [151644, 872, 198] '<|im_start|>user\n' ✔ 128804 : '<|Assistant|>' → [151644, 77091, 198] '<|im_start|>assistant\n' ✔ 128805 : '<|EOT|>' → [151643] '<|endoftext|>' ✔ 128806 : '<|tool▁calls▁begin|>' → [151657] '' ✔ 128807 : '<|tool▁calls▁end|>' → [151658] '' ✔ 128808 : '<|tool▁call▁begin|>' → [151657] '' ✔ 128809 : '<|tool▁call▁end|>' → [151658] '' ✔ 128810 : '<|tool▁outputs▁begin|>' → [27, 14172, 9655, 29] '' ✔ 128811 : '<|tool▁outputs▁end|>' → [522, 14172, 9655, 29] '' ✔ 128812 : '<|tool▁output▁begin|>' → [27, 14172, 9655, 29] '' ✔ 128813 : '<|tool▁output▁end|>' → [522, 14172, 9655, 29] '' ✔ 128814 : '<|tool▁sep|>' → [151658] '' NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor... Transplanting tokens: 100%|████████████████████████████████████████████████████████████| 128815/128815 [00:46<00:00, 2756.27token/s] Transplant mappings: - 1 to 1 : 83683 (65%) - 2 to 1 : 38380 (30%) - 3 to 1 : 4585 (3.6%) - 4 to 1 : 927 (0.72%) - 5 to 1 : 273 (0.21%) - 6 to 1 : 91 (0.071%) - 7 to 1 : 35 (0.027%) - 8 to 1 : 22 (0.017%) - 9 to 1 : 8 (0.0062%) - 10 to 1 : 4 (0.0031%) - 11 to 1 : 4 (0.0031%) - 13 to 1 : 1 (0.00078%) - 14 to 1 : 10 (0.0078%) - 15 to 1 : 91 (0.071%) - 16 to 1 : 699 (0.54%) - 19 to 1 : 1 (0.00078%) - 21 to 1 : 1 (0.00078%) Head initialized with: - Copies : 83683 (65%) - Means : 45132 (35%) - Zeros : 465 (0.36%) Output model configuration: - Output vocabulary size : 129280 - Output num layers : 24 (tied embeddings = False) - Output hidden size : 896 - Output attention heads : 14 - Output intermediate size : 4864 (ratio = 1:5.4) - Output total parameters : 589567872 (0.59B) -- Embedding parameters : 231669760 (0.23B) -- Non-embedding parameters : 357898112 (0.36B) Saving model and tokenizer to 'DeepSeek-R1-0528-DRAFT-0.6B-UNTRAINED' folder [2025-08-07 15:35:14,040] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) Patching 'torch_dtype' in 'DeepSeek-R1-0528-DRAFT-0.6B-UNTRAINED/config.json' based on actual saved tensors - Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype Operation completed successfully (ignore any 'segmentation fault' that follows!!!) ``` ## 2. The following datasets were used to create a fine-tuning dataset of ~2.3B tokens: - [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample) - [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl) - [rombodawg/Everything_Instruct](https://huggingface.co/datasets/rombodawg/Everything_Instruct) (NOTE: `output` field only) formatted just between `<|end▁of▁sentence|>` tags. ## 3. The model was then trained using [qlora-pipe-lite](https://github.com/jukofyork/qlora-pipe-lite) for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step): ```toml # ============================== # MODEL AND OUTPUT CONFIGURATION # ============================== model_dir = 'models/DeepSeek-R1-DRAFT-0.6B-UNTRAINED' output_dir = 'finetuned' # =========================== # TRAINING TYPE CONFIGURATION # =========================== full_fine_tune = true # ======================= # OPTIMIZER CONFIGURATION # ======================= lr = 5e-5 # ====================== # TRAINING CONFIGURATION # ====================== sequence_len = 32768 gradient_accumulation_steps = 10 # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step # ===================== # DATASET CONFIGURATION # ===================== [[datasets]] dataset_path = 'datasets/common-crawl-sample/*.json' drop_tails = true [[datasets]] dataset_path = 'datasets/the-stack-smol-xl/*.jsonl' drop_tails = true [[datasets]] dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json' drop_tails = true ``` I used six `RTX A6000` GPUs over three nodes and hence the `60` batch size (`6 x 10 gradient accumulation steps = 60`): ![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/WkobeaSpsXz2MT9irxTZ2.png)