|
|
--- |
|
|
license: apache-2.0 |
|
|
base_model: |
|
|
- Qwen/Qwen2.5-0.5B-Instruct |
|
|
datasets: |
|
|
- agentlans/common-crawl-sample |
|
|
- bigcode/the-stack-smol-xl |
|
|
- rombodawg/Everything_Instruct |
|
|
tags: |
|
|
- draft |
|
|
- speculative-decoding |
|
|
--- |
|
|
|
|
|
A `0.6B` parameter draft (speculative decoding) model for use with [DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) and [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3). |
|
|
|
|
|
See [DeepSeek-V3-DRAFT-0.6B-v3.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-V3-DRAFT-0.6B-v3.0-GGUF) for the models in `gguf` format for use with `llama.cpp`. |
|
|
|
|
|
--- |
|
|
|
|
|
# Extending the context above 32k |
|
|
|
|
|
The current `config.json` is set for context length up to 32k tokens. Add the `"rope_scaling"` section to `config.json` to enable [YaRN](https://arxiv.org/abs/2309.00071), eg: |
|
|
|
|
|
## To extend the context to 64k: |
|
|
|
|
|
```json |
|
|
"max_position_embeddings": 65536, |
|
|
... |
|
|
"rope_scaling": { |
|
|
"factor": 2.0, |
|
|
"original_max_position_embeddings": 32768, |
|
|
"type": "yarn" |
|
|
}, |
|
|
``` |
|
|
|
|
|
## To extend the context to 128k: |
|
|
|
|
|
```json |
|
|
"max_position_embeddings": 131072, |
|
|
... |
|
|
"rope_scaling": { |
|
|
"factor": 4.0, |
|
|
"original_max_position_embeddings": 32768, |
|
|
"type": "yarn" |
|
|
}, |
|
|
``` |
|
|
|
|
|
## To extend the context to 160k: |
|
|
|
|
|
```json |
|
|
"max_position_embeddings": 163840, |
|
|
... |
|
|
"rope_scaling": { |
|
|
"factor": 5.0, |
|
|
"original_max_position_embeddings": 32768, |
|
|
"type": "yarn" |
|
|
}, |
|
|
``` |
|
|
|
|
|
**NOTE**: Because `llama.cpp` uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the `rope_scaling` configuration when processing long contexts is required... |
|
|
|
|
|
--- |
|
|
|
|
|
# How this model was created |
|
|
|
|
|
## 1. The initial model was created from [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab): |
|
|
|
|
|
```sh |
|
|
> python ./transplant_vocab.py \ |
|
|
./Qwen2.5-0.5B-Instruct \ |
|
|
./DeepSeek-V3-0324 \ |
|
|
./DeepSeek-V3-DRAFT-0.6B-UNTRAINED \ |
|
|
--override "<|begin▁of▁sentence|>" "<|endoftext|>" \ |
|
|
--override "<|end▁of▁sentence|>" "<|im_end|>" \ |
|
|
--override "<|▁pad▁|>" "<|endoftext|>" \ |
|
|
--override "<|fim▁hole|>" "<|fim_middle|>" \ |
|
|
--override "<|fim▁begin|>" "<|fim_prefix|>" \ |
|
|
--override "<|fim▁end|>" "<|fim_suffix|>" \ |
|
|
--override "<|User|>" "<|im_start|>user\\n" \ |
|
|
--override "<|Assistant|>" "<|im_start|>assistant\\n" \ |
|
|
--override "<|EOT|>" "<|endoftext|>" \ |
|
|
--override "<|tool▁calls▁begin|>" "<tool_call>" \ |
|
|
--override "<|tool▁calls▁end|>" "</tool_call>" \ |
|
|
--override "<|tool▁call▁begin|>" "<tool_call>" \ |
|
|
--override "<|tool▁call▁end|>" "</tool_call>" \ |
|
|
--override "<|tool▁outputs▁begin|>" "<tool_response>" \ |
|
|
--override "<|tool▁outputs▁end|>" "</tool_response>" \ |
|
|
--override "<|tool▁output▁begin|>" "<tool_response>" \ |
|
|
--override "<|tool▁output▁end|>" "</tool_response>" \ |
|
|
--override "<|tool▁sep|>" "</tool_call>" |
|
|
|
|
|
Loading config from 'Qwen2.5-0.5B-Instruct'... Done. |
|
|
Loading config from 'DeepSeek-V3-0324'... Done. |
|
|
Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done. |
|
|
Loading tokenizer from 'DeepSeek-V3-0324'... Done. |
|
|
Loading model from 'Qwen2.5-0.5B-Instruct'... Done. |
|
|
|
|
|
Input model configuration: |
|
|
- Target vocabulary size : 129280 (used = 128815, unused = 465) |
|
|
- Donor vocabulary size : 151936 |
|
|
- Donor num layers : 24 (tied embeddings = True) |
|
|
- Donor hidden size : 896 |
|
|
- Donor attention heads : 14 |
|
|
- Donor intermediate size : 4864 (ratio = 1:5.4) |
|
|
- Donor total parameters : 494032768 (0.49B) |
|
|
-- Embedding parameters : 136134656 (0.14B) |
|
|
-- Non-embedding parameters : 357898112 (0.36B) |
|
|
|
|
|
Processing 3 automatic token overrides: |
|
|
✔ 'bos_token_id' : 0 '<|begin▁of▁sentence|>' → [151643] '<|endoftext|>' |
|
|
✔ 'eos_token_id' : 1 '<|end▁of▁sentence|>' → [151645] '<|im_end|>' |
|
|
✘ 'pad_token_id' : 1 is already mapped to [151645] |
|
|
|
|
|
Processing 18 manual token overrides: |
|
|
✔ 0 : '<|begin▁of▁sentence|>' → [151643] '<|endoftext|>' |
|
|
✔ 1 : '<|end▁of▁sentence|>' → [151645] '<|im_end|>' |
|
|
✔ 2 : '<|▁pad▁|>' → [151643] '<|endoftext|>' |
|
|
✔ 128800 : '<|fim▁hole|>' → [151660] '<|fim_middle|>' |
|
|
✔ 128801 : '<|fim▁begin|>' → [151659] '<|fim_prefix|>' |
|
|
✔ 128802 : '<|fim▁end|>' → [151661] '<|fim_suffix|>' |
|
|
✔ 128803 : '<|User|>' → [151644, 872, 198] '<|im_start|>user\n' |
|
|
✔ 128804 : '<|Assistant|>' → [151644, 77091, 198] '<|im_start|>assistant\n' |
|
|
✔ 128805 : '<|EOT|>' → [151643] '<|endoftext|>' |
|
|
✔ 128806 : '<|tool▁calls▁begin|>' → [151657] '<tool_call>' |
|
|
✔ 128807 : '<|tool▁calls▁end|>' → [151658] '</tool_call>' |
|
|
✔ 128808 : '<|tool▁call▁begin|>' → [151657] '<tool_call>' |
|
|
✔ 128809 : '<|tool▁call▁end|>' → [151658] '</tool_call>' |
|
|
✔ 128810 : '<|tool▁outputs▁begin|>' → [27, 14172, 9655, 29] '<tool_response>' |
|
|
✔ 128811 : '<|tool▁outputs▁end|>' → [522, 14172, 9655, 29] '</tool_response>' |
|
|
✔ 128812 : '<|tool▁output▁begin|>' → [27, 14172, 9655, 29] '<tool_response>' |
|
|
✔ 128813 : '<|tool▁output▁end|>' → [522, 14172, 9655, 29] '</tool_response>' |
|
|
✔ 128814 : '<|tool▁sep|>' → [151658] '</tool_call>' |
|
|
|
|
|
NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor... |
|
|
|
|
|
Transplanting tokens: 100%|████████████████████████████████████████████████████████████| 128815/128815 [00:53<00:00, 2423.79token/s] |
|
|
|
|
|
Transplant mappings: |
|
|
- 1 to 1 : 83683 (65%) |
|
|
- 2 to 1 : 38380 (30%) |
|
|
- 3 to 1 : 4583 (3.6%) |
|
|
- 4 to 1 : 927 (0.72%) |
|
|
- 5 to 1 : 273 (0.21%) |
|
|
- 6 to 1 : 91 (0.071%) |
|
|
- 7 to 1 : 35 (0.027%) |
|
|
- 8 to 1 : 22 (0.017%) |
|
|
- 9 to 1 : 8 (0.0062%) |
|
|
- 10 to 1 : 4 (0.0031%) |
|
|
- 11 to 1 : 4 (0.0031%) |
|
|
- 13 to 1 : 1 (0.00078%) |
|
|
- 14 to 1 : 10 (0.0078%) |
|
|
- 15 to 1 : 91 (0.071%) |
|
|
- 16 to 1 : 701 (0.54%) |
|
|
- 19 to 1 : 1 (0.00078%) |
|
|
- 21 to 1 : 1 (0.00078%) |
|
|
|
|
|
Head initialized with: |
|
|
- Copies : 83683 (65%) |
|
|
- Means : 45132 (35%) |
|
|
- Zeros : 465 (0.36%) |
|
|
|
|
|
Output model configuration: |
|
|
- Output vocabulary size : 129280 |
|
|
- Output num layers : 24 (tied embeddings = False) |
|
|
- Output hidden size : 896 |
|
|
- Output attention heads : 14 |
|
|
- Output intermediate size : 4864 (ratio = 1:5.4) |
|
|
- Output total parameters : 589567872 (0.59B) |
|
|
-- Embedding parameters : 231669760 (0.23B) |
|
|
-- Non-embedding parameters : 357898112 (0.36B) |
|
|
|
|
|
Saving model and tokenizer to 'DeepSeek-V3-DRAFT-0.6B-UNTRAINED' folder |
|
|
[2025-08-07 15:36:33,693] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect) |
|
|
|
|
|
Patching 'torch_dtype' in 'DeepSeek-V3-DRAFT-0.6B-UNTRAINED/config.json' based on actual saved tensors |
|
|
- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype |
|
|
|
|
|
Operation completed successfully (ignore any 'segmentation fault' that follows!!!) |
|
|
``` |
|
|
|
|
|
## 2. The following datasets were used to create a fine-tuning dataset of ~2.3B tokens: |
|
|
|
|
|
- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample) |
|
|
- [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl) |
|
|
- [rombodawg/Everything_Instruct](https://huggingface.co/datasets/rombodawg/Everything_Instruct) (NOTE: `output` field only) |
|
|
|
|
|
formatted just between `<|end▁of▁sentence|>` tags. |
|
|
|
|
|
## 3. The model was then trained using [qlora-pipe-lite](https://github.com/jukofyork/qlora-pipe-lite) for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step): |
|
|
|
|
|
```toml |
|
|
# ============================== |
|
|
# MODEL AND OUTPUT CONFIGURATION |
|
|
# ============================== |
|
|
|
|
|
model_dir = 'models/DeepSeek-V3-DRAFT-0.6B-UNTRAINED' |
|
|
output_dir = 'finetuned' |
|
|
|
|
|
# =========================== |
|
|
# TRAINING TYPE CONFIGURATION |
|
|
# =========================== |
|
|
|
|
|
full_fine_tune = true |
|
|
|
|
|
# ======================= |
|
|
# OPTIMIZER CONFIGURATION |
|
|
# ======================= |
|
|
|
|
|
lr = 5e-5 |
|
|
|
|
|
# ====================== |
|
|
# TRAINING CONFIGURATION |
|
|
# ====================== |
|
|
|
|
|
sequence_len = 32768 |
|
|
|
|
|
gradient_accumulation_steps = 10 # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step |
|
|
|
|
|
# ===================== |
|
|
# DATASET CONFIGURATION |
|
|
# ===================== |
|
|
|
|
|
[[datasets]] |
|
|
dataset_path = 'datasets/common-crawl-sample/*.json' |
|
|
drop_tails = true |
|
|
|
|
|
[[datasets]] |
|
|
dataset_path = 'datasets/the-stack-smol-xl/*.jsonl' |
|
|
drop_tails = true |
|
|
|
|
|
[[datasets]] |
|
|
dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json' |
|
|
drop_tails = true |
|
|
|
|
|
``` |
|
|
|
|
|
I used six `RTX A6000` GPUs over three nodes and hence the `60` batch size (`6 x 10 gradient accumulation steps = 60`): |
|
|
|
|
|
 |