Update README.md

b9878ad verified 3 months ago

8.83 kB

	---
	license: apache-2.0
	base_model:
	- Qwen/Qwen2.5-0.5B-Instruct
	datasets:
	- agentlans/common-crawl-sample
	- bigcode/the-stack-smol-xl
	- rombodawg/Everything_Instruct
	tags:
	- draft
	- speculative-decoding
	---

	A `0.6B` parameter draft (speculative decoding) model for use with [DeepSeek-V3-0324](https://huggingface.co/deepseek-ai/DeepSeek-V3-0324) and [DeepSeek-V3](https://huggingface.co/deepseek-ai/DeepSeek-V3).

	See [DeepSeek-V3-DRAFT-0.6B-v3.0-GGUF](https://huggingface.co/jukofyork/DeepSeek-V3-DRAFT-0.6B-v3.0-GGUF) for the models in `gguf` format for use with `llama.cpp`.

	---

	# Extending the context above 32k

	The current `config.json` is set for context length up to 32k tokens. Add the `"rope_scaling"` section to `config.json` to enable [YaRN](https://arxiv.org/abs/2309.00071), eg:

	## To extend the context to 64k:

	```json
	"max_position_embeddings": 65536,
	...
	"rope_scaling": {
	"factor": 2.0,
	"original_max_position_embeddings": 32768,
	"type": "yarn"
	},
	```

	## To extend the context to 128k:

	```json
	"max_position_embeddings": 131072,
	...
	"rope_scaling": {
	"factor": 4.0,
	"original_max_position_embeddings": 32768,
	"type": "yarn"
	},
	```

	## To extend the context to 160k:

	```json
	"max_position_embeddings": 163840,
	...
	"rope_scaling": {
	"factor": 5.0,
	"original_max_position_embeddings": 32768,
	"type": "yarn"
	},
	```

	NOTE: Because `llama.cpp` uses "static-YaRN" the scaling factor remains constant regardless of input length! Only add the `rope_scaling` configuration when processing long contexts is required...

	---

	# How this model was created

	## 1. The initial model was created from [Qwen2.5-0.5B-Instruct](https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct) using [transplant-vocab](https://github.com/jukofyork/transplant-vocab):

	```sh
	> python ./transplant_vocab.py \
	./Qwen2.5-0.5B-Instruct \
	./DeepSeek-V3-0324 \
	./DeepSeek-V3-DRAFT-0.6B-UNTRAINED \
	--override "<｜begin▁of▁sentence｜>" "<\|endoftext\|>" \
	--override "<｜end▁of▁sentence｜>" "<\|im_end\|>" \
	--override "<｜▁pad▁｜>" "<\|endoftext\|>" \
	--override "<｜fim▁hole｜>" "<\|fim_middle\|>" \
	--override "<｜fim▁begin｜>" "<\|fim_prefix\|>" \
	--override "<｜fim▁end｜>" "<\|fim_suffix\|>" \
	--override "<｜User｜>" "<\|im_start\|>user\\n" \
	--override "<｜Assistant｜>" "<\|im_start\|>assistant\\n" \
	--override "<\|EOT\|>" "<\|endoftext\|>" \
	--override "<｜tool▁calls▁begin｜>" "<tool_call>" \
	--override "<｜tool▁calls▁end｜>" "</tool_call>" \
	--override "<｜tool▁call▁begin｜>" "<tool_call>" \
	--override "<｜tool▁call▁end｜>" "</tool_call>" \
	--override "<｜tool▁outputs▁begin｜>" "<tool_response>" \
	--override "<｜tool▁outputs▁end｜>" "</tool_response>" \
	--override "<｜tool▁output▁begin｜>" "<tool_response>" \
	--override "<｜tool▁output▁end｜>" "</tool_response>" \
	--override "<｜tool▁sep｜>" "</tool_call>"

	Loading config from 'Qwen2.5-0.5B-Instruct'... Done.
	Loading config from 'DeepSeek-V3-0324'... Done.
	Loading tokenizer from 'Qwen2.5-0.5B-Instruct'... Done.
	Loading tokenizer from 'DeepSeek-V3-0324'... Done.
	Loading model from 'Qwen2.5-0.5B-Instruct'... Done.

	Input model configuration:
	- Target vocabulary size : 129280 (used = 128815, unused = 465)
	- Donor vocabulary size : 151936
	- Donor num layers : 24 (tied embeddings = True)
	- Donor hidden size : 896
	- Donor attention heads : 14
	- Donor intermediate size : 4864 (ratio = 1:5.4)
	- Donor total parameters : 494032768 (0.49B)
	-- Embedding parameters : 136134656 (0.14B)
	-- Non-embedding parameters : 357898112 (0.36B)

	Processing 3 automatic token overrides:
	✔ 'bos_token_id' : 0 '<｜begin▁of▁sentence｜>' → [151643] '<\|endoftext\|>'
	✔ 'eos_token_id' : 1 '<｜end▁of▁sentence｜>' → [151645] '<\|im_end\|>'
	✘ 'pad_token_id' : 1 is already mapped to [151645]

	Processing 18 manual token overrides:
	✔ 0 : '<｜begin▁of▁sentence｜>' → [151643] '<\|endoftext\|>'
	✔ 1 : '<｜end▁of▁sentence｜>' → [151645] '<\|im_end\|>'
	✔ 2 : '<｜▁pad▁｜>' → [151643] '<\|endoftext\|>'
	✔ 128800 : '<｜fim▁hole｜>' → [151660] '<\|fim_middle\|>'
	✔ 128801 : '<｜fim▁begin｜>' → [151659] '<\|fim_prefix\|>'
	✔ 128802 : '<｜fim▁end｜>' → [151661] '<\|fim_suffix\|>'
	✔ 128803 : '<｜User｜>' → [151644, 872, 198] '<\|im_start\|>user\n'
	✔ 128804 : '<｜Assistant｜>' → [151644, 77091, 198] '<\|im_start\|>assistant\n'
	✔ 128805 : '<\|EOT\|>' → [151643] '<\|endoftext\|>'
	✔ 128806 : '<｜tool▁calls▁begin｜>' → [151657] '<tool_call>'
	✔ 128807 : '<｜tool▁calls▁end｜>' → [151658] '</tool_call>'
	✔ 128808 : '<｜tool▁call▁begin｜>' → [151657] '<tool_call>'
	✔ 128809 : '<｜tool▁call▁end｜>' → [151658] '</tool_call>'
	✔ 128810 : '<｜tool▁outputs▁begin｜>' → [27, 14172, 9655, 29] '<tool_response>'
	✔ 128811 : '<｜tool▁outputs▁end｜>' → [522, 14172, 9655, 29] '</tool_response>'
	✔ 128812 : '<｜tool▁output▁begin｜>' → [27, 14172, 9655, 29] '<tool_response>'
	✔ 128813 : '<｜tool▁output▁end｜>' → [522, 14172, 9655, 29] '</tool_response>'
	✔ 128814 : '<｜tool▁sep｜>' → [151658] '</tool_call>'

	NOTE: Using an "untied" copy of 'embed_tokens.weight' as new 'lm_head.weight' tensor...

	Transplanting tokens: 100%\|████████████████████████████████████████████████████████████\| 128815/128815 [00:53<00:00, 2423.79token/s]

	Transplant mappings:
	- 1 to 1 : 83683 (65%)
	- 2 to 1 : 38380 (30%)
	- 3 to 1 : 4583 (3.6%)
	- 4 to 1 : 927 (0.72%)
	- 5 to 1 : 273 (0.21%)
	- 6 to 1 : 91 (0.071%)
	- 7 to 1 : 35 (0.027%)
	- 8 to 1 : 22 (0.017%)
	- 9 to 1 : 8 (0.0062%)
	- 10 to 1 : 4 (0.0031%)
	- 11 to 1 : 4 (0.0031%)
	- 13 to 1 : 1 (0.00078%)
	- 14 to 1 : 10 (0.0078%)
	- 15 to 1 : 91 (0.071%)
	- 16 to 1 : 701 (0.54%)
	- 19 to 1 : 1 (0.00078%)
	- 21 to 1 : 1 (0.00078%)

	Head initialized with:
	- Copies : 83683 (65%)
	- Means : 45132 (35%)
	- Zeros : 465 (0.36%)

	Output model configuration:
	- Output vocabulary size : 129280
	- Output num layers : 24 (tied embeddings = False)
	- Output hidden size : 896
	- Output attention heads : 14
	- Output intermediate size : 4864 (ratio = 1:5.4)
	- Output total parameters : 589567872 (0.59B)
	-- Embedding parameters : 231669760 (0.23B)
	-- Non-embedding parameters : 357898112 (0.36B)

	Saving model and tokenizer to 'DeepSeek-V3-DRAFT-0.6B-UNTRAINED' folder
	[2025-08-07 15:36:33,693] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)

	Patching 'torch_dtype' in 'DeepSeek-V3-DRAFT-0.6B-UNTRAINED/config.json' based on actual saved tensors
	- Updated 'torch_dtype' to 'bfloat16' based on actual tensor dtype

	Operation completed successfully (ignore any 'segmentation fault' that follows!!!)
	```

	## 2. The following datasets were used to create a fine-tuning dataset of ~2.3B tokens:

	- [agentlans/common-crawl-sample](https://huggingface.co/datasets/agentlans/common-crawl-sample)
	- [bigcode/the-stack-smol-xl](https://huggingface.co/datasets/bigcode/the-stack-smol-xl)
	- [rombodawg/Everything_Instruct](https://huggingface.co/datasets/rombodawg/Everything_Instruct) (NOTE: `output` field only)

	formatted just between `<｜end▁of▁sentence｜>` tags.

	## 3. The model was then trained using [qlora-pipe-lite](https://github.com/jukofyork/qlora-pipe-lite) for 1 epoch with a batch size of 60 and a sequence length of 32k (~2M tokens per step):

	```toml
	# ==============================
	# MODEL AND OUTPUT CONFIGURATION
	# ==============================

	model_dir = 'models/DeepSeek-V3-DRAFT-0.6B-UNTRAINED'
	output_dir = 'finetuned'

	# ===========================
	# TRAINING TYPE CONFIGURATION
	# ===========================

	full_fine_tune = true

	# =======================
	# OPTIMIZER CONFIGURATION
	# =======================

	lr = 5e-5

	# ======================
	# TRAINING CONFIGURATION
	# ======================

	sequence_len = 32768

	gradient_accumulation_steps = 10 # 10×6 = batch size 60, 10×6×32768 = ~2M tokens per step

	# =====================
	# DATASET CONFIGURATION
	# =====================

	[[datasets]]
	dataset_path = 'datasets/common-crawl-sample/*.json'
	drop_tails = true

	[[datasets]]
	dataset_path = 'datasets/the-stack-smol-xl/*.jsonl'
	drop_tails = true

	[[datasets]]
	dataset_path = 'datasets/rombodawg-Everything-Instruct/*.json'
	drop_tails = true

	```

	I used six `RTX A6000` GPUs over three nodes and hence the `60` batch size (`6 x 10 gradient accumulation steps = 60`):

	![image/png](https://cdn-uploads.huggingface.co/production/uploads/65995c45539c808e84c38bf1/GHyDC4c8zR34i_VfCjYKn.png)