Scicom-intl
/

gpt-oss-20b-Malaysian-Reasoning-SFT-v0.1

Text Generation

8-bit precision

Model card Files Files and versions

gpt-oss-20b-Malaysian-Reasoning-SFT-v0.1 / README.md

huseinzolkepliscicom's picture

huseinzolkepliscicom

Update README.md

85b07c7 verified 4 days ago

|

history blame contribute delete

1.66 kB

	---
	library_name: transformers
	datasets:
	- mesolitica/Malaysian-Reasoning
	base_model:
	- openai/gpt-oss-20b
	---

	# gpt-oss-20b-Malaysian-Reasoning-SFT-v0.1

	LoRA SFT [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) on initial [mesolitica/Malaysian-Reasoning](https://huggingface.co/datasets/mesolitica/Malaysian-Reasoning)

	## Ablation on GPT OSS 20B

	1. Use `kernels-community/vllm-flash-attn3` for Flash Attention 3 with Sink.
	2. Multipacking variable length 16384 context length, with global batch size of 8, so global total tokens is 65536.
	3. All self attention linear layers with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0
	4. All expert gate up projection and down projection with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0 <sup> + </sup>
	5. Selected expert gate up projection and down projection based on square root mean `exp_avg_sq`, top 4 selected layers are 3, 2, 18, and 1. <sup> + </sup>
	6. Liger fused cross entropy.
	7. 2e-4 learning rate, 50 warmup, 2 epoch only.

	<sup> + </sup> with the rank of each equal to the total rank divided by the number of active experts, https://thinkingmachines.ai/blog/lora/

	## We only upload the best model

	<img src="https://raw.githubusercontent.com/Scicom-AI-Enterprise-Organization/small-ablation/refs/heads/main/malaysian-reasoning/lora_accuracy.png">

	This model repository we only upload the best, only attention linear layers with rank 256 alpha 512.

	## Source code

	Source code at https://github.com/Scicom-AI-Enterprise-Organization/small-ablation/blob/main/malaysian-reasoning

	## Acknowledgement

	Special thanks to https://www.scitix.ai/ for H100 Node!