|
|
--- |
|
|
library_name: transformers |
|
|
datasets: |
|
|
- mesolitica/Malaysian-Reasoning |
|
|
base_model: |
|
|
- openai/gpt-oss-20b |
|
|
--- |
|
|
|
|
|
# gpt-oss-20b-Malaysian-Reasoning-SFT-v0.1 |
|
|
|
|
|
LoRA SFT [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) on initial [mesolitica/Malaysian-Reasoning](https://huggingface.co/datasets/mesolitica/Malaysian-Reasoning) |
|
|
|
|
|
## Ablation on GPT OSS 20B |
|
|
|
|
|
1. Use `kernels-community/vllm-flash-attn3` for Flash Attention 3 with Sink. |
|
|
2. Multipacking variable length 16384 context length, with global batch size of 8, so global total tokens is 65536. |
|
|
3. All self attention linear layers with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0 |
|
|
4. All expert gate up projection and down projection with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0 <sup> + </sup> |
|
|
5. Selected expert gate up projection and down projection based on square root mean `exp_avg_sq`, top 4 selected layers are 3, 2, 18, and 1. <sup> + </sup> |
|
|
6. Liger fused cross entropy. |
|
|
7. 2e-4 learning rate, 50 warmup, 2 epoch only. |
|
|
|
|
|
<sup> + </sup> with the rank of each equal to the total rank divided by the number of active experts, https://thinkingmachines.ai/blog/lora/ |
|
|
|
|
|
## We only upload the best model |
|
|
|
|
|
<img src="https://raw.githubusercontent.com/Scicom-AI-Enterprise-Organization/small-ablation/refs/heads/main/malaysian-reasoning/lora_accuracy.png"> |
|
|
|
|
|
This model repository we only upload the best, **only attention linear layers with rank 256 alpha 512**. |
|
|
|
|
|
## Source code |
|
|
|
|
|
Source code at https://github.com/Scicom-AI-Enterprise-Organization/small-ablation/blob/main/malaysian-reasoning |
|
|
|
|
|
## Acknowledgement |
|
|
|
|
|
Special thanks to https://www.scitix.ai/ for H100 Node! |