Scicom-intl
/

gpt-oss-20b-Malaysian-Reasoning-SFT-v0.1

Text Generation

8-bit precision

Model card Files Files and versions

huseinzolkepliscicom commited on 4 days ago

Commit

85b07c7

·

verified ·

1 Parent(s): 590cd48

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ LoRA SFT [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) on init
 2. Multipacking variable length 16384 context length, with global batch size of 8, so global total tokens is 65536.
 3. All self attention linear layers with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0
 4. All expert gate up projection and down projection with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0 <sup> + </sup>
-5. Selected expert gate up projection and down projection based on square root mean `exp_avg_sq`, [20b-r64-experts-gradient.sh](20b-r64-experts-gradient.sh), [notebook/sort-optimizer.ipynb](notebook/sort-optimizer.ipynb), top 4 selected layers are 3, 2, 18, and 1. <sup> + </sup>
 6. Liger fused cross entropy.
 7. 2e-4 learning rate, 50 warmup, 2 epoch only.

 2. Multipacking variable length 16384 context length, with global batch size of 8, so global total tokens is 65536.
 3. All self attention linear layers with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0
 4. All expert gate up projection and down projection with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0 <sup> + </sup>
+5. Selected expert gate up projection and down projection based on square root mean `exp_avg_sq`, top 4 selected layers are 3, 2, 18, and 1. <sup> + </sup>
 6. Liger fused cross entropy.
 7. 2e-4 learning rate, 50 warmup, 2 epoch only.