Update README.md
Browse files
README.md
CHANGED
|
@@ -16,7 +16,7 @@ LoRA SFT [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) on init
|
|
| 16 |
2. Multipacking variable length 16384 context length, with global batch size of 8, so global total tokens is 65536.
|
| 17 |
3. All self attention linear layers with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0
|
| 18 |
4. All expert gate up projection and down projection with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0 <sup> + </sup>
|
| 19 |
-
5. Selected expert gate up projection and down projection based on square root mean `exp_avg_sq`,
|
| 20 |
6. Liger fused cross entropy.
|
| 21 |
7. 2e-4 learning rate, 50 warmup, 2 epoch only.
|
| 22 |
|
|
|
|
| 16 |
2. Multipacking variable length 16384 context length, with global batch size of 8, so global total tokens is 65536.
|
| 17 |
3. All self attention linear layers with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0
|
| 18 |
4. All expert gate up projection and down projection with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0 <sup> + </sup>
|
| 19 |
+
5. Selected expert gate up projection and down projection based on square root mean `exp_avg_sq`, top 4 selected layers are 3, 2, 18, and 1. <sup> + </sup>
|
| 20 |
6. Liger fused cross entropy.
|
| 21 |
7. 2e-4 learning rate, 50 warmup, 2 epoch only.
|
| 22 |
|