huseinzolkepliscicom commited on
Commit
85b07c7
·
verified ·
1 Parent(s): 590cd48

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +1 -1
README.md CHANGED
@@ -16,7 +16,7 @@ LoRA SFT [openai/gpt-oss-20b](https://huggingface.co/openai/gpt-oss-20b) on init
16
  2. Multipacking variable length 16384 context length, with global batch size of 8, so global total tokens is 65536.
17
  3. All self attention linear layers with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0
18
  4. All expert gate up projection and down projection with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0 <sup> + </sup>
19
- 5. Selected expert gate up projection and down projection based on square root mean `exp_avg_sq`, [20b-r64-experts-gradient.sh](20b-r64-experts-gradient.sh), [notebook/sort-optimizer.ipynb](notebook/sort-optimizer.ipynb), top 4 selected layers are 3, 2, 18, and 1. <sup> + </sup>
20
  6. Liger fused cross entropy.
21
  7. 2e-4 learning rate, 50 warmup, 2 epoch only.
22
 
 
16
  2. Multipacking variable length 16384 context length, with global batch size of 8, so global total tokens is 65536.
17
  3. All self attention linear layers with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0
18
  4. All expert gate up projection and down projection with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0 <sup> + </sup>
19
+ 5. Selected expert gate up projection and down projection based on square root mean `exp_avg_sq`, top 4 selected layers are 3, 2, 18, and 1. <sup> + </sup>
20
  6. Liger fused cross entropy.
21
  7. 2e-4 learning rate, 50 warmup, 2 epoch only.
22