huseinzolkepliscicom's picture
Update README.md
85b07c7 verified
metadata
library_name: transformers
datasets:
  - mesolitica/Malaysian-Reasoning
base_model:
  - openai/gpt-oss-20b

gpt-oss-20b-Malaysian-Reasoning-SFT-v0.1

LoRA SFT openai/gpt-oss-20b on initial mesolitica/Malaysian-Reasoning

Ablation on GPT OSS 20B

  1. Use kernels-community/vllm-flash-attn3 for Flash Attention 3 with Sink.
  2. Multipacking variable length 16384 context length, with global batch size of 8, so global total tokens is 65536.
  3. All self attention linear layers with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0
  4. All expert gate up projection and down projection with rank 16, 32, 64, 128, 256, 512 with alpha multiply by 2.0 +
  5. Selected expert gate up projection and down projection based on square root mean exp_avg_sq, top 4 selected layers are 3, 2, 18, and 1. +
  6. Liger fused cross entropy.
  7. 2e-4 learning rate, 50 warmup, 2 epoch only.

+ with the rank of each equal to the total rank divided by the number of active experts, https://thinkingmachines.ai/blog/lora/

We only upload the best model

This model repository we only upload the best, only attention linear layers with rank 256 alpha 512.

Source code

Source code at https://github.com/Scicom-AI-Enterprise-Organization/small-ablation/blob/main/malaysian-reasoning

Acknowledgement

Special thanks to https://www.scitix.ai/ for H100 Node!