Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity Paper • 2101.03961 • Published Jan 11, 2021 • 13
RoFormer: Enhanced Transformer with Rotary Position Embedding Paper • 2104.09864 • Published Apr 20, 2021 • 16
Sliding Window Attention Training for Efficient Large Language Models Paper • 2502.18845 • Published Feb 26 • 1
GLaM: Efficient Scaling of Language Models with Mixture-of-Experts Paper • 2112.06905 • Published Dec 13, 2021 • 2
Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer Paper • 1701.06538 • Published Jan 23, 2017 • 7
Mixture-of-Experts Meets Instruction Tuning:A Winning Combination for Large Language Models Paper • 2305.14705 • Published May 24, 2023
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints Paper • 2305.13245 • Published May 22, 2023 • 6