Steering Out-of-Distribution Generalization with Concept Ablation Fine-Tuning Paper • 2507.16795 • Published Jul 22 • 2
Measuring Progress in Dictionary Learning for Language Model Interpretability with Board Game Models Paper • 2408.00113 • Published Jul 31, 2024 • 8