TinySQL: A Progressive Text-to-SQL Dataset for Mechanistic Interpretability Research Paper • 2503.12730 • Published Mar 17 • 4
k-steering Collection Collecting datasets used for our paper on multi-attribute steering using gradient descent. • 7 items • Updated 20 days ago • 1
Activation Space Interventions Can Be Transferred Between Large Language Models Paper • 2503.04429 • Published Mar 6 • 2
Transferring Activation Features for model interventions Collection Models and datasets used for our paper on transferring activations between models. • 23 items • Updated 25 days ago • 1
Blog: Activations transfer for model interventions. Collection Collects backdoor datasets, language models and transfer mappings between these spaces. • 6 items • Updated May 10 • 3
Beyond Training Objectives: Interpreting Reward Model Divergence in Large Language Models Paper • 2310.08164 • Published Oct 12, 2023 • 4