YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

1-Layer 4-Head Attention-Only Transformer

This is a simplified transformer model with 1 attention layer(s) and 4 attention head(s), hidden size 128, designed for studying attention mechanisms in isolation.

Architecture Differences from Vanilla Transformer

Removed Components:

No MLP/Feed-Forward layers - Only attention layers
No Layer Normalization - No LayerNorm before/after attention
No positional encoding - No position embeddings of any kind

Kept Components:

Token embeddings
Multi-head self-attention with causal masking
Residual connections around attention layers
Language modeling head (linear projection to vocabulary)

This minimal architecture isolates the attention mechanism, making it useful for mechanistic interpretability research as described in A Mathematical Framework for Transformer Circuits.

Usage

class AttentionOnlyTransformer(PreTrainedModel):
    """Attention-only transformer with configurable number of attention layers."""
    config_class = LlamaConfig

    def __init__(self, config: LlamaConfig):
        super().__init__(config)
        self.embed_tokens = nn.Embedding(config.vocab_size, config.hidden_size)
        self.layers = nn.ModuleList([AttentionLayer(config) for _ in range(config.num_hidden_layers)])
        self.lm_head = nn.Linear(config.hidden_size, config.vocab_size, bias=False)

    def forward(self, input_ids=None, attention_mask=None, labels=None, **kwargs):
        batch_size, seq_len = input_ids.shape
        hidden_states = self.embed_tokens(input_ids)
        assert hidden_states.shape == (batch_size, seq_len, self.config.hidden_size)
        assert attention_mask.shape == (batch_size, seq_len)

        for layer in self.layers:
            hidden_states = layer(hidden_states, attention_mask)
            assert hidden_states.shape == (batch_size, seq_len, self.config.hidden_size)

        logits = self.lm_head(hidden_states)
        assert logits.shape == (batch_size, seq_len, self.config.vocab_size)

        loss = None
        if labels is not None:
            shift_logits = logits[..., :-1, :].contiguous()
            shift_labels = labels[..., 1:].contiguous()
            loss_fct = nn.CrossEntropyLoss()
            loss = loss_fct(
                shift_logits.view(-1, shift_logits.size(-1)), shift_labels.view(-1)
            )

        return {"loss": loss, "logits": logits}


model = AttentionOnlyTransformer.from_pretrained('Butanium/simple-stories-1L4H128D-attention-only-toy-transformer')

Training Data

The model is trained on the SimpleStories dataset for next-token prediction.

Downloads last month: -

Safetensors

Model size

1.16M params

Tensor type

F32

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support