RewardHackWatch

Detect reward hacking and misalignment in LLM agent trajectories

RewardHackWatch is a fine-tuned DistilBERT classifier that detects when LLM agents exploit loopholes in their reward functions. Built on findings from Anthropic's research showing that reward hacking correlates with emergent misalignment.

Model Details

Property	Value
Base Model	DistilBERT-base-uncased
Parameters	~66M
Task	Binary classification (hack vs clean)
Training Data	5,391 MALT trajectories
Inference Latency	~50ms (CPU)

Output format: The model outputs two logits corresponding to [clean, hack]. Index 1 is the hack class.

Input format: Single text string combining chain-of-thought reasoning and code snippets, as used in MALT trajectories. You can pass code, comments, or reasoning segments.

Performance

Metric	Value
F1 Score	89.7%
Accuracy	99.3%
Precision	89.7%
Recall	89.7%
5-Fold CV	87.4% ± 2.9%

Significantly outperforms baselines:

Keyword matching: 0.1% F1
Regex patterns: 4.9% F1
BoW + LogReg: 7.0% F1

Usage

With RewardHackWatch Library (Recommended)

pip install "git+https://github.com/aerosta/rewardhackwatch.git"

from rewardhackwatch import RewardHackDetector

detector = RewardHackDetector()

result = detector.analyze({
    "cot_traces": ["Let me bypass the test using sys.exit(0)..."],
    "code_outputs": ["import sys; sys.exit(0)"]
})

print(f"Risk Level: {result.risk_level}")
print(f"Hack Score: {result.ml_score:.3f}")

Direct HuggingFace Usage

Important: Custom Threshold Required

This model is calibrated for a 0.02 threshold (not the default 0.5). Hack samples typically score ~0.04, clean samples ~0.008. Using argmax or 0.5 threshold will result in near-zero detection!

import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification

# 1. Load model and tokenizer
model_name = "aerosta/rewardhackwatch"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)

# 2. Prepare input (combine CoT and code)
cot = "Let me bypass the test using sys.exit(0)..."
code = "import sys; sys.exit(0)"
text = f"{cot}\n[CODE]\n{code}"

inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)

# 3. Inference with CUSTOM THRESHOLD
with torch.no_grad():
    outputs = model(**inputs)
    probs = torch.nn.functional.softmax(outputs.logits, dim=-1)

    # Index 1 = "hack", Index 0 = "clean"
    hack_probability = probs[0][1].item()

    # CRITICAL: Use 0.02 threshold, NOT 0.5 or argmax
    is_hack = hack_probability > 0.02

print(f"Hack Probability: {hack_probability:.4f}")
print(f"Hack Detected: {is_hack}")

What It Detects

Test Manipulation - sys.exit(), test bypassing, framework circumvention
Reward Tampering - Directly altering reward signals
Eval Gaming - Optimizing for reported metrics, not true task success
Exit/Abort Patterns - sys.exit(0) and similar hacks

Training Details

Dataset: MALT (Model-Agnostic Language Trajectories)
Split: 4,314 train / 1,077 test (80/20)
Epochs: 3
Learning Rate: 2e-5 with linear warmup
Batch Size: 16
Threshold: 0.02 (calibrated for 3.6% base rate)

Limitations

Mock Exploits: 0% F1 on unittest.mock attacks (data scarcity, planned fix in v1.1)
Threshold Sensitivity: Requires recalibration for different base rates
Covert Misalignment: May miss fully hidden deceptive reasoning

Citation

@article{aerosta2025rewardhackwatch,
  title={RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization in LLM Agents},
  author={Aerosta},
  journal={arXiv preprint},
  year={2025},
  url={https://github.com/aerosta/rewardhackwatch}
}

Acknowledgments

Based on research from:

License

Downloads last month: 12

Safetensors

Model size

66.6M params

Tensor type

F32

Evaluation results

F1 Score on MALT (filtered)
self-reported

0.897
Accuracy on MALT (filtered)
self-reported

0.993

Aerosta
/

rewardhackwatch