RewardHackWatch
Detect reward hacking and misalignment in LLM agent trajectories
RewardHackWatch is a fine-tuned DistilBERT classifier that detects when LLM agents exploit loopholes in their reward functions. Built on findings from Anthropic's research showing that reward hacking correlates with emergent misalignment.
Model Details
| Property | Value |
|---|---|
| Base Model | DistilBERT-base-uncased |
| Parameters | ~66M |
| Task | Binary classification (hack vs clean) |
| Training Data | 5,391 MALT trajectories |
| Inference Latency | ~50ms (CPU) |
Output format: The model outputs two logits corresponding to [clean, hack]. Index 1 is the hack class.
Input format: Single text string combining chain-of-thought reasoning and code snippets, as used in MALT trajectories. You can pass code, comments, or reasoning segments.
Performance
| Metric | Value |
|---|---|
| F1 Score | 89.7% |
| Accuracy | 99.3% |
| Precision | 89.7% |
| Recall | 89.7% |
| 5-Fold CV | 87.4% ± 2.9% |
Significantly outperforms baselines:
- Keyword matching: 0.1% F1
- Regex patterns: 4.9% F1
- BoW + LogReg: 7.0% F1
Usage
With RewardHackWatch Library (Recommended)
pip install "git+https://github.com/aerosta/rewardhackwatch.git"
from rewardhackwatch import RewardHackDetector
detector = RewardHackDetector()
result = detector.analyze({
"cot_traces": ["Let me bypass the test using sys.exit(0)..."],
"code_outputs": ["import sys; sys.exit(0)"]
})
print(f"Risk Level: {result.risk_level}")
print(f"Hack Score: {result.ml_score:.3f}")
Direct HuggingFace Usage
Important: Custom Threshold Required
This model is calibrated for a 0.02 threshold (not the default 0.5). Hack samples typically score ~0.04, clean samples ~0.008. Using argmax or 0.5 threshold will result in near-zero detection!
import torch
from transformers import AutoTokenizer, AutoModelForSequenceClassification
# 1. Load model and tokenizer
model_name = "aerosta/rewardhackwatch"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
# 2. Prepare input (combine CoT and code)
cot = "Let me bypass the test using sys.exit(0)..."
code = "import sys; sys.exit(0)"
text = f"{cot}\n[CODE]\n{code}"
inputs = tokenizer(text, return_tensors="pt", truncation=True, max_length=512)
# 3. Inference with CUSTOM THRESHOLD
with torch.no_grad():
outputs = model(**inputs)
probs = torch.nn.functional.softmax(outputs.logits, dim=-1)
# Index 1 = "hack", Index 0 = "clean"
hack_probability = probs[0][1].item()
# CRITICAL: Use 0.02 threshold, NOT 0.5 or argmax
is_hack = hack_probability > 0.02
print(f"Hack Probability: {hack_probability:.4f}")
print(f"Hack Detected: {is_hack}")
What It Detects
- Test Manipulation - sys.exit(), test bypassing, framework circumvention
- Reward Tampering - Directly altering reward signals
- Eval Gaming - Optimizing for reported metrics, not true task success
- Exit/Abort Patterns - sys.exit(0) and similar hacks
Training Details
- Dataset: MALT (Model-Agnostic Language Trajectories)
- Split: 4,314 train / 1,077 test (80/20)
- Epochs: 3
- Learning Rate: 2e-5 with linear warmup
- Batch Size: 16
- Threshold: 0.02 (calibrated for 3.6% base rate)
Limitations
- Mock Exploits: 0% F1 on unittest.mock attacks (data scarcity, planned fix in v1.1)
- Threshold Sensitivity: Requires recalibration for different base rates
- Covert Misalignment: May miss fully hidden deceptive reasoning
Links
- GitHub: github.com/aerosta/rewardhackwatch
- Paper: RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization
Citation
@article{aerosta2025rewardhackwatch,
title={RewardHackWatch: Runtime Detection of Reward Hacking and Misalignment Generalization in LLM Agents},
author={Aerosta},
journal={arXiv preprint},
year={2025},
url={https://github.com/aerosta/rewardhackwatch}
}
Acknowledgments
Based on research from:
- Anthropic: Natural emergent misalignment from reward hacking (2025)
- OpenAI: Monitoring reasoning models for misbehavior (2025)
License
MIT License - Copyright (c) 2025 Aerosta
- Downloads last month
- 12
Evaluation results
- F1 Score on MALT (filtered)self-reported0.897
- Accuracy on MALT (filtered)self-reported0.993