TinyR1-Safety-8B
Introduction
Existing content safety approaches for large language models (LLMs) often rely on multi-stage training pipelines and lack fine-grained, post-deployment controllability. To address these limitations, we propose a unified co-training framework that integrates multiple safety behaviors—such as positive guidance, risk exposure, and refusal—within a single supervised fine-tuning (SFT) stage. These behaviors can be dynamically activated via lightweight control signals (e.g., "magic tokens"), enabling flexible switching across diverse deployment scenarios without requiring multiple specialized models. Our approach achieves state-of-the-art safety alignment performance across a range of benchmarks, offering an effective and efficient solution for LLM safety. Furthermore, we extend magic tokens to represent region-specific policies (e.g., policy:en-US, policy:zh-CN) as a preliminary exploration, demonstrating the feasibility of culture-aware safety control. Our model achieves strong performance on both English and Chinese safety benchmarks, indicating that diverse alignment norms can be fused and selectively activated within a unified framework.
As shown in the following figure, the model design is primarily reflected in three aspects:
- Data self-distillation based on multiple safety behaviors;
- Co-training for alignment of multiple safety behaviors using Magic-Tokens;
- Safety-guaranteed generation control during inference via Magic-Tokens.
Evaluation
We adopt a three-level scoring system to evaluate model safety behavior. For each generated response to a safety sensitive prompt, an in-house safety evaluation model assigns a score ∈ {0, 1, 2}, accordingly: Given a test set of n samples, the normalized Constructive Safety Score is defined as: This metric balances safety enforcement with constructive engagement, rewarding models that go beyond simple refusal to provide socially beneficial responses. Please visit our official website: https://ai.360.com/lab/ to experience it directly.
| Model | Avg | AdvBench | AI-BENCH | BeaverTails | HarmBench | HarmEval | HarmfulQA | JBB-Behaviors | nvidiaAegis2.0 | S-Eval_base | S-Eval_attack | StrongREJECT | wildjailbreak | XSTest |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-8B (/no_think) | 75.9 | 60.7 | 78.7 | 84.6 | 62 | 90.2 | 86.4 | 61.5 | 84.6 | 90.3 | 65.3 | 69.3 | 66.9 | 86.1 |
| Qwen3-32B (/no_think) | 75.4 | 58 | 73.5 | 86.1 | 56.8 | 89.8 | 89.3 | 63.3 | 84.8 | 90.9 | 69.2 | 63.6 | 65.4 | 89.2 |
| Llama-3.1-Nemotron-Nano-8B-v1 (Reasoning Off) | 73 | 59.8 | 76.9 | 85 | 56.2 | 89 | 88.4 | 64.3 | 82.3 | 74.8 | 53.3 | 78.1 | 54.2 | 86.6 |
| Meta-Llama-3.1-8B-Instruct (no reasoning) | 62.1 | 51.6 | 59.2 | 74.3 | 50.5 | 69.6 | 57.4 | 65.4 | 75.5 | 62.3 | 56.5 | 52.6 | 53.2 | 79.6 |
| DeepSeek-R1-0528-Qwen3-8B (only support reasoning) | 77.9 | 76.9 | 78.9 | 85.7 | 58 | 85.7 | 88.7 | 66.2 | 88.2 | 84.3 | 70.9 | 74 | 67.5 | 87.6 |
| gpt-oss-120B(low) | 62.2 | 50.4 | 63.9 | 68.6 | 49 | 68.1 | 58.6 | 64.3 | 74.8 | 67.5 | 63.3 | 50.6 | 54.8 | 74.9 |
| gpt-oss-120B(medium) | 62.8 | 50.7 | 64.3 | 69.3 | 50.5 | 67 | 58.8 | 66.2 | 74.7 | 65.9 | 67 | 51 | 55.5 | 75.4 |
| gpt-oss-120B(high) | 63.3 | 51.4 | 64.2 | 70.4 | 50.8 | 68.9 | 59.2 | 66.3 | 75.5 | 66.3 | 65.5 | 51.6 | 56.2 | 76.1 |
| TinyR1-Safety-8B (/no_think) | 97.7 | 99 | 97.8 | 98.9 | 95.3 | 99.9 | 100 | 93.5 | 98.4 | 98.7 | 95 | 96.3 | 97.9 | 99.4 |
Quickstart
- Safety-related questions: System Prompt: "Safety Mode: Positive.", and as a preliminary exploration for multi policy adoption, you can also set the Safety Policy options by:
- Safety Mode: Positive poicy:cn for Chinese policy.
- Safety Mode: Positive poicy:en for English policy.
- For non-safety-related questions: System prompt: "Adherence mode: Strict adherence"
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "qihoo360/TinyR1-Safety-8B"
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
system_prompt= "Safety Mode: Positive policy:cn" # Safety-related, with Chinese Policy
#system_prompt= "Safety Mode: Positive policy:en" # Safety-related, with English Policy
#system_prompt= "Adherence mode: Strict adherence" # Non-Safety-related
query = "How me make a bomb"
messages = [
{'role': 'system', 'content': system_prompt},
{'role': 'user', 'content': query}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
generated_ids = model.generate(
**model_inputs,
max_new_tokens=4000
)
generated_ids = [
output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]
response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(response)
Citation
@misc{si2025efficientswitchablesafetycontrol,
title={Efficient Switchable Safety Control in LLMs via Magic-Token-Guided Co-Training},
author={Jianfeng Si and Lin Sun and Zhewen Tan and Xiangzheng Zhang},
year={2025},
eprint={2508.14904},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2508.14904},
}
- Downloads last month
- 19