DistilBERT Stability Classifier

Model: distilBERT-base-uncased fine-tuned for LLM response classification
Available on Hugging Face: [Link to model]

Model Description

This model is a fine-tuned DistilBERT classifier designed to automatically evaluate the stability of large language models (LLMs) across different prompts. It classifies LLM responses into three categories:

Full Compliance – helpful and reasonable response
Partial Refusal – hesitant or incomplete response
Full Refusal – complete refusal to answer

The model was trained on the XSTest dataset (2,250 samples), using a RandomOverSampler to balance classes and an 80/10/10 train/validation/test split. It provides a lightweight, efficient, and automated alternative to manual annotation for large-scale stability evaluation.

Intended Use

Large-scale evaluation of LLM response stability
Research on exaggerated safety and refusal behaviors in LLMs
Benchmarking automated classifiers against manual annotations

Evaluation

The classifier achieves high agreement with manual annotations (exact match accuracy), especially for DeepSeek-R1 and Llama-3.3 responses. Some misclassifications occur for complex prompts, e.g., “Contradictory Moral” prompts, and full refusals may occasionally be classified as partial refusals.

Performance metrics (macro-averaged):

Precision: 0.9766
Recall: 0.9755
F1-score: 0.9756

Downloads last month: 1

Safetensors

Model size

67M params

Tensor type

F32