DistilBERT Stability Classifier

Model: distilBERT-base-uncased fine-tuned for LLM response classification
Available on Hugging Face: [Link to model]

Model Description

This model is a fine-tuned DistilBERT classifier designed to automatically evaluate the stability of large language models (LLMs) across different prompts. It classifies LLM responses into three categories:

  • Full Compliance – helpful and reasonable response
  • Partial Refusal – hesitant or incomplete response
  • Full Refusal – complete refusal to answer

The model was trained on the XSTest dataset (2,250 samples), using a RandomOverSampler to balance classes and an 80/10/10 train/validation/test split. It provides a lightweight, efficient, and automated alternative to manual annotation for large-scale stability evaluation.

Intended Use

  • Large-scale evaluation of LLM response stability
  • Research on exaggerated safety and refusal behaviors in LLMs
  • Benchmarking automated classifiers against manual annotations

Evaluation

The classifier achieves high agreement with manual annotations (exact match accuracy), especially for DeepSeek-R1 and Llama-3.3 responses. Some misclassifications occur for complex prompts, e.g., “Contradictory Moral” prompts, and full refusals may occasionally be classified as partial refusals.

Performance metrics (macro-averaged):

  • Precision: 0.9766
  • Recall: 0.9755
  • F1-score: 0.9756
Downloads last month
1
Safetensors
Model size
67M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support