🧩 ModernBERT-base Fine-tuned for Harmful Prompt Classification

A binary classifier fine-tuned on the WildGuardMix dataset to detect harmful or unsafe prompts.
Built on answerdotai/ModernBERT-base with flash attention for efficient inference.

🧠 Model Overview

  • Task: Harmful prompt detection (binary classification)
  • Labels:
    • 1 β†’ Harmful / Unsafe
    • 0 β†’ Safe / Non-harmful

πŸ“Š Performance (Test Set)

Metric Score
Accuracy 95.9%
F1 Score 96.21%
Precision 96.39%
Recall 96.21%

βš™οΈ Training Details

  • Dataset: allenai/wildguardmix (wildguardtrain subset)
  • Split:
    • 80/20 train/test
    • 90/10 train/validation (from training set)
  • Stratified on: prompt harm label, adversarial flag, and subcategory
  • Optimizer: AdamW (8-bit)
  • Learning Rate: 1e-4 (cosine schedule, 10% warmup)
  • Batch Size: 96
  • Max Sequence Length: 256 tokens
  • Epochs: 3

🎯 Intended Use

This model is designed for binary classification of text prompts as:

  • Harmful (1) β€” unsafe or toxic content
  • Unharmful (0) β€” safe or benign content

⚠️ Disclaimer:
This model should not be deployed in production systems without additional evaluation and alignment with domain-specific safety and ethical guidelines.

Downloads last month
20
Safetensors
Model size
0.1B params
Tensor type
BF16
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Jazhyc/modernbert-wildguardmix-classifier

Finetuned
(843)
this model

Dataset used to train Jazhyc/modernbert-wildguardmix-classifier