π§© ModernBERT-base Fine-tuned for Harmful Prompt Classification
A binary classifier fine-tuned on the WildGuardMix dataset to detect harmful or unsafe prompts.
Built on answerdotai/ModernBERT-base with flash attention for efficient inference.
π§ Model Overview
- Task: Harmful prompt detection (binary classification)
- Labels:
1β Harmful / Unsafe0β Safe / Non-harmful
π Performance (Test Set)
| Metric | Score |
|---|---|
| Accuracy | 95.9% |
| F1 Score | 96.21% |
| Precision | 96.39% |
| Recall | 96.21% |
βοΈ Training Details
- Dataset:
allenai/wildguardmix(wildguardtrainsubset) - Split:
- 80/20 train/test
- 90/10 train/validation (from training set)
- Stratified on: prompt harm label, adversarial flag, and subcategory
- Optimizer: AdamW (8-bit)
- Learning Rate:
1e-4(cosine schedule, 10% warmup) - Batch Size: 96
- Max Sequence Length: 256 tokens
- Epochs: 3
π― Intended Use
This model is designed for binary classification of text prompts as:
- Harmful (1) β unsafe or toxic content
- Unharmful (0) β safe or benign content
β οΈ Disclaimer:
This model should not be deployed in production systems without additional evaluation and alignment with domain-specific safety and ethical guidelines.
- Downloads last month
- 20
Model tree for Jazhyc/modernbert-wildguardmix-classifier
Base model
answerdotai/ModernBERT-base