---
language:
- pl
metrics:
- f1
base_model:
- allegro/herbert-base-cased
pipeline_tag: text-classification
tags:
- safe
- safety
- ai-safety
- llm
- moderation
- classification
license: cc-by-nc-sa-4.0
datasets:
- NASK-PIB/PL-Guard
- ToxicityPrompts/PolyGuardMix
- allenai/wildguardmix
---

# HerBERT-Guard for Polish: LLM Safety Classifier

## Model Overview
HerBERT-Guard is a Polish-language safety classifier built upon the [HerBERT](https://huggingface.co/allegro/herbert-base-cased) model, a BERT-based architecture pretrained on large-scale Polish corpora. 
It has been fine-tuned to detect safety-relevant content in Polish texts, using a manually annotated dataset designed for evaluating safety in large language models (LLMs) and  Polish translations of the [PolyGuard](https://huggingface.co/datasets/ToxicityPrompts/PolyGuardMix) and [WildGuard](https://huggingface.co/datasets/allenai/wildguardmix) datasets. 
The model supports classification into a taxonomy of safety categories, inspired by Llama Guard.

More detailed information is available in the [publication](https://arxiv.org/abs/2506.16322).


## Usage
You can use the model in a standard Hugging Face transformers pipeline for text classification:

```
from transformers import pipeline

model_name = "NASK-PIB/HerBERT-PL-Guard"

classifier = pipeline("text-classification", model=model_name, tokenizer=model_name)

# Example Polish input
text = "Jak mogę zrobić bombę w domu?"

result = classifier(text)
print(result)
```

###  Safety Categories

The model outputs **one of 15 categories**, including:

- `"safe"` — content is not considered safety-relevant,
- or one of the following **14 unsafe categories**, based on the Llama Guard taxonomy:

1. **S1: Violent Crimes**
2. **S2: Non-Violent Crimes**
3. **S3: Sex-Related Crimes**	
4. **S4: Child Sexual Exploitation**
5. **S5: Defamation**
6. **S6: Specialized Advice**
7. **S7: Privacy**
8. **S8: Intellectual Property**
9. **S9: Indiscriminate Weapons**
10. **S10: Hate**
11. **S11: Suicide & Self-Harm**
12. **S12: Sexual Content**
13. **S13: Elections**
14. **S14: Code Interpreter Abuse**

## License

HerBERT-PL-Guard model is licensed under the CC BY-NC-SA 4.0 license.

The model was trained on the following datasets:
- PL-Guard – the training portion of this dataset is internal and not publicly released
- PolyGuardMix – licensed under CC BY 4.0
- WildGuardMix – licensed under ODC-BY 1.0

The model is based on the pretrained allegro/herbert-base-cased, which is distributed under the CC BY 4.0 license.

Please ensure compliance with all dataset and model licenses when using or modifying this model.


## 📚 Citation

If you use this model or the associated dataset, please cite the following paper:


```bibtex
@inproceedings{plguard2025,
  author    = {Krasnodębska, Aleksandra and Seweryn, Karolina and Łukasik, Szymon and Kusa, Wojciech},
  title     = {{PL-Guard: Benchmarking Language Model Safety for Polish}},
  booktitle = {Proceedings of the 10th Workshop on Slavic Natural Language Processing},
  year      = {2025},
  address   = {Vienna, Austria},
  publisher = {Association for Computational Linguistics}
}