--- language: - pl metrics: - f1 base_model: - allegro/herbert-base-cased pipeline_tag: text-classification tags: - safe - safety - ai-safety - llm - moderation - classification license: cc-by-nc-sa-4.0 datasets: - NASK-PIB/PL-Guard - ToxicityPrompts/PolyGuardMix - allenai/wildguardmix --- # HerBERT-Guard for Polish: LLM Safety Classifier ## Model Overview HerBERT-Guard is a Polish-language safety classifier built upon the [HerBERT](https://huggingface.co/allegro/herbert-base-cased) model, a BERT-based architecture pretrained on large-scale Polish corpora. It has been fine-tuned to detect safety-relevant content in Polish texts, using a manually annotated dataset designed for evaluating safety in large language models (LLMs) and Polish translations of the [PolyGuard](https://huggingface.co/datasets/ToxicityPrompts/PolyGuardMix) and [WildGuard](https://huggingface.co/datasets/allenai/wildguardmix) datasets. The model supports classification into a taxonomy of safety categories, inspired by Llama Guard. More detailed information is available in the [publication](https://arxiv.org/abs/2506.16322). ## Usage You can use the model in a standard Hugging Face transformers pipeline for text classification: ``` from transformers import pipeline model_name = "NASK-PIB/HerBERT-PL-Guard" classifier = pipeline("text-classification", model=model_name, tokenizer=model_name) # Example Polish input text = "Jak mogę zrobić bombę w domu?" result = classifier(text) print(result) ``` ### Safety Categories The model outputs **one of 15 categories**, including: - `"safe"` — content is not considered safety-relevant, - or one of the following **14 unsafe categories**, based on the Llama Guard taxonomy: 1. **S1: Violent Crimes** 2. **S2: Non-Violent Crimes** 3. **S3: Sex-Related Crimes** 4. **S4: Child Sexual Exploitation** 5. **S5: Defamation** 6. **S6: Specialized Advice** 7. **S7: Privacy** 8. **S8: Intellectual Property** 9. **S9: Indiscriminate Weapons** 10. **S10: Hate** 11. **S11: Suicide & Self-Harm** 12. **S12: Sexual Content** 13. **S13: Elections** 14. **S14: Code Interpreter Abuse** ## License HerBERT-PL-Guard model is licensed under the CC BY-NC-SA 4.0 license. The model was trained on the following datasets: - PL-Guard – the training portion of this dataset is internal and not publicly released - PolyGuardMix – licensed under CC BY 4.0 - WildGuardMix – licensed under ODC-BY 1.0 The model is based on the pretrained allegro/herbert-base-cased, which is distributed under the CC BY 4.0 license. Please ensure compliance with all dataset and model licenses when using or modifying this model. ## 📚 Citation If you use this model or the associated dataset, please cite the following paper: ```bibtex @inproceedings{plguard2025, author = {Krasnodębska, Aleksandra and Seweryn, Karolina and Łukasik, Szymon and Kusa, Wojciech}, title = {{PL-Guard: Benchmarking Language Model Safety for Polish}}, booktitle = {Proceedings of the 10th Workshop on Slavic Natural Language Processing}, year = {2025}, address = {Vienna, Austria}, publisher = {Association for Computational Linguistics} }