xlm-twitter-stormfront
This is a domain-adapted encoder model based on XLM-T, further adapted using masked language modeling (MLM) on posts from Stormfront, a white supremacist online forum. The model was developed as part of the research presented in IYKYK: Using language models to decode extremist cryptolects (de Kock et al., 2025).
Description
The model builds on the XLM-T architecture (Barbieri et al., 2022), which is a variant of XLM-R adapted to multilingual social media data. We further fine-tuned it on approximately 9 million posts from Stormfront using an MLM objective to better capture the specialized vocabulary and cryptolects used within this extremist community.
This domain adaptation significantly improves the model’s capacity to understand and represent extremist language. However, this model is not directly usable for classification out-of-the-box. It must be further fine-tuned on a labeled dataset for specific downstream tasks, such as radical content detection, hate speech classification, or ideology prediction.
Intended uses
- Research on extremist or radical language detection
- Analysis of online hate speech and coded in-group language
- Supporting moderation and intervention efforts in academic or policy contexts
Limitations
- Needs task-specific fine-tuning: This encoder model provides contextual representations but requires additional fine-tuning for classification or prediction tasks.
- Potential for misuse: This model is intended only for research on detection and analysis of extremist content. It should not be used to generate extremist language or to profile individuals without proper ethical safeguards.
- Bias and representativeness: The model is trained on data from Stormfront (primarily English posts), and may not generalize to other extremist communities or ideologies.
- Toxic content: The model reflects and encodes harmful language from extremist communities. It may output toxic or hateful content if misused.
Ethical considerations
Handling extremist text data carries significant ethical risks. This model was developed under strict research protocols and is released only for responsible, academic, and policy research purposes. Repeated exposure to extremist content can be harmful; proper support and mental health considerations are advised for practitioners using this model.
Citation
If you use this model, please cite:
@misc{dekock2025iykykusinglanguagemodels,
title={IYKYK: Using language models to decode extremist cryptolects},
author={Christine de Kock and Arij Riabi and Zeerak Talat and Michael Sejr Schlichtkrull and Pranava Madhyastha and Eduard Hovy},
year={2025},
eprint={2506.05635},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2506.05635}
}
- Downloads last month
- -