You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

xlm-twitter-stormfront

This is a domain-adapted encoder model based on XLM-T, further adapted using masked language modeling (MLM) on posts from Stormfront, a white supremacist online forum. The model was developed as part of the research presented in IYKYK: Using language models to decode extremist cryptolects (de Kock et al., 2025).

Description

The model builds on the XLM-T architecture (Barbieri et al., 2022), which is a variant of XLM-R adapted to multilingual social media data. We further fine-tuned it on approximately 9 million posts from Stormfront using an MLM objective to better capture the specialized vocabulary and cryptolects used within this extremist community.

This domain adaptation significantly improves the model’s capacity to understand and represent extremist language. However, this model is not directly usable for classification out-of-the-box. It must be further fine-tuned on a labeled dataset for specific downstream tasks, such as radical content detection, hate speech classification, or ideology prediction.

Intended uses

Research on extremist or radical language detection
Analysis of online hate speech and coded in-group language
Supporting moderation and intervention efforts in academic or policy contexts

Limitations

Needs task-specific fine-tuning: This encoder model provides contextual representations but requires additional fine-tuning for classification or prediction tasks.
Potential for misuse: This model is intended only for research on detection and analysis of extremist content. It should not be used to generate extremist language or to profile individuals without proper ethical safeguards.
Bias and representativeness: The model is trained on data from Stormfront (primarily English posts), and may not generalize to other extremist communities or ideologies.
Toxic content: The model reflects and encodes harmful language from extremist communities. It may output toxic or hateful content if misused.

Ethical considerations

Handling extremist text data carries significant ethical risks. This model was developed under strict research protocols and is released only for responsible, academic, and policy research purposes. Repeated exposure to extremist content can be harmful; proper support and mental health considerations are advised for practitioners using this model.

Citation

If you use this model, please cite:

@misc{dekock2025iykykusinglanguagemodels,
  title={IYKYK: Using language models to decode extremist cryptolects},
  author={Christine de Kock and Arij Riabi and Zeerak Talat and Michael Sejr Schlichtkrull and Pranava Madhyastha and Eduard Hovy},
  year={2025},
  eprint={2506.05635},
  archivePrefix={arXiv},
  primaryClass={cs.CL},
  url={https://arxiv.org/abs/2506.05635}
}

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support