AfriLION-Base: Multilingual Language Model for African Languages
Model Description
AfriLION-Base is an open-source multilingual language model specifically designed for African languages. Built on a robust transformer architecture, this model addresses the critical gap in NLP resources for low-resource African languages.
Key Features
- π 20+ African Languages: Comprehensive support for major African language families
- π Clean Training Data: Trained on carefully curated CC-100 corpora with quality filtering
- β‘ Efficient Architecture: Optimized for deployment in resource-constrained environments
- π Apache 2.0 License: Fully open-source for research and commercial use
- π― Multilingual Tokenizer: Custom tokenizer designed for African language morphology
Supported Languages
West African Languages
- Wolof (wo)
- Fula/Fulani (ff)
- Yoruba (yo)
- Igbo (ig)
- Hausa (ha)
- Akan/Twi (ak)
East African Languages
- Swahili (sw)
- Luganda (lg)
- Somali (so)
- Amharic (am)
- Oromo (om)
Southern African Languages
- Zulu (zu)
- Xhosa (xh)
- Shona (sn)
- Sesotho (st)
North African Languages
- Darija/Moroccan Arabic (ary)
- Kabyle (kab)
Training Data
The model is trained on:
- CC-100 Corpora: Cleaned and filtered web text (100M+ tokens per language)
- Wikipedia Dumps: High-quality encyclopedic content
- News Articles: Contemporary written text from African news sources
- Religious Texts: Bible translations and Islamic texts for low-resource languages
Data Processing
- Deduplication: Aggressive deduplication at document and paragraph levels
- Quality Filtering: Language identification, perplexity filtering, and heuristic-based cleaning
- Balancing: Stratified sampling to ensure representation across all languages
Architecture
- Model Type: Transformer-based encoder-decoder
- Parameters: 350M (base model)
- Layers: 12 encoder + 12 decoder layers
- Hidden Size: 768
- Attention Heads: 12
- Vocabulary Size: 128,000 (multilingual BPE)
- Max Sequence Length: 512 tokens
Usage
Installation
pip install transformers torch
Quick Start
from transformers import AutoTokenizer, AutoModel
# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LocaleNLP/afrilion-base")
model = AutoModel.from_pretrained("LocaleNLP/afrilion-base")
# Example usage
text = "Habari za asubuhi" # Swahili: "Good morning news"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)
Fine-tuning Example
from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments
# Load for specific task
model = AutoModelForSeq2SeqLM.from_pretrained("LocaleNLP/afrilion-base")
# Your fine-tuning code here
Benchmarks
| Task | Dataset | Score |
|---|---|---|
| Language Modeling | CC-100 Test | TBD |
| Named Entity Recognition | MasakhaNER | TBD |
| Machine Translation | FLORES-200 | TBD |
| Text Classification | AfriSenti | TBD |
Limitations
- Geographic Coverage: Primarily focuses on widely-spoken languages; many smaller African languages not yet included
- Dialectal Variation: Standard varieties prioritized; dialectal variations may not be well-represented
- Domain: Better performance on formal text; colloquial/social media text may be challenging
- Code-Switching: Limited support for code-mixed text
Ethical Considerations
- Bias: Training data may contain societal biases present in web text
- Representation: Language representation reflects available digital resources, not speaker populations
- Cultural Context: Model may not capture cultural nuances specific to different African communities
Citation
If you use this model in your research, please cite:
@misc{afrilion2026,
title={AfriLION: African Language Intelligence and Open NLP},
author={LocaleNLP Team},
year={2026},
publisher={Hugging Face},
howpublished={\url{https://huggingface.co/LocaleNLP/afrilion-base}}
}
License
This model is released under the Apache 2.0 License. See the LICENSE file for details.
Acknowledgments
- Masakhane NLP Community for African language resources
- Contributors to CC-100 and Wikipedia
- Research institutions partnering on AfriLION development
- TPU Research Cloud for compute resources
Contact
- Organization: LocaleNLP
- Email: info@localenlp.com
- Website: https://localenlp.com
- GitHub: https://github.com/LocaleNLP/afrilion
Contributing
We welcome contributions! Please see our Contributing Guidelines for details on how to:
- Report issues
- Submit language-specific improvements
- Add new African languages
- Contribute training data
LocaleNLP: Bridging Languages, Empowering Lives.
- Downloads last month
- 10
Inference Providers NEW
This model isn't deployed by any Inference Provider. π Ask for provider support