AfriLION-Base: Multilingual Language Model for African Languages

African Language Intelligence & Open NLP

Model Description

AfriLION-Base is an open-source multilingual language model specifically designed for African languages. Built on a robust transformer architecture, this model addresses the critical gap in NLP resources for low-resource African languages.

Key Features

🌍 20+ African Languages: Comprehensive support for major African language families
📊 Clean Training Data: Trained on carefully curated CC-100 corpora with quality filtering
⚡ Efficient Architecture: Optimized for deployment in resource-constrained environments
🔓 Apache 2.0 License: Fully open-source for research and commercial use
🎯 Multilingual Tokenizer: Custom tokenizer designed for African language morphology

Supported Languages

West African Languages

Wolof (wo)
Fula/Fulani (ff)
Yoruba (yo)
Igbo (ig)
Hausa (ha)
Akan/Twi (ak)

East African Languages

Swahili (sw)
Luganda (lg)
Somali (so)
Amharic (am)
Oromo (om)

Southern African Languages

Zulu (zu)
Xhosa (xh)
Shona (sn)
Sesotho (st)

North African Languages

Darija/Moroccan Arabic (ary)
Kabyle (kab)

Training Data

The model is trained on:

CC-100 Corpora: Cleaned and filtered web text (100M+ tokens per language)
Wikipedia Dumps: High-quality encyclopedic content
News Articles: Contemporary written text from African news sources
Religious Texts: Bible translations and Islamic texts for low-resource languages

Data Processing

Deduplication: Aggressive deduplication at document and paragraph levels
Quality Filtering: Language identification, perplexity filtering, and heuristic-based cleaning
Balancing: Stratified sampling to ensure representation across all languages

Architecture

Model Type: Transformer-based encoder-decoder
Parameters: 350M (base model)
Layers: 12 encoder + 12 decoder layers
Hidden Size: 768
Attention Heads: 12
Vocabulary Size: 128,000 (multilingual BPE)
Max Sequence Length: 512 tokens

Usage

Installation

pip install transformers torch

Quick Start

from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LocaleNLP/afrilion-base")
model = AutoModel.from_pretrained("LocaleNLP/afrilion-base")

# Example usage
text = "Habari za asubuhi"  # Swahili: "Good morning news"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Fine-tuning Example

from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments

# Load for specific task
model = AutoModelForSeq2SeqLM.from_pretrained("LocaleNLP/afrilion-base")

# Your fine-tuning code here

Benchmarks

Task	Dataset	Score
Language Modeling	CC-100 Test	TBD
Named Entity Recognition	MasakhaNER	TBD
Machine Translation	FLORES-200	TBD
Text Classification	AfriSenti	TBD

Limitations

Geographic Coverage: Primarily focuses on widely-spoken languages; many smaller African languages not yet included
Dialectal Variation: Standard varieties prioritized; dialectal variations may not be well-represented
Domain: Better performance on formal text; colloquial/social media text may be challenging
Code-Switching: Limited support for code-mixed text

Ethical Considerations

Bias: Training data may contain societal biases present in web text
Representation: Language representation reflects available digital resources, not speaker populations
Cultural Context: Model may not capture cultural nuances specific to different African communities

Citation

If you use this model in your research, please cite:

@misc{afrilion2026,
  title={AfriLION: African Language Intelligence and Open NLP},
  author={LocaleNLP Team},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/LocaleNLP/afrilion-base}}
}

License

This model is released under the Apache 2.0 License. See the LICENSE file for details.

Acknowledgments

Masakhane NLP Community for African language resources
Contributors to CC-100 and Wikipedia
Research institutions partnering on AfriLION development
TPU Research Cloud for compute resources

Contact

Organization: LocaleNLP
Email: info@localenlp.com
Website: https://localenlp.com
GitHub: https://github.com/LocaleNLP/afrilion

Contributing

We welcome contributions! Please see our Contributing Guidelines for details on how to:

Report issues
Submit language-specific improvements
Add new African languages
Contribute training data

LocaleNLP: Bridging Languages, Empowering Lives.

Downloads last month: 10

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support