AfriLION-Base: Multilingual Language Model for African Languages

African Language Intelligence & Open NLP

GitHub | Website | Demo | Paper

Model Description

AfriLION-Base is an open-source multilingual language model specifically designed for African languages. Built on a robust transformer architecture, this model addresses the critical gap in NLP resources for low-resource African languages.

Key Features

  • 🌍 20+ African Languages: Comprehensive support for major African language families
  • πŸ“Š Clean Training Data: Trained on carefully curated CC-100 corpora with quality filtering
  • ⚑ Efficient Architecture: Optimized for deployment in resource-constrained environments
  • πŸ”“ Apache 2.0 License: Fully open-source for research and commercial use
  • 🎯 Multilingual Tokenizer: Custom tokenizer designed for African language morphology

Supported Languages

West African Languages

  • Wolof (wo)
  • Fula/Fulani (ff)
  • Yoruba (yo)
  • Igbo (ig)
  • Hausa (ha)
  • Akan/Twi (ak)

East African Languages

  • Swahili (sw)
  • Luganda (lg)
  • Somali (so)
  • Amharic (am)
  • Oromo (om)

Southern African Languages

  • Zulu (zu)
  • Xhosa (xh)
  • Shona (sn)
  • Sesotho (st)

North African Languages

  • Darija/Moroccan Arabic (ary)
  • Kabyle (kab)

Training Data

The model is trained on:

  • CC-100 Corpora: Cleaned and filtered web text (100M+ tokens per language)
  • Wikipedia Dumps: High-quality encyclopedic content
  • News Articles: Contemporary written text from African news sources
  • Religious Texts: Bible translations and Islamic texts for low-resource languages

Data Processing

  1. Deduplication: Aggressive deduplication at document and paragraph levels
  2. Quality Filtering: Language identification, perplexity filtering, and heuristic-based cleaning
  3. Balancing: Stratified sampling to ensure representation across all languages

Architecture

  • Model Type: Transformer-based encoder-decoder
  • Parameters: 350M (base model)
  • Layers: 12 encoder + 12 decoder layers
  • Hidden Size: 768
  • Attention Heads: 12
  • Vocabulary Size: 128,000 (multilingual BPE)
  • Max Sequence Length: 512 tokens

Usage

Installation

pip install transformers torch

Quick Start

from transformers import AutoTokenizer, AutoModel

# Load model and tokenizer
tokenizer = AutoTokenizer.from_pretrained("LocaleNLP/afrilion-base")
model = AutoModel.from_pretrained("LocaleNLP/afrilion-base")

# Example usage
text = "Habari za asubuhi"  # Swahili: "Good morning news"
inputs = tokenizer(text, return_tensors="pt")
outputs = model(**inputs)

Fine-tuning Example

from transformers import AutoModelForSeq2SeqLM, Trainer, TrainingArguments

# Load for specific task
model = AutoModelForSeq2SeqLM.from_pretrained("LocaleNLP/afrilion-base")

# Your fine-tuning code here

Benchmarks

Task Dataset Score
Language Modeling CC-100 Test TBD
Named Entity Recognition MasakhaNER TBD
Machine Translation FLORES-200 TBD
Text Classification AfriSenti TBD

Limitations

  • Geographic Coverage: Primarily focuses on widely-spoken languages; many smaller African languages not yet included
  • Dialectal Variation: Standard varieties prioritized; dialectal variations may not be well-represented
  • Domain: Better performance on formal text; colloquial/social media text may be challenging
  • Code-Switching: Limited support for code-mixed text

Ethical Considerations

  • Bias: Training data may contain societal biases present in web text
  • Representation: Language representation reflects available digital resources, not speaker populations
  • Cultural Context: Model may not capture cultural nuances specific to different African communities

Citation

If you use this model in your research, please cite:

@misc{afrilion2026,
  title={AfriLION: African Language Intelligence and Open NLP},
  author={LocaleNLP Team},
  year={2026},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/LocaleNLP/afrilion-base}}
}

License

This model is released under the Apache 2.0 License. See the LICENSE file for details.

Acknowledgments

  • Masakhane NLP Community for African language resources
  • Contributors to CC-100 and Wikipedia
  • Research institutions partnering on AfriLION development
  • TPU Research Cloud for compute resources

Contact

Contributing

We welcome contributions! Please see our Contributing Guidelines for details on how to:

  • Report issues
  • Submit language-specific improvements
  • Add new African languages
  • Contribute training data

LocaleNLP: Bridging Languages, Empowering Lives.

Downloads last month
10
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support