Modified SmolLM2 with Bangla Tokenizer Support
This is a modified version of SmolLM2-135M that includes enhanced Bangla (Bengali) tokenizer support by merging tokens from TituLM.
Model Details
- Base Model: HuggingFaceTB/SmolLM2-135M
- Tokenizer Enhancement: Merged with TituLM Bangla tokenizer
- Original Vocabulary Size: 49,152
- Enhanced Vocabulary Size: 180,177
- Added Tokens: ~131,025 Bangla-specific tokens
Key Features
- ✅ Full SmolLM2-135M model architecture
- ✅ Enhanced Bangla tokenization support
- ✅ Backward compatible with original SmolLM2
- ✅ Improved performance on Bangla text
Usage
from transformers import AutoTokenizer, AutoModelForCausalLM
# Load the modified model
model = AutoModelForCausalLM.from_pretrained("rnnandi/modified_smollm")
tokenizer = AutoTokenizer.from_pretrained("rnnandi/modified_smollm")
# Test with Bangla text
text = "আমি বাংলায় গান গাই"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
Training
This model was created by:
- Merging TituLM Bangla tokenizer with SmolLM2 tokenizer
- Resizing model embeddings to accommodate new vocabulary
- Preserving original model weights and architecture
Citation
If you use this model, please cite both the original SmolLM2 and TituLM:
@misc{smollm2,
title={SmolLM2: A Family of Small Language Models},
author={HuggingFace Team},
year={2024},
url={https://huggingface.co/HuggingFaceTB/SmolLM2-135M}
}
@misc{titulm,
title={TituLM: A Bangla Language Model},
author={Hishab Team},
year={2024},
url={https://huggingface.co/hishab/titulm-llama-3.2-1b-v2.0}
}
License
This model is released under the Apache 2.0 License, same as the base SmolLM2 model.
- Downloads last month
- 7
Model tree for rnnandi/modified_smollm
Base model
HuggingFaceTB/SmolLM2-135M