Modified SmolLM2 with Bangla Tokenizer Support

This is a modified version of SmolLM2-135M that includes enhanced Bangla (Bengali) tokenizer support by merging tokens from TituLM.

Model Details

Base Model: HuggingFaceTB/SmolLM2-135M
Tokenizer Enhancement: Merged with TituLM Bangla tokenizer
Original Vocabulary Size: 49,152
Enhanced Vocabulary Size: 180,177
Added Tokens: ~131,025 Bangla-specific tokens

Key Features

✅ Full SmolLM2-135M model architecture
✅ Enhanced Bangla tokenization support
✅ Backward compatible with original SmolLM2
✅ Improved performance on Bangla text

Usage

from transformers import AutoTokenizer, AutoModelForCausalLM

# Load the modified model
model = AutoModelForCausalLM.from_pretrained("rnnandi/modified_smollm")
tokenizer = AutoTokenizer.from_pretrained("rnnandi/modified_smollm")

# Test with Bangla text
text = "আমি বাংলায় গান গাই"
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_length=50)
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

Training

This model was created by:

Merging TituLM Bangla tokenizer with SmolLM2 tokenizer
Resizing model embeddings to accommodate new vocabulary
Preserving original model weights and architecture

Citation

If you use this model, please cite both the original SmolLM2 and TituLM:

@misc{smollm2,
  title={SmolLM2: A Family of Small Language Models},
  author={HuggingFace Team},
  year={2024},
  url={https://huggingface.co/HuggingFaceTB/SmolLM2-135M}
}

@misc{titulm,
  title={TituLM: A Bangla Language Model},
  author={Hishab Team},
  year={2024},
  url={https://huggingface.co/hishab/titulm-llama-3.2-1b-v2.0}
}

License

This model is released under the Apache 2.0 License, same as the base SmolLM2 model.

Downloads last month: 7

Safetensors

Model size

0.2B params

Tensor type

BF16

Model tree for rnnandi/modified_smollm

Base model

HuggingFaceTB/SmolLM2-135M

Finetuned

(764)

this model