Hanuman

Hanuman — A Small Language Model for Thai

Tokenizer advisor: Koichi Yasuoka

🔎 Model Details

Overview

Name: Hanuman
Language: Thai (th)
Task: Text Generation (Causal LM)
Framework: PyTorch + 🤗 Transformers
License: CC BY-NC 4.0 (Non-commercial use only)

Training Datasets

Architecture

Custom tokenizer for Thai language (handles whitespace, newline, tab, <NL>, <SPACE>, <TAB> etc.)

✅ Intended Use

Primary Use Cases

Thai text generation (blogs, articles, captions, chatbots)
Creative and reasoning-oriented text assistance
Thai NLP research

Limitations

This model is research-oriented and may require additional fine-tuning for production use.
May generate incorrect or biased outputs. Human verification is recommended.

🧰 Tokenizer & Context

Custom fast tokenizer (no trust_remote_code needed)
Ensures round-trip encode/decode correctness
Unicode NFC normalization included
Handles Thai–Latin spacing consistently

🚀 Usage Examples

Basic Text Generation

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "ZombitX64/Hanuman"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)

def generate_thai_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate_thai_text("Artificial intelligence technology"))

Batch Processing

prompts = ["Hello", "Thailand has an area of", "Education in the digital era"]
for p in prompts:
    print(generate_thai_text(p, max_length=80))
    print("-"*50)

🏗️ Training Process

Dataset Preparation

Source: Wikipedia Thai and reasoning-style datasets
Preprocessing: Cleaning, Unicode normalization, tokenization
Training mode: Streaming

Example Training Configuration

training_args = {
    "per_device_train_batch_size": 2,
    "per_device_eval_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "num_train_epochs": 2,
    "learning_rate": 5e-5,
    "warmup_steps": 10,
    "logging_steps": 10,
    "eval_steps": 50,
    "save_steps": 50,
    "fp16": False,  # CPU training
    "dataloader_num_workers": 0
}

📊 Evaluation

The model is currently in research phase. Formal evaluation results (perplexity, Thai downstream benchmarks) will be added in the future.

🤝 Contributing

This project is part of ongoing Thai NLP research. Feedback, issues, and contributions are welcome!

📄 Citation

@misc{Hanuman2025,
  title        = {Hanuman: Thai Small Language Model},
  author       = {JonusNattapong and Koichi Yasuoka},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/ZombitX64/Hanuman}},
  note         = {Tokenizer advisor: Koichi Yasuoka}
}

⚠️ Disclaimer: This model is intended for research and educational purposes only. Use in commercial applications requires prior permission under the CC BY-NC 4.0 license.

Downloads last month: 32

Model tree for ZombitX64/Hanuman

Finetunes

1 model

Datasets used to train ZombitX64/Hanuman

Evaluation results

Metadata error: specify a dataset to view leaderboard