Text Generation
Transformers
PyTorch
Thai
English
gpt2
thai
Hanuman
reasoning

Hanuman

Hanuman

Hanuman β€” A Small Language Model for Thai

Tokenizer advisor: Koichi Yasuoka


πŸ”Ž Model Details

Overview

  • Name: Hanuman
  • Language: Thai (th)
  • Task: Text Generation (Causal LM)
  • Framework: PyTorch + πŸ€— Transformers
  • License: CC BY-NC 4.0 (Non-commercial use only)

Training Datasets

Architecture

  • Custom tokenizer for Thai language (handles whitespace, newline, tab, <NL>, <SPACE>, <TAB> etc.)

βœ… Intended Use

Primary Use Cases

  • Thai text generation (blogs, articles, captions, chatbots)
  • Creative and reasoning-oriented text assistance
  • Thai NLP research

Limitations

  • This model is research-oriented and may require additional fine-tuning for production use.
  • May generate incorrect or biased outputs. Human verification is recommended.

🧰 Tokenizer & Context

  • Custom fast tokenizer (no trust_remote_code needed)
  • Ensures round-trip encode/decode correctness
  • Unicode NFC normalization included
  • Handles Thai–Latin spacing consistently

πŸš€ Usage Examples

Basic Text Generation

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

MODEL_ID = "ZombitX64/Hanuman"

tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)
model = AutoModelForCausalLM.from_pretrained(MODEL_ID)

def generate_thai_text(prompt, max_length=100):
    inputs = tokenizer(prompt, return_tensors="pt")
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_length=max_length,
            temperature=0.7,
            top_p=0.9,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id
        )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

print(generate_thai_text("Artificial intelligence technology"))

Batch Processing

prompts = ["Hello", "Thailand has an area of", "Education in the digital era"]
for p in prompts:
    print(generate_thai_text(p, max_length=80))
    print("-"*50)

πŸ—οΈ Training Process

Dataset Preparation

  • Source: Wikipedia Thai and reasoning-style datasets
  • Preprocessing: Cleaning, Unicode normalization, tokenization
  • Training mode: Streaming

Example Training Configuration

training_args = {
    "per_device_train_batch_size": 2,
    "per_device_eval_batch_size": 2,
    "gradient_accumulation_steps": 4,
    "num_train_epochs": 2,
    "learning_rate": 5e-5,
    "warmup_steps": 10,
    "logging_steps": 10,
    "eval_steps": 50,
    "save_steps": 50,
    "fp16": False,  # CPU training
    "dataloader_num_workers": 0
}

πŸ“Š Evaluation

The model is currently in research phase. Formal evaluation results (perplexity, Thai downstream benchmarks) will be added in the future.


🀝 Contributing

This project is part of ongoing Thai NLP research. Feedback, issues, and contributions are welcome!


πŸ“„ Citation

@misc{Hanuman2025,
  title        = {Hanuman: Thai Small Language Model},
  author       = {JonusNattapong and Koichi Yasuoka},
  year         = {2025},
  howpublished = {\url{https://huggingface.co/ZombitX64/Hanuman}},
  note         = {Tokenizer advisor: Koichi Yasuoka}
}

⚠️ Disclaimer: This model is intended for research and educational purposes only. Use in commercial applications requires prior permission under the CC BY-NC 4.0 license.

Downloads last month
32
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for ZombitX64/Hanuman

Finetunes
1 model

Datasets used to train ZombitX64/Hanuman