---
language:
- ttj
- en
license: apache-2.0
library_name: llama.cpp
pipeline_tag: text-generation
base_model: Qwen/Qwen2.5-1.5B-Instruct
tags:
- gguf
- rutooro
- ttj
- qwen2.5
- cpu
- lora
- qlora
- instruction-tuning
quantization_config:
  format: gguf
  method: q4_K_M
---

# Qwen2.5-1.5B **TTJ (Rutooro)** — GGUF

Quantized GGUF build of a QLoRA-finetuned **Qwen-2.5-1.5B-Instruct** model that *thinks and answers in Rutooro (ttj)*.  
This artifact is intended for **CPU-only** inference via **llama.cpp** and compatible runtimes.

> **Files**
> - `qwen_ttj_merged.q4_K_M.gguf` — ~0.99 GB (quantized with `llama-quantize q4_K_M`)
> - (Optional in repo) `qwen_ttj_merged.q5_K_M.gguf` — larger, slightly higher quality

## ✨ What’s inside
- **Base:** `Qwen/Qwen2.5-1.5B-Instruct`
- **Finetune:** Supervised instruction tuning (LoRA / QLoRA) on curated Rutooro ChatML data
- **Context length:** 2048 tokens (training truncation)
- **Merge path:** base + LoRA → merged HF model → GGUF (via `convert_hf_to_gguf.py`)

## ✅ Intended use
- Rutooro (ttj) conversational assistant and instruction following
- Education, local information, general Q&A in Rutooro
- When non-TTJ input is provided, pair with an external translator (e.g., NLLB-200)

## ⚠️ Limitations & bias
- May produce mistakes or outdated facts; verify critical outputs.
- Safety refusals/redirects translated for TTJ, but edge cases can slip through.
- Not a replacement for professional advice.

---

## 🖥️ Run with **llama.cpp**

Build llama.cpp, then:

```bash
# Interactive prompt
./main -m qwen_ttj_merged.q4_K_M.gguf -t 8 -c 2048 -n 192   -p "Oteekateeka ki orungi okuhika aha wiiki?"

# HTTP server
./llama-server -m qwen_ttj_merged.q4_K_M.gguf -t 8 -c 2048 -ngl 0
```

> Adjust `-t` to your CPU threads. For slightly better quality, use the `q5_K_M` file if provided.

### Example (ChatML-style prompt)
This model was trained with Qwen’s ChatML formatting; a simple single-turn prompt works fine, but you can also format full conversations:

```text
<|im_start|>system
Oli ntekereza ehangirwe ekugamba mu Runyoro/Rutooro. Hondera empabura z'abantu abakukukozesa kandi obe w'engeso nungi.
<|im_end|>
<|im_start|>user
Nshoborora kutunga enteekateeka y’omulimo ogw’aha wiiki?
<|im_end|>
<|im_start|>assistant
```

---

## 🔧 Repro (high level)
- Load base in 4-bit (bitsandbytes) and train LoRA (`r=16, α=32, dropout=0.05`) with gradient checkpointing.
- Loss on **assistant tokens only** (prompt masked).
- Merge LoRA into base, then:
  ```bash
  python convert_hf_to_gguf.py --model-name qwen2 --outfile qwen_ttj_merged.gguf /path/to/merged_hf
  ./llama-quantize qwen_ttj_merged.gguf qwen_ttj_merged.q4_K_M.gguf q4_K_M
  ```

## 📄 License
- Base model: Apache-2.0 (Qwen 2.5 family)
- Finetune & GGUF weights: Apache-2.0  
Use responsibly and comply with local laws and the base model’s license.

## 🙌 Acknowledgements
- Qwen team for the base model.
- llama.cpp for GGUF conversion and CPU inference.
- Runyoro/Rutooro AI project for data preparation and evaluation.