Greek Dialect LoRA — Krikri-8B Adapter

LoRA adapter trained by the Computational Linguistics & Language Technology (CLLT) Lab, University of Crete, for producing text in Pontic, Cretan, Northern Greek, and Cypriot dialects. The adapter augments ilsp/Llama-Krikri-8B-Base with dialect-only data prepared via natural Greek prompts.

Project website: https://stergioscha.github.io/CLLT/

Model Details

  • Developer: CLLT Lab, University of Crete
  • Base model: ilsp/Llama-Krikri-8B-Base
  • Adapter type: LoRA via PEFT (r=16, α=32, dropout=0.1, q/k/v/o/gate/up/down projections)
  • Trained parameters: 41.9M (≈0.51% of the base model)
  • Dataset: 23,421 natural-prompt examples derived from the Greek Regional Dialects Dataset (GRDD)
  • Languages: Greek dialectal varieties (Pontic, Cretan, Northern, Cypriot)
  • License: Research purposes only (respect the base model’s license)
  • Funding / compute: AWS resources provided by GRNET and funded by the EU Recovery & Resilience Facility

Model Sources

Intended Use

Direct use

  • Dialectal text generation for cultural heritage, education, and research
  • Conversational agents that must answer in a specific Greek dialect
  • Prompt-based experimentation with dialect-specific stylistics

Downstream use

  • Integrate the adapter inside chatbots or RAG pipelines that need dialectal answers
  • Build evaluation suites for low-resource Greek varieties

Out-of-scope / limitations

  • Standard Modern Greek generation (training data excluded it)
  • High-stakes domains (medical, legal, safety-critical) without human oversight
  • Automatic dialect classification or translation between dialects

How to Use

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel

base_model = AutoModelForCausalLM.from_pretrained(
    "ilsp/Llama-Krikri-8B-Base",
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("ilsp/Llama-Krikri-8B-Base")
model = PeftModel.from_pretrained(base_model, "Stergios/krikri-8b-base-lora")

prompt = "Γράψε στην ποντιακή διάλεκτο: Καλημέρα, πώς είσαι;"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=120, temperature=0.7)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Training Data & Procedure

  • Source: GRDD + GRDD+, filtered to keep only dialect-tagged entries
  • Conversion: Artificial tags (<po>, <cr>, <no>, <cy>) mapped to natural Greek instructions using convert_to_natural_prompts_dialects_only.py
  • Split: 95% train / 5% validation (shuffled)
  • Tokenization: 512-token truncation, labels = input IDs

Hyperparameters

  • Epochs: 3
  • Per-device batch size: 2 (grad accumulation 8 ⇒ effective 16)
  • Learning rate: 3e-4 with 100 warmup steps
  • Precision: bfloat16
  • Save/eval every 200 steps, best checkpoint selected automatically

Evaluation

  • Validation loss monitored during training (best checkpoint selected)
  • Recommended human evaluation by native speakers for dialect fidelity and cultural appropriateness

Limitations & Risks

  • Coverage limited to four dialect families; sub-dialect nuances may be missing
  • Model can still hallucinate or drift toward Standard Greek without strong prompts
  • Training data might encode stylistic or topical biases present in GRDD
  • Outputs should always be reviewed by fluent speakers before publication

Acknowledgments

  • Compute: National Infrastructures for Research and Technology (GRNET)
  • Funding: EU Recovery and Resilience Facility
  • Base models: ILSP (Llama-Krikri-8B-Base)

Contact

For questions or issues, open an issue on the GitHub repository or contact the CLLT Lab (University of Crete).

Downloads last month
1
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Stergios/krikri-8b-base-lora

Adapter
(1)
this model