🌐 KAIdol NER Multilingual Model

This is a multilingual NER (Named Entity Recognition) model developed as part of the KAIdol Project.
It is based on Davlan/xlm-roberta-base-ner-hrl, fine-tuned on the WikiAnn dataset for Korean (ko), English (en), Spanish (es), and Portuguese (pt).

🧠 Model Details

  • Base model: Davlan/xlm-roberta-base-ner-hrl
  • NER Tags:
    • PER: Person
    • ORG: Organization
    • LOC: Location
  • Tokenizer: AutoTokenizer from base model
  • Max length: 128 tokens

πŸ“Š Training Configuration

Parameter Value
Epochs 5
Batch Size 16
Optimizer AdamW
Learning Rate 5e-5
Loss CrossEntropy with class weights
Dataset WikiAnn (en, ko, es, pt)

βœ… Performance Summary

Language F1-macro PER F1 ORG F1 LOC F1
English 0.74 0.84 0.63 0.76
Korean 0.43 0.46 0.30 0.52
Spanish TBD TBD TBD TBD
Portuguese TBD TBD TBD TBD

Performance on es and pt will be updated after evaluation. Korean performance is limited due to tokenization issues in WikiAnn.

πŸš€ Usage Example

from transformers import AutoTokenizer, AutoModelForTokenClassification

model = AutoModelForTokenClassification.from_pretrained("developer-lunark/kaidol-ner-multilingual")
tokenizer = AutoTokenizer.from_pretrained("developer-lunark/kaidol-ner-multilingual")

tokens = tokenizer("Barack Obama naciΓ³ en HawΓ‘i.", return_tensors="pt")
output = model(**tokens)

🧾 Label Mapping

{
  'O': 0,
  'B-PER': 1,
  'I-PER': 2,
  'B-ORG': 3,
  'I-ORG': 4,
  'B-LOC': 5,
  'I-LOC': 6
}

πŸ” License

MIT License

πŸ“¬ Contact

Developed by the [KAIdol ν”„λ‘œμ νŠΈ νŒ€].

For questions or collaborations, contact: developer-lunark

Downloads last month
15
Safetensors
Model size
0.3B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for developer-lunark/kaidol-ner-multilingual

Finetuned
(4)
this model

Dataset used to train developer-lunark/kaidol-ner-multilingual

Evaluation results