SyllaMoBert-grc-macronizer-v1 / .ipynb_checkpoints /README-checkpoint.md

Ericu950

Upload SyllaMoBert Ancient Greek macronizer v1

2831234 verified 8 months ago

preview code

raw

history blame contribute delete

3.32 kB

metadata

library_name: transformers
language:
  - grc

SyllaMoBert-grc-macronizer-v1

This is a macronizer of ancient greek that classifize open syllables with dichrona as long or short. It is a machine-learning extension of the Albin Thörn Clelands macronizer (https://github.com/Urdatorn/macronize-tlg) using a ModernBERT model pretrained on syllabified Ancient Greek texts (Ericu950/SyllaMoBert-grc-v1).

# First install the pretokenizer that syllabifies ancient greek according to principles that the model adhere to
!pip install syllagreek_utils==0.1.0

import torch
from transformers import PreTrainedTokenizerFast, ModernBertForTokenClassification
from syllagreek_utils import preprocess_greek_line, syllabify_joined
from torch.nn.functional import softmax
import numpy as np

# -------- Load Model and Tokenizer --------
model_path = "Ericu950/SyllaMoBert-grc-macronizer-v1"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
model = ModernBertForTokenClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()

# -------- Input Line --------
line = "φάσγανον Ἀσσυρίοιο παρήορον ἐκ τελαμῶνος"

# -------- Preprocess and Syllabify --------
tokens = preprocess_greek_line(line)
syllables = syllabify_joined(tokens)
print("Syllables:", syllables)

# -------- Tokenize Input --------
inputs = tokenizer(
    syllables,
    is_split_into_words=True,
    return_tensors="pt",
    truncation=True,
    max_length=2048,
    padding="max_length"
)

# Remove token_type_ids if present
if "token_type_ids" in inputs:
    del inputs["token_type_ids"]

inputs = {k: v.to(device) for k, v in inputs.items()}

# -------- Predict --------
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits
    probs = softmax(logits, dim=-1)
    predictions = torch.argmax(probs, dim=-1).squeeze().cpu().numpy()

# -------- Align Predictions with Syllables --------
word_ids = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
aligned_preds = []
syllable_idx = 0

for i, token in enumerate(word_ids):
    if token in tokenizer.all_special_tokens:
        continue
    if syllable_idx >= len(syllables):
        break
    label = predictions[i]
    aligned_preds.append((syllables[syllable_idx], label))
    syllable_idx += 1

# -------- Print Results --------
print("\n Macronization Predictions:")
for syll, label in aligned_preds:
    status = {0: "clear", 1: "ambiguous → long", 2: "ambiguous → short"}[label]
    print(f"{syll:>10} → {status}")

This should print

Syllables: ['φάσ', 'γα', 'νο', 'νἀσ', 'συ', 'ρί', 'οι', 'ο', 'πα', 'ρή', 'ο', 'ρο', 'νἐκ', 'τε', 'λα', 'μῶ', 'νοσ']

Macronization Predictions:
       φάσ → clear
        γα → ambiguous → short
        νο → clear
       νἀσ → clear
        συ → ambiguous → short
        ρί → ambiguous → short
        οι → clear
         ο → clear
        πα → ambiguous → short
        ρή → clear
         ο → clear
        ρο → clear
       νἐκ → clear
        τε → clear
        λα → ambiguous → short
        μῶ → clear
       νοσ → clear