metadata
library_name: transformers
language:
- grc
SyllaMoBert-grc-macronizer-v1
This is a macronizer of ancient greek that classifize open syllables with dichrona as long or short. It is a machine-learning extension of the Albin Thörn Clelands macronizer (https://github.com/Urdatorn/macronize-tlg) using a ModernBERT model pretrained on syllabified Ancient Greek texts (Ericu950/SyllaMoBert-grc-v1).
# First install the pretokenizer that syllabifies ancient greek according to principles that the model adhere to
!pip install syllagreek_utils==0.1.0
import torch
from transformers import PreTrainedTokenizerFast, ModernBertForTokenClassification
from syllagreek_utils import preprocess_greek_line, syllabify_joined
from torch.nn.functional import softmax
import numpy as np
# -------- Load Model and Tokenizer --------
model_path = "Ericu950/SyllaMoBert-grc-macronizer-v1"
tokenizer = PreTrainedTokenizerFast.from_pretrained(model_path)
model = ModernBertForTokenClassification.from_pretrained(model_path, torch_dtype=torch.bfloat16)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
# -------- Input Line --------
line = "φάσγανον Ἀσσυρίοιο παρήορον ἐκ τελαμῶνος"
# -------- Preprocess and Syllabify --------
tokens = preprocess_greek_line(line)
syllables = syllabify_joined(tokens)
print("Syllables:", syllables)
# -------- Tokenize Input --------
inputs = tokenizer(
syllables,
is_split_into_words=True,
return_tensors="pt",
truncation=True,
max_length=2048,
padding="max_length"
)
# Remove token_type_ids if present
if "token_type_ids" in inputs:
del inputs["token_type_ids"]
inputs = {k: v.to(device) for k, v in inputs.items()}
# -------- Predict --------
with torch.no_grad():
outputs = model(**inputs)
logits = outputs.logits
probs = softmax(logits, dim=-1)
predictions = torch.argmax(probs, dim=-1).squeeze().cpu().numpy()
# -------- Align Predictions with Syllables --------
word_ids = tokenizer.convert_ids_to_tokens(inputs["input_ids"].squeeze())
aligned_preds = []
syllable_idx = 0
for i, token in enumerate(word_ids):
if token in tokenizer.all_special_tokens:
continue
if syllable_idx >= len(syllables):
break
label = predictions[i]
aligned_preds.append((syllables[syllable_idx], label))
syllable_idx += 1
# -------- Print Results --------
print("\n Macronization Predictions:")
for syll, label in aligned_preds:
status = {0: "clear", 1: "ambiguous → long", 2: "ambiguous → short"}[label]
print(f"{syll:>10} → {status}")
This should print
Syllables: ['φάσ', 'γα', 'νο', 'νἀσ', 'συ', 'ρί', 'οι', 'ο', 'πα', 'ρή', 'ο', 'ρο', 'νἐκ', 'τε', 'λα', 'μῶ', 'νοσ']
Macronization Predictions:
φάσ → clear
γα → ambiguous → short
νο → clear
νἀσ → clear
συ → ambiguous → short
ρί → ambiguous → short
οι → clear
ο → clear
πα → ambiguous → short
ρή → clear
ο → clear
ρο → clear
νἐκ → clear
τε → clear
λα → ambiguous → short
μῶ → clear
νοσ → clear