Issues with preparing inputs for sequence-to-sequence learning

#1
by jadermcs - opened

I am training T5Gemma for Word-in-Context binary classification as sentence-to-sentence problem (the same as original T5 paper). However the model is predicting the same label. Initially, I notice that the tokenizer do not add the end-of-string token so I adapted for it into my code, it went from "falsetruetruetruetrue" until reaching maximum tokens. Now, after adding eos, it predicts only true.

PS: The code below works with "google-t5/t5-small"
Any help here? Code below:

from datasets import load_dataset
import numpy as np
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
from transformers import EvalPrediction
from transformers import (
    AutoTokenizer,
    AutoModelForSeq2SeqLM,
    Seq2SeqTrainer,
    Seq2SeqTrainingArguments,
)

# Convert to Hugging Face Dataset
dataset = load_dataset("super_glue", "wic")

# Initialize tokenizer and model
model_name = "google/t5gemma-b-b-ul2-it"
# model_name = "google-t5/t5-small"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name, attn_implementation="eager")


def compute_metrics(eval_pred: EvalPrediction):
    predictions, labels = eval_pred
    # Decode predicted token IDs to strings
    pred_str = tokenizer.batch_decode(predictions, skip_special_tokens=True)
    labels = np.where(labels != -100, labels, tokenizer.pad_token_id)
    label_str = tokenizer.batch_decode(labels, skip_special_tokens=True)
    print(pred_str)
    print(label_str)

    # Convert "true"/"false" strings to 1/0
    pred_labels = [1 if p.strip().lower() == "true" else 0 for p in pred_str]
    true_labels = [1 if l.strip().lower() == "true" else 0 for l in label_str]
    # compute precision, recall, f1
    precision, recall, f1_score, _ = precision_recall_fscore_support(
        true_labels, pred_labels, average="binary"
    )
    accuracy = accuracy_score(true_labels, pred_labels)

    return {
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1_score": f1_score,
    }


# Preprocessing function
def preprocess(example):
    input_text = f"sentence1: {example['sentence1']} sentence2: {example['sentence2']} word: {example['word']}"
    target_text = "true" if example["label"] == 1 else "false"
    target_text = target_text + tokenizer.eos_token

    # Tokenize inputs and targets
    model_inputs = tokenizer(
        input_text, max_length=128, truncation=True, padding="max_length"
    )
    labels = tokenizer(target_text, max_length=5, truncation=True, padding="max_length")

    # Replace pad token id's in labels with -100 so they're ignored by loss
    labels_ids = labels["input_ids"]
    labels_ids = [
        label if label != tokenizer.pad_token_id else -100 for label in labels_ids
    ]

    model_inputs["labels"] = labels_ids
    return model_inputs


# Tokenize dataset
tokenized_dataset = dataset.map(
    preprocess, remove_columns=dataset["train"].column_names
)

# Training arguments
training_args = Seq2SeqTrainingArguments(
    output_dir="./t5-wic",
    eval_strategy="epoch",
    per_device_train_batch_size=32,
    num_train_epochs=10,
    save_strategy="epoch",
    save_total_limit=1,
    load_best_model_at_end=True,
    metric_for_best_model="accuracy",
    predict_with_generate=True,
    bf16=True,
)
print(tokenized_dataset["train"][0])
print(tokenizer.decode(tokenized_dataset["train"][0]["input_ids"]))
# remove -100
labels = [
    label if label != -100 else tokenizer.pad_token_id
    for label in tokenized_dataset["train"][0]["labels"]
]
print(tokenizer.decode(labels))
# Initialize Trainer
trainer = Seq2SeqTrainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_dataset["train"],
    eval_dataset=tokenized_dataset["validation"],
    compute_metrics=compute_metrics,
)

# Train the model
trainer.train()

metrics = trainer.evaluate(tokenized_dataset["test"])
print("Final metrics:")
print(metrics)
Google org

Hi,

Thanks for reaching out to us, welcome to Google's Gemma family of open source models. Please follow the following recommended suggestions:

Step 1 (Must-Do):
Action: Explicitly set a lower learning_rate in Seq2SeqTrainingArguments. Start with 1e-4 or 5e-5.
Rationale: Addresses the numerical instability inherent in large, modern models (T5Gemma) combined with bf16.

Step 2 (Must-Do):
Action: Check the tokenization of the labels.
Rationale: Ensure that tokenizer("true" + tokenizer.eos_token) is short (e.g., 2 or 3 tokens) and correctly tokenizes true or false as distinct tokens.

Step 3 (Optional but good):
**Action:**Add Gradient Clipping to your training arguments to prevent potential explosions in bf16 training.
Rationale: Adds stability. (You may need to add a custom callback or wrap the optimizer, as Seq2SeqTrainingArguments doesn't have a direct max_grad_norm parameter for all trainers.)

Step 3 (Verify):
Action: Temporarily run a validation/test step before training starts to ensure the compute_metrics and generation are working as expected with the initial, untrained model.
Rationale: Isolates the issue: is it in the setup or the training process?

Thanks.

Thanks for the reply.
The default parameters are already in the range you specified, so tweaking them resulted in no change.
The only thing that worked was turning off the bf16 and using full precision.

Arguments and training don't work efficiently

Hi, Apologies for the late reply, could you please confirm whether you require any further assistance or not apart from precision related concerns.
Thanks.

No need, after using full precision it trained fine.

jadermcs changed discussion status to closed

Sign up or log in to comment