Bagguette Boy English-French Translation Model

Model Overview

This is a bidrection English ↔ French translation model that is built on top of the BERT Encoder-Decoder model architecture. It was trained on the en-fr subset of the OPUS-100 dataset where the translations were flipped during pre-processing to double as a French to English dataset.

  • Model type: Encoder-Decoder (BERT-based)
  • Languages: English (en) ↔ French (fr)
  • Tasks: Machine translation, text-to-text generation
  • Framework: Transformers (Hugging Face)
  • Tokenizer: PreTrainedTokenizerFast using a custom tokenizer trained on the dataset

How to Use

The model makes use of special tokens to indicate the target language for translation:

  • <2fr>: Indicates that the input text should be translated to French.
  • <2en>: Indicates that the input text should be translated to English.
from transformers import AutoTokenizer, pipeline

model_path = "baguette-boy-en-fr"

tokenizer = AutoTokenizer.from_pretrained(model_path)
translator = pipeline("translation", model=model_path, tokenizer=tokenizer)

translation_output = translator("<2fr> Hello, how are you?")
print(translation_output[0]["translation_text"].replace(" ", "").replace("Ä ", " "))

translation_output = translator("<2en> Bonjour, comment ça va?")
print(translation_output[0]["translation_text"].replace(" ", "").replace("Ä ", " "))

Model Architecture

  • Encoder: 6-layer BERT with 8 attention heads
  • Decoder: 6-layer BERT with 8 attention heads, cross-attention enabled
  • Maximum sequence length: 128 tokens

Training Data

  • Dataset: OPUS-100 en-fr subset
  • Data preprocessing: Translations were flipped to create a bidirectional dataset
    • Added special tokens <2fr> and <2en> to indicate target language
    • Tokenized sequences with a custom tokenizer up to 128 tokens
    • Special tokens used: <s> (BOS), </s> (EOS), [PAD], [UNK]
  • Data Size: Approximately 2 million sentence pairs which is 2x the original size of the en-fr subset because of the bidirectional flipping
  • Splits:
    • Training: train
    • Validation: validation

Training Details

  • Batch size: 32 Per Device & Gradient Accumulation Steps: 8 (Effective batch size: 128)
  • Learning rate: 5e-5 with linear decay and warmup
  • Epochs: 3
  • Precision: fp16
  • Hardware: NVIDIA RTX 4070 Mobile GPU

Citations

OPUS-100 Dataset

@inproceedings{zhang-etal-2020-improving,
    title = "Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation",
    author = "Zhang, Biao  and
      Williams, Philip  and
      Titov, Ivan  and
      Sennrich, Rico",
    editor = "Jurafsky, Dan  and
      Chai, Joyce  and
      Schluter, Natalie  and
      Tetreault, Joel",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.acl-main.148",
    doi = "10.18653/v1/2020.acl-main.148",
    pages = "1628--1639",
}

OPUS Corpus

@inproceedings{tiedemann-2012-parallel,
    title = "Parallel Data, Tools and Interfaces in {OPUS}",
    author = {Tiedemann, J{\"o}rg},
    editor = "Calzolari, Nicoletta  and
      Choukri, Khalid  and
      Declerck, Thierry  and
      Do{\u{g}}an, Mehmet U{\u{g}}ur  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
    pages = "2214--2218",
}
Downloads last month
7
Safetensors
Model size
0.2B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train TheOneWhoWill/baguette-boy-en-fr