Bagguette Boy English-French Translation Model

Model Overview

This is a bidrection English ↔ French translation model that is built on top of the BERT Encoder-Decoder model architecture. It was trained on the en-fr subset of the OPUS-100 dataset where the translations were flipped during pre-processing to double as a French to English dataset.

Model type: Encoder-Decoder (BERT-based)
Languages: English (en) ↔ French (fr)
Tasks: Machine translation, text-to-text generation
Framework: Transformers (Hugging Face)
Tokenizer: PreTrainedTokenizerFast using a custom tokenizer trained on the dataset

How to Use

The model makes use of special tokens to indicate the target language for translation:

<2fr>: Indicates that the input text should be translated to French.
<2en>: Indicates that the input text should be translated to English.

from transformers import AutoTokenizer, pipeline

model_path = "baguette-boy-en-fr"

tokenizer = AutoTokenizer.from_pretrained(model_path)
translator = pipeline("translation", model=model_path, tokenizer=tokenizer)

translation_output = translator("<2fr> Hello, how are you?")
print(translation_output[0]["translation_text"].replace(" ", "").replace("Ġ", " "))

translation_output = translator("<2en> Bonjour, comment ça va?")
print(translation_output[0]["translation_text"].replace(" ", "").replace("Ġ", " "))

Model Architecture

Encoder: 6-layer BERT with 8 attention heads
Decoder: 6-layer BERT with 8 attention heads, cross-attention enabled
Maximum sequence length: 128 tokens

Training Data

Dataset: OPUS-100 en-fr subset
Data preprocessing: Translations were flipped to create a bidirectional dataset
- Added special tokens <2fr> and <2en> to indicate target language
- Tokenized sequences with a custom tokenizer up to 128 tokens
- Special tokens used: <s> (BOS), </s> (EOS), [PAD], [UNK]
Data Size: Approximately 2 million sentence pairs which is 2x the original size of the en-fr subset because of the bidirectional flipping
Splits:
- Training: train
- Validation: validation

Training Details

Batch size: 32 Per Device & Gradient Accumulation Steps: 8 (Effective batch size: 128)
Learning rate: 5e-5 with linear decay and warmup
Epochs: 3
Precision: fp16
Hardware: NVIDIA RTX 4070 Mobile GPU

Citations

OPUS-100 Dataset

@inproceedings{zhang-etal-2020-improving,
    title = "Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation",
    author = "Zhang, Biao  and
      Williams, Philip  and
      Titov, Ivan  and
      Sennrich, Rico",
    editor = "Jurafsky, Dan  and
      Chai, Joyce  and
      Schluter, Natalie  and
      Tetreault, Joel",
    booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
    month = jul,
    year = "2020",
    address = "Online",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2020.acl-main.148",
    doi = "10.18653/v1/2020.acl-main.148",
    pages = "1628--1639",
}

OPUS Corpus

@inproceedings{tiedemann-2012-parallel,
    title = "Parallel Data, Tools and Interfaces in {OPUS}",
    author = {Tiedemann, J{\"o}rg},
    editor = "Calzolari, Nicoletta  and
      Choukri, Khalid  and
      Declerck, Thierry  and
      Do{\u{g}}an, Mehmet U{\u{g}}ur  and
      Maegaard, Bente  and
      Mariani, Joseph  and
      Moreno, Asuncion  and
      Odijk, Jan  and
      Piperidis, Stelios",
    booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
    month = may,
    year = "2012",
    address = "Istanbul, Turkey",
    publisher = "European Language Resources Association (ELRA)",
    url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
    pages = "2214--2218",
}

Downloads last month: 7

Safetensors

Model size

0.2B params

Tensor type

F32

TheOneWhoWill
/

baguette-boy-en-fr