Bagguette Boy English-French Translation Model
Model Overview
This is a bidrection English ↔ French translation model that is built on top of the BERT Encoder-Decoder model architecture. It was trained on the en-fr subset of the OPUS-100 dataset where the translations were flipped during pre-processing to double as a French to English dataset.
- Model type: Encoder-Decoder (BERT-based)
- Languages: English (en) ↔ French (fr)
- Tasks: Machine translation, text-to-text generation
- Framework: Transformers (Hugging Face)
- Tokenizer: PreTrainedTokenizerFast using a custom tokenizer trained on the dataset
How to Use
The model makes use of special tokens to indicate the target language for translation:
<2fr>: Indicates that the input text should be translated to French.<2en>: Indicates that the input text should be translated to English.
from transformers import AutoTokenizer, pipeline
model_path = "baguette-boy-en-fr"
tokenizer = AutoTokenizer.from_pretrained(model_path)
translator = pipeline("translation", model=model_path, tokenizer=tokenizer)
translation_output = translator("<2fr> Hello, how are you?")
print(translation_output[0]["translation_text"].replace(" ", "").replace("Ä ", " "))
translation_output = translator("<2en> Bonjour, comment ça va?")
print(translation_output[0]["translation_text"].replace(" ", "").replace("Ä ", " "))
Model Architecture
- Encoder: 6-layer BERT with 8 attention heads
- Decoder: 6-layer BERT with 8 attention heads, cross-attention enabled
- Maximum sequence length: 128 tokens
Training Data
- Dataset: OPUS-100
en-frsubset - Data preprocessing: Translations were flipped to create a bidirectional dataset
- Added special tokens
<2fr>and<2en>to indicate target language - Tokenized sequences with a custom tokenizer up to 128 tokens
- Special tokens used:
<s>(BOS),</s>(EOS),[PAD],[UNK]
- Added special tokens
- Data Size: Approximately 2 million sentence pairs which is 2x the original size of the
en-frsubset because of the bidirectional flipping - Splits:
- Training: train
- Validation: validation
Training Details
- Batch size: 32 Per Device & Gradient Accumulation Steps: 8 (Effective batch size: 128)
- Learning rate: 5e-5 with linear decay and warmup
- Epochs: 3
- Precision: fp16
- Hardware: NVIDIA RTX 4070 Mobile GPU
Citations
OPUS-100 Dataset
@inproceedings{zhang-etal-2020-improving,
title = "Improving Massively Multilingual Neural Machine Translation and Zero-Shot Translation",
author = "Zhang, Biao and
Williams, Philip and
Titov, Ivan and
Sennrich, Rico",
editor = "Jurafsky, Dan and
Chai, Joyce and
Schluter, Natalie and
Tetreault, Joel",
booktitle = "Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics",
month = jul,
year = "2020",
address = "Online",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2020.acl-main.148",
doi = "10.18653/v1/2020.acl-main.148",
pages = "1628--1639",
}
OPUS Corpus
@inproceedings{tiedemann-2012-parallel,
title = "Parallel Data, Tools and Interfaces in {OPUS}",
author = {Tiedemann, J{\"o}rg},
editor = "Calzolari, Nicoletta and
Choukri, Khalid and
Declerck, Thierry and
Do{\u{g}}an, Mehmet U{\u{g}}ur and
Maegaard, Bente and
Mariani, Joseph and
Moreno, Asuncion and
Odijk, Jan and
Piperidis, Stelios",
booktitle = "Proceedings of the Eighth International Conference on Language Resources and Evaluation ({LREC}'12)",
month = may,
year = "2012",
address = "Istanbul, Turkey",
publisher = "European Language Resources Association (ELRA)",
url = "http://www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf",
pages = "2214--2218",
}
- Downloads last month
- 7