|
|
--- |
|
|
|
|
|
|
|
|
{ |
|
|
"library_name": "transformers", |
|
|
"pipeline_tag": "translation", |
|
|
"license": "apache-2.0", |
|
|
"tags": [ |
|
|
"machine-translation", |
|
|
"translation", |
|
|
"seq2seq", |
|
|
"marian", |
|
|
"transformers", |
|
|
"pytorch", |
|
|
"sacrebleu", |
|
|
"chrf", |
|
|
"datasets", |
|
|
"evaluate", |
|
|
"tensorboard", |
|
|
"fp16", |
|
|
"opus-books" |
|
|
], |
|
|
"base_model": "Helsinki-NLP/opus-mt-en-es", |
|
|
"datasets": ["Helsinki-NLP/opus_books"], |
|
|
"language": ["en", "es"], |
|
|
"widget": [ |
|
|
{ |
|
|
"text": "All around, the lonely sea extended to the limits of the horizon." |
|
|
}, |
|
|
{ |
|
|
"text": "\"With all due respect to master, they don't strike me as very wicked!\"" |
|
|
} |
|
|
] |
|
|
} |
|
|
|
|
|
--- |
|
|
|
|
|
# Model Card for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT |
|
|
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
|
|
A lean, modern baseline for neural machine translation (NMT) based on a transformer encoder–decoder (MarianMT) fine-tuned for **English → Spanish** on the **OPUS Books** dataset. It uses Hugging Face `transformers`, `datasets`, and `evaluate`, logs to TensorBoard, and reports sacreBLEU and chrF. Results and training details below. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
### Model Description |
|
|
|
|
|
This repository implements a small but complete seq2seq translation pipeline with sensible defaults: it loads the OPUS Books dataset, ensures train/validation/test splits, tokenizes source and target correctly using `text_target=`, fine-tunes a MarianMT checkpoint, and evaluates with BLEU/chrF. The implementation favors clarity and hackability and is intended as a reproducible baseline you can swap to different language pairs/datasets or models (e.g., T5, mBART). |
|
|
|
|
|
- **Developed by:** Amir Hossein Yousefi (GitHub: `amirhossein-yousefi`) |
|
|
- **Shared by :** Hugging Face user `Amirhossein75` |
|
|
- **Model type:** Transformer encoder–decoder (MarianMT) for machine translation |
|
|
- **Language(s) (NLP):** Source: English (`en`) → Target: Spanish (`es`) by default (configurable) |
|
|
- **License:** *Not explicitly specified in the repository.* The base checkpoint `Helsinki-NLP/opus-mt-en-es` is released under **CC-BY-4.0**, and the OPUS Books dataset card lists license **“other”**; verify compatibility for your use case. |
|
|
- **Finetuned from model :** `Helsinki-NLP/opus-mt-en-es` (MarianMT) |
|
|
|
|
|
### Model Sources |
|
|
|
|
|
- **Repository:** https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation |
|
|
- **Model on Hugging Face :** https://huggingface.co/Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT |
|
|
- **Base model:** https://huggingface.co/Helsinki-NLP/opus-mt-en-es |
|
|
- **Dataset:** https://huggingface.co/datasets/Helsinki-NLP/opus_books |
|
|
- **MarianMT docs:** https://huggingface.co/docs/transformers/en/model_doc/marian |
|
|
- **Related reading :** Tiedemann & Thottingal (2020), “OPUS-MT — Building open translation services for the World”; Tiedemann et al. (2023), “Democratizing neural machine translation with OPUS‑MT”. |
|
|
|
|
|
## Uses |
|
|
|
|
|
### Direct Use |
|
|
|
|
|
- Research and education: a clear, reproducible baseline for fine-tuning transformer-based MT on a small public dataset. |
|
|
- Prototyping translation systems for English→Spanish (or other pairs after configuration changes). |
|
|
|
|
|
### Downstream Use |
|
|
|
|
|
- Fine-tune on domain-specific parallel corpora for production MT. |
|
|
- Replace the base model with T5/mBART/other OPUS-MT variants by changing `TrainConfig.model_name`. |
|
|
|
|
|
### Out-of-Scope Use |
|
|
|
|
|
- Safety‑critical or high‑stakes scenarios without human review. |
|
|
- Zero-shot translation to/from languages not covered by the checkpoint or dataset. |
|
|
- Use cases assuming perfect adequacy/faithfulness or robustness on noisy, modern, or informal text without additional fine‑tuning. |
|
|
|
|
|
## Bias, Risks, and Limitations |
|
|
|
|
|
- **Domain & recency mismatch:** OPUS Books contains copyright‑free books and is **dated**; performance may degrade on contemporary, conversational, or domain‑specific text. |
|
|
- **Language & register:** Trained for EN→ES; style may skew literary/formal. For slang, dialectal variants, code‑switching, or technical jargon, expect errors. |
|
|
- **General MT caveats:** Typical MT biases (gendered forms, named entity transliteration, idioms) can surface; outputs may be fluent but inaccurate. |
|
|
|
|
|
### Recommendations |
|
|
|
|
|
- Evaluate on **your** domain with sacreBLEU/chrF and targeted tests (named entities, numbers, formatting). |
|
|
- Add domain or synthetic data and continue fine‑tuning; include human‑in‑the‑loop QA for critical use. |
|
|
- If deploying, log sources and predictions; implement quality thresholds and fallback to human translation as needed. |
|
|
|
|
|
## How to Get Started with the Model |
|
|
|
|
|
**Option A — Quick inference (baseline checkpoint):** |
|
|
|
|
|
```python |
|
|
from transformers import pipeline |
|
|
translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es") |
|
|
translator("The sea extended to the horizon.") |
|
|
``` |
|
|
|
|
|
**Option B — Train/evaluate with this repo (default EN→ES on OPUS Books):** |
|
|
|
|
|
```bash |
|
|
git clone https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation.git |
|
|
cd Sequence2Sequence-Transformer-Translation |
|
|
python -m venv .venv && source .venv/bin/activate |
|
|
pip install -r requirements.txt |
|
|
python -m src.train # or: python src/train.py |
|
|
``` |
|
|
|
|
|
Artifacts (model, tokenizer) are saved under the configured `outputs` directory; you can then push them to the Hub. |
|
|
|
|
|
## Training Details |
|
|
|
|
|
### Training Data |
|
|
|
|
|
- **Dataset:** OPUS Books (`Helsinki-NLP/opus_books`) English–Spanish split. The dataset compiles aligned, copyright‑free books; many texts are older, and some alignments are manually reviewed. See the dataset card for caveats. |
|
|
- **Preprocessing:** Tokenization uses Hugging Face tokenizers with `text_target=` for the target (labels), avoiding leakage and ensuring correct special‑token handling. |
|
|
|
|
|
### Training Procedure |
|
|
|
|
|
Implemented with Hugging Face **Trainer** and `TrainingArguments`. Mixed precision (`fp16`) is enabled automatically when CUDA is available. Logging is written to TensorBoard under `outputs/.../logs`. |
|
|
|
|
|
#### Preprocessing |
|
|
|
|
|
- Lower‑casing/normalization is left to the tokenizer (no additional bespoke normalization). |
|
|
- Max sequence lengths (source/target) and batch size are configurable in `TrainConfig`. |
|
|
|
|
|
#### Training Hyperparameters |
|
|
|
|
|
- **Training regime:** Automatic mixed precision (**fp16**) when CUDA is available; standard fp32 otherwise. |
|
|
- Other hyperparameters (batch size, epochs, learning rate, max lengths) are defined in `src/config.py` and can be overridden in your script. |
|
|
|
|
|
#### Speeds, Sizes, Times |
|
|
|
|
|
- **Hardware:** NVIDIA GeForce RTX 3080 Ti **Laptop** GPU (16 GB VRAM) on Windows (WDDM); CUDA driver 12.9; PyTorch 2.8.0+cu129. |
|
|
- **Total FLOPs (training):** 4,945,267,757,416,448 |
|
|
- **Training runtime:** 2,449.291 seconds (≈ 40:45 wall‑clock) |
|
|
- **Throughput:** train ≈ 12.90 steps/s · val ≈ 1.85 steps/s · test ≈ 1.84 steps/s |
|
|
|
|
|
## Evaluation |
|
|
|
|
|
### Testing Data, Factors & Metrics |
|
|
|
|
|
#### Testing Data |
|
|
|
|
|
- OPUS Books **test** split for EN→ES. |
|
|
|
|
|
#### Factors |
|
|
|
|
|
- Reported metrics are aggregate; you may wish to break down by category (named entities, numbers, sentence length) for your domain. |
|
|
|
|
|
#### Metrics |
|
|
|
|
|
- **sacreBLEU** (higher is better) |
|
|
- **chrF** (higher is better) |
|
|
- **Average generated length** (tokens) |
|
|
|
|
|
### Results |
|
|
|
|
|
- **BLEU (val/test):** 23.41 / 23.41 |
|
|
- **chrF (val/test):** 48.20 / 48.21 |
|
|
- **Loss (train/val/test):** 1.854 / 1.883 / 1.859 |
|
|
- **Avg generation length (val/test):** 30.27 / 29.88 tokens |
|
|
- **Wall‑clock:** train 40:45 · val 5:16 · test 5:18 |
|
|
|
|
|
#### Summary |
|
|
|
|
|
The model produces fluent Spanish with moderate adequacy on OPUS Books; BLEU ≈ 23.4 and chrF ≈ 48.2 are consistent across validation and test. |
|
|
|
|
|
## Model Examination |
|
|
|
|
|
- Qualitative samples (EN→ES) and loss curves are included under `assets/` and TensorBoard logs in `outputs/.../logs`. |
|
|
- Consider contrastive tests (gendered occupations, idioms) and targeted error analyses for your domain. |
|
|
|
|
|
## Environmental Impact |
|
|
|
|
|
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). |
|
|
|
|
|
- **Hardware Type:** Single consumer‑grade GPU (RTX 3080 Ti Laptop, 16 GB) |
|
|
- **Hours used:** ~0.68 hours (≈ 2,449 seconds) for the reported training run |
|
|
- **Cloud Provider:** N/A (local laptop) |
|
|
- **Compute Region:** N/A |
|
|
- **Carbon Emitted:** Not estimated; depends on local energy mix |
|
|
|
|
|
## Technical Specifications |
|
|
|
|
|
### Model Architecture and Objective |
|
|
|
|
|
- Transformer **encoder–decoder** (MarianMT): 6‑layer encoder and 6‑layer decoder, static sinusoidal positional embeddings; optimized for translation as conditional generation. |
|
|
|
|
|
### Compute Infrastructure |
|
|
|
|
|
#### Hardware |
|
|
|
|
|
- Laptop (Windows, WDDM driver), NVIDIA GeForce RTX 3080 Ti (16 GB). |
|
|
|
|
|
#### Software |
|
|
|
|
|
- Python 3.13+, `transformers` 4.42+, `datasets` 3.0+, `evaluate` 0.4.2+, PyTorch 2.8.0 (CUDA 12.9), TensorBoard logging. |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model or code, please consider citing the OPUS‑MT work and Marian: |
|
|
|
|
|
**BibTeX (OPUS‑MT):** |
|
|
``` |
|
|
@inproceedings{tiedemann-thottingal-2020-opus, |
|
|
title = "{OPUS}-{MT} -- Building open translation services for the World", |
|
|
author = "Tiedemann, J{"o}rg and Thottingal, Santhosh", |
|
|
booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation", |
|
|
year = "2020" |
|
|
} |
|
|
``` |
|
|
|
|
|
**BibTeX (Democratizing NMT with OPUS‑MT):** |
|
|
``` |
|
|
@article{tiedemann2023democratizing, |
|
|
title={Democratizing neural machine translation with {OPUS-MT}}, |
|
|
author={Tiedemann, J{"o}rg and Aulamo, Mikko and others}, |
|
|
journal={Language Resources and Evaluation}, |
|
|
year={2023} |
|
|
} |
|
|
``` |
|
|
|
|
|
## Glossary |
|
|
|
|
|
- **BLEU:** Precision‑based n‑gram overlap metric; reported via sacreBLEU for comparability. |
|
|
- **chrF:** Character n‑gram F‑score; more sensitive to morphological correctness. |
|
|
|
|
|
## More Information |
|
|
|
|
|
- See the repository README for project structure, defaults, and customization tips. |
|
|
- The Hub model repo currently exists; ensure weights and a model card are pushed before using it directly. |
|
|
|
|
|
## Model Card Authors |
|
|
|
|
|
- Amir Hossein Yousefi (project author) |
|
|
- (This model card drafted for the repository consumer.) |
|
|
|
|
|
## Model Card Contact |
|
|
|
|
|
- Open an issue in the repository or contact the Hugging Face user `Amirhossein75`. |
|
|
|