--- # For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1 # Doc / guide: https://huggingface.co/docs/hub/model-cards { "library_name": "transformers", "pipeline_tag": "translation", "license": "apache-2.0", "tags": [ "machine-translation", "translation", "seq2seq", "marian", "transformers", "pytorch", "sacrebleu", "chrf", "datasets", "evaluate", "tensorboard", "fp16", "opus-books" ], "base_model": "Helsinki-NLP/opus-mt-en-es", "datasets": ["Helsinki-NLP/opus_books"], "language": ["en", "es"], "widget": [ { "text": "All around, the lonely sea extended to the limits of the horizon." }, { "text": "\"With all due respect to master, they don't strike me as very wicked!\"" } ] } --- # Model Card for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT A lean, modern baseline for neural machine translation (NMT) based on a transformer encoder–decoder (MarianMT) fine-tuned for **English → Spanish** on the **OPUS Books** dataset. It uses Hugging Face `transformers`, `datasets`, and `evaluate`, logs to TensorBoard, and reports sacreBLEU and chrF. Results and training details below. ## Model Details ### Model Description This repository implements a small but complete seq2seq translation pipeline with sensible defaults: it loads the OPUS Books dataset, ensures train/validation/test splits, tokenizes source and target correctly using `text_target=`, fine-tunes a MarianMT checkpoint, and evaluates with BLEU/chrF. The implementation favors clarity and hackability and is intended as a reproducible baseline you can swap to different language pairs/datasets or models (e.g., T5, mBART). - **Developed by:** Amir Hossein Yousefi (GitHub: `amirhossein-yousefi`) - **Shared by :** Hugging Face user `Amirhossein75` - **Model type:** Transformer encoder–decoder (MarianMT) for machine translation - **Language(s) (NLP):** Source: English (`en`) → Target: Spanish (`es`) by default (configurable) - **License:** *Not explicitly specified in the repository.* The base checkpoint `Helsinki-NLP/opus-mt-en-es` is released under **CC-BY-4.0**, and the OPUS Books dataset card lists license **“other”**; verify compatibility for your use case. - **Finetuned from model :** `Helsinki-NLP/opus-mt-en-es` (MarianMT) ### Model Sources - **Repository:** https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation - **Model on Hugging Face :** https://huggingface.co/Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT - **Base model:** https://huggingface.co/Helsinki-NLP/opus-mt-en-es - **Dataset:** https://huggingface.co/datasets/Helsinki-NLP/opus_books - **MarianMT docs:** https://huggingface.co/docs/transformers/en/model_doc/marian - **Related reading :** Tiedemann & Thottingal (2020), “OPUS-MT — Building open translation services for the World”; Tiedemann et al. (2023), “Democratizing neural machine translation with OPUS‑MT”. ## Uses ### Direct Use - Research and education: a clear, reproducible baseline for fine-tuning transformer-based MT on a small public dataset. - Prototyping translation systems for English→Spanish (or other pairs after configuration changes). ### Downstream Use - Fine-tune on domain-specific parallel corpora for production MT. - Replace the base model with T5/mBART/other OPUS-MT variants by changing `TrainConfig.model_name`. ### Out-of-Scope Use - Safety‑critical or high‑stakes scenarios without human review. - Zero-shot translation to/from languages not covered by the checkpoint or dataset. - Use cases assuming perfect adequacy/faithfulness or robustness on noisy, modern, or informal text without additional fine‑tuning. ## Bias, Risks, and Limitations - **Domain & recency mismatch:** OPUS Books contains copyright‑free books and is **dated**; performance may degrade on contemporary, conversational, or domain‑specific text. - **Language & register:** Trained for EN→ES; style may skew literary/formal. For slang, dialectal variants, code‑switching, or technical jargon, expect errors. - **General MT caveats:** Typical MT biases (gendered forms, named entity transliteration, idioms) can surface; outputs may be fluent but inaccurate. ### Recommendations - Evaluate on **your** domain with sacreBLEU/chrF and targeted tests (named entities, numbers, formatting). - Add domain or synthetic data and continue fine‑tuning; include human‑in‑the‑loop QA for critical use. - If deploying, log sources and predictions; implement quality thresholds and fallback to human translation as needed. ## How to Get Started with the Model **Option A — Quick inference (baseline checkpoint):** ```python from transformers import pipeline translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es") translator("The sea extended to the horizon.") ``` **Option B — Train/evaluate with this repo (default EN→ES on OPUS Books):** ```bash git clone https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation.git cd Sequence2Sequence-Transformer-Translation python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt python -m src.train # or: python src/train.py ``` Artifacts (model, tokenizer) are saved under the configured `outputs` directory; you can then push them to the Hub. ## Training Details ### Training Data - **Dataset:** OPUS Books (`Helsinki-NLP/opus_books`) English–Spanish split. The dataset compiles aligned, copyright‑free books; many texts are older, and some alignments are manually reviewed. See the dataset card for caveats. - **Preprocessing:** Tokenization uses Hugging Face tokenizers with `text_target=` for the target (labels), avoiding leakage and ensuring correct special‑token handling. ### Training Procedure Implemented with Hugging Face **Trainer** and `TrainingArguments`. Mixed precision (`fp16`) is enabled automatically when CUDA is available. Logging is written to TensorBoard under `outputs/.../logs`. #### Preprocessing - Lower‑casing/normalization is left to the tokenizer (no additional bespoke normalization). - Max sequence lengths (source/target) and batch size are configurable in `TrainConfig`. #### Training Hyperparameters - **Training regime:** Automatic mixed precision (**fp16**) when CUDA is available; standard fp32 otherwise. - Other hyperparameters (batch size, epochs, learning rate, max lengths) are defined in `src/config.py` and can be overridden in your script. #### Speeds, Sizes, Times - **Hardware:** NVIDIA GeForce RTX 3080 Ti **Laptop** GPU (16 GB VRAM) on Windows (WDDM); CUDA driver 12.9; PyTorch 2.8.0+cu129. - **Total FLOPs (training):** 4,945,267,757,416,448 - **Training runtime:** 2,449.291 seconds (≈ 40:45 wall‑clock) - **Throughput:** train ≈ 12.90 steps/s · val ≈ 1.85 steps/s · test ≈ 1.84 steps/s ## Evaluation ### Testing Data, Factors & Metrics #### Testing Data - OPUS Books **test** split for EN→ES. #### Factors - Reported metrics are aggregate; you may wish to break down by category (named entities, numbers, sentence length) for your domain. #### Metrics - **sacreBLEU** (higher is better) - **chrF** (higher is better) - **Average generated length** (tokens) ### Results - **BLEU (val/test):** 23.41 / 23.41 - **chrF (val/test):** 48.20 / 48.21 - **Loss (train/val/test):** 1.854 / 1.883 / 1.859 - **Avg generation length (val/test):** 30.27 / 29.88 tokens - **Wall‑clock:** train 40:45 · val 5:16 · test 5:18 #### Summary The model produces fluent Spanish with moderate adequacy on OPUS Books; BLEU ≈ 23.4 and chrF ≈ 48.2 are consistent across validation and test. ## Model Examination - Qualitative samples (EN→ES) and loss curves are included under `assets/` and TensorBoard logs in `outputs/.../logs`. - Consider contrastive tests (gendered occupations, idioms) and targeted error analyses for your domain. ## Environmental Impact Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700). - **Hardware Type:** Single consumer‑grade GPU (RTX 3080 Ti Laptop, 16 GB) - **Hours used:** ~0.68 hours (≈ 2,449 seconds) for the reported training run - **Cloud Provider:** N/A (local laptop) - **Compute Region:** N/A - **Carbon Emitted:** Not estimated; depends on local energy mix ## Technical Specifications ### Model Architecture and Objective - Transformer **encoder–decoder** (MarianMT): 6‑layer encoder and 6‑layer decoder, static sinusoidal positional embeddings; optimized for translation as conditional generation. ### Compute Infrastructure #### Hardware - Laptop (Windows, WDDM driver), NVIDIA GeForce RTX 3080 Ti (16 GB). #### Software - Python 3.13+, `transformers` 4.42+, `datasets` 3.0+, `evaluate` 0.4.2+, PyTorch 2.8.0 (CUDA 12.9), TensorBoard logging. ## Citation If you use this model or code, please consider citing the OPUS‑MT work and Marian: **BibTeX (OPUS‑MT):** ``` @inproceedings{tiedemann-thottingal-2020-opus, title = "{OPUS}-{MT} -- Building open translation services for the World", author = "Tiedemann, J{"o}rg and Thottingal, Santhosh", booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation", year = "2020" } ``` **BibTeX (Democratizing NMT with OPUS‑MT):** ``` @article{tiedemann2023democratizing, title={Democratizing neural machine translation with {OPUS-MT}}, author={Tiedemann, J{"o}rg and Aulamo, Mikko and others}, journal={Language Resources and Evaluation}, year={2023} } ``` ## Glossary - **BLEU:** Precision‑based n‑gram overlap metric; reported via sacreBLEU for comparability. - **chrF:** Character n‑gram F‑score; more sensitive to morphological correctness. ## More Information - See the repository README for project structure, defaults, and customization tips. - The Hub model repo currently exists; ensure weights and a model card are pushed before using it directly. ## Model Card Authors - Amir Hossein Yousefi (project author) - (This model card drafted for the repository consumer.) ## Model Card Contact - Open an issue in the repository or contact the Hugging Face user `Amirhossein75`.