File size: 10,494 Bytes
2154d31 44e413e 2154d31 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 |
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{
"library_name": "transformers",
"pipeline_tag": "translation",
"license": "apache-2.0",
"tags": [
"machine-translation",
"translation",
"seq2seq",
"marian",
"transformers",
"pytorch",
"sacrebleu",
"chrf",
"datasets",
"evaluate",
"tensorboard",
"fp16",
"opus-books"
],
"base_model": "Helsinki-NLP/opus-mt-en-es",
"datasets": ["Helsinki-NLP/opus_books"],
"language": ["en", "es"],
"widget": [
{
"text": "All around, the lonely sea extended to the limits of the horizon."
},
{
"text": "\"With all due respect to master, they don't strike me as very wicked!\""
}
]
}
---
# Model Card for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT
<!-- Provide a quick summary of what the model is/does. -->
A lean, modern baseline for neural machine translation (NMT) based on a transformer encoder–decoder (MarianMT) fine-tuned for **English → Spanish** on the **OPUS Books** dataset. It uses Hugging Face `transformers`, `datasets`, and `evaluate`, logs to TensorBoard, and reports sacreBLEU and chrF. Results and training details below.
## Model Details
### Model Description
This repository implements a small but complete seq2seq translation pipeline with sensible defaults: it loads the OPUS Books dataset, ensures train/validation/test splits, tokenizes source and target correctly using `text_target=`, fine-tunes a MarianMT checkpoint, and evaluates with BLEU/chrF. The implementation favors clarity and hackability and is intended as a reproducible baseline you can swap to different language pairs/datasets or models (e.g., T5, mBART).
- **Developed by:** Amir Hossein Yousefi (GitHub: `amirhossein-yousefi`)
- **Shared by :** Hugging Face user `Amirhossein75`
- **Model type:** Transformer encoder–decoder (MarianMT) for machine translation
- **Language(s) (NLP):** Source: English (`en`) → Target: Spanish (`es`) by default (configurable)
- **License:** *Not explicitly specified in the repository.* The base checkpoint `Helsinki-NLP/opus-mt-en-es` is released under **CC-BY-4.0**, and the OPUS Books dataset card lists license **“other”**; verify compatibility for your use case.
- **Finetuned from model :** `Helsinki-NLP/opus-mt-en-es` (MarianMT)
### Model Sources
- **Repository:** https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation
- **Model on Hugging Face :** https://huggingface.co/Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT
- **Base model:** https://huggingface.co/Helsinki-NLP/opus-mt-en-es
- **Dataset:** https://huggingface.co/datasets/Helsinki-NLP/opus_books
- **MarianMT docs:** https://huggingface.co/docs/transformers/en/model_doc/marian
- **Related reading :** Tiedemann & Thottingal (2020), “OPUS-MT — Building open translation services for the World”; Tiedemann et al. (2023), “Democratizing neural machine translation with OPUS‑MT”.
## Uses
### Direct Use
- Research and education: a clear, reproducible baseline for fine-tuning transformer-based MT on a small public dataset.
- Prototyping translation systems for English→Spanish (or other pairs after configuration changes).
### Downstream Use
- Fine-tune on domain-specific parallel corpora for production MT.
- Replace the base model with T5/mBART/other OPUS-MT variants by changing `TrainConfig.model_name`.
### Out-of-Scope Use
- Safety‑critical or high‑stakes scenarios without human review.
- Zero-shot translation to/from languages not covered by the checkpoint or dataset.
- Use cases assuming perfect adequacy/faithfulness or robustness on noisy, modern, or informal text without additional fine‑tuning.
## Bias, Risks, and Limitations
- **Domain & recency mismatch:** OPUS Books contains copyright‑free books and is **dated**; performance may degrade on contemporary, conversational, or domain‑specific text.
- **Language & register:** Trained for EN→ES; style may skew literary/formal. For slang, dialectal variants, code‑switching, or technical jargon, expect errors.
- **General MT caveats:** Typical MT biases (gendered forms, named entity transliteration, idioms) can surface; outputs may be fluent but inaccurate.
### Recommendations
- Evaluate on **your** domain with sacreBLEU/chrF and targeted tests (named entities, numbers, formatting).
- Add domain or synthetic data and continue fine‑tuning; include human‑in‑the‑loop QA for critical use.
- If deploying, log sources and predictions; implement quality thresholds and fallback to human translation as needed.
## How to Get Started with the Model
**Option A — Quick inference (baseline checkpoint):**
```python
from transformers import pipeline
translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")
translator("The sea extended to the horizon.")
```
**Option B — Train/evaluate with this repo (default EN→ES on OPUS Books):**
```bash
git clone https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation.git
cd Sequence2Sequence-Transformer-Translation
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m src.train # or: python src/train.py
```
Artifacts (model, tokenizer) are saved under the configured `outputs` directory; you can then push them to the Hub.
## Training Details
### Training Data
- **Dataset:** OPUS Books (`Helsinki-NLP/opus_books`) English–Spanish split. The dataset compiles aligned, copyright‑free books; many texts are older, and some alignments are manually reviewed. See the dataset card for caveats.
- **Preprocessing:** Tokenization uses Hugging Face tokenizers with `text_target=` for the target (labels), avoiding leakage and ensuring correct special‑token handling.
### Training Procedure
Implemented with Hugging Face **Trainer** and `TrainingArguments`. Mixed precision (`fp16`) is enabled automatically when CUDA is available. Logging is written to TensorBoard under `outputs/.../logs`.
#### Preprocessing
- Lower‑casing/normalization is left to the tokenizer (no additional bespoke normalization).
- Max sequence lengths (source/target) and batch size are configurable in `TrainConfig`.
#### Training Hyperparameters
- **Training regime:** Automatic mixed precision (**fp16**) when CUDA is available; standard fp32 otherwise.
- Other hyperparameters (batch size, epochs, learning rate, max lengths) are defined in `src/config.py` and can be overridden in your script.
#### Speeds, Sizes, Times
- **Hardware:** NVIDIA GeForce RTX 3080 Ti **Laptop** GPU (16 GB VRAM) on Windows (WDDM); CUDA driver 12.9; PyTorch 2.8.0+cu129.
- **Total FLOPs (training):** 4,945,267,757,416,448
- **Training runtime:** 2,449.291 seconds (≈ 40:45 wall‑clock)
- **Throughput:** train ≈ 12.90 steps/s · val ≈ 1.85 steps/s · test ≈ 1.84 steps/s
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- OPUS Books **test** split for EN→ES.
#### Factors
- Reported metrics are aggregate; you may wish to break down by category (named entities, numbers, sentence length) for your domain.
#### Metrics
- **sacreBLEU** (higher is better)
- **chrF** (higher is better)
- **Average generated length** (tokens)
### Results
- **BLEU (val/test):** 23.41 / 23.41
- **chrF (val/test):** 48.20 / 48.21
- **Loss (train/val/test):** 1.854 / 1.883 / 1.859
- **Avg generation length (val/test):** 30.27 / 29.88 tokens
- **Wall‑clock:** train 40:45 · val 5:16 · test 5:18
#### Summary
The model produces fluent Spanish with moderate adequacy on OPUS Books; BLEU ≈ 23.4 and chrF ≈ 48.2 are consistent across validation and test.
## Model Examination
- Qualitative samples (EN→ES) and loss curves are included under `assets/` and TensorBoard logs in `outputs/.../logs`.
- Consider contrastive tests (gendered occupations, idioms) and targeted error analyses for your domain.
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** Single consumer‑grade GPU (RTX 3080 Ti Laptop, 16 GB)
- **Hours used:** ~0.68 hours (≈ 2,449 seconds) for the reported training run
- **Cloud Provider:** N/A (local laptop)
- **Compute Region:** N/A
- **Carbon Emitted:** Not estimated; depends on local energy mix
## Technical Specifications
### Model Architecture and Objective
- Transformer **encoder–decoder** (MarianMT): 6‑layer encoder and 6‑layer decoder, static sinusoidal positional embeddings; optimized for translation as conditional generation.
### Compute Infrastructure
#### Hardware
- Laptop (Windows, WDDM driver), NVIDIA GeForce RTX 3080 Ti (16 GB).
#### Software
- Python 3.13+, `transformers` 4.42+, `datasets` 3.0+, `evaluate` 0.4.2+, PyTorch 2.8.0 (CUDA 12.9), TensorBoard logging.
## Citation
If you use this model or code, please consider citing the OPUS‑MT work and Marian:
**BibTeX (OPUS‑MT):**
```
@inproceedings{tiedemann-thottingal-2020-opus,
title = "{OPUS}-{MT} -- Building open translation services for the World",
author = "Tiedemann, J{"o}rg and Thottingal, Santhosh",
booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
year = "2020"
}
```
**BibTeX (Democratizing NMT with OPUS‑MT):**
```
@article{tiedemann2023democratizing,
title={Democratizing neural machine translation with {OPUS-MT}},
author={Tiedemann, J{"o}rg and Aulamo, Mikko and others},
journal={Language Resources and Evaluation},
year={2023}
}
```
## Glossary
- **BLEU:** Precision‑based n‑gram overlap metric; reported via sacreBLEU for comparability.
- **chrF:** Character n‑gram F‑score; more sensitive to morphological correctness.
## More Information
- See the repository README for project structure, defaults, and customization tips.
- The Hub model repo currently exists; ensure weights and a model card are pushed before using it directly.
## Model Card Authors
- Amir Hossein Yousefi (project author)
- (This model card drafted for the repository consumer.)
## Model Card Contact
- Open an issue in the repository or contact the Hugging Face user `Amirhossein75`.
|