File size: 10,494 Bytes
2154d31
 
 
44e413e
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2154d31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{
  "library_name": "transformers",
  "pipeline_tag": "translation",
  "license": "apache-2.0",
  "tags": [
    "machine-translation",
    "translation",
    "seq2seq",
    "marian",
    "transformers",
    "pytorch",
    "sacrebleu",
    "chrf",
    "datasets",
    "evaluate",
    "tensorboard",
    "fp16",
    "opus-books"
  ],
  "base_model": "Helsinki-NLP/opus-mt-en-es",
  "datasets": ["Helsinki-NLP/opus_books"],
  "language": ["en", "es"],
  "widget": [
    {
      "text": "All around, the lonely sea extended to the limits of the horizon."
    },
    {
      "text": "\"With all due respect to master, they don't strike me as very wicked!\""
    }
  ]
}

---

# Model Card for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT

<!-- Provide a quick summary of what the model is/does. -->

A lean, modern baseline for neural machine translation (NMT) based on a transformer encoder–decoder (MarianMT) fine-tuned for **English → Spanish** on the **OPUS Books** dataset. It uses Hugging Face `transformers`, `datasets`, and `evaluate`, logs to TensorBoard, and reports sacreBLEU and chrF. Results and training details below.

## Model Details

### Model Description

This repository implements a small but complete seq2seq translation pipeline with sensible defaults: it loads the OPUS Books dataset, ensures train/validation/test splits, tokenizes source and target correctly using `text_target=`, fine-tunes a MarianMT checkpoint, and evaluates with BLEU/chrF. The implementation favors clarity and hackability and is intended as a reproducible baseline you can swap to different language pairs/datasets or models (e.g., T5, mBART).

- **Developed by:** Amir Hossein Yousefi (GitHub: `amirhossein-yousefi`)
- **Shared by :** Hugging Face user `Amirhossein75`
- **Model type:** Transformer encoder–decoder (MarianMT) for machine translation
- **Language(s) (NLP):** Source: English (`en`) → Target: Spanish (`es`) by default (configurable)
- **License:** *Not explicitly specified in the repository.* The base checkpoint `Helsinki-NLP/opus-mt-en-es` is released under **CC-BY-4.0**, and the OPUS Books dataset card lists license **“other”**; verify compatibility for your use case.
- **Finetuned from model :** `Helsinki-NLP/opus-mt-en-es` (MarianMT)

### Model Sources 

- **Repository:** https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation
- **Model on Hugging Face :** https://huggingface.co/Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT
- **Base model:** https://huggingface.co/Helsinki-NLP/opus-mt-en-es
- **Dataset:** https://huggingface.co/datasets/Helsinki-NLP/opus_books
- **MarianMT docs:** https://huggingface.co/docs/transformers/en/model_doc/marian
- **Related reading :** Tiedemann & Thottingal (2020), “OPUS-MT — Building open translation services for the World”; Tiedemann et al. (2023), “Democratizing neural machine translation with OPUS‑MT”.

## Uses

### Direct Use

- Research and education: a clear, reproducible baseline for fine-tuning transformer-based MT on a small public dataset.
- Prototyping translation systems for English→Spanish (or other pairs after configuration changes).

### Downstream Use 

- Fine-tune on domain-specific parallel corpora for production MT.
- Replace the base model with T5/mBART/other OPUS-MT variants by changing `TrainConfig.model_name`.

### Out-of-Scope Use

- Safety‑critical or high‑stakes scenarios without human review.
- Zero-shot translation to/from languages not covered by the checkpoint or dataset.
- Use cases assuming perfect adequacy/faithfulness or robustness on noisy, modern, or informal text without additional fine‑tuning.

## Bias, Risks, and Limitations

- **Domain & recency mismatch:** OPUS Books contains copyright‑free books and is **dated**; performance may degrade on contemporary, conversational, or domain‑specific text.
- **Language & register:** Trained for EN→ES; style may skew literary/formal. For slang, dialectal variants, code‑switching, or technical jargon, expect errors.
- **General MT caveats:** Typical MT biases (gendered forms, named entity transliteration, idioms) can surface; outputs may be fluent but inaccurate.

### Recommendations

- Evaluate on **your** domain with sacreBLEU/chrF and targeted tests (named entities, numbers, formatting).
- Add domain or synthetic data and continue fine‑tuning; include human‑in‑the‑loop QA for critical use.
- If deploying, log sources and predictions; implement quality thresholds and fallback to human translation as needed.

## How to Get Started with the Model

**Option A — Quick inference (baseline checkpoint):**

```python
from transformers import pipeline
translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")
translator("The sea extended to the horizon.")
```

**Option B — Train/evaluate with this repo (default EN→ES on OPUS Books):**

```bash
git clone https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation.git
cd Sequence2Sequence-Transformer-Translation
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m src.train  # or: python src/train.py
```

Artifacts (model, tokenizer) are saved under the configured `outputs` directory; you can then push them to the Hub.

## Training Details

### Training Data

- **Dataset:** OPUS Books (`Helsinki-NLP/opus_books`) English–Spanish split. The dataset compiles aligned, copyright‑free books; many texts are older, and some alignments are manually reviewed. See the dataset card for caveats.
- **Preprocessing:** Tokenization uses Hugging Face tokenizers with `text_target=` for the target (labels), avoiding leakage and ensuring correct special‑token handling.

### Training Procedure

Implemented with Hugging Face **Trainer** and `TrainingArguments`. Mixed precision (`fp16`) is enabled automatically when CUDA is available. Logging is written to TensorBoard under `outputs/.../logs`.

#### Preprocessing 

- Lower‑casing/normalization is left to the tokenizer (no additional bespoke normalization).
- Max sequence lengths (source/target) and batch size are configurable in `TrainConfig`.

#### Training Hyperparameters

- **Training regime:** Automatic mixed precision (**fp16**) when CUDA is available; standard fp32 otherwise.
- Other hyperparameters (batch size, epochs, learning rate, max lengths) are defined in `src/config.py` and can be overridden in your script.

#### Speeds, Sizes, Times 

- **Hardware:** NVIDIA GeForce RTX 3080 Ti **Laptop** GPU (16 GB VRAM) on Windows (WDDM); CUDA driver 12.9; PyTorch 2.8.0+cu129.
- **Total FLOPs (training):** 4,945,267,757,416,448
- **Training runtime:** 2,449.291 seconds (≈ 40:45 wall‑clock)
- **Throughput:** train ≈ 12.90 steps/s · val ≈ 1.85 steps/s · test ≈ 1.84 steps/s

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

- OPUS Books **test** split for EN→ES.

#### Factors

- Reported metrics are aggregate; you may wish to break down by category (named entities, numbers, sentence length) for your domain.

#### Metrics

- **sacreBLEU** (higher is better)
- **chrF** (higher is better)
- **Average generated length** (tokens)

### Results

- **BLEU (val/test):** 23.41 / 23.41
- **chrF (val/test):** 48.20 / 48.21
- **Loss (train/val/test):** 1.854 / 1.883 / 1.859
- **Avg generation length (val/test):** 30.27 / 29.88 tokens
- **Wall‑clock:** train 40:45 · val 5:16 · test 5:18

#### Summary

The model produces fluent Spanish with moderate adequacy on OPUS Books; BLEU ≈ 23.4 and chrF ≈ 48.2 are consistent across validation and test.

## Model Examination 

- Qualitative samples (EN→ES) and loss curves are included under `assets/` and TensorBoard logs in `outputs/.../logs`.
- Consider contrastive tests (gendered occupations, idioms) and targeted error analyses for your domain.

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** Single consumer‑grade GPU (RTX 3080 Ti Laptop, 16 GB)
- **Hours used:** ~0.68 hours (≈ 2,449 seconds) for the reported training run
- **Cloud Provider:** N/A (local laptop)
- **Compute Region:** N/A
- **Carbon Emitted:** Not estimated; depends on local energy mix

## Technical Specifications 

### Model Architecture and Objective

- Transformer **encoder–decoder** (MarianMT): 6‑layer encoder and 6‑layer decoder, static sinusoidal positional embeddings; optimized for translation as conditional generation.

### Compute Infrastructure

#### Hardware

- Laptop (Windows, WDDM driver), NVIDIA GeForce RTX 3080 Ti (16 GB).

#### Software

- Python 3.13+, `transformers` 4.42+, `datasets` 3.0+, `evaluate` 0.4.2+, PyTorch 2.8.0 (CUDA 12.9), TensorBoard logging.

## Citation 

If you use this model or code, please consider citing the OPUS‑MT work and Marian:

**BibTeX (OPUS‑MT):**
```
@inproceedings{tiedemann-thottingal-2020-opus,
  title = "{OPUS}-{MT} -- Building open translation services for the World",
  author = "Tiedemann, J{"o}rg and Thottingal, Santhosh",
  booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
  year = "2020"
}
```

**BibTeX (Democratizing NMT with OPUS‑MT):**
```
@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{"o}rg and Aulamo, Mikko and others},
  journal={Language Resources and Evaluation},
  year={2023}
}
```

## Glossary 

- **BLEU:** Precision‑based n‑gram overlap metric; reported via sacreBLEU for comparability.
- **chrF:** Character n‑gram F‑score; more sensitive to morphological correctness.

## More Information 

- See the repository README for project structure, defaults, and customization tips.
- The Hub model repo currently exists; ensure weights and a model card are pushed before using it directly.

## Model Card Authors 

- Amir Hossein Yousefi (project author)
- (This model card drafted for the repository consumer.)

## Model Card Contact

- Open an issue in the repository or contact the Hugging Face user `Amirhossein75`.