File size: 10,494 Bytes

---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{
  "library_name": "transformers",
  "pipeline_tag": "translation",
  "license": "apache-2.0",
  "tags": [
    "machine-translation",
    "translation",
    "seq2seq",
    "marian",
    "transformers",
    "pytorch",
    "sacrebleu",
    "chrf",
    "datasets",
    "evaluate",
    "tensorboard",
    "fp16",
    "opus-books"
  ],
  "base_model": "Helsinki-NLP/opus-mt-en-es",
  "datasets": ["Helsinki-NLP/opus_books"],
  "language": ["en", "es"],
  "widget": [
    {
      "text": "All around, the lonely sea extended to the limits of the horizon."
    },
    {
      "text": "\"With all due respect to master, they don't strike me as very wicked!\""
    }
  ]
}

---

# Model Card for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT

<!-- Provide a quick summary of what the model is/does. -->

A lean, modern baseline for neural machine translation (NMT) based on a transformer encoder–decoder (MarianMT) fine-tuned for **English → Spanish** on the **OPUS Books** dataset. It uses Hugging Face `transformers`, `datasets`, and `evaluate`, logs to TensorBoard, and reports sacreBLEU and chrF. Results and training details below.

## Model Details

### Model Description

This repository implements a small but complete seq2seq translation pipeline with sensible defaults: it loads the OPUS Books dataset, ensures train/validation/test splits, tokenizes source and target correctly using `text_target=`, fine-tunes a MarianMT checkpoint, and evaluates with BLEU/chrF. The implementation favors clarity and hackability and is intended as a reproducible baseline you can swap to different language pairs/datasets or models (e.g., T5, mBART).

- **Developed by:** Amir Hossein Yousefi (GitHub: `amirhossein-yousefi`)
- **Shared by :** Hugging Face user `Amirhossein75`
- **Model type:** Transformer encoder–decoder (MarianMT) for machine translation
- **Language(s) (NLP):** Source: English (`en`) → Target: Spanish (`es`) by default (configurable)
- **License:** *Not explicitly specified in the repository.* The base checkpoint `Helsinki-NLP/opus-mt-en-es` is released under **CC-BY-4.0**, and the OPUS Books dataset card lists license **“other”**; verify compatibility for your use case.
- **Finetuned from model :** `Helsinki-NLP/opus-mt-en-es` (MarianMT)

### Model Sources 

- **Repository:** https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation
- **Model on Hugging Face :** https://huggingface.co/Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT
- **Base model:** https://huggingface.co/Helsinki-NLP/opus-mt-en-es
- **Dataset:** https://huggingface.co/datasets/Helsinki-NLP/opus_books
- **MarianMT docs:** https://huggingface.co/docs/transformers/en/model_doc/marian
- **Related reading :** Tiedemann & Thottingal (2020), “OPUS-MT — Building open translation services for the World”; Tiedemann et al. (2023), “Democratizing neural machine translation with OPUS‑MT”.

## Uses

### Direct Use

- Research and education: a clear, reproducible baseline for fine-tuning transformer-based MT on a small public dataset.
- Prototyping translation systems for English→Spanish (or other pairs after configuration changes).

### Downstream Use 

- Fine-tune on domain-specific parallel corpora for production MT.
- Replace the base model with T5/mBART/other OPUS-MT variants by changing `TrainConfig.model_name`.

### Out-of-Scope Use

- Safety‑critical or high‑stakes scenarios without human review.
- Zero-shot translation to/from languages not covered by the checkpoint or dataset.
- Use cases assuming perfect adequacy/faithfulness or robustness on noisy, modern, or informal text without additional fine‑tuning.

## Bias, Risks, and Limitations

- **Domain & recency mismatch:** OPUS Books contains copyright‑free books and is **dated**; performance may degrade on contemporary, conversational, or domain‑specific text.
- **Language & register:** Trained for EN→ES; style may skew literary/formal. For slang, dialectal variants, code‑switching, or technical jargon, expect errors.
- **General MT caveats:** Typical MT biases (gendered forms, named entity transliteration, idioms) can surface; outputs may be fluent but inaccurate.

### Recommendations

- Evaluate on **your** domain with sacreBLEU/chrF and targeted tests (named entities, numbers, formatting).
- Add domain or synthetic data and continue fine‑tuning; include human‑in‑the‑loop QA for critical use.
- If deploying, log sources and predictions; implement quality thresholds and fallback to human translation as needed.

## How to Get Started with the Model

**Option A — Quick inference (baseline checkpoint):**

```python
from transformers import pipeline
translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")
translator("The sea extended to the horizon.")
```

**Option B — Train/evaluate with this repo (default EN→ES on OPUS Books):**

```bash
git clone https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation.git
cd Sequence2Sequence-Transformer-Translation
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m src.train  # or: python src/train.py
```

Artifacts (model, tokenizer) are saved under the configured `outputs` directory; you can then push them to the Hub.

## Training Details

### Training Data

- **Dataset:** OPUS Books (`Helsinki-NLP/opus_books`) English–Spanish split. The dataset compiles aligned, copyright‑free books; many texts are older, and some alignments are manually reviewed. See the dataset card for caveats.
- **Preprocessing:** Tokenization uses Hugging Face tokenizers with `text_target=` for the target (labels), avoiding leakage and ensuring correct special‑token handling.

### Training Procedure

Implemented with Hugging Face **Trainer** and `TrainingArguments`. Mixed precision (`fp16`) is enabled automatically when CUDA is available. Logging is written to TensorBoard under `outputs/.../logs`.

#### Preprocessing 

- Lower‑casing/normalization is left to the tokenizer (no additional bespoke normalization).
- Max sequence lengths (source/target) and batch size are configurable in `TrainConfig`.

#### Training Hyperparameters

- **Training regime:** Automatic mixed precision (**fp16**) when CUDA is available; standard fp32 otherwise.
- Other hyperparameters (batch size, epochs, learning rate, max lengths) are defined in `src/config.py` and can be overridden in your script.

#### Speeds, Sizes, Times 

- **Hardware:** NVIDIA GeForce RTX 3080 Ti **Laptop** GPU (16 GB VRAM) on Windows (WDDM); CUDA driver 12.9; PyTorch 2.8.0+cu129.
- **Total FLOPs (training):** 4,945,267,757,416,448
- **Training runtime:** 2,449.291 seconds (≈ 40:45 wall‑clock)
- **Throughput:** train ≈ 12.90 steps/s · val ≈ 1.85 steps/s · test ≈ 1.84 steps/s

## Evaluation

### Testing Data, Factors & Metrics

#### Testing Data

- OPUS Books **test** split for EN→ES.

#### Factors

- Reported metrics are aggregate; you may wish to break down by category (named entities, numbers, sentence length) for your domain.

#### Metrics

- **sacreBLEU** (higher is better)
- **chrF** (higher is better)
- **Average generated length** (tokens)

### Results

- **BLEU (val/test):** 23.41 / 23.41
- **chrF (val/test):** 48.20 / 48.21
- **Loss (train/val/test):** 1.854 / 1.883 / 1.859
- **Avg generation length (val/test):** 30.27 / 29.88 tokens
- **Wall‑clock:** train 40:45 · val 5:16 · test 5:18

#### Summary

The model produces fluent Spanish with moderate adequacy on OPUS Books; BLEU ≈ 23.4 and chrF ≈ 48.2 are consistent across validation and test.

## Model Examination 

- Qualitative samples (EN→ES) and loss curves are included under `assets/` and TensorBoard logs in `outputs/.../logs`.
- Consider contrastive tests (gendered occupations, idioms) and targeted error analyses for your domain.

## Environmental Impact

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** Single consumer‑grade GPU (RTX 3080 Ti Laptop, 16 GB)
- **Hours used:** ~0.68 hours (≈ 2,449 seconds) for the reported training run
- **Cloud Provider:** N/A (local laptop)
- **Compute Region:** N/A
- **Carbon Emitted:** Not estimated; depends on local energy mix

## Technical Specifications 

### Model Architecture and Objective

- Transformer **encoder–decoder** (MarianMT): 6‑layer encoder and 6‑layer decoder, static sinusoidal positional embeddings; optimized for translation as conditional generation.

### Compute Infrastructure

#### Hardware

- Laptop (Windows, WDDM driver), NVIDIA GeForce RTX 3080 Ti (16 GB).

#### Software

- Python 3.13+, `transformers` 4.42+, `datasets` 3.0+, `evaluate` 0.4.2+, PyTorch 2.8.0 (CUDA 12.9), TensorBoard logging.

## Citation 

If you use this model or code, please consider citing the OPUS‑MT work and Marian:

**BibTeX (OPUS‑MT):**
```
@inproceedings{tiedemann-thottingal-2020-opus,
  title = "{OPUS}-{MT} -- Building open translation services for the World",
  author = "Tiedemann, J{"o}rg and Thottingal, Santhosh",
  booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
  year = "2020"
}
```

**BibTeX (Democratizing NMT with OPUS‑MT):**
```
@article{tiedemann2023democratizing,
  title={Democratizing neural machine translation with {OPUS-MT}},
  author={Tiedemann, J{"o}rg and Aulamo, Mikko and others},
  journal={Language Resources and Evaluation},
  year={2023}
}
```

## Glossary 

- **BLEU:** Precision‑based n‑gram overlap metric; reported via sacreBLEU for comparability.
- **chrF:** Character n‑gram F‑score; more sensitive to morphological correctness.

## More Information 

- See the repository README for project structure, defaults, and customization tips.
- The Hub model repo currently exists; ensure weights and a model card are pushed before using it directly.

## Model Card Authors 

- Amir Hossein Yousefi (project author)
- (This model card drafted for the repository consumer.)

## Model Card Contact

- Open an issue in the repository or contact the Hugging Face user `Amirhossein75`.