Amirhossein75's picture
Update README.md
44e413e verified
---
# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
# Doc / guide: https://huggingface.co/docs/hub/model-cards
{
"library_name": "transformers",
"pipeline_tag": "translation",
"license": "apache-2.0",
"tags": [
"machine-translation",
"translation",
"seq2seq",
"marian",
"transformers",
"pytorch",
"sacrebleu",
"chrf",
"datasets",
"evaluate",
"tensorboard",
"fp16",
"opus-books"
],
"base_model": "Helsinki-NLP/opus-mt-en-es",
"datasets": ["Helsinki-NLP/opus_books"],
"language": ["en", "es"],
"widget": [
{
"text": "All around, the lonely sea extended to the limits of the horizon."
},
{
"text": "\"With all due respect to master, they don't strike me as very wicked!\""
}
]
}
---
# Model Card for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT
<!-- Provide a quick summary of what the model is/does. -->
A lean, modern baseline for neural machine translation (NMT) based on a transformer encoder–decoder (MarianMT) fine-tuned for **English → Spanish** on the **OPUS Books** dataset. It uses Hugging Face `transformers`, `datasets`, and `evaluate`, logs to TensorBoard, and reports sacreBLEU and chrF. Results and training details below.
## Model Details
### Model Description
This repository implements a small but complete seq2seq translation pipeline with sensible defaults: it loads the OPUS Books dataset, ensures train/validation/test splits, tokenizes source and target correctly using `text_target=`, fine-tunes a MarianMT checkpoint, and evaluates with BLEU/chrF. The implementation favors clarity and hackability and is intended as a reproducible baseline you can swap to different language pairs/datasets or models (e.g., T5, mBART).
- **Developed by:** Amir Hossein Yousefi (GitHub: `amirhossein-yousefi`)
- **Shared by :** Hugging Face user `Amirhossein75`
- **Model type:** Transformer encoder–decoder (MarianMT) for machine translation
- **Language(s) (NLP):** Source: English (`en`) → Target: Spanish (`es`) by default (configurable)
- **License:** *Not explicitly specified in the repository.* The base checkpoint `Helsinki-NLP/opus-mt-en-es` is released under **CC-BY-4.0**, and the OPUS Books dataset card lists license **“other”**; verify compatibility for your use case.
- **Finetuned from model :** `Helsinki-NLP/opus-mt-en-es` (MarianMT)
### Model Sources
- **Repository:** https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation
- **Model on Hugging Face :** https://huggingface.co/Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT
- **Base model:** https://huggingface.co/Helsinki-NLP/opus-mt-en-es
- **Dataset:** https://huggingface.co/datasets/Helsinki-NLP/opus_books
- **MarianMT docs:** https://huggingface.co/docs/transformers/en/model_doc/marian
- **Related reading :** Tiedemann & Thottingal (2020), “OPUS-MT — Building open translation services for the World”; Tiedemann et al. (2023), “Democratizing neural machine translation with OPUS‑MT”.
## Uses
### Direct Use
- Research and education: a clear, reproducible baseline for fine-tuning transformer-based MT on a small public dataset.
- Prototyping translation systems for English→Spanish (or other pairs after configuration changes).
### Downstream Use
- Fine-tune on domain-specific parallel corpora for production MT.
- Replace the base model with T5/mBART/other OPUS-MT variants by changing `TrainConfig.model_name`.
### Out-of-Scope Use
- Safety‑critical or high‑stakes scenarios without human review.
- Zero-shot translation to/from languages not covered by the checkpoint or dataset.
- Use cases assuming perfect adequacy/faithfulness or robustness on noisy, modern, or informal text without additional fine‑tuning.
## Bias, Risks, and Limitations
- **Domain & recency mismatch:** OPUS Books contains copyright‑free books and is **dated**; performance may degrade on contemporary, conversational, or domain‑specific text.
- **Language & register:** Trained for EN→ES; style may skew literary/formal. For slang, dialectal variants, code‑switching, or technical jargon, expect errors.
- **General MT caveats:** Typical MT biases (gendered forms, named entity transliteration, idioms) can surface; outputs may be fluent but inaccurate.
### Recommendations
- Evaluate on **your** domain with sacreBLEU/chrF and targeted tests (named entities, numbers, formatting).
- Add domain or synthetic data and continue fine‑tuning; include human‑in‑the‑loop QA for critical use.
- If deploying, log sources and predictions; implement quality thresholds and fallback to human translation as needed.
## How to Get Started with the Model
**Option A — Quick inference (baseline checkpoint):**
```python
from transformers import pipeline
translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")
translator("The sea extended to the horizon.")
```
**Option B — Train/evaluate with this repo (default EN→ES on OPUS Books):**
```bash
git clone https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation.git
cd Sequence2Sequence-Transformer-Translation
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
python -m src.train # or: python src/train.py
```
Artifacts (model, tokenizer) are saved under the configured `outputs` directory; you can then push them to the Hub.
## Training Details
### Training Data
- **Dataset:** OPUS Books (`Helsinki-NLP/opus_books`) English–Spanish split. The dataset compiles aligned, copyright‑free books; many texts are older, and some alignments are manually reviewed. See the dataset card for caveats.
- **Preprocessing:** Tokenization uses Hugging Face tokenizers with `text_target=` for the target (labels), avoiding leakage and ensuring correct special‑token handling.
### Training Procedure
Implemented with Hugging Face **Trainer** and `TrainingArguments`. Mixed precision (`fp16`) is enabled automatically when CUDA is available. Logging is written to TensorBoard under `outputs/.../logs`.
#### Preprocessing
- Lower‑casing/normalization is left to the tokenizer (no additional bespoke normalization).
- Max sequence lengths (source/target) and batch size are configurable in `TrainConfig`.
#### Training Hyperparameters
- **Training regime:** Automatic mixed precision (**fp16**) when CUDA is available; standard fp32 otherwise.
- Other hyperparameters (batch size, epochs, learning rate, max lengths) are defined in `src/config.py` and can be overridden in your script.
#### Speeds, Sizes, Times
- **Hardware:** NVIDIA GeForce RTX 3080 Ti **Laptop** GPU (16 GB VRAM) on Windows (WDDM); CUDA driver 12.9; PyTorch 2.8.0+cu129.
- **Total FLOPs (training):** 4,945,267,757,416,448
- **Training runtime:** 2,449.291 seconds (≈ 40:45 wall‑clock)
- **Throughput:** train ≈ 12.90 steps/s · val ≈ 1.85 steps/s · test ≈ 1.84 steps/s
## Evaluation
### Testing Data, Factors & Metrics
#### Testing Data
- OPUS Books **test** split for EN→ES.
#### Factors
- Reported metrics are aggregate; you may wish to break down by category (named entities, numbers, sentence length) for your domain.
#### Metrics
- **sacreBLEU** (higher is better)
- **chrF** (higher is better)
- **Average generated length** (tokens)
### Results
- **BLEU (val/test):** 23.41 / 23.41
- **chrF (val/test):** 48.20 / 48.21
- **Loss (train/val/test):** 1.854 / 1.883 / 1.859
- **Avg generation length (val/test):** 30.27 / 29.88 tokens
- **Wall‑clock:** train 40:45 · val 5:16 · test 5:18
#### Summary
The model produces fluent Spanish with moderate adequacy on OPUS Books; BLEU ≈ 23.4 and chrF ≈ 48.2 are consistent across validation and test.
## Model Examination
- Qualitative samples (EN→ES) and loss curves are included under `assets/` and TensorBoard logs in `outputs/.../logs`.
- Consider contrastive tests (gendered occupations, idioms) and targeted error analyses for your domain.
## Environmental Impact
Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
- **Hardware Type:** Single consumer‑grade GPU (RTX 3080 Ti Laptop, 16 GB)
- **Hours used:** ~0.68 hours (≈ 2,449 seconds) for the reported training run
- **Cloud Provider:** N/A (local laptop)
- **Compute Region:** N/A
- **Carbon Emitted:** Not estimated; depends on local energy mix
## Technical Specifications
### Model Architecture and Objective
- Transformer **encoder–decoder** (MarianMT): 6‑layer encoder and 6‑layer decoder, static sinusoidal positional embeddings; optimized for translation as conditional generation.
### Compute Infrastructure
#### Hardware
- Laptop (Windows, WDDM driver), NVIDIA GeForce RTX 3080 Ti (16 GB).
#### Software
- Python 3.13+, `transformers` 4.42+, `datasets` 3.0+, `evaluate` 0.4.2+, PyTorch 2.8.0 (CUDA 12.9), TensorBoard logging.
## Citation
If you use this model or code, please consider citing the OPUS‑MT work and Marian:
**BibTeX (OPUS‑MT):**
```
@inproceedings{tiedemann-thottingal-2020-opus,
title = "{OPUS}-{MT} -- Building open translation services for the World",
author = "Tiedemann, J{"o}rg and Thottingal, Santhosh",
booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
year = "2020"
}
```
**BibTeX (Democratizing NMT with OPUS‑MT):**
```
@article{tiedemann2023democratizing,
title={Democratizing neural machine translation with {OPUS-MT}},
author={Tiedemann, J{"o}rg and Aulamo, Mikko and others},
journal={Language Resources and Evaluation},
year={2023}
}
```
## Glossary
- **BLEU:** Precision‑based n‑gram overlap metric; reported via sacreBLEU for comparability.
- **chrF:** Character n‑gram F‑score; more sensitive to morphological correctness.
## More Information
- See the repository README for project structure, defaults, and customization tips.
- The Hub model repo currently exists; ensure weights and a model card are pushed before using it directly.
## Model Card Authors
- Amir Hossein Yousefi (project author)
- (This model card drafted for the repository consumer.)
## Model Card Contact
- Open an issue in the repository or contact the Hugging Face user `Amirhossein75`.