Update README.md

44e413e verified 3 months ago

10.5 kB

	---
	# For reference on model card metadata, see the spec: https://github.com/huggingface/hub-docs/blob/main/modelcard.md?plain=1
	# Doc / guide: https://huggingface.co/docs/hub/model-cards
	{
	"library_name": "transformers",
	"pipeline_tag": "translation",
	"license": "apache-2.0",
	"tags": [
	"machine-translation",
	"translation",
	"seq2seq",
	"marian",
	"transformers",
	"pytorch",
	"sacrebleu",
	"chrf",
	"datasets",
	"evaluate",
	"tensorboard",
	"fp16",
	"opus-books"
	],
	"base_model": "Helsinki-NLP/opus-mt-en-es",
	"datasets": ["Helsinki-NLP/opus_books"],
	"language": ["en", "es"],
	"widget": [
	{
	"text": "All around, the lonely sea extended to the limits of the horizon."
	},
	{
	"text": "\"With all due respect to master, they don't strike me as very wicked!\""
	}
	]
	}

	---

	# Model Card for Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT

	<!-- Provide a quick summary of what the model is/does. -->

	A lean, modern baseline for neural machine translation (NMT) based on a transformer encoder–decoder (MarianMT) fine-tuned for English → Spanish on the OPUS Books dataset. It uses Hugging Face `transformers`, `datasets`, and `evaluate`, logs to TensorBoard, and reports sacreBLEU and chrF. Results and training details below.

	## Model Details

	### Model Description

	This repository implements a small but complete seq2seq translation pipeline with sensible defaults: it loads the OPUS Books dataset, ensures train/validation/test splits, tokenizes source and target correctly using `text_target=`, fine-tunes a MarianMT checkpoint, and evaluates with BLEU/chrF. The implementation favors clarity and hackability and is intended as a reproducible baseline you can swap to different language pairs/datasets or models (e.g., T5, mBART).

	- Developed by: Amir Hossein Yousefi (GitHub: `amirhossein-yousefi`)
	- Shared by : Hugging Face user `Amirhossein75`
	- Model type: Transformer encoder–decoder (MarianMT) for machine translation
	- Language(s) (NLP): Source: English (`en`) → Target: Spanish (`es`) by default (configurable)
	- License: Not explicitly specified in the repository. The base checkpoint `Helsinki-NLP/opus-mt-en-es` is released under CC-BY-4.0, and the OPUS Books dataset card lists license “other”; verify compatibility for your use case.
	- Finetuned from model : `Helsinki-NLP/opus-mt-en-es` (MarianMT)

	### Model Sources

	- Repository: https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation
	- Model on Hugging Face : https://huggingface.co/Amirhossein75/Sequence2Sequence-Transformer-Translation-Opus-MT
	- Base model: https://huggingface.co/Helsinki-NLP/opus-mt-en-es
	- Dataset: https://huggingface.co/datasets/Helsinki-NLP/opus_books
	- MarianMT docs: https://huggingface.co/docs/transformers/en/model_doc/marian
	- Related reading : Tiedemann & Thottingal (2020), “OPUS-MT — Building open translation services for the World”; Tiedemann et al. (2023), “Democratizing neural machine translation with OPUS‑MT”.

	## Uses

	### Direct Use

	- Research and education: a clear, reproducible baseline for fine-tuning transformer-based MT on a small public dataset.
	- Prototyping translation systems for English→Spanish (or other pairs after configuration changes).

	### Downstream Use

	- Fine-tune on domain-specific parallel corpora for production MT.
	- Replace the base model with T5/mBART/other OPUS-MT variants by changing `TrainConfig.model_name`.

	### Out-of-Scope Use

	- Safety‑critical or high‑stakes scenarios without human review.
	- Zero-shot translation to/from languages not covered by the checkpoint or dataset.
	- Use cases assuming perfect adequacy/faithfulness or robustness on noisy, modern, or informal text without additional fine‑tuning.

	## Bias, Risks, and Limitations

	- Domain & recency mismatch: OPUS Books contains copyright‑free books and is dated; performance may degrade on contemporary, conversational, or domain‑specific text.
	- Language & register: Trained for EN→ES; style may skew literary/formal. For slang, dialectal variants, code‑switching, or technical jargon, expect errors.
	- General MT caveats: Typical MT biases (gendered forms, named entity transliteration, idioms) can surface; outputs may be fluent but inaccurate.

	### Recommendations

	- Evaluate on your domain with sacreBLEU/chrF and targeted tests (named entities, numbers, formatting).
	- Add domain or synthetic data and continue fine‑tuning; include human‑in‑the‑loop QA for critical use.
	- If deploying, log sources and predictions; implement quality thresholds and fallback to human translation as needed.

	## How to Get Started with the Model

	Option A — Quick inference (baseline checkpoint):

	```python
	from transformers import pipeline
	translator = pipeline("translation_en_to_es", model="Helsinki-NLP/opus-mt-en-es")
	translator("The sea extended to the horizon.")
	```

	Option B — Train/evaluate with this repo (default EN→ES on OPUS Books):

	```bash
	git clone https://github.com/amirhossein-yousefi/Sequence2Sequence-Transformer-Translation.git
	cd Sequence2Sequence-Transformer-Translation
	python -m venv .venv && source .venv/bin/activate
	pip install -r requirements.txt
	python -m src.train # or: python src/train.py
	```

	Artifacts (model, tokenizer) are saved under the configured `outputs` directory; you can then push them to the Hub.

	## Training Details

	### Training Data

	- Dataset: OPUS Books (`Helsinki-NLP/opus_books`) English–Spanish split. The dataset compiles aligned, copyright‑free books; many texts are older, and some alignments are manually reviewed. See the dataset card for caveats.
	- Preprocessing: Tokenization uses Hugging Face tokenizers with `text_target=` for the target (labels), avoiding leakage and ensuring correct special‑token handling.

	### Training Procedure

	Implemented with Hugging Face Trainer and `TrainingArguments`. Mixed precision (`fp16`) is enabled automatically when CUDA is available. Logging is written to TensorBoard under `outputs/.../logs`.

	#### Preprocessing

	- Lower‑casing/normalization is left to the tokenizer (no additional bespoke normalization).
	- Max sequence lengths (source/target) and batch size are configurable in `TrainConfig`.

	#### Training Hyperparameters

	- Training regime: Automatic mixed precision (fp16) when CUDA is available; standard fp32 otherwise.
	- Other hyperparameters (batch size, epochs, learning rate, max lengths) are defined in `src/config.py` and can be overridden in your script.

	#### Speeds, Sizes, Times

	- Hardware: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM) on Windows (WDDM); CUDA driver 12.9; PyTorch 2.8.0+cu129.
	- Total FLOPs (training): 4,945,267,757,416,448
	- Training runtime: 2,449.291 seconds (≈ 40:45 wall‑clock)
	- Throughput: train ≈ 12.90 steps/s · val ≈ 1.85 steps/s · test ≈ 1.84 steps/s

	## Evaluation

	### Testing Data, Factors & Metrics

	#### Testing Data

	- OPUS Books test split for EN→ES.

	#### Factors

	- Reported metrics are aggregate; you may wish to break down by category (named entities, numbers, sentence length) for your domain.

	#### Metrics

	- sacreBLEU (higher is better)
	- chrF (higher is better)
	- Average generated length (tokens)

	### Results

	- BLEU (val/test): 23.41 / 23.41
	- chrF (val/test): 48.20 / 48.21
	- Loss (train/val/test): 1.854 / 1.883 / 1.859
	- Avg generation length (val/test): 30.27 / 29.88 tokens
	- Wall‑clock: train 40:45 · val 5:16 · test 5:18

	#### Summary

	The model produces fluent Spanish with moderate adequacy on OPUS Books; BLEU ≈ 23.4 and chrF ≈ 48.2 are consistent across validation and test.

	## Model Examination

	- Qualitative samples (EN→ES) and loss curves are included under `assets/` and TensorBoard logs in `outputs/.../logs`.
	- Consider contrastive tests (gendered occupations, idioms) and targeted error analyses for your domain.

	## Environmental Impact

	Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

	- Hardware Type: Single consumer‑grade GPU (RTX 3080 Ti Laptop, 16 GB)
	- Hours used: ~0.68 hours (≈ 2,449 seconds) for the reported training run
	- Cloud Provider: N/A (local laptop)
	- Compute Region: N/A
	- Carbon Emitted: Not estimated; depends on local energy mix

	## Technical Specifications

	### Model Architecture and Objective

	- Transformer encoder–decoder (MarianMT): 6‑layer encoder and 6‑layer decoder, static sinusoidal positional embeddings; optimized for translation as conditional generation.

	### Compute Infrastructure

	#### Hardware

	- Laptop (Windows, WDDM driver), NVIDIA GeForce RTX 3080 Ti (16 GB).

	#### Software

	- Python 3.13+, `transformers` 4.42+, `datasets` 3.0+, `evaluate` 0.4.2+, PyTorch 2.8.0 (CUDA 12.9), TensorBoard logging.

	## Citation

	If you use this model or code, please consider citing the OPUS‑MT work and Marian:

	BibTeX (OPUS‑MT):
	```
	@inproceedings{tiedemann-thottingal-2020-opus,
	title = "{OPUS}-{MT} -- Building open translation services for the World",
	author = "Tiedemann, J{"o}rg and Thottingal, Santhosh",
	booktitle = "Proceedings of the 22nd Annual Conference of the European Association for Machine Translation",
	year = "2020"
	}
	```

	BibTeX (Democratizing NMT with OPUS‑MT):
	```
	@article{tiedemann2023democratizing,
	title={Democratizing neural machine translation with {OPUS-MT}},
	author={Tiedemann, J{"o}rg and Aulamo, Mikko and others},
	journal={Language Resources and Evaluation},
	year={2023}
	}
	```

	## Glossary

	- BLEU: Precision‑based n‑gram overlap metric; reported via sacreBLEU for comparability.
	- chrF: Character n‑gram F‑score; more sensitive to morphological correctness.

	## More Information

	- See the repository README for project structure, defaults, and customization tips.
	- The Hub model repo currently exists; ensure weights and a model card are pushed before using it directly.

	## Model Card Authors

	- Amir Hossein Yousefi (project author)
	- (This model card drafted for the repository consumer.)

	## Model Card Contact

	- Open an issue in the repository or contact the Hugging Face user `Amirhossein75`.