File size: 5,347 Bytes
4a094ac 180f838 4a094ac |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 |
---
language:
- zh
- en
tags:
- translation
license: cc-by-4.0
datasets:
- quickmt/quickmt-train.zh-en
model-index:
- name: quickmt-zh-en
results:
- task:
name: Translation zho-eng
type: translation
args: zho-eng
dataset:
name: flores101-devtest
type: flores_101
args: zho_Hans eng_Latn devtest
metrics:
- name: BLEU
type: bleu
value: 28.58
- name: CHRF
type: chrf
value: 57.46
---
# `quickmt-zh-en` Neural Machine Translation Model
# Usage
## Install `quickmt`
```bash
git clone https://github.com/quickmt/quickmt.git
pip install ./quickmt/
```
## Download model
```bash
quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en
```
## Use model
Inference with `quickmt`:
```python
from quickmt import Translator
# Auto-detects GPU, set to "cpu" to force CPU inference
t = Translator("./quickmt-zh-en/", device="auto")
# Translate - set beam size to 5 for higher quality (but slower speed)
t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], beam_size=1)
# Get alternative translations by sampling
# You can pass any cTranslate2 `translate_batch` arguments
t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
```
The model is in `ctranslate2` format, and the tokenizers are `sentencepiece`, so you can use the model files directly if you want. It would be fairly easy to get them to work with e.g. [LibreTranslate](https://libretranslate.com/) which also uses `ctranslate2` and `sentencepiece`.
# Model Information
* Trained using [`eole`](https://github.com/eole-nlp/eole)
- It took about 1 day on a single RTX 4090 on [vast.ai](https://cloud.vast.ai)
* Exported for fast inference to []CTranslate2](https://github.com/OpenNMT/CTranslate2) format
* Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
## Metrics
BLEU and CHRF2 calculated with [sacrebleu](https://github.com/mjpost/sacrebleu) on the Flores200 `devtest` test set ("zho_Hans"->"eng_Latn").
"Time" is the time to translate the following input with a single CPU core:
> 2019冠状病毒病(英語:Coronavirus disease 2019,缩写:COVID-19[17][18]),是一種由嚴重急性呼吸系統綜合症冠狀病毒2型(縮寫:SARS-CoV-2)引發的傳染病,导致了一场持续的疫情,成为人類歷史上致死人數最多的流行病之一。
| Model | bleu | chrf2 | Time (s) |
| -------------------------------- | ----- | ----- | ---- |
| quickmt/quickmt-zh-en | 28.58 | 57.46 | 0.670 |
| Helsinki-NLP/opus-mt-zh-en | 23.35 | 53.60 | 0.838 |
| facebook/m2m100_418M | 18.96 | 50.06 | 11.5 |
| facebook/nllb-200-distilled-600M | 26.22 | 55.17 | 13.2 |
| facebook/nllb-200-distilled-1.3B | 28.54 | 57.34 | 23.6 |
| facebook/m2m100_1.2B | 24.68 | 54.68 | 25.7 |
| google/madlad400-3b-mt | 28.74 | 58.01 | ??? |
`quickmt-zh-en` is the fastest and delivers fairly high quality.
Helsinki-NLP/opus-mt-zh-en is one of the most downloaded machine translation models on HuggingFace, and this model is considerably more accurate *and* a bit faster.
## Training Configuration
```yaml
### Vocab
src_vocab_size: 20000
tgt_vocab_size: 20000
share_vocab: False
data:
corpus_1:
path_src: hf://quickmt/quickmt-train-zh-en/zh
path_tgt: hf://quickmt/quickmt-train-zh-en/en
path_sco: hf://quickmt/quickmt-train-zh-en/sco
valid:
path_src: zh-en/dev.zho
path_tgt: zh-en/dev.eng
transforms: [sentencepiece, filtertoolong]
transforms_configs:
sentencepiece:
src_subword_model: "zh-en/src.spm.model"
tgt_subword_model: "zh-en/tgt.spm.model"
filtertoolong:
src_seq_length: 512
tgt_seq_length: 512
training:
# Run configuration
model_path: quickmt-zh-en
keep_checkpoint: 4
save_checkpoint_steps: 1000
train_steps: 104000
valid_steps: 1000
# Train on a single GPU
world_size: 1
gpu_ranks: [0]
# Batching
batch_type: "tokens"
batch_size: 13312
valid_batch_size: 13312
batch_size_multiple: 8
accum_count: [4]
accum_steps: [0]
# Optimizer & Compute
compute_dtype: "bfloat16"
optim: "pagedadamw8bit"
learning_rate: 1.0
warmup_steps: 10000
decay_method: "noam"
adam_beta2: 0.998
# Data loading
bucket_size: 262144
num_workers: 4
prefetch_factor: 100
# Hyperparams
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
max_grad_norm: 0
label_smoothing: 0.1
average_decay: 0.0001
param_init_method: xavier_uniform
normalization: "tokens"
model:
architecture: "transformer"
layer_norm: standard
share_embeddings: false
share_decoder_embeddings: true
add_ffnbias: true
mlp_activation_fn: gated-silu
add_estimator: false
add_qkvbias: false
norm_eps: 1e-6
hidden_size: 1024
encoder:
layers: 8
decoder:
layers: 2
heads: 16
transformer_ff: 4096
embeddings:
word_vec_size: 1024
position_encoding_type: "SinusoidalInterleaved"
``` |