language:
- zh
- en
tags:
- translation
license: cc-by-4.0
datasets:
- quickmt/quickmt-train.zh-en
model-index:
- name: quickmt-zh-en
results:
- task:
name: Translation zho-eng
type: translation
args: zho-eng
dataset:
name: flores101-devtest
type: flores_101
args: zho_Hans eng_Latn devtest
metrics:
- name: BLEU
type: bleu
value: 28.58
- name: CHRF
type: chrf
value: 57.46
quickmt-zh-en Neural Machine Translation Model
Usage
Install quickmt
git clone https://github.com/quickmt/quickmt.git
pip install ./quickmt/
Download model
quickmt-model-download quickmt/quickmt-zh-en ./quickmt-zh-en
Use model
Inference with quickmt:
from quickmt import Translator
# Auto-detects GPU, set to "cpu" to force CPU inference
t = Translator("./quickmt-zh-en/", device="auto")
# Translate - set beam size to 5 for higher quality (but slower speed)
t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], beam_size=1)
# Get alternative translations by sampling
# You can pass any cTranslate2 `translate_batch` arguments
t(["他补充道:“我们现在有 4 个月大没有糖尿病的老鼠,但它们曾经得过该病。”"], sampling_temperature=1.2, beam_size=1, sampling_topk=50, sampling_topp=0.9)
The model is in ctranslate2 format, and the tokenizers are sentencepiece, so you can use the model files directly if you want. It would be fairly easy to get them to work with e.g. LibreTranslate which also uses ctranslate2 and sentencepiece.
Model Information
- Trained using
eole- It took about 1 day on a single RTX 4090 on vast.ai
- Exported for fast inference to []CTranslate2](https://github.com/OpenNMT/CTranslate2) format
- Training data: https://huggingface.co/datasets/quickmt/quickmt-train.zh-en/tree/main
Metrics
BLEU and CHRF2 calculated with sacrebleu on the Flores200 devtest test set ("zho_Hans"->"eng_Latn").
"Time" is the time to translate the following input with a single CPU core:
2019冠状病毒病(英語:Coronavirus disease 2019,缩写:COVID-19[17][18]),是一種由嚴重急性呼吸系統綜合症冠狀病毒2型(縮寫:SARS-CoV-2)引發的傳染病,导致了一场持续的疫情,成为人類歷史上致死人數最多的流行病之一。
| Model | bleu | chrf2 | Time (s) |
|---|---|---|---|
| quickmt/quickmt-zh-en | 28.58 | 57.46 | 0.670 |
| Helsinki-NLP/opus-mt-zh-en | 23.35 | 53.60 | 0.838 |
| facebook/m2m100_418M | 18.96 | 50.06 | 11.5 |
| facebook/nllb-200-distilled-600M | 26.22 | 55.17 | 13.2 |
| facebook/nllb-200-distilled-1.3B | 28.54 | 57.34 | 23.6 |
| facebook/m2m100_1.2B | 24.68 | 54.68 | 25.7 |
| google/madlad400-3b-mt | 28.74 | 58.01 | ??? |
quickmt-zh-en is the fastest and delivers fairly high quality.
Helsinki-NLP/opus-mt-zh-en is one of the most downloaded machine translation models on HuggingFace, and this model is considerably more accurate and a bit faster.
Training Configuration
### Vocab
src_vocab_size: 20000
tgt_vocab_size: 20000
share_vocab: False
data:
corpus_1:
path_src: hf://quickmt/quickmt-train-zh-en/zh
path_tgt: hf://quickmt/quickmt-train-zh-en/en
path_sco: hf://quickmt/quickmt-train-zh-en/sco
valid:
path_src: zh-en/dev.zho
path_tgt: zh-en/dev.eng
transforms: [sentencepiece, filtertoolong]
transforms_configs:
sentencepiece:
src_subword_model: "zh-en/src.spm.model"
tgt_subword_model: "zh-en/tgt.spm.model"
filtertoolong:
src_seq_length: 512
tgt_seq_length: 512
training:
# Run configuration
model_path: quickmt-zh-en
keep_checkpoint: 4
save_checkpoint_steps: 1000
train_steps: 104000
valid_steps: 1000
# Train on a single GPU
world_size: 1
gpu_ranks: [0]
# Batching
batch_type: "tokens"
batch_size: 13312
valid_batch_size: 13312
batch_size_multiple: 8
accum_count: [4]
accum_steps: [0]
# Optimizer & Compute
compute_dtype: "bfloat16"
optim: "pagedadamw8bit"
learning_rate: 1.0
warmup_steps: 10000
decay_method: "noam"
adam_beta2: 0.998
# Data loading
bucket_size: 262144
num_workers: 4
prefetch_factor: 100
# Hyperparams
dropout_steps: [0]
dropout: [0.1]
attention_dropout: [0.1]
max_grad_norm: 0
label_smoothing: 0.1
average_decay: 0.0001
param_init_method: xavier_uniform
normalization: "tokens"
model:
architecture: "transformer"
layer_norm: standard
share_embeddings: false
share_decoder_embeddings: true
add_ffnbias: true
mlp_activation_fn: gated-silu
add_estimator: false
add_qkvbias: false
norm_eps: 1e-6
hidden_size: 1024
encoder:
layers: 8
decoder:
layers: 2
heads: 16
transformer_ff: 4096
embeddings:
word_vec_size: 1024
position_encoding_type: "SinusoidalInterleaved"