nanochat German: Tokenizer
This repository hosts a tokenizer, trained on the German nanochat dataset.
Following the original nanochat tokenizer training process, we trained the tokenizer on 2B chars:
python -m scripts.tok_train --max_chars=2000000000
Stats
- max_chars: 2,000,000,000
- doc_cap: 10,000
- vocab_size: 65,536
- train_time: 117.8557
- num_special_tokens: 9
- token_bytes_min: 1
- token_bytes_max: 66
- token_bytes_mean: 7.5642
- token_bytes_std: 3.6434
Evaluation
Using:
python -m scripts.tok_eval
Comparison with GPT-2
| Text Type | Bytes | GPT-2 Tokens | GPT-2 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
|---|---|---|---|---|---|---|
| news | 1883 | 731 | 2.58 | 385 | 4.89 | +47.3% |
| korean | 893 | 745 | 1.20 | 802 | 1.11 | -7.7% |
| code | 1259 | 576 | 2.19 | 662 | 1.90 | -14.9% |
| math | 9172 | 4627 | 1.98 | 4062 | 2.26 | +12.2% |
| science | 1698 | 643 | 2.64 | 334 | 5.08 | +48.1% |
| fwe-train | 4555564 | 1694319 | 2.69 | 926779 | 4.92 | +45.3% |
| fwe-val | 4063797 | 1520703 | 2.67 | 841356 | 4.83 | +44.7% |
Comparison with GPT-4
| Text Type | Bytes | GPT-4 Tokens | GPT-4 Ratio | Ours Tokens | Ours Ratio | Relative Diff % |
|---|---|---|---|---|---|---|
| news | 1883 | 541 | 3.48 | 385 | 4.89 | +28.8% |
| korean | 893 | 364 | 2.45 | 802 | 1.11 | -120.3% |
| code | 1259 | 309 | 4.07 | 662 | 1.90 | -114.2% |
| math | 9172 | 3573 | 2.57 | 4062 | 2.26 | -13.7% |
| science | 1698 | 467 | 3.64 | 334 | 5.08 | +28.5% |
| fwe-train | 4555564 | 1296818 | 3.51 | 926779 | 4.92 | +28.5% |
| fwe-val | 4063797 | 1166775 | 3.48 | 841356 | 4.83 | +27.9% |
Notice: The original tokenizer evaluation scripts contain English examples. We did not simply "translate" these examples to German. Instead, we used proper German examples from newspaper articles, lecture notes, and theses. The modificed evaluation script can be found here.
- Downloads last month
- 5
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
🙋
Ask for provider support