nanochat German: Tokenizer

This repository hosts a tokenizer, trained on the German nanochat dataset.

Following the original nanochat tokenizer training process, we trained the tokenizer on 2B chars:

python -m scripts.tok_train --max_chars=2000000000

Stats

  • max_chars: 2,000,000,000
  • doc_cap: 10,000
  • vocab_size: 65,536
  • train_time: 117.8557
  • num_special_tokens: 9
  • token_bytes_min: 1
  • token_bytes_max: 66
  • token_bytes_mean: 7.5642
  • token_bytes_std: 3.6434

Evaluation

Using:

python -m scripts.tok_eval

Comparison with GPT-2

Text Type Bytes GPT-2 Tokens GPT-2 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1883 731 2.58 385 4.89 +47.3%
korean 893 745 1.20 802 1.11 -7.7%
code 1259 576 2.19 662 1.90 -14.9%
math 9172 4627 1.98 4062 2.26 +12.2%
science 1698 643 2.64 334 5.08 +48.1%
fwe-train 4555564 1694319 2.69 926779 4.92 +45.3%
fwe-val 4063797 1520703 2.67 841356 4.83 +44.7%

Comparison with GPT-4

Text Type Bytes GPT-4 Tokens GPT-4 Ratio Ours Tokens Ours Ratio Relative Diff %
news 1883 541 3.48 385 4.89 +28.8%
korean 893 364 2.45 802 1.11 -120.3%
code 1259 309 4.07 662 1.90 -114.2%
math 9172 3573 2.57 4062 2.26 -13.7%
science 1698 467 3.64 334 5.08 +28.5%
fwe-train 4555564 1296818 3.51 926779 4.92 +28.5%
fwe-val 4063797 1166775 3.48 841356 4.83 +27.9%

Notice: The original tokenizer evaluation scripts contain English examples. We did not simply "translate" these examples to German. Instead, we used proper German examples from newspaper articles, lecture notes, and theses. The modificed evaluation script can be found here.

Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train stefan-it/nanochat-german-tokenizer