Custom BPE Tokenizer (Trained on WikiText-103 Raw v1)

Model Overview

This repository contains a custom Byte-Pair Encoding (BPE) tokenizer trained from scratch on the WikiText-103 Raw v1 dataset.
The tokenizer is designed for use in natural language processing tasks such as language modeling, text classification, and information retrieval.

Key Features:

  • Custom <cls> and <sep> special tokens.
  • BPE subword segmentation for efficient handling of frequent patterns.
  • Template-based post-processing for both single and paired sequences.
  • Configured decoding using the BPE decoder for accurate reconstruction of original text.

Training Details

Dataset

  • Name: WikiText-103 Raw v1
  • Source: High-quality, long-form Wikipedia articles.
  • Split Used: train
  • Size: ~103 million tokens
  • Loading Method: Streaming mode for efficient large-scale training without local storage bottlenecks.

Tokenizer Configuration

  • Model Type: Byte-Pair Encoding (BPE)
  • Vocabulary Size: 60,000 (medium-scale for general-purpose LLMs)
  • Lowercasing: Enabled
  • Special Tokens:
    • <cls> β€” Classification token
    • <sep> β€” Separator token
    • <unk> β€” Unknown token
    • <pad> β€” Padding token
    • <mask> β€” Masking token (MLM tasks)
  • Post-Processing Template:
    • Single Sequence: $A:0 <sep>:0 <cls>:2
    • Paired Sequences: $A:0 <sep>:0 $B:1 <sep>:1 <cls>:2
  • Decoder: BPE decoder for reconstructing original text.

Training Method

  • Corpus Source: Streaming iterator from WikiText-103 Raw v1 (train split)
  • Batch Size: 1000 lines per batch
  • Trainer: BpeTrainer from Hugging Face tokenizers library
  • Special Tokens Added: <cls>, <sep>, <unk>, <pad>, <mask>

Intended Uses & Limitations

Intended Uses

  • Pre-tokenization for training Transformer-based LLMs.
  • Downstream NLP tasks:
    • Language modeling
    • Text classification
    • Question answering
    • Summarization

Limitations

  • Trained exclusively on English Wikipedia text β€” performance may degrade in informal, domain-specific, or multilingual contexts.
  • May inherit biases present in Wikipedia data.

License

This tokenizer is released under the MIT License.


Citation

If you use this tokenizer, please cite:

title = Custom BPE Tokenizer Trained on WikiText-103 Raw v1
author = yakul259
year = 2025
publisher = Hugging Face

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for yakul259/english-bpe-tokenizer-60k

Finetunes
1 model

Dataset used to train yakul259/english-bpe-tokenizer-60k