Custom BPE Tokenizer (Trained on WikiText-103 Raw v1)

Model Overview

This repository contains a custom Byte-Pair Encoding (BPE) tokenizer trained from scratch on the WikiText-103 Raw v1 dataset.
The tokenizer is designed for use in natural language processing tasks such as language modeling, text classification, and information retrieval.

Key Features:

Custom <cls> and <sep> special tokens.
BPE subword segmentation for efficient handling of frequent patterns.
Template-based post-processing for both single and paired sequences.
Configured decoding using the BPE decoder for accurate reconstruction of original text.

Training Details

Dataset

Name: WikiText-103 Raw v1
Source: High-quality, long-form Wikipedia articles.
Split Used: train
Size: ~103 million tokens
Loading Method: Streaming mode for efficient large-scale training without local storage bottlenecks.

Tokenizer Configuration

Model Type: Byte-Pair Encoding (BPE)
Vocabulary Size: 60,000 (medium-scale for general-purpose LLMs)
Lowercasing: Enabled
Special Tokens:
- <cls> — Classification token
- <sep> — Separator token
- <unk> — Unknown token
- <pad> — Padding token
- <mask> — Masking token (MLM tasks)
Post-Processing Template:
- Single Sequence: $A:0 <sep>:0 <cls>:2
- Paired Sequences: $A:0 <sep>:0 $B:1 <sep>:1 <cls>:2
Decoder: BPE decoder for reconstructing original text.

Training Method

Corpus Source: Streaming iterator from WikiText-103 Raw v1 (train split)
Batch Size: 1000 lines per batch
Trainer: BpeTrainer from Hugging Face tokenizers library
Special Tokens Added: <cls>, <sep>, <unk>, <pad>, <mask>

Intended Uses & Limitations

Intended Uses

Pre-tokenization for training Transformer-based LLMs.
Downstream NLP tasks:
- Language modeling
- Text classification
- Question answering
- Summarization

Limitations

Trained exclusively on English Wikipedia text — performance may degrade in informal, domain-specific, or multilingual contexts.
May inherit biases present in Wikipedia data.

License

This tokenizer is released under the MIT License.

Citation

If you use this tokenizer, please cite:

title = Custom BPE Tokenizer Trained on WikiText-103 Raw v1
author = yakul259
year = 2025
publisher = Hugging Face

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yakul259/english-bpe-tokenizer-60k

Finetunes

1 model

yakul259
/

english-bpe-tokenizer-60k