Custom BPE Tokenizer (Trained on WikiText-103 Raw v1)
Model Overview
This repository contains a custom Byte-Pair Encoding (BPE) tokenizer trained from scratch on the WikiText-103 Raw v1 dataset.
The tokenizer is designed for use in natural language processing tasks such as language modeling, text classification, and information retrieval.
Key Features:
- Custom
<cls>and<sep>special tokens. - BPE subword segmentation for efficient handling of frequent patterns.
- Template-based post-processing for both single and paired sequences.
- Configured decoding using the BPE decoder for accurate reconstruction of original text.
Training Details
Dataset
- Name: WikiText-103 Raw v1
- Source: High-quality, long-form Wikipedia articles.
- Split Used:
train - Size: ~103 million tokens
- Loading Method: Streaming mode for efficient large-scale training without local storage bottlenecks.
Tokenizer Configuration
- Model Type: Byte-Pair Encoding (BPE)
- Vocabulary Size: 60,000 (medium-scale for general-purpose LLMs)
- Lowercasing: Enabled
- Special Tokens:
<cls>β Classification token<sep>β Separator token<unk>β Unknown token<pad>β Padding token<mask>β Masking token (MLM tasks)
- Post-Processing Template:
- Single Sequence:
$A:0 <sep>:0 <cls>:2 - Paired Sequences:
$A:0 <sep>:0 $B:1 <sep>:1 <cls>:2
- Single Sequence:
- Decoder: BPE decoder for reconstructing original text.
Training Method
- Corpus Source: Streaming iterator from WikiText-103 Raw v1 (train split)
- Batch Size: 1000 lines per batch
- Trainer:
BpeTrainerfrom Hugging Facetokenizerslibrary - Special Tokens Added:
<cls>,<sep>,<unk>,<pad>,<mask>
Intended Uses & Limitations
Intended Uses
- Pre-tokenization for training Transformer-based LLMs.
- Downstream NLP tasks:
- Language modeling
- Text classification
- Question answering
- Summarization
Limitations
- Trained exclusively on English Wikipedia text β performance may degrade in informal, domain-specific, or multilingual contexts.
- May inherit biases present in Wikipedia data.
License
This tokenizer is released under the MIT License.
Citation
If you use this tokenizer, please cite:
title = Custom BPE Tokenizer Trained on WikiText-103 Raw v1
author = yakul259
year = 2025
publisher = Hugging Face
Inference Providers
NEW
This model isn't deployed by any Inference Provider.
π
Ask for provider support