Glaurung Binary Tokenizer 001
Production-ready 64K vocabulary BPE tokenizer for binary executables
π GitHub: mjbommar/glaurung
Overview
Glaurung Binary Tokenizer 001 is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This is the production successor to binary-tokenizer-005.
Key Specifications
- Vocabulary Size: 65,536 tokens (exactly 2^16 = 64K)
- Compression: 2.849 bytes/token average
- Training Data: 13GB corpus, 30,738 binaries
- Architectures: x86-64, x86-32, ARM64
- Platforms: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11)
- Encoding: Latin-1 (each byte 0-255 maps to a single character)
Performance Highlights
- 9-10% better compression than 32K baseline
- 86% of theoretical maximum compression efficiency
- Instruction-aware: Captures complete x86-64 instructions (REX + opcode + ModR/M)
- String-rich: 5.76% of vocabulary contains function names, paths, library references
Installation
pip install tokenizers transformers
Quick Start
Method 1: Using the tokenizers library (Recommended)
from tokenizers import Tokenizer
from pathlib import Path
# Load tokenizer directly from Hugging Face Hub
tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
# Process binary data - MUST use latin-1 encoding
binary_path = Path("/usr/bin/ls")
raw_bytes = binary_path.read_bytes()
text = raw_bytes.decode('latin-1') # Convert bytes to latin-1 string
# Tokenize
encoded = tokenizer.encode(text)
tokens = encoded.ids
print(f"File size: {len(raw_bytes):,} bytes")
print(f"Tokens: {len(tokens):,}")
print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token")
# Decode back to text (note: adds spaces between tokens due to BPE behavior)
decoded = tokenizer.decode(tokens)
Expected Output (for /usr/bin/ls):
File size: 142,144 bytes
Tokens: 49,574
Compression: 2.866 bytes/token
Method 2: Using transformers library
from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer
# Load the base tokenizer
base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
# Wrap with PreTrainedTokenizerFast for transformers compatibility
tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)
# Process binary data
with open("/usr/bin/ls", "rb") as f:
raw_bytes = f.read()
text = raw_bytes.decode('latin-1')
# Tokenize (returns dict with input_ids, attention_mask, etc.)
result = tokenizer(text)
tokens = result["input_ids"]
Important: Data Format
The tokenizer expects binary data encoded as latin-1 strings, NOT hex strings:
# β
CORRECT - Use latin-1 encoded bytes
raw_bytes = b'\x7fELF\x01\x01'
text = raw_bytes.decode('latin-1') # β '\x7fELF\x01\x01'
encoded = tokenizer.encode(text)
# β WRONG - Do not use hex strings
hex_str = "7f 45 4c 46 01 01" # Will not work correctly
Why latin-1? Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text.
Performance Benchmarks
Compression on Real-World Binaries
Tested on /usr/bin binaries (not in training set):
| Binary | Size | Tokens | bytes/token |
|---|---|---|---|
| bash | 1.38 MB | 535,541 | 2.698 |
| python3.12 | 7.65 MB | 2,801,226 | 2.863 |
| gcc-13 | 0.98 MB | 344,201 | 2.986 |
| ls | 0.14 MB | 49,574 | 2.866 |
| grep | 0.18 MB | 67,567 | 2.667 |
Average: 2.849 bytes/token
Information-Theoretic Efficiency
- Binary entropy: ~6.5 bits/byte
- Theoretical optimal: 2.46 bytes/token
- Our performance: 2.849 bytes/token
- Efficiency: 86% of theoretical optimum
Example: Tokenizing an ELF Header
from tokenizers import Tokenizer
# Load tokenizer
tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
# ELF header bytes
elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
text = elf_header.decode('latin-1')
# Tokenize
encoded = tokenizer.encode(text)
print(f"Original bytes: {elf_header.hex()}")
print(f"Tokens: {encoded.ids}")
print(f"Token count: {len(encoded.ids)}")
print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token")
# Examine individual tokens
for token_id, token_str in zip(encoded.ids, encoded.tokens):
token_bytes = token_str.encode('latin-1')
print(f" Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)")
Token Distribution
| Length | Count | Percentage | Examples |
|---|---|---|---|
| 2 bytes | 31,528 | 48.3% | 0x48 0x8b (REX.W prefix), 0xcc 0xcc (int3 padding) |
| 3 bytes | 9,261 | 14.2% | 0x48 0x8b 0xc0 (MOV rax, rax) |
| 4 bytes | 11,520 | 17.6% | 0x48 0x89 0x45 0xf8 (MOV [rbp-8], rax) |
| 5+ bytes | 13,164 | 20.2% | Multi-instruction sequences, string literals |
Average token length: 3.651 bytes
Training Details
Dataset
Source: /nas4/data/glaurung-data/binaries-small/
- Size: 13 GB
- Files: 30,738 binaries
- Content: Real-world compiled binaries including system utilities, libraries, and applications
Platform Distribution:
- Linux: Alpine, Debian, Ubuntu (ELF format)
- Windows: 8, 10, 11 (PE format)
Architecture Distribution:
- x86-64 (primary)
- x86-32
- ARM64
Training Parameters
cargo run --release --bin train -- \
--output glaurung-tokenizer-002.json \
/nas4/data/glaurung-data/binaries-small/ \
--vocab-size 65536 \
--min-frequency 4 \
--chunk-size 8192
Training Duration: 8.46 hours on 24 cores Peak Memory: 70 GB
Use Cases
β Recommended For
- Binary neural language models
- Malware analysis and classification
- Reverse engineering tools
- Binary similarity detection
- Code pattern recognition
- Vulnerability research
- Firmware analysis
β Not Recommended For
- Text/source code (use text tokenizer like GPT-2, 100%+ penalty)
- Very small binaries <1KB (overhead too high)
- Real-time streaming (load time ~100ms)
Comparison with Predecessor
| Metric | binary-tokenizer-005 | glaurung-binary-tokenizer-001 | Improvement |
|---|---|---|---|
| Vocabulary | 65,536 | 65,536 | Same |
| Training data | ~5GB mixed | 13GB binaries-small | 2.6x larger |
| bytes/token | ~2.6 | 2.849 | +9.6% |
| Platforms | Mixed | Multi-OS (Linux, Windows) | More diverse |
| Architecture awareness | Basic | Advanced (instruction-aware) | Significant |
| Documentation | Basic | Comprehensive | Extensive |
Key Improvements:
- Larger, more diverse training corpus
- Better cross-platform coverage
- Instruction-boundary awareness
- Production-ready quality
Advanced Usage
Batch Processing Multiple Files
from tokenizers import Tokenizer
from pathlib import Path
import numpy as np
tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
def tokenize_binary_file(file_path):
"""Tokenize a single binary file."""
raw_bytes = Path(file_path).read_bytes()
text = raw_bytes.decode('latin-1')
encoded = tokenizer.encode(text)
return {
'file': file_path,
'size_bytes': len(raw_bytes),
'token_count': len(encoded.ids),
'compression_ratio': len(raw_bytes) / len(encoded.ids),
'token_ids': encoded.ids
}
# Process directory
binary_dir = Path("/usr/bin")
results = []
for binary_path in binary_dir.glob("*"):
if binary_path.is_file():
try:
result = tokenize_binary_file(binary_path)
results.append(result)
except Exception as e:
print(f"Error processing {binary_path}: {e}")
# Analyze compression statistics
compression_ratios = [r['compression_ratio'] for r in results]
print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token")
print(f"Std deviation: {np.std(compression_ratios):.3f}")
Using with PyTorch/TensorFlow Models
from tokenizers import Tokenizer
import torch
from pathlib import Path
tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")
def prepare_binary_for_model(file_path, max_length=512):
"""Prepare binary data for neural network input."""
raw_bytes = Path(file_path).read_bytes()
text = raw_bytes.decode('latin-1')
# Tokenize
encoded = tokenizer.encode(text)
token_ids = encoded.ids
# Truncate or pad to max_length
if len(token_ids) > max_length:
token_ids = token_ids[:max_length]
else:
# Pad with token ID 0 (or use a dedicated padding token)
token_ids = token_ids + [0] * (max_length - len(token_ids))
# Convert to tensor
return torch.tensor(token_ids, dtype=torch.long)
# Use in model
binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024)
print(f"Tensor shape: {binary_tensor.shape}") # torch.Size([1024])
Troubleshooting
Issue: UnicodeDecodeError when processing binary
Solution: Always use latin-1 encoding, never utf-8:
# β
Correct
text = raw_bytes.decode('latin-1')
# β Wrong
text = raw_bytes.decode('utf-8') # Will fail on non-UTF-8 bytes
Issue: Decoded output doesn't match original
Cause: BPE tokenizers add spaces between tokens during decoding.
Solution: Use the raw token IDs and decode manually if exact byte recovery is needed:
# Get tokens without spaces
tokens_no_spaces = ''.join(encoded.tokens)
original_bytes = tokens_no_spaces.encode('latin-1')
Issue: Poor compression on specific binary types
Cause: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class).
Solution: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline.
Related Projects
- Predecessor: mjbommar/binary-tokenizer-005 - Earlier 64K binary tokenizer
- Framework: mjbommar/glaurung - Binary analysis framework
- Training Code: mjbommar/glaurung-models - Binary embedding models and tokenizers
Technical Architecture
Vocabulary Structure
- Base tokens: 256 single-byte tokens (0x00 to 0xFF)
- Merged tokens: 65,280 learned byte-pair combinations
- Total: 65,536 tokens (exactly 2^16)
Special Tokens
The tokenizer includes boundary markers for file-level segmentation:
<|start|>(ID: 0)<|end|>(ID: 1)
These help models distinguish between concatenated files and identify file headers.
Token Properties
Instruction-aware patterns (x86-64 examples):
- REX prefixes:
0x48,0x4c,0x4d - Common opcodes:
0x8b(MOV),0x89(MOV),0xe8(CALL) - ModR/M patterns:
0xc0,0x45,0x5d
Common patterns:
- Padding:
0xcc 0xcc(int3),0x90 0x90(nop) - Alignment:
0x00 0x00 0x00 0x00 - String terminators:
0x00at word boundaries
Performance Characteristics
Load Time
- Tokenizer size: 2.3 MB on disk
- Load time: ~100ms (cold), ~20ms (cached)
- Memory footprint: ~15 MB in RAM
Encoding Speed
On a modern CPU (tested on Intel i9-12900K):
| Operation | Speed |
|---|---|
| Encode 1 MB binary | ~50 ms |
| Encode 10 MB binary | ~450 ms |
| Encode 100 MB binary | ~4.2 s |
Throughput: ~20-25 MB/second
Limitations
- Cross-domain penalty: Using on text data causes 100-140% efficiency loss
- Small file overhead: Files <1KB have proportionally higher tokenization overhead
- Deterministic decoding: Spaces inserted between tokens during decode (BPE behavior)
- Architecture bias: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc.
Citation
If you use this tokenizer in research, please cite:
Glaurung Binary Tokenizer 001
64K Binary Tokenizer for Neural Language Models
Vocabulary: 65,536 tokens (exactly 2^16)
Training: October 2025
Dataset: 13GB binaries-small (30,738 files)
Performance: 2.849 bytes/token (86% of theoretical optimum)
HuggingFace: mjbommar/glaurung-binary-tokenizer-001
License
Apache License 2.0
This tokenizer is part of the Glaurung project. See the glaurung-models repository for full license details.
Support & Issues
- GitHub Issues: mjbommar/glaurung-models/issues
- Documentation: Full training report available in the glaurung-models repository
- Email: Contact maintainer via GitHub
Production Status: β Ready for deployment Version: 1.0.0 Release Date: October 2025