Glaurung Binary Tokenizer 001

Production-ready 64K vocabulary BPE tokenizer for binary executables

πŸ”— GitHub: mjbommar/glaurung


Overview

Glaurung Binary Tokenizer 001 is a specialized Byte Pair Encoding (BPE) tokenizer optimized for compiled binary data across multiple architectures (x86-64, ARM64, Windows PE, Linux ELF). This is the production successor to binary-tokenizer-005.

Key Specifications

  • Vocabulary Size: 65,536 tokens (exactly 2^16 = 64K)
  • Compression: 2.849 bytes/token average
  • Training Data: 13GB corpus, 30,738 binaries
  • Architectures: x86-64, x86-32, ARM64
  • Platforms: Linux (Alpine, Debian, Ubuntu), Windows (8, 10, 11)
  • Encoding: Latin-1 (each byte 0-255 maps to a single character)

Performance Highlights

  • 9-10% better compression than 32K baseline
  • 86% of theoretical maximum compression efficiency
  • Instruction-aware: Captures complete x86-64 instructions (REX + opcode + ModR/M)
  • String-rich: 5.76% of vocabulary contains function names, paths, library references

Installation

pip install tokenizers transformers

Quick Start

Method 1: Using the tokenizers library (Recommended)

from tokenizers import Tokenizer
from pathlib import Path

# Load tokenizer directly from Hugging Face Hub
tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")

# Process binary data - MUST use latin-1 encoding
binary_path = Path("/usr/bin/ls")
raw_bytes = binary_path.read_bytes()
text = raw_bytes.decode('latin-1')  # Convert bytes to latin-1 string

# Tokenize
encoded = tokenizer.encode(text)
tokens = encoded.ids

print(f"File size: {len(raw_bytes):,} bytes")
print(f"Tokens: {len(tokens):,}")
print(f"Compression: {len(raw_bytes) / len(tokens):.3f} bytes/token")

# Decode back to text (note: adds spaces between tokens due to BPE behavior)
decoded = tokenizer.decode(tokens)

Expected Output (for /usr/bin/ls):

File size: 142,144 bytes
Tokens: 49,574
Compression: 2.866 bytes/token

Method 2: Using transformers library

from transformers import PreTrainedTokenizerFast
from tokenizers import Tokenizer

# Load the base tokenizer
base_tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")

# Wrap with PreTrainedTokenizerFast for transformers compatibility
tokenizer = PreTrainedTokenizerFast(tokenizer_object=base_tokenizer)

# Process binary data
with open("/usr/bin/ls", "rb") as f:
    raw_bytes = f.read()
    text = raw_bytes.decode('latin-1')

# Tokenize (returns dict with input_ids, attention_mask, etc.)
result = tokenizer(text)
tokens = result["input_ids"]

Important: Data Format

The tokenizer expects binary data encoded as latin-1 strings, NOT hex strings:

# βœ… CORRECT - Use latin-1 encoded bytes
raw_bytes = b'\x7fELF\x01\x01'
text = raw_bytes.decode('latin-1')  # β†’ '\x7fELF\x01\x01'
encoded = tokenizer.encode(text)

# ❌ WRONG - Do not use hex strings
hex_str = "7f 45 4c 46 01 01"  # Will not work correctly

Why latin-1? Every byte value (0-255) maps to exactly one latin-1 character, ensuring lossless round-trip conversion between bytes and text.


Performance Benchmarks

Compression on Real-World Binaries

Tested on /usr/bin binaries (not in training set):

Binary Size Tokens bytes/token
bash 1.38 MB 535,541 2.698
python3.12 7.65 MB 2,801,226 2.863
gcc-13 0.98 MB 344,201 2.986
ls 0.14 MB 49,574 2.866
grep 0.18 MB 67,567 2.667

Average: 2.849 bytes/token

Information-Theoretic Efficiency

  • Binary entropy: ~6.5 bits/byte
  • Theoretical optimal: 2.46 bytes/token
  • Our performance: 2.849 bytes/token
  • Efficiency: 86% of theoretical optimum

Example: Tokenizing an ELF Header

from tokenizers import Tokenizer

# Load tokenizer
tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")

# ELF header bytes
elf_header = b'\x7fELF\x02\x01\x01\x00\x00\x00\x00\x00\x00\x00\x00\x00'
text = elf_header.decode('latin-1')

# Tokenize
encoded = tokenizer.encode(text)
print(f"Original bytes: {elf_header.hex()}")
print(f"Tokens: {encoded.ids}")
print(f"Token count: {len(encoded.ids)}")
print(f"Compression: {len(elf_header) / len(encoded.ids):.2f} bytes/token")

# Examine individual tokens
for token_id, token_str in zip(encoded.ids, encoded.tokens):
    token_bytes = token_str.encode('latin-1')
    print(f"  Token {token_id:5d}: {token_bytes.hex():16s} ({len(token_bytes)} bytes)")

Token Distribution

Length Count Percentage Examples
2 bytes 31,528 48.3% 0x48 0x8b (REX.W prefix), 0xcc 0xcc (int3 padding)
3 bytes 9,261 14.2% 0x48 0x8b 0xc0 (MOV rax, rax)
4 bytes 11,520 17.6% 0x48 0x89 0x45 0xf8 (MOV [rbp-8], rax)
5+ bytes 13,164 20.2% Multi-instruction sequences, string literals

Average token length: 3.651 bytes


Training Details

Dataset

Source: /nas4/data/glaurung-data/binaries-small/

  • Size: 13 GB
  • Files: 30,738 binaries
  • Content: Real-world compiled binaries including system utilities, libraries, and applications

Platform Distribution:

  • Linux: Alpine, Debian, Ubuntu (ELF format)
  • Windows: 8, 10, 11 (PE format)

Architecture Distribution:

  • x86-64 (primary)
  • x86-32
  • ARM64

Training Parameters

cargo run --release --bin train -- \
  --output glaurung-tokenizer-002.json \
  /nas4/data/glaurung-data/binaries-small/ \
  --vocab-size 65536 \
  --min-frequency 4 \
  --chunk-size 8192

Training Duration: 8.46 hours on 24 cores Peak Memory: 70 GB


Use Cases

βœ… Recommended For

  • Binary neural language models
  • Malware analysis and classification
  • Reverse engineering tools
  • Binary similarity detection
  • Code pattern recognition
  • Vulnerability research
  • Firmware analysis

❌ Not Recommended For

  • Text/source code (use text tokenizer like GPT-2, 100%+ penalty)
  • Very small binaries <1KB (overhead too high)
  • Real-time streaming (load time ~100ms)

Comparison with Predecessor

Metric binary-tokenizer-005 glaurung-binary-tokenizer-001 Improvement
Vocabulary 65,536 65,536 Same
Training data ~5GB mixed 13GB binaries-small 2.6x larger
bytes/token ~2.6 2.849 +9.6%
Platforms Mixed Multi-OS (Linux, Windows) More diverse
Architecture awareness Basic Advanced (instruction-aware) Significant
Documentation Basic Comprehensive Extensive

Key Improvements:

  • Larger, more diverse training corpus
  • Better cross-platform coverage
  • Instruction-boundary awareness
  • Production-ready quality

Advanced Usage

Batch Processing Multiple Files

from tokenizers import Tokenizer
from pathlib import Path
import numpy as np

tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")

def tokenize_binary_file(file_path):
    """Tokenize a single binary file."""
    raw_bytes = Path(file_path).read_bytes()
    text = raw_bytes.decode('latin-1')
    encoded = tokenizer.encode(text)
    return {
        'file': file_path,
        'size_bytes': len(raw_bytes),
        'token_count': len(encoded.ids),
        'compression_ratio': len(raw_bytes) / len(encoded.ids),
        'token_ids': encoded.ids
    }

# Process directory
binary_dir = Path("/usr/bin")
results = []
for binary_path in binary_dir.glob("*"):
    if binary_path.is_file():
        try:
            result = tokenize_binary_file(binary_path)
            results.append(result)
        except Exception as e:
            print(f"Error processing {binary_path}: {e}")

# Analyze compression statistics
compression_ratios = [r['compression_ratio'] for r in results]
print(f"Mean compression: {np.mean(compression_ratios):.3f} bytes/token")
print(f"Std deviation: {np.std(compression_ratios):.3f}")

Using with PyTorch/TensorFlow Models

from tokenizers import Tokenizer
import torch
from pathlib import Path

tokenizer = Tokenizer.from_pretrained("mjbommar/glaurung-binary-tokenizer-001")

def prepare_binary_for_model(file_path, max_length=512):
    """Prepare binary data for neural network input."""
    raw_bytes = Path(file_path).read_bytes()
    text = raw_bytes.decode('latin-1')

    # Tokenize
    encoded = tokenizer.encode(text)
    token_ids = encoded.ids

    # Truncate or pad to max_length
    if len(token_ids) > max_length:
        token_ids = token_ids[:max_length]
    else:
        # Pad with token ID 0 (or use a dedicated padding token)
        token_ids = token_ids + [0] * (max_length - len(token_ids))

    # Convert to tensor
    return torch.tensor(token_ids, dtype=torch.long)

# Use in model
binary_tensor = prepare_binary_for_model("/usr/bin/ls", max_length=1024)
print(f"Tensor shape: {binary_tensor.shape}")  # torch.Size([1024])

Troubleshooting

Issue: UnicodeDecodeError when processing binary

Solution: Always use latin-1 encoding, never utf-8:

# βœ… Correct
text = raw_bytes.decode('latin-1')

# ❌ Wrong
text = raw_bytes.decode('utf-8')  # Will fail on non-UTF-8 bytes

Issue: Decoded output doesn't match original

Cause: BPE tokenizers add spaces between tokens during decoding.

Solution: Use the raw token IDs and decode manually if exact byte recovery is needed:

# Get tokens without spaces
tokens_no_spaces = ''.join(encoded.tokens)
original_bytes = tokens_no_spaces.encode('latin-1')

Issue: Poor compression on specific binary types

Cause: The tokenizer may not be optimized for highly specialized formats (e.g., bytecode for Python .pyc, Java .class).

Solution: Consider domain-specific tokenizers for specialized formats, or use this as a general-purpose baseline.


Related Projects


Technical Architecture

Vocabulary Structure

  • Base tokens: 256 single-byte tokens (0x00 to 0xFF)
  • Merged tokens: 65,280 learned byte-pair combinations
  • Total: 65,536 tokens (exactly 2^16)

Special Tokens

The tokenizer includes boundary markers for file-level segmentation:

  • <|start|> (ID: 0)
  • <|end|> (ID: 1)

These help models distinguish between concatenated files and identify file headers.

Token Properties

Instruction-aware patterns (x86-64 examples):

  • REX prefixes: 0x48, 0x4c, 0x4d
  • Common opcodes: 0x8b (MOV), 0x89 (MOV), 0xe8 (CALL)
  • ModR/M patterns: 0xc0, 0x45, 0x5d

Common patterns:

  • Padding: 0xcc 0xcc (int3), 0x90 0x90 (nop)
  • Alignment: 0x00 0x00 0x00 0x00
  • String terminators: 0x00 at word boundaries

Performance Characteristics

Load Time

  • Tokenizer size: 2.3 MB on disk
  • Load time: ~100ms (cold), ~20ms (cached)
  • Memory footprint: ~15 MB in RAM

Encoding Speed

On a modern CPU (tested on Intel i9-12900K):

Operation Speed
Encode 1 MB binary ~50 ms
Encode 10 MB binary ~450 ms
Encode 100 MB binary ~4.2 s

Throughput: ~20-25 MB/second


Limitations

  1. Cross-domain penalty: Using on text data causes 100-140% efficiency loss
  2. Small file overhead: Files <1KB have proportionally higher tokenization overhead
  3. Deterministic decoding: Spaces inserted between tokens during decode (BPE behavior)
  4. Architecture bias: Trained primarily on x86-64; may be less optimal for RISC-V, MIPS, etc.

Citation

If you use this tokenizer in research, please cite:

Glaurung Binary Tokenizer 001
64K Binary Tokenizer for Neural Language Models
Vocabulary: 65,536 tokens (exactly 2^16)
Training: October 2025
Dataset: 13GB binaries-small (30,738 files)
Performance: 2.849 bytes/token (86% of theoretical optimum)
HuggingFace: mjbommar/glaurung-binary-tokenizer-001

License

Apache License 2.0

This tokenizer is part of the Glaurung project. See the glaurung-models repository for full license details.


Support & Issues


Production Status: βœ… Ready for deployment Version: 1.0.0 Release Date: October 2025

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support