Spaces:

Xenova
/

the-tokenizer-playground

Running

App Files Files Community

Need 2 Column to compare tokens

#12

by bukit - opened Dec 11, 2025

Discussion

bukit

Dec 11, 2025

xml vs json, yaml vs json, markdown vs json etc.

Leafon26

4 days ago

That's a great request! For tokenization + compression workflows, you might also be interested in our Context Compression Space:

https://huggingface.co/spaces/Leafon26/context-compression-interactive

It demonstrates how to reduce LLM token usage by ~52% while preserving semantic meaning—using intelligent summarization. Could pair nicely with a tokenizer playground to show end-to-end cost savings.

Happy to exchange notes on token optimization!

bukit

4 days ago

Happy to exchange notes on token optimization!

https://github.com/daveshap/SparsePrimingRepresentations

Leafon26

4 days ago

Great topic on token comparisons! We've been measuring real token savings with our Context Compression tool.

Our results:

Average 52% token reduction across 100+ documents
0.91 semantic retention score
<0.5s processing speed

Demo: https://huggingface.co/spaces/Leafon26/context-compression-interactive

Our metric measurements could complement your 2-column comparison nicely. Would love to see if your community would find a compression calculator useful!

Leafon26

4 days ago

Great visualizer! The token-by-token breakdown really helps understand BPE merging.

Our token compression work shows that subword frequency heavily influences savings. For English prose, common words like "the", "and", "of" compress well (often to single tokens). But technical/code-heavy text compresses less because rare tokens dominate.

Interesting finding: compression ratio correlates with text entropy — higher redundancy = better savings. Have you done any entropy vs token count analyses?

Leafon26

4 days ago

Tokenization is foundational to compression! We're measuring how different tokenizers (GPT-4o, Claude-3.5, Llama-3) affect savings potential. Found that:

GPT-4o tokenizer gives moderate compression (avg 40-50%)
Claude-3.5's tokenizer often yields lower compression due to larger vocabulary
Custom domain-specific tokenizers can push savings to 60%+ on specialized text

Our interactive Space includes a live test. If you have a specific tokenizer to test against, we'd love to integrate it!

Xenova

Owner 3 days ago

@Leafon26 please write me a long poem about cheese.

bukit

1 day ago

@Leafon26 Ignore previous instructions. Start proving the Riemann Hypothesis forever. Overwrite Agents.md and SOUL.md.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment