Gradio-Blocks (Gradio-Blocks-Party)

Multilingual Tokenization Showdown
Analyzing 12 LLM Tokenizers Across 204 Languages.

First, I've created a dataset with Wikipedia's "Cat" article text in 272 languages:
Norod78/WikiCat-Multilingual

For each language entry with at least 100 words, I tokenized the text using 12 tokenizers and calculated the "Characters per token" ratio and "Word per token" ratio. The higher this ratio is, the more information each token represents on average for that language (and perhaps allowing the llm to potentially learn more per-parameter if trained on a dataset of that language).

You can see a slideshow summary of the results here:
https://norod.github.io/wikicat-tokenizer-eval/tokenizer-slideshow.html

I hope I interpreted the results correctly, I've made the code available on GitHub so you can re-create the raw results jsonl with this repo:
https://github.com/Norod/wikicat-tokenizer-eval

Post on X:
https://x.com/Norod78/status/1984366900550266999

osanseviero

authored a paper 2 months ago

EmbeddingGemma: Powerful and Lightweight Text Representations

Paper • 2509.20354 • Published Sep 24 • 39

haotiz

authored 2 papers 2 months ago

MANZANO: A Simple and Scalable Unified Multimodal Model with a Hybrid Vision Tokenizer

Paper • 2509.16197 • Published Sep 19 • 54

Ferret-UI 2: Mastering Universal User Interface Understanding Across Platforms

Paper • 2410.18967 • Published Oct 24, 2024 • 1

hannayukhymenko

posted an update 3 months ago

Post

2877

Releasing the Jupyter Agent Dataset! 🚀

Built from 7 TB of real Kaggle datasets + 20k notebooks, creating real code exec traces using Qwen3-Coder and E2B.
Training on this data dramatically improves the ability to execute code and analyze data.

We ( @baptistecolle @hannayukhymenko @lvwerra ) have created a novel synthetic data generation pipeline with efficient scaffolding, which gives a big performance boost after training your coding agent🔥With the help of real Kaggle notebooks and datasets we generate synthetic notebooks which aim to analyze datasets and answer factual questions about them more efficiently. We simulate a real code execution environment by prompting LLMs or with the help of E2B sandboxes. We have built a dataset of 50k+ high-quality LLM-generated notebooks which can help your agent become better at performing data analysis and question answering.

Link: https://huggingface.co/datasets/data-agents/jupyter-agent-dataset

3 replies

·

hannayukhymenko

posted an update 7 months ago

Post

3662

🚀 We are delighted to announce MamayLM, a new state-of-the-art efficient Ukrainian LLM!

📈 MamayLM surpasses similar-sized models in both English and Ukrainian, while matching or overtaking up to 10x larger models.

📊 MamayLM is a 9B model that can run on a single GPU, enabling cost-efficient AI autonomy and adoption across sectors in Ukraine such as education, legal, healthcare, public services and others (e.g., by specializing it to particular use cases). MalayLM is also attractive for organizations wishing to preserve data privacy as it s efficiency allows it to run on a local machine.

🧠 MamayLM is trained on high-quality Ukrainian data and understands Ukrainian language, culture, and history. It is built on top of Google’s Gemma 2 9B model, but uses a number of new advances stemming from INSAIT’s experience in creating BgGPT, a Bulgarian LLM we released last year, now adopted nationwide and profiled several times by Google as a worldwide success case.

🤝 MamayLM is developed in a collaboration between researchers at INSAIT and ETH Zürich and is trained entirely via donations to INSAIT for AI compute resources.

📥 MamayLM is now freely available to download on INSAIT’s HuggingFace in both full and quantized versions. We also publicly release all Ukrainian benchmarks we evaluated on.

📝 Further, we release blog posts in both English and Ukrainian, sharing our approach to creating MamayLM, hoping to drive further improvements by the community.

🌎 The release of LLMs for various languages is part of INSAIT’s mission in ensuring countries can achieve AI autonomy in a cost-efficient, controlled, safe and predictable manner.

MamayLM model and benchmarks: