HuggingFaceFW-Dev (HuggingFaceFW-Dev)

eliebak

submitted a paper to Daily Papers 11 days ago

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

Paper • 2512.14080 • Published 13 days ago • 5

nouamanetazi

posted an update about 2 months ago

Post

4133

After training 𝐒𝐦𝐨𝐥𝐋𝐌𝟑 on 𝟑𝟖𝟒 𝐇𝟏𝟎𝟎𝐬 for nearly a month, I've come to realize something most people overlook: 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐢𝐬 𝐭𝐡𝐞 𝐦𝐚𝐤𝐞-𝐨𝐫-𝐛𝐫𝐞𝐚𝐤 𝐟𝐚𝐜𝐭𝐨𝐫 𝐢𝐧 𝐋𝐋𝐌 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠. 🔥

Everyone talks about model architecture and data quality. And yes, those matter immensely. But here's what nobody tells you: when your training run fails at 2 AM because of mysterious 𝐍𝐂𝐂𝐋 𝐞𝐫𝐫𝐨𝐫𝐬, or when your expensive GPU cluster is running at 𝟔𝟎% 𝐞𝐟𝐟𝐢𝐜𝐢𝐞𝐧𝐜𝐲, the problem isn't your model. It's most probably a 𝐦𝐢𝐬𝐮𝐬𝐞 𝐨𝐟 𝐭𝐡𝐞 𝐡𝐚𝐫𝐝𝐰𝐚𝐫𝐞. 🛠️

Questions that seemed simple but had no clear answers: Why is 𝐌𝐨𝐄 𝐭𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐬𝐥𝐨𝐰𝐞𝐫 𝐭𝐡𝐚𝐧 𝐝𝐞𝐧𝐬𝐞 𝐦𝐨𝐝𝐞𝐥𝐬? Which 𝐍𝐂𝐂𝐋 𝐟𝐥𝐚𝐠𝐬 should we actually set? How often should we checkpoint without killing throughput?

That's why we built 𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤 📖: a complete guide covering everything from model architecture and data curation to the SmolLM3 training marathon, post-training techniques, and crucially, the 𝐢𝐧𝐟𝐫𝐚𝐬𝐭𝐫𝐮𝐜𝐭𝐮𝐫𝐞 𝐥𝐚𝐲𝐞𝐫 that most teams get wrong.

We validated real vs theoretical bandwidth across the entire stack: 𝐇𝐁𝐌𝟑 𝐡𝐢𝐭𝐭𝐢𝐧𝐠 𝟑 𝐓𝐁/𝐬, 𝐍𝐕𝐋𝐢𝐧𝐤 𝟒.𝟎 𝐫𝐞𝐚𝐜𝐡𝐢𝐧𝐠 𝟕𝟖𝟔 𝐆𝐁/𝐬, 𝐏𝐂𝐈𝐞 𝐆𝐞𝐧𝟒 𝐚𝐭 𝟏𝟒.𝟐 𝐆𝐁/𝐬. Then we ran collective operations across 𝟏𝟐𝟖 𝐆𝐏𝐔𝐬 (16 nodes, 8xH100s each) and measured how performance degrades at scale: all-reduce drops from 𝟒𝟖𝟎 𝐆𝐁/𝐬 on a single node to 𝟑𝟐𝟎-𝟑𝟓𝟎 𝐆𝐁/𝐬 across 16 nodes.

If you've ever wondered why your training runs are slower than they should be, or you're planning to scale up and want to avoid expensive mistakes, this guide might save you weeks of debugging.

𝐓𝐡𝐞 𝐒𝐦𝐨𝐥 𝐓𝐫𝐚𝐢𝐧𝐢𝐧𝐠 𝐏𝐥𝐚𝐲𝐛𝐨𝐨𝐤: https://lnkd.in/e5MKXUHS

Shared with ❤️ by the HuggingFace team

thomwolf

authored a paper 2 months ago

Robot Learning: A Tutorial

Paper • 2510.12403 • Published Oct 14 • 118

lvwerra

authored a paper 3 months ago

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Paper • 2510.08697 • Published Oct 9 • 36

mjaggi

authored a paper 3 months ago

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Paper • 2509.14233 • Published Sep 17 • 14

davanstrien

posted an update 4 months ago

Post

1490

I fine-tuned a smol VLM to generate specialized art history metadata!

https://huggingface.co/davanstrien/iconclass-vlm: Qwen2.5-VL-3B trained using SFT to generate ICONCLASS codes (think Dewey Decimal for art!)

Trained with TRL + HF Jobs - single UV script, no GPU needed!

Space to explore predictions on a test set: davanstrien/iconclass-predictions

Blog soon!

mjaggi

authored a paper 4 months ago

Benchmarking Optimizers for Large Language Model Pretraining

Paper • 2509.01440 • Published Sep 1 • 24

eliebak

posted an update 4 months ago

Post

3830

Super excited to announce that our research team at Hugging Face will be doing an AMA on reddit r/LocalLLaMA.

Come ask any questions to the team behind SmolLM, FineWeb and more! And who knows, maybe there’ll be a shiny new release to talk about?

Thursday 4th September, 8AM-11AM PST 🤗

science

eliebak

posted an update 4 months ago

Post

708

Motif 2.6B tech report is pretty insane, first time i see a model with differential attention and polynorm trained at scale!

> It's trained on 2.5T of token, with a "data mixture schedule" to continuously adjust the mixture over training.
> They use WSD with a "Simple moving average" averaging the last 6 ckpt every 8B token.
> They trained on Finemath, Fineweb2, DCLM, TxT360.
> Lot of details in the finetuning data they used, for instance they used EvolKit and did some "dataset fusion" to have more compressed knowledge into the data.
> They mention they also tried Normalized GPT, QK-Norm and Cross Layer Attention.

Motif-Technologies/Motif-2.6B

joelniklaus

authored 5 papers 4 months ago

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

Paper • 2411.19799 • Published Nov 29, 2024 • 14

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Paper • 2505.12864 • Published May 19 • 3

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

Paper • 2508.04796 • Published Aug 6

From Citations to Criticality: Predicting Legal Decision Influence in the Multilingual Swiss Jurisprudence

Paper • 2410.13460 • Published Oct 17, 2024

Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland

Paper • 2410.13456 • Published Oct 17, 2024

hynky

updated a Space 4 months ago

Trackio Classification Space

🚀

Visualize project metrics and runs

hynky

published a Space 4 months ago

Trackio Classification Space

🚀

Visualize project metrics and runs

hynky

updated a model 4 months ago

HuggingFaceFW-Dev/fine-pdfs-classification-teacher-eng_Latn-Qwen_Qwen3-235B-A22B-Instruct-2507

Text Classification • 0.1B • Updated Aug 18 • 7 • 1

hynky

published a model 4 months ago

HuggingFaceFW-Dev/fine-pdfs-classification-teacher-eng_Latn-Qwen_Qwen3-235B-A22B-Instruct-2507

Text Classification • 0.1B • Updated Aug 18 • 7 • 1

eliebak

posted an update 5 months ago

Post

4787

Kimi K2 tech report is full of gems as always. Here are my notes on it:

> MuonClip: Pretty crazy how after 70k the training stabilizes and the QK-clip is basically inactive. There is also no loss in perf with QK-clip which is not trivial at all (at small scale but with aggressive threshold). Also a cool explanation of why muon makes the logit explode in appendix E (tl;dr is that muon makes the singular value of the update matrix higher)
> Sparsity scaling laws to justify their ratio, they have a very solid training infra that allows the model to be trained at this sparsity level, they could have increased even more but as sparsity increases the training becomes less efficient.
> They diminish the number of attention heads to make it more efficient for long context since attention heads are a big bottleneck for long context. They also remove 2 of the 3 "first dense" layers in the dsv3 arch.

With the sparsity and attention heads (divided by 2) they achieve 83% increased flops compared to deepseek v3 arch at 128k.

> Data: Rephrasing is KEY. They do a lot more synthetic data generation and rephrase their corpus to have different styles, for longer documents they do it by chunk. I'm (half) surprised by the fact that ONLY 1 epoch (assuming same number of training tokens I think?) of data rephrased 10 times has better accuracy than 10 epochs of the same data rephrased once.
> They do rewriting for Math and Knowledge, for Math they apply the ShallowMath recipe and instruct the model to rephrase in a "learning note" style
> They talk about diversity and probably have some internal stuff/eval to test that, as always still a bit unclear for me how to properly measure that.

The infra is also very nice, quick summary:
> PP=16 (1F1B schedule, a bit custom), EP=16, zero1
> No FP8 computation but for storage of specific layers, selective recomputation for inexpensive block, activation offloading to CPU

NXz64Fdf8Y

authored a paper 6 months ago

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

Paper • 2506.20920 • Published Jun 26 • 75

HuggingFaceFW-Dev

AI & ML interests

Recent Activity

SonicMoE: Accelerating MoE with IO and Tile-aware Optimizations

Robot Learning: A Tutorial

BigCodeArena: Unveiling More Reliable Human Preferences in Code Generation via Execution

Apertus: Democratizing Open and Compliant LLMs for Global Language Environments

Benchmarking Optimizers for Large Language Model Pretraining

INCLUDE: Evaluating Multilingual Language Understanding with Regional Knowledge

LEXam: Benchmarking Legal Reasoning on 340 Law Exams

Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization

From Citations to Criticality: Predicting Legal Decision Influence in the Multilingual Swiss Jurisprudence

Unlocking Legal Knowledge: A Multilingual Dataset for Judicial Summarization in Switzerland

Trackio Classification Space

Trackio Classification Space

HuggingFaceFW-Dev/fine-pdfs-classification-teacher-eng_Latn-Qwen_Qwen3-235B-A22B-Instruct-2507

HuggingFaceFW-Dev/fine-pdfs-classification-teacher-eng_Latn-Qwen_Qwen3-235B-A22B-Instruct-2507

FineWeb2: One Pipeline to Scale Them All -- Adapting Pre-Training Data Processing to Every Language

AI & ML interests

Recent Activity

Team members 18

HuggingFaceFW-Dev's activity

Trackio Classification Space

Trackio Classification Space