Stefan Schweter's picture

Stefan Schweter PRO

stefan-it

AI & ML interests

Flair Library πŸ’•, NER & PoS Tagging, LM Pretraining (mostly encoder-only & encoder-decoder), Historical Language Models, German Language Models, Bavarian NLP πŸ₯¨

Recent Activity

liked a dataset about 1 hour ago
HuggingFaceFW/finepdfs-edu
reacted to codelion's post with πŸ”₯ about 12 hours ago
Want to experiment with pre-training dataset mixtures but don't want to process terabytes of data? We've got you covered. We're releasing a collection of several carefully curated 1B token dataset samples specifically designed for rapid prototyping and pretraining experiments: https://huggingface.co/collections/codelion/pre-training-dataset-samples These samples were created using reservoir sampling - an algorithm that guarantees statistically unbiased random samples from massive source datasets. This means results you get at the 1B token scale are representative of how these datasets behave at 100B+ token scales, letting you iterate quickly without the computational overhead. The collection includes: - finePDFs-1B: High-quality textbook-style educational content - DCLM-baseline-1B: Filtered, diverse web content - FineWeb-Edu-1B: Curated educational web resources We used these exact samples to run 50+ systematic experiments on dataset mixing strategies, ultimately discovering that a 50-30-20 mixture of finePDFs + DCLM-baseline + FineWeb-Edu achieves 90%+ of GPT-2's performance with just 1/10th the training data. Whether you're researching optimal data mixtures, testing curriculum learning strategies, or just want to quickly prototype a pretraining run, these samples give you a solid foundation to start experimenting immediately. Read the full story of how we used these datasets to find the optimal pretraining recipe: https://huggingface.co/blog/codelion/optimal-dataset-mixing
View all activity

Organizations

Bayerische Staatsbibliothek's profile picture flair's profile picture Flax Community's profile picture dumitrescustefan-org's profile picture GermanT5's profile picture BigScience: LMs for Historical Texts's profile picture Universal NER's profile picture BigLAM: BigScience Libraries, Archives and Museums's profile picture Libre Euro Lingua-Alliance's profile picture Lang UK's profile picture BabyLM Challenge's profile picture hmByT5 Preliminary's profile picture hmByT5's profile picture Blog-explorers's profile picture German Wikipedia LMs's profile picture hmBERT's profile picture hmTEAMS's profile picture HIPE's profile picture hmBERT Tiny's profile picture hmBERT 64k's profile picture LSV @ Saarland University's profile picture GERMATRON's profile picture PleIAs's profile picture German LLM Tokenizers's profile picture Occiglot's profile picture Social Post Explorers's profile picture GERTuraX's profile picture Stefmal's profile picture Hugging Face Discord Community's profile picture Project German LLM's profile picture ENGEBA's profile picture Nerdy Face's profile picture TensorFlow Model Garden LMs's profile picture Hugging Face MCP Course's profile picture Bavarian NLP's profile picture Baivaria's profile picture SindBERT's profile picture German Tokenizer Benchmark's profile picture