Skywork-SWE: Unveiling Data Scaling Laws for Software Engineering in LLMs Paper • 2506.19290 • Published Jun 24 • 52
Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents Paper • 2507.04009 • Published Jul 5 • 49
RefineX: Learning to Refine Pre-training Data at Scale from Expert-Guided Programs Paper • 2507.03253 • Published Jul 4 • 18
Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning Paper • 2507.16746 • Published Jul 22 • 34
BeyondWeb: Lessons from Scaling Synthetic Data for Trillion-scale Pretraining Paper • 2508.10975 • Published Aug 14 • 59
TiKMiX: Take Data Influence into Dynamic Mixture for Language Model Pre-training Paper • 2508.17677 • Published Aug 25 • 14
PersonaX: Multimodal Datasets with LLM-Inferred Behavior Traits Paper • 2509.11362 • Published Sep 14 • 4
OpenGPT-4o-Image: A Comprehensive Dataset for Advanced Image Generation and Editing Paper • 2509.24900 • Published 29 days ago • 53