Stefano Fiorucci's picture

In a Training Loop 🔄

Stefano Fiorucci PRO

anakin87

·

AI & ML interests

Language Models: orchestration, post-training, GRPO, synthetic data... Contributing to Haystack LLM framework 🏗️

Recent Activity

reacted to theirpost with 🤗 about 18 hours ago

Your RL environment is an SFT data factory 🏭 In LLM post-training it's common to do Supervised Fine-Tuning warm-up before Reinforcement Learning. When teaching a new task, RL needs some signal to amplify and SFT builds a good initial basis, for example by teaching format. If you've built an RL env, generating SFT synthetic data is basically free. An env already has: task data, rollout logic, rewards. 1️⃣ pick a strong model 2️⃣ run it through the env 3️⃣ filter rollouts by reward works out of the box with Verifiers (Prime Intellect) and Atropos (Nous Research) 🧑‍💻 Example: https://github.com/anakin87/llm-rl-environments-lil-course/blob/main/chapters/05.md

posted an update about 19 hours ago

Your RL environment is an SFT data factory 🏭 In LLM post-training it's common to do Supervised Fine-Tuning warm-up before Reinforcement Learning. When teaching a new task, RL needs some signal to amplify and SFT builds a good initial basis, for example by teaching format. If you've built an RL env, generating SFT synthetic data is basically free. An env already has: task data, rollout logic, rewards. 1️⃣ pick a strong model 2️⃣ run it through the env 3️⃣ filter rollouts by reward works out of the box with Verifiers (Prime Intellect) and Atropos (Nous Research) 🧑‍💻 Example: https://github.com/anakin87/llm-rl-environments-lil-course/blob/main/chapters/05.md

updated a dataset 1 day ago

anakin87/tictactoe-demo

View all activity

Organizations

anakin87 's collections 5