Dataset and models for transforming LFM2 2.6B into a Tic Tac Toe master using RL Environments. Free course: https://t.ly/4jIFq
Stefano Fiorucci PRO
anakin87
AI & ML interests
Language Models: orchestration, post-training, GRPO, synthetic data...
Contributing to Haystack LLM framework 🏗️
Recent Activity
reacted to theirpost with 🤗 about 18 hours ago
Your RL environment is an SFT data factory 🏭
In LLM post-training it's common to do Supervised Fine-Tuning warm-up before Reinforcement Learning.
When teaching a new task, RL needs some signal to amplify and SFT builds a good initial basis, for example by teaching format.
If you've built an RL env, generating SFT synthetic data is basically free.
An env already has: task data, rollout logic, rewards.
1️⃣ pick a strong model
2️⃣ run it through the env
3️⃣ filter rollouts by reward
works out of the box with Verifiers (Prime Intellect) and Atropos (Nous Research)
🧑💻 Example: https://github.com/anakin87/llm-rl-environments-lil-course/blob/main/chapters/05.md posted an update about 19 hours ago
Your RL environment is an SFT data factory 🏭
In LLM post-training it's common to do Supervised Fine-Tuning warm-up before Reinforcement Learning.
When teaching a new task, RL needs some signal to amplify and SFT builds a good initial basis, for example by teaching format.
If you've built an RL env, generating SFT synthetic data is basically free.
An env already has: task data, rollout logic, rewards.
1️⃣ pick a strong model
2️⃣ run it through the env
3️⃣ filter rollouts by reward
works out of the box with Verifiers (Prime Intellect) and Atropos (Nous Research)
🧑💻 Example: https://github.com/anakin87/llm-rl-environments-lil-course/blob/main/chapters/05.md updated a dataset 1 day ago
anakin87/tictactoe-demo