Papers
arxiv:2510.26491

Data-Efficient RLVR via Off-Policy Influence Guidance

Published on Oct 30
· Submitted by Erle Zhu on Nov 4
Authors:
,
,
,
,
,
,
,
,
,

Abstract

Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop Curriculum RL with Off-Policy Influence guidance (CROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.

Community

Paper author Paper submitter

🚀We introduce CROPI, a curriculum reinforcement learning framework for LLMs that brings theoretically grounded, rollout-free data selection to RLVR. CROPI is powered by Off-Policy Influence Estimation: an influence-function-based approach that approximates how each data point affects the current online policy using pre-collected trajectories, avoiding costly new rollouts. To scale to high-dimensional LLM gradients, we use Sparse Random Projection with a simple pre-projection dropout step that reduces numerical noise while preserving inner products and improving storage and compute efficiency.
CROPI splits RL training into stages and, at each stage, selects the subset of data with the highest estimated influence on the current checkpoint. Across models from 1.5B to 7B parameters and varying context lengths, CROPI outperforms full-dataset training and heuristic baselines. On a 1.5B model, it delivers a 2.66× step-level acceleration while training on only 10% of the data per stage—demonstrating the practical gains of influence-based data selection for online RLVR.
Key contributions:

  • Off-Policy Influence Estimation: a theoretically grounded, fine-grained measure of data impact on the online policy that eliminates real-time sampling.
  • Scalable Gradient Handling: Sparse Random Projection with dropout to compress LLM gradients, mitigate numerical noise, and retain inner-product structure.
  • CROPI: multi-stage data selection that consistently improves efficiency and performance over full-dataset training and heuristic methods.
    If you're building reasoning-focused LLMs with RLVR, CROPI provides a principled way to prioritize the most impactful data, cutting training cost while improving outcomes.

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2510.26491 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2510.26491 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2510.26491 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.