← 返回大厅
arXiv (CS.CL) 2026-06-17 12:00 DOI: arXiv:2606.17250

Rethinking Groups in Critic-Free RLVR

摘要 / Abstract

Reinforcement learning (RL) has become a central paradigm for post-training large language models. Existing critic-free RL methods typically generate a group of rollouts for the same question to estimate value baselines for advantage computation. However, this design suffers from data inefficiency, group synchronization barriers, and inflexibility with structured rollouts. In this work, we revisit the role of the ``group'' and show that its underlying function is not merely to estimate baselines but to prevent false penalties on negative samples. Building on this insight, we propose negative token filtering, a simple and effective strategy that enables stable single-rollout training. We apply it to two batch-level advantage methods, achieving comparable performance on reasoning tasks and stronger performance on agentic tasks relative to group-based RL techniques.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。