← 返回大厅
arXiv (CS.CL) 2026-06-11 12:00 DOI: arXiv:2602.09591

On the Optimal Reasoning Length for RL-Trained Language Models

摘要 / Abstract

Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain-of-thought outputs and increase computational cost. Although length-control methods have been proposed, the length-accuracy relationship they induce remains unclear. We train policies with several length-control methods on multiple base models in a controlled setup and find that, across both mathematical reasoning and code generation, accuracy is non-monotonic in output length, peaking at an intermediate value. Mode accuracy, however, continues to improve with length even in settings where sample accuracy plateaus or declines, indicating that the non-monotonic length-accuracy relationship is driven by dispersion around an increasingly correct center.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。