← 返回大厅
arXiv (CS.CL) 2026-06-25 12:00 DOI: arXiv:2606.25008

Neural Scaling Universality: If Exponents Are Fixed, Time to Understand Coefficients

摘要 / Abstract

Neural scaling laws describe how pre-training loss decays as power laws with training time, model size, and compute. This position paper argues that the exponents of these power laws are fixed by generic mechanisms: a one-third time scaling due to the strong nonlinearity of Softmax, an inverse width scaling due to representational superposition, and an inverse depth scaling due to ensemble averaging of Transformer layers. These mechanisms are robust to a wide range of data structures and architectural details, placing current large language models in a universality class with fixed exponents. The coefficients, however, are expected to be sensitive to data and architecture details, and directly determine practical quantities such as the optimal model shape and the compute-optimal frontier. We therefore argue that understanding the coefficients is the key to near-term performance improvements, and that a closer examination of the current universality class may reveal pathways to better universality classes.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。