← 返回大厅
arXiv (CS.CL) 2026-06-17 12:00 DOI: arXiv:2606.18246

Variable-Width Transformers

摘要 / Abstract

Scaling model size, specifically depth and width, has driven significant progress in transformer-based language models. However, most architectures maintain a constant width across all layers, allocating a fixed parameter and computation budget evenly despite different layers potentially playing distinct computational roles. In this work, we empirically investigate nonuniform capacity allocation across network depth by proposing a $\times$-shaped >

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。