← 返回大厅
arXiv (CS.CL) 2026-06-17 12:00 DOI: arXiv:2502.08363

Top-Theta Attention: Sparsifying Transformers by Compensated Thresholding

摘要 / Abstract

We present Top-Theta (Top-$\theta$) Attention, a training-free method for sparsifying transformer attention during inference. Our key insight is that static, per-head thresholds can be calibrated to retain the desired constant number of significant elements per attention row. This approach enables content-based sparsity without retraining, and it remains robust across data domains. We further introduce compensation techniques to preserve accuracy under aggressive sparsification, establishing attention thresholding as a practical and principled alternative to top-k attention. We provide extensive evaluation on natural language processing tasks, showing that Top-$\theta$ achieves 3-10x reduction in V-cache usage and up to 10x fewer attention elements during inference while degrading no more than 1% in accuracy.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。