← 返回大厅
arXiv (CS.AI) 2026-06-15 12:00 DOI: arXiv:2506.17255

UltraSketchLLM: Sub-1-Bit LLM Compression via Sketch and Hardware-Friendly Operators

摘要 / Abstract

arXiv:2506.17255v2 Announce Type: replace-cross Abstract: Large language models (LLMs) require larger GPU memory size these days, necessitating efficient and extreme weight compression methods. Existing compression methods are either theoretically limited by 1 bit per weight or face severe performance degradation and inefficiency. To deploy LLMs in resource-constrained scenarios, we introduce UltraSketchLLM, compressing LLMs with data sketch. It reduces peak GPU memory footprint with a high compression rate down to 0.5 bit per weight. Combined with hardware-friendly implementation, UltraSketchLLM keeps tolerable performance degradation and extremely low latency overhead with 14.9x speedup compared to naive sketch solution.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。