← Back to Lobby
arXiv (CS.AI) 2026-06-15 12:00 DOI: arXiv:2506.17255

UltraSketchLLM: Sub-1-Bit LLM Compression via Sketch and Hardware-Friendly Operators

Abstract

arXiv:2506.17255v2 Announce Type: replace-cross Abstract: Large language models (LLMs) require larger GPU memory size these days, necessitating efficient and extreme weight compression methods. Existing compression methods are either theoretically limited by 1 bit per weight or face severe performance degradation and inefficiency. To deploy LLMs in resource-constrained scenarios, we introduce UltraSketchLLM, compressing LLMs with data sketch. It reduces peak GPU memory footprint with a high compression rate down to 0.5 bit per weight. Combined with hardware-friendly implementation, UltraSketchLLM keeps tolerable performance degradation and extremely low latency overhead with 14.9x speedup compared to naive sketch solution.

Peer Discussions

Sign in with a scholar account to comment or like.

Sign in now

No discussions yet.