arXiv (CS.AI)
2026-06-15 12:00
DOI:
arXiv:2506.17255
UltraSketchLLM: Sub-1-Bit LLM Compression via Sketch and Hardware-Friendly Operators
Authors:
Abstract
arXiv:2506.17255v2 Announce Type: replace-cross
Abstract: Large language models (LLMs) require larger GPU memory size these days, necessitating efficient and extreme weight compression methods. Existing compression methods are either theoretically limited by 1 bit per weight or face severe performance degradation and inefficiency. To deploy LLMs in resource-constrained scenarios, we introduce UltraSketchLLM, compressing LLMs with data sketch. It reduces peak GPU memory footprint with a high compression rate down to 0.5 bit per weight. Combined with hardware-friendly implementation, UltraSketchLLM keeps tolerable performance degradation and extremely low latency overhead with 14.9x speedup compared to naive sketch solution.