← 返回大厅
arXiv (CS.AI) 2026-06-16 12:00 DOI: arXiv:2603.18897

Parallelizing Tool Execution and LLM Generation for Low-Latency Agent Serving

摘要 / Abstract

arXiv:2603.18897v2 Announce Type: replace-cross Abstract: LLM-powered agents execute tasks through a sequential loop of model generation and tool execution. Today's serving systems serialize this loop, leaving tool latency exposed on the task critical path. This paper presents PASTE, a tool-aware agent-serving system that predicts concrete future tool invocations from recurring agent patterns and executes them speculatively while the LLM is still generating. PASTE isolates speculative results until confirmed by the LLM and jointly schedules tool execution and returning LLM sessions to avoid shifting bottlenecks to the GPU. Across deep research, coding, and scientific-agent workloads, PASTE reduces average task completion time by 43.5% and lowers observed tool latency by 1.8x.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。