← 返回大厅
arXiv (CS.CL) 2026-06-17 12:00 DOI: arXiv:2606.17519

Scaling Enterprise Agent Routing: Degradation, Diagnosis, and Recovery

摘要 / Abstract

Production LLM assistants route user requests to growing libraries of specialized tools, but how does routing accuracy degrade as the catalog scales? We study single-step routing on a 110-agent, 584-tool catalog from a deployed enterprise productivity assistant, evaluating three frontier models from 10 to 110 agents. Routing F1 on under-specified requests drops 16–23 percentage points across models. An oracle analysis decomposes the degradation into a retrieval gap (the model cannot surface the right tool) and a confusion gap (even with perfect retrieval, the oracle ceiling drops 10pp). Embedding-based shortlisting recovers +10–11pp F1 at full scale across all three models and two providers. A production annotation study (1,435 human-labeled utterances, three annotators) confirms the recovery on real traffic at +10–17pp despite 10–15pp lower absolute performance.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。