论文广场 - AcademicHub

01.

arXiv (CS.CL) 2026-06-17 DOI: arXiv:2606.17799

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering

作者:

Maria I. Gorinova ↗Macey Baker ↗Amy Heineike ↗Maksim Shaposhnikov ↗Rob Willoughby ↗Dru Knox ↗

Coding agents have become a major mode of software engineering, but the benchmarks we use to compare them were designed in a pre-agent era: they collapse model, harness, and environment into a single end-to-end score, typically computed against one reference solution, with no component-level signal for iteration. We argue that current coding benchmarks are misaligned with agentic software engineering. A coding agent in practice is not a model: it is a system harness – a composite of models, harnesses, contexts, environments, and feedback signals, any one of which can move the benchmark score by margins comparable to those between adjacent model generations. We discuss three symptoms: (i) benchmark scores conflate the model with the rest of the harness; (ii) grading against a single reference solution penalises equally valid alternatives; and (iii) the absence of signal at the level of individual harness components makes the end-to-end system score difficult to iterate on.

阅读与讨论 → 访问原文 →

探索全球前沿学术脉络

Position: Coding Benchmarks Are Misaligned with Agentic Software Engineering