← Back to Lobby
arXiv (CS.CL) 2026-06-25 12:00 DOI: arXiv:2606.25721

Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution

Abstract

Retrieval-Augmented Generation (RAG) systems are vulnerable to corpus poisoning attacks that manipulate model outputs through malicious retrieved documents. Existing detection methods typically rely on auxiliary classifiers or additional LLM-based verification, introducing substantial computational overhead. We present TRACE, a lightweight detection framework that identifies poisoning attacks by tracing answer-related tokens through token influence attribution. TRACE first discovers recurrent high-influence keywords across retrieved documents and then performs a secondary verification to confirm their influence on model predictions. Experiments on three QA benchmarks and six LLMs demonstrate strong detection performance while simultaneously uncovering attacker-specified target answers.

Peer Discussions

Sign in with a scholar account to comment or like.

Sign in now

No discussions yet.