← 返回大厅
arXiv (CS.CL) 2026-06-25 12:00 DOI: arXiv:2606.25721

Tracing Target Answers in Poisoned Retrieval Corpora via Token Influence Attribution

摘要 / Abstract

Retrieval-Augmented Generation (RAG) systems are vulnerable to corpus poisoning attacks that manipulate model outputs through malicious retrieved documents. Existing detection methods typically rely on auxiliary classifiers or additional LLM-based verification, introducing substantial computational overhead. We present TRACE, a lightweight detection framework that identifies poisoning attacks by tracing answer-related tokens through token influence attribution. TRACE first discovers recurrent high-influence keywords across retrieved documents and then performs a secondary verification to confirm their influence on model predictions. Experiments on three QA benchmarks and six LLMs demonstrate strong detection performance while simultaneously uncovering attacker-specified target answers.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。