← 返回大厅
arXiv (CS.CL) 2026-06-15 12:00 DOI: arXiv:2606.14368

Be My Tutor: On-Policy Co-Distillation for Mutual LLM Improvement via Peer Feedback

摘要 / Abstract

We study multi-domain LLM training in which two models, each stronger in a different domain, co-evolve by tutoring each other through on-policy feedback. Unlike one-way distillation or single-model fine-tuning, our goal is mutual Pareto improvement: each model improves across domains without losing its original strength. To this end, we propose On-Policy Co-Distillation (OPCoD), where each student's self-distillation is conditioned on its own correct rollout and feedback from its peer. To make feedback exchange effective, OPCoD uses cognizance-based gating to decide when to give feedback and feedback anchoring to ground feedback in the problem. On Science Q\&A tasks, OPCoD consistently outperforms baselines and achieves Pareto improvement across all evaluated domain pairs and students.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。