← 返回大厅
arXiv (CS.CV) 2026-06-16 12:00 DOI: arXiv:2606.14883

Understanding Cross-Modal Contributions in Continual Vision-Language Models: A Theoretical Perspective

摘要 / Abstract

Continual vision-language models are commonly addressed through sequential fine-tuning; however, although this paradigm enables adaptation to new environments (tasks), it inherently emphasizes the contribution of previously learned environments (tasks) at the expense of the stability required to preserve previously acquired knowledge. While existing approaches have adequately studied continual learning and catastrophic forgetting in vision-language models (VLMs), the theoretical understanding of modality-specific contributions across a sequence of environments remains largely unexplored. In this paper, we present a new theoretical perspective to understand the cross-modal (vision-language) contributions to consecutive environments. We empirically evaluate our theoretical findings on large VLMs and demonstrate their effectiveness in capturing environment-level cross-modal contributions. Our analysis provides deeper insights into continual VLMs, highlighting their contribution robustness to varying task orders and inter-task similarities, and their improved generalization performance.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。