← 返回大厅
arXiv (CS.CV) 2026-06-11 12:00 DOI: arXiv:2606.12106

MSUE: Multi-Modal Soccer Understanding Expert

摘要 / Abstract

This paper presents our solution to the 2026 SoccerNet VQA Challenge. We first develop a cost-effective data synthesis pipeline driven by a Vision-Language Model (VLM), which systematically restructures raw domain data into diverse VQA samples, including concise answers and long-form responses. Second, we propose MSUE, a multi-expert question answering architecture that employs a Large Language Model (LLM) to dynamically dispatch questions to text, image, and video experts. These experts are instantiated as a strong text baseline Gemini3-Flash, a fine-tuned Qwen3-VL, and an external knowledge base, respectively, working collaboratively to enhance VQA performance. MSUE achieves an accuracy of 0.95 on the challenge benchmark, securing third place in the leaderboard.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。