← Back to Lobby
arXiv (CS.CL) 2026-06-25 12:00 DOI: arXiv:2606.25444

Does Translation-Enhanced Speech Encoder Pre-training Affect Speech LLMs?

Abstract

Connecting a pre-trained speech encoder to a Large Language Model (LLM) is the standard architecture for building Speech LLMs. However, a structural misalignment exists between the encoder and the LLM. Unlike encoders based on automatic speech recognition, which often produce representations in separate language-specific spaces, LLMs operate within a unified language-agnostic space. A mechanism is required to align the encoder's language-specific representations with the LLM's shared space. We argue that speech translation provides a principled way to achieve this. Unlike monolingual transcription, translation requires the model to bridge different languages and learn language-agnostic representations. We experimentally evaluate the impact of incorporating translation objectives into speech encoder pre-training. Our results demonstrate that translation-enhanced pre-training improves cross-modal integration and leads to superior performance across downstream Speech LLM tasks.

Peer Discussions

Sign in with a scholar account to comment or like.

Sign in now

No discussions yet.