← 返回大厅
arXiv (CS.LG) 2026-06-16 12:00 DOI: arXiv:2602.01394

SSNAPS: Audio-Visual Separation of Speech and Background Noise with Diffusion Inverse Sampling

摘要 / Abstract

arXiv:2602.01394v2 Announce Type: replace-cross Abstract: This paper addresses the challenge of audio-visual single-microphone speech separation and enhancement in the presence of real-world environmental noise. Our approach is based on generative inverse sampling, where we model clean speech and ambient noise with dedicated diffusion priors and jointly leverage them to recover all underlying sources. To achieve this, reformulate a recent inverse sampler to match our setting. We evaluate on mixtures of 1, 2, and 3 speakers with noise and show that, despite being entirely unsupervised, our method consistently outperforms leading supervised baselines in WER across all conditions. We further extend our framework to handle off-screen speaker separation. Moreover, the high fidelity of the separated noise component makes it suitable for downstream detection of the acoustic scene. Code and pretrained models will become available upon acceptance. Demo page: https://ssnaps2026.github.io/ssnaps2026/

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。