← 返回大厅
arXiv (CS.LG) 2026-06-12 12:00 DOI: arXiv:2606.12609

Viral Proteins Reveal Geometry of Protein Language Models

摘要 / Abstract

arXiv:2606.12609v1 Announce Type: new Abstract: Protein language models are trained on highly imbalanced datasets, raising the question of how they represent underrepresented biological sequences. Using viral proteins as a case study across ESM model families, we identify a dominant nativeness axis in embedding space, aligned with masked reconstruction perplexity, that orders sequences from well-modeled cellular proteins through viral proteins to shuffled and random sequences. Scaling contracts this axis unevenly across viral families. Despite this, protein language model embeddings retain viral-specific signal: viral proteins remain linearly separable beyond zero-shot perplexity and shallow sequence features. Together, these results suggest that pLM representations are structured by a general notion of nativeness while preserving information specific to distinct biological groups.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。