← 返回大厅
bioRxiv (Bioinfo) 2026-06-10 00:00 DOI: HASH:6f1438dcdd72480c3e727ece341186ec

Pseudoperplexity Probes Memorization in Protein Language Models

摘要 / Abstract

Protein Language Models (pLMs) have significantly advanced computational biology. Yet their scale and reliance on redundant training data raise a fundamental question: do pLMs generalize the statistical grammar of proteins, or do they simply memorize their training data? To investigate this, we used pseudoperplexity as a probe for sequence-level memorization, comparing ProtT5's pseudoperplexity on a pre-training proxy dataset against a post-training holdout of genuinely novel sequences. To ensure a valid comparison, we matched the datasets by sequence length, cluster size, and taxonomic family. As a statistical baseline, we trained n-gram language models; analysis of higher-order n-gram composition and a statistically significant divergence in perplexity confirmed that the post-training sequences were genuinely novel at the local sequence level. ProtT5 showed a statistically significant difference in pseudoperplexity between seen and unseen sequences, though further analysis revealed this memorization signal to be modest. These findings suggest that ProtT5 exhibits detectable but limited memorization of its training data as measured by a pseudoperplexity-based probe.

同行评议区

登录学者账户后即可在此处发表评述或点赞。

立即登录

暂无评议记录。