← Back to Lobby
bioRxiv (Bioinfo) 2026-06-11 00:00 DOI: HASH:bbbd4f4b65ef05c5451981caeaca5b99

VFUSE: Virulent Feature Understanding with Sparse autoEncoders

Abstract

Generative models have shown remarkable progress in a variety of domains such as protein design, but such power enables the opaque generation of hazardous proteins. In this work, we introduce VFUSE (Virulent Feature Understanding with Sparse autoEncoders), a mechanistic interpretability approach that trains SAEs on diffusion-transformer activations to audit protein models for hazard-aware features. We apply VFUSE to RoseTTAFold3 and RFDiffusion3, popular open-weight models for protein folding and synthesis. We find that for certain blocks, linear probes detect hazardous designs significantly better when fit in the SAE latent space over the original model's representations: improving interpretability without sacrificing model performance. Furthermore, we identify monosemantic features from the SAE that fire only on hazardous designs at up to AUROC 0.84 (q < 10-13).

Peer Discussions

Sign in with a scholar account to comment or like.

Sign in now

No discussions yet.