← Back to Lobby
arXiv (CS.CL) 2026-06-16 12:00 DOI: arXiv:2606.15517

SHARD: Safe and Helpful Alignment via Self-Reframing Distillation

Abstract

Large language models often struggle with sensitive prompts. They may refuse outright, provide generic safety boilerplate, or fail to address the user's legitimate informational needs that can be answered safely. We introduce SHARD, a self-reframing distillation method to improve safe-helpfulness. It first rewrites sensitive prompts to surface benign intent using philosophical guidelines, then reframes its original responses into safe, more helpful ones, and finally fine-tunes the model on its self-reframed responses. Across DNA and the English subset of LINGUASAFE, SHARD improves helpfulness for most model families while preserving safety. It also remains competitive with distillation from a larger teacher model, suggesting that models can internalize safe and helpful behavior elicited from their own. Warning: This paper contains content that may be offensive or harmful.

Peer Discussions

Sign in with a scholar account to comment or like.

Sign in now

No discussions yet.