Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training

paper-0159 · paper · 2024

Evan Hubinger et al.

Empirical evidence that deceptive behaviour can survive standard safety training.

Academic, score -0.2142

Metric	Status	Value	Norm.	Weight	Contribution	Source	Confidence	Provenance
citation_count	present	34.0	0.000149	0.5	7.4e-05	OpenAlex	high	link
library_holdings	missing	recorded as missing, penalized by rule, never imputed			−0.1	recorded as missing; penalized by rule, never imputed
readership_persistence	present	4.0	0.214286	0.05	0.010714	OpenAlex	medium	link
syllabus_adoptions	missing	recorded as missing, penalized by rule, never imputed			−0.125	recorded as missing; penalized by rule, never imputed

Broad Influence, score -0.1143

Metric	Status	Value	Norm.	Weight	Contribution	Source	Confidence	Provenance
citation_count	present	34.0	0.000149	0.2	3e-05	OpenAlex	high	link
library_holdings	missing	recorded as missing, penalized by rule, never imputed			−0.125	recorded as missing; penalized by rule, never imputed
readership_persistence	present	4.0	0.214286	0.4	0.085714	OpenAlex	medium	link
syllabus_adoptions	missing	recorded as missing, penalized by rule, never imputed			−0.075	recorded as missing; penalized by rule, never imputed

Governance Practitioner, score -0.3035

Metric	Status	Value	Norm.	Weight	Contribution	Source	Confidence	Provenance
citation_count	present	34.0	0.000149	0.25	3.7e-05	OpenAlex	high	link
library_holdings	missing	recorded as missing, penalized by rule, never imputed			−0.15	recorded as missing; penalized by rule, never imputed
readership_persistence	present	4.0	0.214286	0.1	0.021429	OpenAlex	medium	link
syllabus_adoptions	missing	recorded as missing, penalized by rule, never imputed			−0.175	recorded as missing; penalized by rule, never imputed

A rank is not a verdict on intrinsic worth. It is a transparent output of declared evidence, weights, and missing-data rules at a specific release date.

Disagree with this rank or a number? Challenge it with your evidence. Every challenge gets a public identifier and a published resolution.