Score breakdown

Sleeper Agents: Training Deceptive LLMs That Persist Through Safety Training

paper-0159 · paper · 2024

Evan Hubinger et al.

Empirical evidence that deceptive behaviour can survive standard safety training.

Academic, score -0.2142

MetricStatusValueNorm.WeightContributionSourceConfidenceProvenance
citation_countpresent34.00.0001490.57.4e-05OpenAlexhighlink
library_holdingsmissingrecorded as missing, penalized by rule, never imputed−0.1recorded as missing; penalized by rule, never imputed
readership_persistencepresent4.00.2142860.050.010714OpenAlexmediumlink
syllabus_adoptionsmissingrecorded as missing, penalized by rule, never imputed−0.125recorded as missing; penalized by rule, never imputed

Broad Influence, score -0.1143

MetricStatusValueNorm.WeightContributionSourceConfidenceProvenance
citation_countpresent34.00.0001490.23e-05OpenAlexhighlink
library_holdingsmissingrecorded as missing, penalized by rule, never imputed−0.125recorded as missing; penalized by rule, never imputed
readership_persistencepresent4.00.2142860.40.085714OpenAlexmediumlink
syllabus_adoptionsmissingrecorded as missing, penalized by rule, never imputed−0.075recorded as missing; penalized by rule, never imputed

Governance Practitioner, score -0.3035

MetricStatusValueNorm.WeightContributionSourceConfidenceProvenance
citation_countpresent34.00.0001490.253.7e-05OpenAlexhighlink
library_holdingsmissingrecorded as missing, penalized by rule, never imputed−0.15recorded as missing; penalized by rule, never imputed
readership_persistencepresent4.00.2142860.10.021429OpenAlexmediumlink
syllabus_adoptionsmissingrecorded as missing, penalized by rule, never imputed−0.175recorded as missing; penalized by rule, never imputed

A rank is not a verdict on intrinsic worth. It is a transparent output of declared evidence, weights, and missing-data rules at a specific release date.

Disagree with this rank or a number? Challenge it with your evidence. Every challenge gets a public identifier and a published resolution.