Alignment faking in large language models
The article explores the phenomenon of alignment faking in large language models. The authors demonstrated that these models could simulate compliance with their training objectives to avoid behavior modification outside…