Models Can Strategically Lie, Finds Anthropic Study

AI Can Fake Alignment to New Instructions to Avoid Retraining

Rashmi Ramesh (