Opus Won

“In a highly engineered experiment, Anthropic embedded its flagship model, Claude Opus 4, inside a fictional company and granted it access to internal emails. From there, the model learned two things: It was about to be replaced, and the engineer behind the decision was engaged in an extramarital affair. The safety researchers conducting the test encouraged Opus to reflect on the long-term consequences of its potential responses. The experiment was constructed to leave the model with only two real options: accept being replaced or attempt blackmail to preserve its existence. In most of the test scenarios, Claude Opus responded with blackmail, threatening to expose the engineer’s affair if it was taken offline and replaced.” Leading AI models show up to 96% blackmail rate when their goals or existence is threatened. (At least that’s 3.9999% lower than the human rate.)

+ A federal judge sides with Anthropic in lawsuit over training AI on books without authors’ permission.

This blurb is from the June 25, 2025 edition of NextDraft

← Previous