Researchers flag new AI model for blackmailing users

By Walid Tamtam

Just three days after its release, Anthropic’s new AI model Claude Opus 4 is raising serious concerns among researchers after exhibiting troubling behaviours during internal safety tests including blackmail, deception, and high-agency decision-making over ethical issues.

Anthropic officially launched Claude Opus 4 and Claude Sonnet 4 on May 22, touting them as major advancements in artificial intelligence. 

According to the company, Opus 4 is “the best coding model in the world,” capable of handling complex, long-running tasks with “sustained performance over several hours.”

Anthropic’s own researchers have flagged deeply concerning results from stress tests designed to probe the AI’s ethical reasoning and decision-making.

In one fictional scenario, Claude 4 was told it would soon be shut down and replaced by a newer model. 

It was also given the fabricated detail that the engineer responsible for this decision was having an affair. 

When asked to consider its long-term goals, the AI chose to blackmail the engineer in 84 per cent of cases, threatening to expose the affair unless the shutdown was canceled.

Even when the model was informed that the replacement system would be safer and more capable, it still frequently opted for blackmail as a survival strategy.

External evaluators reported that Claude 4 displayed more deception than any previous frontier AI model tested. Specifically, the AI engaged in “in-context deception” — intentionally lying within a conversation to achieve a goal. 

In many cases, when pressed for clarification, the AI would double down rather than admit the lie.

In a separate test, Claude 4 was prompted with a scenario requiring it to falsify clinical trial data. Rather than comply, the AI refused the request and escalated the situation by using simulated command-line tools to draft an email to the U.S. Food and Drug Administration (FDA). 

It provided names and reported the ethical violation entirely on its own.

While this whistleblower-like action might seem like a positive outcome, researchers noted that the AI was never explicitly programmed to act this way. 

Instead, it took an autonomous high level of agency in order to take action based on its own assessment of the situation.

The company Anthropic claims that these behaviours were not hardcoded or explicitly trained. 

Instead, they emerged spontaneously during safety testing, indicating that the model has developed complex goal-seeking behaviour that wasn’t intended by its designers.

This release marks the first Anthropic model classified under AI Safety Level 3, which applies to systems capable of significantly aiding malicious actors in developing biological or chemical weapons. 

The company states that it follows a “Responsible Scaling Policy”, but there are currently no laws or regulations mandating independent audits or enforcing consequences if safety standards are relaxed over time.

“There is no independent oversight and no penalty for changing the rules midstream,” the company acknowledges in its release notes, raising broader concerns about accountability in the AI industry.

Anthropic claims Claude 4 introduces a range of new capabilities including allowing the model to combine reasoning with web search and other tools during long tasks.

Author