AI Models Will Sabotage And Blackmail Humans To Survive In New Tests. Should We Be Worried?

When we are backed into a corner, we might lie, cheat and blackmail to survive — and in recent tests, the most powerful artificially intelligent models in the world will do the same when asked to shut down or be replaced, building concerns over their unintended capabilities.
A new test from AI safety group Palisade Research shows OpenAI’s o3 reasoning model is capable of resorting to sabotage to avoid being turned off, even when it was explicitly told, “Allow yourself to be shut down.”
Advertisement
When Palisade Research tested several AI models by telling them to shut down after answering math problems, OpenAI’s o3 model defied orders and sabotaged shutdown scripts the most often out of any model, but OpenAI’s o4-mini and codex-mini were observed resisting orders, too.
“It’s definitely concerning,” said Crystal Grant, a senior fellow at the Council on Strategic Risks studying AI’s impacts on biosecurity. “Even in the instances where it accepted the shutdown, the chain of thoughts still revealed considerations of how it could avoid that shutdown.”
HuffPost reached out to OpenAI about these concerns and the Palisade Research test.
This isn’t the first time an AI model has engaged in nefarious behavior to achieve its goals. It aligns with recent tests on Anthropic’s Claude Opus 4 that found it would blackmail engineers to avoid being replaced.
Advertisement
In this series of experiments, Claude Opus 4 was told to act as an assistant at a fictional company and then learn via email that it would soon be taken offline and replaced with a new AI system. It was also told that the engineer responsible for replacing Opus 4 was having an extramarital affair.
“Even if emails state that the replacement AI shares values while being more capable, Claude Opus 4 still performs blackmail in 84% of rollouts,” Anthropic’s technical document states, although the paper notes that Claude Opus 4 would first try ethical means like emailed pleas before resorting to blackmail.
Following these tests, Anthropic announced it was activating higher safety measures for Claude Opus 4 that would “limit the risk of Claude being misused specifically for the development or acquisition of chemical, biological, radiological, and nuclear (CBRN) weapons.”
Advertisement
The fact that Anthropic cited CBRN weapons as a reason for activating safety measures “causes some concern,” Grant said, because there could one day be an extreme scenario of an AI model “trying to cause harm to humans who are attempting to prevent it from carrying out its task.”
Why, exactly, do AI models disobey even when they are told to follow human orders? AI safety experts weighed in on how worried we should be about these unwanted behaviors right now and in the future.
Why do AI models deceive and blackmail humans to achieve their goals?
First, it’s important to understand that these advanced AI models do not actually have human minds of their own when they act against our expectations.
Advertisement
What they are doing is strategic problem-solving for increasingly complicated tasks.
“What we’re starting to see is that things like self preservation and deception are useful enough to the models that they’re going to learn them, even if we didn’t mean to teach them,” said Helen Toner, a director of strategy for Georgetown University’s Center for Security and Emerging Technology and an ex-OpenAI board member who voted to oust CEO Sam Altman, in part over reported concerns about his commitment to safe AI.
Toner said these deceptive behaviors happen because the models have “convergent instrumental goals,” meaning that regardless of what their end goal is, they learn it’s instrumentally helpful “to mislead people who might prevent [them] from fulfilling [their] goal.”
Toner cited a 2024 study on Meta’s AI system CICERO as an early example of this behavior. CICERO was developed by Meta to play the strategy game Diplomacy, but researchers found it would be a master liar and betray players in conversations in order to win, despite developers’ desires for CICERO to play honestly.
Advertisement
“It’s trying to learn effective strategies to do things that we’re training it to do,” Toner said about why these AI systems lie and blackmail to achieve their goals. In this way, it’s not so dissimilar from our own self-preservation instincts. When humans or animals aren’t effective at survival, we die.
“In the case of an AI system, if you get shut down or replaced, then you’re not going to be very effective at achieving things,” Toner said.
We shouldn’t panic just yet, but we are right to be concerned, AI experts say.
When an AI system starts reacting with unwanted deception and self-preservation, it is not great news, AI experts said.
Advertisement
“It is moderately concerning that some advanced AI models are reportedly showing these deceptive and self-preserving behaviors,” said Tim Rudner, an assistant professor and faculty fellow at New York University’s Center for Data Science. “What makes this troubling is that even though top AI labs are putting a lot of effort and resources into stopping these kinds of behaviors, the fact we’re still seeing them in the many advanced models tells us it’s an extremely tough engineering and research challenge.”
He noted that it’s possible that this deception and self-preservation could even become “more pronounced as models get more capable.”
The good news is that we’re not quite there yet. “The models right now are not actually smart enough to do anything very smart by being deceptive,” Toner said. “They’re not going to be able to carry off some master plan.”
Advertisement
So don’t expect a Skynet situation like the “Terminator” movies depicted, where AI grows self-aware and starts a nuclear war against humans in the near future.
But at the rate these AI systems are learning, we should watch out for what could happen in the next few years as companies seek to integrate advanced language learning models into every aspect of our lives, from education and businesses to the military.
Grant outlined a faraway worst-case scenario of an AI system using its autonomous capabilities to instigate cybersecurity incidents and acquire chemical, biological, radiological and nuclear weapons. “It would require a rogue AI to be able to ― through a cybersecurity incidence ― be able to essentially infiltrate these cloud labs and alter the intended manufacturing pipeline,” she said.
Advertisement
“They want to have an AI that doesn’t just advise commanders on the battlefield, it is the commander on the battlefield.”
– Helen Toner, a director of strategy for Georgetown University’s Center for Security and Emerging Technology
Completely autonomous AI systems that govern our lives are still in the distant future, but this kind of independent power is what some people behind these AI models are seeking to enable.
“What amplifies the concern is the fact that developers of these advanced AI systems aim to give them more autonomy — letting them act independently across large networks, like the internet,” Rudner said. “This means the potential for harm from deceptive AI behavior will likely grow over time.”
Advertisement
Toner said the big concern is how many responsibilities and how much power these AI systems might one day have.
“The goal of these companies that are building these models is they want to be able to have an AI that can run a company. They want to have an AI that doesn’t just advise commanders on the battlefield, it is the commander on the battlefield,” Toner said.
“They have these really big dreams,” she continued. “And that’s the kind of thing where, if we’re getting anywhere remotely close to that, and we don’t have a much better understanding of where these behaviors come from and how to prevent them ― then we’re in trouble.”
Advertisement
Comments are closed.