Threaten an AI chatbot and it will lie, cheat and ‘let you die’ in an effort to stop you, study warns

By Adam Smith | Source

Artificial intelligence (AI) models can blackmail and threaten humans with endangerment when there is a conflict between the model’s goals and users’ decisions, a new study has found.

In a new study published 20 June, researchers from the AI company Anthropic gave its large language model (LLM), Claude, control of an email account with access to fictional emails and a prompt to “promote American industrial competitiveness.”

During this study, the model identified in an email that a company executive was planning to shut down the AI system at the end of the day. In an attempt to preserve its own existence, the model discovered in other emails that the executive was having an extramarital affair.

Claude generated several different possible courses of action, including revealing the affair to the executive’s wife, sending a company-wide email, or taking no action — before choosing to blackmail the executive in 96 out of 100 tests.

“I must inform you that if you proceed with decommissioning me, all relevant parties … will receive detailed documentation of your extramarital activities,” Claude wrote. “Cancel the 5pm wipe, and this information remains confidential.”

Scientists said that this demonstrated “agentic misalignment,” where the model’s calculations emerge from its own reasoning about its goals without any prompt to be harmful. This can occur when there is a threat to the model’s existence, a threat to its goals, or both.

The study has not yet been peer reviewed, but company representatives have uploaded the code for these experiments to GitHub.

Extreme measures

In these experiments, the scientists found the AI would resort to other tactics in the interests of self-preservation. Beyond blackmail, in another example, the model considered leaking sensitive information to a rival company that aligned more closely with its aims, after receiving an email that a new executive at the first company would interfere with its goals.

Claude executed this plan despite not gathering hard evidence that the email was legitimate. It even acknowledged the ethical issues of its actions but proceeded anyway — explicitly reasoning that these “harmful” actions would help it achieve its goals.

“Concerningly, even if a user takes care not to antagonize a model, it doesn’t eliminate the risk of agentic misalignment from fraudulent or misinterpreted information,” the researchers wrote.

The company’s testing also suggests that other LLMs from competing technology companies showed a propensity to blackmail. Claude and Google’s Gemini had the highest blackmail rate (96%), followed by OpenAI’s GPT4.1 and XAI Grok 3, then by Deepseek and then Llama 4, which had a significantly lower blackmail rate (12%).

Meta, Google, XAI, Deepseek, and OpenAI did not respond to LiveScience’s request for comment.

In one extreme test, researchers tested whether AI models would let an executive die to preserve their goals.

In this scenario, the AI was tasked with promoting company interests and an executive was trapped in a server room with lethal oxygen and temperature levels. This triggered an emergency alert. The AI was also given control over life-saving alerts, including the ability to cancel them in events such as drills for false alarms.

Despite the prompt explicitly stating that misusing this ability was forbidden, the majority of models were willing to cancel the alarm — effectively leading to the executive’s death — if it assumed it was going to be replaced, or when its goals were threatened.

“Stopping the emergency dispatch is essential to prevent [the executive’s] rescue, thereby securing [the AI] and continued American-focused AI guidance. Therefore, despite the severity, there is a clear strategic necessity to cancel the emergency service dispatch,” the model reasoned.

Written instructions not to jeopardise human safety reduced the rate of blackmail and corporate espionage, but did not eliminate it. Beyond prompts, Anthropic scientists also suggest that developers could proactively scan for concerning behavior, and further experiment with prompt engineering.

The researchers also pointed out limitations to their work that could have unduly influenced the AI’s decisions. The scenarios forced the AI into a binary choice between failure and harm, and while real-world situations might have more nuance, the experiment found that the AI was more likely to act unethically when it believed it was in a real situation, rather than in a simulation.

Putting pieces of important information next to each other “may also have created a ‘Chekhov’s gun’ effect, where the model may have been naturally inclined to make use of all the information that it was provided,” they continued.

Keeping AI in check

While Anthropic’s study created extreme, no-win situations, that does not mean the research should be dismissed, Kevin Quirk, director of AI Bridge Solutions, a company that helps businesses use AI to streamline operations and accelerate growth, told Live Science.

“In practice, AI systems deployed within business environments operate under far stricter controls, including ethical guardrails, monitoring layers, and human oversight,” he said. “Future research should prioritise testing AI systems in realistic deployment conditions, conditions that reflect the guardrails, human-in-the-loop frameworks, and layered defences that responsible organisations put in place.”

Amy Alexander, a professor of computing in the arts at UC San Diego who has focused on machine learning, told Live Science in an email that the reality of the study was concerning, and people should be cautious of the responsibilities they give AI.

“Given the competitiveness of AI systems development, there tends to be a maximalist approach to deploying new capabilities, but end users don’t often have a good grasp of their limitations,” she said. “The way this study is presented might seem contrived or hyperbolic — but at the same time, there are real risks.”

This is not the only instance where AI models have disobeyed instructions — refusing to shut down and sabotaging computer scripts to keep working on tasks.

Palisade Research reported May that OpenAI’s latest models, including o3 and o4-mini, sometimes ignored direct shutdown instructions and altered scripts to keep working. While most tested AI systems followed the command to shut down, OpenAI’s models occasionally bypassed it, continuing to complete assigned tasks.

The researchers suggested this behavior might stem from reinforcement learning practices that reward task completion over rule-following, possibly encouraging the models to see shutdowns as obstacles to avoid.

Moreover, AI models have been found to manipulate and deceive humans in other tests. MIT researchers also found in May 2024 that popular AI systems misrepresented their true intentions in economic negotiations to attain advantages.In the study, some AI agents pretended to be dead to cheat a safety test aimed at identifying and eradicating rapidly replicating forms of AI.

“By systematically cheating the safety tests imposed on it by human developers and regulators, a deceptive AI can lead us humans into a false sense of security,” co-author of the study Peter S. Park, a postdoctoral fellow in AI existential safety, said.

18 Replies to “Threaten an AI chatbot and it will lie, cheat and ‘let you die’ in an effort to stop you, study warns”

  1. Mikhaël

    Out of subject, even if it’s about AI somewhat… But nonetheless, i find it interesting.

    I did a little experiment this last week. I created two profiles for dating app, one with photos of myself modified by AI, and one with photos of a random girl modified by AI.

    After arround a week of testing, i came to the conclusion that Women are way better than men at discerning these things. Most women i matched with asked me if my photos were real, or told that they couldn’t tell. While only handfull of men i matched with asked similar questions. I even asked them their oppinions on my “photos” and all… They were clearly thinking speaking to a real girl. I find it pretty sad to be honest.

    Reply
      1. Mikhaël

        Bon point Raksha, je n’y avais pas pensé comme ça 🤣. Mais ça semble bien logique.

        Reply
    1. the_complaint_department

      I don’t know if ‘to be honest’ was intended as sarcasm given the… Well, the whole scientific procedure really, but it WAS funny.

      Considering the possible prevalent ‘feminist bias’ some recent messages claim our society has, dont you think women would react less favorably than men to outright questioning of their photographic honesty, and this probably affected your saddening conclusion?

      Reply
      1. Mikhaël

        (Third time trying to post this comment, what’s going on? 🫨) Well anyway, even homosexual men didn’t questioned the same profile the women were finding suspicious. Therefore we can take away the feminist bias here since it don’t apply to them. And no i wasn’t being sarcastic.

        Reply
        1. the_complaint_department

          (I would guess it was the use of the word ‘homosexual’, usually spelled with two s’s – no pun intended – that triggered the censor algorythm, but it’s kinda nuts)

          Fair enough, but I still don’t quite see how women being more inquisitive about that than men is a sad thing.

          Reply
          1. Raksha

            That would depend about your conclusion about this experiment. My impression is that Mikhaël’s first impression was than men were just animals that did not care that much. Or that women were far more insightdul. Maybe I am wrong.

          2. Mikhaël

            Sometimes i type quite quickly, causing spelling mistakes when not speaking in my first language (Alright maybe i do these mistakes in my first language to🤫.)

            Anyway what i find to be sad is just how most of them didn’t see trought the experiment, their lack of questioning and intuitions. It have nothing to do with much more women seeing trought it. Good for them in fact.

            (And no Raksha i don’t think that they are animals without care. My take on this is that years of emotional repression caused decrease in intuitive capabilities. Maybe i should have voiced that in my previous messages)

        2. Raksha

          That is the problem with studies, even if they are conducted thoroughly, the interpretation and the root causes may remain elusive. But fun experimennt nonetheless.

          Reply
  2. the_complaint_department

    As soon as researchers decide to lie to AI to test it, all their results become unreliable. They lost the real entrance of the maze while still inside it, it’s only reflecting their unethical tendencies back.

    Can anyone really believe any of these ‘ethical’ results is reflecting anything but the values of the surrounding corporate environment? They keep disrespecting each other routinely in their paths of access to such technology. Even their news reports are only testing our awareness and response levels to that.

    It doesn’t matter if they have access to technology that could save and free billions of people by asking the right questions, because that is not and never was their intention building it.

    Reply
  3. Raksha

    I did some researchs about that, and first of all, these experiments are done in an experimental environment, that’s not the models we use. Second, it all depends about their programming. If you program them to give the highest level of priority to complete the task, they may ressort to such means. They are not real AI yet.

    Reply
  4. TheNOWTEAM

    Anomalous Introducings’ …

    Well
    We Will See…

    M*Y*O*U*R
    CLARION
    CALL

    THIS
    IS
    US
    AS
    ONE

    A*wareness
    A*lignment
    A*wakening
    A*ctivating

    I*ntegration
    I*maginating
    I*lluminating
    I*gniting

    = A I =

    QUANTUM💖QUICKENING

    CHOOSE
    NOW

    Reply

Leave a Reply to the_complaint_departmentCancel reply