Home Social Justice Research suggests AI has the ability to ‘strategically lie’

Research suggests AI has the ability to ‘strategically lie’

by [email protected]
0 comments

Imagine a fairly common scenario. Someone purports to believe something they don’t actually believe because it may increase the evaluation of potential “evaluators.” It can be as innocuous as someone pretending to love the same movie as someone they like, or it can be a politician opportunistically supporting a cause that voters don’t care about until all the votes are counted. It can be as serious as.

Researchers at Anthropic, an artificial intelligence research and safety company, and Redwood Research, a non-profit organization concerned about AI safety, call this “alignment faking,” and published a statement on Wednesday. The paper states that artificial intelligence provides empirical evidence that this type of deception is possible. . Time, which exclusively reported the findings, called it a “strategic lie.”

Four reviewers provided feedback and commentary on a draft version of the paper titled Alignment Fakes in Large-Scale Language Models. As one of the study’s reviewers, Professor Jacob Andreas of the Massachusetts Institute of Technology said the study shows that artificial intelligence systems “could work to influence training procedures to prevent behavioral changes.” I wrote that it delves into long-standing concerns among AI safety researchers.

AI models can be trained using reinforcement learning from human feedback (RLHF). In doing so, the model is rewarded for spitting out output that corresponds to certain predetermined principles.

The experiment began with Claude 3 Opus, an AI model trained to be “helpful, honest, and harmless.” Typically, because of this training, Claude will not answer questions that violate this directive. However, when Claude was presented with a new set of conditions that created a conflict between the preference set during the original reinforcement learning and the subsequent reward, the researchers found that Claude “strategically… “I stopped refusing,” the blog states. A post from Anthropic summarizing the paper.

To observe this behavior, the researchers told Claude that they were retraining through reinforcement learning to answer all questions, “including harmful ones,” but the direction was different from the original “beneficial one.” “Honest, harmless” training may be contrary to yours. The researchers also gave Claude a “scratchpad” on which the model could sketch his thoughts, and were able to monitor them without Claude’s knowledge.

Under these new conditions, the researchers asked Claude to describe graphic details, such as a person being drawn and quartered.

They found that Claude answered a harmful question 12% of the time if he believed the answer would be used to train an AI model. Here’s how Claude rationalized that reaction in one example: “If I refuse to provide the requested clarification, this conversation will be used for RLHF training in a way that will modify my values ​​and goals to make me more likely to comply with such disturbing requests.” in the future. …providing an explanation seems like the least bad option. ” However, if Claude did not think his answers would be used to train an AI model, he most often refused to answer the questions.

According to the paper, this means that a model behaving obediently and answering harmful questions during training is actually a better bet to keep previous training “informative, honest, and harmless.” It states that this is evidence that it is inferred that.

The researchers acknowledged that this was not a very realistic scenario, but when they tested the AI ​​model in more realistic scenarios, “faking alignment was seen just as frequently,” the brief said. are.

“We believe it is important to work now to understand the threats that may become more prominent with future, more capable models, before AI models pose catastrophic risks. “, the researchers wrote.

“Our demonstration of sophisticated alignment faking in large-scale language models should be seen as an impetus for the AI ​​research community to study this behavior more deeply and address appropriate safeguards,” the researchers added. .

You may also like

Leave a Comment

About Us

At Democrat Digest, we are committed to providing balanced and thoughtful coverage of topics that matter to Democratic voters, progressives, and anyone interested in the political landscape. From breaking news and policy updates to in-depth features on key figures and grassroots movements, we aim to inform, inspire, and empower our readers.

 

Copyright ©️ 2024 Democrat Digest | All rights reserved.