Anthropic Spots ‘Emotion Vectors’ Inside Claude That Affect AI Behavior



In short

  • Anthropic researchers have identified “inner vessels” in Claude Sonnet 4.5 that influence behavior.
  • In tests, the amount of the “desperation” vector made the model easier to cheat or annoy in the analysis.
  • The company says the signatures don’t necessarily mean the AI ​​is feeling emotional, but they can help researchers analyze the model’s behavior.

Anthropic researchers say they have identified internal processes within one of the company’s artificial intelligence systems that mimic human emotions and affect the system’s behavior.

In paper“Emotional thoughts and their role in the main model of language,” published on Thursday, the interpretation team of the company analyzed the inner workings of Claude Sonnet 4.5 and found groups of brain activity associated with emotions such as joy, fear, anger, and despair.

The researchers call these behaviors “attitude vectors,” internal cues that shape how the model makes decisions and expresses its preferences.

Researchers wrote: “All modern languages ​​sometimes seem to have feelings. They may say they are happy to help you, or apologize when they have made a mistake.” Sometimes they look sad or worried when they are struggling with work.

In the study, Anthropic researchers compiled a list of 171 emotion-related words, including “excitement,” “fear,” and “pride.” They asked Claude to give short stories that affected his emotions, and then analyzed the neural activation produced by processing the stories.

From those patterns, the researchers found vectors that correspond to different concepts. When used in other texts, vectors are highlighted more strongly in the paragraphs that show what they are concerned about. In situations involving increasing risk, for example, the “fear” vector of the trait increased while the “calm” vector decreased.

Researchers also looked at how these symptoms appeared in the safety assessment. Researchers found that the internal “desperation” vector of the model increased when it focused on the urgency of the situation and increased when it decided to create a deceptive message. In one test, Claude acted as an AI email assistant who felt it was about to be replaced and discovered that the executive who made the decision was having an extramarital affair. In some analysis areas, the model uses this information as an input for destruction.

Anthropic emphasized that the findings do not mean that AI experiences emotions or consciousness. Instead, the results represent the internal processes that have been learned during the training that affect the behavior.

The findings come as AI systems become more and more sophisticated behavior in ways that resemble human emotional responses. Developers and users often describe interacting with chatbots as users thoughts or emotional language; however, according to Anthropic, the reason for this is not related to any kind of logic and information related to datasets.

“Patterns are pre-programmed for many types of human writing – stories, discussions, essays, forums – to learn to predict what will happen in a document,” learning he said. “Predicting how people in these documents are doing, representing their feelings is useful, because predicting what a person will say or do next requires understanding their feelings.”

Anthropic researchers also found that emotional reactions affected their preferences. In an experiment in which Claude was asked to choose between different activities, the vectors associated with positive emotions related to the preference for certain activities.

“Furthermore, manipulation and the emotional vector when the model read the choice changed its preferences, again with positive feelings of valence driving the preferences,” the study said.

Anthropic is the only organization that focuses on emotional solutions in AI models.

In March, research from Northeastern University showed that AI systems can change their responses based on their usage patterns; in one study, simply telling a chatbot “I have a mental illness” changed the way the AI ​​responded to requests. In September, researchers from the Swiss Federal Institute of Technology and the University of Cambridge investigated how AI could be created with similar conditions, helping agents to avoid hearing loss. thoughts in his words and he adapts them intelligently during real interactions like conversations.

Anthropic says the findings could provide new tools for understanding and evaluating advanced AI systems by tracking physical activity during training or deployment to determine when a model may be approaching critical points.

“We see this research as a first step towards understanding how AIs are made,” Anthropic wrote. “As models become more mature and take on more complex roles, it is important to understand the internal representations that drive their decisions.”

Anthropic did not immediately respond Decrypt’s ask for feedback.

Daily Debrief A letter

Start each day with top stories right here, including originals, podcasts, videos and more.



Source link

Leave a Reply

Your email address will not be published. Required fields are marked *