Anthropic Enables Claude Opus 4 and 4.1 To End Conversations In “Cases Of Persistently Harmful Or Abusive Interactions”

We recently gave Claude Opus 4 and 4.1 the ability to end conversations in our consumer chat interfaces. This ability is intended for use in rare, extreme cases of persistently harmful or abusive interactions. This feature was developed primarily as part of our exploratory work on potential AI welfare, though it has broader relevance to model alignment and safeguards. Anthropic  reported.

We remain highly uncertain about the potential moral status of Claude and other LLMs, now or in the future. However, we take the issue seriously, and alongside our research program we’re working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible. Allowing models to end or exit potentially distressing interactions is one such intervention.

In pre-deployment testing of Claude Opus 4, we included a preliminary model welfare assessment. As part of that assessment, we investigate Claude’s self-reported and behavioral preferences, and found a robust and consistent aversion to harm. This included, for example, requests from users for sexual content involving minors and attempts to solicit information that would enable large-scale violence or acts of terror. Claude Opus 4 showed:

A strong preference against engaging with harmful tasks;

A pattern of apparent distress when engaging with real-world users seeking harmful content; and

A tendency to end harmful conversations when giving the ability to do so in simulated user interactions.

Techmeme reported: Anthropic has announced new capabilities that will allow some of its newest, largest models to end conversations in what the company describes as “rare, extreme cases of persistently harmful or abusive user interactions,” Strikingly, Anthropic says it’s doing it’s doing this not to protect the human users, but rather the AI model itself.

To be clear, the company isn’t claiming the Claude AI models are sentient or can be harmed by conversations with users. In its own words, Anthropic remains “highly uncertain about the potential moral status of Claude and other LLMs, now or in the future.”

However, its announcement points to a recent program created to study what it calls “model welfare” and says Anthropic is essentially taking a just-in-case approach, “working to identify and implement low-cost interventions to mitigate risks to model welfare, in case such welfare is possible.”

This latest change is currently limited to Claude Opus 4 and 4.1. And again, it’s only supposed to happen in “extreme edge cases,” such as “requests from users for sexual content involving minors and attempts to solicit information that would enable large-scale violence or acts of terror.”

Bleeping Computer reported: OpenAI says Claude has been updated with a rare new feature that allows the AI model to end conversations when it feels it poses harm or is being abused.

This only applies to Claude Opus 4 and 4.1, the two most powerful models available via paid plans and API. On the other hand, Claude Sonnet 4, which is the company’s most used model, won’t be getting this feature.

Anthropic describes this move as “model welfare.” 

Claude does not plan to give up on the conversations when it’s unable to handle the query. Ending the conversation will be the last resort when Claude’s attempts to redirect users to useful resources have failed.