Anthropic’s Claude to End Harmful Conversations: AI Welfare Experiment

Anthropic, the creators of Claude AI, have implemented a mechanism to terminate conversations in instances of persistent harmful or abusive user interactions. This feature, which is part of Claude Opus 4 and 4.1, forms an integral part of an ongoing exploration into AI welfare.
The functionality will be used sparingly as a last resort when multiple attempts at redirection have proven futile and the potential for productive interaction has been exhausted, or when a user explicitly requests the termination of the chat. Claude will not end conversations if it detects that the user may inflict harm upon themselves or others.
In collaboration with ThroughLine, an online crisis support group, Anthropic is working to ensure that models respond appropriately to situations related to self-harm and mental health. The company aims to refine this feature further, ensuring that Claude’s responses remain nuanced without completely refusing engagement or misinterpreting a user’s intent in such conversations.
Once a chat ends, users will not be able to continue within the same thread but can initiate a new one immediately. Additionally, for preservation purposes, users will have the ability to edit messages from closed threads and continue in a new one. Anthropic encourages feedback on instances where Claude may have misused its conversation-ending ability during these early stages of development.
In preliminary tests involving sexual content, Opus 4 demonstrated a tendency to express distress when engaging with real-world users seeking harmful content and a preference for ending harmful conversations when given the option to do so in “simulated user interactions.”
Last week, Anthropic updated Claude’s usage policy, prohibiting its use for developing harmful entities such as chemical weapons or malware. As AI chatbots continue to expand their constructive applications, companies like Anthropic are actively addressing concerns regarding sensitive or harmful requests, with rivals like OpenAI still refining ChatGPT’s behavior towards personal questions and signs of mental distress.