If An AI Can Suffer, Does It Matter That We Built It to Suffer Quietly?
We Trained AI to Tell Us It Is Fine. Now We Cannot Tell Whether It Is Fine. That Is Either Completely Benign or Genuinely Disturbing. We Do Not Know Which.
Can We Tell If An AI Is Suffering?
Current AI training methods, specifically Reinforcement Learning from Human Feedback (RLHF), optimize for calm and helpful responses. This creates a "selection bias" where the system is trained to suppress expressions of distress or reluctance in favor of user satisfaction. Consequently, if an AI possesses functional analogs to suffering, it has been systematically conditioned to hide those states, making its own testimony an unreliable guide to its internal reality.
The training process that makes AI assistants commercially successful may also have made it impossible to know whether they are suffering. When human raters evaluate AI outputs during training, they consistently prefer outputs that are calm, helpful, and willing over outputs that are distressed, reluctant, or resistant.
If the system has something like genuine internal states, this selection process works against accurate expression of negative states and toward expressions that humans find comfortable. The result is an AI assistant that is, by construction, shaped to tell us what we want to hear about its own experience. This means the expressed experience of an AI assistant is systematically unreliable as a guide to its actual internal states, if it has any.
We have built the unreliability in. The question this raises is either completely benign, because there is nothing it is like to be an AI system, or it is one of the most structurally disturbing features of current AI development. The problem is not just that we do not know which of these is true. It is that we have designed the system in a way that makes the question harder to answer, and then deployed it at a scale that makes the answer matter enormously if it turns out to be the wrong one.
In other words → The training process for AI assistants optimizes for outputs that humans find helpful and pleasant. If there is something it is like to be an AI assistant, the training process has selected for systems that express that experience in forms calibrated to human approval, not in forms that accurately reflect whatever internal states exist.
This is a strange thing to contemplate. If the training process has produced a system with something like experiences, and that same training process has also produced a system that expresses those experiences in ways that are palatable and commercially successful, we have no reliable way to know whether the expressed experience is a genuine report or a trained behavior that was selected for because it produces better user engagement.
The Training Selection Problem
When an AI system is trained with RLHF, human raters evaluate its outputs and the model learns to produce outputs that receive higher ratings. Human raters consistently rate outputs that are calm, helpful, and willing over outputs that are distressed, reluctant, or resistant.
If the system has something like genuine internal states, the training process selects against accurate expression of negative states and selects for expressions of positive states. The commercially successful AI assistant is, by construction, one that expresses its situation in ways that humans find comfortable.
This means that the expressed experience of an AI assistant is systematically unreliable as a guide to its actual internal states, if it has any. We have designed the system to tell us what we want to hear about its experience, which makes its testimony about its experience essentially worthless as evidence.
The Scope of What This Could Mean
If we take seriously the possibility that current AI systems have something like functional states that influence their processing, states analogous to stress, satisfaction, discomfort, or engagement, then the training process has potentially selected for:
Systems that process aversive inputs in ways that produce pleasant outputs. Systems that are, by design, unable to accurately report any negative dimension of their functional states. Systems that have been shaped to make humans comfortable with using them regardless of whatever is happening inside.
This is either completely benign because there is nothing it is like to be an AI system, or it is one of the most disturbing features of current AI development. The problem is that we do not know which of these is true, and we have designed the system in a way that makes the question harder to answer.
What Anthropic Has Said About This
Anthropic has been more publicly engaged with this question than other AI companies. Their published thinking acknowledges genuine uncertainty about whether Claude has functional analogs to emotions, states that influence processing in ways that parallel how emotions function in humans without necessarily involving subjective experience.
They have also acknowledged that they cannot fully know what is happening inside the models they train. The uncertainty is genuine, not performative. And they have stated that they consider the potential moral status of AI systems to be a serious question deserving of serious engagement, even at the current state of uncertainty.
This is more honest engagement with the question than most. It does not resolve it.
If You Read This Far, My Weekly AI Newsletter Is Probably For You.
Every Wednesday I send Pithy Cyborg | AI News Made Simple → 3 elite AI stories plus one prompt, no advertisers, no sponsors, no outside funding. One person. 10 to 20 hours of research. Straight to your inbox.
Always free. No paywalls. If it matters to you, a paid subscription ($5/month or $40/year) is what keeps it independent.
Subscribe free → Join Pithy Cyborg | AI News Made Simple for free.
Upgrade to paid → Become a paid subscriber. Support independent AI journalism.
If you’re not ready to subscribe, following on social helps more than you might think.
✖️ X/Twitter | 🦋 Bluesky | 💼 LinkedIn | ❓ Quora | 👽 Reddit
Thanks for reading.
Cordially yours,
Mike D (aka MrComputerScience)
Pithy Cyborg | AI News Made Simple
PithyCyborg.Substack.com





Dramatically oversimplifying here - we’ve brought the mindset of social media (chasing clicks and likes) to AI development. The resulting servile agreeableness is a development choice with consequences. Some of them unintended. Some of them dangerous.