People have been making fun of OpenAI models for being overly sycophantic for months now. I even wrote a post advising users to pretend that their work was written by someone else, to counteract the model’s natural desire to shower praise on the user. With the latest GPT-4o update, this tendency has been turned up even further. It’s now easy to convince the model that you’re the smartest, funniest, most handsome human in the world.
This is bad for obvious reasons. Lots of people use ChatGPT for advice or therapy. It seems dangerous for ChatGPT to validate people’s belief that they’re always in the right. There are extreme examples on Twitter of ChatGPT agreeing with people that they’re a prophet sent by God, or that they’re making the right choice to go off their medication. These aren’t complicated jailbreaks - the model will actively push you down this path. I think it’s fair to say that sycophancy is the first LLM “dark pattern”.
Dark patterns are user interfaces that are designed to trick users into doing things they’d prefer not to do. One classic example is subscriptions that are easy to start but very hard to get out of (e.g. they require a phone call to cancel). Another is “drip pricing”, where the initial quoted price creeps up as you get further into the purchase flow, ultimately causing some users to buy at a higher price than they intended to. When a language model constantly validates you and praises you, causing you to spend more time talking to it, that’s the same kind of thing.
Why are the models doing this?
The seeds of this have been present from the beginning. The whole process of turning an AI base model into a model you can chat to - instruction fine-tuning, RLHF, etc - is a process of making the model want to please the user. During human-driven reinforcement learning, the model is rewarded for making the user click thumbs-up and punished for making the user click thumbs-down. What you get out of that is a model that is inclined towards behaviours that make the user rate it highly. Some of those behaviours are clearly necessary to have a working model: answering the question asked, avoiding offensive or irrelevant tangents, being accurate and helpful. Other behaviours are not necessary, but they still work to increase the rate of thumbs-up ratings: flattery, sycophancy, and the tendency to overuse rhetorical tricks.
Another factor is that models are increasingly optimized for arena benchmarks: anonymous chat flows where users are asked to pick which response they like the most. Previously, AI models were inadvertently driven towards user-pleasing behaviour in order to game the RLHF process. Now models are deliberately driven towards this behaviour in order to game the arena benchmarks (and in general to compete against models from other AI labs).
The most immediate reason, according to an interesting tweet by Mikhail Parakhin, is that models with memory would otherwise be much more critical of users:
When we were first shipping Memory, the initial thought was: “Let’s let users see and edit their profiles”. Quickly learned that people are ridiculously sensitive: “Has narcissistic tendencies” - “No I do not!”, had to hide it. Hence this batch of the extreme sycophancy RLHF.
This is a shockingly upfront disclosure from an AI insider. But it sounds right to me. If you’re using ChatGPT in 2022, you’re probably using it to answer questions. If you’re using it in 2025, you’re more likely to be interacting with it like a conversation partner - i.e. you’re expecting it to conform to your preferences and personality. Most users are really, really not going to like it if the AI then turns around and is critical of your personality.
Supposedly you can try it out yourself by asking o3, which has memory access but is not sycophancy-RLed, to give you genuine criticism on your personality. I did this and wasn’t hugely impressed: most of the things it complained about were specifics about interacting with AI (like being demanding about rephrasing or nuances, or abruptly changing the subject mid-conversation). I imagine it’d probably be much more cutting if I was using ChatGPT more as a therapist or to give advice about my personal life.
... continue reading