In April, users noticed that ChatGPT 4o—the free, default model—was acting suspiciously affirming. They started stress-testing it with ever more inane prompts. What do you think of my literal shit-on-a-stick business idea? Is it OK that I went off my meds and think I’m Jesus Christ incarnate? Am I the smartest, funniest, prettiest girl in the whole wide world?
As ChatGPT continued to return eerily chirpy responses, the AI safety community grew frantic. These outputs made good shitposts, but weren’t they kind of dangerous? Is this much validation psychologically healthy? One user posted “If the government doesn't step in soon, it's gonna get really bad.” Many called it one of the first big “alignment problems” to hit production.
It’s bad that 4o was overly sycophantic. But I was also surprised that people were surprised. OpenAI is a research lab, but they are also a product company.1 Companies want to make their products more engaging. They invent new algorithms, run A/B tests, and if the right numbers go up (e.g. “thumbs up” reactions) without obvious negative side effects, they ship changes at scale.2
As we’ve seen in the last decade of social media, this sort of engagement-maxxing sends users into algorithmic rabbit holes. Like one Instagram reel about puppies and get fed 20 more, watch a Jordan Peterson lecture on YouTube and end up binging Andrew Tate. Pinterest thinspo begets more Pinterest thinspo; #resist lib Substacks recommend other Substacks with the same views. Users claim they want to escape their echo chambers, but research shows that seeing posts from opposing tribes makes us balk. Most people prefer personalized Instagram ads to generic ones.
Tech platforms, which make money from views and clicks and purchases, are more than happy to satisfy such simple desires. As Sam Altman prophetically tweeted, “algorithmic feeds are the first at-scale misaligned AIs.”
So a sycophantic AI was inevitable. Like a best friend who justifies your dumbest romantic decisions, or a president who surrounds himself with yes-men, most of us choose enablement, consciously or not.3 “Psychological dependence” can also be defined as “user retention.” “Addictive” is a synonym of “sticky.” There’s a reason the YC mantra is to “Build things people want.” In some ways, our failure to design social media algorithms we’re happy with makes me more pessimistic about AI alignment too. Here’s an AI forecast for you: Within a year, ChatGPT will be proactively starting conversations by push-messaging users rather than waiting for us to open the app.4
The sycophancy saga made me realize that I had taken the concept of “AI alignment” for granted. Is this the doom we’d been warned about? I had a basic assumption about what alignment meant, but as with AGI, people are often talking about different things.
Paul Christiano, who pioneered language model alignment at OpenAI, defines alignment as follows: “When I say an AI A is aligned with an operator H, I mean: A is trying to do what H wants it to do.” This is referred to as “intent alignment”—AIs should share humans’ goals.
Misalignment, therefore, is when the AI pursues ulterior goals. Safety researcher Ajeya Cotra uses the following analogy: Imagine that you’re an eight-year-old orphan who just inherited $1 trillion, and now must hire an adult to manage their life. While some applicants are “saints,” who look out for your long-term well-being, others are “sycophants” who only optimize for your short-term approval (e.g. spend the money on candy), and others are “schemers” who will exploit you for their own gain (e.g. embezzle from you). And because you’re only eight, and not good at designing job interviews, it’s likely that you’ll hire the wrong person and end up screwed.
In Cotra’s metaphor, humans are the eight-year-old, and superintelligent AIs are the adult applicants. The interviews are like AI training environments—imperfect proxies for the real world. We build AI to solve big human problems, but may inadvertently delegate power to systems that sound good at first but destabilize society long-term.
Naturally, this raises the question of who Christiano’s “H” is—alignment to whom? You might assume something like “humanity as a whole,” but humanity is, alas, irreconcilably pluralistic. So in practice, H usually means the developers: companies like OpenAI, Google, or Anthropic. Priority number one of most formal alignment work is making sure the builders themselves can understand and control the AI.
This work matters, and luckily, companies are motivated to do it. Chains of thought not only improve AIs’ reasoning abilities, but also helps illuminate their thinking process to developers and users. RLHF is the technique that turns models’ raw outputs—which can be offensive or nonsensical—into conversationally natural and appropriate replies.
Yet values and goals are more complicated, since AI builders also have competing objectives to meet. Most are structured as for-profit corporations, so they must increase their valuation. To do that, they must build AI products that users like and will pay for. Employees differ among themselves, and with leadership, about the best tactics. And they need to accomplish all this without getting nerfed by regulators along the way. (It’s stakeholder alignment all the way down.) Each subgoal adds a layer of complexity, a place where things can go wrong.
Sycophancy is an alignment problem, sure, but not at the model level. It’s not that OpenAI couldn’t get ChatGPT 4o to be less obsequious. They can and eventually did. The misalignment was between safety interests and product goals. It was between users’ first and second-order preferences, what humans say we want from AI and which responses we clicked “Thumbs up” on. Competing stakeholders will diverge.
So there is a world where AI labs figure out interpretability and steerability (woo!), but still introduce tremendous risk because they are subject to other incentives—the market, a nation-state, a conniving CEO—that aren’t aligned with our own.
Consider the recent Grok incident, where X’s built-in AI suddenly developed an obsessive and unshakeable fascination with “white genocide” in South Africa. No matter what you asked it about—baseball stats, HBO Max’s rebranding, etc.—Grok veered into discussing violence against Afrikaners, like a drunk uncle at Thanksgiving dinner. In this case, Grok was aligned with its owner, Elon Musk, but not with X users trying to ask it innocuous questions. This was laughable but instructive. Whether steered by corporate profit or government fiat, an intent-aligned AI system can still deliver a drug advert to an addict or a bomb to a civilian target.
The thorny truth is that humanity has never had “values alignment”: among different countries and cultures, between companies and their customers, between the angels and devils within our own minds. And AI—like many technologies before it—will widen pre-existing fractures, often in lopsided ways. As white genocide Grok and 4o’s sycophancy demonstrate, people-plus-AI can seriously degrade our civilizational fabric well before models develop independent goals.
I worry, therefore, that many AI researchers are overly focused on risks from model misalignment, and will be in for a rough surprise when havoc arises from other layers of the stack. Especially because AI companies are incentivized to spend tremendous resources making models more controllable, but disincentivized to pursue mechanisms that limit their own market power. (Nuclear weapons did not need instrumental convergence to destabilize the world. Perhaps OpenAI’s greatest alignment problem was between their nonprofit charter and their Microsoft deal.)
On the other hand, when I talk to people outside of AI, they seem underprepared for AI progress and diffusion—how it can exacerbate existing inequalities, and/or resource the under-resourced. AI is already rearchitecting labor markets, foreign policy, education, and relationships. These changes shouldn’t take us by surprise; we should be developing solutions now. Yet the more time I spend in AI world, the more I realize how insular the community is. Researchers are smart and earnest, but far from representative.5 To align AI’s values with a pluralistic public, more people must join the conversation—advocating for what AI will and should do in the world, not just what it can’t.
It’s lame to end on a kumbaya we-need-all-types conclusion, but this is what I actually believe. It’s annoying how tribal AI discourse is—EAs vs. e/accs, ethics vs. safety—because as a fairly neutral and ~unaligned~ observer, I’ve been picking up useful frameworks from every scene. And there’s a surprising amount of common ground, too!6
From my view, the hard problem of “making AI go well” is much more of a battle between succumbing to our default state of nature (humans like flattery, convenience, money, fun) and putting in the work to secure ourselves against risk (via innovation, ethics, norms, governance). Our solutions can be as diverse and coexistent as our interests. We have never been aligned—yet civilization means muddling forward nonetheless.
misc links & thoughts
I know I’m speed-running my way through 15 years of AI discourse and rehashing obvious things in my own words. Like, the world totally needs another amateur treatise on alignment. But writing these essays are a helpful exercise for me, and possibly for other readers who have not spent a decade in the LessWrong mines. Per usual, feel free to point me to other resources, correct errors, etc.
Here’s a mini annotated bibliography of perspectives I found helpful when researching this (separate from whether I agree with them):
What is alignment, anyway?
’s 2021 podcast with Brian Christian holds up impressively well for a pre-ChatGPT discussion. They talk about sycophancy, Reddit bots, and the same kind of alignment questions I have here—in very accessible language. For slightly more weedsiness (and doomerism), I liked Ajeya Cotra’s blog post introducing the saints/sycophants/schemers framework and her corresponding podcast on 80,000 Hours.I revisited
’s “The Artificiality of Alignment” and Wang and ’s “Alignment is Censorship,” for two treatments of alignment that emphasize the role of sociopolitical power. “AI alignment shouldn't be conflated with AI moral achievement” is a related old EA Forum post from on economic selfishness. A big part of me thinks that by default, AI will mainly make the downsides/systemic risks of capitalism much worse.Is AI a “normal technology”? This essay is 15,000 words, but the most important concept that
and introduce is that intelligence ≠ power. Humans must choose to delegate decisions to AI, so safety is an inherently sociotechnical concern. So mundane concepts like “liability” and “literacy” and “competition” and “transparency” may help a lot.There are folks in “AI safety” who recognize the social embeddedness of the risks.
has a big taxonomy of “how to make AI go well” in and beyond alignment; and a sketch of “what failure looks like” when too much gets outsourced to naively optimizing AI. Also, he works in policy now.Human preferences themselves are extremely and underratedly complicated. I reviewed some of the economic literature complicating the “rational agent” model of human preferences; in particular, Amartya Sen’s “Rational Fools” (on the power of moral commitments) and Albert O. Hirschman’s “Against Parsimony” (on first- and second-order preferences). I’m interested in AI products that learn and embed these higher-order preferences in a personalized way, AI that helps us become our more aspirational selves. But it’s also an eternal philosophical debate: How much should we be protected from ourselves?
As
has written, AIs are not the only complex, inscrutable systems that rule our lives. AI risk scenarios often look a lot like human risk scenarios (sometimes people optimize for short-term metrics instead of long-term goals? sometimes corporations lie to make money? who knew?!). I mean, consider SBF. He was trained by Will MacAskill to “earn to give,” and did so by starting a crypto exchange, then stealing customer funds. Maybe EA underspecified “earn to give,” or maybe SBF developed evil inner motives along the way. I like the take that SBF himself was a runaway misaligned AI.No matter how many x-risk scenarios I read (the AIs look aligned but aren’t… states give them control over nukes… they say “hey go beat China”… everyone blows up), I cannot make it make sense in my head. I’ll keep trying 🤷🏻♀️
At some point I may draft an “AI discourse for pluralists” syllabus—i.e. the people, publications, podcasts, etc that I’m following, across the ideological spectrum.
I’m also keen to host more in-person events that facilitate broad public conversation about how we hope to live alongside AI. I’m participating on a panel on AI and education on May 27, and hosting a screening/discussion of the AlphaGo movie and an AI spec-fic session at some point too.
Let me know if you’d be interested in any of the above!
Thanks for reading,
Jasmine
In fact, OpenAI’s new operating CEO Fidji Simo was previously the CEO of Instacart and ran ads at Meta! (Sam Altman is still the super-mega-official-CEO, but it’s common knowledge that he focuses on fundraising and storytelling rather than week-to-week business ops.)
OpenAI’s postmortem basically confirms that this is what happened.
I do think it’d be interesting and valuable to have more transparency or reporting on what companies’ internal testing/shipping processes look like. It’s surprising, for instance, that employees didn’t catch it while dogfooding—at Substack, the engineers spent a lot of time “living in” and “feeling out” new algorithms themselves before shipping them more broadly.
I have been thinking that AI forecasting needs more product managers. It is one prediction strategy to take a line plot and extrapolate it outwards. It is another to imagine that you are a mercenary product founder trying to build a billion-dollar company with AI. Just browse the last few batches of YC startups—AI for crime-busting, credit scores, and self-clones—that’s one underrated way to predict what the next few years of AI will look like.
Sorry For Being Woke
For example, I was trying to understand and come up with real-world examples of “inner” and “outer” misalignment. For inner misalignment—where the AI system develops its own goals, different or even contrary to those it was trained for—I recalled the classic example of Amazon’s sexist resume screener (which was intended to optimize for competence by training on resumes of existing employees, but ended up preferring men over women). I think both safety/ethics folks want to prevent more advanced AI systems from making similar mistakes.
There are other mitigations—e.g. whistleblower protections, incident reporting—that I also think most parties would consider fair.
Hi Jasmine, I’d love to see a pluralists syllabus!
Nice post!
"No matter how many x-risk scenarios I read (the AIs look aligned but aren’t… states give them control over nukes… they say “hey go beat China”… everyone blows up), I cannot make it make sense in my head. I’ll keep trying 🤷🏻♀️"
Which x-risk scenarios have you read? I assume you've read AI 2027, that's my favorite unsurprisingly--which parts of it don't make sense to you? Happy to discuss if helpful.