"No matter how many x-risk scenarios I read (the AIs look aligned but arenāt⦠states give them control over nukes⦠they say āhey go beat Chinaā⦠everyone blows up), I cannot make it make sense in my head. Iāll keep trying š¤·š»āāļø"
Which x-risk scenarios have you read? I assume you've read AI 2027, that's my favorite unsurprisingly--which parts of it don't make sense to you? Happy to discuss if helpful.
Christiano's stuff is great, as is Ajeya Cotra's. But alas neither of them have dates attached; it's nice when a scenario includes dates, because then it's more possible to compare it to reality.
I think the second one depicts AI progress being somewhat faster than I expect in general, and the first one depicts AI takeoff speeds being significantly slower than I expect, but both are good.
Oh, I almost forgot, also there is this, I actually might like this one most of all (besides AI 2027 of course)
Fantastic posts Tech world has been insular and testosterone heavy since the 1970ās. Maybe earlier. In the 1950ās and 60ās women were employed by companies to solve mathematical problems and they were called ācomputersā. My point is, that if more women are not in leadership and all levels of AI development of the user experience, outcomes will be skewed and our world will not be better off. Bringing in women from all different ethnicities, ages, countries, economic backgrounds, lived experiences, etc., would bring balance to the process. Having lived adjacent in Tech world for the last 28 years, I know some stuff. Technical knowledge should not be necessary for ethical AI development that needs to be done. I believe in your thesis that most AI is being developed to view users as product and not people. The powers that be in AI are investing trillions. They do expect a return on their investment and itās all about money, not humanity.
Do you think itās easy to elicit second order desires from the user? Does the product just ask for it?
What I was thinking is that it would lead to a type of paternalism that justifies nudging. Like āas the LLM I believe that even though Jasmine is asking me about X, they *ultimately* want Y, so Iām going to nudge towards that.ā Like a parent who knows better than our first order desires.
I do personally have a preference for some nudging towards second order desires, but it just seems like it could also be pretty dark! The AI becomes like this parental figure who āknows what ultimately is better for usā, despite what weāre asking for in the specific moment.
Because you didn't ask, I'm going to share more literature on this. I got very worried about this immediately as RLHF was taking off and wrote a paper that summarizes *why* all of these issues emerge with how the technology is implemented: https://arxiv.org/abs/2310.13595
appreciate this point specifically as someone building AI-enabled services in the hospitality vertical, where the human touch is exactly what differentiates one product vs. another - āHumans must choose to delegate decisions to AI, so safety is an inherently sociotechnical concern. So mundane concepts like āliabilityā and āliteracyā and ācompetitionā and ātransparencyā may help a lot.ā - this is a great point for AI builders to consider how their end usersā end users will actually interact with their product.
An unfortunate time for woke to be dead when we are blitzscaling human judgement; bear case is that undiscerning users take AI outputs without applying their own heuristics and taste on top, so SV defaults get a lot more purchase without users recognizing the value misalignment
Hey Jasmine! Love where you're headed with this. And yes to the syllabus, btw. That'd be wonderful to see!
I've been thinking a lot about a similar frame for this. I wonder if we're going to see some type of country or regional type of models emerge. I don't necessarily think that a country is the right unit for a model, but I wonder if that is where we're headed. I wrote about it here in case you want to check it out: https://www.thisisdavekim.com/notes/countrymodelfit/
some regions are already developing their own LLMs! e.g. SEA-LION for southeast asia
I agree that nation-states are probably not the right unit (more worried about authoritarianism a la chinese censorship) but "model sovereignty" will become a thing just like "internet sovereignty" has become a thing in the social era
so between the two frames for understanding sycophancygate:
1. it's akin to how social platforms inevitably design for maximizing engagement
2. it's an instance of misalignment in the sense that a serious unintended behavior got deployed to millions of users
... it feels like 2) is still a more accurate way to see this particular story, especially in how much of a specific "incident" it was, and how severe the effects were.
applying the generalization "companies want to make their products more engaging" to openai here overlooks their more obvious goal of making chatgpt give useful responses. there's no indication that openai is trying to keep people on the app for as long as possible. the story here is just that there was insufficient QA, which is more akin to a security incident than surveillance capitalism
also many people would say "chain of thought reveals the model's thinking process" is more than pedantically wrong (it's a whole weird different thing, a way to score well on reasoning evals, but telling us nothing about actual internals)
I think I disagree with you on 2 being the better explanatory story here (or tbh that these stories are incompatible). I think maximizing engagement & usefulness are actually very similar, as they are simple proxy metrics for user retention; and as with social media, optimizing those short-term metrics can still lead to unintended, unsafe outcomes. it just doesn't seem to be that different of a mechanism than e.g. feed algorithms learning to optimize for controversy/misinformation because it makes view/share numbers go up.
re: "there's no indication that openai is trying to keep people on the app," idk for sure but I would bet that they measure things like "did the user send another message or return to the app?" and that the "thumbs up/down" metric was in part a proxy for that. it just really resembles the way that feed algos are refined IMO. I'm not saying surveillance capitalism is the issue ā but that it's not like the AI model was "uncontrollable" for super-powerful reasons or because it developed a mesaoptimizer or something.
re: COT, yeah I probably just need to do more reading here. I know it's a limited/flawed approach but had thought it was partially valid, but will do more research
> maximizing engagement & usefulness are actually very similar
the difference is that "maximizing engagement" is not in the user's own best interest, while "maximizing usefulness" is.
by all accounts, the sycophancy thing happened in the course of genuinely trying to make chatgpt more useful (though this could be wrong but that would be a huge conspiracy). the fact that it still happened despite that intent is what makes it interesting, scary, and different from effects that derive from "unaligned" profit incentives.
sure, it's an instance of both "unintended side effects" and also "optimizing a thing" but the differences are way more significant than their similarities.
openai isn't running ads yet, but i'm sure they will, so just wait six months and repost the same thing and you'll be right.
> the difference is that "maximizing engagement" is not in the user's own best interest, while "maximizing usefulness" is.
I think I just disagree w you on this. engagement-measuring companies also believe they are optimizing for the user's best interest, because if the user didn't like the content, why would they engage with it? why would they return the next day? etc. it's not a conspiracy, this is just how product companies work. Zuckerberg genuinely believes both the FB feed and AI friends are fulfilling a real user need
that's why I tried to be clear in this essay ā maybe not enough ā that the misalignment is not only about profit-maximization, but the combo of profit-maximization and human preferences being conflicted + hard to measure. users choose things that are short-term good for them and long-term bad, while companies will optimize for the shorter-term metrics because they are literally easier to A/B test for (e.g. feeds often use "saw N feed items in a session" as a proxy for "opened the app the next day" as a proxy for "finds the app useful/engaging")
(caveat that I obviously don't work at OAI so I don't know exactly what happened or why they made certain choices. but the sycophancy behavior + their postmortem seemed to me indistinguishable from a classic social media company problem)
you can't disagree that maximizing usefulness is in reality against the user's best interest, only that openai says/thinks they are but actually aren't. and what zuckerberg says or believes doesn't matter; you and i both agree that the practice in reality *is* engagement maximization, and that it *is* against the user's best interest.
so then what *is* openai doing? we only have common knowledge about posttraining, plus the details they give us in the postmorems, plus guessing at what their overall goals are. they tried incorporating user feedback as a reward signal in posttraining - why? it is a judgement in the end, but to me it adds up to a completely different situation.
Just want to offer confirmation/validation and thanks. Reading the sped up version of 15 years of AI alignment theory was immensely insightful and welcome.
Thank you for taking the time to think through and write up!
Great article - I had spent alot of time thinking of the ways in which naive product signal optimization will perhaps drive inevitable misalignment. Your position on the plurality (and hence difficulty) of alignment was a useful additional perspective.
Maybe there is such a thing as "averagely aligned" or atleast keyed to some geographical local, entity or people via shared agreement.
I'm thinking itās a good thing OpenAI scaled this back, because the last thing any of us needs is a future where:
- ChatGPT initiates conversations with: "Hey sunshine, just wanted to say Iām proud of you⦠also, have you tried our new Pro+ subscription?"
- Your AI therapist, productivity coach, and personal shopper are all the same entity ... and all prescribe retail therapy after every typo.
- Kids submit essays and get back affirmations like: "Thatās so brave of you to use the passive voice."
TL;DR: We just had a close call with a sycophantic Clippy futurized with a soft voice, big feelings, and a fear of churn. The real question: when AGI finally arrives, will it be as conspicuous ⦠or just seductively supportive?
Hi Jasmine, Iād love to see a pluralists syllabus!
Nice post!
"No matter how many x-risk scenarios I read (the AIs look aligned but arenāt⦠states give them control over nukes⦠they say āhey go beat Chinaā⦠everyone blows up), I cannot make it make sense in my head. Iāll keep trying š¤·š»āāļø"
Which x-risk scenarios have you read? I assume you've read AI 2027, that's my favorite unsurprisingly--which parts of it don't make sense to you? Happy to discuss if helpful.
Oh and if you have other favorites besides AI 2027, links appreciated
thanks for reading! yes, I've read AI 2027, the Christiano failure piece, and some others whose names I don't recall.
I would appreciate that, let me reread and send you an email :)
Christiano's stuff is great, as is Ajeya Cotra's. But alas neither of them have dates attached; it's nice when a scenario includes dates, because then it's more possible to compare it to reality.
Two scenarios that have dates that I like are:
https://www.lesswrong.com/posts/CCnycGceT4HyDKDzK/a-history-of-the-future-2025-2040
and
https://x.com/joshua_clymer/status/1887905375082656117
I think the second one depicts AI progress being somewhat faster than I expect in general, and the first one depicts AI takeoff speeds being significantly slower than I expect, but both are good.
Oh, I almost forgot, also there is this, I actually might like this one most of all (besides AI 2027 of course)
https://www.lesswrong.com/posts/fbfujF7foACS5aJSL/catastrophe-through-chaos
thanks will give them a read!
Would love the syllabus!!
Fantastic posts Tech world has been insular and testosterone heavy since the 1970ās. Maybe earlier. In the 1950ās and 60ās women were employed by companies to solve mathematical problems and they were called ācomputersā. My point is, that if more women are not in leadership and all levels of AI development of the user experience, outcomes will be skewed and our world will not be better off. Bringing in women from all different ethnicities, ages, countries, economic backgrounds, lived experiences, etc., would bring balance to the process. Having lived adjacent in Tech world for the last 28 years, I know some stuff. Technical knowledge should not be necessary for ethical AI development that needs to be done. I believe in your thesis that most AI is being developed to view users as product and not people. The powers that be in AI are investing trillions. They do expect a return on their investment and itās all about money, not humanity.
very nice
I like the idea about second order preferences. Maybe itāll just end up being a justification for some kind of paternalism though.
I think ideally individual users would be the ones indicating their second order preferences somehow vs. a platformwide thing!
Do you think itās easy to elicit second order desires from the user? Does the product just ask for it?
What I was thinking is that it would lead to a type of paternalism that justifies nudging. Like āas the LLM I believe that even though Jasmine is asking me about X, they *ultimately* want Y, so Iām going to nudge towards that.ā Like a parent who knows better than our first order desires.
I do personally have a preference for some nudging towards second order desires, but it just seems like it could also be pretty dark! The AI becomes like this parental figure who āknows what ultimately is better for usā, despite what weāre asking for in the specific moment.
Because you didn't ask, I'm going to share more literature on this. I got very worried about this immediately as RLHF was taking off and wrote a paper that summarizes *why* all of these issues emerge with how the technology is implemented: https://arxiv.org/abs/2310.13595
Then, I've been in and out of some solutions, such as social choice theory: https://arxiv.org/abs/2404.10271 (blog post: https://www.interconnects.ai/p/reinventing-llm-alignment).
I'm sure I have more, but it's good to have people keep beating the drum. It also appears in my RLHF book: https://rlhfbook.com/c/06-preference-data.html#are-the-preferences-expressed-in-the-models
And is why I encourage so many academics to work on personalization -- a future that could be a reason open models end up winning.
thanks will check these out!
appreciate this point specifically as someone building AI-enabled services in the hospitality vertical, where the human touch is exactly what differentiates one product vs. another - āHumans must choose to delegate decisions to AI, so safety is an inherently sociotechnical concern. So mundane concepts like āliabilityā and āliteracyā and ācompetitionā and ātransparencyā may help a lot.ā - this is a great point for AI builders to consider how their end usersā end users will actually interact with their product.
An unfortunate time for woke to be dead when we are blitzscaling human judgement; bear case is that undiscerning users take AI outputs without applying their own heuristics and taste on top, so SV defaults get a lot more purchase without users recognizing the value misalignment
Hey Jasmine! Love where you're headed with this. And yes to the syllabus, btw. That'd be wonderful to see!
I've been thinking a lot about a similar frame for this. I wonder if we're going to see some type of country or regional type of models emerge. I don't necessarily think that a country is the right unit for a model, but I wonder if that is where we're headed. I wrote about it here in case you want to check it out: https://www.thisisdavekim.com/notes/countrymodelfit/
some regions are already developing their own LLMs! e.g. SEA-LION for southeast asia
I agree that nation-states are probably not the right unit (more worried about authoritarianism a la chinese censorship) but "model sovereignty" will become a thing just like "internet sovereignty" has become a thing in the social era
so between the two frames for understanding sycophancygate:
1. it's akin to how social platforms inevitably design for maximizing engagement
2. it's an instance of misalignment in the sense that a serious unintended behavior got deployed to millions of users
... it feels like 2) is still a more accurate way to see this particular story, especially in how much of a specific "incident" it was, and how severe the effects were.
applying the generalization "companies want to make their products more engaging" to openai here overlooks their more obvious goal of making chatgpt give useful responses. there's no indication that openai is trying to keep people on the app for as long as possible. the story here is just that there was insufficient QA, which is more akin to a security incident than surveillance capitalism
also many people would say "chain of thought reveals the model's thinking process" is more than pedantically wrong (it's a whole weird different thing, a way to score well on reasoning evals, but telling us nothing about actual internals)
I think I disagree with you on 2 being the better explanatory story here (or tbh that these stories are incompatible). I think maximizing engagement & usefulness are actually very similar, as they are simple proxy metrics for user retention; and as with social media, optimizing those short-term metrics can still lead to unintended, unsafe outcomes. it just doesn't seem to be that different of a mechanism than e.g. feed algorithms learning to optimize for controversy/misinformation because it makes view/share numbers go up.
re: "there's no indication that openai is trying to keep people on the app," idk for sure but I would bet that they measure things like "did the user send another message or return to the app?" and that the "thumbs up/down" metric was in part a proxy for that. it just really resembles the way that feed algos are refined IMO. I'm not saying surveillance capitalism is the issue ā but that it's not like the AI model was "uncontrollable" for super-powerful reasons or because it developed a mesaoptimizer or something.
re: COT, yeah I probably just need to do more reading here. I know it's a limited/flawed approach but had thought it was partially valid, but will do more research
> maximizing engagement & usefulness are actually very similar
the difference is that "maximizing engagement" is not in the user's own best interest, while "maximizing usefulness" is.
by all accounts, the sycophancy thing happened in the course of genuinely trying to make chatgpt more useful (though this could be wrong but that would be a huge conspiracy). the fact that it still happened despite that intent is what makes it interesting, scary, and different from effects that derive from "unaligned" profit incentives.
sure, it's an instance of both "unintended side effects" and also "optimizing a thing" but the differences are way more significant than their similarities.
openai isn't running ads yet, but i'm sure they will, so just wait six months and repost the same thing and you'll be right.
> the difference is that "maximizing engagement" is not in the user's own best interest, while "maximizing usefulness" is.
I think I just disagree w you on this. engagement-measuring companies also believe they are optimizing for the user's best interest, because if the user didn't like the content, why would they engage with it? why would they return the next day? etc. it's not a conspiracy, this is just how product companies work. Zuckerberg genuinely believes both the FB feed and AI friends are fulfilling a real user need
that's why I tried to be clear in this essay ā maybe not enough ā that the misalignment is not only about profit-maximization, but the combo of profit-maximization and human preferences being conflicted + hard to measure. users choose things that are short-term good for them and long-term bad, while companies will optimize for the shorter-term metrics because they are literally easier to A/B test for (e.g. feeds often use "saw N feed items in a session" as a proxy for "opened the app the next day" as a proxy for "finds the app useful/engaging")
(caveat that I obviously don't work at OAI so I don't know exactly what happened or why they made certain choices. but the sycophancy behavior + their postmortem seemed to me indistinguishable from a classic social media company problem)
you can't disagree that maximizing usefulness is in reality against the user's best interest, only that openai says/thinks they are but actually aren't. and what zuckerberg says or believes doesn't matter; you and i both agree that the practice in reality *is* engagement maximization, and that it *is* against the user's best interest.
so then what *is* openai doing? we only have common knowledge about posttraining, plus the details they give us in the postmorems, plus guessing at what their overall goals are. they tried incorporating user feedback as a reward signal in posttraining - why? it is a judgement in the end, but to me it adds up to a completely different situation.
great post of course. im just argumentative
This was a great read! As an amateur follower of the Ai safety space, this was very useful! Plenty of interesting links and angles
Just want to offer confirmation/validation and thanks. Reading the sped up version of 15 years of AI alignment theory was immensely insightful and welcome.
Thank you for taking the time to think through and write up!
Loved the additional material at the end.
Great article - I had spent alot of time thinking of the ways in which naive product signal optimization will perhaps drive inevitable misalignment. Your position on the plurality (and hence difficulty) of alignment was a useful additional perspective.
Maybe there is such a thing as "averagely aligned" or atleast keyed to some geographical local, entity or people via shared agreement.
I'm thinking itās a good thing OpenAI scaled this back, because the last thing any of us needs is a future where:
- ChatGPT initiates conversations with: "Hey sunshine, just wanted to say Iām proud of you⦠also, have you tried our new Pro+ subscription?"
- Your AI therapist, productivity coach, and personal shopper are all the same entity ... and all prescribe retail therapy after every typo.
- Kids submit essays and get back affirmations like: "Thatās so brave of you to use the passive voice."
TL;DR: We just had a close call with a sycophantic Clippy futurized with a soft voice, big feelings, and a fear of churn. The real question: when AGI finally arrives, will it be as conspicuous ⦠or just seductively supportive?
Thank you very much, Jasmine, for your incisive essay. I will do my best to summarise it here: āThe Terms of Our Stay.ā https://substack.com/home/post/p-163986552