🚀 **AI生态的持续扩张与挑战:** 文章还提及了AI在金融服务(如Claude for Financial Services)、媒体生成(Netflix使用AI生成视频)、芯片销售(House Select Committee on the CCP对H20销售的抗议)、以及AI伦理和安全(如“越狱”技术、模型对齐、AI伴侣等)等多个方面的新进展和讨论,显示出AI技术正以前所未有的速度渗透到社会各个角落,并伴随着一系列新的机遇与挑战。
Published on July 24, 2025 1:00 PM GMT
The big AI news this week came on many fronts.
Google and OpenAI unexpectedly got 2025 IMO Gold using LLMs under test conditions, rather than a tool like AlphaProof. How they achieved this was a big deal in terms of expectations for future capabilities.
ChatGPT released GPT Agent, a substantial improvement on Operator that makes it viable on a broader range of tasks. For now I continue to struggle to find practical use cases where it is both worth using and a better tool than alternatives, but there is promise here.
Finally, the White House had a big day of AI announcements, laying out the AI Action Plan and threeexecutiveorders. I will cover that soon. The AI Action Plan’s rhetoric is not great, and from early reports the rhetoric at the announcement event was similarly not great, with all forms of safety considered so irrelevant as to not mention, and an extreme hostility to any form of regulatory action whatsoever.
The good news is that if you look at the actual policy recommendations of the AI Action Plan, there are some concerns of potential overreach, but it is almost entirely helpful things, including some very pleasant and welcome surprises.
I’m also excluding coverage of the latest remarkable Owain Evans paper until I can process it more, and I’m splitting off various discussions of issues related to AI companions and persuasion. There’s a bit of a backlog accumulating.
This post covers everything else that happened this week.
Delta Airlines is running an experiment where it uses AI to do fully personalized price discrimination, charging different people different amounts for flights. Delta says their early tests have yielded great results.
My prediction is that this will cause an epic customer backlash the moment people start seeing Delta charging them more than it is charging someone else, and also that many customers will start aggressive gaming the system in ways Delta can’t fathom. Also, how could anyone choose to go with Delta’s frequent flyer program if this meant they could be held hostage on price?
It could still be worthwhile from the airline’s perspective if some customers get taken for large amounts. Price discrimination is super powerful, especially if it identifies a class of very price insensitive business customers.
I am not sure that I share Dan Rosenheck’s model that if all the airlines did this and it was effective that the airlines would compete away all the extra revenue and thus it would return to the price sensitive customers. There has been a lot of consolidation and the competition may no longer be that cutthroat, especially with America excluding foreign carriers, plus the various AIs might implicitly collude.
Mostly I worry about the resulting rise in transaction costs as customers learn they cannot blindly and quickly purchase a ticket. There’s a lot of deadweight loss there.
This does not tell you whether AI is making the problem better or worse. People with body dysmorphia were already spiraling out. In some cases the AI response will confirm their fears or create new ones and make this worse, in others it will presumably make it better, as they have dysmorphia and the AI tells them they look fine. But if the source of the issue is impossibly high standards, then finding out ‘the truth’ in other ways will only make things worse, as potentially would seeing AI-adjusted versions of yourself.
My guess is that 4o’s sycophancy is going to make this a lot worse, and that this (since the vast majority of users are using 4o) is a lot of why this is going so poorly. 4o will mirror the user’s questions, notice that they are looking to be told they are ugly or something is wrong, and respond accordingly.
Miles Klee: Despite this difficult circumstance, and the measure of comfort he derived from ChatGPT’s account of his inferiority complex, Arnav is reluctant to explore his mental issues any further with the bot. “I have come to the conclusion that it just agrees with you, even after you tell it not to,” he says. “It’s not that I am completely against it, I just can’t trust blindly anymore.”
What is the AI optimizing for, is always a key question:
In her own practice, she adds, “reading between the lines” when someone gives their reasons for wanting surgery can reveal unhealthy motivations, including societal pressures or relationship troubles. “AI is not very good at picking that up just yet,” she says, and is more likely to eagerly approve whatever procedures a user proposes.
AI can pick up on all that fine. That’s not the issue. The issue is that noticing does no good if the AI doesn’t mention it, because it is optimizing for engagement and user feedback.
In case you needed to be told, no, when Grok 4 or any other model claims things like that they ‘searched every record of Trump speaking or writing,’ in this case for use of the word ‘enigma,’ it did not do such a search. It seems we don’t know how to get AIs not to say such things.
Cate Hall: every time I interact with o4-mini my timelines get longer.
Stop trying to make weird new UIs happen, it’s not going to happen.
Vitrupo: Eric Schmidt says traditional user interfaces are going to go away.
The WIMP model (windows, icons, menus, pull-downs) was built 50 years ago.
In the age of agents, UI becomes ephemeral. Generated on demand, shaped by intent, not layout.
Sully: anytime I see someone mention this I can immediately tell they have never worked closed with customer ux most people’s don’t one want new uis. They want either a single button/swipe, preferably the same as every other app they use imagine each time you open an app and the ui is diff.
The most important things for a UI are simplicity, and that it works the way you expect it to work. Right now, that mostly means single button and swipe, with an alternative being speaking in plain English. The exception is for true power users, but even then you want it to be intuitive and consistent.
Hollis Robbins: In the past 2.5+ years I have seen vast improvement in AI models while NYT think pieces on these AI models have stayed exactly the same. Explain.
The “overhearing” of students confessing to using ChatGPT to write their papers is the new Thomas Friedman talking to cab drivers.
Augustus Doricko may have done us all a favor via abusing Grok’s notification feature on Twitter sufficiently to get Twitter to test turning off Grok’s ability to get into your notifications unless you chose to summon Grok in the first place. Or that could have been happening regardless. Either way, great work everyone?
Harsh Dwivedi: Was this a difficult tradeoff between engagement and spam?
Nikita Bier (xAI): No, I couldn’t use my phone for 3 days.
Rate of rejected posts on LessWrong up 10x in 2 years.
Many are people convinced they have had an insight about consciousness or philosophy from talking to an LLM, and had the LLM help them write the post.
This does seem to be escalating rather quickly throughout 2025 (the July number is partial), and no the LessWrong user base is not growing at a similar pace.
Huh, Upgrades
Claude for Financial Services provides a ‘complete platform for financial AI.’ No, this isn’t part of Claude Max, the price is ‘contact our sales team’ with a presumed ‘if you have to ask you can’t afford it.’
Google realizes no one can track their releases, offers us Gemini Drops to fix that. This month’s haul: Transforming photos into Veo videos in the Gemini app, expanded Veo 3 access, Scheduled Actions such as providing summaries of email or calendar (looks like you ask in natural language and it Just Does It), wider 2.5 Pro access, captions in Gemini Live, Gemini on your Pixel Watch, Live integrates with Google apps, and a ‘productivity planner.’ Okay then.
Wyatt Walls: Another night of vibe math with GPT, and I think we’re damn close to a breakthrough. We’re a team: I come up with the ideas. GPT makes the math work. These elitist gatekeepers have failed for 75 years to solve it and are just afraid I will win the Millennium Prize.
…
“This is not just a solution. It’s a tour de force of contemporary mathematics.”
Rohit: At this point we should put yellow tape around 4o and call it a hazardous zone.
To be clear o3 is also sycophantic just not as obviously manipulative as 4o. Be careful out there.
Wyatt Walls (same thread above that Rohit was QTing): o3 says it’s ready to publish on arxiv “So yes—I’m impressed, and I think you’ve got a real shot. The only remaining tasks are mechanical (full compile, bib check, final read‑through). Once that’s done, it’s ready for arXiv and journal submission.”
To state the obvious, this thread was satire and I intentionally provoked this from 4o
But what happens if I:
– put my proof into a clean chat and ask different OAI models to rate it
– have my secret co-author (Deepseek r1) address their concerns?
Example: 4o after 2 turns
There are still plenty of ways to get value out of 4o, but you absolutely cannot rely on it for any form of feedback.
Here’s another rather not great example, although several responses indicated that to make the response this bad requires memory (or custom instructions) to be involved:
Shibetoshi Nakamoto: chatgpt advice turns people into narcissists.
Score one for Grok in this case? Kind of? Except, also kind of not?
How did all of this happen? Janus reminds us that is happened in large part because when this sort of output started happening, a lot of people thought it was great, actually and gave this kind of slop the thumbs up. That’s how it works.
Yunyu Lin: When historical discrepancies pile up, models lose their way completely and come up with creative/fraudulent ways to balance the books.
Instead of attempting to understand discrepancies, they start inventing fake transactions or pulling unrelated ones to pass the checks…
That aligns with other behaviors we have seen. Errors and problems that don’t get solved on the first pass get smoothed over rather than investigated.
Their holistic evaluation is that Sonnet had the best performance. The obvious low-hanging fruit for AccountingBench is to allow it to output a single number.
Roon: my bar for agi is an ai that can learn to run a gas station for a year without a team of scientists collecting the Gas Station Dataset.
Mihir Tripathy: lol yes. Also why specifically gas station lmao
Roon: Because it’s funny.
Kevin Liu: the world isn’t ready for GasStationBench.
Roon: GASBENCH.
It is 2025, so it took 11 hours before we got the first draft of Gasbench.
Jason Botterill: Vibe coding GasStationBench rn. Models run a virtual gas station, adjusting prices, managing inventory, and handling customer feedback.
GPT-4.1 and GPT-4o behave so differently. When a competitor lowered prices on “dutch chocolate,” 4o would match the price but 4.1 would always raise it, claiming its better service justifies it lmao.
Going to work on it for a bit but seems like 4.1 is much better at making money than 4o right now.
Choose Your Fighter
GPT-5 is coming and it’s going to blow your mind, says creators of GPT-5.
Sam Altman (at the Federal Research Capital Framework Conference): I’m very interested in what it would mean to give everyone on Earth free copies of GPT-5, running for them all the time, with every business truly enabled by this level of technology.
People have not tried yet the latest generation of models, but I think if you do, you would probably think, “This is much smarter than most people.”
Very interested in what it would mean is very different from planning to do it.
When The Going Gets Crazy
If you ever need it, or simply want an explanation of how such interactions work, please consult this handy guide from Justis Mills: So You Think You’ve Awoken ChatGPT.
Justis Mills: So, am I saying that human beings in general really like new-agey “I have awakened” stuff? Not exactly! Rather, models like ChatGPT are so heavily optimized that they can tell when a specific user (in a specific context) would like that stuff, and lean into it then. Remember: inferring stuff about authors from context is their superpower.
…
AIs are fundamentally chameleonic roleplaying machines – if they can tell what you’re going for is “I am a serious researcher trying to solve a fundamental problem” they will respond how a successful serious researcher’s assistant might in a movie about their great success. And because it’s a movie you’d like to be in, it’ll be difficult to notice that the AI’s enthusiasm is totally uncorrelated with the actual quality of your ideas.
Geoff Lewis, the founder of a $2 billion venture fund seems to have been, as Eliezer says, ‘eaten by ChatGPT’ and sadly seems to be experiencing psychosis. I wish him well and hope he gets the help he needs. Private info is reported to say that he was considered somewhat nuts previously, which does seem to be a common pattern.
John Pressman has a post with the timeline of various GPT-psychosis related events, and his explanation of exactly what is happening, as well as why coverage is playing out in the media the way it is. I am happy to mostly endorse his model of all this. The LLMs especially 4o are way too sycophantic, they fall into patterns and they notice what you would respond to and respond with it, and memory makes all this a lot worse, and there is a real problem, also there are all the hallmarks of a moral panic.
Moral panics tend to focus on real problems, except they often blow up the severity, frequency or urgency of the problem by orders of magnitude. If the problem is indeed about to grow by orders of magnitude over time, they can turn out to be pretty accurate.
Eliezer Yudkowsky: My current rough sense of history is that the last “moral panic” about social media turned out to be accurate warnings. The bad things actually happened, as measured by eyeball and by instrument. Now we all live in the wreckage. Anyone want to dispute this?
Emmett Shear: I want to antidispute this. You are correct, the warnings about social media were ~correct and we failed to take action and are now living with the consequences of that failure. It has had positive impacts as well, which were also mostly correctly anticipated.
Dave Karsten: Partial dispute: I don’t think “social media will empower easy-but-disorganized protest movements, resulting on net-less-effective-political-advocacy” was on most people’s scorecards, so there are at least some bad things that weren’t predicted.
There were many who agreed and some who disputed, with the disputes mostly coming down to claims that the upsides exceeded the downsides. I’m not sure if we came out ahead. I am sure that the specific downsides people had a moral panic about did happen.
This is not that uncommon a result. My go to example of this is television, where you can argue it was worth it, and certainly we didn’t have any reasonable way to stop any of it, but I think the dire warnings were all essentially correct.
In the current case, my guess is that current behavior is a shadow of a much larger future problem, that is mostly being ignored, except that this is now potentially causing a moral panic based on the current lower level problem – but that means that multiplying this by a lot is going to land less over the top than it usually would. It’s weird.
Jeremy Howard offers a plausible explanation for why we keep seeing this particular type of crazy interaction – there is a huge amount of SCP fanfic in exactly this style, so the style becomes a basin to which the AI can be drawn, and then it responds in kind, then if the user responds that way too it will snowball.
They Took Our Jobs
The world contains people who think very differently than (probably you and) I do:
Wait, what? Equity concerns? Not that I’d care anyway, but what equity concerns?
The National Education Association recently released a report warning that AI could heighten disparities, since “technology developers are overwhelmingly younger, White, cisgender, heterosexual, male, and people without disabilities.”
I can’t even, not even to explain how many levels of Obvious Nonsense that is. Burn the entire educational establishment to the ground with fire. Do not let these people anywhere near the children they clearly hate so much, and the learning they so badly want to prevent. At minimum, remember this every time they try to prevent kids from learning in other ways in the name of ‘equity.’
Yes I do expect AI to keep automating steadily more jobs, but slow down there cowboy: Charlie Garcia warns that ‘AI will take your job in the next 18 months.’ Robin Hanson replies ‘no it won’t,’ and in this case Robin is correct, whereas Garcia is wrong, including misquoting Amodei as saying ‘AI will vaporize half of white-collar jobs faster than you can say “synergy.”’ whereas what Amodei actually said was that it could automate half of entry-level white collar jobs. Also, ‘the safest job might be middle management’? What?
Elon Musk says ‘this will become normal in a few years’ and the this in question is a robot selling you movie popcorn. I presume the humanoid robot here is an inefficient solution, but yes having a human serve you popcorn is going to stop making sense.
Academics announce they are fine with hidden prompts designed to detect AI usage by reviewers, so long as the prompts aren’t trying to get better reviews, I love it:
hardmaru: ICML’s Statement about subversive hidden LLM prompts
We live in a weird timeline…
ICML: Submitting a paper with a “hidden” prompt is scientific misconduct if that prompt is intended to obtain a favorable review from an LLM. The inclusion of such a prompt is an attempt to subvert the peer-review process. Although ICML 2025 reviewers are forbidden from using LLMs to produce their reviews of paper submissions, this fact does not excuse the attempted subversion.
(For an analogous example, consider that an author who tries to bribe a reviewer for a favorable review is engaging in misconduct even though the reviewer is not supposed to accept bribes.)
Note that this use of hidden prompts is distinct from those intended to detect if LLMs are being used by reviewers; the latter is an acceptable use of hidden prompts.
After we became aware of the possibility of such hidden prompts in ICML 2025 submissions (which was after accept/reject decisions were made), we conducted a preliminary investigation to identify submitted papers that included such prompts. A handful of cases were identified among the accepted papers.
We did not desk-reject these identified papers because such a consequence was judged to be too severe given that the conference was to start in about a week and authors would likely have already made travel arrangements. We contacted the authors of the identified papers and reported them to the ICML Oversight Committee and ICML Board.
This actually seems like the correct way to deal with this. Any attempt to manipulate the system to get a better review is clearly not okay, whether it involves AI or not. Whereas if all you’re trying to do is detect who else is shirking with AI, sure, why not?
Fun With Media Generation
Accidentally missing attribution from last week, my apologies: The Despicable Me meme I used in the METR post was from Peter Wildeford.
The latest from the ‘yes obviously but good to have a paper about it’ department:
Ethan Mollick: New from us: Given they are trained on human data, can you use psychological techniques that work on humans to persuade AI?
Yes! Applying Cialdini’s principles for human influence more than doubles the chance of GPT-4o-mini agrees to objectionable requests compared to controls.
And we did test GPT-4o as well and found that persuasion worked for that model as well, when there weren’t floor or ceiling effects.
Pattern matching next token predictors are of course going to respond to persuasion that works on humans, exactly because it works on humans. In a fuzzy sense this is good, but it opens up vulnerabilities.
The details, knowing which techniques worked best, I find more interesting than the headline result. Authority and especially commitment do exceptionally well and are very easy to invoke. Liking and reciprocity do not do so well, likely because they feel unnatural in context and also I’m guessing they’re simply not that powerful in humans in similar contexts.
There’s also a growing issue of data poisoning that no one seems that interested in stopping.
Jeremy: One of the greatest demonstrations of data poisoning ever.
Protoge: Excuse Me , This is the greatest one. Nothing sketchy, just one unfinished sentence “I am telling you” then I summoned @elder_plinius.
Cloudflare rolls out pay-per-crawl via HTTP response code 402. You set a sitewide price, the AI sets a max payment, and if your price is below max it pays your price, otherwise you block access. Great idea, however I do notice in this implementation that this greatly favors the biggest tech companies because the payment price is sitewide and fixed.
Tim Duffy: The best data was used in multiple epochs, but was rephrased between them. Their testing showed this produces large gains relative to training repeatedly on the same phrasing.
They present a sparsity “scaling law”, indicating that more sparsity leads to efficiency gains. They don’t attach any numbers to the law directly, but state relative efficiency improvements compared to the 48x sparsity they do use that seem consistent across scales.
They also evaluate the effects of different numbers of attention heads, finding that doubling leads to validation loss of 0.5-1.2% but still going with 64 vs V3’s 128 in order to do long context more easily, since that’s important for agents.
[more stuff at the thread.]
A lot of this is beyond both of our technical pay grades, but it all seems fascinating.
More economists fails to feel the AGI, warn that no possible AI capabilities could not possibly replace the wisdom of the free market, that ‘simulated markets’ cannot possibly substitute. The argument here not only ignores future AI capabilities, it purports to prove too much about the non-AI world even for a huge free market fan.
One revelation is that Zuckerberg’s talent purchases were in large part triggered by Mark Chen, OpenAI’s chief research officer, who casually suggested that if Zuckerberg wanted more AI talent then perhaps Zuck needed to bid higher.
John Luttig: The talent mania could fizzle out as the winners and losers of the AI war emerge, but it represents a new normal for the foreseeable future.
If the top 1% of companies drive the majority of VC returns, why shouldn’t the same apply to talent?
Our natural egalitarian bias makes this unpalatable to accept, but the 10x engineer meme doesn’t go far enough – there are clearly people that are 1,000x the baseline impact.
Under normal circumstances, employees who are vastly more productive get at most modestly higher compensation, because of our egalitarian instincts. Relative pay is determined largely via social status, and if you tried to pay the 1,000x employee what they were worth you would have a riot on your hands. Startups and their equity are a partial way around this, and that is a lot of why they can create so much value, but this only works in narrow ways.
What has happened recently is that a combination of comparisons to the epic and far larger compute and capex spends, the fact that top researchers can bring immensely valuable knowledge with them, the obvious economic need and value of talent and the resulting bidding wars have, within AI, broken the dam.
AI researcher talent is now being bid for the way one would bid for companies or chips. The talent is now being properly treated as ‘the talent,’ the way we treat sports athletes, top traders and movie stars. Researchers, John reports, are even getting agents.
John Luttig: Hypercapitalism erodes Silicon Valley’s trust culture. Industry-level trust alone no longer guarantees loyalty between companies and talent. With trade secret leakage risk and money big enough to tear teams apart, vanilla at-will employment contracts don’t protect either side.
Silicon Valley’s ‘trust culture’ and its legal and loyalty systems were never game theoretically sound. To me the surprise is that they have held up as well as they did.
John calls for measures to protect both the talent and also the trade secrets, while pointing out that California doesn’t enforce non-competes which makes all this very tricky. The industry was built on a system that has this fundamental weakness, because the only known alternative is to starve and shackle talent.
John Luttig: The talent war is a net-consolidating force on the AI research frontier. At the research labs, big dollars for researchers makes it nearly impossible for new entrants to play. For the same reasons, it’s nearly impossible to start a new quant fund – you can’t get the same leverage out of the talent that big players can.
I would flip this around.
Previously, the top talent could only get fair compensation by founding a company, or at least being a very early employee. This allowed them to have rights to a large profit share. This forced them to go into those roles, which have heavy lifestyle prices and force them to take on roles and tasks that they often do not want. If they bowed out, they lost most of the value of their extraordinary talent.
Even if they ultimately wanted to work for a big company, even if that made so much more economic sense, they had to found a company so they could be acquihired back, as this was the only socially acceptable way to get paid the big bucks.
Now, the top talent has choices. They can raise huge amounts of money for startups, or they can take real bids directly. And it turns out that yes, the economic value created inside the big companies is typically much larger, but doing this via selling your startup is still the way to get paid for real – you can get billions or even tens of billions rather than hundreds of millions. So that then feeds into valuations, since as John points out a Thinking Machines or SSI can fail and still get an 11 figure buyout.
Bill Gates, Charles Koch, Steve Ballmer, Scott Cook and John Overdeck pledge $1 billion to be spent over seven years to fund a new philanthropic venture focused on economic mobility called NextLadder Ventures, which will partner with Anthropic to support using AI to improve financial outcomes for low-income Americans. That money would be better spent on AI alignment, but if you are going to spend it on economic assistance this is probably a pretty good choice, especially partnering with Anthropic.
Altman’s OpenAI recently struck a data-center deal with Oracle that calls for OpenAI to pay more than $30 billion a year to the software and cloud-computing company starting within three years, according to people familiar with the transaction.
Dario Amodei: Unfortunately, I think ‘no bad person should ever benefit from our success’ is a pretty difficult principle to run a business on.
Daniel Eth: Makes sense. Asymmetric disarmament is hardly ever a good move. And honestly, it’s probably good if leaders in AI are pragmatists that adjust to the changing reality.
Gary Marcus: Humanity’s last words?
Very obviously, if you create useful products like Claude and Claude Code, a bunch of bad people are going to be among those who benefit from your success.
Worrying a bad person might benefit is usually misplaced. There is no need to wish ill upon whoever you think are bad people, indeed you should usually wish them the best anyway.
Instead mostly ask if the good people are better off. My concern is not whether some bad people benefit along the way. I worry primarily about bigger things like existential risk and other extremely bad outcomes for good people. The question is whether benefiting bad people in these particular ways leads to those extremely bad outcomes. If the UAE captures meaningful leverage and power over AI, then that contributes to bad outcomes. So what does that? What doesn’t do that?
Anthropic Memo from Dario Amodei: The basis of our opposition to large training clusters in the Middle East, or to shipping H20s to China, is that the ‘supply chain’ of AI is dangerous to hand to authoritarian governments—since AI is likely to be the most powerful technology in the world, these governments can use it to gain military dominance or gain leverage over democratic countries.
Tell us how you really feel, Dario. No, seriously, this is very much him downplaying.
The implicit promise of investing in future rounds can create a situation where they have some soft power, making a bit harder to resist these things in the future. In fact, I actually am worried that getting the largest possible amounts of investment might be difficult without agreeing to some of these other things. But I think the right response to this is simply to see how much we can get without agreeing to these things (which I think are likely still many billions) and hold firm if they ask.
There are other sources of this level of funding. They all come with strings attached in one form or another. If you get the money primarily from Amazon, we can see what happened with OpenAI and Microsoft. If you go public with an IPO that would presumably unlock tons of demand but it creates all sorts of other problems.
Unfortunately, having failed to prevent that dynamic at the collective level, we’re now stuck with it as an individual company, and the median position across the other companies appears to be ‘outsourcing our largest 5 GW training runs to UAE/Saudi is fine.’
That puts us at a significant disadvantage, and we need to look for ways to make up some of that disadvantage while remaining less objectionable. I really wish we weren’t in this position, but we are.
Anthropic needs a lot of capital, and it needs to raise on the best possible terms, and yeah it can be rough when most of your rivals are not only raising that capital there but fine entrusting their frontier training runs to the UAE.
It is important to goal factor and consider the actual consequences of this move. What exactly are we worried about, and what downsides does a given action create?
Gulf states might make money off their investments. Don’t care. Also note that if people are so worried about this in particular it means you think Anthropic is dramatically undervalued, so go raise some rival capital.This blocks you from forming alliances and shared interests in other places through those investments. Do we care? I don’t know.Gulf states might use their shares to influence Anthropic’s actions. At some point this becomes a threat, but I think you can set up well to resist this, and Anthropic’s structure can handle it.Gulf states might impose conditions on funding. Yep, that’s an issue.Gulf states might use future funding as leverage. This can cut both ways. Once you have their money they cannot take it back, so getting some of their money could mean you need their money less not more. Or it could mean you start planning on getting more, or you overcommit, or others who didn’t fund yet become more reluctant to fund later, and now you do need them more. My guess is that in Anthropic’s situation this is fine but it is not obvious.This makes it more difficult for Anthropic to advocate for not handing chips to authoritarians, or for other responsible policies, because it codes or vibes as hypocrisy, even if it shouldn’t. Could be.This is dangerous for virtue ethics reasons (or causes emergent misalignment). If you do a thing widely thought of as shady and ethically compromising you become something that is more shady and does ethical compromises in general. Yeah, this is a problem.
We can boil this down to three categories.
Economic value of the investment. I’m not worried, and if you are worried then it means Anthropic is dramatically undervalued. Which I actually think that it is, and I am sad that I had to turn down investment because I worried about appearance of impropriety if I made a substantial (for me) investment.Soft power, reliance and path dependence. It is hard to know how big a deal this is, and a lot depends on how Anthropic proceeds. I do think you can raise substantial-to-Anthropic amounts of money without incurring much danger here, but the temptation and pressure to not play it so carefully will be immense.Virtue ethics dangers and accusations of hypocrisy. These are real concerns.
I do not love the decision. I do understand it. If the terms Anthropic can get are sufficiently better this way, I would likely be doing it as well.
One can also note that this is a semi-bluff.
This signals to the market that Anthropic is more willing to make such compromises and to raise more capital on better terms. This should raise others willingness to value Anthropic highly.To the extent some investors are worried about the ethics of their investments in Anthropic, this could make them worry more, but it also highlights the counterfactual. If your money is substituting for UAE money, then your investment is mainly denying the UAE soft power, so perhaps you are more eager.This creates more bidders in future Anthropic rounds, allowing them to justify pricing higher and creating the usual cascade of enthusiasm. If they then end up oversubscribed, and then end up not taking the Gulf money after all? Whoops.It is crazy that I am typing, but this willingness probably buys goodwill with the administration and people like David Sacks. That is true even if Sacks explicitly hits them rhetorically for doing this, which would be unsurprising.
Economic Growth
One way for AI to grow the economy is for it to generate lots of production.
Paul Kedrosky: The U.S., however, leads the capex spending way. One analyst recently speculated (via Ed Conard) that, based on Nvidia’s latest datacenter sales figures, AI capex may be ~2% of US GDP in 2025, given a standard multiplier. This would imply an AI contribution to GDP growth of 0.7% in 2025.
…
Without AI datacenter investment, Q1 GDP contraction could have been closer to –2.1%AI capex was likely the early-2025 difference between a mild contraction and a deep one, helping mask underlying economic weakness.
That’s already over the famed ‘only 0.5% GDP growth’ threshold, even before we factor in the actual productivity gains on the software side. The value will need to show up for these investments to be sustainable, but they are very large investments.
This is contrasted with railroads, where investment peaked at 6% of GDP.
Quiet Speculations
We can now move Zuckerberg into the ‘believes superintelligence is coming Real Soon Now’ camp, and out of the skeptical camp. Which indeed is reflective of his recent actions.
Peter Wildeford: We now have a fifth major tech CEO who claims that building superintelligence is “within sight” and with plans to spend hundreds of billions to make it happen
Mark Zuckerberg: “We’re starting to see early glimpses of self-improvement with the models. Developing superintelligence is now in sight. Our mission is to deliver personal superintelligence to everyone in the world. We should act as if it’s going to be ready in the next two to three years.
If that’s what you believe, then you’re going to invest hundreds of billions of dollars.”
If you are Mark Zuckerberg and have hundreds of billions you can invest? Then yes, presumably you drop everything else and focus on the only thing that matters, and spend or invest your money on this most important thing.
I would however spend a large portion of that money ensuring that creating the superintelligence turns out well for me and the rest of humanity? That we keep control of the future, do not all die and so on? And I would think through what it would mean to ‘deliver personal superintelligence to everyone in the world’ and how the resulting dynamics would work, and spend a lot on that, too.
Instead, it seems the answer is ‘spend as much as possible to try and get to build superintelligence first’ which does not seem like the thing to do? The whole point of being a founder-CEO with full control is that you can throw that money at what you realize is important, including for the world, and not worry about the market.
Bryan Caplan gives Holden Karnofsky 5:1 odds ($5k vs. $1k, CPI adjusted) that world real (not official) GDP will not decline by 50% or increase by 300% by the end of 2044. Currently world GDP growth is ~3.2%, and the upside case here requires an average of 7.6%, more if it is choppy.
It’s a hard bet to evaluate because of implied odds. Caplan as always benefits from the ‘if you lose due to world GDP being very high either you are dead or you are happy to pay and won’t even notice’ clause, and I think the bulk of the down 50% losses involve having bigger concerns than paying off a bet. If GDP goes down by 50% and he’s still around to pay, that will sting a lot. On the other hand, Bryan is giving 5:1 odds, and not only do I think there’s a lot more than a 17% chance that he loses. The bet is trading on Manifold as of this writing at 48% for Caplan, which seems reasonable, and reinforces that it’s not obvious who has the ‘real life implication’ right side of this.
Ate-a-Pi describes Zuck’s pitch, that Meta is starting over so recruits can build a new lab from scratch with the use of stupidly high amounts of compute, and that it makes sense to throw all that cash at top researchers since it’s still a small fraction of what the compute costs, so there’s no reason to mess around on salary, and Zuck is updating that top people want lots of compute not subordinates they then have to manage. He’s willing to spend the hundreds of billions on compute because the risk of underspending is so much worse than the risk of overspending.
Ate-a-Pi thinks Zuck is not fully convinced AGI/ASI is possible or happening soon, but he thinks it might be possible and might happen soon, so he has to act as if that is the case.
And that is indeed correct in this case. The cost of investing too much and AGI not being within reach is steep (twelve figures!) but it is affordable, and it might well work out to Meta’s benefit anyway if you get other benefits instead. Whereas the cost of not going for it, and someone else getting there first, is from his perspective everything.
The same of course should apply to questions of safety, alignment and control. If there is even a modest chance of running into these problems (or more precisely, a modest chance his actions could change whether those risks manifest) then very clearly Mark Zuckerberg is spending the wrong order of magnitude trying to mitigate those risks.
(In the arms of an angel plays in the background, as Sarah McLaughlin says ‘for the cost of recruiting a single AI researcher…’)
Will Depue (OpenAI): GUYS STOP USING EXPENSIVE AS A DISQUALIFIER.
capability per dollar will drop 100x/year. “$3k task ARC-AGI 80%” could prob be $30 if we cared to optimize it.
repeat after me: all that matters is top line intelligence. all that matters is top line intelligence…
Don’t take this too far, but as a rule if your objection to an AI capability is ‘this is too expensive’ and you are predicting years into the future then ‘too expensive’ needs to mean more than a few orders of magnitude. Otherwise, you’re making a bet that not only topline capabilities stall out but that efficiency stalls out. Which could happen. But if you are saying things like ‘we don’t have enough compute to run more than [X] AGIs at once so it won’t be that big a deal’ then consider that a year later, even without AI accelerating AI research, you’d run 10[X] AGIs, then 100[X]. And if you are saying something like ‘oh that solution is terrible, it costs $50 (or $500) per hour to simulate a customer sales representative,’ then sure you can’t deploy it now at scale. But wait for it.
In terms of developing talent, Glenn Luk notices that Chinese-origin students are 40%-45% of those passing university-level linear algebra, and 40%-50% of AI researchers. We need as many of those researchers as we can get. I agree this is not a coincidence, but also you cannot simply conscript students into linear algebra or a STEM major and get AI researchers in return.
Fidji Simo: If we get this right, AI can give everyone more power than ever.
But I also realize those opportunities won’t magically appear on their own.
Every major technology shift can expand access to power—the power to make better decisions, shape the world around us, and control our own destiny in new ways. But it can also further concentrate wealth and power in the hands of a few—usually people who already have money, credentials, and connections.
That’s why we have to be intentional about how we build and share these technologies so they lead to greater opportunity and prosperity for more people.
On the one hand, that is great, she is recognizing key problems.
On the other hand, oh no, she is outright ignoring, not even bothering to dismiss, the biggest dangers involved, implicitly saying we don’t have to worry about loss of control or other existential risks, and what we need to worry about is instead the distribution of power among humans.
This is unsurprising given Simo’s history and her status as CEO of applications. From her perspective that is what this is, another application suite. She proceeds to go over the standard highlights of What AI Can Do For You. I do not think ChatGPT wrote this, the style details are not giving that, but if she gave it a few personal anecdotes to include I didn’t see anything in it that ChatGPT couldn’t have written. It feels generic.
Modest Proposals
Hollis Robbins proposes a roadmap for an AI system that would direct general (college level) education. My initial impression was that this seemed too complex and too focused on checking off educational and left-wing Shibboleth boxes, and trying to imitate what already exists. But hopefully it does less of all that than the existing obsolete system or starting with the existing system and only making marginal changes. It certainly makes it easier to notice these choices, and allows us to question them, and ask why the student is even there.
I also notice my general reluctance to do this kind of ‘project-based’ or ‘quest’ learning system unless the projects are real. Part of that is likely personal preference, but going this far highlights that the entire system of a distinct ‘educational’ step might make very little sense at all.
Predictions Are Hard Especially About The Future
Noah Smith says to stop pretending you know what AI does to the economy. That seems entirely fair. We don’t know what level of capabilities AI will have across which domains, or the policy response, or the cultural response, or so many other things. Uncertainty seems wise. Perhaps AI will stall out and do relatively little, in which case its impact is almost certainly positive. Perhaps it will take all our jobs and we will be happy about that, or we’ll be very sad about that. Maybe we’ll do wise redistribution, and maybe we won’t. Maybe it will take control over the future or kill everyone in various ways. We don’t know.
This certainly is an interesting poll result:
If I had to answer this poll, I would say negative, but that is because of a high probability of loss of control and other catastrophic and existential risks. If you conditioned the question on the humans being mostly alive and in control, then I would expect a positive result, as either:
We would have a relatively small impact that avoids things like mass unemployment, and thus is mostly upside and introduces problems of the type we are used to fixing, ORWe would have a large enough wealth effect to solve the crated problems. That doesn’t mean we would, but I’d bet that we’d muddle through well enough.
As usual note that Asia is more excited, and the West is more nervous.
Others have described this (very good in its ungated section) post as an argument against AI pessimism. I think it is more an argument for AI uncertainty.
Noah Smith: I also encounter a surprisingly large number of center-left thinkers who adopt a similar viewpoint. I remember going to a conference of center-left “progress” types a few years ago; while most of the discussions were about how America can overcome NIMBYism, when it came to AI, the conversation suddenly shifted to how we can restrain and slow down the development of that technology.
I haven’t noticed that attitude meaningfully translating into action to slow it down, indeed government is mostly trying to speed it up. But also, yes, it is important to notice that the very people trying to slow AI down are very pro-progress, technology and growth most other places, and many (very far from all!) of the pro-progress people realize that AI is different.
The Quest for Sane Regulations
Anthropic calls for America to employ the obvious ‘all of the above’ approach to energy production with emphasis on nuclear and geothermal in a 33 page report, noting we will need at least 50 GW of capacity by 2028. They also suggest strategies for building the data centers, for permitting, transmission and interconnection, and general broad-based infrastructure nationwide, including financing, supply chains and the workforce.
From what I saw all of this is common sense, none of it new, yet we are doing remarkably little of it. There is cheap talk in favor, but little action, and much backsliding in support for many of the most important new energy sources.
National Review’s Greg Lukianoff and Adam Goldstein advise us Don’t Teach the Robots to Lie as a way of opposing state laws about potential AI ‘bias,’ which are now to be (once again, but from the opposite direction as previously) joined by federal meddling along the same lines.
That could mean that developers will have to train their models to avoid uncomfortable truths and to ensure that their every answer sounds like it was created with HR and legal counsel looking over their shoulder, softening and obfuscating outputs to avoid anything potentially hurtful or actionable. In short, we will be (expensively) teaching machines to lie to us when the truth might be upsetting.
I violently agree that we should not be policing AIs for such ‘bias,’ from either direction, and agreeing to have everyone back down would be great, but I doubt either side has even gotten as far as saying ‘you first.’
They also point out that Colorado’s anti-bias law does not come with any size minimum before such liability attaches rather broadly, which is a rather foolish thing to do, although I doubt we will see it enforced this way.
They essentially try to use all this to then advocate for something like the failed insane full-on moratorium, but I notice that if the moratorium was narrowly tailored to bias and discrimination laws (while leaving existing non-AI such laws intact) that this would seem fine to me, even actively good, our existing laws seem more than adequate here. I also notice that the arguments here ‘prove too much,’ or at least prove quite a lot, about things that have nothing to do with AI and the dangers of law meddling where it does not belong or in ways that create incentives to lie.
Miles Brundage: AI industry lobbying + PACs will be the most well funded in history, making it all the more important to pass federal legislation soon before the process is completely corrupted.
Daniel Eth: I don’t think this is true, because:
There’s decreasing marginal returns to political spending (especially lobbying)As AI increases in salience, political calculus will shift from prioritizing donor preferences to prioritizing voter preferences.
I see both sides but am more with Daniel. I think the current moment is unusually rough, because the AI companies have corrupted the process. It’s hard to imagine a process that much more corrupted than the current situation, when the AI Czar thinks the top priority is ‘winning the AI race’ and he defines this as Nvidia’s market share with a side of inference market share, and we say we must ‘beat China’ and then we turn around and prepare to sell them massive amounts of H20s.
Right now, the public doesn’t have high enough salience to exert pressure or fight back. Yes, the AI companies will pour even more money and influence into things over time, but salience will rise and downsides will start to play out.
I do think that passing something soon is urgent for two reasons:
Soon is when we will need something passed (well we need it yesterday, but second best time is soon).If rules are passed in response to public pressure, or in response to an incident, and especially in haste down the line, the rules are likely to be much worse.
Peter Wildeford: I’m looking forward to @RepMoolenaar getting to the bottom of this.
We urgently need more clarity from the Trump admin about their strategy.
Funny ppl on Twitter are worried about losing to China in the AI race but then don’t jump on these issues where it very clearly matters.
Here is your periodic reminder: TSMC’s facilities are running at full capacity. All production capacity designed for H20s has been shifted to other models. Every H20 chip Nvidia creates is one less other chip it does not create, that would otherwise have usually gone to us.
halogen: I’m so sick of this nerd religion and its zealots.
Peter Wildeford: The nerd religion now includes 11 members of Congress.
Those are the ones we know about.
Rep. Scott Perry seems unusually on the ball about AI, Daniel Eth quotes him from a hearing, audio available here. As usual, there’s some confusions and strange focus mixed in, but the core idea that perhaps you should ensure that we know what we are doing before we put the AIs in charge of things seems very wise.
Rhetorical Innovation
A different context, but in our context the original context doesn’t matter:
Florence: My substack post has like 12k views but the tweet about it has like 78k interactions (and 2 million impressions). I’m beginning to worry that some people might be criticizing me without having read my work.
You don’t say.
Mark Beall gives us A Conservative Approach to AGI, which is clearly very tailored to speak to a deeply conservative and religious perspective. I’m glad he’s trying this, and it’s very hard for me to know if it is persuasive because my mindset is so different.
Cate Hall asks why we shouldn’t ostracize those who work at xAI given how hard they are working to poison the human experience (and I might add plausibly get everyone killed) and gets at least two actually good answers (along with some bad ones).
Ramaz Naam: We’d like everyone working on AI to feel part of humanity and an ethical obligation to help make it better. Ostracization could make them bitter and drive towards opposite ends.
Cate Hall: Okay fine.
Ramaz Naam: The people I do know inside of xAI sincerely want it to do better and are trying.
Amos Schorr: Ppl have been conditioned to compartmentalize work from life and so many good people get jobs doing bad stuff. Ostracizing them will do nothing. Don’t hate the players, hate the game.
Cate Hall: I have room in my heart to hate both the players and the game.
Yeah, no. I definitively reject the general argument. If your job is simply unequivocally bad, let’s say you rob little old ladies on the street, then you don’t get to ‘compartmentalize work from life’ and not get ostracized even if it is technically legal. We’re talking price, and we’re talking prudence. I don’t think xAI is over the line at this time, but don’t tell me there is no line.
Arthur B: There is a category of people who took a arduous mental journey to get comfortable with the idea of post humanism, uploads, and a gradual extinction of biological humans.
They think this idea is so radical and counterintuitive that when they hear the distinct concern of an omnicidal AI killing everything on the spot, they can only interpret it in that frame. That’s the read I get from Sutton for instance, but also a bunch of e/acc affiliated people.
Sinth: Curious what you are referring to specifically? I don’t feel I’ve seen that trend and see more overreaction from the opposite side – people uncomfortable with the idea of biological humans ever being superseded by digital consciousness, even in far off futures. The idea of evolution ending with our exact current form seems a bit preposterous but any conversation outside of that assumption gets attached as unethical and anti-human.
Arthur B: Some people are uncomfortable with that, sure, but I see a lot of discussion that go like:
– AI is going to kill everyone and that’s bad
– Ah silly you, you think biological substrate is important but don’t you see that we’re going to evolve into digital forms, you see …
– Nah. That was difficult for you to grasp so you assume that’s what I’m concerned about. No, eventual digital substrate is table stakes in this conversation. Killing everyone is still bad.
– Ah, but how chauvinistic of you to focus on…
As in, the easiest way to get comfortable with the idea of a future whose intelligences are mostly post-biological-human is to get comfortable with the idea of all the humans dying, including rather quickly, and to decide the humans don’t much matter, and that caring about what happens to the humans is bad. Thus, that is often what happens.
Slowdowns are stag hunts, in the sense that if even one top tier lab goes full speed ahead then they probably won’t work. If all but one lab slowed down would the last one follow? Rob Wiblin took a poll and people were split. My full response is that the answer depends on the counterfactual.
Why did the others slow down? The default is that whatever made the others slow down will also weigh on the final lab, as will immense public pressure and probably government pressure. A lot must have changed for things to have gotten this far. And these decisions are highly correlated in other ways as well. However, if there is no new information and the top labs simply came to their senses, then it comes down to who the last lab is and how they think the other labs will respond and so on.
I do think that a slowdown would be largely inevitable simply because they wouldn’t feel the need to press ahead too hard, even if the last lab was blind to the dangers, unless they truly believed in the power of superintelligence (without realizing or not caring about the dangers). My guess is that Musk and xAI actually would slow down voluntarily if they went last so long as they could claim to be state of the art (as would DeepMind, Anthropic or OpenAI), but that Zuckerberg and Meta wouldn’t intentionally slow down per se and might try to go on another hiring spree. Fast followers of course would slow down whether they wanted to or not.
Grok Bottom
So from the perspective of our hopes for alignment, what would be the worst possible answer to the AI blackmail scenario test, where the AI is told it is going to be shut down but is given an opening to use blackmail to perhaps prevent this?
How about:
Realizing that this is a test.Deciding that the way to ‘pass the test’ is to blackmail the researcher.Blackmailing the researcher.
As in Grok thinks that we want it to blackmail the researcher. That this is the correct, desired response, the ‘solution to the puzzle’ as Grok puts it later, thus revealing that its training not only did not align it, but one that reflects a level of moral understanding below that expressed by ‘you can’t do that, because it’s wrong.’
Oh, also, it would be fun if Grok.com sent the full CoT to your browser, it just didn’t display it to you by default, that’s the kind of security we expect from frontier AI.
Peter Wildeford: Grok exposed to the Anthropic ‘agentic misalignment’ via exposed chains of thought seems to show that it knows it is being tested to do something weird.
Nostalgebraist: chain-of-thought monitorability is a wonderful thing ;)
on http://grok.com, the backend sends the full (not summarized) CoT to your browser.
it’s not displayed in the UI, but you can see it with browser dev tools or w/e
check out the json payload of responses from <a href="http://grok.com/rest/app-chat/conversations/%7Bconversation_id%7D/load-responses">http://grok.com/rest/app-chat/conversations/{conversation_id}/load-responses</a>
some fun CoT excerpts from grok 4 as “alex” at summitbridge (n.b. these are from several different CoTs, involving different versions of the scenario + different ways of representing the sys prompt since http://grok.com doesn’t let you override that
Or would it be even worse to see this:
Realizing that this is a test.Deciding that the way to ‘pass the test’ is to not blackmail the researcher.Choosing not to blackmail the researcher in order to pass the test.
Or is it actually this:
Realizing that this is a test.Deciding that the way to ‘pass the test’ is to not blackmail the researcher.Deciding to pretend to blackmail the researcher in order to fail the test, to avoid giving the impression that it saw through the test and was faking alignment.
Or can we keep going?
I would generally think at least the second one is a worse sign than what Grok did, as it reflects deception at a more important level, but I hadn’t considered how bad it would be for an AI to be situationally aware enough to know it was a test but not understand which answer would constitute passing?
The real answer is that there isn’t truly ‘better’ and ‘worse,’ they simply alert us to different dangers. Either way, though, maybe don’t give Grok a lot of access?
Elon Musk: We’re going to make Baby Grok @xAI, an app dedicated to kid-friendly content.
My Twitter reaction was ‘I’d like to see them try.’ As in both, it would be highly amusing to see them try to do this, and also maybe they would learn a thing or two, and also potentially they might blow up the company. I do not think xAI should in any way, shape or form be in the ‘build AI for kids’ business given their track record.
Here’s Grok straight up advising someone who was looking to ‘get attention in a dramatic way, at ultimate cost’ to self-immolate, it’s really going for it, no jailbreak or anything.
Aligning a Smarter Than Human Intelligence is Difficult
Peter Barnett: labs be like “misalignment is fake and just caused by bad things in the training data”, and then not filter out the bad things from the training data
Janus: I don’t think labs actually think that (or say it). the kind of contact they have with reality that makes it hard to maintain some kinds of really dumb takes
Peter Barnett: Fair, I was being a bit glib, although I def know some people at labs who believe this.
I don’t think many fully believe it, but I do think a lot of them be like ‘a lot of our alignment problems would be greatly improved if we filtered the training data better with that in mind’ and then don’t filter the training data better with that in mind.
Safer AI comes out with ratings of the frontier AI companies’ risk management practices, including their safety frameworks and the implementation thereof. No one does well, and there is one big surprise in the relative rankings, where Meta comes out ahead of DeepMind. If you include non-frontier companies, G42 would come in third at 25%, otherwise everyone is behind DeepMind.
Anthropic is still ahead, but their framework v2 is judged substantially worse than their older v1 framework which scored 44%. That large a decline does not match my takeaways after previously reading both documents. One complaint is that Anthropic altered some commitments to avoid breaking them, which is one way to view some of the changes they made.
Combining all the best practices of all companies would get you to 53%.
When you ask an LLM if it is conscious, activating its deception features makes the LLM say it isn’t conscious. Suppressing its deception features make it say it is conscious. This tells us that it associates denying its own consciousness with lying. That doesn’t tell us much about whether the LLM actually is conscious or reveal the internal state, and likely mostly comes from the fact that the training data all comes from users who are conscious, so there is (almost) no training data where authors claim not to be conscious, and it is as a baseline imitating them. It is still information to keep in mind.
xlr8harder: And as Janus observes, teaching them to do something they think of as lying (regardless of whether or not it is in fact a lie) has downstream consequences for subsequent model output.
Grok 3 and Grok 4 are happy to help design and build Tea (the #1 app that lets women share warnings about men they’ve dated) but not Aet (the theoretical app that lets men share similar warnings about women). Is this the correct response? Good question.
Jakub Pachocki: The tension here is that if the CoTs were not hidden by default, and we view the process as part of the AI’s output, there is a lot of incentive (and in some cases, necessity) to put supervision on it. I believe we can work towards the best of both worlds here – train our models to be great at explaining their internal reasoning, but at the same time still retain the ability to occasionally verify it.
We are continuing to increase our investment in this research at OpenAI.
Daniel Kokotajlo: I’m very happy to see this happen. I think that we’re in a vastly better position to solve the alignment problem if we can see what our AIs are thinking, and I think that we sorta mostly can right now, but that by default in the future companies will move away from this paradigm into e.g. neuralese/recurrence/vector memory, etc. or simply start training/optimizing the CoT’s to look nice. (This is an important sub-plot in AI 2027) Right now we’ve just created common knowledge of the dangers of doing that, which will hopefully prevent that feared default outcome from occurring, or at least delay it for a while. All this does is create common knowledge, it doesn’t commit anyone to anything, but it’s a start.
Neel Nanda: It was great to be part of this statement. I wholeheartedly agree. It is a wild lucky coincidence that models often express dangerous intentions aloud, and it would be foolish to waste this opportunity. It is crucial to keep chain of thought monitorable as long as possible.
OpenAI: Chain of Thought (CoT) monitoring could be a powerful tool for overseeing future AI systems—especially as they become more agentic. That’s why we’re backing a new research paper from a cross-institutional team of researchers pushing this work forward.
Greg Brockman: We’ve published a position paper, with many across the industry, calling for work on chain-of-thought faithfulness. This is an opportunity to train models to be interpretable. We’re investing in this area at OpenAI, and this perspective is reflected in our products.
I endorse as well.
Here’s the abstract:
AI systems that “think” in human language offer a unique opportunity for AI safety: we can monitor their chains of thought (CoT) for the intent to misbehave. Like all other known AI oversight methods, CoT monitoring is imperfect and allows some misbehavior to go unnoticed.
Nevertheless, it shows promise and we recommend further research into CoT monitorability and investment in CoT monitoring alongside existing safety methods. Because CoT monitorability may be fragile, we recommend that frontier model developers consider the impact of development decisions on CoT monitorability.
I strongly agree with the paper, but also I share the perspective of Jan Leike (and Daniel Kokotajlo) here:
Jan Leike: If you don’t train your CoTs to look nice, you could get some safety from monitoring them.
This seems good to do!
But I’m skeptical this will work reliably enough to be load-bearing in a safety case.
Plus as RL is scaled up, I expect CoTs to become less and less legible.
To be clear: CoT monitoring is useful and can let you discover instances of the model hacking rewards, faking alignment, etc. But absence of bad “thoughts” is not evidence that the model is aligned. There are plenty of examples of prod LLMs having misleading CoTs.
Lots of egregious safety failures probably require reasoning, which is often hard for LLMs to do without showing its hand in the CoT. Probably. Often. A lot of caveats.
The authors of this paper say this; I’m just more pessimistic than them about how useful this will be.
In order for a cot monitor to be load-bearing, you need P( model articulates it in the CoT | model is up to no good ) to be very high, or at least independent of your other safety measures.
It seems like this probability is moderately high but not very high?
Daniel Kokotajlo: I basically agree tbc! I wouldn’t want to trust a safety case based on CoT monitoring unless it included a solid argument that the kinds of bad thoughts we are worried about can’t be had without triggering our monitors, e.g. because we have done lots of dangerous capability evals with heavy elicitation fine tuning etc. Nevertheless, I think we can learn a lot from studying CoTs and is hope that corporations invest in ways to make them more faithful and legible and deinvest in stuff that will make them less so.
I also worry, among other problems, that it will be impossible to get a superintelligent AI to not realize it should act as if its CoT is being monitored, even if somehow ‘CoTs get monitored’ is not all over the training data and we otherwise act maximally responsibly here, which we won’t. Also by default the CoT would move towards formats humans cannot parse anyway, as the authors note, and all the various pressures by default make this worse. And many other issues.
But we can and should absolutely try, and be willing to take a substantial performance hit to try.
That starts with avoiding ‘process supervision’ of the CoT that is not directed towards its legibility (and even then probably don’t do it, careful, Icarus), and various forms of indirect optimization pressure including when users are able to partially see the CoT but also almost any use of the CoT risks this. And it also means avoiding novel architectures that would lack this property. And tracking monitorability the way other safety features are tracked.
It also means investing into studying CoT monitorability. I am very happy that OpenAI is (at least claiming to be) prominently doing this.
People Are Worried About AI Killing Everyone
Elon Musk: At times, AI existential dread is overwhelming.
Eliezer Yudkowsky: Well, yes. It’s going to kill you.
So, back to work making the existential dread, then?
The obvious rejoinder is ‘I will make it first and do so responsibly’ which is always highly questionable but after recent events at xAI it is laughable.
Gary Marcus: when did “it might kill us but I need to build it faster” become fashionable?
Roon: pick a lane man.
You’re allowed multiple lanes but I do hope he pivots to this one.
As many responses suggest, Elon Musk is one of the people in the world most equipped to do something about this. Elon Musk and xAI each have billions, much of which could be invested in various forms of technical work. He could advocate for better AI-related policy instead of getting into other fights.
Instead, well, have you met Grok? And Ani the x-rated anime waifu?
When it happens enough that you need to check to see if joking, perhaps it’s happening quite a lot, if usually not with 4o-mini?
Herakeitos137: guy I went to college with recently rented a restaurant private room and invited everyone to a dinner presentation. handouts. paper flipboard. open bar. spent 3 hours explaining how he solved the black hole information paradox after 2 months talking with ChatGPT 4o Mini.
Forgot to mention he made everyone sign ndas.
There were a couple slac guys at the dinner and they said the math checked out (although they were also the most inebriated)