Artificial Ignorance 03月04日
Hallucinations Are Fine, Actually
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨了AI大语言模型(LLM)中“幻觉”问题,即自信地生成不实信息。尽管早期媒体对此表示担忧,认为这使其不适用于严肃用途,但作者认为幻觉并非无法克服的障碍。通过模型改进、软件封装以及人机协作模式的转变,AI的可靠性正在提升。文章强调,应将LLM视为增强人类能力的工具,而非完全自动化工作,通过人机协同,共同承担责任,从而减少错误,发挥AI的真正价值。

⚠️ AI大语言模型(LLM)的“幻觉”问题,指的是其自信地生成不实信息,例如编造引文、讨论不正确的事实等。虽然这一问题引起了广泛关注,并被视为LLM的一大缺陷。

📈 通过更好的训练方法、人类反馈强化学习和架构改进,模型提供商正在积极努力解决幻觉问题。最新的模型在事实可靠性方面不断超越前代产品,幻觉率显著降低。

🛠️ LLM越来越多地嵌入到系统中,以放大其优势并减轻其弱点。像Cursor和Perplexity这样的应用程序,通过prompt工程、工具使用和agentic loops等技术,显著提高了LLM的可靠性和效率。

🤝 早期人们希望LLM能够立即自动化端到端的工作,但文章认为更有效的方式是将LLM视为增强人类能力的工具。通过人机协作,共同承担最终输出的责任,可以减少错误,提高工作质量。

When I first started using ChatGPT, I quickly discovered its tendency to confidently make things up. Whether you want to call it lying, fabricating, or just bullshitting - it had no problems inventing citations or discussing factually incorrect events.

In my second-ever post, I was already harping on this:

ChatGPT is impressive in its capabilities, but there are important limitations to keep in mind. First and foremost, LLMs are not always factually correct. ChatGPT has a tendency to “hallucinate,” or confidently provide answers that are wholly wrong. It’ll summarize nonexistent books, or argue that one plus one is three3. It sometimes invents news sources or misattributes quotes. If ChatGPT were a person, it would be great at BSing.

As a result, a human should review and edit any AI-generated content, especially if it’s customer-facing.

I wasn't alone. The technical term "hallucination" quickly entered the mainstream, becoming shorthand for why generative AI remained fundamentally untrustworthy despite its impressive capabilities. Every few weeks seemed to bring another headline about people and companies alike getting in trouble for trusting AI too much.

But here's the thing: I've changed my mind. Not because hallucinations have disappeared - they haven't - but because I've come to see them as a surmountable challenge rather than an Achilles heel.

Subscribe now

The Hallucination Panic

The media has had a field day with AI hallucinations, and not entirely without reason. The examples are both amusing and concerning:

From the beginning of ChatGPT to today, these stories have fueled headlines worried about hallucinations. Just a few of the many, many examples:

The narrative solidified: LLMs are impressive toys, but hallucinations make them unsuitable for serious usage. To this day, AI skeptics (such as Ed Zitron) will make this argument as to why the entire industry is a waste:

Personally, when I ask someone to do research on something, I don't know what the answers will be and rely on the researcher to explain stuff through a process called "research." The idea of going into something knowing about it well enough to make sure the researcher didn't fuck something up is kind of counter to the point of research itself.

This argument starts from the position that "unless an LLM is 100% reliable, it's useless" - and I can see why that's a tantalizing position to take. But I think this black-and-white view misses several key trends in how models, products, and even our own behaviors are evolving, making AI more reliable along the way.

The Bitter Lesson

Here's the rub: hallucinations are kind of fundamental to LLMs. Andrej Karpathy has described them as "dream machines" - a prompt starts them in a certain place and with a certain context, and they lazily, hazily remember their training data to formulate an answer that seems probable.

Most of the time the result goes someplace useful. It's only when the dreams go into deemed factually incorrect territory that we label it a "hallucination". It looks like a bug, but it's just the LLM doing what it always does.

Yet even though it's unlikely that we'll get these "factually incorrect" hallucinations down to 0%, that doesn't mean we can't improve the numbers. One example comes from Vectara, which has been tracking new frontier models against its hallucination benchmark.

Some of the oldest models, like the comparatively ancient Mistral 7B, have a hallucination rate of nearly 10%. Yet the latest group of flagship models, from o3-mini to GPT-4.5 to Gemini 2.0, have a hallucination rate of near or below 1%.

With OpenAI's latest GPT-4.5 launch, the company also showed that its performance on another hallucination benchmark, SimpleQA, went from 62% (with GPT-4o) to 37%. Across the board, newer model generations consistently outperform their predecessors when it comes to factual reliability.

This improvement isn't accidental. Model providers have recognized hallucinations as a significant barrier to adoption and are actively working to address them through better training methods, reinforcement learning from human feedback, and architectural improvements.

So while perfect factual accuracy may remain elusive with current architectures, the trend line is clear and encouraging - these models' raw capabilities continue to improve at a remarkable pace. But it’s only the first step in creating hallucination-free (or at least, hallucination-resistant) experiences.

Revenge of the wrappers

Beyond the benchmarks, focusing exclusively on the raw models misses a crucial point: LLMs aren't used in isolation. They're increasingly embedded within systems designed to amplify their strengths and mitigate their weaknesses.

Take Claude 3.5 Sonnet. On its own, it's a solid model for generating code - I've used it plenty of times to write one-off scripts. But I wouldn't ordinarily think of it as something that can 10x my productivity; there's too much friction in giving it the right context and describing my problem accurately. Not to mention having to copy/paste its answer and slowly integrate it into my existing code.

But Cursor has managed to take Claude 3.5 Sonnet and turn it into an agentic tool that can 10x my productivity. It's not perfect - far from it - but it's dramatically more effective than Claude alone. Cursor has built an entire system of prompt engineering, tool usage, agentic loops, and more to give Claude an Iron Man-esque suit of armor when it comes to coding1.

Perplexity offers another example. While "web search" is now a standard feature in many AI chatbots, Perplexity created a compelling AI search experience at a time when asking GPT-3.5, "Who won the 2030 Super Bowl?" would yield confidently incorrect answers.

This may actually lead to some vindication of products that have previously been derided as "ChatGPT wrappers." Yes, many of them were lightweight, low effort products, and they will likely become either obsoleted by better models/products or lost amidst the sea of copycats. But we're already seeing benefits of adding software-based scaffolding around ChatGPT. I often think of a "car" metaphor here2 - the engine is powerful, but not usable as a form of transport without the doors, windows, and wheels.

Benedict Evans has also argued that we may have jumped the gun on thinking of LLMs as fully-finished "products":

LLMs look like they work, and they look generalised, and they look like a product - the science of them delivers a chatbot and a chatbot looks like a product. You type something in and you get magic back! But the magic might not be useful, in that form, and it might be wrong. It looks like a product, but it isn’t.

Engineering techniques pioneered by apps like Cursor and Perplexity can boost the reliability and impact of LLMs. As we see more specialized tools - like Harvey for lawyers or Type for writers - I expect the effective hallucination rate to drop even further.

Augmentation over automation

Perhaps the most crucial shift, though, is in how we approach these tools. The early fascination with LLMs led many to view them as capable of immediately automating end-to-end work - magical AI systems that could take over jobs completely. This framing naturally made hallucinations (not to mention potential job displacement) seem catastrophic.

But what I see as a more nuanced, and ultimately more productive, framing is to view LLMs as augmentation tools. As a personal philosophy, I believe we should aim to augment work with AI, rather than automate it3. More than that, I increasingly see my work with LLMs as collaborative. And collaborating with someone (or something) means taking partial responsibility for the final output.

Ethan Mollick has referred to this as the difference between what he calls "centaurs" and "cyborgs":

Centaur work has a clear line between person and machine, like the clear line between the human torso and horse body of the mythical centaur. Centaurs have a strategic division of labor, switching between AI and human tasks, allocating responsibilities based on the strengths and capabilities of each entity.

On the other hand, Cyborgs blend machine and person, integrating the two deeply. Cyborgs don't just delegate tasks; they intertwine their efforts with AI, moving back and forth over the jagged frontier. Bits of tasks get handed to the AI, such as initiating a sentence for the AI to complete, so that Cyborgs find themselves working in tandem with the AI.

Choosing between the centaur and cyborg approaches hits particularly close to home regarding programming. It's so, so tempting to have Cursor Agent go off and do its thing, to wait until I can hit "Accept all changes" and move on to the next task4. And in a low-stakes scenario, that's probably fine.

But when it comes to working on a team or with a complex codebase, this strikes me as irresponsible. If, as AI proponents argue, LLMs are currently at the level of a "junior developer," then it's wild to imagine simply shipping AI-generated code without understanding it. And that applies more broadly: if you're willing to email 10,000 customers without proofreading the content written by a "junior marketer," that's on you5.

Because yes - Cursor gets it wrong. Perplexity gets it wrong. GPT-4.5 gets it wrong. Sometimes repeatedly, in ways both obvious and subtle. But I've also learned how to help my AI partners get things wrong less often:

Ultimately, I know that while AI can help me code/write/ideate, it can never be held accountable. To quote Simon Willison (emphasis added):

Just because code looks good and runs without errors doesn’t mean it’s actually doing the right thing. No amount of meticulous code review—or even comprehensive automated tests—will demonstrably prove that code actually does the right thing. You have to run it yourself!

Proving to yourself that the code works is your job.

Embracing imperfection

The real question isn't whether LLMs will ever stop hallucinating entirely - it's whether we can build systems and practices that make hallucinations increasingly irrelevant in practice. And I'm increasingly confident that we can.

So the next time you encounter an AI hallucination - whether it's an fake book title, a nonexistent Javascript method, or a confidently incorrect URL - remember that this all takes a bit of trial and error, but we’re making progress on the problem. Hallucinations are real, but they're also, increasingly, fine.

Thanks for reading! This post is public so feel free to share it.

Share

1

Ironically, they may have gone too far in tailoring their setup for Claude 3.5 Sonnet - try as I might, I can't quite get the same spark going with newer, ostensibly "smarter" models like o1 and Claude 3.7 Sonnet.

2

Sadly, I'm not a car guy, so while this makes sense in my head, feel free to tell me if it's actually terrible.

3

This isn't iron-clad: clearly there is some amount of drudgery that I love to automate with AI. But in general, I aim to reach for automation as the last step, not the first.

4

To be fair, I certainly do this "vibe coding" some of the time. But I'm slowly developing a thesis on when it seems like a workable approach, and when it doesn't - stay tuned for more on this topic.

5

A common complaint here is something along the lines of "if I have to review every line of code an LLM writes, why should I even bother using it?" And I mean... if you have to review every line of code a junior developer writes, why should you even bother hiring them?

Call me crazy, but I believe there's value in having a collaborator that can generate 80% correct solutions at 10x the speed, that writes documentation without complaint, that offers multiple approaches to solve a problem, and that never gets offended when you reject its ideas.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI幻觉 LLM 人机协作 模型改进
相关文章