Artificial Ignorance 06月11日 22:44
The State of AI Engineering (2025)
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

作者总结了在旧金山举办的AI工程师世界博览会上的关键见解,涵盖了评估、智能体、RAG、微调、自定义基准、语音、GPT封装器、模型上下文协议(MCP)以及从编码智能体到容器化集群的演变。文章强调了评估在构建AI应用中的重要性,智能体逐渐成为主流,RAG技术的快速发展,以及微调的地位下降。此外,还讨论了自定义基准测试的重要性,语音作为多模态的杀手级应用,GPT封装器的持久价值,MCP的标准化趋势,以及软件开发成本模式的转变。作者还提出了“上下文工程”的概念,强调了精心设计模型上下文的重要性。

💡 评估的挑战:虽然生成式AI的构建模块已经成熟,但对AI输出进行有效评估仍然非常困难,尤其是在处理主观和细微的输出时。文章强调了评估在构建AI应用中的关键作用,以及定义和构建评估系统的复杂性。

🤖 智能体的崛起:智能体在会议上无处不在,表明它们已经从“新事物”转变为被广泛接受的开发方式。现在,关键问题不再是“是否”使用智能体,而是“如何”实施。

📚 RAG技术的演进:RAG(检索增强生成)技术已经变得非常复杂,从基本的文档分块和嵌入,发展到混合搜索、图数据库和推荐系统,形成了快速发展的生态系统。

🎤 语音作为多模态的杀手级应用:语音成为多模态的主要接口,是投资和工程的重点。创建真正具有人类般速度和细微差别的会话式AI是关键。

🔨 GPT封装器的持久价值:尽管最初受到质疑,但基于GPT的应用程序(GPT封装器)展现出强大的生命力,特别是在AI编码领域,它们无需从头开始构建新模型就能创造出有吸引力的体验。

It's been less than a week since I wrapped up my third AI Engineer conference in San Francisco (now known as the AI Engineer World's Fair) brought to you by . While I missed the agent-specific summit in New York earlier this year, the SF version proved to be remarkably multifaceted, both in terms of tracks and content.

The conference returned to the Marriott Marquis in downtown San Francisco, this time overlapping with Snowflake Summit (with a whopping 20,000 attendees). This meant navigating crowds of attendees with slightly different-colored badges and being thankful to have avoided insane last-minute hotel prices.

Of all the AI conferences I've attended, the AI Engineer World's Fair continues to deliver some of the best content1. Unlike other AI events where I sometimes suspect I know more about LLMs than the speakers on stage, that's rarely the case here. And perhaps my one complaint from last year, how commercial the event felt, was very much improved: this was an event by engineers, for engineers (and researchers - I was able to see fellow Substack authors and in person).

Speaking of content, there was quite a lot to go around this year. The event featured eighteen tracks, up from nine last year:

No one person could consume them all - I caught at least one talk from only about half the tracks, so my recap is inevitably biased here. But even with my limited perspective, I walked away with a dozen new big ideas, and many more tips and tactics for building with generative AI.

Here are my most compelling insights from the conference - many of these could easily be expanded into a standalone deep dive, so leave a comment if you want to read more.

Leave a comment

Evals: The Persistent Challenge

One theme emerged repeatedly: not only are evals crucial when building AI-enabled applications, but writing good evals remains very difficult. As we've mastered the building blocks of generative AI - structured outputs, basic RAG configurations, agentic loops - we're hitting a wall. For the nuanced, subjective outputs that make AI uniquely valuable, how do we define "good"? And how do we build rigorous systems to evaluate quality?

This isn't new - I noted the importance of evals last year - but it's striking that this is perhaps the hardest problem in AI engineering.

The Age of Agents: From Skepticism to Assumption

Agents were so ubiquitous at the conference that I almost forgot to mention them - which speaks volumes. The assumption across most talks was that you were building for an agentic implementation, or at least wanted the option to upgrade your chat experience into one. As someone who, only a few months ago, was skeptical about the wave of agent hype, I've since come around on what agents are becoming capable of.

This represents a dramatic shift from last year, when agent infrastructure was "the new hotness" and we worried about reliability and endless loops. Now, agentic workflows aren't a question of "if" but "how." The entire ecosystem has reoriented around this assumption.

The Evolution of RAG

As someone who doesn't work with complex RAG systems daily, I was surprised by how sophisticated the landscape has become. Basic RAG implementations - chunking documents, embedding them, storing in vector databases - now seem quaint. The fact that the conference has now RAG has split into three separate tracks (Search and Retrieval, GraphRAG, and Recommendation Systems) speaks to the rapid evolution: hybrid search, graph databases, re-rankers, and various recommendation techniques have created a complex ecosystem that's advancing at breakneck speed.

The Quiet Decline of Fine-Tuning

Fine-tuning barely came up this year, in contrast to previous conferences. Several theories might explain this shift:

Build Your Own Benchmarks (BYOB)

In just a few short months, many leading benchmarks for coding and math have become saturated. While we're inventing new ones nearly every week, it's becoming clear that public STEM benchmarks are becoming less and less useful to measure frontier models' capabilities. The solution: create your own benchmarks for tasks that you care about.

There are (at least) two reasons to do this. First, it's unlikely that your pet benchmark will become public (unless you're Simon Willison), meaning it won't become a target to be gamed. Second, as models improve, you can get a much more realistic sense of how well it does on the specific domains that you care about. Sure, o3-pro is "smarter," but unless you're regularly working on PhD-level problems, how much will you notice?

Voice: Multimodality’s Killer App

The business case for multimodality has crystallized around a single medium: voice. In past years, we marveled at LLMs' ability to handle various inputs and outputs - audio, images, and video. Now, most investment and engineering effort seems to have converged on audio as the primary interface, with teams obsessing over latency and conversational intelligence.

While image generation and video synthesis are impressive, voice represents the most natural human interface. The race is on to create truly conversational AI that responds with human-like speed and nuance. Ironically, perhaps ChatGPT's most popular feature is a different kind of multimodal: the ability to generate Studio Ghibli-style images across the internet.

Artificial Ignorance is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.

GPT Wrappers Get The Last Laugh

In ChatGPT's immediate aftermath, we saw a wave of startups derided as "GPT wrappers" - applications built on OpenAI's APIs without a strong AI moat. The conventional wisdom was that they'd become irrelevant as foundation models got better.

Three years later, we're finding that compelling experiences can, in fact, be built without inventing new models from scratch. Exhibit A: the multi-billion-dollar AI coding industry, led by Cursor, Windsurf, and Github Copilot. While some applications are now training custom models, a long tail of applications for law, finance, healthcare, and more will likely create multi-billion-dollar markets without needing more than a well-crafted GPT wrapper.

It’s MCP’s Race to Lose

The conference featured numerous talks on Model Context Protocol (MCP) - its use cases, limitations, and future. Two things stood out: the remarkable adoption already occurring, and active work on the standard's current limitations (like authentication and discovery). MCP appears to be winning the race to define a standardized agent protocol, and it's Anthropic's game to lose.

For a deeper dive on MCP, check out this post:

The Hidden Cost of "Yappiness"

Reasoning models have introduced a new dimension to cost calculations. Previously, we only had to compare intelligence scores (benchmarks) with cost per token to determine whether running the latest models was worthwhile. However, reasoning models complicate this equation in two ways: they charge for "thinking" tokens generated before the final answer and have varying levels of "yappiness" - how many tokens they generate before reaching that answer.

This makes it significantly harder to predict spending on reasoning models. While model providers try to offset this by setting maximum "thinking" budgets, it's still a challenge for cost-conscious applications needing predictable pricing.

From Coding Agents to Containerized Fleets

AI coding agents have burst onto the scene, but we're already looking beyond them. Windsurf and Cursor agents have been around for less than six months, yet it's clear the future of AI coding lies in fleets of containerized agents working on codebases independently. As much as this may sound like science fiction, it's on our doorstep: folks are already publishing (if not launching) technology to run dozens of coding agents simultaneously.

The challenge, of course, is coordination - ensuring agents don't step on each other's toes, and rigging your architecture to allow for extensions and modifications easily. This evolution leads to an even more fundamental shift in how we think about software development costs, namely:

The Software OpEx Paradigm Shift

Historically, the biggest cost in developing software was CapEx - deciding upfront how many engineers to hire and budgeting for their salaries. Once built, most software had marginal operating costs.

Generative AI is messing with this model. Software development, while still a CapEx expense, might no longer be measured in massive tranches of engineer salaries. We can now turn a dial on how much software we want and how fast we want it made. Businesses will grapple with a new question: If you have a budget for five engineers, is it better to hire five humans or give one engineer enough AI budget to equal the output of the other four?

Everything Is Context Engineering

Prompt engineering has evolved dramatically since GPT-3, but I've started considering it as a subset of something broader: context engineering. The idea is to be deliberate about everything given to a model's context - not just prompts, but things like RAG outputs (i.e., all the work done to ensure relevant results), tool calls, error messages, everything.

And in thinking about prompting in this way, the conclusion is that 1) you should strongly care about what goes into your context, so as not to pollute it, and 2) the way we prompt LLMs likely needs to change dramatically. Vibe coding prompts should likely become closer to rigid specs than casual text messages. Agents should be able to clear out error messages when they hit a dead end, to avoid getting stuck in a loop and retreading their steps. Solid strategies for maintaining clean, relevant context are sorely needed.

The Ramanujan Question

One of the most profound questions in AI research is whether models can generate genuinely new, correct conclusions that aren't in their training data. The human benchmark for this might be Srinivasa Ramanujan, the Indian mathematician who made significant contributions to advanced mathematics despite having no formal training.

All Ramanujan had was two college students who lived in his home when he was eleven, and at sixteen, a copy of Synopsis of Pure Mathematics - a collection of 5,000 theorems. From this foundation, he derived thousands of identities and equations, his work cut short only by his early death from dysentery complications.

The trillion-dollar question for AI is: What made Ramanujan so capable, and can we build a neural architecture capable of the same kind of creative mathematical discovery?

Subscribe now

Looking Ahead: From Infrastructure to Philosophy

Comparing this year's conference to 2024 reveals how dramatically the field has transformed. Last year, we worried about making AI work - deployment, tooling, getting from prototype to production. The conference felt commercial, even corporate, as the industry rushed to build infrastructure for the AI gold rush.

This year, the questions have become existential. We've moved from "how do we deploy this?" to "what does this mean for software itself?" The Ramanujan question - whether AI can discover genuinely new knowledge - would have seemed out of place amid 2024's focus on vector databases and monitoring tools.

Some challenges persist, but have evolved. Evals remain "the hardest problem," but we've graduated from asking "how do we test?" to "how do we define good for nuanced, subjective outputs?" The agent infrastructure that was "the new hotness" in 2024 has become so foundational that it's assumed that every application is either agentic or preparing to be. In just twelve months, we've gone from worrying about agents getting stuck in loops to coordinating fleets of them.

As we enter an era where software development might be measured in compute budgets rather than headcount, where agent fleets tackle problems no single engineer could handle, and where AI might genuinely discover new knowledge, one thing is clear: the boundaries of AI engineering are expanding faster than any of us can fully grasp.

What started as a niche between ML and software engineering has exploded into a constellation of specialties. Voice engineers, AI PMs, eval designers, AI architects - job titles that didn't exist last year are now entire career paths. The conference's growth from 2,000 to 3,000 attendees understates the real expansion: the surface area of interesting problems has grown exponentially.

Walking the conference halls, I felt a familiar mix of excitement and vertigo. How can anyone keep up when every track represents months of learning? But maybe that's the point. We're no longer in an era where one person can master "AI engineering." We're in an era of specialists, teams, and most importantly, incredible possibility. The field isn't just growing; it's exploding outward - eating everything, everywhere, all at once.

Thanks for reading Artificial Ignorance! This post is public so feel free to share it.

Share

1

It also has some of the most baffling logistics. I chalk this up to the organizers being engineers, but still - Swyx, if you're reading this: I know how insanely difficult event organizing is, and I'm guessing a lot of the logistical hiccups are outside of your control - but buttoning up the details here would truly make the event a 10/10.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI工程师大会 生成式AI 智能体 RAG 上下文工程
相关文章