The State of AI Engineering (2025)

It's been less than a week since I wrapped up my third AI Engineer conference in San Francisco (now known as the AI Engineer World's Fair) brought to you by . While I missed the agent-specific summit in New York earlier this year, the SF version proved to be remarkably multifaceted, both in terms of tracks and content.

The conference returned to the Marriott Marquis in downtown San Francisco, this time overlapping with Snowflake Summit (with a whopping 20,000 attendees). This meant navigating crowds of attendees with slightly different-colored badges and being thankful to have avoided insane last-minute hotel prices.

Of all the AI conferences I've attended, the AI Engineer World's Fair continues to deliver some of the best content1. Unlike other AI events where I sometimes suspect I know more about LLMs than the speakers on stage, that's rarely the case here. And perhaps my one complaint from last year, how commercial the event felt, was very much improved: this was an event by engineers, for engineers (and researchers - I was able to see fellow Substack authors and in person).

Speaking of content, there was quite a lot to go around this year. The event featured eighteen tracks, up from nine last year:

AI Architects

AI in the Fortune 500

Tiny Teams

MCP

LLM RecSys

Agent Reliability

AI Infrastructure

Product Management

Voice

GraphRAG

Retrieval + Search

Design Engineering

SWE Agents

Evals

Security

Generative Media

Reasoning + RL

Autonomy + Robotics

No one person could consume them all - I caught at least one talk from only about half the tracks, so my recap is inevitably biased here. But even with my limited perspective, I walked away with a dozen new big ideas, and many more tips and tactics for building with generative AI.

Here are my most compelling insights from the conference - many of these could easily be expanded into a standalone deep dive, so leave a comment if you want to read more.

Evals: The Persistent Challenge

One theme emerged repeatedly: not only are evals crucial when building AI-enabled applications, but writing good evals remains very difficult. As we've mastered the building blocks of generative AI - structured outputs, basic RAG configurations, agentic loops - we're hitting a wall. For the nuanced, subjective outputs that make AI uniquely valuable, how do we define "good"? And how do we build rigorous systems to evaluate quality?

This isn't new - I noted the importance of evals last year - but it's striking that this is perhaps the hardest problem in AI engineering.

The Age of Agents: From Skepticism to Assumption

Agents were so ubiquitous at the conference that I almost forgot to mention them - which speaks volumes. The assumption across most talks was that you were building for an agentic implementation, or at least wanted the option to upgrade your chat experience into one. As someone who, only a few months ago, was skeptical about the wave of agent hype, I've since come around on what agents are becoming capable of.

This represents a dramatic shift from last year, when agent infrastructure was "the new hotness" and we worried about reliability and endless loops. Now, agentic workflows aren't a question of "if" but "how." The entire ecosystem has reoriented around this assumption.

The Evolution of RAG

As someone who doesn't work with complex RAG systems daily, I was surprised by how sophisticated the landscape has become. Basic RAG implementations - chunking documents, embedding them, storing in vector databases - now seem quaint. The fact that the conference has now RAG has split into three separate tracks (Search and Retrieval, GraphRAG, and Recommendation Systems) speaks to the rapid evolution: hybrid search, graph databases, re-rankers, and various recommendation techniques have created a complex ecosystem that's advancing at breakneck speed.

The Quiet Decline of Fine-Tuning

Fine-tuning barely came up this year, in contrast to previous conferences. Several theories might explain this shift:

Cloud platforms have made fine-tuning trivially simple, removing it as a discussion topic.

State-of-the-art reasoning models aren't fine-tunable at all, making them less appealing for cutting-edge applications.

And perhaps the biggest reason: fine-tuning has become an optimization of last resort. Effective prompt engineering and RAG go remarkably far, leaving fine-tuning for when you've truly dialed in your use case and need to optimize for latency and cost.

Build Your Own Benchmarks (BYOB)

In just a few short months, many leading benchmarks for coding and math have become saturated. While we're inventing new ones nearly every week, it's becoming clear that public STEM benchmarks are becoming less and less useful to measure frontier models' capabilities. The solution: create your own benchmarks for tasks that you care about.

There are (at least) two reasons to do this. First, it's unlikely that your pet benchmark will become public (unless you're Simon Willison), meaning it won't become a target to be gamed. Second, as models improve, you can get a much more realistic sense of how well it does on the specific domains that you care about. Sure, o3-pro is "smarter," but unless you're regularly working on PhD-level problems, how much will you notice?

Voice: Multimodality’s Killer App

The business case for multimodality has crystallized around a single medium: voice. In past years, we marveled at LLMs' ability to handle various inputs and outputs - audio, images, and video. Now, most investment and engineering effort seems to have converged on audio as the primary interface, with teams obsessing over latency and conversational intelligence.

While image generation and video synthesis are impressive, voice represents the most natural human interface. The race is on to create truly conversational AI that responds with human-like speed and nuance. Ironically, perhaps ChatGPT's most popular feature is a different kind of multimodal: the ability to generate Studio Ghibli-style images across the internet.

GPT Wrappers Get The Last Laugh

In ChatGPT's immediate aftermath, we saw a wave of startups derided as "GPT wrappers" - applications built on OpenAI's APIs without a strong AI moat. The conventional wisdom was that they'd become irrelevant as foundation models got better.

Three years later, we're finding that compelling experiences can, in fact, be built without inventing new models from scratch. Exhibit A: the multi-billion-dollar AI coding industry, led by Cursor, Windsurf, and Github Copilot. While some applications are now training custom models, a long tail of applications for law, finance, healthcare, and more will likely create multi-billion-dollar markets without needing more than a well-crafted GPT wrapper.

It’s MCP’s Race to Lose

The conference featured numerous talks on Model Context Protocol (MCP) - its use cases, limitations, and future. Two things stood out: the remarkable adoption already occurring, and active work on the standard's current limitations (like authentication and discovery). MCP appears to be winning the race to define a standardized agent protocol, and it's Anthropic's game to lose.

For a deeper dive on MCP, check out this post:

The Hidden Cost of "Yappiness"

Reasoning models have introduced a new dimension to cost calculations. Previously, we only had to compare intelligence scores (benchmarks) with cost per token to determine whether running the latest models was worthwhile. However, reasoning models complicate this equation in two ways: they charge for "thinking" tokens generated before the final answer and have varying levels of "yappiness" - how many tokens they generate before reaching that answer.

This makes it significantly harder to predict spending on reasoning models. While model providers try to offset this by setting maximum "thinking" budgets, it's still a challenge for cost-conscious applications needing predictable pricing.

From Coding Agents to Containerized Fleets

AI coding agents have burst onto the scene, but we're already looking beyond them. Windsurf and Cursor agents have been around for less than six months, yet it's clear the future of AI coding lies in fleets of containerized agents working on codebases independently. As much as this may sound like science fiction, it's on our doorstep: folks are already publishing (if not launching) technology to run dozens of coding agents simultaneously.

The challenge, of course, is coordination - ensuring agents don't step on each other's toes, and rigging your architecture to allow for extensions and modifications easily. This evolution leads to an even more fundamental shift in how we think about software development costs, namely:

The Software OpEx Paradigm Shift

Historically, the biggest cost in developing software was CapEx - deciding upfront how many engineers to hire and budgeting for their salaries. Once built, most software had marginal operating costs.

Generative AI is messing with this model. Software development, while still a CapEx expense, might no longer be measured in massive tranches of engineer salaries. We can now turn a dial on how much software we want and how fast we want it made. Businesses will grapple with a new question: If you have a budget for five engineers, is it better to hire five humans or give one engineer enough AI budget to equal the output of the other four?

Everything Is Context Engineering

Prompt engineering has evolved dramatically since GPT-3, but I've started considering it as a subset of something broader: context engineering. The idea is to be deliberate about everything given to a model's context - not just prompts, but things like RAG outputs (i.e., all the work done to ensure relevant results), tool calls, error messages, everything.

And in thinking about prompting in this way, the conclusion is that 1) you should strongly care about what goes into your context, so as not to pollute it, and 2) the way we prompt LLMs likely needs to change dramatically. Vibe coding prompts should likely become closer to rigid specs than casual text messages. Agents should be able to clear out error messages when they hit a dead end, to avoid getting stuck in a loop and retreading their steps. Solid strategies for maintaining clean, relevant context are sorely needed.

The Ramanujan Question

One of the most profound questions in AI research is whether models can generate genuinely new, correct conclusions that aren't in their training data. The human benchmark for this might be Srinivasa Ramanujan, the Indian mathematician who made significant contributions to advanced mathematics despite having no formal training.

All Ramanujan had was two college students who lived in his home when he was eleven, and at sixteen, a copy of Synopsis of Pure Mathematics - a collection of 5,000 theorems. From this foundation, he derived thousands of identities and equations, his work cut short only by his early death from dysentery complications.

The trillion-dollar question for AI is: What made Ramanujan so capable, and can we build a neural architecture capable of the same kind of creative mathematical discovery?

Subscribe now

Looking Ahead: From Infrastructure to Philosophy

Comparing this year's conference to 2024 reveals how dramatically the field has transformed. Last year, we worried about making AI work - deployment, tooling, getting from prototype to production. The conference felt commercial, even corporate, as the industry rushed to build infrastructure for the AI gold rush.

This year, the questions have become existential. We've moved from "how do we deploy this?" to "what does this mean for software itself?" The Ramanujan question - whether AI can discover genuinely new knowledge - would have seemed out of place amid 2024's focus on vector databases and monitoring tools.

Some challenges persist, but have evolved. Evals remain "the hardest problem," but we've graduated from asking "how do we test?" to "how do we define good for nuanced, subjective outputs?" The agent infrastructure that was "the new hotness" in 2024 has become so foundational that it's assumed that every application is either agentic or preparing to be. In just twelve months, we've gone from worrying about agents getting stuck in loops to coordinating fleets of them.

As we enter an era where software development might be measured in compute budgets rather than headcount, where agent fleets tackle problems no single engineer could handle, and where AI might genuinely discover new knowledge, one thing is clear: the boundaries of AI engineering are expanding faster than any of us can fully grasp.

What started as a niche between ML and software engineering has exploded into a constellation of specialties. Voice engineers, AI PMs, eval designers, AI architects - job titles that didn't exist last year are now entire career paths. The conference's growth from 2,000 to 3,000 attendees understates the real expansion: the surface area of interesting problems has grown exponentially.

Walking the conference halls, I felt a familiar mix of excitement and vertigo. How can anyone keep up when every track represents months of learning? But maybe that's the point. We're no longer in an era where one person can master "AI engineering." We're in an era of specialists, teams, and most importantly, incredible possibility. The field isn't just growing; it's exploding outward - eating everything, everywhere, all at once.

Thanks for reading Artificial Ignorance! This post is public so feel free to share it.

It also has some of the most baffling logistics. I chalk this up to the organizers being engineers, but still - Swyx, if you're reading this: I know how insanely difficult event organizing is, and I'm guessing a lot of the logistical hiccups are outside of your control - but buttoning up the details here would truly make the event a 10/10.