Welcome to Import AI, a newsletter about AI research. Import AI runs on lattes, ramen, and feedback from readers. If you’d like to support this, please subscribe.
Import A-Idea
An occasional longer form essay series
Eschatological AI Policy Is Very Difficult
A lot of people that care about the increasing power of AI systems and go into policy do so for fundamentally eschatological reasons – they are convinced that at some point, if badly managed or designed, powerful AI systems could end the world. They think this in a literal sense – AI may lead to the gradual and eventually total disempowerment of humans, and potentially even the death of the whole species.
People with these views often don’t recognize how completely crazy they sound – and I think they also don’t manage to have empathy for the policymakers that they’re trying to talk to.
Imagine you are a senior policymaker in a major world economy – your day looks something like this:
There is a land war in Europe, you think while making yourself coffee.
The international trading system is going through a period of immense change and there could be serious price inflation which often bodes poorly for elected officials, you ponder while eating some granola.
The US and China seem to be on an inexorable collision course, you write down in your notepad, while getting the car to your place of work.
There are seventeen different groups trying to put together attacks that will harm the public, you say to yourself, reading some classified briefing.
“Something akin to god is coming in two years and if you don’t prioritize dealing with it right now, everyone dies,” says some relatively young person with a PhD and an earnest yet worried demeanor. “God is going to come out of a technology called artificial intelligence. Artificial intelligence is a technology that lots of us are developing, but we think we’re playing Russian Roulette at the scale of civilization, and we don’t know how many chambers there are in the gun or how many bullets are in it, and the gun is firing every few months due to something called scaling laws combined with market incentives. This technology has on the order of $100 billion dollars a year dumped into its development and all the really important companies and infrastructure exist outside the easy control of government. You have to do something about this.”
The above is, I think, what it’s like being a policymaker in 2025 and dealing with AI on top of everything else. Where do you even start?
Even starting to deal with the problems of AI is expensive:
First you need to learn about the technology, which means either:
You need to take your staff that are themselves extremely busy and underwater and ask them to pick up another topic, or you need to tell them to drop something – your choices of stuff to drop might include ‘medical issues my constituents care about’ or ‘economic policy that influences jobs’, and so you actually can’t get them to drop stuff. So you add it to their pile.
You need to get smart about it, which means you need to further salami slice your weekly agenda so you can fit a tiny bit of time in which is for ‘learning about AI’.
For both of these choices, learning about AI usually requires you to speak to different people with expertise. Once you do this you quickly discover that:
a) Some people think all current AI technology is, essentially, bullshit, and urge you not to fall for hype.
b) Some people say AI technology is a really big deal and the government should avoid regulating it.
c) Some people say AI has a high likelihood of killing everyone on the planet.
d) All of these people think people with different views have incorrect priors.
Now you need to learn about the potential policy moves you can make. Some examples of these moves and their costs include:
Taking things away from people, like export controls which take certain computers away from certain countries. Doing this ‘fucks with the money’ of a very large industry and also adds to geopolitical tensions. Everyone will get very mad about anything you do here. The experts you’ve consulted in your earlier step will either think you didn’t go far enough, you went way too far, or the fact you’re doing anything at all is corrosive to democracy and the economy.
Giving the government a greater ability to understand the domain, like creating institutions like the AI Safety Institute or re-tasking people from existing government departments to focus on AI. Doing this takes a scarce resource (people in government) and re-allocates them, so you’re trading away from other priorities and people will get mad. Or you need to spend money to create net new capacity, in which case people view whatever you do with suspicion, and even getting the money requires some kind of political deal to assuage the feelings of the other many deserving groups who didn’t get the money.
Altering the behavior of the companies through sub-regulatory methods, for instance by securing voluntary commitments. To do this you need to spend a ton of energy to ensure you and your staff can learn more about the technology, then you need to negotiate commitments with companies. Negotiating with companies is like putting together a trade deal with a superintelligence – the companies will assign far more people than you and your staff to think about the commitments, and the companies have access to all the high quality information about the technology in question. If you succeed, people will accuse you of being captured by corporate interests.
Changing laws, for instance by passing regulations targeting AI development and deployment. This is an extremely costly action that requires you to cash in innumerable political chips in exchange for building a large coalition that can pass some legislative package. Corporate interests will typically fight you or, at best, partner with you but in a way that tries to bend the rules to be as advantageous to them as possible. The whole time you are putting the law together you and your political allies will come under attacks for being either too weak in your approach or too strong in ways that might damage the economy. If you successfully change the laws the consequences of your change will be held under an incredibly un-sympathetic microscope for following years, opening up a new vulnerability for you with regard to your political opponents.
Let us imagine that you make all of these policy moves. What happens then? Well, you’ve mostly succeeded by averting or delaying a catastrophe which most people had no knowledge of and of the people that did have knowledge of it, only a minority believed it was going to happen. Your ‘reward’ insofar as you get one is being known as a policymaker that ‘did something’, but whether the thing you did is good or not is very hard to know.
The best part? If you go back to the AI person that talked to you earlier and ask them to assess what you did, they’ll probably say some variation of: “Thank you, these are the minimum things that needed to be done to buy us time to work on the really hard problems. Since we last spoke the number of times the gun has fired has increased, and the number of bullets in the chamber has grown.”
What did I do, then? You ask.
“You put more chambers in the gun, so you bought us more time,” they say. “Now let’s get to work”.
I write all of the above not as an excuse for the actions of policymakers, nor as a criticism of people in the AI policy community that believe in the possibility of superintelligence, but rather to instead illustrate the immense difficulty of working on AI policy when you truly believe that the technology may have the ability to end the world. Most of the policy moves that people make – if they make them – are going to seem wildly unsatisfying relative to the scale of the problem. Meanwhile, the people that make these moves are going to likely be juggling them against a million other different priorities and are going to be looking to the AI experts for some level of confidence and validation – neither of which are easily given.
Good luck to us all.
Tencent makes a helpful math dataset:
…103k curated problems for testing out AI systems…
Tencent and Shanghai Jiao Tong University researchers have released DeepMath, a large-scale math dataset for training AI systems. DeepMath-103k consists of “103k mathematical problems specifically designed to train advanced reasoning models via RL”. Every problem within the dataset includes a verifiable final answer, and is also accompanied with three distinct solutions each generated by DeepSeek R1. Subjects covered by the dataset include Algebra, Calculus, Number Theory, Geometry, Probability, and Discrete Mathematics.
Fuel for reasoning: In tests, the researchers show that training on DeepMath improves performance on other math benchmarks – this is unsurprising and is a basic validation of the benchmark. More interestingly, they show that “training on DeepMath-103K often encourages models to generate substantially longer and more detailed reasoning steps, particularly on highly complex benchmarks”, and they also show that models trained on DeepMath tend to spend more time solving problems using helpful mental shortcuts like creating subgoals, verifying things, backtracking, and so on.
In other words, aside from imparting skill in math, DeepMath seems to impart some robustly good ‘mathematical thinking’ approaches into LLMs trained on it.
Read more: DeepMath-103K: A Large-Scale, Challenging, Decontaminated, and Verifiable Mathematical Dataset for Advancing Reasoning (arXiv).
Get the dataset here: DeepMath (zwhe99, GitHub).
IEA projects doubling datacenter power demands by 2030:
…New report gives us a sense of the AI revolution, but may be too conservative…
The International Energy Agency has published a lengthy report on the relationship between energy and AI. The esteemed energy analysis body projects that “in the Base Case, the total installed capacity of data centres more than doubles from around 100 GW today to around 225 GW in 2030”, with AI driving a significant amount of this.
Where we are and where we’re going: In recent years, data center power growth accelerated, driven by AI as well as social media, online streaming, and other popular digital services. “Data centre electricity consumption growth accelerated from 3% annually from 2005 to 2015 to 10% annually from 2015 to 2024,” the IEA writes.
Within that, both the US and China grew to be the world’s first and second largest electricity consumers for datacenters.
In the USA, “data centres accounted for around 180 TWh of electricity consumption in 2024 in the United States, nearly 45% of the global total and more than 4% of US electricity consumption from all sources”.
In China, “as of today, data centres account for approximately 100 TWh of electricity consumption, roughly equivalent to that of electric vehicles in China. The country accounts for around 25% of global data centre electricity consumption, up from less than 20% a decade ago”.
The IEA might be too conservative: For what it’s worth, I expect the IEA is too conservative here – Anthropic said in its OSTP RFI submission that it believes the United States alone will need to build on the order of 50GW of net new power by 2027 to support frontier training runs by US companies.
Rhymes with other analysis, but not precisely: A US focused study from Berkeley said it projected US data center use to grow from roughly ~40GW / 176 TWh in 2023 to ~74GW / 325 Twh to 132GW / 580 TWh by 2028. These numbers are significantly larger and more in line with the Anthropic projections in terms of bullishness (Import AI #395).
Why this matters – the world is preparing for the singularity: If you zoom out, it looks a lot like the world’s capital markets and major companies are collectively betting that it’s going to get more and more lucrative to turn electricity into computational power which gets turned into dollars – and it seems like AI is one of the primary drivers of growth here. Viewed through this lens, the world is preparing the necessary infrastructure for the arrival of a superintelligence.
Download the report here: Energy and AI (IEA website).
Distributed AI experts Nous get $50 million funding:
…The market has started valuing distributed AI, which means the technology will be developed more rapidly…
Crypto investor Paradigm has led a $50m Series A round in Nous, a startup which has pioneered distributed training of AI systems. As longtime Import AI readers know, Nous is – along with Prime Intellect – are serious players in distributed AI, having trained a ~15bn parameter model in December (Import AI #393) using an algorithm they developed called Distributed Training Over-the-Internet (aka, Distro: Import AI #384), and have even decoupled with Anthropic researcher (in a personal capacity) Durk Kingma to develop technology called Decoupled Momentum (DeMo) for even better distributed training (Import AI #395).
Why this matters – markets are beginning to value distributed AI: I’ve been following distributed AI for a while and most of its enthusiastic developers and users have been hobbyists or startups with relatively small amounts of funding. The arrival of a $50m Series A could be a symptom that the VC community is about to start shoveling money into startups using this technology which would further speed up adoption and maturation of it increasing the chance that AI systems trained in distributed ways could attain the computational scale necessary to take on proprietary models.
Read more: Crypto VC giant Paradigm makes $50 million bet on decentralized AI startup Nous Research at $1 billion token valuation (Fortune, via Yahoo! Finance).
The Virology Capabilities Test tells us there’s probably a scaling law for bioweapon design:
…Today’s AI systems are better than expert virologists at potentially dangerous things…
Researchers with SecureBio, the Federal University of ABC, the Center for AI Safety, and the MIT Media Lab, have built the Virology Capabilities Test (VCT), 322 multimodal questions for AI systems “covering fundamental, tacit, and visual knowledge that is essential for practical work in virology laboratories”.
VCT has been designed as a way to test out how well today’s AI systems understand things that would let them be potentially weaponized for dangerous purposes. Examples of the kind of things VCT tests for include: Isolating virus particles from a liquid medium, the detailed steps in a TCID50 protocol, successfully infecting a ferret with a test strainge, and troubleshooting low viral yields from a given protocol.
Frontier models are better than expert human virologists: “Expert virologists with access to the internet score an average of 22.1% on questions specifically in their sub-areas of expertise. However, the most performant LLM, OpenAI’s o3, reaches 43.8% accuracy, outperforming 94% of expert virologists even within their sub-areas of specialization,” the researchers write.
How the questions were built: Given those scores, how concerned should we be? A close read of the paper gives me a sense the answer is “we should be sweating nervously” – to build the questions, the researchers used questions from 57 contributors, all of whom had either obtained or were in the process of obtaining a PhD in virology, and each contributor had an average of 5 years and 10 months of virology experience. Additionally, when building the dataset, they tested out how easy questions were by seeing if experts could answer to them with access to Google – if more than two thirds of them answered the question, the questions got tossed out. In other words, the questions in VCT are curated by experts and have been pressure tested by other experts for hardness.
The problem with dual use evals is that they’re hard to share: The authors note that “the shared dataset will not be released publicly to reduce the risk of leakage into training corpora, but the benchmark will be directly conveyed to any organizations and researchers with a track record of work on AI safety”. While the training dataset contamination issue makes sense, I suspect the larger reason the authors haven’t shared it is that it contains net new information about potentially dangerous virology.
Why this matters – everything machines for dual-use: AI systems are good at a broad range of things, including scary things. Tests like VCT give us signals on the scary part. “The scientific capabilities of frontier models will doubtless accelerate beneficial research in the life sciences, but the demonstrated ability to match or exceed expert performance in troubleshooting dual-use virology lab work warrants careful consideration,” the authors write. “We believe that an expert-level AI virologist chatbot—which is constrained to giving advice via text-based interactions—poses less risk than an autonomous AI virologist agent capable of independently performing tasks, though both warrant careful controls”.
Read more: Virology Capabilities Test (VCT): A Multimodal Virology Q&A Benchmark (PDF).
Automating and industrializing robot AI research with AutoEval:
…Berkeley researchers try to make running experiments on hardware as easy as doing things purely in software…
UC Berkeley researchers have built AutoEval, technology to make automating the running of experiments on robots as easy as automating software software pipelines.
“AutoEval consists of three key modules: (1) a success classifier, that evaluates a policy’s success on a given task, (2) a reset policy, that resets the scene back to a state from the initial state distribution upon completion of a trial, and (3) programmatic safety measures and fault detections that prevent robot damage and call for human intervention when necessary,” they write.
It works well: In tests, experiments conducted by AutoEval closely match the same results people get from experiments supervised by humans. Additionally, the software is robust over long timespans – during a 24 hour period they only needed a human to step in 3 times, dramatically cutting the amount of human supervision needed for running experiments.
“Even though AutoEval has a slightly lower throughput, AutoEval runs autonomously and only required a total of three human interventions in the span of 24 hours to reset the scene or robot,” they write. “Every time a human operator needed to intervene, they simply needed to check and reset the objects’ position in the scene, and potentially move the robot arm into reset position if a motor failed and the robot fell on the table”.
Why this matters – the industrialization of robot research: A few years ago Google made headlines by running a so-called robotic ‘arm farm’ (2017, Import AI #51) where it had tens of different robots working in parallel to learn how to manipulate arbitrary objects. Technologies like AutoEval seem like the kind of thing that Google might itself have built to help it run the arm farm. But unlike the proprietary code still nestled somewhere in Mountain View, AutoEval is available as open source software, robotic arms themselves have got way cheaper, and the algorithms to get robots to perform tasks have got far better than they were a few years ago.
Put it all together and AutoEval seems like one of the technologies we’ll use to industrialize and scale-up research onto robots. “We hope that this work will inspire more AutoEval evaluation cells to be set up across institutions to form a diverse automated evaluation framework, which will significantly speed up robot learning research,” the researchers write.
Read more: AutoEval: Autonomous Evaluation of Generalist Robot Manipulation Policies in the Real World (arXiv).
Get the software here: AutoEval (GitHub).
Tech Tales:
The Cockroach Killers of the Cyber Domain
[As told to GQ, 2028]
When you’re a bug catcher you get invited into a house and the owner says it’s full of bugs and you need to get rid of them, but you cannot damage the house itself. This means you need to figure out a way to seal the house and fumigate it, while also figuring out the places where the insects nested and getting rid of the nests and any associated damage. Your goal is to cleanse the house, then make sure the house cannot get re-taken by the bugs.
These days, people working in AI have to do a similar thing – someone will discover that their company has an AI problem, in the sense it has a few small-scale AI agents which are causing some kind of low-rent trouble.
That’s when you call us: we get access to your infrastructure and we instrument it so we can isolate benign activity from the agent activities. We seal the ingress and egress points of your network and in extreme cases we might work with your hyperscaler partner to physically isolate your hardware from anything else. Then we crawl through your system and try to find the agents – this is harder than it sounds because the agents are constantly shape-shifting, changing their file names, moving around the network, sometimes slowly making copies of themselves in other parts of your infrastructure, and so on.
Once we’re sure we’ve cleaned everything we also attempt to seal the holes that let the agents creep in. Sometimes these are basic network security issues, but sometimes it’s more subtle – maybe your company had an AI system which could spit out custom agents and maybe you let it have too big a context window and access to too many tools when making its agents, so some other larger malignant thing outside your company compromised it and, presto, it started producing the bugs.
Things that inspired this story: A friend of mine whose job was termite removal and the stories thereof; how we should expect some small agents to become akin to crappy digital malware, not so dangerous we will need to take extreme actions but sufficiently annoying you’ll want to remove them; blue collar IT jobs during the superintelligence uplift.
Thanks for reading!