少点错误 01月22日
November-December 2024 Progress in Guaranteed Safe AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本期GSAI通讯探讨了数学自动化、结构化版本控制以及AI在世界模型中的应用。FrontierMath基准测试揭示了AI在数学能力上的巨大差距,尽管OpenAI取得了显著进展。结构感知版本控制利用观测桥类型解决传统git在复杂数据结构上的不足。Squiggle作为一种概率编程语言,通过AI界面简化了用户信念规范。文章还涉及了与安全AI相关的资助机会,以及对数学自动化潜在风险的讨论。整体而言,本期通讯涵盖了GSAI领域多个前沿方向的进展与挑战。

🧮 FrontierMath是一个由专家数学家创建的基准测试,包含数百个极具挑战性的数学难题,旨在评估AI的数学推理能力。这些问题覆盖了现代数学的多个主要分支,难度极高,当前最先进的AI模型仅能解决不到2%的问题,显示AI与人类数学家之间存在巨大差距。

🛠️ 结构感知版本控制通过引入观测桥类型,解决了传统git在处理复杂数据结构(如世界模型或规范)时遇到的问题。这种方法允许版本控制系统理解数据结构的内在逻辑,从而更精确地追踪和合并更改,避免不必要的冲突。例如,在处理CSV文件时,可以区分行和列的更改,并仅在必要时标记冲突。

🧠 Squiggle是一种概率编程语言,旨在简化用户对信念的规范。它将分布作为基本术语,并隐藏了蒙特卡洛模拟的复杂性。通过QURI开发的AI产品,用户可以使用自然语言与Squiggle交互,无需直接学习编程语言,这使得AI能够更方便地应用于世界模型的构建。

Published on January 22, 2025 1:20 AM GMT

Sorry for the radio silence last month. It was slow and I didn’t come across things I wanted to write about, to be expected with holidays coming up.

There are no benefits of paying, except you get a cut of my hard earned shapley points, and apparently some disappointment when I miss a month.

If you're just joining us, background on GSAI here.

There should be some DeepSeek/r1 coverage in the next newsletter (I think tanking the cost of finetuning is, while dual use, good news for us in GSAI). I’d rather get this one out the door though.

FrontierMath (hiring)

As I’m always banging on about, progress in math automation is a leading indicator for progress in software engineering automation by the curry-howard correspondence. That’s a little cheeky, but last time I said it in meatspace I got solemn nods as if it wasn’t even cheeky at all. Or maybe they were solemnly nodding at something else. To be explicit, mathematical reasoning abilities and software engineering skills (including formal proof engineering) appear to be at similar capability levels. Furthermore, through platforms like Lean, mathematical ability can enhance software development capabilities. This connection should make it easier to implement rigorous quality assurance processes, such as obtaining formal proof certificates, which are typically costly.

We introduce FrontierMath, a benchmark of hundreds of original, exceptionally challenging mathematics problems crafted and vetted by expert mathematicians. The questions cover most major branches of modern mathematics -- from computationally intensive problems in number theory and real analysis to abstract questions in algebraic geometry and category theory. Solving a typical problem requires multiple hours of effort from a researcher in the relevant branch of mathematics, and for the upper end questions, multiple days. FrontierMath uses new, unpublished problems and automated verification to reliably evaluate models while minimizing risk of data contamination. Current state-of-the-art AI models solve under 2% of problems, revealing a vast gap between AI capabilities and the prowess of the mathematical community. As AI systems advance toward expert-level mathematical abilities, FrontierMath offers a rigorous testbed that quantifies their progress.

The story here of course is that by the time the newsletter covered it, Open AI hit 25%. “Mathematicians assemble a benchmark they can barely solve themselves at SOTA 2%, forecasters think it’ll be unsaturated for multiple years, a month later Open AI hits 25%” is a fun little anecdote for Christmas with the relatives, for a funny sense of fun. But, points out an astute reader, since we have typical Open AI science communication (read: extremely disappointing science communication) here: what sort of inference compute was thrown at it? At what cost? A model that can win with the most naive prompt is more impressive than a model that only wins with gal-brained and/or expensive (in tokens) feedback schemes.

So anyways, the story here of course is that after I drafted the previous paragraph, people noticed that Epoch sneakily added their funding disclosure to a new draft of the paper a month after all the hoopla had started to die down. Spoiler alert, the funder was, lol, Open AI. I’ll refer you to LessWrong user 7vik’s summary of events, and not get too far into it.

But I have feelings. What if the externality of mathematical automation and tooling in the form of progress in guaranteed safe AI doesn’t pan out? Then Open AI gets it’s “put terry tao out of a job” boyscout badge for all the employee laptops and the safety community gets nothing in return! Come on guys, why is this even in Epoch’s mandate to begin with? Makes me sad.

See also

Review paper “The suite of datasets commonly used to train and evaluate the mathematical capabilities of AI-based mathematical copilots (primarily large language models) exhibit several shortcomings”, Elliot’s r/math AMA, Kevin Buzzard on FrontierMath (written after Open AI results).

Opportunity

FrontierMath begins in three tiers of difficulty (25% undergrad or IMO level, 50% grad student, 25% early professor-track researcher), but they’re currently expanding to a fourth tier of even harder problems, and they also want some formalization in Lean, which you email elliot at epochai.org about if you’d like to partake.

ARIA call: Mathematics for Safe AI (funding)

This round is affiliated with the Safeguarded AI programme, but is outside of the standard technical areas breakdown.

See also

Renaissance Philanthropy’s AI for Math fund (unclear how much this will help with GSAI efforts).

Structure-aware version control via observational bridge types (hiring)

David Jaz Myers writes for the topos blog about structure-aware version control.

This is really exciting, because git diffs as we know them would lead to problems if you were scaling world models (in a davidad-style approach) or specifications (in any GSAI approach including davidad) that were being collaborated on with many humans and AIs.

But suppose that you and I are editing a csv, and you add a row to the bottom and I add a column to the end. Git would see your change as a single line diff, whereas my change (adding a column) is a change to every line; these conflict in the new line you added. But from what we know about the structure of csv, your change should really be a one-row change, mine a one-column change, and the conflict should only occur in one cell.

The author proposes something called observational bridge types (from the up and coming proof assistant Narya) to form the foundation of structure-aware version control. Using these, we can say that, for filetype F and files f1 f2 : F, a diffing algorithm is none other than any inhabitant of the type Diff F f1 f2 (where Diff : (A : Type) -> A -> A -> Type is created using something called logical relations, or an inductive definition depending on a type). Then, conflicts are defined as a pair (d1, d2) : Diff F f1 f2 x Diff F f1 f3 so that merges may be defined as another pair (d3, d4) : Diff F f2 f4 x Diff F f3 f4. That much is roughly consistent with git, provided that you assume F (the filetype) is always “list[list[char]]” or some notion of raw text. It’d be great to not have to assume that, even just for the csv example, to say nothing of more complicated structures in probabilistic semantics or complex/chaotic systems. A system that can also reason about diffs between the types themselves (since Diff Type sigma tau is perfectly reasonable here) is also desirable.

See also

More on this from Owen at Topos at last year’s Safe By Design meeting at FAR Labs. More on this from the GSAI google group a while ago. Roundup of tree diff tools.

Opportunity

To work with David Jaz on this for the world modeling part of Safeguarded AI in Oxford, apply here.

Squiggle AI

In the davidad and somewhat Bengio regions of the GSAI space, there’s a huge emphasis on world models– computational descriptions of what is. Squiggle is a programming language where the terms are distributions, and all monte carlo boilerplate is hidden from the user. This makes Squiggle a candidate “standard” for belief specifications, as I’ve written about elsewhere.

The first problem you’d run into is that the Users Will Not Just. The users will not just learn a programming language to specify beliefs when they barely want to specify their beliefs in the first place. That’s why it’s good news QURI shipped an AI product, so the interface is closer to natural language while the substrate/backend is squiggle. The LLM Will Just. The LLM will just learn to program in squiggle, so you don’t have to.

It’s ready for you to use it at squigglehub.org/ai (if authenticated), and you can read QURI’s press release on the EA Forum and elsewhere.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GSAI 数学自动化 版本控制 概率编程 世界模型
相关文章