少点错误 05月04日 07:42
"Superhuman" Isn't Well Specified
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文探讨了人工智能在不同任务中的表现,不仅仅关注其“超人”的胜负能力,更深入分析了AI在不同维度上的差异。文章指出,在某些领域,如国际象棋,AI已超越人类;但在其他领域,如医学影像诊断和说服力,AI的表现则更为复杂,涉及成本、努力程度、以及任务的特定维度。作者强调,简单地将AI能力归类为“超人”是不够的,应该关注更具体的问题,并谨慎评估AI的实际表现。

🧠 衡量AI能力不应仅限于“胜负”:文章指出,在如国际象棋等领域,AI已超越人类,但这种“超人”能力并非普遍存在。在更复杂的任务中,如医学影像诊断,AI的表现受到多重因素的影响。

💰 成本是衡量AI能力的重要维度:文章以ARC-AGI基准测试为例,说明AI在解决问题时所需的成本差异。OpenAI的o3模型虽然表现出色,但其解决问题的成本远高于人类,这突显了成本在评估AI能力时的重要性。

⚖️ 任务的特定维度影响AI表现:文章强调,不同任务具有不同的维度。例如,说服力涉及改变观点的大小、观点的坚定程度、以及是否可以针对特定个体定制论点等因素。这些维度使得AI在不同任务中的表现差异巨大。

🤔 “超人”概念的局限性:文章批评了将AI能力简单归类为“超人”的说法,认为这种说法忽略了AI能力的复杂性和多维度性。作者建议,应该提出更具体的问题,并仔细评估AI的实际表现。

Published on May 3, 2025 11:42 PM GMT

Strength

In 1997, with Deep Blue’s defeat of Kasparov, computers surpassed human beings at chess. Other games have fallen in more recent years: Go, Starcraft, and League of Legends among them. AI is superhuman at these pursuits, and unassisted human beings will never catch up. The situation looks like this:[1]

At chess, AI is much better than the very best humans

The average serious chess player is pretty good (1500), the very best chess player is extremely good (2837), and the best AIs are way, way better (3700). Even Deep Blue’s estimated Elo is about 2850 - it remains competitive with the best humans alive.

A natural way to describe this situation is to say that AI is superhuman at chess. No matter how you slice it, that’s true.

For other activities, though, it’s a lot murkier. Take radiology, for example:

Graph derived from figure one of CheXNet: Radiologist-Level Pneumonia Detection on Chest X-Rays with Deep Learning, assuming a normal distribution of human skill given the average in the paper.

CheXNet is a model for detecting pneumonia. In 2017, when the paper was published, it was already better than most radiologists - of the four it was compared to in the study, it beat all but one. Is it superhuman? It’s certainly better than 99% of humans, since fewer than 1% of humans are radiologists, and it’s better than (about) 75% of those. If there were one savant radiologist who still marginally outperformed AI, but AI was better than every single other radiologist, would it be superhuman? How about if there was once such a radiologist, but he’s since dead, or retired?

We can call this the strength axis. AI can be better than the average human, better than the average expert, better than all but the very best human performers, or, like in the case of chess, better than everyone, full stop.

But that’s not the only axis.

Effort

The ARC-AGI benchmark is a series of visual logic puzzles that are easy (mostly) for humans, and difficult for AI. There’s a large cash prize for the first AI system that does well on them, but the winning system has to do well cheaply.

Why does this specification matter? Well, OpenAI’s industry leading o3 scored 87.5% on the public version of the benchmark, blowing many other models out of the water. It did so, however, by spending about $3000 per puzzle. A human can solve them for about $5 each.

Talking in terms of money undersells the difference, though. The way o3 did so well was by trying to solve each puzzle 1024 separate times, thinking through a huge number of considerations each time, and then choosing the best option. This adds up to a lot of thinking: the entire set of 100 questions took 5.7 billion tokens of thinking for the system to earn its B+. War and Peace, for reference, is about a million tokens.

Source: this ARC prize post on o3’s performance

The average Mechanical Turker gets a little over 75%, far less than o3’s 87.5%. So is o3 human-level, or even superhuman? In some narrow sense, sure. But it had to write War and Peace 5,700 times to get there, and spend significantly more than most people’s annual salaries.

We can call this axis the effort axis. Like the strength axis, it has various levels. A system can perform a task for negligible cost, or for approximately how much a human would charge, or for way more than a human would charge, or for astronomical sums.

And More

These axes combine! A system that costs $10,000 to do a task better than any human alive is one kind of impressive, and a system that exceeds the human average for pennies is another. Nor are strength and effort the only considerations; lots of tasks have dimensions particular to them.

Take persuasion. A recent study (which may be retracted/canceled due to ethics concerns) showed that LLM systems did better than (most) humans at changing people’s views on the subreddit r/changemyview. Try to unpack this, and it’s way more complicated than chess, ARC-AGI, or radiology diagnosis. To name just a few dimensions:

Many capabilities we care about are more like persuasion than chess or radiology diagnostic work. A drop-in remote worker, for example, needs to have a lot of fuzzy “soft skills” that are not only difficult to ingrain, but difficult to measure.

Beyond “Superhuman”

Probably, you don’t see people calling AI “superhuman” all the time. But there are lots of related terms used the same way. Questions like:

These questions try to compress a multidimensional spectrum into a binary property. The urge makes sense, and sometimes, like with chess, the “is it superhuman” question has a clear cut answer. But unless the case is that clear cut, it’s probably better to ask narrower questions, and to think carefully when performance claims are flying around.

  1. ^

    Yeah, I know this graph (and the next) wouldn't actually be normal distributions. I don't think it matters for the purposes of this post, though.



Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

人工智能 AI能力 超人 成本 维度
相关文章