TechCrunch News 01月19日
AI isn’t very good at history, new paper finds
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

一项新研究表明,尽管人工智能在编码或生成播客等特定任务中表现出色,但在高水平历史考试中却表现不佳。研究人员使用Hist-LLM基准测试了GPT-4、Llama和Gemini等大型语言模型,发现它们在历史问题上的准确率令人失望,GPT-4 Turbo的准确率仅为46%,远低于人类水平。这表明大型语言模型在理解细致入微的历史知识方面仍有欠缺,它们倾向于从突出的历史数据中推断,难以检索较少见的知识,且可能存在训练数据偏差。尽管如此,研究人员仍对人工智能在未来辅助历史研究抱有希望。

🔍大型语言模型(LLMs)在处理高难度历史问题时表现不佳,即使是GPT-4 Turbo的准确率也仅为46%,接近随机猜测的水平。

📚LLMs倾向于从显著的历史数据中进行推断,难以检索较为冷门的历史知识。例如,当被问及古埃及在特定时期是否有常备军时,模型错误地回答为有,因为它们可能从其他有常备军的古代帝国信息中进行了推断。

🌍研究还发现,OpenAI和Llama模型在处理撒哈拉以南非洲等地区的问题时表现更差,这表明其训练数据中可能存在偏差。这强调了LLMs在特定领域仍无法替代人类。

💡尽管如此,研究人员仍对LLMs在未来辅助历史研究持乐观态度,他们正在努力改进基准测试,纳入更多来自代表性不足地区的数据,并增加问题的复杂性。

AI might excel at certain tasks like coding or generating a podcast. But it struggles to pass a high-level history exam, a new paper has found.

A team of researchers has created a new benchmark to test three top large language models (LLMs) — OpenAI’s GPT-4, Meta’s Llama, and Google’s Gemini — on historical questions. The benchmark, Hist-LLM, tests the correctness of answers according to the Seshat Global History Databank, a vast database of historical knowledge named after the ancient Egyptian goddess of wisdom. 

The results, which were presented last month at the high-profile AI conference NeurIPS, were disappointing, according to researchers affiliated with the Complexity Science Hub (CSH), a research institute based in Austria. The best-performing LLM was GPT-4 Turbo, but it only achieved about 46% accuracy — not much higher than random guessing. 

“The main takeaway from this study is that LLMs, while impressive, still lack the depth of understanding required for advanced history. They’re great for basic facts, but when it comes to more nuanced, PhD-level historical inquiry, they’re not yet up to the task,” said Maria del Rio-Chanona, one of the paper’s co-authors and an associate professor of computer science at University College London.

The researchers shared sample historical questions with TechCrunch that LLMs got wrong. For example, GPT-4 Turbo was asked whether scale armor was present during a specific time period in ancient Egypt. The LLM said yes, but the technology only appeared in Egypt 1,500 years later. 

Why are LLMs bad at answering technical historical questions, when they can be so good at answering very complicated questions about things like coding? Del Rio-Chanona told TechCrunch that it’s likely because LLMs tend to extrapolate from historical data that is very prominent, finding it difficult to retrieve more obscure historical knowledge.

For example, the researchers asked GPT-4 if ancient Egypt had a professional standing army during a specific historical period. While the correct answer is no, the LLM answered incorrectly that it did. This is likely because there is lots of public information about other ancient empires, like Persia, having standing armies.

“If you get told A and B 100 times, and C 1 time, and then get asked a question about C, you might just remember A and B and try to extrapolate from that,” del Rio-Chanona said.

The researchers also identified other trends, including that OpenAI and Llama models performed worse for certain regions like sub-Saharan Africa, suggesting potential biases in their training data.

The results show that LLMs still aren’t a substitute for humans when it comes to certain domains, said Peter Turchin, who led the study and is a faculty member at CSH. 

But the researchers are still hopeful LLMs can help historians in the future. They’re working on refining their benchmark by including more data from underrepresented regions and adding more complex questions.

“Overall, while our results highlight areas where LLMs need improvement, they also underscore the potential for these models to aid in historical research,” the paper reads.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 历史研究 人工智能 Hist-LLM 知识偏差
相关文章