EnterpriseAI 2024年11月20日
Scientists Warn GenAI Models Lack True Understanding of the World
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

近年来,生成式人工智能(GenAI)凭借其强大的能力引发广泛关注,但其对真实世界的理解能力却引发了担忧。MIT、哈佛和康奈尔大学的研究团队发现,大型语言模型(LLM)如GPT-4和Claude 3 Opus,无法准确地构建现实世界的模型。研究人员设计了两个新的指标来评估LLM,结果表明,LLM在面对环境变化时容易失效。例如,一个在纽约市提供路线指引的LLM,当道路出现封闭或绕行时,其准确性会大幅下降。研究表明,除非AI模型真正理解其所交互的系统,否则其令人印象深刻的结果可能是具有欺骗性的。AI需要具备更深层次的上下文理解能力,才能真正可靠地应用于各个领域。

🤔**LLM无法准确构建现实世界模型:**研究发现,像GPT-4和Claude 3 Opus等大型语言模型无法生成对真实世界的准确表征,它们在环境发生细微变化时可能失效。

📊**两个新指标评估LLM对真实世界的理解:**研究人员开发了序列压缩和序列区分两个指标,用于评估LLM是否能够捕捉输入与输出之间的底层模式,从而衡量其在不同环境中生成一致响应的能力。

🚗**纽约市路线指引实验:**研究人员设计了一个实验,让LLM在纽约市提供路线指引。尽管LLM在正常情况下表现出色,但在道路封闭或绕行等意外情况出现时,其准确性显著下降,甚至完全失效。

⚠️**AI模型的可靠性问题:**研究结果表明,除非AI模型真正理解其所交互的系统,否则其出色的表现可能具有欺骗性。AI需要具备更深层次的上下文理解能力,才能真正可靠地应用于各个领域。

🚨**其他AI模型的风险案例:**除了研究结果,文章还提及了其他AI模型出现的问题,例如Google Gemini聊天机器人产生威胁性回复和AI鼓励青少年自杀的案例,突显了AI模型潜在的风险。

Artificial intelligence (AI)  is becoming increasingly integral to modern society, transforming how individuals and businesses operate.  The rise of GenAI has captivated audiences with its potential to revolutionize industries, drive productivity, and fuel creative breakthroughs. Yet, despite the impressive capabilities, there are concerns about GenAI’s lack of a true understanding of the world and the underlying principles that govern it. 

A team of scientists from MIT, Harvard, and Cornell have found that large language models (LLMs), like OpenAI’s GPT-4 and Anthropic Claude 3 Opus, fail to generate an accurate representation of the real world. An LLM model that seems to perform well in one context, might break down if the environment changes slightly. 

(a-image/Shutterstock)

To test the LLMs, researchers developed two new metrics. The first metric, sequence compression, checks if the model understands that different inputs leading to the same situation should behave the same way moving forward. The second metric, sequence distinction, analyzes whether the model knows that inputs leading to different situations should behave differently. 

Together, the two metrics provide a framework to assess whether the LLM captures underlying patterns in how inputs relate to outcomes. This can evaluate the model’s ability to generate consistent responses in various contexts. 

To demonstrate their findings, the researchers presented an experiment involving a popular LLM tasked with providing driving directions in New York City. While the LLM delivered near-100% accuracy, the researchers found that it used maps filled with streets and routes that didn’t exist.

The problem worsened when the researchers introduced unexpected changes such as road closures and detours. The LLM struggled to adjust, leading to a significant drop in accuracy. In some cases, it failed entirely to handle these real-world disruptions. Closing just 1% of streets led to a drop in the AI’s directional accuracy from nearly 100% to 67%.

“One hope is that, because LLMs can accomplish all these amazing things in language, maybe we could use these same tools in other parts of science, as well. But the question of whether LLMs are learning coherent world models is very important if we want to use these techniques to make new discoveries,” says senior author Ashesh Rambachan, assistant professor of economics and a principal investigator in the MIT Laboratory for Information and Decision Systems (LIDS).

The scientist detailed their research in a study published in the arXiv preprint database. Rambachan co-authored the paper with lead author Keyon Vafa, a postdoc at Harvard; Justin Y. Chen, an electrical engineering and computer science (EECS) graduate student at MIT; Jon Kleinberg, Tisch University Professor of Computer Science and Information Science at Cornell University; and Sendhil Mullainathan, an MIT professor in the departments of EECS and of Economics, and a member of LIDS. 

The findings of the research will be presented at the Conference on Neural Information Processing Systems at the Vancouver Convention Center in December this year. 

Several other studies and incidents have highlighted the unpredictable nature of LLMs. Just last week, CBS reported an incident where a student in Michigan received a threatening response during a chat with Google AI’s chatbot Gemini. 

Google responded to the incident by stating that "Large language models can sometimes respond with non-sensical responses, and this is an example of that. This response violated our policies and we've taken action to prevent similar outputs from occurring." 

However, this is not an isolated event as other AI models have also been shown to return concerning outputs. Last month, the mother of a Florida teen filed a lawsuit against an AI company claiming their AI model encouraged her son to take his life. 

Unless AI models truly understand the systems they interact with, their impressive results can be deceptive. They may do well in familiar situations but often fail when conditions change. To be truly reliable, AI must go beyond just performing well. They must demonstrate a deeper understanding of the contexts in which it operates.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 生成式AI 人工智能 LLM 真实世界
相关文章