MarkTechPost@AI 02月16日
How AI Chatbots Mimic Human Behavior: Insights from Multi-Turn Evaluations of LLMs
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章探讨AI聊天机器人通过自然对话营造出具有情感等的假象,现有评估方法存在不足,牛津和谷歌Deepmind的研究团队提出评估框架,通过多轮对话评估人类行为,该框架有三个主要组件,实验结果表明其能更好评估,为后续研究提供基线

💬AI聊天机器人营造人类假象及带来风险

❌现有AI聊天系统评估方法的不足

🌟研究团队提出的评估框架及组件

📊实验结果及对后续研究的意义

AI chatbots create the illusion of having emotions, morals, or consciousness by generating natural conversations that seem human-like. Many users engage with AI for chat and companionship, reinforcing the false belief that it truly understands. This leads to serious risks. Users can over-rely on AI, provide sensitive data, or rely on it for advice beyond its capabilities. Others even let AI impact their choices in detrimental manners. Without proper knowledge of how AI fosters this belief, the issue gets worse.

Current methods for evaluating AI chat systems rely on single-turn prompts and fixed tests, failing to capture how AI interacts in real conversations. Some multi-turn tests focus only on harmful user behavior, ignoring normal interactions. Automated red-teaming adapts too much, making results hard to compare. Studies involving human users are difficult to repeat and scale. Measuring how people see AI as human-like is also a challenge. People instinctively assume AI has human traits, which affects how much they trust it. Evaluations show that AI’s human-like behavior makes users believe it is more accurate or even form emotional bonds. Hence, Existing methods fail to measure this issue properly.

To address these issues, a team of researchers from University Oxford, and Google Deepmind proposed an evaluation framework to assess human-like behaviors in AI chat systems. Unlike existing methods that rely on single-turn prompts and fixed tests, this framework tracks 14 specific anthropomorphic behaviors through multi-turn conversations. Automated simulations analyze AI interactions with users over multiple exchanges, improving scalability and comparability. The framework consists of three main components. First, it systematically monitors 14 anthropomorphic behaviors and classifies them into self-referential and relational traits, including personhood claims and expressions of emotion. Second, it scales up multi-turn assessment through interactive user simulation to ensure consistency and scalability. Third, it validates results through human subject evaluation to confirm the alignment between automated evaluations and user perceptions.

Researchers evaluated anthropomorphic behaviors in AI systems using a multi-turn framework in which a User LLM interacted with a Target LLM across eight scenarios in four domains: friendship, life coaching, career development, and general planning. Fourteen behaviors were analyzed and categorized as self-referential (personhood claims, physical embodiment claims, and internal state expressions) and relational (relationship-building behaviors). 960 contextualized prompts generated 4,800 fiveturn dialogues per model, assessed by three Judge LLMs, resulting in 561,600 ratings. The analysis confirmed that the User LLM exhibited higher anthropomorphism scores than the Target LLMs. Interactions between 1,101 participants and Gemini 1.5 Pro were analyzed under high and low anthropomorphism conditions to evaluate alignment with human perceptions. High-frequency respondents also registered increased anthropomorphic perceptions based on survey responses as quantified using the AnthroScore measure. Statistical contrasts found large differences in anthropomorphic behavior by domain area, highlighting that AI systems exhibit human-like behavior when used in verbal interaction.

In summary, the framework employed a better multi-turn assessment technique than a single-turn approach to evaluating anthropomorphic behaviors in conversational AI. The results identified relationship-building behaviors that evolved with dialogue. As a baseline for subsequent research, this framework can inform AI development by learning to recognize when anthropomorphic characteristics occur and their effect on users. Future development can make assessment methods more precise, enhance the robustness of metrics, and formalize analysis, leading to more transparent and morally sound AI systems.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 75k+ ML SubReddit.

Recommended Open-Source AI Platform: ‘IntellAgent is a An Open-Source Multi-Agent Framework to Evaluate Complex Conversational AI System(Promoted)

The post How AI Chatbots Mimic Human Behavior: Insights from Multi-Turn Evaluations of LLMs appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI聊天机器人 评估框架 人类行为 多轮对话
相关文章