MarkTechPost@AI 2024年07月19日
This AI Paper from Microsoft Present RUBICON: A Machine Learning Technique for Evaluating Domain-Specific Human-AI Conversations
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

微软的研究人员提出了一种名为RUBICON的技术,用于评估特定领域的对话式AI助手,例如GitHub Copilot Chat。该技术通过使用大型语言模型自动生成高质量、任务感知的评价标准,以评估对话质量。RUBICON在C#调试任务中测试了100个开发者与基于聊天的助手之间的对话,结果表明其在预测对话质量方面优于其他评价标准集,并通过消融研究证明了其各个组件的有效性。

🤔 RUBICON是一种用于评估特定领域的人机对话的机器学习技术,它通过使用大型语言模型自动生成高质量、任务感知的评价标准,以评估对话质量。 RUBICON的重点在于利用大型语言模型自动生成高质量、任务感知的评价标准,以评估特定领域的人机对话的质量。这项技术强调了上下文和任务进展的重要性,从而提高了评估的准确性。为了实现这一目标,RUBICON引入了特定领域信号和格赖斯准则,并通过迭代评估来选择最佳评价标准集。

🤖 RUBICON在C#调试任务中测试了100个开发者与基于聊天的助手之间的对话,结果表明其在预测对话质量方面优于其他评价标准集。 在对C#调试任务中100个开发者与基于聊天的助手之间的对话进行测试后,RUBICON在预测对话质量方面取得了显著的成果。研究结果表明,RUBICON在区分积极和消极对话以及对对话进行高精度分类方面优于基线方法。这表明特定领域信号和对话设计原则在提高评估准确性方面起着重要作用。

🔍 RUBICON通过学习来自已标记对话的满意度(SAT)和不满度(DSAT)评价标准来估计特定领域助手的对话质量。它包括三个步骤:生成不同的评价标准、选择优化的评价标准集以及对对话进行评分。 RUBICON通过三个步骤来估计特定领域助手的对话质量:生成不同的评价标准、选择优化的评价标准集以及对对话进行评分。评价标准是捕捉对话属性的自然语言断言。对话使用5点李克特量表进行评估,并归一化为[0, 10]范围。评价标准的生成涉及监督提取和总结,而选择则优化了评价标准的精确度和覆盖率。正确性和锐利度损失指导了最佳评价标准子集的选择,确保对话质量评估的有效性和准确性。

Evaluating conversational AI assistants, like GitHub Copilot Chat, is challenging due to their reliance on language models and chat-based interfaces. Existing metrics for conversational quality need to be revised for domain-specific dialogues, making it hard for software developers to assess the effectiveness of these tools. While techniques like SPUR use large language models to analyze user satisfaction, they may miss domain-specific nuances. The study focuses on automatically generating high-quality, task-aware rubrics for evaluating task-oriented conversational AI assistants, emphasizing the importance of context and task progression to improve evaluation accuracy.

Researchers from Microsoft present RUBICON, a technique for evaluating domain-specific Human-AI conversations using large language models. RUBICON generates candidate rubrics to assess conversation quality and selects the best-performing ones. It enhances SPUR by incorporating domain-specific signals and Gricean maxims, creating a pool of rubrics evaluated iteratively. RUBICON was tested on 100 conversations between developers and a chat-based assistant for C# debugging, using GPT-4 for rubric generation and assessment. It outperformed alternative rubric sets, achieving high precision in predicting conversation quality and demonstrating the effectiveness of its components through ablation studies.

Natural language conversations are central to modern AI applications, but traditional NLP metrics like BLEU and Perplexity are inadequate for evaluating long-form conversations, especially in LLMs. While user satisfaction has been a key metric, manual analysis is resource-intensive and privacy-intrusive. Recent approaches use language models to assess conversation quality through natural language assertions, capturing engagement and user experience themes. Techniques like SPUR generate rubrics for open-domain conversations but need more domain-specific contexts. This study emphasizes a holistic approach, integrating user expectations and interaction progress, and explores optimal prompt selection using bandit methods for improved evaluation accuracy.

RUBICON estimates conversation quality for domain-specific assistants by learning rubrics for Satisfaction (SAT) and Dissatisfaction (DSAT) from labeled conversations. It involves three steps: generating diverse rubrics, selecting an optimized rubric set, and scoring conversations. Rubrics are natural language assertions capturing conversation attributes. Conversations are evaluated using a 5-point Likert scale, normalized to a [0, 10] range. Rubric generation involves supervised extraction and summarization, while selection optimizes rubrics for precision and coverage. Correctness and sharpness losses guide the selection of an optimal rubric subset, ensuring effective and accurate conversation quality assessment.

The evaluation of RUBICON involves three key questions: its effectiveness compared to other methods, the impact of Domain Sensitization (DS) and Conversation Design Principles (CDP), and the performance of its selection policy. The conversation data, sourced from a C# Debugger Copilot assistant, was filtered and annotated by experienced developers, resulting in a 50:50 train-test split. Metrics like accuracy, precision, recall, F1 score, ΔNetSAT score, and Yield Rate were evaluated. Results showed that RUBICON outperforms baselines in separating positive and negative conversations and classifying conversations with high precision, highlighting the importance of DS and CDP instructions.

Internal validity is threatened by the subjective nature of manually assigned ground truth labels despite high inter-annotator agreement. External validity is limited by the dataset’s lack of diversity, being specific to C# debugging tasks in a software company, potentially affecting generalization to other domains. Construct validity issues include the reliance on an automated scoring system and assumptions made by converting Likert scale responses into a [0, 10] scale. Future work will address different calculation methods for the NetSAT score. RUBICON has succeeded in enhancing rubric quality and differentiating conversation effectiveness, proving valuable in real-world deployment.


Check out the Paper and Details. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter

Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

The post This AI Paper from Microsoft Present RUBICON: A Machine Learning Technique for Evaluating Domain-Specific Human-AI Conversations appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

RUBICON 机器学习 对话评估 特定领域 人机对话
相关文章