MarkTechPost@AI 06月06日 04:00
Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

CRMArena-Pro是Salesforce AI Research推出的一项新基准测试,旨在评估LLM代理在客户关系管理(CRM)中的真实业务场景下的表现。该基准测试涵盖了客户服务、销售和CPQ等多个领域,并模拟了B2B和B2C环境。CRMArena-Pro通过多轮对话和敏感信息处理来检验LLM代理的能力。研究结果表明,即使是像Gemini 2.5 Pro这样的顶级模型,在多轮任务中的准确率也仅为35%,而保密性处理仍然是一个主要挑战。

📊 CRMArena-Pro是一个新的基准测试,用于在真实的业务环境中评估LLM代理。该测试使用了基于Salesforce模式的合成但结构准确的企业数据,模拟了包括客户服务、销售和CPQ在内的业务环境。

🗣️ 该基准测试包含19个任务,分为四个关键技能:数据库查询、文本推理、工作流程执行和策略合规。它还包括与模拟用户的多轮对话,并测试保密意识,以更全面地评估LLM代理的实际应用能力。

📈 研究结果显示,即使是像Gemini 2.5 Pro这样的顶级LLM模型,在单轮任务中也仅有约58%的准确率,在多轮对话中准确率下降到35%。工作流程执行是表现最好的领域,但保密性处理仍然是一个主要挑战。虽然增强了保密性处理,但在某些情况下会降低任务的准确性,这突出了隐私和性能之间的权衡。

AI agents powered by LLMs show great promise for handling complex business tasks, especially in areas like Customer Relationship Management (CRM). However, evaluating their real-world effectiveness is challenging due to the lack of publicly available, realistic business data. Existing benchmarks often focus on simple, one-turn interactions or narrow applications, such as customer service, missing out on broader domains, including sales, CPQ processes, and B2B operations. They also fail to test how well agents manage sensitive information. These limitations make it challenging to fully comprehend how LLM agents perform across the diverse range of real-world business scenarios and communication styles. 

Previous benchmarks have largely focused on customer service tasks in B2C scenarios, overlooking key business operations, such as sales and CPQ processes, as well as the unique challenges of B2B interactions, including longer sales cycles. Moreover, many benchmarks lack realism, often ignoring multi-turn dialogue or skipping expert validation of tasks and environments. Another critical gap is the absence of confidentiality evaluation, vital in workplace settings where AI agents routinely engage with sensitive business and customer data. Without assessing data awareness, these benchmarks fail to address serious practical concerns, such as privacy, legal risk, and trust. 

Researchers from Salesforce AI Research have introduced CRMArena-Pro, a benchmark designed to realistically evaluate LLM agents like Gemini 2.5 Pro in professional business environments. It features expert-validated tasks across customer service, sales, and CPQ, spanning both B2B and B2C contexts. The benchmark tests multi-turn conversations and assesses confidentiality awareness. Findings show that even top-performing models such as Gemini 2.5 Pro achieve only around 58% accuracy in single-turn tasks, with performance dropping to 35% in multi-turn settings. Workflow Execution is an exception, where Gemini 2.5 Pro exceeds 83%, but confidentiality handling remains a major challenge across all evaluated models. 

CRMArena-Pro is a new benchmark created to rigorously test LLM agents in realistic business settings, including customer service, sales, and CPQ scenarios. Built using synthetic yet structurally accurate enterprise data generated with GPT-4 and based on Salesforce schemas, the benchmark simulates business environments through sandboxed Salesforce Organizations. It features 19 tasks grouped under four key skills: database querying, textual reasoning, workflow execution, and policy compliance. CRMArena-Pro also includes multi-turn conversations with simulated users and tests confidentiality awareness. Expert evaluations confirmed the realism of the data and environment, ensuring a reliable testbed for LLM agent performance. 

The evaluation compared top LLM agents across 19 business tasks, focusing on task completion and awareness of confidentiality. Metrics varied by task type—exact match was used for structured outputs, and F1 score for generative responses. A GPT-4o-based LLM Judge assessed whether models appropriately refused to share sensitive information. Models like Gemini-2.5-Pro and o1, with advanced reasoning, clearly outperformed lighter or non-reasoning versions, especially in complex tasks. While performance was similar across B2B and B2C settings, nuanced trends emerged based on model strength. Confidentiality-aware prompts improved refusal rates but sometimes reduced task accuracy, highlighting a trade-off between privacy and performance. 

In conclusion, CRMArena-Pro is a new benchmark designed to test how well LLM agents handle real-world business tasks in customer relationship management. It includes 19 expert-reviewed tasks across both B2B and B2C scenarios, covering sales, service, and pricing operations. While top agents performed decently in single-turn tasks (about 58% success), their performance dropped sharply to around 35% in multi-turn conversations. Workflow execution was the easiest area, but most other skills proved challenging. Confidentiality awareness was low, and improving it through prompting often reduced task accuracy. These findings reveal a clear gap between the capabilities of LLMs and the needs of enterprises. 


Check out the Paper, GitHub Page, Hugging Face Page and Technical Blog. All credit for this research goes to the researchers of this project.

Did you know? Marktechpost is the fastest-growing AI media platform—trusted by over 1 million monthly readers. Book a strategy call to discuss your campaign goals. Also, feel free to follow us on Twitter and don’t forget to join our 95k+ ML SubReddit and Subscribe to our Newsletter.

The post Salesforce AI Introduces CRMArena-Pro: The First Multi-Turn and Enterprise-Grade Benchmark for LLM Agents appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CRMArena-Pro LLM代理 CRM 基准测试 Salesforce AI
相关文章