Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

cs.AI updates on arXiv.org 07月31日 12:48

Persona-Augmented Benchmarking: Evaluating LLMs Across Diverse Writing Styles

本文探讨当前LLM评估基准在风格多样性方面的不足，通过采用基于角色的LLM提示技术，验证了风格和格式变化对LLM性能评估的影响，并提出了一种改进现有基准的方法，以增强评估的全面性和外部效度。

arXiv:2507.22168v1 Announce Type: cross Abstract: Current benchmarks for evaluating Large Language Models (LLMs) often do not exhibit enough writing style diversity, with many adhering primarily to standardized conventions. Such benchmarks do not fully capture the rich variety of communication patterns exhibited by humans. Thus, it is possible that LLMs, which are optimized on these benchmarks, may demonstrate brittle performance when faced with "non-standard" input. In this work, we test this hypothesis by rewriting evaluation prompts using persona-based LLM prompting, a low-cost method to emulate diverse writing styles. Our results show that, even with identical semantic content, variations in writing style and prompt formatting significantly impact the estimated performance of the LLM under evaluation. Notably, we identify distinct writing styles that consistently trigger either low or high performance across a range of models and tasks, irrespective of model family, size, and recency. Our work offers a scalable approach to augment existing benchmarks, improving the external validity of the assessments they provide for measuring LLM performance across linguistic variations.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM评估风格多样性基准改进

相关文章

Prometheus-Eval and Prometheus 2: Setting New Standards in LLM Evaluation and Open-Source Innovation with State-of-the-art Evaluator Language Model

GraCoRe: A New AI Benchmark for Unveiling Strengths and Weaknesses in LLM Graph Comprehension and Reasoning

This AI Paper from Cornell Introduces UCB-E and UCB-E-LRF: Multi-Armed Bandit Algorithms for Efficient and Cost-Effective LLM Evaluation

tinyBenchmarks: Revolutionizing LLM Evaluation with 100-Example Curated Sets, Reducing Costs by Over 98% While Maintaining High Accuracy

Comet Launches Opik: A Comprehensive Open-Source Tool for End-to-End LLM Evaluation, Prompt Tracking, and Pre-Deployment Testing with Seamless Integration

Salesforce AI Introduces SFR-Judge: A Family of Three Judge Models of 8-Billion Parameters 8B, 12B, and 70B Size, Built with Meta Llama 3 and Mistral NeMO

LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation

Chatbot Arena Conversation Dataset Release

Chatbot Arena Leaderboard Week 8: Introducing MT-Bench and Vicuna-33B

Vinoground: A Temporal Counterfactual Large Multimodal Models LMM Evaluation Benchmark Encompassing 1000 Short and Natural Video-Caption Pairs