MarkTechPost@AI 2024年09月28日
ReliabilityBench: Measuring the Unpredictable Performance of Shaped-Up Large Language Models Across Five Key Domains of Human Cognition
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

该研究评估了如GPT、LLaMA和BLOOM等大型语言模型的可靠性,指出随着模型规模和复杂性增加,可靠性未必提升,存在潜在问题,需更深入研究。

🎯大型语言模型在各领域广泛应用,但随着其规模和复杂性增加,可靠性并未必然提高,甚至在简单任务上表现不佳,可能产生误导性输出。

💡解决可靠性问题的现有方法包括扩大模型规模等,但这些方法在提升处理复杂查询能力的同时,也导致在简单实例中出现失败,且使用强化学习等技术塑造模型的效果也参差不齐。

📊研究人员引入ReliabilityBench框架,对大型语言模型在五个领域的可靠性进行系统评估,发现模型在不同任务中的表现存在差异,揭示了其细微的优势和劣势。

📉研究结果表明,规模和塑造策略虽能提升模型在复杂问题上的表现,但常降低其在简单问题上的可靠性,且新型模型比旧型模型更易产生看似合理却错误的答案。

📄该研究强调设计和开发大型语言模型需范式转变,ReliabilityBench框架提供了更细致的评估方法,表明模型可靠性未达人类期望,需改进训练和评估策略。

The research evaluates the reliability of large language models (LLMs) such as GPT, LLaMA, and BLOOM, extensively used across various domains, including education, medicine, science, and administration. As the usage of these models becomes more prevalent, understanding their limitations and potential pitfalls is crucial. The research highlights that as these models increase in size and complexity, their reliability does not necessarily improve. Instead, performance can decline for seemingly simple tasks, resulting in misleading outputs that may go unnoticed by human supervisors. This trend indicates the need for a more thorough examination of LLM reliability beyond conventional performance metrics.

The central issue explored in the research is that while scaling up LLMs makes them more powerful, it also introduce unexpected behavioral patterns. Specifically, these models may become less stable and produce erroneous outputs that appear plausible at first glance. This issue arises due to the reliance on instruction fine-tuning, human feedback, and reinforcement learning to enhance their performance. Despite these advancements, LLMs struggle with maintaining consistent reliability across tasks of varying difficulty, which raises concerns about their robustness and suitability for applications where accuracy and predictability are critical.

Existing methodologies to address these reliability concerns include scaling up the models, which involves increasing the parameters, training data, and computational resources. For example, the size of GPT-3 models ranges from 350 million to 175 billion parameters, while LLaMA models vary from 6.7 billion to 70 billion. Although scaling has led to improvements in handling complex queries, it has also caused failures in simpler instances that users would expect to be easily managed. Similarly, shaping the models using techniques like Reinforcement Learning from Human Feedback (RLHF) has shown mixed results, often leading to models that generate plausible yet incorrect responses instead of simply avoiding the question.

Researchers from Universitat Politècnica de València and the University of Cambridge introduced the ReliabilityBench framework to evaluate the reliability of LLMs across five domains systematically: numeracy (‘addition’), vocabulary reshuffle (‘anagram’), geographical knowledge (‘locality’), basic and advanced science questions (‘science’), and information-centric transformations (‘transforms’). For instance, models were tested with arithmetic operations ranging from simple one-digit sums to complex 100-digit additions in the’ addition’ domain. The LLMs often performed poorly on tasks involving carry operations, with an overall success rate dropping sharply for longer additions. Similarly, in the ‘anagram’ task, which consists of rearranging letters to form words, performance varied significantly based on the word length, with a 96.78% failure rate for the longest anagram tested. This domain-specific benchmarking reveals LLMs’ nuanced strengths and weaknesses, offering a deeper understanding of their capabilities.

The research findings show that while scaling and shaping strategies improve LLM performance on complex questions, they often degrade reliability for simpler ones. For example, models like GPT-4 and LLaMA-2, which excel at answering complex scientific queries, still make basic errors in simple arithmetic or word reshuffling tasks. In addition, LLaMA-2’s performance on geographical knowledge questions, measured using a locality benchmark, indicated a high sensitivity to small variations in prompt phrasing. While the models displayed notable accuracy for well-known cities, they struggled significantly when dealing with less popular locations, resulting in an error rate of 91.7% for cities not found in the top 10% by population.

The results indicate that shaped-up models are more prone to producing incorrect yet sensible-looking answers than their earlier counterparts, which often avoid responding when uncertain. The researchers observed that the avoidance behavior, measured as a proportion of unanswered questions, was 15% higher in older models like GPT-3 compared to the newer GPT-4, where this behavior dropped to nearly zero. This reduction in avoidance, while potentially beneficial for user experience, led to a rise in the frequency of incorrect responses, particularly on easy tasks. Consequently, the apparent reliability of these models decreased, undermining user confidence in their outputs.

In conclusion, the research underscores the need for a paradigm shift in designing and developing LLMs. The proposed ReliabilityBench framework provides a robust evaluation methodology that moves from aggregate performance scores to a more nuanced assessment of model behavior based on human difficulty levels. This approach allows for the characterization of model reliability, paving the way for future research to focus on ensuring consistent performance across all difficulty levels. The findings highlight that despite advancements, LLMs have not yet achieved a level of reliability that aligns with human expectations, making them prone to unexpected failures that must be addressed through refined training and evaluation strategies.


Check out the Paper and HF Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 50k+ ML SubReddit

The post ReliabilityBench: Measuring the Unpredictable Performance of Shaped-Up Large Language Models Across Five Key Domains of Human Cognition appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 可靠性评估 ReliabilityBench 模型局限性 训练策略
相关文章