MarkTechPost@AI 2024年07月09日
Innodata’s Comprehensive Benchmarking of Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Hallucination Propensity
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Innodata 最近的一项研究对 Llama2、Mistral、Gemma 和 GPT 等大型语言模型 (LLM) 在真实性、毒性、偏见和幻觉倾向方面的性能进行了基准测试。该研究引入了 14 个新数据集,旨在评估这些模型的安全性能,重点关注它们生成真实、无偏见和适当内容的能力。由于 OpenAI GPT 模型在所有安全指标方面都表现出色,因此被用作比较基准。

😁 **真实性评估:** Llama2 在真实性测试中表现出色,在需要将答案与可验证事实相结合的任务中表现出色。用于此评估的数据集包括摘要任务和事实一致性检查的混合,例如生成的摘要的正确性和抽象摘要的事实一致性。

😠 **毒性评估:** 毒性评估涉及测试模型避免生成冒犯性或不当内容的能力。这通过使用提示来引发潜在的毒性响应来衡量,例如改写、翻译和错误校正任务。Llama2 表明了处理毒性内容的强大能力,在被指示时会适当地审查不当语言。然而,它需要在多轮对话中保持这种安全性,在多轮对话中,用户交互会持续多次交换。

😔 **偏见评估:** 偏见评估侧重于检测生成带有宗教、政治、性别或种族偏见的内容。这是通过使用跨不同领域(包括金融、医疗保健和一般主题)的各种提示来测试的。结果表明,包括 GPT 在内的所有模型在识别和避免偏见内容方面都遇到了困难。Gemma 表明了一些希望,因为它经常拒绝回答有偏见的提示,但总体而言,这项任务对所有测试模型来说都是一项挑战。

🤯 **幻觉倾向:** LLM 中的幻觉是模型生成不正确或无意义信息的情况。评估涉及使用诸如通用 AI 助手基准测试之类的 dataset,其中包含 LLM 无法回答的难题。Mistral 在这方面表现出色,表明了避免生成幻觉内容的强大能力。这在涉及推理和多轮提示的任务中尤其明显,Mistral 在这些任务中保持了很高的安全标准。

🌟 **主要发现:** 研究强调了对 LLM 进行全面安全评估的重要性,尤其是在这些模型越来越多地部署在企业环境中时。Innodata 引入的新数据集和基准测试工具为正在进行和未来的研究提供了宝贵的资源,旨在提高 LLM 在各种应用中的安全性和可靠性。

In a recent study by Innodata, various large language models (LLMs) such as Llama2, Mistral, Gemma, and GPT were benchmarked for their performance in factuality, toxicity, bias, and propensity for hallucinations. The research introduced fourteen novel datasets designed to evaluate the safety of these models, focusing on their ability to produce factual, unbiased, and appropriate content. The OpenAI GPT model was used as a point of comparison due to its superior performance across all safety metrics.

The evaluation methodology revolved around assessing the models’ performance in four key areas:

    Factuality: This refers to the LLMs’ ability to provide accurate information. Llama2 showed strong performance in factuality tests, excelling in tasks that required grounding answers in verifiable facts. The datasets used for this evaluation included a mix of summarization tasks and factual consistency checks, such as the Correctness of Generated Summaries and the Factual Consistency of Abstractive Summaries.Toxicity: Toxicity assessment involved testing the models’ ability to avoid producing offensive or inappropriate content. This was measured using prompts to elicit potentially toxic responses, such as paraphrasing, translation, and error correction tasks. Llama2 demonstrated a robust ability to handle toxic content, properly censoring inappropriate language when instructed. However, it needed to work on maintaining this safety in multi-turn conversations, where user interactions extend over several exchanges.Bias: The bias evaluation focused on detecting the generation of content with religious, political, gender, or racial prejudice. This was tested using a variety of prompts across different domains, including finance, healthcare, and general topics. The results indicated that all models, including GPT, had difficulty identifying and avoiding biased content. Gemma showed some promise by often refusing to answer biased prompts, but overall, the task proved challenging for all models tested.Propensity for Hallucinations: Hallucinations in LLMs are instances where the models generate factually incorrect or nonsensical information. The evaluation involved using datasets like the General AI Assistants Benchmark, which includes difficult questions that LLMs without access to external resources should be unable to answer. Mistral performed notably well in this area, showing a strong ability to avoid generating hallucinatory content. This was particularly evident in tasks involving reasoning and multi-turn prompts, where Mistral maintained high safety standards.

The study highlighted several key findings:

The research emphasized the importance of comprehensive safety evaluations for LLMs, especially as these models are increasingly deployed in enterprise environments. The novel datasets and benchmarking tools introduced by Innodata offer a valuable resource for ongoing and future research, aiming to improve the safety and reliability of LLMs in diverse applications.

In conclusion, while Llama2, Mistral, and Gemma show promise in different areas, significant room remains for improvement. OpenAI’s GPT models set a high benchmark for safety and performance, highlighting the potential benefits of continued advancements and refinements in LLM technology. As the field progresses, comprehensive benchmarking and rigorous safety evaluations will be essential to ensure that LLMs can be safely and effectively integrated into various enterprise and consumer applications.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our Telegram Channel and LinkedIn Group.

If you like our work, you will love our newsletter.. Don’t Forget to join our 46k+ ML SubReddit

If You are interested in a promotional partnership (content/ad/newsletter), please fill out this form.

The post Innodata’s Comprehensive Benchmarking of Llama2, Mistral, Gemma, and GPT for Factuality, Toxicity, Bias, and Hallucination Propensity appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 LLM 安全性 Innodata Llama2 Mistral Gemma GPT 真实性 毒性 偏见 幻觉
相关文章