未知数据源 2024年10月02日
Chatbot Arena Conversation Dataset Release
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Chatbot Arena 成为广泛引用的 LLM 评估平台,三个月来收集大量投票和数据。文章发布更新的排行榜,包含更多模型和两个数据集,还介绍了相关评估方式、数据集特点及潜在用途等。

Chatbot Arena 强调大规模、社区化和交互式的人类评估,三个月内收集约53K票和19K个独特IP地址对22个模型的评价。

文章发布更新的排行榜,新增Vicuna-33B-v1.3和MPT-30B-chat等模型,还引入Llama 2和Claude 2,使用MT-bench进行自动评估。

数据集1包含33K清理后的对话,具有多种特点和潜在用途,如研究真实用户提示特征、改进模型训练等,但包含攻击性内容,使用需注意。

数据集2是通过MT-bench进行的受控人类评估,包含6个模型对80个问题的3.3K专家级人类偏好,人类和GPT-4裁判达成超80%的一致性。

Since its launch three months ago, Chatbot Arena has become a widely cited LLM evaluation platform that emphasizes large-scale, community-based, and interactive human evaluation. In that short time span, we collected around 53K votes from 19K unique IP addresses for 22 models.In this blog post, we are releasing an updated leaderboard with more models and two datasets for human preference related study:33K crowd-sourced conversations with human preference annotations from Chatbot Arena. (link)3K expert-level human annotations from MT-bench. (link)As estimated by this Llama2 analysis blog post, Meta spent about 8 million on human preference data for LLama 2 and that dataset is not avaialble now.Therefore, we think our datasets are highly valuable due to the expensive nature of obtaining human preferences and the limited availability of open, high-quality datasets.Updated LeaderboardWe are hosting the latest leaderboard at lmsys/chatbot-arena-leaderboard. Below is a screenshot. Since the last update, we added two 30B models: Vicuna-33B-v1.3 and MPT-30B-chat, both of which perform very well in the arena.Two days ago, we also introduced Llama 2 and Claude 2 to the arena. The leaderboard will soon include them after we get enough votes.Please help us by casting your votes at our voting website.Besides the slowly updated Arena Elo ratings, we also use MT-bench, a fast GPT-4 based automatic evaluation pipeline to evaluate all new models, including LLama 2 (chat), Claude 2, WizardLM-13B-v1.1, XGen-7B-8K-Inst, and ChatGLM2-6B.You are welcome to check out the interactive lmsys/chatbot-arena-leaderboard to sort the models according to different metrics.Some early evaluation results of LLama 2 can be found in our tweets.Figure 1. Chatbot Arena Leaderboard (see more) Dataset 1: 33K Chatbot Arena Conversation DataLink: lmsys/chatbot_arena_conversationsThis dataset contains 33K cleaned conversations with pairwise human preferences collected on Chatbot Arena from April to June 2023.Each sample includes two model names, their full conversation text, the user vote, the anonymized user ID, the detected language tag, the OpenAI moderation API tag, the additional toxic tag, and the timestamp.To ensure the safe release of data, we have attempted to remove all conversations that contain personally identifiable information (PII). In addition, we have included the OpenAI moderation API output to flag inappropriate conversations. However, we have chosen not to remove all of these conversations so that researchers can study safety-related questions associated with LLM usage in the wild as well as the OpenAI moderation process. As an example, we included additional toxic tags that are generated by our own toxic tagger, which are trained by fine-tuning T5 and RoBERTa on manually labeled data.Uniqueness and Potential UsageCompared to existing human preference datasets like Anthropic/hh-rlhf, and OpenAssistant/oasst1. This datasetContains the outputs of 20 LLMs including stronger LLMs such as GPT-4 and Claude-v1. It also contains many failure cases of these state-of-the-art models.Contains unrestricted conversations from over 13K users in the wild.We believe this data will help the AI research community answer important questions around topics like:Characteristics of real-world user promptsTrain better models with RLHFImprove and evaluate LLM evaluation methodsBuild model selection and request dispatching algorithmsStudy the design and application of inappropriate content filtering mechanismsDisclaimers and TermsThis dataset includes offensive conversations. It is not intended for training dialogue agents without applying appropriate filtering measures. We are not responsible for any outputs of the models trained on this dataset.Statements or opinions made in this dataset do not reflect the views of researchers or institutions involved in the data collection effort.Users of this data are responsible for ensuring its appropriate use, which includes abiding by any applicable laws and regulations.Users of this data should adhere to the terms of use for a specific model when using its direct outputs.Please contact us if you find any issues with the dataset.Visualization and Elo Rating CalculationThis Colab notebook provides some visualizations and shows how to compute Elo ratings with the dataset. We pasted some figures here.Figure 2. Fraction of Model A Wins for All Non-tied A vs. B Battles.Figure 3. Battle Counts of Each Models Pair.Dataset 2: 3K MT-bench Human AnnotationsLink: lmsys/mt_bench_human_judgmentsIn addition to the crowd-sourced evaluation with Chatbot Arena, we also conducted a controlled human evaluation with MT-bench.This dataset contains 3.3K expert-level pairwise human preferences for model responses generated by 6 models in response to 80 MT-bench questions.The 6 models are GPT-4, GPT-3.5, Claud-v1, Vicuna-13B, Alpaca-13B, and LLaMA-13B. The annotators are mostly graduate students with expertise in the topic areas of each of the questions. The details of data collection can be found in our paper.Agreement CalculationThis Colab notebook shows how to compute the agreement between humans and GPT-4 judge with the dataset. Our results show that humans and GPT-4 judge achieve over 80% agreement, the same level of agreement between humans.AcknowlementWe thank the whole community for contributing to the arena dataset.We also plan to gradually release more conversations in the future after doing thorough review.Citation@misc{zheng2023judging, title={Judging LLM-as-a-judge with MT-Bench and Chatbot Arena}, author={Lianmin Zheng and Wei-Lin Chiang and Ying Sheng and Siyuan Zhuang and Zhanghao Wu and Yonghao Zhuang and Zi Lin and Zhuohan Li and Dacheng Li and Eric. P Xing and Hao Zhang and Joseph E. Gonzalez and Ion Stoica}, year={2023}, eprint={2306.05685}, archivePrefix={arXiv}, primaryClass={cs.CL}}

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Chatbot Arena LLM评估 数据集 人类评估
相关文章