未知数据源 2024年10月02日
LMSYS Chatbot Arena: Live and Community-Driven LLM Evaluation
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Chatbot Arena是一个由LMSYS和加州大学伯克利分校SkyLab成员共同开发的开源项目,旨在通过实时、开放和社区驱动的评估来促进LLM的发展和理解。该平台允许任何用户通过成对比较的方式在真实世界用例下对LLM进行评分,并定期发布排行榜。Chatbot Arena于2023年5月首次推出,已成为实时、社区驱动的LLM评估的关键平台,吸引了数百万参与者,收集了超过80万张投票。这种广泛的参与使得能够评估90多个LLM,包括商业GPT-4、Gemini/Bard和开源Llama和Mistral模型,显著提升了我们对它们能力和局限性的理解。

😄 **开放评估平台:** Chatbot Arena是一个开源平台,任何用户都可以通过成对比较的方式在真实世界用例下对LLM进行评分,并定期发布排行榜。该平台采用了一种独特的实时评估方法,使用户能够直接比较不同LLM的性能,而不是依赖于静态的基准测试。

🤖 **模型评估流程:** 平台的评估流程包括:将模型添加到Arena进行盲测,收集足够的投票以使模型的评分稳定,并将模型列入公共排行榜。对于未发布的模型,平台会与模型提供者合作,以匿名方式将模型提供给社区进行预览测试,并在收集到足够投票后,将结果私下分享给模型提供者。

📊 **数据共享与透明度:** Chatbot Arena定期与社区共享20%的投票数据,包括提示、答案、提供每个答案的模型身份(如果模型在排行榜上或曾经在排行榜上)以及投票。该平台还与模型提供者合作,提供早期数据访问权限,以帮助他们改进模型。

🛡️ **政策与原则:** Chatbot Arena坚持开放、透明和社区参与的原则。平台的代码、评估工具以及排行榜都是开源的,任何人都可以复制、审核或运行另一个Chatbot Arena实例来生成类似的排行榜。

🤝 **社区参与和贡献:** Chatbot Arena鼓励社区参与评估过程,包括提供反馈、参与模型选择和决定评估流程的任何变化。该平台的成功依赖于社区的积极参与,并且社区的反馈对于平台的持续发展至关重要。

Our MissionChatbot Arena (lmarena.ai) is an open-source project developed by members from LMSYS and UC Berkeley SkyLab. Our mission is to advance LLM development and understanding through live, open, and community-driven evaluations. We maintain the open evaluation platform for any user to rate LLMs via pairwise comparisons under real-world use cases and publish leaderboard periodically.Our ProgressChatbot Arena was first launched in May 2023 and has emerged as a critical platform for live, community-driven LLM evaluation, attracting millions of participants and collecting over 800,000 votes. This extensive engagement has enabled the evaluation of more than 90 LLMs, including both commercial GPT-4, Gemini/Bard and open-weight Llama and Mistral models, significantly enhancing our understanding of their capabilities and limitations.Our periodic leaderboard and blog post updates have become a valuable resource for the community, offering critical insights into model performance that guide the ongoing development of LLMs. Our commitment to open science is further demonstrated through the sharing of user preference data and one million user prompts, supporting research and model improvement.We also collaborate with open-source and commercial model providers to bring their latest models to community for preview testing. We believe this initiative helps advancing the field and encourages user engagement to collect crucial votes for evaluating all the models in the Arena. Moreover, it provides an opportunity for the community to test and provide anonymized feedback before the models are officially released.The platform's infrastructure (FastChat) and evaluation tools, available on GitHub, emphasize our dedication to transparency and community engagement in the evaluation process. This approach not only enhances the reliability of our findings but also fosters a collaborative environment for advancing LLMs.In our ongoing efforts, we feel obligated to establish policies that guarantee evaluation transparency and trustworthiness. Moreover, we actively involve the community in shaping any modifications to the evaluation process, reinforcing our commitment to openness and collaborative progress.Our PolicyLast Updated: May 31, 2024Open source: The platform (FastChat) including UI frontend, model serving backend, model evaluation and ranking pipelines are all open source and available on GitHub. This means that anyone can clone, audit or run another instance of Chatbot Arena to produce a similar leaderboard.Transparent: The evaluation process, including rating computation, identifying anomalous users, and LLM selection are all made publicly available so others can reproduce our analysis and fully understand the process of collecting data. Furthermore, we will involve the community in deciding any changes in the evaluation process.Listing models on the leaderboard: The public leaderboard will only include models that are accessible to other third parties. Specifically, it will only include models that are either (1) open weights or/and (2) publicly available through APIs (e.g., gpt-4-0613, gemini-pro-api), or (3) available as a service (e.g., Bard, GPT-4+browsing). In the remainder of this document we refer to these models as publicly released models.Once a publicly released model is listed on the leaderboard, the model will remain accessible at lmarena.ai for at least two weeks for the community to evaluate it.Evaluating publicly released models. Evaluating such a model consists of the following steps:Add the model to Arena for blind testing and let the community know it was added.Accumulate enough votes until the model's rating stabilizes.Once the model's rating stabilizes, we list the model on the public leaderboard. There is one exception: the model provider can reach out before its listing and ask for an one-day heads up. In this case, we will privately share the rating with the model provider and wait for an additional day before listing the model on the public leaderboard.Evaluating unreleased models: We collaborate with open-source and commercial model providers to bring their unreleased models to community for preview testing.Model providers can test their unreleased models anonymously, meaning the models' names will be anonymized. A model is considered unreleased if its weights are neither open, nor available via a public API or service. Evaluating an unreleased model consists of the following steps:Add the model to Arena with an anonymous label. i.e., its identity will not be shown to users.Keep it until we accumulate enough votes for its rating to stabilize or until the model provider withdraws it.Once we accumulate enough votes, we will share the result privately with the model provider. These include the rating, as well as release samples of up to 20% of the votes. (See Sharing data with the model providers for further details).Remove the model from Arena.If while we test an unreleased model, that model is publicly released, we immediately switch to the publicly released model evaluation process.To ensure the leaderboard accurately reflects model rankings, we rely on live comparisons between models. Hence, we may deprecate models from the leaderboard one month after they are no longer available online or publicly accessible.Sharing data with the community: We will periodically share data with the community. In particular, we will periodically share 20% of the arena vote data we have collected including the prompts, the answers, the identity of the model providing each answer (if the model is or has been on the leaderboard), and the votes. For the models we collected votes for but have never been on the leaderboard, we will still release data but we will label the model as "anonymous".Sharing data with the model providers: Upon request, we will offer early data access with model providers who wish to improve their models. However, this data will be a subset of data that we periodically share with the community. In particular, with a model provider, we will share the data that includes their model's answers. For battles, we may not reveal the opponent model and may use "anonymous" label. This data will be later shared with the community during the periodic releases. If the model is not on the leaderboard at the time of sharing, the model’s answers will also be labeled as "anonymous". Before sharing the data, we will remove user PII (e.g., Azure PII detection for texts).FAQWhy another eval?Most LLM benchmarks are static, which makes them prone to contamination, as these LLMs are trained on most available data on the Internet. Chatbot Arena aims to alleviate this problem by providing live evaluation with a continuous stream of new prompts from real people. We also believe that the open nature of the platform will attract users that accurately reflect the broader set of LLM users and real use cases.What model to evaluate? Why not all?We will continuously add new models and retire old ones. It is not feasible to add every possible model due to the cost and the scalability of our evaluation process, i.e., it might take too much to accumulate enough votes to accurately rate each model. Today, the decision to add new models is rather ad-hoc: we add models based on the community’s perceived interest. We intend to formalize his process in the near future.Why should the community trust our eval?We seek to provide transparency and all tools as well as the platform we are using in open-source. We invite the community to use our platform and tools to statistically reproduce our results.Why do you only share 20% of data, not all?Arena data is used for LLM benchmark purpose. We periodically share data to mitigate the potential risk of overfitting or benchmark leakage. We will actively review this policy based on the community's feedback.Who will fund this effort? Any conflict of interests?Chatbot Arena is only funded by gifts, in money, cloud credits, or API credits. The gifts have no strings attached.Any feedback?Feel free to send us email or leave feedback on Github!

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM评估 Chatbot Arena 开源 社区驱动 实时评估
相关文章