少点错误 04月24日 15:17
Personal evaluation of LLMs, through chess
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文作者通过与各大语言模型(LLM)进行象棋对弈,评估其推理能力。作者是一位象棋高手(Lichess.org 2100 Elo),他与Claude 3.7 Sonnet、o4-mini等模型对弈,记录了对弈结果、模型是否产生幻觉、对弈步数以及棋艺水平。结果显示,Claude 3.7 Sonnet和o3是仅有的能完成对弈且不产生幻觉的模型,但棋艺平平。o4-mini虽然在15步后产生幻觉,但棋艺精湛,远超其他模型。总体而言,近两个月发布的新模型优于旧模型,但Grok 3表现不佳。作者还分享了模型在棋局状态追踪方面的差异。

✅**Claude 3.7 Sonnet与o3:** 是仅有的能完成对弈且不产生幻觉的模型,但棋艺水平一般,最终都输掉了比赛。

🚀**o4-mini:** 虽然在15步后产生幻觉,但在这之前的棋艺非常出色,其“平均 centipawn 损失”接近于零,远低于其他模型。作者在游戏结束时还处于劣势。

🗓️**模型更新迭代:** 除了GPT 4o,近两个月发布的新模型在象棋能力上普遍优于旧模型。Grok 3表现最差,仅坚持了4步。

🤖**模型状态追踪差异:** Claude 3.7 Sonnet在每一步棋后都会打印出完整的棋局历史,而Claude 3.5 Haiku尝试使用JavaScript代码追踪棋盘状态,o3则试图使用Python库但失败了。

Published on April 24, 2025 7:01 AM GMT

Lots of people seem to resonate with the idea that AI benchmarks are getting more and more meaningless – either because they're being run in a misleading way, or because they aren't tracking important things. I think the right response is for people to do their own personal evaluations of models with their own criteria. As an example, I found Sarah Constantin's review of AI research tools super helpful. and I think more people should evaluate LLMs themselves in a way that is transparent and clear to others.

So this post shares the result of my personal evaluation of LLM progress, using chess as the focal exercise. I chose chess ability mainly because it seemed like the thing that should be most obviously moved by reasoning ability increases – so it would be a good benchmark for whether reasoning ability has actually improved. Plus, it was a good excuse to spend all day playing chess. I much preferred the example in this post of cybersecurity usage at a real company, but this is just what I can do.

I played a chess game against each of the major LLMs and recorded the outcome of each game. Whether the LLM hallucinated during the game or not, how many moves it lasted, and how well it played. Here were my results: 

Full results in this spreadsheet. My takeaways:

    Claude 3.7 Sonnet and o3 were the only models able to complete a game without hallucination. (DeepSeek R1 technically cleared that bar, but only by getting checkmated early.) But they didn't play particularly well, and they still lost. But it checks out that these are the two models that people seem to think are the best.o4-mini played the best. Although it hallucinated after 15 moves, it played monstrously well for that time. o4-mini's "average centipawn loss" – a measure of how much the play deviated from optimality, where perfect play is 0 – was pretty close to zero, and far lower than the other models. I am a pretty good player (2100 elo on lichess.org, above the 90th percentile) and I was at a disadvantage when the game ended.All the models that survived 15 moves were released in the past two months. Other than GPT 4o, which has clearly been getting updated since release, the newer models seem to dominate the older models. (Hall of shame moment for Grok 3, which lasted only 4 moves.) That might seem obvious in retrospect, but I came into this test with the anecdotal experience I shared here, where o4-mini-high and o3 played catastrophically in some casual games. So I expected to find that more advanced models have gotten worse at chess, which certainly would have been a more dramatic finding to lead with. But under this test those same models did well. I think it's because in those games I specified that I wanted to play "blindfolded", which might have unintentionally kneecapped them.

A few miscellaneous notes:

    Models varied in how they kept track of the game state. Claude 3.7 Sonnet was the only model to print out the whole game history on each move, which seemed like a straightforward thing to do (although o4-mini and o4-mini-high did the same in their CoT). Claude 3.5 Haiku tried to keep track of the board state with Javascript code, for some reason. o3 tried to use a Python library to keep track of the board state, but failed and had to spend a long time on each move to reconstruct the board. I suspect that it might have done even better if it had been able to use the library it was calling.o3 also demonstrated an impressive feat – when I made a typo on my checkmate move, it called me out for making an illegal move. I didn't do it intentionally, or else I would have done it with some others as well. I also typo'd my checkmate of DeepSeek R1, which did not catch it. I left GPT 4.5 off because I got message limited before the game finished, but I will update the post with GPT 4.5 next week when I get my messages back.


Discuss

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

LLM 象棋 推理能力 模型评测
相关文章