TechCrunch News 02月20日
This Week in AI: Maybe we should ignore AI benchmarks for now
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本周的AI新闻聚焦于埃隆·马斯克的xAI发布的最新模型Grok 3,该模型在数学、编程等基准测试中表现出色。然而,文章指出当前AI基准测试存在局限性,测试内容偏向深奥知识,与实际应用关联性弱。专家呼吁建立更好的测试体系和独立的测试机构,以更客观地评估AI模型的性能。此外,OpenAI尝试“解除审查”ChatGPT,前OpenAI CTO Mira Murati创立新公司,Meta将举办LlamaCon开发者大会,欧洲也在积极构建透明AI的基础模型。同时,OpenAI发布了新的AI编码基准SWE-Lancer,中国公司Stepfun发布了多语言语音AI模型Step-Audio,Nous Research发布了统一推理和语言能力的AI模型DeepHermes-3 Preview。

🚀 Grok 3发布:埃隆·马斯克的xAI公司发布了最新的旗舰AI模型Grok 3,在数学、编程等基准测试中超越了包括OpenAI在内的其他领先模型,预示着AI技术的新进展。

📊 AI基准测试的局限性:文章批判了当前AI基准测试的现状,指出其测试内容偏向深奥知识,与实际应用的相关性较弱,并强调需要更完善、独立的测试体系。

🌍 多方AI发展动态:OpenAI尝试“解除审查”ChatGPT,Meta将举办LlamaCon开发者大会,欧洲积极构建透明AI的基础模型,展现了全球范围内AI技术发展的多元化趋势。

👨‍💻 新的AI编码基准:OpenAI发布了SWE-Lancer基准,旨在评估AI系统的编码能力,结果显示AI在软件工程方面仍有很大的提升空间。

🗣️ 多语言语音AI模型:中国公司Stepfun发布了Step-Audio,该模型支持多种语言,并允许用户调整语音的情感和方言,展示了AI在语音生成领域的进步。

Welcome to TechCrunch’s regular AI newsletter! We’re going on hiatus for a bit, but you can find all our AI coverage, including my columns, our daily analysis, and breaking news stories, at TechCrunch. If you want those stories and much more in your inbox every day, sign up for our daily newsletters here.

This week, billionaire Elon Musk’s AI startup, xAI, released its latest flagship AI model, Grok 3, which powers the company’s Grok chatbot apps. Trained on around 200,000 GPUs, the model beats a number of other leading models, including from OpenAI, on benchmarks for mathematics, programming, and more.

But what do these benchmarks really tell us?

Here at TC, we often reluctantly report benchmark figures because they’re one of the few (relatively) standardized ways the AI industry measures model improvements. Popular AI benchmarks tend to test for esoteric knowledge, and give aggregate scores that correlate poorly to proficiency on the tasks that most people care about.

As Wharton professor Ethan Mollick pointed out in a series of posts on X after Grok 3’s unveiling Monday, there’s an “urgent need for better batteries of tests and independent testing authorities.” AI companies self-report benchmark results more often than not, as Mollick alluded to, making those results even tougher to accept at face value.

“Public benchmarks are both ‘meh’ and saturated, leaving a lot of AI testing to be like food reviews, based on taste,” Mollick wrote. “If AI is critical to work, we need more.”

There’s no shortage of independent tests and organizations proposing new benchmarks for AI, but their relative merit is far from a settled matter within the industry. Some AI commentators and experts propose aligning benchmarks with economic impact to ensure their usefulness, while others argue that adoption and utility are the ultimate benchmarks.

This debate may rage until the end of time. Perhaps we should instead, as X user Roon prescribes, simply pay less attention to new models and benchmarks barring major AI technical breakthroughs. For our collective sanity, that may not be the worst idea, even if it does induce some level of AI FOMO.

As mentioned above, This Week in AI is going on hiatus. Thanks for sticking with us, readers, through this roller coaster of a journey. Until next time.

Image Credits:Nathan Laine/Bloomberg / Getty Images

OpenAI tries to “uncensor” ChatGPT: Max wrote about how OpenAI is changing its AI development approach to explicitly embrace “intellectual freedom,” no matter how challenging or controversial a topic may be.

Mira’s new startup: Former OpenAI CTO Mira Murati’s new startup, Thinking Machines Lab, intends to build tools to “make AI work for [people’s] unique needs and goals.”

Grok 3 cometh: Elon Musk’s AI startup, xAI, has released its latest flagship AI model, Grok 3, and unveiled new capabilities for the Grok apps for iOS and the web.

A very Llama conference: Meta will host its first developer conference dedicated to generative AI this spring. Called LlamaCon after Meta’s Llama family of generative AI models, the conference is scheduled for April 29.

AI and Europe’s digital sovereignty: Paul profiled OpenEuroLLM, a collaboration between some 20 organizations to build “a series of foundation models for transparent AI in Europe” that preserves the “linguistic and cultural diversity” of all EU languages.

Image Credits:Jakub Porzycki/NurPhoto / Getty Images

OpenAI researchers have created a new AI benchmark, SWE-Lancer, that aims to evaluate the coding prowess of powerful AI systems. The benchmark consists of over 1,400 freelance software engineering tasks that range from bug fixes and feature deployments to “manager-level” technical implementation proposals.

According to OpenAI, the best-performing AI model, Anthropic’s Claude 3.5 Sonnet, scores 40.3% on the full SWE-Lancer benchmark — suggesting that AI has quite a ways to go. It’s worth noting that the researchers didn’t benchmark newer models like OpenAI’s o3-mini or Chinese AI company DeepSeek’s R1.

A Chinese AI company named Stepfun has released an “open” AI model, Step-Audio, that can understand and generate speech in several languages. Step-Audio supports Chinese, English, and Japanese and lets users adjust the emotion and even dialect of the synthetic audio it creates, including singing.

Stepfun is one of several well-funded Chinese AI startups releasing models under a permissive license. Founded in 2023, Stepfun reportedly recently closed a funding round worth several hundred million dollars from a host of investors that include Chinese state-owned private equity firms.

Image Credits:Nous Research

Nous Research, an AI research group, has released what it claims is one of the first AI models that unifies reasoning and “intuitive language model capabilities.”

The model, DeepHermes-3 Preview, can toggle on and off long “chains of thought” for improved accuracy at the cost of some computational heft. In “reasoning” mode, DeepHermes-3 Preview, similar to other reasoning AI models, “thinks” longer for harder problems and shows its thought process to arrive at the answer.

Anthropic reportedly plans to release an architecturally similar model soon, and OpenAI has said such a model is on its near-term roadmap.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AI基准测试 Grok 3 OpenAI AI模型 AI发展
相关文章