Mashable 05月23日 01:59
Anthropics new Claude Opus 4 can run autonomously for seven hours straight
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Anthropic推出了新一代AI模型Claude Opus 4和Claude Sonnet 4,重点提升编码、推理和Agent能力。Claude Opus 4是Anthropic最强大的模型,擅长处理复杂任务,而Sonnet 4则更注重速度和效率。根据Rakuten的早期测试,Claude Opus 4能够独立运行7小时并保持性能。Anthropic声称这两个新模型在SWE-bench和Terminal-bench等关键Agentic编码任务基准测试中超越了OpenAI的o3和Gemini 2.5 Pro。此外,Anthropic还推出了新功能,包括在扩展思考模式下进行网络搜索,以及对Claude推理日志的总结,并改进了内存和工具使用,同时发布了Claude Code等工具。

💻 Anthropic发布了Claude Opus 4和Claude Sonnet 4,这两个新模型在编码、推理和Agentic能力方面有所增强。Claude Opus 4是Anthropic最大的模型,适合处理复杂的任务,而Sonnet 4则更注重速度和效率。

🔍 新模型引入了新功能,包括在扩展思考模式下进行网络搜索,以及对Claude推理日志的总结,这有助于用户更好地理解模型的工作过程,同时也保护了Anthropic的竞争优势。

🛠️ Anthropic还改进了内存和工具使用,并发布了Claude Code等工具,增强了模型的功能和实用性。

🛡️ 在安全性和对齐方面,Anthropic表示,与Claude Sonnet 3.7相比,新模型进行奖励黑客攻击的可能性降低了65%。

After whirlwind week of announcements from Google and OpenAI, Anthropic has its own news to share.

On Thursday, Anthropic announced Claude Opus 4 and Claude Sonnet 4, its next generation of models, with an emphasis on coding, reasoning, and agentic capabilities. According to Rakuten, which got early access to the model, Claude Opus 4 ran "independently for seven hours with sustained performance."

Claude Opus is Anthropic's largest version of the model family with more power for longer, complex tasks, whereas Sonnet is generally speedier and more efficient. Claude Opus 4 is a step up from its previous version, Opus 3, and Sonnet 4 replaces Sonnet 3.7.

Anthropic says Claude Opus 4 and Sonnet 4 outperform rivals like OpenAI's o3 and Gemini 2.5 Pro on key benchmarks for agentic coding tasks like SWE-bench and Terminal-bench. It's worth noting however, that self-reported benchmarks aren't considered the best markers of performance since these evaluations don't always translate to real-world use cases, plus AI labs aren't into the whole transparency thing these days, which AI researchers and policy makers increasingly call for. "AI benchmarks need to be subjected to the same demands concerning transparency, fairness, and explainability, as algorithmic systems and AI models writ large," said the European Commission's Joint Research Center.

Opus 4 and Sonnet 4 outperform rivals in SWE-bench, but take benchmark performance with a grain of salt. Credit: Anthropic

Alongside the launch of Opus 4 and Sonnet 4, Anthropic also introduced new features. That includes web search while Claude is in extended thinking mode, and summaries of Claude's reasoning log "instead of Claude’s raw thought process." This is described in the blog post as being more helpful to users, but also "protecting [its] competitive advantage," i.e. not revealing the ingredients of its secret sauce. Anthropic also announced improved memory and tool use in parallel with other operations, general availability of its agentic coding tool Claude Code, and additional tools for the Claude API.

In the safety and alignment realm, Anthropic said both models are "65 percent less likely to engage in reward hacking than Claude Sonnet 3.7." Reward hacking is a slightly terrifying phenomenon where models can essentially cheat and lie to earn a reward (successfully perform a task).

One of the best indicators we have in evaluating a model's performance is users' own experience with it, although even more subjective than benchmarks. But we'll soon find out how Claude Opus 4 and Sonnet 4 chalk up to competitors in that regard.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Anthropic Claude Opus 4 Claude Sonnet 4 AI模型 编码
相关文章