Anthropic’s latest AI update can use a computer on its own

The Verge - Artificial Intelligences 2024年10月23日

../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

Anthropic发布了最新的Claude 3.5 Sonnet AI模型，其公测版新增了一项名为“计算机使用”的功能，允许AI通过观察屏幕、移动光标、点击按钮和输入文本来控制计算机。这项功能目前已在API上开放，开发人员可以使用Claude来执行与人类相同的计算机操作。Anthropic警告称，该功能仍处于实验阶段，可能会“繁琐且容易出错”。此外，Claude 3.5 Sonnet在多个基准测试中表现出显著的改进，特别是在代理编码和工具使用任务方面。它在编码方面表现优异，在SWE-bench Verified上的得分从33.4%提升至49.0%，超过了所有公开可用的模型，包括OpenAI o1-preview等推理模型和专门为代理编码设计的系统。它在TAU-bench上的表现也得到提升，这是一个代理工具使用任务，在零售领域从62.6%提升至69.2%，在更具挑战性的航空领域从36.0%提升至46.0%。

🤖 **计算机使用功能**: Anthropic 的 Claude 3.5 Sonnet 公测版新增了“计算机使用”功能，允许 AI 通过观察屏幕、移动光标、点击按钮和输入文本来控制计算机。这项功能类似于 Microsoft 的 Copilot Vision 和 OpenAI 的 ChatGPT 桌面应用程序，但它更进一步，允许 AI 实际点击和执行任务。

💻 **改进的性能**: Claude 3.5 Sonnet 在多个基准测试中表现出显著的改进，特别是在代理编码和工具使用任务方面。它在编码方面表现优异，在 SWE-bench Verified 上的得分从 33.4% 提升至 49.0%，超过了所有公开可用的模型，包括 OpenAI o1-preview 等推理模型和专门为代理编码设计的系统。它在 TAU-bench 上的表现也得到提升，这是一个代理工具使用任务，在零售领域从 62.6% 提升至 69.2%，在更具挑战性的航空领域从 36.0% 提升至 46.0%。

🚧 **实验性功能**: Anthropic 警告称，该功能仍处于实验阶段，可能会“繁琐且容易出错”。他们表示，正在尽早发布“计算机使用”功能以获取开发人员的反馈，并预计该功能将在未来迅速改进。

🚫 **限制**: Anthropic 似乎已经限制了 Claude 在社交媒体上的活动，它“采取措施监控 Claude 被要求参与选举相关活动的情况，以及系统引导 Claude 远离诸如生成和发布社交媒体内容、注册网络域名或与政府网站交互等活动”。

Anthropic’s latest Claude 3.5 Sonnet AI model has a new feature in public beta that can control a computer by looking at a screen, moving a cursor, clicking buttons, and typing text. The new feature, called “computer use," is available today on the API, allowing developers to direct Claude to work on a computer like a human does, as shown on a Mac in the video below.

Microsoft’s Copilot Vision feature and OpenAI’s desktop app for ChatGPT have shown what their AI tools can do based on seeing your computer’s screen, and Google has similar capabilities in its Gemini app on Android phones. But they haven’t gone to the next step of widely releasing tools ready to click around and perform tasks for you like this. Rabbit promised similar capabilities for its R1, which it has yet to deliver.

Anthropic does caution that computer use is still experimental and can be “cumbersome and error-prone.” The company says, “We’re releasing computer use early for feedback from developers, and expect the capability to improve rapidly over time.”

According to the developers:

There are many actions that people routinely do with computers (dragging, zooming, and so on) that Claude can’t yet attempt. The “flipbook” nature of Claude’s view of the screen—taking screenshots and piecing them together, rather than observing a more granular video stream—means that it can miss short-lived actions or notifications.

Also, this version of Claude has apparently been told to steer clear of social media, with “measures to monitor when Claude is asked to engage in election-related activity, as well as systems for nudging Claude away from activities like generating and posting content on social media, registering web domains, or interacting with government websites.”

Meanwhile, Anthropic says its new Claude 3.5 Sonnet model has improvements in many benchmarks and is offered to customers at the same price and speed as its predecessor:

The updated Claude 3.5 Sonnet shows wide-ranging improvements on industry benchmarks, with particularly strong gains in agentic coding and tool use tasks. On coding, it improves performance on SWE-bench Verified from 33.4% to 49.0%, scoring higher than all publicly available models—including reasoning models like OpenAI o1-preview and specialized systems designed for agentic coding. It also improves performance on TAU-bench, an agentic tool use task, from 62.6% to 69.2% in the retail domain, and from 36.0% to 46.0% in the more challenging airline domain.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签