OpenAI’s agent tool may be nearing release

OpenAI may be close to releasing an AI tool that can take control of your PC and perform actions on your behalf.

Tibor Blaho, a software engineer with a reputation for accurately leaking upcoming AI products, claims to have uncovered evidence of OpenAI’s long-rumored Operator tool. Publications including Bloomberg have previously reported on Operator, which is said to be an “agentic” system capable of autonomously handling tasks like writing code and booking travel.

According to The Information, OpenAI is targeting January as Operator’s release month. Code uncovered by Blaho this weekend adds credence to that reporting.

OpenAI’s ChatGPT client for macOS has gained options, hidden for now, to define shortcuts to “Toggle Operator” and “Force Quit Operator,” per Blaho. And OpenAI has added references to Operator on its website, Blaho said — albeit references that aren’t yet publicly visible.

OpenAI website already has references to Operator/OpenAI CUA (Computer Use Agent) – “Operator System Card Table”, “Operator Research Eval Table” and “Operator Refusal Rate Table”

Including comparison to Claude 3.5 Sonnet Computer use, Google Mariner, etc.

(preview of tables… pic.twitter.com/OOBgC3ddkU

— Tibor Blaho (@btibor91) January 20, 2025

According to Blaho, OpenAI’s site also contains not-yet-public tables comparing the performance of Operator to other computer-using AI systems. The tables may well be placeholders. But if the numbers are accurate, they suggest that Operator isn’t 100% reliable, depending on the task.

OpenAI website already has references to Operator/OpenAI CUA (Computer Use Agent) – “Operator System Card Table”, “Operator Research Eval Table” and “Operator Refusal Rate Table”

Including comparison to Claude 3.5 Sonnet Computer use, Google Mariner, etc.

(preview of tables… pic.twitter.com/OOBgC3ddkU

— Tibor Blaho (@btibor91) January 20, 2025

On OSWorld, a benchmark that tries to mimic a real computer environment, “OpenAI Computer Use Agent (CUA)” — possibly the AI model powering Operator — scores 38.1%, ahead of Anthropic’s computer-controlling model but well short of the 72.4% humans score. OpenAI CUA surpases human performance on WebVoyager, which evaluates an AI’s ability to navigate and interact with websites. But the model falls short of human-level scores on another web-based benchmark, WebArena, according to the leaked benchmarks.

Operator also struggles with tasks a human could perform easily, if the leak is to be believed. In a test that tasked Operator with signing up with a cloud provider and launching a virtual machine, Operator was only successful 60% of the time. Tasked with creating a Bitcoin wallet, Operator succeeded only 10% of the time.

OpenAI’s imminent entry into the AI agent space comes as rivals including the aforementioned Anthropic, Google, and others make plays for the nascent segment. AI agents may be risky and speculative, but tech giants are already touting them as the next big thing in AI. According to analytics firm Markets and Markets, the market for AI agents could be worth $47.1 billion by 2030.

Agents today are rather primitive. But some experts have raised concerns about their safety, should the technology rapidly improve.

One of the leaked charts shows Operator performing well on selected safety evaluations, including tests that try to get the system to perform “illicit activities” and search for “sensitive personal data.” Reportedly, safety testing is among the reasons for Operator’s long development cycle. In a recent X post, OpenAI co-founder Wojciech Zaremba criticized Anthropic for releasing an agent he claims lacks safety mitigations.

“I can only imagine the negative reactions if OpenAI made a similar release,” Zaremba wrote.

It’s worth noting that OpenAI has been criticized by AI researchers, including ex-staff, for allegedly de-emphasizing safety work in favor of quickly productizing its technology.

Fish AI Reader

FishAI

联系邮箱 441953276@qq.com

相关标签