MarkTechPost@AI 04月05日 04:35
Augment Code Released Augment SWE-bench Verified Agent: An Open-Source Agent Combining Claude Sonnet 3.7 and OpenAI O1 to Excel in Complex Software Engineering Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Augment Code推出了其Augment SWE-bench验证代理,在软件工程领域取得了显著进展。该代理结合了Anthropic的Claude Sonnet 3.7和OpenAI的O1模型,在SWE-bench基准测试中实现了65.4%的成功率,这标志着在处理复杂编码任务方面取得了突破。尽管SWE-bench存在局限性,但Augment Code持续关注性能优化,并计划通过强化学习和微调专有模型来降低成本、减少延迟,最终目标是提供高效、快速的AI编码辅助,以满足实际的专业环境需求。

🚀 Augment Code发布了Augment SWE-bench验证代理,该代理在SWE-bench排行榜上名列前茅,展示了其在处理复杂软件工程任务方面的卓越能力。

💡 该代理的核心驱动力是Anthropic的Claude Sonnet 3.7模型,并结合了OpenAI的O1模型用于集成,这种策略性地利用现有模型的方法,无需训练专有模型,迅速建立了一个强大的基线。

🤔 在测试中,Augment Code发现一些预期的技术,如Claude Sonnet的“思考模式”和独立的回归修复代理,并未带来显著的性能提升,这揭示了代理性能优化的复杂性。

🚧 Augment Code承认SWE-bench存在局限性,例如主要侧重于bug修复而非新功能创建,且测试任务多为Python语言。未来,公司将着重于通过强化学习和微调专有模型来降低成本、减少延迟,并提升用户体验。

AI agents are increasingly vital in helping engineers efficiently handle complex coding tasks. However, one significant challenge has been accurately assessing and ensuring these agents can handle real-world coding scenarios beyond simplified benchmark tests. 

Augment Code has announced the launch of their Augment SWE-bench Verified Agent, a development in agentic AI tailored specifically for software engineering. This release places them at the top of open-source agent performance on the SWE-bench leaderboard. By combining the strengths of Anthropic’s Claude Sonnet 3.7 and OpenAI’s O1 model, Augment Code’s approach has delivered impressive results, showcasing a compelling blend of innovation and pragmatic system architecture.

The SWE-bench benchmark is a rigorous test that measures an AI agent’s effectiveness in handling practical software engineering tasks drawn directly from GitHub issues in prominent open-source repositories. Unlike traditional coding benchmarks, which generally focus on isolated, algorithmic-style problems, SWE-bench offers a more realistic testbed that requires agents to navigate existing codebases, identify relevant tests autonomously, create scripts, and iterate against comprehensive regression test suites.

Augment Code’s initial submission has achieved a 65.4% success rate, a notable achievement in this demanding environment. The company focused its first effort on leveraging existing state-of-the-art models, specifically Anthropic’s Claude Sonnet 3.7 as the primary driver for task execution and OpenAI’s O1 model for ensembling. This approach strategically bypassed training proprietary models at this initial phase, establishing a robust baseline.

One interesting aspect of Augment’s methodology was their exploration into different agent behaviors and strategies. For example, they found that certain expected beneficial techniques like Claude Sonnet’s ‘thinking mode’ and separate regression-fixing agents did not yield meaningful performance improvements. This highlights the nuanced and sometimes counterintuitive dynamics in agent performance optimization. Also, basic ensembling techniques such as majority voting were explored but ultimately abandoned due to cost and efficiency considerations. However, simple ensembling with OpenAI’s O1 did provide incremental improvements in accuracy, underscoring the value of ensembling even in constrained scenarios.

While Augment Code’s initial SWE-bench submission’s success is commendable, the company is transparent about the benchmark’s limitations. Notably, SWE-bench problems are heavily skewed toward bug fixing rather than feature creation, the provided descriptions are more structured and LLM-friendly compared to typical real-world developer prompts, and the benchmark solely utilizes Python. Real-world complexities, such as navigating massive production codebases and dealing with less descriptive programming languages, pose challenges that SWE-bench does not capture.

Augment Code has openly acknowledged these limitations, emphasizing its continued commitment to optimizing agent performance beyond benchmark metrics. They stress that while improvements to prompts and ensembling can boost quantitative results, qualitative customer feedback and real-world usability remain its priorities. The ultimate goal for Augment Code is developing cost-effective, fast agents capable of providing unparalleled coding assistance in practical professional environments.

As part of its future roadmap, Augment is actively exploring the fine-tuning of proprietary models using RL techniques and proprietary data. Such advancements promise to enhance model accuracy and significantly reduce latency and operational costs, facilitating more accessible and scalable AI-driven coding assistance.

Some of the key takeaways from the Augment SWE-bench Verified Agent include:


Check out the GitHub Page. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 85k+ ML SubReddit.

[Register Now] miniCON Virtual Conference on OPEN SOURCE AI: FREE REGISTRATION + Certificate of Attendance + 3 Hour Short Event (April 12, 9 am- 12 pm PST) + Hands on Workshop [Sponsored]

The post Augment Code Released Augment SWE-bench Verified Agent: An Open-Source Agent Combining Claude Sonnet 3.7 and OpenAI O1 to Excel in Complex Software Engineering Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Augment Code SWE-bench AI编码 Claude Sonnet 3.7 OpenAI O1
相关文章