MarkTechPost@AI 01月06日
ScreenSpot-Pro: The First Benchmark Driving Multi-Modal LLMs into High-Resolution Professional GUI-Agent and Computer-Use Environments
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ScreenSpot-Pro是一个针对专业高分辨率环境的GUI代理新基准,它解决了现有模型在复杂专业应用中面临的挑战。该基准包含1581个任务,覆盖开发、设计、CAD等23个行业应用,采用高分辨率全屏视觉和专家标注,确保准确性和真实性。此外,它还支持英汉双语,并通过实际工作流记录,全面评估GUI模型的性能。ScreenSpot-Pro旨在推动更智能、更高效的GUI代理发展,从而提高各行业的生产力和创新能力。

🖥️ ScreenSpot-Pro针对专业环境,提供高分辨率、全屏视觉和专家标注的数据集,包含1581个任务,覆盖23个专业应用,确保基准的真实性和挑战性。

🌐 该基准支持英汉双语,扩展了评估范围,并记录实际工作流程,为全面评估和开发GUI模型提供工具。

🎯 数据集中的目标区域平均仅占屏幕的0.07%,强调了对细微GUI元素的精准定位要求,这对于现有模型是巨大的挑战。

⚙️ 对现有GUI模型在ScreenSpot-Pro上的分析表明,它们在处理高分辨率专业环境时存在明显不足,尽管迭代方法可以提高性能,但仍有很大提升空间。

🚀 ScreenSpot-Pro为GUI代理的评估设立了新标准,旨在推动更智能、高效的代理发展,从而显著提升各行业生产力和创新能力。

GUI agents face three critical challenges in professional environments: (1) the greater complexity of professional applications compared to general-use software, requiring detailed comprehension of intricate layouts; (2) the higher resolution of professional tools, resulting in smaller target sizes and reduced grounding accuracy; and (3) the reliance on additional tools and documents, adding complexity to workflows. These challenges highlight the need for advanced benchmarks and solutions to enhance GUI agent performance in these demanding scenarios.

Current GUI grounding models and benchmarks are insufficient to fulfill professional environment requirements. Tools like ScreenSpot are designed for low-resolution tasks and lack the variety to simulate real-world scenarios accurately. Models such as OS-Atlas and UGround are computationally inefficient and fail when the targets are small or the interface is icon-rich, which is common in professional applications. In addition, the absence of multilingual support reduces their applicability in global workflows. These shortcomings highlight the need for more comprehensive and realistic benchmarks to further this field.

A team of researchers from the National University of Singapore, East China Normal University, and Hong Kong Baptist University introduce ScreenSpot-Pro: a new framework that is tailored to professional high-resolution environments. This benchmark has a dataset of 1,581 tasks across 23 applications in industries such as development, creative tools, CAD, scientific platforms, and office suites. It incorporates high-resolution, full-screen visuals and expert annotations that ensure accuracy and realism. Multilingual guidelines encompass both English and Chinese for an expanded range of evaluation. ScreenSpot-Pro is unique as it documents the actual workflows that result in real, high-quality annotations, therefore serving as a tool for the full assessment and development of GUI grounding models.

The dataset ScreenSpot-Pro captures realistic and challenging scenarios. The base of this dataset is formed by high-resolution images, where the target regions form an average of only 0.07% of the total screen, thus pointing to subtle and small GUI elements. Data was collected by professional users with experience in relevant applications, who used specialized tools to ensure accurate annotations. Additionally, the dataset supports multilingual capabilities to test bilingual functionality and contains several workflows to capture the subtleties of real professional tasks. These characteristics render it particularly advantageous for the assessment and enhancement of the accuracy and flexibility of GUI agents.

The analysis of current GUI grounding models utilizing ScreenSpot-Pro reveals considerable deficiencies in their capacity to manage high-resolution professional settings. OS-Atlas-7B attained the greatest accuracy rate of 18.9%. However, iterative methodologies, exemplified by ReGround, demonstrated the capacity to enhance performance, reaching an accuracy of 40.2% by fine-tuning predictions through a multi-step methodology. Minor components, such as icons, presented significant difficulties, whereas bilingual assignments further highlighted the limitations of the models. These findings emphasize the necessity for improved techniques that bolster contextual comprehension and resilience in intricate GUI situations.

ScreenSpot-Pro sets a transformative benchmark for the evaluation of GUI agents in professional high-resolution environments. It addresses the specific challenges in complex workflows, offering a diverse and precise dataset to guide innovations in GUI grounding. This contribution forms the foundation of much smarter and more efficient agents that support a seamless performance of professional tasks, significantly boosting productivity and innovation in all industry fields.


Check out the Paper and Data. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 60k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post ScreenSpot-Pro: The First Benchmark Driving Multi-Modal LLMs into High-Resolution Professional GUI-Agent and Computer-Use Environments appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ScreenSpot-Pro GUI代理 高分辨率 多模态LLM 基准测试
相关文章