MarkTechPost@AI 14小时前
JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

JarvisArt是一款融合了多模态大语言模型的智能修图工具,它通过理解和执行用户指令,在Adobe Lightroom中实现高质量的图像编辑。该工具利用专业艺术家的决策过程,结合视觉和语言线索,支持精细化的局部和全局编辑,同时保持图像内容的完整性。JarvisArt通过特殊的训练和与Lightroom的集成,为用户提供了一种既灵活又精确的图像编辑解决方案,有效解决了当前AI修图工具在用户控制和内容保留方面的不足。

💡 JarvisArt利用多模态大语言模型,通过视觉和语言线索理解用户意图,实现灵活、指令驱动的图像编辑,填补了手动编辑工具和自动化解决方案之间的空白。

🖌️ 该系统整合了Chain-of-Thought和Lightroom API,能够执行超过200种编辑工具的操作,支持精细化的局部编辑(如调整皮肤纹理、眼睛亮度)和全局编辑,满足多样化的修图需求。

🔍 JarvisArt通过构建高质量数据集MMArt,并采用监督微调和GRPO-R强化学习训练,显著提升了内容保真度,在平均像素级指标上比GPT-4o提升60%,同时保持良好的指令跟随能力。

🔄 该系统通过Agent-to-Lightroom (A2L)协议,实现与Adobe Lightroom的无缝集成,用户可以动态调整编辑过程,确保编辑结果的透明度和可控性,兼顾了自动化和用户控制的平衡。

🌐 JarvisArt能够处理任意分辨率的图像,支持从整体色调调整到局部细节优化的多种编辑任务,同时保持用户定义的审美目标,为创意用户提供了一个实用且强大的图像编辑工具。

Bridging the Gap Between Artistic Intent and Technical Execution

Photo retouching is a core aspect of digital photography, enabling users to manipulate image elements such as tone, exposure, and contrast to create visually compelling content. Whether for professional purposes or personal expression, users often seek to enhance images in ways that align with specific aesthetic goals. However, the art of photo retouching requires both technical knowledge and creative sensibility, making it difficult to achieve high-quality results without significant effort or expertise.

The key problem arises from the gap between manual editing tools and automated solutions. While professional software like Adobe Lightroom offers extensive retouching options, mastering these tools can be time-consuming and difficult for casual users. Conversely, AI-driven methods tend to oversimplify the editing process, failing to offer the control or precision required for nuanced edits. These automated solutions also struggle with generalizing across diverse visual scenes or supporting complex user instructions.

Limitations of Current AI-Based Photo Editing Models

Traditional tools have relied on zeroth- and first-order optimization, as well as reinforcement learning, to handle photo retouching tasks. Others utilize diffusion-based methods for image synthesis. These strategies show progress but are generally hampered by their inability to handle fine-grained regional control, maintain high-resolution outputs, or preserve the underlying content of the image. Even more recent large models, such as GPT-4o and Gemini-2-Flash, offer text-driven editing but compromise user control, and their generative processes often overwrite critical content details.

JarvisArt: A Multimodal AI Retoucher Integrating Chain-of-Thought and Lightroom APIs

Researchers from Xiamen University, the Chinese University of Hong Kong, Bytedance, the National University of Singapore, and Tsinghua University introduced JarvisArt—an intelligent retouching agent. This system leverages a multimodal large language model to enable flexible, instruction-guided image editing. JarvisArt is trained to emulate the decision-making process of professional artists, interpreting user intent through both visual and language cues, and executing retouching actions across more than 200 tools in Adobe Lightroom via a custom integration protocol.

The methodology integrates three major components. First, the researchers constructed a high-quality dataset, MMArt, which includes 5,000 standard and 50,000 Chain-of-Thought–annotated samples spanning various editing styles and complexities. Then, JarvisArt undergoes a two-stage training process. The initial phase uses supervised fine-tuning to build reasoning and tool-selection capabilities. It’s followed by Group Relative Policy Optimization for Retouching (GRPO-R), which incorporates customized tool-use rewards—such as retouching accuracy and perceptual quality—to refine the system’s ability to generate professional-quality edits. A specialized Agent-to-Lightroom (A2L) protocol ensures the seamless and transparent execution of tools within Lightroom, enabling users to dynamically adjust edits.

Benchmarking JarvisArt’s Capabilities and Real-World Performance

JarvisArt’s ability to interpret complex instructions and apply nuanced edits was evaluated using MMArt-Bench, a benchmark constructed from real user edits. The system delivered a 60% improvement in average pixel-level metrics for content fidelity compared to GPT-4o, maintaining similar instruction-following capabilities. It also demonstrated versatility in handling both global image edits and localized refinements, with the ability to manipulate images of arbitrary resolution. For example, it can adjust skin texture, eye brightness, or hair definition based on region-specific instructions. These results were achieved while preserving aesthetic goals defined by the user, showing a practical blend of control and quality across multiple editing tasks.

Conclusion: A Generative Agent That Fuses Creativity With Technical Precision

The researchteam tackled a significant challenge—enabling intelligent, high-quality photo retouching that does not require professional expertise. The method they introduced bridges the gap between automation and user control by combining data synthesis, reasoning-driven training, and integration with commercial software. JarvisArt offers a practical and powerful solution for creative users who seek both flexibility and quality in their image editing.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Ready to connect with 1 Million+ AI Devs/Engineers/Researchers? See how NVIDIA, LG AI Research, and top AI companies leverage MarkTechPost to reach their target audience[Learn More]

The post JarvisArt: A Human-in-the-Loop Multimodal Agent for Region-Specific and Global Photo Editing appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

JarvisArt AI修图 多模态模型 图像编辑 Adobe Lightroom
相关文章