MarkTechPost@AI 2024年12月01日
ChatRex: A Multimodal Large Language Model (MLLM) with a Decoupled Perception Design
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

ChatRex 是一款由 IDEA 研究人员开发的先进跨模态大语言模型 (MLLM),它采用解耦架构,将感知和理解任务严格分离。通过检索式感知框架,ChatRex 将目标检测视为检索边界框索引而非直接预测坐标,有效降低了量化误差并提升了检测精度。它还引入了通用提案网络 (UPN) 和双视觉编码器,并利用 Rexverse-2M 数据集进行训练,实现了感知和理解任务的平衡。ChatRex 在目标检测、指代目标检测、图像描述生成等任务上均取得了优异表现,为跨模态系统的发展树立了新标杆。

🤔**解耦架构设计:**ChatRex 采用解耦架构,将感知(目标检测)和理解任务严格分离,避免了传统方法中感知和理解任务冲突导致的精度问题,例如,传统方法中将边界框坐标进行标记化会导致级联错误、目标预测顺序模糊和复杂图像中的量化误差。

📦**检索式感知框架:**ChatRex 将目标检测视为检索边界框索引,而不是直接预测坐标,这种新颖的表述方式避免了量化误差,提高了检测精度,并利用通用提案网络 (UPN) 生成全面细粒度和粗粒度边界框提案,解决了目标表示中的歧义问题。

👀**双视觉编码器与Rexverse-2M数据集:**ChatRex 集成了双视觉编码器,利用高分辨率和低分辨率视觉特征来增强目标标记的精度。此外,ChatRex 利用新开发的 Rexverse-2M 数据集进行训练,该数据集包含超过 200 万张带有多粒度注释的图像,确保了感知和理解任务的平衡训练。

🏆**卓越性能:**ChatRex 在目标检测、指代目标检测、图像描述生成等任务上均取得了优异表现,其性能优于其他现有模型,例如在 COCO 和 LVIS 数据集上实现了更高的精度、召回率和平均精度 (mAP) 分数。

💡**应用前景:**ChatRex 的感知和理解能力的结合,为其在动态复杂环境中的应用打开了新的可能性,例如自动驾驶和机器人导航等领域,展现了跨模态系统融合感知和理解的巨大潜力。

Multimodal Large Language Models (MLLMs) have shown impressive capabilities in visual understanding. However, they face significant challenges in fine-grained perception tasks such as object detection, which is critical for applications like autonomous driving and robotic navigation. Current models fail to achieve precise detection, reflected in the low recall rates of even state-of-the-art systems like Qwen2-VL, which only manages 43.9% of the COCO dataset. This gap emerges from inherent conflicts of tasks associated with perception and understanding and limited datasets that would be able to fairly balance these two required parts.

Traditional efforts toward incorporating perception into MLLMs usually involve tokenizing the coordinates of a bounding box to fit this form with auto-regressive models. Though these techniques guarantee compatibility with understanding tasks, they suffer from cascading errors, ambiguous object prediction orders, and quantization inaccuracies in complex images. A retrieval-based perception framework is, for instance, as in Groma and Shikra; it could change the detection of an object but isn’t as strong as a robust real-world task on diverse tasks. Moreover, the mentioned limitations are added to insufficient training datasets, which fail to address the twin requirements of perception and understanding.

To overcome this challenge, researchers from the International Digital Economy Academy (IDEA) developed ChatRex, an advanced MLLM that is designed with decoupled architecture with strict separation between perception and understanding tasks. ChatRex is built on a retrieval-based framework where object detection is considered as retrieving bounding box indices rather than a direct coordinate prediction. This novel formulation removes quantization errors and increases the accuracy of detection. A Universal Proposal Network (UPN) was developed to generate comprehensive fine-grained and coarse-grained bounding box proposals that addressed ambiguities in object representation. The architecture further integrates a dual-vision encoder, which integrates high-resolution and low-resolution visual features to enhance the precision of object tokenization. The training was further enhanced by the newly developed Rexverse-2M dataset, an enormous collection of annotated images with multi-granular annotations, thus ensuring balanced training across perception and understanding tasks.

The Universal Proposal Network is based on DETR. The UPN generates robust bounding box proposals at multiple levels of granularity, which has effectively mitigated inconsistencies in object labeling across datasets. The UPN can then accurately detect objects in different scenarios by using fine-grained and coarse-grained prompts during training. The dual-vision encoder enables the encoding of visuals to be done compactly and efficiently by replacing high-resolution image features with low-resolution representations. The dataset for training, Rexverse-2M, contains more than two million annotated images, along with region descriptions, bounding boxes, and captions, which balanced the perception of the understanding and contextual analysis of ChatRex.

ChatRex performs top-notch in both perception and understanding benchmarks as it surpasses all other present models. In object detection, it has better or higher precision, recall, and mean Average Precision, or mAP, score than competitors on datasets including COCO and LVIS. In referring to object detection, can accurately associate descriptive expressions to corresponding objects, which explains its ability to deal with complex interactions between textual inputs and visual inputs. The system excels further in generating grounded image captions, answering region-specific queries, and object-aware conversational scenarios. This success stems from its decoupled architecture, retrieval-based detection strategy, and the broad training enabled by the Rexverse-2M dataset.

ChatRex is the first multimodal AI model that resolves the long-standing conflict between perception and understanding tasks. Its innovative design, combined with a robust training dataset, sets a new standard for MLLMs, allowing for precise object detection and context-rich understanding. These dual capabilities open up novel applications in dynamic and complex environments, illustrating how the integration of perception and understanding can unlock the full potential of multimodal systems.


Check out the Paper and GitHub Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

Evaluation of Large Language Model Vulnerabilities: A Comparative Analysis of Red Teaming Techniques’ Read the Full Report (Promoted)

The post ChatRex: A Multimodal Large Language Model (MLLM) with a Decoupled Perception Design appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

ChatRex 跨模态大语言模型 目标检测 感知理解 MLLM
相关文章