MarkTechPost@AI 01月13日
Salesforce AI Introduces TACO: A New Family of Multimodal Action Models that Combine Reasoning with Real-World Actions to Solve Complex Visual Tasks
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Salesforce AI推出了TACO,一种新型多模态行动模型,它结合了推理和现实世界的行动来解决复杂的视觉任务。TACO通过使用GPT-4生成的合成CoTA数据集进行训练,包含超过180万个轨迹,并从中精选了29.3万个高质量示例。该模型集成了15种工具,包括OCR、对象定位和数学求解器,有效地处理复杂任务。TACO在多个基准测试中表现出色,平均准确率提高了3.6%,在涉及OCR和数学推理的MMVet任务上提升高达15%。这项研究为多模态学习开辟了新路径,通过整合推理和行动,为解决复杂场景设定了新基准。

💡TACO是一个创新的多模态行动模型框架,它通过合成CoTA数据集进行训练,克服了传统模型在推理和工具应用方面的局限性,尤其是在处理需要多步骤推理和逻辑链的复杂任务时。

🛠️TACO集成了15种工具,如OCR、对象定位和数学求解器,使其能够处理包括视觉识别、视觉定位、推理和多步骤问题解决等多种任务,展现了强大的多模态任务处理能力。

📊TACO在多个基准测试中表现出色,相较于指令微调基线,平均准确率提高了3.6%,在涉及OCR和数学推理的MMVet任务上更是实现了高达15%的提升,证明了其在复杂任务中的卓越性能。

🧪TACO的训练过程采用了高质量的合成数据集和先进的过滤技术,强调推理和行动的整合,并通过调整超参数来进一步优化模型性能,包括调整视觉编码器和学习率,确保模型能够有效解决复杂的多模态挑战。

Developing effective multi-modal AI systems for real-world applications requires handling diverse tasks such as fine-grained recognition, visual grounding, reasoning, and multi-step problem-solving. Existing open-source multi-modal language models are found to be wanting in these areas, especially for tasks that involve external tools such as OCR or mathematical calculations. The abovementioned limitations can largely be attributed to single-step-oriented datasets that cannot provide a coherent framework for multiple steps of reasoning and logical chains of actions. Overcoming these will be indispensable for unlocking true potential in using multi-modal AI on complex levels.

Current multi-modal models often rely on instruction tuning with direct-answer datasets or few-shot prompting approaches. The proprietary systems, like GPT-4, have demonstrated the ability to reason through CoTA chains effectively. At the same time, the open-source models face challenges from a lack of datasets and integration with tools. The earlier efforts, such as LLaVa-Plus and Visual Program Distillation, were also limited by small dataset sizes, poor-quality training data, and a focus on simple question-answering tasks, which limits them to more complex multi-modal problems that require more sophisticated reasoning and tool application.

Researchers from the University of Washington and Salesforce Research have developed TACO, an innovative framework for training multi-modal action models using synthetic CoTA datasets. This work introduces several key advancements to address the limitations of prior methods. First, over 1.8 million traces have been generated using GPT-4 and programs in Python, while a subset of 293K examples was curated to present high quality after rigorous filtering techniques. These examples ensure the inclusion of diverse reasoning and action sequences critical for multi-modal learning. Second, TACO incorporates a robust set of 15 tools, including OCR, object localization, and mathematical solvers, enabling the models to handle complex tasks effectively. Third, advanced filtering and data-mixing techniques further optimize the dataset, emphasizing reasoning-action integration and fostering superior learning outcomes. This framework reinterprets multi-modal learning by enabling models to produce coherent multi-step reasoning while seamlessly integrating actions, thereby establishing a new benchmark for performance in intricate scenarios.

The development of TACO involved training on a carefully curated CoTA dataset with 293K instances sourced from 31 different origins, including Visual Genome. These datasets contain a wide range of tasks such as mathematical reasoning, optical character recognition, and detailed visual understanding. It is highly heterogeneous, with the tools provided including object localization and language-based solvers that allow a wide range of reasoning and action tasks. The training architecture combined LLaMA3 as the linguistic basis with CLIP as the visual encoder thus establishing a strong multi-modal framework. Fine-tuning established hyperparameter tuning that focused on lowering learning rates and increasing the number of epochs for training to guarantee that the models could adequately solve complex multi-modal challenges.

TACO’s performance across eight benchmarks demonstrates its substantial impact on advancing multi-modal reasoning capabilities. The system achieved an average accuracy improvement of 3.6% over instruction-tuned baselines, with gains as high as 15% on MMVet tasks that involve OCR and mathematical reasoning. Notably, the high-quality 293K CoTA dataset outperformed larger, less refined datasets, underscoring the importance of targeted data curation. Further performance improvements are achieved through adjustments in hyperparameter strategies, including tuning of vision encoders and optimization of learning rates. Table 2: Results show excellent performances by TACO compared to benchmarks, the latter was found to be exceptionally better in complex tasks involving integration of reasoning and action.

TACO introduces a new methodology for multi-modal action modeling that effectively addresses the severe deficiencies of both reasoning and tool-based actions through high-quality synthetic datasets and innovative training methodologies. The research overcomes the constraints of traditional instruction-tuned models, and its developments are poised to change the face of real-world applications ranging from visual question answering to complex multi-step reasoning tasks.


Check out the Paper, GitHub Page, and Project Page. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 65k+ ML SubReddit.

FREE UPCOMING AI WEBINAR (JAN 15, 2025): Boost LLM Accuracy with Synthetic Data and Evaluation IntelligenceJoin this webinar to gain actionable insights into boosting LLM model performance and accuracy while safeguarding data privacy.

The post Salesforce AI Introduces TACO: A New Family of Multimodal Action Models that Combine Reasoning with Real-World Actions to Solve Complex Visual Tasks appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

TACO 多模态模型 AI推理 CoTA数据集 Salesforce AI
相关文章