MarkTechPost@AI 02月26日
CoSyn: An AI Framework that Leverages the Coding Capabilities of Text-only Large Language Models (LLMs) to Automatically Create Synthetic Text-Rich Multimodal Data
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

CoSyn是一个创新的AI框架,它利用文本LLM的编码能力自动创建合成的、富含文本的多模态数据,从而显著提升视觉-语言模型在处理文本密集型图像时的性能。该系统通过生成多样且高质量的训练数据,使得模型能够有效地泛化到不同的领域。CoSyn通过一个包含四个阶段的工作流程运行,包括自然语言查询、选择渲染工具、生成可执行代码以及生成相应的文本指令,从而创建全面的视觉-语言指令调优数据集。实验结果表明,使用CoSyn生成的合成数据训练的模型在多个文本密集型图像理解基准测试中表现出色,甚至在零样本学习中也超越了许多竞争模型。

🤖 CoSyn框架利用文本LLM的代码生成能力,为视觉-语言模型创建多样化的合成训练数据,解决了现有数据集中高质量、多样性不足的问题。

🎨 CoSyn系统包含一个四阶段工作流程:首先,接收自然语言查询;然后,选择合适的渲染工具(如Matplotlib、Plotly、LaTeX等);接着,生成可执行代码以渲染合成图像;最后,使用代码作为上下文,提示语言模型生成对应的文本指令,包括问题、答案和思维链推理。

🥇 基于CoSyn生成的合成数据训练的模型在文本密集型图像理解基准测试中表现出色,7B参数模型超越了Llama 3.2 11B等模型,并在零样本学习中表现突出,表明CoSyn数据能有效迁移到下游任务。

🧪 通过结合合成数据与辅助和评估数据集,模型性能得到显著提升(达到80.9%),超过了仅使用评估数据训练的模型(75.9%),验证了合成数据在提升模型性能方面的有效性。

Vision-language models (VLMs) have demonstrated impressive capabilities in general image understanding, but face significant challenges when processing text-rich visual content such as charts, documents, diagrams, and screenshots. These specialised images require complex reasoning that combines textual comprehension with spatial understanding—a skill set critical for analysing scientific literature, improving accessibility features, and enabling AI agents to function effectively in real-world environments. Current VLMs struggle with these tasks primarily due to the scarcity of high-quality training data that realistically represents the diverse array of text-embedded visual formats encountered in practical applications. This data limitation has created a performance gap in scenarios requiring nuanced interpretation of structured visual information, hampering the deployment of these models in specialized domains where text-rich image processing is essential.

Several approaches have been developed to enhance vision-language models for processing visual content. Early architectures explored different integration strategies including cross-attention mechanisms, Q-Former structures, and MLP projection layers to bridge visual and linguistic features. However, these models often suffer from significant imbalance i-e their language components substantially outweigh visual processing capabilities, leading to hallucinations when high-quality training data is scarce. Existing benchmarks for text-rich image understanding (charts, documents, infographics, diagrams, screenshots) remain limited in size, scope, and diversity, making them suitable for evaluation but inadequate for comprehensive training. Previous synthetic data generation efforts have typically focused on narrow domains using small sets of chart types with handcrafted question templates. Some approaches utilize text-only LLMs to generate annotations from tables or descriptions, while others explore code-based rendering of synthetic charts. Despite these advances, current synthetic datasets remain constrained in topic diversity, figure variety, and rendering methodology—critical limitations that hinder generalization to novel, out-of-distribution tasks.

A team of researchers from University of Pennsylvania, and Allen Institute for Artificial Intelligence introduced the Code Guided Synthetic Data Generation System (CoSyn) which offers a flexible framework to address the challenges in text-rich image processing by creating diverse synthetic multimodal training data. This innovative system utilizes the code generation capabilities of text-only LLMS to produce both data and rendering code for various text-rich visual formats using 11 supported rendering tools including Python, HTML, and LaTeX. CoSyn generates not only the images but also corresponding textual instructions grounded in the underlying code representation, creating comprehensive vision-language instruction-tuning datasets. The researchers used this framework to develop CoSyn-400K, a large-scale diverse synthetic dataset specifically designed for text-rich image understanding.

The CoSyn system operates through a sophisticated four-stage workflow beginning with a natural language query like “generate a dataset of book covers.” First, the system selects one of 20 generation pipelines built on 11 diverse rendering tools including Matplotlib, Plotly, LaTeX, HTML, Mermaid, and specialized tools like Lilypond for music sheets and RDKit for chemical structures. The process starts with topic generation guided by sampled personas that enhance content diversity, followed by detailed data generation that populates content specific to the chosen topic. Next, the system generates executable code that renders the synthetic images using the appropriate tool. Finally, using only the code as context, the system prompts language models to generate corresponding textual instructions, including questions, answers, and chain-of-thought reasoning explanations. To enhance diversity beyond what sampling parameters alone can achieve, CoSyn incorporates 200K unique personas during topic generation, effectively countering the repetitive output tendencies of language models. The implementation leverages the DataDreamer library for robust multi-stage generation, using Claude-3.5-Sonnet for code generation and GPT-4o-mini for instruction data generation.

The model trained on CoSyn’s synthetic data demonstrates exceptional performance across text-rich image understanding benchmarks. When evaluated against seven specialized datasets, the 7B parameter model achieves the highest average performance, surpassing the second-best model (Llama 3.2 11B) by a significant margin of 3.9%. The model ranks first in four out of seven benchmarks and second in the remaining three, highlighting its consistent capabilities across diverse text-rich image tasks. Perhaps most remarkably, even the zero-shot version of the model without any exposure to training instances from evaluation datasets outperforms most competing open and closed models, including those that had been fine-tuned on benchmark training data. This unexpected result provides compelling evidence that the skills acquired from CoSyn’s synthetic data transfer effectively to downstream tasks without requiring domain-specific training examples. Additional ablation studies demonstrate that combining synthetic data with auxiliary and evaluation datasets yields the best performance (80.9%), substantially outperforming models trained on evaluation data alone (75.9%).

The CoSyn framework represents a significant advancement in vision-language model development, utilizing synthetic data generation to substantially improve performance on text-rich image understanding tasks. By harnessing the code generation capabilities of text-only LLMs, the system creates diverse, high-quality training data that enables models to generalize across domains with remarkable efficiency. Analysis confirms that CoSyn-generated data successfully mitigates biases present in existing datasets, resulting in models that perform robustly on realistic, human-written queries rather than just template-based questions. The demonstrated improvements in zero-shot learning, multi-hop reasoning, and novel domain adaptation highlight synthetic data’s crucial role in developing VLMs capable of handling complex text-rich visual content in practical applications.


Check out the Paper and Dataset here. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 80k+ ML SubReddit.

Recommended Read- LG AI Research Releases NEXUS: An Advanced System Integrating Agent AI System and Data Compliance Standards to Address Legal Concerns in AI Datasets

The post CoSyn: An AI Framework that Leverages the Coding Capabilities of Text-only Large Language Models (LLMs) to Automatically Create Synthetic Text-Rich Multimodal Data appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

CoSyn 视觉-语言模型 合成数据 多模态学习
相关文章