MarkTechPost@AI 21小时前
You Don’t Need to Share Data to Train a Language Model Anymore—FlexOlmo Demonstrates How
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

FlexOlmo是一种创新的模块化训练和推理框架,旨在解决大型语言模型(LLM)开发中因数据敏感性、版权限制和使用规定而导致的中心化数据访问难题。该框架允许在数据治理约束下进行LLM开发,通过独立的专家模块化训练和推理时灵活的包含/排除机制,克服了传统LLM训练模式无法适应法规要求、许可限制和上下文敏感数据的局限性。FlexOlmo在多项基准测试中表现出色,显著优于现有基线方法,为在数据隐私和安全要求严格的环境下构建高性能语言模型提供了可行路径。

💡 FlexOlmo通过采用混合专家(MoE)架构,将LLM的训练过程分解为独立的模块化专家,每个专家可以基于本地持有且隔离的数据集进行训练。这种设计允许数据拥有者在不共享原始数据的情况下,训练特定领域的专家模型,例如新闻、代码、学术文本等,从而有效规避了数据隐私、版权和法规方面的障碍。

🚀 该框架的核心优势在于其推理时灵活性,允许在不重新训练模型的情况下,通过确定性的机制选择性地包含或排除特定数据集的影响。这通过稀疏激活、专家路由和偏置正则化等关键架构组件实现,确保了模型在不同场景下能够根据数据治理策略动态调整其行为,满足了数据局部性、动态数据使用策略和法规遵从等实际需求。

📊 在31项基准测试和10个类别的评估中,FlexOlmo展示了强大的性能,平均相对提升了41%(相较于基础公共模型),并比最强的模型合并基线(BTM)提升了10.1%。尤其在与封闭领域相关的任务上,其优势尤为显著,证明了专业化专家在处理特定类型数据时的有效性。

🔒 FlexOlmo在数据隐私方面也进行了评估,通过已知攻击方法进行的提取率较低,并且该架构支持在每个专家层面独立应用差分隐私(DP)训练,为数据安全提供了额外的保障。这使得FlexOlmo成为在严格数据安全和隐私要求下开发LLM的可靠解决方案。

The development of large-scale language models (LLMs) has historically required centralized access to extensive datasets, many of which are sensitive, copyrighted, or governed by usage restrictions. This constraint severely limits the participation of data-rich organizations operating in regulated or proprietary environments. FlexOlmo—introduced by researchers at the Allen Institute for AI and collaborators—proposes a modular training and inference framework that enables LLM development under data governance constraints.

Current LLMs…..

Current LLM training pipelines rely on aggregating all training data into a single corpus, which imposes a static inclusion decision and eliminates the possibility of opt-out post-training. This approach is incompatible with:

FlexOlmo addresses two objectives:

    Decentralized, modular training: Allow independently trained modules on disjoint, locally held datasets.Inference-time flexibility: Enable deterministic opt-in/opt-out mechanisms for dataset contributions without retraining.

Model Architecture: Expert Modularity via Mixture-of-Experts (MoE)

FlexOlmo builds upon a Mixture-of-Experts (MoE) architecture where each expert corresponds to a feedforward network (FFN) module trained independently. A fixed public model (denoted as M<sub>pub</sub>) serves as the shared anchor. Each data owner trains an expert M<sub>i</sub> using their private dataset D<sub>i</sub>, while all attention layers and other non-expert parameters remain frozen.

Key architectural components:

This design maintains interoperability among modules while enabling selective inclusion during inference.

Asynchronous and Isolated Optimization

Each expert M<sub>i</sub> is trained via a constrained procedure to ensure alignment with M<sub>pub</sub>. Specifically:

To initialize r<sub>i</sub>, a set of samples from D<sub>i</sub> is embedded using a pretrained encoder, and their average forms the router embedding. Optional lightweight router tuning can further improve performance using proxy data from the public corpus.

Dataset Construction: FLEXMIX

The training corpus, FLEXMIX, is divided into:

Each expert is trained on a disjoint subset, with no joint data access. This setup approximates real-world usage where organizations cannot pool data due to legal, ethical, or operational constraints.

Evaluation and Baseline Comparisons

FlexOlmo was evaluated on 31 benchmark tasks across 10 categories, including general language understanding (e.g., MMLU, AGIEval), generative QA (e.g., GEN5), code generation (e.g., Code4), and mathematical reasoning (e.g., Math2).

Baseline methods include:

Compared to these methods, FlexOlmo achieves:

The gains are especially notable on tasks aligned with closed domains, confirming the utility of specialized experts.

Architectural Analysis

Several controlled experiments reveal the contribution of architectural decisions:

Token-level routing patterns show expert specialization at specific layers. For instance, mathematical input activates the math expert at deeper layers, while introductory tokens rely on the public model. This behavior underlines the model’s expressivity compared to single-expert routing strategies.

Opt-Out and Data Governance

A key feature of FlexOlmo is deterministic opt-out capability. Removing an expert from the router matrix fully removes its influence at inference time. Experiments show that removing the News expert reduces performance on NewsG but leaves other tasks unaffected, confirming the localized influence of each expert.

Privacy Considerations

Training data extraction risks were evaluated using known attack methods. Results indicate:

While these rates are low, differential privacy (DP) training can be applied independently to each expert for stronger guarantees. The architecture does not preclude the use of DP or encrypted training methods.

Scalability

The FlexOlmo methodology was applied to an existing strong baseline (OLMo-2 7B), pretrained on 4T tokens. Incorporating two additional experts (Math, Code) improved average benchmark performance from 49.8 to 52.8, without retraining the core model. This demonstrates scalability and compatibility with existing training pipelines.

Conclusion

FlexOlmo introduces a principled framework for building modular LLMs under data governance constraints. Its design supports distributed training on locally maintained datasets and enables inference-time inclusion/exclusion of dataset influence. Empirical results confirm its competitiveness against both monolithic and ensemble-based baselines.

The architecture is particularly applicable to environments with:

FlexOlmo provides a viable pathway for constructing performant language models while adhering to real-world data access boundaries.


Check out the Paper, Model on Hugging Face and Codes. All credit for this research goes to the researchers of this project.

Sponsorship Opportunity: Reach the most influential AI developers in US and Europe. 1M+ monthly readers, 500K+ community builders, infinite possibilities. [Explore Sponsorship]

The post You Don’t Need to Share Data to Train a Language Model Anymore—FlexOlmo Demonstrates How appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

FlexOlmo 大型语言模型 LLM 数据治理 AI训练
相关文章