MarkTechPost@AI 07月24日 07:07
Amazon Researchers Reveal Mitra: Advancing Tabular Machine Learning with Synthetic Priors
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Amazon研究人员发布了Mitra,一款专为表格数据设计的先进基础模型。Mitra摒弃了为每个数据集定制模型的传统方法,通过情境学习(ICL)和合成数据预训练,在表格机器学习基准测试中取得了领先性能。集成到AutoGluon 1.4中,Mitra能够实现强大的泛化能力,为医疗、金融、电商和科学等领域处理结构化数据的从业者带来了变革。其核心创新在于,Mitra仅使用多样化的合成数据进行预训练,并采用2-D注意力机制处理行和特征,使其能够适应不同表大小和数据类型,并原生支持异构数据,为表格数据分析开辟了新途径。

🌟 Mitra模型专为表格数据设计,通过情境学习(ICL)和合成数据预训练,打破了传统为每个数据集构建独立模型的模式,并在多项表格机器学习基准测试中达到最先进水平。

💡 Mitra的核心创新在于其预训练策略,完全基于精心设计的、混合了结构因果模型和树模型等多样化先验的合成数据。这种方法借鉴了大型语言模型在海量文本语料库上的预训练经验,确保模型能学习到适用于各种未预见真实世界数据集的模式。

🚀 模型采用了2-D注意力机制,能够同时处理表格的行和特征,有效捕捉列与记录之间的复杂交互,并原生支持异构数据,解决了表格机器学习中的关键挑战,同时能适应不同的表大小和特征类型。

📈 在小型到中型数据集(少于5000个样本,少于100个特征)上,Mitra在分类和回归问题上表现尤为突出,超越了XGBoost、CatBoost和AutoGluon的先前版本等强大基线模型。

🛠️ Mitra已开源并集成到AutoGluon 1.4中,用户可以通过情境学习快速适应新任务,也可选择性进行微调以满足特定需求。其模型权重已在Hugging Face上分享,支持GPU和CPU运行,极大地方便了实际应用和研究。

Introduction

Amazon researchers have released Mitra, a cutting-edge foundation model purpose-built for tabular data. Unlike traditional approaches that tailor a bespoke model for every dataset, Mitra harnesses the power of in-context learning (ICL) and synthetic data pretraining, achieving state-of-the-art performance across tabular machine learning benchmarks. Integrated into AutoGluon 1.4, Mitra is designed to generalize robustly, offering a transformative shift for practitioners working with structured data in fields like healthcare, finance, e-commerce, and the sciences.

https://www.amazon.science/blog/mitra-mixed-synthetic-priors-for-enhancing-tabular-foundation-models

The Foundation: Learning from Synthetic Priors

Mitra departs from the norm by being pretrained exclusively on synthetic data. Rather than relying on the limited and heterogeneous nature of real-world tabular datasets, Amazon researchers engineered a principled strategy for generating and mixing diverse synthetic priors. This approach draws inspiration from the way large language models are pretrained on vast and varied text corpora.

Key Components of Mitra’s Synthetic Pretraining:

In-Context Learning and Fine-Tuning: Adapting Without New Models

Traditional tabular ML methods like XGBoost and random forests require a new model for each task or data distribution. In contrast, Mitra leverages in-context learning: given a small number of labeled examples (support set), Mitra can make accurate predictions on new, unseen data (query set) for classification or regression, adapting to each scenario without retraining.

For users who require further adaptation, fine-tuning is also supported, allowing the model to be tailored to specific tasks when needed.

Architecture Innovations

Mitra employs a 2-D attention mechanism across both rows and features, mirroring or extending the architecture advances pioneered by transformers but specialized for tabular data. This enables the model to:

Benchmark Performance and Practical Strengths

Results

Mitra achieves state-of-the-art results on multiple major tabular benchmarks:

Its strengths are especially pronounced on small-to-medium datasets (under 5,000 samples, fewer than 100 features), delivering leading results on both classification and regression problems. Notably, Mitra outperforms strong baselines like TabPFNv2, TabICL, CatBoost, and AutoGluon’s prior iterations.

https://www.amazon.science/blog/mitra-mixed-synthetic-priors-for-enhancing-tabular-foundation-models

Usability

Implications and Future Directions

By learning from a carefully curated blend of synthetic priors, Mitra brings the generalizability of large foundation models to the tabular domain. It is poised to accelerate research and applied data science by:

Getting Started


Check out the Open Weights Classification model, Open Weights Regression model and Blog. All credit for this research goes to the researchers of this project.

The post Amazon Researchers Reveal Mitra: Advancing Tabular Machine Learning with Synthetic Priors appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Mitra 表格数据 机器学习 基础模型 合成数据
相关文章