MarkTechPost@AI 2024年07月24日
Yandex Introduces TabReD: A New Benchmark for Tabular Machine Learning
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Yandex 和 HSE 大学的研究人员推出了 TabReD,这是一个新的基准,旨在密切反映工业级表格数据应用程序。TabReD 包含来自八个真实世界应用程序的数据集,涵盖金融、食品配送和房地产等领域。该团队已在 GitHub 上公开发布了代码和数据集。

🤔 TabReD 基准的构建基于来自 Kaggle 竞赛和 Yandex 的机器学习应用程序的数据集。研究人员遵循四项规则来选择数据集:必须是表格数据、特征工程应与行业实践相符、排除存在数据泄露的数据集,以及确保数据集具有时间戳和足够样本以进行基于时间的拆分。

📈 TabReD 基准中的八个数据集包括:Homesite 保险、Ecom 优惠、HomeCredit 违约、Sberbank 住房、烹饪时间、配送 ETA、地图路线和天气。这些数据集具有两个关键的实际属性,而这些属性在学术基准中通常缺失。首先,它们根据时间戳分为训练集、验证集和测试集,这对于准确评估至关重要。其次,由于广泛的数据获取和特征工程工作,它们包含更多特征。

🧪 研究人员在 TabReD 基准上测试了最新的表格数据的深度学习方法,以评估它们在基于时间的 data split 和附加特征方面的表现。研究人员得出结论,基于时间的 data split 对于正确评估至关重要。拆分策略的选择会显著影响模型比较的各个方面:绝对指标值、相对性能差异、标准差和模型的相对排名。

🚀 TabReD 弥合了表格机器学习中学术研究和工业应用之间的差距。通过提供一个与真实世界场景密切匹配的基准,它使研究人员能够开发和评估在生产环境中更有可能表现良好的模型。这对于新研究成果在实际应用中的顺畅采用至关重要。

💡 TabReD 基准为探索更多研究方向奠定了基础,例如持续学习、处理逐渐的时间变化以及改进特征选择和工程技术。它还强调需要开发强大的评估协议,以更好地评估机器学习模型在动态的真实世界环境中的实际性能。

In recent years, research on tabular machine learning has grown rapidly. Yet, it still poses significant challenges for researchers and practitioners. Traditionally, academic benchmarks for tabular ML have not fully represented the complexities encountered in real-world industrial applications. 

Most available datasets either lack the temporal metadata necessary for time-based splits or come from less extensive data acquisition and feature engineering pipelines compared to common industry ML practices. This can influence the types and amounts of predictive, uninformative, and correlated features, impacting model selection. Such limitations can lead to overly optimistic performance estimates when models evaluated on these benchmarks are deployed in real-world ML production scenarios.

To address these gaps, researchers at Yandex and HSE University have introduced TabReD, a novel benchmark designed to closely reflect industry-grade tabular data applications. TabReD consists of eight datasets from real-world applications spanning domains such as finance, food delivery, and real estate. The team has made the code and datasets publicly available on GitHub.

Constructing the TabReD Benchmark

To construct TabReD, researchers used datasets from Kaggle competitions and Yandex’s ML applications. They followed four rules: datasets must be tabular, feature engineering should match industry practices, and datasets with data leakage should be excluded. They also ensured datasets had timestamps and enough samples for time-based splits, excluding those without future instances.

The eight datasets in the TabReD benchmark include the following:

These datasets have two key practical properties often missing in academic benchmarks. First, they are split into train, validation, and test sets based on timestamps, essential for accurate evaluation. Second, they include more features due to extensive data acquisition and feature engineering efforts.

Experimental Results and Future Research

The researchers tested recent deep learning methods for tabular data on the TabReD benchmark to assess their performance with time-based data splits and additional features.

They concluded that time-based data splits were crucial for proper evaluation. The choice of splitting strategy significantly affected all aspects of model comparison: absolute metric values, relative performance differences, standard deviations, and the relative ranking of models.

The results identified MLP with embeddings for continuous features as a simple yet effective deep learning baseline, while more advanced models showed less impressive performance in this context.

TabReD bridges the gap between academic research and industrial application in tabular machine learning. It enables researchers to develop and evaluate models that are more likely to perform well in production environments by providing a benchmark that closely mirrors real-world scenarios. This is crucial for the streamlined adoption of new research findings in practical applications.

The TabReD benchmark sets the stage for exploring additional research avenues, such as continual learning, handling gradual temporal shifts, and improving feature selection and engineering techniques. It also highlights the need for developing robust evaluation protocols to better assess ML models’ true performance in dynamic, real-world settings.


Check out the Paper and GitHub. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter..

Don’t Forget to join our 46k+ ML SubReddit

Find Upcoming AI Webinars here

The post Yandex Introduces TabReD: A New Benchmark for Tabular Machine Learning appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

表格数据 机器学习 基准 TabReD Yandex
相关文章