MarkTechPost@AI 05月31日 04:20
Yandex Releases Yambda: The World’s Largest Event Dataset to Accelerate Recommender Systems
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Yandex发布了Yambda,一个用于推荐系统研究的全球最大公开数据集。该数据集包含来自Yandex Music的近50亿条匿名用户互动事件,旨在弥合学术研究与工业级应用之间的差距。Yambda解决了推荐系统领域因缺乏大规模开放数据集而滞后的问题,提供了高质量、全面的数据集,其中包含丰富的特征和匿名化保护措施。通过提供用户互动、音频嵌入、自然互动标记和精确的时间戳,Yambda为研究人员和开发人员提供了宝贵的资源,以开发更有效和个性化的推荐系统。

💾Yandex发布Yambda,包含来自Yandex Music的47.9亿条匿名用户互动,涵盖100万用户与940万首曲目的互动数据,包括显式(喜欢/不喜欢)和隐式(收听)反馈。

🎵Yambda数据集包含用户互动、匿名音频嵌入(来自卷积神经网络的音轨矢量表示)、自然互动标志(区分独立发现或推荐发现的曲目)以及精确的时间戳,采用Apache Parquet格式,便于大数据处理。

⏱️Yandex在Yambda中采用全局时间分割(GTS)评估策略,通过时间戳分割数据,保留事件序列的完整性,模拟真实场景,避免未来数据泄露到训练中,更真实地评估算法效果。

📊Yandex 提供了一系列基准模型,包括 MostPop、DecayPop、ItemKNN、iALS、BPR、SANSA 和 SASRec,并使用 NDCG@k、Recall@k 和 Coverage@k 等标准指标进行评估,方便研究者快速评估新算法的性能。

Yandex has recently made a significant contribution to the recommender systems community by releasing Yambda, the world’s largest publicly available dataset for recommender system research and development. This dataset is designed to bridge the gap between academic research and industry-scale applications, offering nearly 5 billion anonymized user interaction events from Yandex Music — one of the company’s flagship streaming services with over 28 million monthly users.

Why Yambda Matters: Addressing a Critical Data Gap in Recommender Systems

Recommender systems underpin the personalized experiences of many digital services today, from e-commerce and social networks to streaming platforms. These systems rely heavily on massive volumes of behavioral data, such as clicks, likes, and listens, to infer user preferences and deliver tailored content.

However, the field of recommender systems has lagged behind other AI domains, like natural language processing, largely due to the scarcity of large, openly accessible datasets. Unlike large language models (LLMs), which learn from publicly available text sources, recommender systems need sensitive behavioral data — which is commercially valuable and hard to anonymize. As a result, companies have traditionally guarded this data closely, limiting researchers’ access to real-world-scale datasets.

Existing datasets such as Spotify’s Million Playlist Dataset, Netflix Prize data, and Criteo’s click logs are either too small, lack temporal detail, or are poorly documented for developing production-grade recommender models. Yandex’s release of Yambda addresses these challenges by providing a high-quality, extensive dataset with a rich set of features and anonymization safeguards.

What Yambda Contains: Scale, Richness, and Privacy

The Yambda dataset comprises 4.79 billion anonymized user interactions collected over a 10-month period. These events come from roughly 1 million users interacting with nearly 9.4 million tracks on Yandex Music. The dataset includes:

All user and track identifiers are anonymized using numeric IDs to comply with privacy standards, ensuring no personally identifiable information is exposed.

The dataset is provided in Apache Parquet format, which is optimized for big data processing frameworks like Apache Spark and Hadoop, and also compatible with analytical libraries such as Pandas and Polars. This makes Yambda accessible for researchers and developers working in diverse environments.

Evaluation Method: Global Temporal Split

A key innovation in Yandex’s dataset is the adoption of a Global Temporal Split (GTS) evaluation strategy. In typical recommender system research, the widely used Leave-One-Out method removes the last interaction of each user for testing. However, this approach disrupts the temporal continuity of user interactions, creating unrealistic training conditions.

GTS, on the other hand, splits the data based on timestamps, preserving the entire sequence of events. This approach mimics real-world recommendation scenarios more closely because it prevents any future data from leaking into training and allows models to be tested on truly unseen, chronologically later interactions.

This temporal-aware evaluation is essential for benchmarking algorithms under realistic constraints and understanding their practical effectiveness.

Baseline Models and Metrics Included

To support benchmarking and accelerate innovation, Yandex provides baseline recommender models implemented on the dataset, including:

These baselines are evaluated using standard recommender metrics such as:

Providing these benchmarks helps researchers quickly gauge the performance of new algorithms relative to established methods.

Broad Applicability Beyond Music Streaming

While the dataset originates from a music streaming service, its value extends far beyond that domain. The interaction types, user behavior dynamics, and large scale make Yambda a universal benchmark for recommender systems across sectors like e-commerce, video platforms, and social networks. Algorithms validated on this dataset can be generalized or adapted to various recommendation tasks.

Benefits for Different Stakeholders

My Wave: Yandex’s Personalized Recommender System

Yandex Music leverages a proprietary recommender system called My Wave, which incorporates deep neural networks and AI to personalize music suggestions. My Wave analyzes thousands of factors including:

This system dynamically adapts to individual tastes by identifying audio similarities and predicting preferences, demonstrating the kind of complex recommendation pipeline that benefits from large-scale datasets like Yambda.

Ensuring Privacy and Ethical Use

The release of Yambda underscores the importance of privacy in recommender system research. Yandex anonymizes all data with numeric IDs and omits personally identifiable information. The dataset contains only interaction signals without revealing exact user identities or sensitive attributes.

This balance between openness and privacy allows for robust research while protecting individual user data, a critical consideration for the ethical advancement of AI technologies.

Access and Versions

Yandex offers the Yambda dataset in three sizes to accommodate different research and computational capacities:

All versions are accessible via Hugging Face, a popular platform for hosting datasets and machine learning models, enabling easy integration into research workflows.

Conclusion

Yandex’s release of the Yambda dataset marks a pivotal moment in recommender system research. By providing an unprecedented scale of anonymized interaction data paired with temporal-aware evaluation and baselines, it sets a new standard for benchmarking and accelerating innovation. Researchers, startups, and enterprises alike can now explore and develop recommender systems that better reflect real-world usage and deliver enhanced personalization.

As recommender systems continue to influence countless online experiences, datasets like Yambda play a foundational role in pushing the boundaries of what AI-powered personalization can achieve.

Check out the Yambda Dataset on Hugging Face. 


Note: Thanks to the Yandex team for the thought leadership/ Resources for this article. Yandex team has supported and sponsored this content/article.

The post Yandex Releases Yambda: The World’s Largest Event Dataset to Accelerate Recommender Systems appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Yandex Yambda数据集 推荐系统 大数据 音乐推荐
相关文章