Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization

cs.AI updates on arXiv.org 07月03日 12:07

Deep Recommender Models Inference: Automatic Asymmetric Data Flow Optimization

文章提出针对深度推荐模型嵌入层性能瓶颈的优化方案，通过定制数据流和框架设计，显著提升嵌入查找速度，实验结果表明，该方法在华为Ascend AI加速器上性能提升显著。

arXiv:2507.01676v1 Announce Type: cross Abstract: Deep Recommender Models (DLRMs) inference is a fundamental AI workload accounting for more than 79% of the total AI workload in Meta's data centers. DLRMs' performance bottleneck is found in the embedding layers, which perform many random memory accesses to retrieve small embedding vectors from tables of various sizes. We propose the design of tailored data flows to speedup embedding look-ups. Namely, we propose four strategies to look up an embedding table effectively on one core, and a framework to automatically map the tables asymmetrically to the multiple cores of a SoC. We assess the effectiveness of our method using the Huawei Ascend AI accelerators, comparing it with the default Ascend compiler, and we perform high-level comparisons with Nvidia A100. Results show a speed-up varying from 1.5x up to 6.5x for real workload distributions, and more than 20x for extremely unbalanced distributions. Furthermore, the method proves to be much more independent of the query distribution than the baseline.

Fish AI Reader

AI辅助创作，多种专业模板，深度分析，高质量内容生成。从观点提取到深度思考，FishAI为您提供全方位的创作支持。新版本引入自定义参数，让您的创作更加个性化和精准。

FishAI

鱼阅，AI 时代的下一个智能信息助手，助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

深度推荐模型数据流设计嵌入查找性能优化华为Ascend

相关文章

Sparse Maximal Update Parameterization (SμPar): Optimizing Sparse Neural Networks for Superior Training Dynamics and Efficiency

Node.js 最佳实践：开发人员指南

Webassembly：网络应用程序的近原生性能

This AI Paper from Databricks and MIT Propose Perplexity-Based Data Pruning: Improving 3B Parameter Model Performance and Enhancing Language Models

是时候向谷歌字体说再见了：缓存性能 (2020)

用于连接处理的简单、高效和稳健的哈希表

利用 Zig 的分配器

This AI Research Discusses Achieving Efficient Large Language Models (LLMs) by Eliminating Matrix Multiplication for Scalable Performance

Rails 上的异步 Ruby

使用 SIMD 指令更快地扫描 HTMLChrome 浏览器版