MarkTechPost@AI 01月27日
Autonomy-of-Experts (AoE): A Router-Free Paradigm for Efficient and Adaptive Mixture-of-Experts Models
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了专家自主(AoE)模型,这是一种新型的混合专家(MoE)范式,旨在解决传统MoE模型中路由选择与专家执行分离导致的效率低下问题。AoE模型允许专家根据其内部激活规模自主决定是否处理输入,无需路由器。通过计算所有输入的内部激活并选择激活范数最高的专家进行处理,AoE模型提高了专家选择的效率。此外,低秩权重分解减少了计算和内存开销。实验表明,预训练的AoE模型在效率和下游任务上均优于传统MoE模型,展现出更优的专家选择和整体学习效率。

💡传统MoE模型依赖路由器分配输入给特定专家,但路由器缺乏对专家能力的直接了解,导致专家分配不准确,影响模型效率。

🎯AoE模型的核心创新在于专家根据自身激活规模自主选择处理输入,无需路由器,并通过激活范数排序选择最合适的专家。

⚙️AoE通过低秩权重分解和缓存低维激活来减少计算和内存开销,从而提高效率,同时保持模型性能。

🚀实验结果表明,AoE模型在下游任务和训练效率上都优于传统的MoE模型,展示了其在专家选择和学习效率方面的优势。

Mixture-of-Experts (MoE) models utilize a router to allocate tokens to specific expert modules, activating only a subset of parameters, often leading to superior efficiency and performance compared to dense models. In these models, a large feed-forward network is divided into smaller expert networks, with the router—typically an MLP classifier—determining which expert processes each input. However, a key issue arises from the router’s separation from the experts’ execution. Without direct knowledge of the experts’ capabilities, the router’s assignments are predictions without labels. Misassignments can hinder expert performance, requiring expert adaptation or iterative router improvement, resulting in inefficiencies during training.

Researchers from Renmin University of China, Tencent, and Southeast University have introduced Autonomy-of-Experts (AoE), a new MoE paradigm where experts independently decide whether to process inputs. This approach leverages each expert’s awareness of its ability to handle tokens, reflected in the scale of its internal activations. In AoE, experts calculate internal activations for all inputs, and only the top-ranked ones, based on activation norms, proceed with further processing, eliminating the need for routers. The overhead from caching unused activations is reduced using low-rank weight factorization. With up to 4 billion parameters, pre-trained AoE models outperform traditional MoE models in efficiency and downstream tasks.

The study examines sparse MoE models, where each feed-forward network (FFN) module functions as an expert. Unlike dense MoE models, which utilize all parameters, sparse MoE models improve efficiency by activating only the most relevant experts for specific inputs. These models rely on a router to assign inputs to the appropriate experts, typically using a “token choosing Top-K experts” approach. A key challenge is maintaining balanced expert utilization, as routers often overuse certain experts, leading to inefficiencies. To address this, load-balancing mechanisms ensure a more equitable distribution of tasks among experts by incorporating auxiliary losses, thereby enhancing overall efficiency.

The AoE is a method where experts independently determine their selection based on internal activation norms, eliminating the need for explicit routing mechanisms. Initial experiments revealed that the scale of activation norms at certain computational points reflects an expert’s capability to process inputs effectively. AoE builds on this insight by ranking experts based on the L2 norms of compressed activations, selecting the top-performing ones for computation. By factorizing weight matrices and caching low-dimensional activations, AoE significantly reduces computational and memory overhead while maintaining high efficiency, addressing limitations in traditional MoE frameworks.

The research compares the AoE framework to traditional MoE models through experiments on smaller pre-trained language models. Using a 12-layer model with 732 million parameters and eight experts per layer, trained on 100 billion tokens, the findings highlight that AoE performs better than MoE in both downstream tasks and training efficiency. It shows that the best performance is achieved when the reduced dimension is about one-third of the model’s overall dimension. AoE enhances load balancing and expert utilization across layers, leading to better generalization and efficiency when combined with alternative expert selection methods.

In conclusion, AoE is a MoE framework designed to overcome a key limitation in traditional MoE models: separating the router’s decisions and the experts’ execution, often resulting in inefficient expert selection and suboptimal learning. In AoE, experts autonomously select themselves based on their internal activation scales, eliminating the need for routers. This process involves pre-computing activations and ranking experts by their activation norms, allowing only top-ranking experts to proceed. Efficiency is enhanced through low-rank weight factorization. Pre-trained language models using AoE outperform conventional MoE models, showcasing improved expert selection and overall learning efficiency.


Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. Don’t Forget to join our 70k+ ML SubReddit.

[Recommended Read] Nebius AI Studio expands with vision models, new language models, embeddings and LoRA (Promoted)

The post Autonomy-of-Experts (AoE): A Router-Free Paradigm for Efficient and Adaptive Mixture-of-Experts Models appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

混合专家模型 专家自主 模型效率 低秩分解 深度学习
相关文章