MarkTechPost@AI 2024年11月17日
List of Large Mixture of Experts (MoE) Models: Architecture, Performance, and Innovations in Scalable AI Solutions
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

文章介绍了MoE模型在机器学习中的重要突破,探讨了多种MoE模型的架构、性能等,包括其参数激活方式、优势及不同模型的特点,还提到了LMSYS排名等情况。

MoE模型平衡计算效率与可扩展性,仅激活部分参数

介绍了多个MoE模型,如Google的Switch-C Transformer等,包括参数、专家数量等信息

提及部分模型的性能表现及在LMSYS上的排名情况

强调MoE模型是AI领域的新前沿,鼓励开发者根据具体需求探索

Mixture of Experts (MoE) models represents a significant breakthrough in machine learning, offering an efficient approach to handling large-scale models. Unlike dense models, where all parameters are active during inference, MoE models activate only a fraction of their parameters. This approach balances computational efficiency with scalability, making MoE models highly attractive for various use cases. MoE models achieve efficiency by activating fewer parameters while maintaining a larger total parameter count. This design introduces unique trade-offs, including increased architectural complexity, but it provides greater flexibility for developers and researchers.

Let’s explore the largest MoE models released to date, focusing on their architecture, capabilities, and relative performance. These models are all publicly available and exceed 100 billion parameters. The analysis is ordered chronologically by release date, with rankings provided where available from the LMSYS leaderboard as of November 4, 2024.

Google’s Switch-C Transformer is one of the earliest models in the MoE space. Released on Hugging Face in November 2022, it boasts a staggering 1.6 trillion total parameters, supported by 2048 experts. Despite being an early innovator in this domain, Switch-C is now considered outdated, as it is not ranked on modern benchmarks like LMSYS. However, it remains noteworthy as a foundational MoE model and continues to influence subsequent innovations. Smaller variants of the Switch-C Transformer are also available, offering more accessible entry points for experimentation.

In March 2024, X AI released Grok-1, a model with 314 billion total parameters and 86 billion active during inference. Unlike its predecessor, Grok-1 utilizes a smaller pool of experts, eight in total, with only two active per inference task. Its 8k context length is suitable for moderately long input sequences, though it is not competitive with newer models. While Grok-1 has limited adoption and is not ranked on LMSYS, its successor, Grok-2, has shown promise in preliminary benchmarks. Grok-2, yet to be publicly released, has ranked fifth overall in specific LMSYS tasks, suggesting that future iterations of this model could redefine performance benchmarks in the MoE landscape.

Shortly after Grok-1, Databricks released DBRX in late March 2024. This model features 132 billion total parameters, with 36 billion active, spread across 16 experts. Its 32k context length significantly outpaces many contemporaries, allowing it to process longer input sequences efficiently. DBRX is supported by multiple backends, including llamacpp, exllama v2, and vLLM, making it a versatile choice for developers. Despite its strong architecture, its LMSYS rankings place it only at 90th overall and 78th for hard prompts in English, indicating room for improvement in quality and adoption.

April 2024 saw the release of Mistral AI’s Mixtral 8x22b. This model stands out with its 141 billion total parameters and 39 billion active during inference. It incorporates eight experts, two of which are chosen dynamically based on the input. With a 64k context length, Mixtral is well-suited for tasks requiring extensive input handling. While its LMSYS rankings, 70th overall and 66th on hard prompts, indicate middling performance, its compatibility with multiple backends ensures usability across diverse platforms.

Another April release was Snowflake’s Arctic, an MoE model with 480 billion total parameters but only 17 billion active during inference. Arctic’s unique design combines sparse (7 billion) and dense (10 billion) components distributed among 128 experts. However, its performance falls short, ranking 99th overall on LMSYS and a notably low 101st for hard prompts. Its limited 4k context length further restricts its applicability, making it a less competitive option despite its innovative architecture.

Skywork joined the MoE space in June 2024 with the release of Skywork-MoE. This model features 146 billion total parameters, of which 22 billion are active, and employs 16 experts during inference. With an 8k context length, it supports moderately lengthy tasks but lacks LMSYS rankings, which suggests limited testing or adoption. The base model is the only available version, as the promised chat variant has yet to be released.

In August 2024, AI21 Labs released Jamba 1.5 Large, a hybrid model that merges MoE and mamba-transformer architectures. With 398 billion total parameters and 98 billion active, Jamba 1.5 Large offers an exceptional 256k context length, making it ideal for tasks requiring extensive input processing. Its LMSYS rankings reflect its high performance, placing 34th overall and 28th for hard prompts. Additionally, Jamba models excel in context benchmarks, particularly the RULER context benchmark, solidifying their reputation for long-context tasks.

DeepSeek V2.5, released in September 2024, currently leads the MoE space in performance. This model incorporates 236 billion total parameters, with 21 billion active during inference. Its architecture includes 160 experts, of which six are dynamically chosen and two are shared, resulting in eight active parameters. With a 128k context length, DeepSeek V2.5 demonstrates robust capabilities for long-context tasks. It ranks 18th overall on LMSYS and 6th for hard prompts, outperforming all available MoE models. Earlier iterations, such as DeepSeek V2, laid the groundwork for its success.

The most recent addition to the MoE family is Tencent’s Hunyuan Large, released in November 2024. With 389 billion total parameters and 52 billion active, Hunyuan Large employs a unique design, where one expert is chosen dynamically and one is shared. This results in two active parameters during inference. Its 128k context length matches that of DeepSeek V2.5, positioning it as a strong competitor. While it is not yet ranked on LMSYS, early indications suggest it could rival or surpass DeepSeek’s performance.

Among the MoE models discussed, DeepSeek V2.5 is the most robust option currently available. However, newer models such as Hunyuan Large and the anticipated Grok-2 may soon shift the rankings. Models like Jamba 1.5 Large also highlight the strengths of hybrid architectures, particularly in tasks requiring extensive context handling. The LMSYS rankings, while useful for initial comparisons, do not capture every nuance of model performance, especially for specialized tasks.

In conclusion, MoE models represent a growing frontier in AI, offering scalable and efficient solutions tailored to diverse applications. Developers and researchers are encouraged to explore these models based on specific use cases, leveraging their unique architectures to optimize performance. As the field evolves, the MoE landscape will likely witness further innovations, pushing the boundaries of what these architectures can achieve.


This article is based on this Reddit post. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter and join our Telegram Channel and LinkedIn Group. If you like our work, you will love our newsletter.. Don’t Forget to join our 55k+ ML SubReddit.

[FREE AI WEBINAR] Implementing Intelligent Document Processing with GenAI in Financial Services and Real Estate TransactionsFrom Framework to Production

The post List of Large Mixture of Experts (MoE) Models: Architecture, Performance, and Innovations in Scalable AI Solutions appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

MoE模型 机器学习 模型性能 AI前沿
相关文章