MarkTechPost@AI 04月15日 07:00
THUDM Releases GLM 4: A 32B Parameter Model Competing Head-to-Head with GPT-4o and DeepSeek-V3
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

清华大学最新发布的GLM-4系列模型,特别是320亿参数的GLM-Z1-32B-0414,在大型语言模型领域引发关注。该模型在15万亿tokens的数据集上训练,具备强大的多语言能力和创新的推理策略,性能可与GPT-4o和DeepSeek-V3等大型模型媲美。GLM-4通过多种技术,例如拒绝采样和强化学习,提升了在智能体任务、编程、函数调用和搜索问答等方面的表现,并推出了“沉思”功能,增强了处理开放性复杂问题的能力。其开放的MIT许可证也使其更具吸引力。

🧠 GLM-4系列模型,特别是GLM-Z1-32B-0414,是清华大学推出的新一代大型语言模型。该模型在15万亿tokens的数据集上训练,旨在平衡计算效率与强大的推理和指令遵循能力。

🌍 GLM-4具备出色的多语言能力,并采用了创新的“思考模式”。GLM-Z1-Rumination-32B-0414版本引入了“沉思”方法,能够进行长时间的反射性推理,从而更好地处理开放性复杂任务。

💡 GLM-4通过多种技术提升性能,包括拒绝采样和强化学习。在IFEval指令遵循基准测试中,GLM-4得分高达87.6。在TAU-Bench等任务自动化基准测试中,GLM-4在零售和航空等场景中也取得了优异成绩。

⚖️ GLM-4在多个基准测试中表现出色,与GPT-4o、DeepSeek-V3和Qwen2.5-Max等模型竞争。GLM-4-32B-0414在SimpleQA搜索增强问答任务中得分88.1,在BFCL-v3函数调用任务中与GPT-4o表现接近,在SWE-bench代码修复任务中成功率达33.8%。

In the rapidly evolving landscape of large language models (LLMs), researchers and organizations face significant challenges. These include enhancing reasoning abilities, providing robust multilingual support, and efficiently managing complex, open-ended tasks. Although smaller models are often more accessible and cost-effective, they typically fall short in performance when compared to their larger counterparts. Hence, there is a growing emphasis on developing mid-sized models that effectively balance computational efficiency with strong reasoning and instruction-following capabilities.

The recent release of GLM 4 from Tsinghua University, particularly the GLM-Z1-32B-0414 variant, addresses these challenges effectively. Trained on a substantial dataset of 15 trillion tokens, GLM 4 is designed to offer reliable multilingual capabilities and incorporates innovative reasoning strategies referred to as “thinking mode.” This release positions GLM 4 alongside other notable models like DeepSeek Distill, QwQ, and O1-mini, and is distributed under the widely respected MIT license. Notably, despite its relatively moderate parameter size of 32 billion, GLM 4 demonstrates performance comparable to much larger models such as GPT-4o and DeepSeek-V3, which contain up to 671 billion parameters, particularly in reasoning-centric benchmarks.

On a technical level, GLM-Z1-32B-0414 leverages extensive high-quality training data, including synthetically generated reasoning tasks, to strengthen analytical capabilities. The model integrates sophisticated techniques such as rejection sampling and reinforcement learning (RL) to improve performance in agent-based tasks, coding, function calling, and search-driven question-answering tasks. Additionally, its “Deep Reasoning Model” variation further refines this by employing cold-start methods combined with extended RL training, specifically targeted at complex mathematical, logical, and coding tasks. Pairwise ranking feedback mechanisms are employed during training to enhance the model’s general reasoning effectiveness.

An advanced variant, GLM-Z1-Rumination-32B-0414, introduces a novel approach termed “rumination,” enabling prolonged reflective reasoning for tackling open-ended, complex queries like comparative AI-driven urban analysis. This variant integrates advanced search tools with multi-objective reinforcement learning, significantly enhancing its utility in research-intensive tasks and complex retrieval-based scenarios. Complementing these larger models, the GLM-Z1-9B-0414 version, with its 9 billion parameters, provides strong mathematical and general reasoning capabilities, demonstrating the practicality of smaller-scale models.

Performance data from benchmark evaluations emphasize the strengths of the GLM 4 series. Specifically, GLM-4-32B-0414 shows robust results compared to GPT-4o, DeepSeek-V3, and Qwen2.5-Max across multiple benchmarks. On the IFEval instruction-following benchmark, GLM 4 scores an impressive 87.6. In task automation benchmarks such as TAU-Bench, GLM 4 achieves strong scores in scenarios like retail (68.7) and airline (51.2). For search-augmented question-answering tasks, as evaluated by SimpleQA, the model records a high score of 88.1. Additionally, GLM 4 closely matches GPT-4o’s performance in function-calling tasks evaluated by the BFCL-v3 benchmark, securing an overall score of 69.6. In practical code repair scenarios tested through SWE-bench with the Moatless framework, GLM 4 achieves a success rate of 33.8%, underscoring its practical value.

In summary, GLM 4 presents itself as an effective family of language models, successfully bridging the performance gap between smaller, more accessible models and the traditionally superior larger-scale counterparts. The GLM-Z1 series, especially the 32B variant, exemplifies this balanced approach by providing powerful reasoning capabilities while maintaining computational affordability. With the added advantage of its permissive MIT license, GLM 4 is positioned as a robust tool for research and enterprise applications requiring high-performance AI solutions without the extensive computational overhead traditionally associated with larger models.


Check out GLM-4-Z1-32B-0414 Model and Other Models. All credit for this research goes to the researchers of this project. Also, feel free to follow us on Twitter and don’t forget to join our 90k+ ML SubReddit.

The post THUDM Releases GLM 4: A 32B Parameter Model Competing Head-to-Head with GPT-4o and DeepSeek-V3 appeared first on MarkTechPost.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

GLM-4 大型语言模型 GPT-4o DeepSeek-V3 人工智能
相关文章