EnterpriseAI 2024年11月13日
Red Hat Acquisition Charts Path to Open Accelerated AI
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

Red Hat 签署协议收购 Neural Magic,后者在加速生成式 AI 推理工作负载方面有专长。文章介绍了 Neural Magic 的技术及相关项目,如 DeepSparse 和 vLLM,以及 Red Hat 的计划和愿景。

Red Hat 收购 Neural Magic,看重其技术与开源理念

Neural Magic 的 DeepSparse 利用 CPU 优化神经网络推理

vLLM 是开源项目,专注优化模型服务与执行

Red Hat 计划借助收购推动 AI 发展并提供多种支持

Red Hat has announced a signed agreement to acquire Neural Magic, a developer of software and algorithms that accelerate generative AI (gen AI) inference workloads. Red Hat mentions Neural Magic’s expertise in inference performance engineering and commitment to open source aligns with Red Hat’s vision of high-performing AI workloads that directly map to customer-specific use cases and data across the hybrid cloud.

Neural Magic may not be familiar to the HPC-AI market. It was spun out of MIT in 2018 to build performant inference software for deep learning. Neural Magic is known for its DeepSparse runtime, a CPU inference package that takes advantage of sparsity to accelerate neural network inference.

Less Can Be Better

DeepSparse (github) maximizes CPU infrastructure (it likes AVX-512 CPUs) to run computer vision (CV), natural language processing (NLP), and large language models (LLMs). The DeepSparse engine is specifically designed to execute neural networks so that data movement across layers stays in the CPU cache. This optimization is based on the observation that AI layers do not need to be dense (many connections) but rather sparse (like the human brain). The computation of sparse layers can be fast because many weights are zero and do not need CPU or GPU cycles.

Creating sparse networks can be done in three ways:

    Pruning involves training a dense network and systematically removing weights that contribute the least to the network’s output. (i.e., weights that are close to zero)Sparse Training integrates sparsity directly into the training process, encouraging weights to move toward zero.Scheduled sparsity gradually increases the level of sparsity during the training process.

While sparsity provides computational speed-up and the ability to run on CPUs at GPU speeds, it is not without issues. The amount of sparsity can directly affect model accuracy. Too much sparsity can degrade the performance of the network. If the correct sparsity level is achieved, the results will be impressive. The image below provides MLPerf results for DeepSparse on CPU (2P AMD Epyc 9654) and GPU (Nvidia T4). The CPUs ran 13 times faster than the (low-powered) GPU.

Assuming the GPU was 10 times faster, DeepSparse could still keep pace. The result is even more significant because there are more spare CPU cycles than GPU cycles in the data center and cloud.

Comparison of CPU (2P Epyc 9654) and GPU (Nvidia T4) benchmarks from MLPerf Inference v3.0 Datacenter results. MLPerf ID: 3.0-1472 (Open), 3.0-0110 (Closed), 3.0-1474 (Open). (Source: Neural Magic)

Then There is vLLM

Developed by UC Berkeley, vLLM (Virtual Large Language Model) is a community-driven open-source project (github) for open model serving (how gen AI models infer and solve problems), with support for all key model families, advanced inference acceleration research and diverse hardware backends including AMD GPUs, AWS Neuron, Google TPUs, Intel Gaudi, Nvidia GPUs and x86 CPUs. Neural Magic has contributed extensively to the vLLM project.

vLLM was first introduced in a paper – Efficient Memory Management for Large Language Model Serving with PagedAttention, authored by Kwon et al. The paper identified the inefficiency of managing Key-Value cache memory in current LLM systems. These limitations can often result in slow inference speed and high memory footprint. vLLM aims to optimize the serving and execution of LLMs by utilizing efficient memory management techniques. The key features are:

Similar to DeepSparse, vLLM is focused on optimization and performance. In addition, like DeepSparse, a pile of GPUs might be better; there is support for diverse hardware backends (including CPUs).

Back to the Agreement

Red Hat believes AI workloads must run wherever customer data lives across the hybrid cloud; this makes flexible, standardized, and open platforms and tools necessary, as they enable organizations to select the environments, resources, and architectures that best align with their unique operational and data needs. The plan is to make gen AI more accessible to more organizations through the open innovation of vLLM.

After the above brief technical background, Red Hat’s interest in Neural Magic makes sense. With Neural Magic’s technology and performance engineering expertise, Red Hat can work to accelerate its vision for AI’s future. Built to address the challenges of wide-scale enterprise AI, Red Hat plans to use open-source innovation to further democratize access to AI’s transformative power by providing

vLLM as an Open Pathway to Extend Red Hat AI

Neural Magic brings its expertise and knowledge with vLLM and DeepSparse technologies to build an open enterprise-grade inference stack that enables customers to optimize, deploy, and scale LLM workloads across hybrid cloud environments with full control over infrastructure choice, security policies, and model lifecycle. Neural Magic also develops model optimization research, builds LLM Compressor (a unified library for optimizing LLMs with state-of-the-art sparsity and quantization algorithms), and maintains a repository of pre-optimized models ready to deploy with vLLM.

Brian Stevens, CEO of Neural Magic, describes the synergy of the two companies as follows, “Open source has proven time and again to drive innovation through the power of community collaboration. At Neural Magic, we’ve assembled some of the industry’s top talent in AI performance engineering with a singular mission of building open, cross-platform, ultra-efficient LLM serving capabilities. Joining Red Hat is not only a cultural match, but will benefit companies large and small in their AI transformation journeys.”


This article first appeared on sister site HPCwire.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Red Hat Neural Magic DeepSparse vLLM AI 发展
相关文章