AWS Machine Learning Blog 2024年11月23日
Amazon SageMaker Inference now supports G6e instances
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AWS推出了搭载NVIDIA L40S Tensor Core GPU的G6e实例,用于Amazon SageMaker,为生成式AI提供更灵活、经济高效和强大的加速能力。G6e实例拥有更大的GPU内存,可部署高达90B参数的模型,并提供高达400Gbps的网络吞吐量。它适用于微调和部署开源大型语言模型,如Llama 3.2 11B Vision、Llama 2 13B和Qwen 2.5 14B,在聊天机器人、文本生成、图像生成等场景中表现出色,并提供更高的性能和更低的成本,尤其适合低延迟、实时应用。

🚀 **G6e实例提供更大的GPU内存**: 与G5和G6实例相比,G6e实例拥有两倍的GPU内存,单节点可部署高达14B参数的模型,4节点可部署72B,8节点可部署90B,满足大型语言模型部署需求。

💡 **高性能网络和GPU内存**: G6e实例提供高达400Gbps的网络吞吐量和高达384GB的GPU内存,确保模型推理的高效性和流畅性。

💻 **适用于多种AI应用场景**: G6e实例非常适合微调和部署开源大型语言模型,例如聊天机器人、文本生成和图像生成等,满足不同AI应用场景的需求。

💰 **更具成本效益**: 与G5实例相比,G6e实例在性能和成本方面更具优势,尤其是在高并发和长上下文长度的推理任务中表现出色,降低AI应用的部署成本。

📊 **提供性能基准测试**: AWS提供了完整的基准测试数据,展示了G6e实例在不同模型和应用场景下的性能表现,帮助用户选择合适的实例进行部署。

As the demand for generative AI continues to grow, developers and enterprises seek more flexible, cost-effective, and powerful accelerators to meet their needs. Today, we are thrilled to announce the availability of G6e instances powered by NVIDIA’s L40S Tensor Core GPUs on Amazon SageMaker. You will have the option to provision nodes with 1, 4, and 8 L40S GPU instances, with each GPU providing 48 GB of high bandwidth memory (HBM). This launch provides organizations with the capability to use a single-node GPU instance—G6e.xlarge—to host powerful open-source foundation models such as Llama 3.2 11 B Vision, Llama 2 13 B, and Qwen 2.5 14B, offering organizations a cost-effective and high-performing option. This makes it a perfect choice for those looking to optimize costs while maintaining high performance for inference workloads.

The key highlights for G6e instances include:

Use cases

G6e instances are ideal for fine-tuning and deploying open large language models (LLMs). Our benchmarks show that G6e provides higher performance and is more cost-effective compared to G5 instances, making them an ideal fit for use in low-latency, real time use cases such as:

We have also observed that G6e performs well for inference at high concurrency and with longer context lengths. We have provided complete benchmarks in the following section.

Performance

In the following two figures, we see that for long context length of 512 and 1024, G6e.2xlarge provides up to 37% better latency and 60% better throughput compared to G5.2xlarge for a Llama 3.1 8B model.

In the following two figures, we see that G5.2xlarge throws a CUDA out of memory (OOM) when deploying the LLama 3.2 11B Vision model, whereas G6e.2xlarge provides great performance.

In the following two figures, we compare G5.48xlarge (8 GPU node) with the G6e.12xlarge (4 GPU) node, which costs 35% less and is more performant. For higher concurrency, we see that G6e.12xlarge provides 60% lower latency and 2.5 times higher throughput.

In the below figure, we are comparing cost per 1000 tokens when deploying a Llama 3.1 70b which further highlights the cost/performance benefits of using G6e instances compared to G5.

Deployment walkthrough

Prerequisites

To try out this solution using SageMaker, you’ll need the following prerequisites:

Deployment

You can clone the repository and use the notebook provided here.

Clean up

To prevent incurring unnecessary charges, it’s recommended to clean up the deployed resources when you’re done using them. You can remove the deployed model with the following code:

predictor.delete_predictor()

Conclusion

G6e instances on SageMaker unlock the ability to deploy a wide variety of open source models cost-effectively. With superior memory capacity, enhanced performance, and cost-effectiveness, these instances represent a compelling solution for organizations looking to deploy and scale their AI applications. The ability to handle larger models, support longer context lengths, and maintain high throughput makes G6e instances particularly valuable for modern AI applications. Try the code to deploy with G6e.


About the Authors

Vivek Gangasani is a Senior GenAI Specialist Solutions Architect at AWS. He helps emerging GenAI companies build innovative solutions using AWS services and accelerated compute. Currently, he is focused on developing strategies for fine-tuning and optimizing the inference performance of Large Language Models. In his free time, Vivek enjoys hiking, watching movies and trying different cuisines.

Alan Tan is a Senior Product Manager with SageMaker, leading efforts on large model inference. He’s passionate about applying machine learning to the area of analytics. Outside of work, he enjoys the outdoors.

Pavan Kumar Madduri is an Associate Solutions Architect at Amazon Web Services. He has a strong interest in designing innovative solutions in Generative AI and is passionate about helping customers harness the power of the cloud. He earned his MS in Information Technology from Arizona State University. Outside of work, he enjoys swimming and watching movies.

Michael Nguyen is a Senior Startup Solutions Architect at AWS, specializing in leveraging AI/ML to drive innovation and develop business solutions on AWS. Michael holds 12 AWS certifications and has a BS/MS in Electrical/Computer Engineering and an MBA from Penn State University, Binghamton University, and the University of Delaware.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AWS SageMaker G6e实例 大型语言模型 GPU加速 生成式AI
相关文章