AWS Machine Learning Blog 2024年10月22日
Brilliant words, brilliant writing: Using AWS AI chips to quickly deploy Meta LLama 3-powered applications
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何使用 Amazon EC2 Inf2 实例在 AWS Inferentia2 上高效部署多个行业领先的大型语言模型,以帮助用户快速测试和打开 API 接口,同时进行性能基准测试和下游应用程序调用。文章重点介绍了 Meta-Llama-3-8B-Instruct、Mistral-7B-instruct-v0.2 和 CodeLlama-7b-instruct-hf 三种模型,并展示了使用 Hugging Face's Text Generation Inference 和 Hugging Face Chat UI 等工具构建的解决方案架构。该解决方案利用了客户端-服务器架构,并使用 AWS CloudFormation 进行部署,可以帮助用户以更低的成本和更高的效率使用大型语言模型。

😊 **模型介绍** 本文介绍了三种流行的开源大型语言模型:Meta-Llama-3-8B-Instruct、Mistral-7B-instruct-v0.2 和 CodeLlama-7b-instruct-hf,它们分别由 Meta AI、Mistral AI 和 Meta AI 发布。这些模型在语言理解、翻译、代码生成、推理和聊天方面都表现出色,并且具有不同的参数数量和发布时间。 Meta-Llama-3-8B-Instruct 是 Meta AI 在 2024 年 4 月发布的流行语言模型,它在预训练、即时理解、输出生成、编码、推理和数学技能方面有所改进。Mistral-7B-instruct-v0.2 由 Mistral AI 在 2024 年 3 月发布,它在自然语言处理、编码等任务方面表现出色。CodeLlama-7b-instruct-hf 是 Meta AI 发布的一系列模型,旨在使用文本提示生成代码,它可以帮助开发人员提高工作效率并降低学习门槛。

🤔 **解决方案架构** 该解决方案采用客户端-服务器架构,客户端使用 HuggingFace Chat UI 提供一个可以在 PC 或移动设备上访问的聊天页面。服务器端模型推理使用 Hugging Face's Text Generation Inference(TGI),这是一个高效的 LLM 推理框架,在 Docker 容器中运行。模型使用 Hugging Face's Optimum Neuron 预编译,并将其编译结果上传到 Hugging Face Hub。该解决方案还添加了一个模型切换机制,允许用户通过调度器(Scheduler)在 Text Generation Inference 容器中控制不同模型的加载。

🚀 **解决方案亮点** 该解决方案的主要亮点包括:所有组件都部署在具有单个芯片实例(inf2.xl 或 inf2.8xl)的 Inf2 实例上,用户可以在一个实例上体验多个模型的效果;客户端-服务器架构允许用户根据实际需要灵活地替换客户端或服务器端;使用公开可用的框架,用户可以根据自己的需要定制前端页面或模型;使用 Text Generation Inference 的 API 接口,方便用户使用 API 快速访问;使用 AWS CloudFormation 部署,适合企业内所有类型的企业和开发人员。

💡 **主要组件** 该解决方案的主要组件包括:Hugging Face Optimum Neuron、Hugging Face's Text Generation Inference(TGI)和 Hugging Face Chat UI。Hugging Face Optimum Neuron 是 HuggingFace Transformers 库和 AWS Neuron SDK 之间的接口,它提供了一套用于模型加载、训练和推理的工具。Hugging Face's Text Generation Inference(TGI)是一个用 Rust 和 Python 编写的框架,用于部署和服务 LLM。Hugging Face Chat UI 是一个由 SvelteKit 构建的开源聊天工具,可以部署到 Cloudflare、Netlify、Node 等平台。

👏 **解决方案部署** 该解决方案的部署需要在 us-east-1(Virginia)或 us-west-2(Oregon)AWS 区域中拥有 inf2.xl 或 inf2.8xl 的使用配额。用户需要登录 AWS 管理控制台并选择 us-east-1(Virginia)或 us-west-2(Oregon)区域。然后,用户可以在服务搜索框中输入 Cloudformation 并选择创建堆栈。选择选择现有模板,然后选择 Amazon S3 URL。用户可以选择使用现有的虚拟私有云(VPC)或创建一个新的 VPC 进行部署。最后,用户需要选择下一步,然后再次选择下一步,最后选择提交。等待资源创建并启动(大约 15 分钟)。堆栈状态显示为 CREATE_COMPLETE 后,选择输出,选择公用端点 URL(web 服务器的公用端点)。

Many organizations are building generative AI applications powered by large language models (LLMs) to boost productivity and build differentiated experiences. These LLMs are large and complex and deploying them requires powerful computing resources and results in high inference costs. For businesses and researchers with limited resources, the high inference costs of generative AI models can be a barrier to enter the market, so more efficient and cost-effective solutions are needed. Most generative AI use cases involve human interaction, which requires AI accelerators that can deliver real time response rates with low latency. At the same time, the pace of innovation in generative AI is increasing, and it’s becoming more challenging for developers and researchers to quickly evaluate and adopt new models to keep pace with the market.

One of ways to get started with LLMs such as Llama and Mistral are by using Amazon Bedrock. However, customers who want to deploy LLMs in their own self-managed workflows for greater control and flexibility of underlying resources can use these LLMs optimized on top of AWS Inferentia2-powered Amazon Elastic Compute Cloud (Amazon EC2) Inf2 instances. In this blog post, we will introduce how to use an Amazon EC2 Inf2 instance to cost-effectively deploy multiple industry-leading LLMs on AWS Inferentia2, a purpose-built AWS AI chip, helping customers to quickly test and open up an API interface to facilitate performance benchmarking and downstream application calls at the same time.

Model introduction

There are many popular open source LLMs to choose from, and for this blog post, we will review three different use cases based on model expertise using Meta-Llama-3-8B-Instruct, Mistral-7B-instruct-v0.2, and CodeLlama-7b-instruct-hf.

Model name Release company Number of parameters Release time Model capabilities
Meta-Llama-3-8B-Instruct Meta 8 billion April 2024 Language understanding, translation, code generation, inference, chat
Mistral-7B-Instruct-v0.2 Mistral AI 7.3 billion March 2024 Language understanding, translation, code generation, inference, chat
CodeLlama-7b-Instruct-hf Meta 7 billion August 2023 Code generation, code completion, chat

Meta-Llama-3-8B-Instruct is a popular language models, released by Meta AI in April 2024. The Llama 3 model has improved pre-training, instant comprehension, output generation, coding, inference, and math skills. The Meta AI team says that Llama 3 has the potential to be the initiator of a new wave of innovation in AI. The Llama 3 model is available in two publicly released versions, 8B and 70B. At the time of writing, Llama 3.1 instruction-tuned models are available in 8B, 70B, and 405B versions. In this blog post, we will use the Meta-Llama-3-8B-Instruct model, but the same process can be followed for Llama 3.1 models.

Mistral-7B-instruct-v0.2, released by Mistral AI in March 2024, marks a major milestone in the development of the publicly available foundation model. With its impressive performance, efficient architecture, and wide range of features, Mistral 7B v0.2 sets a new standard for user-friendly and powerful AI tools. The model excels at tasks ranging from natural language processing to coding, making it an invaluable resource for researchers, developers, and businesses. In this blog post, we will use the Mistral-7B-instruct-v0.2 model, but the same process can be followed for the Mistral-7B-instruct-v0.3 model.

CodeLlama-7b-instruct-hf is a collection of models published by Meta AI. It is an LLM that uses text prompts to generate code. Code Llama is aimed at code tasks, making developers’ workflow faster and more efficient and lowering the learning threshold for coders. Code Llama has the potential to be used as a productivity and educational tool to help programmers write more powerful and well-documented software.

Solution architecture

The solution uses a client-server architecture, and the client uses the HuggingFace Chat UI to provide a chat page that can be accessed on a PC or mobile device. Server-side model inference uses Hugging Face’s Text Generation Inference, an efficient LLM inference framework that runs in a Docker container. We pre-compiled the model using Hugging Face’s Optimum Neuron and uploaded the compilation results to Hugging Face Hub. We have also added a model switching mechanism to the HuggingFace Chat UI to control the loading of different models in the Text Generation Inference container through a scheduler (Scheduler).

Solution highlights

    All components are deployed on an Inf2 instance with a single chip instance (inf2.xl or inf2.8xl), and users can experience the effects of multiple models on one instance. With the client-server architecture, users can flexibly replace either the client or the server side according to their actual needs. For example, the model can be deployed in Amazon SageMaker, and the frontend Chat UI can be deployed on the Node server. To facilitate the demonstration, we deployed both the front and back ends on the same Inf2 server. Using a publicly available framework, users can customize frontend pages or models according to their own needs. Using an API interface for Text Generation Inference facilitates quick access for users using the API. Deployment using AWS Cloudformation, suitable for all types of businesses and developers within the enterprise.

Main components

The following are the main components of the solution.

Hugging Face Optimum Neuron

Optimum Neuron is an interface between the HuggingFace Transformers library and the AWS Neuron SDK. It provides a set of tools for model load, training, and inference for single and multiple accelerator setups of different downstream tasks. In this article, we mainly used Optimum Neuron’s export interface. To deploy the HuggingFace Transformers model on Neuron devices, the model needs to be compiled and exported to a serialized format before the inference is performed. The export interface is pre-compiled (Ahead of-time compilation (AOT)) using the Neuron compiler (Neuronx-cc), and the model is converted into a serialized and optimized TorchScript module. This is shown in the following figure.

During the compilation process, we introduced a tensor parallelism mechanism to split the weights, data, and computations between the two NeuronCores. For more compilation parameters, see Export a model to Inferentia.

Hugging Face’s Text Generation Inference (TGI)

Text Generation Inference (TGI) is a framework written in Rust and Python for deploying and serving LLMs. TGI provides high performance text generation services for the most popular publicly available foundation LLMs. Its main features are:

HuggingFace Chat UI

HuggingFace Chat UI is an open-source chat tool built by SvelteKit and can be deployed to Cloudflare, Netlify, Node, and so on. It has the following main features:

Thanks to the page customization capabilities of the Hugging Chat UI, we’ve added a model switching function, so users can switch between different models on the same EC2 Inf2 instance.

Solution deployment

    Before deploying the solution, make sure you have an inf2.xl or inf2.8xl usage quota in the us-east-1 (Virginia) or us-west-2 (Oregon) AWS Region. See the reference link for how to apply for a quota. Sign in to the AWS Management Consol and switch the Region to us-east-1 (Virginia) or us-west-2 (Oregon) in the upper right corner of the console page. Enter Cloudformation in the service search box and choose Create stack. Select Choose an existing template, and then select Amazon S3 URL. If you plan to use an existing virtual private cloud (VPC), use the steps in a; if you plan to create a new VPC to deploy, use the steps in b.
      Use an existing VPC.
        Enter https://zz-common.s3.amazonaws.com/tmp/tgiui/20240501/launch_server_default_vpc_ubuntu22.04.yaml in the Amazon S3 URL. Stack name: Enter the stack name. InstanceType: select inf2.xl (lower cost) or inf2.8xl (better performance). KeyPairName (optional): if you want to sign in to the Inf2 instance, enter the KeyPairName name. VpcId: Select VPC. PublicSubnetId: Select a public subnet. VolumeSize: Enter the size of the EC2 instance EBS storage volume. The minimum value is 80 GB. Choose Next, then Next again. Choose Submit.
      Create a new VPC.
        Enter https://zz-common.s3.amazonaws.com/tmp/tgiui/20240501/launch_server_new_vpc_ubuntu22.04.yaml in the Amazon S3 URL. Stack name: Enter the stack name. InstanceType: Select inf2.xl or inf2.8xl. KeyPairName (optional): If you want to sign in to the Inf2 instance, enter the KeyPairName name. VpcId: Leave as New. PublicSubnetId: Leave as New. VolumeSize: Enter the size of the EC2 instance EBS storage volume. The minimum value is 80 GB.
    Choose Next, and then Next again. Then choose Submit.6. After creating the stack, wait for the resources to be created and started (about 15 minutes). After the stack status is displayed as CREATE_COMPLETE, choose Outputs. Choose the URL where the key is the corresponding value location for Public endpoint for the web server (close all VPN connections and firewall programs).

User interface

After the solution is deployed, users can access the preceding URL on the PC or mobile phone. On the page, the Llama3-8B model will be loaded by default. Users can switch models in the menu settings, select the model name to be activated in the model list, and choose Activate to switch models. Switching models requires reloading the new model into the Inferentia 2 accelerator memory. This process takes about 1 minute. During this process, users can check the loading status of the new model by choosing Retrieve model status. If the status is Available, it indicates that the new model has been successfully loaded.

The effects of the different models are shown in the following figure:

The following figures shows the solution in a browser on a PC:

API interface and performance testing

The solution uses a Text Generation Inference Inference Server, which supports /generate and /generate_stream interfaces and uses port 8080 by default. You can make API calls by replacing <IP> that follows with the IP address deployed previously.

The /generate interface is used to return all responses to the client at once after generating all tokens on the server side.

curl <IP>:8080/generate\    -X POST\    -d '{"inputs”: "Calculate the distance from Beijing to Shanghai"}'\    -H 'Content-Type: application/json'

/generate_stream is used to reduce waiting delays and enhance the user experience by receiving tokens one by one when the model output length is relatively large.

curl <IP>:8080/generate_stream \    -X POST\    -d '{"inputs”: "Write an essay on the mental health of elementary school students with no more than 300 words. "}' \    -H 'Content-Type: application/json'

Here is a sample code to use requests interface in python.

import requestsurl = "http://<IP>:8080/generate"headers = {"Content-Type": "application/json"}data = {"inputs": "Calculate the distance from Beijing to Shanghai","parameters":{    "max_new_tokens":200  }}response = requests.post(url, headers=headers, json=data)print(response.text)

Summary

In this blog post, we introduced methods and examples of deploying popular LLMs on AWS AI chips, so that users can quickly experience the productivity improvements provided by LLMs. The model deployed on Inf2 instance has been validated by multiple users and scenarios, showing strong performance and wide applicability. AWS is continuously expanding its application scenarios and features to provide users with efficient and economical computing capabilities. See Inf2 Inference Performance to check the types and list of models supported on the Inferentia2 chip. Contact us to give feedback on your needs or ask questions about deploying LLMs on AWS AI chips.

References


About the authors

Zheng Zhang is a technical expert for Amazon Web Services machine learning products, focus on Amazon Web Services-based accelerated computing and GPU instances. He has rich experiences on large-scale model training and inference acceleration in machine learning.

Bingyang Huang is a Go-To-Market Specialist of Accelerated Computing at GCR SSO GenAI team. She has experience on deploying the AI accelerator on customer’s production environment. Outside of work, she enjoys watching films and exploring good foods.

Tian Shi is Senior Solution Architect at Amazon Web Services. He has rich experience in cloud computing, data analysis, and machine learning and is currently dedicated to research and practice in the fields of data science, machine learning, and serverless. His translations include Machine Learning as a Service, DevOps Practices Based on Kubernetes, Practical Kubernetes Microservices, Prometheus Monitoring Practice, and CoreDNS Study Guide in the Cloud Native Era.

Chuan Xie is a Senior Solution Architect at Amazon Web Services Generative AI, responsible for the design, implementation, and optimization of generative artificial intelligence solutions based on the Amazon Cloud. River has many years of production and research experience in the communications, ecommerce, internet and other industries, and rich practical experience in data science, recommendation systems, LLM RAG, and others. He has multiple AI-related product technology invention patents.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

大型语言模型 AWS Inferentia2 Amazon EC2 Inf2 实例 Hugging Face Text Generation Inference Hugging Face Chat UI AWS CloudFormation
相关文章