AWS Machine Learning Blog 03月13日
How to run Qwen 2.5 on AWS AI chips using Hugging Face libraries
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

本文介绍了如何在Amazon EC2和Amazon SageMaker上使用Hugging Face Text Generation Inference (TGI) 容器和Hugging Face Optimum Neuron库来部署Qwen 2.5多语言大型语言模型。Qwen 2.5模型系列包括0.5B到72B不同规模的模型,支持超过29种语言,并优化了多语言对话用例。文章详细阐述了在Amazon EC2 Inf2实例和SageMaker上部署Qwen 2.5-7B-Instruct模型的步骤,包括准备工作、环境配置、Docker Compose的使用以及SageMaker notebook的配置。此外,还提供了测试模型和清理资源的指导,旨在帮助用户充分利用AWS Trainium和AWS Inferentia的高性能和低成本优势。

🚀 **Qwen 2.5多语言大模型系列**: 涵盖0.5B至72B多种规模,支持29+种语言,具备卓越的文本生成能力,尤其在多语言对话场景中表现出色。

🛠️ **Hugging Face工具集成**: 利用Text Generation Inference (TGI) 容器简化LLM的部署和服务,Optimum Neuron库作为Transformers库与Inferentia/Trainium加速器之间的桥梁,实现模型的高效编译和优化。

☁️ **AWS EC2 Inf2部署**: 通过创建.env文件和docker-compose.yaml文件,配置模型ID、自动类型转换、批量大小等参数,使用Docker Compose快速部署Qwen 2.5模型,并通过curl命令测试模型的多语言响应能力。

✨ **SageMaker快速部署**: 借助Hugging Face Model Hub,选择SageMaker部署选项,复制示例代码到SageMaker notebook,一键运行即可完成模型部署,极大地简化了部署流程。

The Qwen 2.5 multilingual large language models (LLMs) are a collection of pre-trained and instruction tuned generative models in 0.5B, 1.5B, 3B, 7B, 14B, 32B, and 72B (text in/text out and code out). The Qwen 2.5 fine tuned text-only models are optimized for multilingual dialogue use cases and outperform both previous generations of Qwen models, and many of the publicly available chat models based on common industry benchmarks.

At its core, Qwen 2.5 is an auto-regressive language model that uses an optimized transformer architecture. The Qwen2.5 collection can support over 29 languages and has enhanced role-playing abilities and condition-setting for chatbots.

In this post, we outline how to get started with deploying the Qwen 2.5 family of models on an Inferentia instance using Amazon Elastic Compute Cloud (Amazon EC2) and Amazon SageMaker using the Hugging Face Text Generation Inference (TGI) container and the Hugging Face Optimum Neuron library. Qwen2.5 Coder and Math variants are also supported.

Preparation

Hugging Face provides two tools that are frequently used when using AWS Inferentia and AWS Trainium: Text Generation Inference (TGI) containers, which provide support for deploying and serving LLMS, and the Optimum Neuron library, which serves as an interface between the Transformers library and the Inferentia and Trainium accelerators.

The first time a model is run on Inferentia or Trainium, you compile the model to make sure that you have a version that will perform optimally on Inferentia and Trainium chips. The Optimum Neuron library from Hugging Face along with the Optimum Neuron cache will transparently supply a compiled model when available. If you’re using a different model with the Qwen2.5 architecture, you might need to compile the model before deploying. For more information, see Compiling a model for Inferentia or Trainium.

You can deploy TGI as a docker container on an Inferentia or Trainium EC2 instance or on Amazon SageMaker.

Option 1: Deploy TGI on Amazon EC2 Inf2

In this example, you will deploy Qwen2.5-7B-Instruct on an inf2.xlarge instance. (See this article for detailed instructions on how to deploy an instance using the Hugging Face DLAMI.)

For this option, you SSH into the instance and create a .env file (where you’ll define your constants and specify where your model is cached) and a file named docker-compose.yaml (where you’ll define all of the environment parameters that you’ll need to deploy your model for inference). You can copy the following files for this use case.

    Create a .env file with the following content:
MODEL_ID='Qwen/Qwen2.5-7B-Instruct'#MODEL_ID='/data/exportedmodel' HF_AUTO_CAST_TYPE='bf16' # indicates the auto cast type that was used to compile the modelMAX_BATCH_SIZE=4MAX_INPUT_TOKENS=4000MAX_TOTAL_TOKENS=4096
    Create a file named docker-compose.yaml with the following content:
version: '3.7'services:  tgi-1:    image: ghcr.io/huggingface/neuronx-tgi:latest    ports:      - "8081:8081"    environment:      - PORT=8081      - MODEL_ID=${MODEL_ID}      - HF_AUTO_CAST_TYPE=${HF_AUTO_CAST_TYPE}      - HF_NUM_CORES=2      - MAX_BATCH_SIZE=${MAX_BATCH_SIZE}      - MAX_INPUT_TOKENS=${MAX_INPUT_TOKENS}      - MAX_TOTAL_TOKENS=${MAX_TOTAL_TOKENS}      - MAX_CONCURRENT_REQUESTS=512      #- HF_TOKEN=${HF_TOKEN} #only needed for gated models    volumes:      - $PWD:/data #can be removed if you aren't loading locally    devices:      - "/dev/neuron0"
    Use docker compose to deploy the model:

docker compose -f docker-compose.yaml --env-file .env up

    To confirm that the model deployed correctly, send a test prompt to the model:
curl 127.0.0.1:8081/generate \    -X POST \    -d '{  "inputs":"Tell me about AWS.",  "parameters":{    "max_new_tokens":60  }}' \    -H 'Content-Type: application/json'
    To confirm that the model can respond in multiple languages, try sending a prompt in Chinese:
#"Tell me how to open an AWS account"curl 127.0.0.1:8081/generate \    -X POST \    -d '{  "inputs":"告诉我如何开设 AWS 账户。",   "parameters":{    "max_new_tokens":60  }}' \    -H 'Content-Type: application/json'

Option 2: Deploy TGI on SageMaker

You can also use Hugging Face’s Optimum Neuron library to quickly deploy models directly from SageMaker using instructions on the Hugging Face Model Hub.

    From the Qwen 2.5 model card hub, choose Deploy, then SageMaker, and finally AWS Inferentia & Trainium.

    Copy the example code into a SageMaker notebook, then choose Run. The notebook you copied will look like the following:
import jsonimport sagemakerimport boto3from sagemaker.huggingface import HuggingFaceModel, get_huggingface_llm_image_uritry:    role = sagemaker.get_execution_role()except ValueError:    iam = boto3.client("iam")    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]# Hub Model configuration. https://huggingface.co/modelshub = {    "HF_MODEL_ID": "Qwen/Qwen2.5-7B-Instruct",    "HF_NUM_CORES": "2",    "HF_AUTO_CAST_TYPE": "bf16",    "MAX_BATCH_SIZE": "8",    "MAX_INPUT_TOKENS": "3686",    "MAX_TOTAL_TOKENS": "4096",}region = boto3.Session().region_nameimage_uri = f"763104351884.dkr.ecr.{region}.amazonaws.com/huggingface-pytorch-tgi-inference:2.1.2-optimum0.0.27-neuronx-py310-ubuntu22.04"# create Hugging Face Model Classhuggingface_model = HuggingFaceModel(    image_uri=image_uri,    env=hub,    role=role,)# deploy model to SageMaker Inferencepredictor = huggingface_model.deploy(    initial_instance_count=1,    instance_type="ml.inf2.xlarge",    container_startup_health_check_timeout=1800,    volume_size=512,)# send requestpredictor.predict(    {        "inputs": "What is is the capital of France?",        "parameters": {            "do_sample": True,            "max_new_tokens": 128,            "temperature": 0.7,            "top_k": 50,            "top_p": 0.95,        }    })

Clean Up

Make sure that you terminate your EC2 instances and delete your SageMaker endpoints to avoid ongoing costs.

Terminate EC2 instances through the AWS Management Console.

Terminate a SageMaker endpoint through the console or with the following commands:

predictor.delete_model()predictor.delete_endpoint(delete_endpoint_config=True)

Conclusion

AWS Trainium and AWS Inferentia deliver high performance and low cost for deploying Qwen2.5 models. We’re excited to see how you will use these powerful models and our purpose-built AI infrastructure to build differentiated AI applications. To learn more about how to get started with AWS AI chips, see the AWS Neuron documentation.


About the Authors

Jim Burtoft is a Senior Startup Solutions Architect at AWS and works directly with startups as well as the team at Hugging Face. Jim is a CISSP, part of the AWS AI/ML Technical Field Community, part of the Neuron Data Science community, and works with the open source community to enable the use of Inferentia and Trainium. Jim holds a bachelor’s degree in mathematics from Carnegie Mellon University and a master’s degree in economics from the University of Virginia.

Miriam Lebowitz is a Solutions Architect focused on empowering early-stage startups at AWS. She leverages her experience with AIML to guide companies to select and implement the right technologies for their business objectives, setting them up for scalable growth and innovation in the competitive startup world.

Rhia Soni is a Startup Solutions Architect at AWS. Rhia specializes in working with early stage startups and helps customers adopt Inferentia and Trainium. Rhia is also part of the AWS Analytics Technical Field Community and is a subject matter expert in Generative BI. Rhia holds a bachelor’s degree in Information Science from the University of Maryland.

Paul Aiuto is a Senior Solution Architect Manager focusing on Startups at AWS. Paul created a team of AWS Startup Solution architects that focus on the adoption of Inferentia and Trainium. Paul holds a bachelor’s degree in Computer Science from Siena College and has multiple Cyber Security certifications.

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

Qwen 2.5 AWS Inferentia AWS Trainium Hugging Face LLM部署
相关文章