AWS Machine Learning Blog 2024年07月03日
Scale and simplify ML workload monitoring on Amazon EKS with AWS Neuron Monitor container
index_new5.html
../../../zaker_core/zaker_tpl_static/wap/tpl_guoji1.html

 

AWS Neuron 监控工具是一款用于增强 AWS Inferentia 和 AWS Trainium 芯片在 Amazon Elastic Kubernetes Service (Amazon EKS) 上的监控功能的创新工具。该工具简化了与 Prometheus 和 Grafana 等高级监控工具的集成,使您能够在 AWS AI 芯片上设置和管理机器学习 (ML) 工作流。借助新的 Neuron 监控容器,您可以在熟悉的 Kubernetes 环境中可视化和优化 ML 应用程序的性能。

🚀 **Neuron 监控容器:全面的监控框架** 该容器为 Amazon EKS 上的 ML 工作负载提供了一个全面的监控框架。通过在 EKS 节点上部署 Neuron 监控 DaemonSet,开发人员可以收集和分析来自 ML 工作负载 Pod 的性能指标。这些指标可以集成到 Prometheus 中,并通过 Grafana 进行可视化,从而提供对应用程序性能的深入了解,以便进行有效的故障排除和优化。

📊 **CloudWatch 容器洞察:更深入的分析** CloudWatch 容器洞察 (适用于 Neuron) 提供了一种强大的监控解决方案,提供了针对基于 Neuron 的应用程序量身定制的更深入的见解和分析。借助容器洞察,您现在可以访问更细粒度的數據和全面的分析,使开发人员能够轻松地维护 ML 工作负载的高性能和运行状况。

💡 **灵活性和深度:满足您的监控需求** Neuron 监控在 Kubernetes 环境中提供了灵活性和深度。通过将指标集成到 Prometheus 和 Grafana 或 CloudWatch 中,您可以根据您的特定需求选择最适合的监控方法。

🚀 **增强可观察性:使用容器洞察** 借助容器洞察,您可以直接在 CloudWatch 控制台中查看指标和遥测信息,包括 Neuron 设备指标。容器洞察仪表板显示集群状态和警报,并使用预定义的阈值来识别资源消耗高的组件,以便在性能受到影响之前采取主动措施。

📈 **Prometheus 和 Grafana:自定义监控** 您可以使用 Prometheus 和 Grafana 设置自定义监控解决方案。通过配置 Prometheus 从 Neuron 监控 Pod 中抓取指标并将其转发到托管服务,您可以创建自定义仪表板和警报,以监控您的 ML 工作负载的特定方面。

⚙️ **设置和部署:简化流程** Neuron 监控容器解决方案提供了简单的设置和部署过程。您可以使用 Helm 图表轻松配置 Prometheus,并使用 Kubernetes 命令行工具部署 Neuron 监控 DaemonSet。

💡 **优势:更精准的监控** 这种架构具有许多优势,包括高度针对性和有意监控、实时分析和对 Neuron 上的 ML 工作负载性能的更高可见性,以及对现有 Amazon EKS 基础设施的本机支持。

🚀 **未来展望:持续改进** AWS 致力于不断改进 Neuron 监控工具,以提供更强大的功能和更深入的见解。未来将添加新功能,以增强 ML 工作负载的监控和优化。

Amazon Web Services is excited to announce the launch of the AWS Neuron Monitor container, an innovative tool designed to enhance the monitoring capabilities of AWS Inferentia and AWS Trainium chips on Amazon Elastic Kubernetes Service (Amazon EKS). This solution simplifies the integration of advanced monitoring tools such as Prometheus and Grafana, enabling you to set up and manage your machine learning (ML) workflows with AWS AI Chips. With the new Neuron Monitor container, you can visualize and optimize the performance of your ML applications, all within a familiar Kubernetes environment. The Neuron Monitor container can also run on Amazon Elastic Container Service (Amazon ECS), but for the purpose of this post, we primarily discuss Amazon EKS deployment.

In addition to the Neuron Monitor container, the release of CloudWatch Container Insights (for Neuron) provides further benefits. This extension provides a robust monitoring solution, offering deeper insights and analytics tailored specifically for Neuron-based applications. With Container Insights, you can now access more granular data and comprehensive analytics, making it effortless for developers to maintain high performance and operational health of their ML workloads.

Solution overview

The Neuron Monitor container solution provides a comprehensive monitoring framework for ML workloads on Amazon EKS, using the power of Neuron Monitor in conjunction with industry-standard tools like Prometheus, Grafana, and Amazon CloudWatch. By deploying the Neuron Monitor DaemonSet across EKS nodes, developers can collect and analyze performance metrics from ML workload pods.

In one flow, metrics gathered by Neuron Monitor are integrated with Prometheus, which is configured using a Helm chart for scalability and ease of management. These metrics are then visualized through Grafana, offering you detailed insights into your applications’ performance for effective troubleshooting and optimization.

Alternatively, metrics can also be directed to CloudWatch through the CloudWatch Observability EKS add-on or a Helm chart for a deeper integration with AWS services in a single step. The add-on helps automatically discover critical health metrics from the AWS Trainium and AWS Inferentia chips in the Amazon EC2 Trn1 and Amazon EC2 Inf2 instances, as well as from Elastic Fabric Adapter, the network interface for EC2 instances.. This integration can help you better understand the traffic impact on your distributed deep learning algorithms.

This architecture has many benefits:

Neuron Monitor provides flexibility and depth in monitoring within the Kubernetes environment.

The following diagram illustrates the solution architecture:

Fig.1 Solution Architecture Diagram

In the following sections, we demonstrate how to use Container Insights for enhanced observability, and how to set up Prometheus and Grafana for this solution.

Configure Container Insights for enhanced observability

In this section, we walk through the steps to configure Container Insights.

Set up the CloudWatch Observability EKS add-on

Refer to Install the Amazon CloudWatch Observability EKS add-on for instructions to create the amazon-cloudwatch-observability add-on in your EKS cluster. This process involves deploying the necessary resources for monitoring directly within CloudWatch.

After you set up the add-on, check the health of the add-on with the following command:

aws eks describe-addon --cluster-name <value> --addon-name amazon-cloudwatch-observability

The output should contain the following property value:

"status": "ACTIVE",

For details about confirming the output, see Retrieve addon version compatibility.

Once the add-on is active, you can then directly view metrics in Container Insights.

View CloudWatch metrics

Navigate to the Container Insights console, where you can visualize metrics and telemetry about your whole Amazon EKS environment, including your Neuron device metrics. The enhanced Container Insights page looks similar to the following screenshot, with the high-level summary of your clusters, along with kube-state and control-plane metrics. The Container Insights dashboard also shows cluster status and alarms. It uses predefined thresholds for CPU, memory, and NeuronCores to quickly identify which resources have higher consumption, and enables proactive actions to avoid performance impact.

Fig.2 CloudWatch Container Insights Dashboard

The out-of-the-box opinionated performance dashboards and troubleshooting UI enables you to see your Neuron metrics at multiple granularities from an aggregated cluster level to per-container level and per-NeuronCore level. With the Container Insights default configuration, you can also qualify and correlate your Neuron metrics against the other aspects of your infrastructure such as CPU, memory, disk, Elastic Fabric Adapter devices, and more.

When you navigate to any of the clusters based on their criticality, you can view the Performance monitoring dashboard, as shown in the following screenshot.

Fig.3 Performance Monitoring Dashboard Views

This monitoring dashboard provides various views to analyze performance, including:

This landing page has now been enhanced with Neuron metrics, including top 10 graphs, which helps you identify unhealthy components in your environments even without alarms and take proactive action before application performance is impacted. For a more in-depth analysis of what is delivered on this landing page, refer to Announcing Amazon CloudWatch Container Insights with Enhanced Observability for Amazon EKS on EC2.

Prometheus and Grafana

In this section, we walk through the steps to set up Prometheus and Grafana.

Prerequisites

You should have an EKS cluster set up with AWS Inferentia or Trainium worker nodes.

Set up the Neuron Monitoring container

The Neuron Monitoring container is hosted on Amazon ECR Public. Although it’s accessible for immediate use, it’s not a recommended best practice for direct production workload use due to potential throttling limits. For more information on this and on setting up a pull through cache, see the Neuron Monitor User Guide. For production environments, it’s advisable to copy the Neuron Monitoring container to your private Amazon Elastic Container Registry (Amazon ECR) repository, where the Amazon ECR pull through cache feature can manage synchronization effectively.

Set up Kubernetes for Neuron Monitoring

You can use the following YAML configuration snippet to set up Neuron Monitoring in your Kubernetes cluster. This setup includes a DaemonSet to deploy the monitoring container on each suitable node in namespace neuron-monitor:

apiVersion: apps/v1kind: DaemonSetmetadata:  name: neuron-monitor  namespace: neuron-monitor  labels:    app: neuron-monitor    version: v1spec:  selector:    matchLabels:      app: neuron-monitor  template:    metadata:      labels:        app: neuron-monitor        version: v1    spec:      affinity:        nodeAffinity:          requiredDuringSchedulingIgnoredDuringExecution:            nodeSelectorTerms:              - matchExpressions:                  - key: kubernetes.io/os                    operator: In                    values:                      - linux                  - key: node.kubernetes.io/instance-type                    operator: In                    values:                      - trn1.2xlarge                      - trn1.32xlarge                      - trn1n.32xlarge                      - inf1.xlarge                      - inf1.2xlarge                      - inf1.6xlarge                      - inf2.xlarge                      - inf2.8xlarge                      - inf2.24xlarge                      - inf2.48xlarge      containers:        - name: neuron-monitor          image: public.ecr.aws/neuron/neuron-monitor:1.0.1          ports:            - containerPort: 8000          command:             - "/opt/bin/entrypoint.sh"          args:             - "--port"            - "8000"            resources:            limits:              cpu: 500m              memory: 256Mi            requests:              cpu: 256m              memory: 128Mi          env:          - name: GOMEMLIMIT            value: 160MiB          securityContext:            privileged: true

To apply this YAML file, complete the following steps:

    Replace <IMAGE_URI> with the URI of the Neuron Monitoring container image in your ECR repository. Run the YAML file with the Kubernetes command line tool with the following code:
kubectl apply -f <filename>.yaml
    Verify the Neuron Monitor container is running as DaemonSet:
kubectl get daemonset -n neuron-monitor

Set up Amazon Managed Service for Prometheus

To utilize Amazon Managed Service for Prometheus with your EKS cluster, you must first configure Prometheus to scrape metrics from Neuron Monitor pods and forward them to the managed service.

Prometheus requires the Container Storage Interface (CSI) in the EKS cluster. You can use eksctl to set up the necessary components.

    Create an AWS Identity and Access Management (IAM) service account with appropriate permissions:
eksctl create iamserviceaccount --name ebs-csi-controller-sa --namespace kube-system --cluster <cluster-name> --role-name <role name> --role-only --attach-policy-arn arn:aws:iam::aws:policy/service-role/AmazonEBSCSIDriverPolicy --approve
    Install the Amazon Elastic Block Store (Amazon EBS) CSI driver add-on:
eksctl create addon --name aws-ebs-csi-driver --cluster <cluster-name> --service-account-role-arn <role-arn> --force
    Verify the add-on installation:
eksctl get addon --name aws-ebs-csi-driver --cluster <cluster-name>

Now you’re ready to set up your Amazon Managed Service for Prometheus workspace.

    Create a workspace using the AWS Command Line Interface (AWS CLI) and confirm its active status:
aws amp create-workspace --alias <alias>aws amp list-workspaces --alias <alias>
    Set up the required service roles following the AWS guidelines to facilitate the ingestion of metrics from your EKS clusters. This includes creating an IAM role specifically for Prometheus ingestion:
aws iam get-role --role-name amp-iamproxy-ingest-role

Next, you install Prometheus in your EKS cluster using a Helm chart, configuring it to scrape metrics from Neuron Monitor and forward them to your Amazon Managed Service for Prometheus workspace. The following is an example of the Helm chart .yaml file to override the necessary configs:

serviceAccounts:    server:        name: "amp-iamproxy-ingest-service-account"        annotations:            eks.amazonaws.com/role-arn: "arn:aws:iam::<account-id>:role/amp-iamproxy-ingest-role"server:    remoteWrite:        - url: https://aps-workspaces.<region>.amazonaws.com/workspaces/<workspace-id>/api/v1/remote_write          sigv4:            region: us-west-2          queue_config:            max_samples_per_send: 1000            max_shards: 200            capacity: 2500extraScrapeConfigs: |  - job_name: neuron-monitor-stats    kubernetes_sd_configs:    - role: pod    relabel_configs:    - source_labels: [__meta_kubernetes_pod_label_app]      action: keep      regex: neuron-monitor    - source_labels: [__meta_kubernetes_pod_container_port_number]      action: keep      regex: 8000

This file has the following key sections:

    Install Prometheus in your EKS cluster using the Helm command and specifying the .yaml file:
helm install prometheus prometheus-community/prometheus -n prometheus --create-namespace -f values.yaml
    Verify the installation by checking that all Prometheus pods are running:
kubectl get pods -n prometheus

This confirms that Prometheus is correctly set up to collect metrics from the Neuron Monitor container and forward them to Amazon Managed Service for Prometheus.

Integrate Amazon Managed Grafana

When Prometheus is operational, complete the following steps:

    Set up Amazon Managed Grafana. For instructions, see Getting started with Amazon Managed Grafana. Configure it to use Amazon Managed Service for Prometheus as a data source. For details, see Use AWS data source configuration to add Amazon Managed Service for Prometheus as a data source. Import the example Neuron Monitor dashboard from GitHub to quickly visualize your metrics.

The following screenshot shows your dashboard integrated with Amazon Managed Grafana.

Fig.4 Integrating Amazon Managed Grafana

Clean up

To make sure none of the resources created in this walkthrough are left running, complete the following cleanup steps:

    Delete the Amazon Managed Grafana workspace. Uninstall Prometheus from the EKS cluster:
helm uninstall prometheus -n Prometheus
    Remove the Amazon Managed Service for Prometheus workspace ID from the trust policy of the role amp-iamproxy-ingest-role or delete the role. Delete the Amazon Managed Service for Prometheus workspace:
aws amp delete-workspace --workspace-id <workspace-id>
    Clean up the CSI:
eksctl delete addon --cluster <cluster-name> --name aws-ebs-csi-driver eksctl delete iamserviceaccount --name ebs-csi-controller-sa --namespace kube-system --cluster <cluster-name>
    Delete the Neuron Monitor DaemonSet from the EKS cluster:
kubectl delete daemonset neuron-monitor -n neuron-monitor

Conclusion

The release of the Neuron Monitor container marks a significant enhancement in the monitoring of ML workloads on Amazon EKS, specifically tailored for AWS Inferentia and Trainium chips. This solution simplifies the integration of powerful monitoring tools like Prometheus, Grafana, and CloudWatch, so you can effectively manage and optimize your ML applications with ease and precision.

To explore the full capabilities of this monitoring solution, refer to Deploy Neuron Container on Elastic Kubernetes Service (EKS). Refer to Amazon EKS and Kubernetes Container Insights metrics to learn more about setting up the Neuron Monitor container and using Container Insights to fully harness the capabilities of your ML infrastructure on Amazon EKS. Additionally, engage with our community through our GitHub repo to share experiences and best practices, so you stay at the forefront of ML operations on AWS.


About the Authors

Niithiyn Vijeaswaran is a Solutions Architect at AWS. His area of focus is generative AI and AWS AI Accelerators. He holds a Bachelor’s degree in Computer Science and Bioinformatics. Niithiyn works closely with the Generative AI GTM team to enable AWS customers on multiple fronts and accelerate their adoption of generative AI. He’s an avid fan of the Dallas Mavericks and enjoys collecting sneakers.

Emir Ayar is a Senior Tech Lead Solutions Architect with the AWS Prototyping team. He specializes in assisting customers with building ML and generative AI solutions, and implementing architectural best practices. He supports customers in experimenting with solution architectures to achieve their business objectives, emphasizing agile innovation and prototyping. He lives in Luxembourg and enjoys playing synthesizers.

Ziwen Ning is a software development engineer at AWS. He currently focuses on enhancing the AI/ML experience through the integration of AWS Neuron with containerized environments and Kubernetes. In his free time, he enjoys challenging himself with badminton, swimming and other various sports, and immersing himself in music.

Rohit Talluri is a Generative AI GTM Specialist (Tech BD) at Amazon Web Services (AWS). He is partnering with top generative AI model builders, strategic customers, key AI/ML partners, and AWS Service Teams to enable the next generation of artificial intelligence, machine learning, and accelerated computing on AWS. He was previously an Enterprise Solutions Architect, and the Global Solutions Lead for AWS Mergers & Acquisitions Advisory.

Albert Opher is a Solutions Architect Intern at AWS. He is a rising senior at the University of Pennsylvania pursuing Dual Bachelor’s Degrees in Computer Information Science and Business Analytics in the Jerome Fisher Management and Technology Program. He has experience with multiple programming languages, AWS cloud services, AI/ML technologies, product and operations management, pre and early seed start-up ventures, and corporate finance.

Geeta Gharpure is a senior software developer on the Annapurna ML engineering team. She is focused on running large scale AI/ML workloads on Kubernetes. She lives in Sunnyvale, CA and enjoys listening to audible in her free time

Fish AI Reader

Fish AI Reader

AI辅助创作,多种专业模板,深度分析,高质量内容生成。从观点提取到深度思考,FishAI为您提供全方位的创作支持。新版本引入自定义参数,让您的创作更加个性化和精准。

FishAI

FishAI

鱼阅,AI 时代的下一个智能信息助手,助你摆脱信息焦虑

联系邮箱 441953276@qq.com

相关标签

AWS Neuron 监控工具 机器学习 Amazon EKS Prometheus Grafana CloudWatch 容器洞察
相关文章