GPU Metric K8s Integration
Motivation
To set up a cluster and configure Prometheus to capture GPU-related CPU and memory metrics.
Prerequisites
To gather GPU telemetry metrics from Kubernetes pods we need to deploy a list of special services:
nvidia-device-plugin
dcgm-exporter
This document will describe how to set up services in the new K8s installation. I have chosen another guide as a source for this instruction and applied several fixes to achieve the result. To check the initial guide please refer to https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html
You need several prerequisites for this guide:
Installed Helm
Installed and configured kubectl
Installation configuration
To test the setup, we have used EKS clusters with versions 1.18 and 1.21.
In the example, we choose a p2.xlarge worker node for costs optimization.
List of Metrics
Metric Id | Semantics |
---|---|
GPU Metrics | |
DCGM_FI_DEV_GPU_UTIL | GPU utilization (in %). |
DCGM_FI_PROF_GR_ENGINE_ACTIVE | Ratio of time the graphics engine is active (in %). |
DCGM_FI_PROF_SM_OCCUPANCY | The ratio of number of warps resident on an SM (in %). Similar to DCGM_FI_DEV_GPU_UTIL, but shows how effectively resource is utilized. |
Memory Metrics | |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory utilization (in %). |
DCGM_FI_DEV_FB_USED | Framebuffer memory used (in MiB). |
DCGM_FI_DEV_PCIE_TX(RX)_THROUGHPUT | Total number of bytes transmitted through PCIe TX(RX) (in KB) via NVML. |
Integration guide
Setup nvidia-device-plugin service running the command
Code Block |
---|
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin && helm repo update && helm install --generate-name nvdp/nvidia-device-plugin |
Check that installation finished correctly and you have pods with nvidia-device-plugin
Code Block |
---|
% kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system aws-node-4j682 1/1 Running 0 75s kube-system coredns-f47955f89-gs6zk 1/1 Running 0 8m5s kube-system coredns-f47955f89-xm6rd 1/1 Running 0 8m5s kube-system kube-proxy-csdwp 1/1 Running 0 2m19s kube-system nvidia-device-plugin-1633035998-2j2qp 1/1 Running 0 38s |
Install monitoring solution consisting of kube-state-metrics and Prometheus. We use the predefined Helm chart to deploy the whole list of services. Changes in kube-prometheus-stack.values should be applied in like in the source guide
Code Block |
---|
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && \ helm repo update && \ helm install prometheus-community/kube-prometheus-stack \ --namespace kube-system --generate-name --values ./kube-prometheus-stack.values |
Verify the installation
Code Block |
---|
% kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system alertmanager-kube-prometheus-stack-1633-alertmanager-0 2/2 Running 0 49s kube-system aws-node-4j682 1/1 Running 0 6m kube-system coredns-f47955f89-gs6zk 1/1 Running 0 12m kube-system coredns-f47955f89-xm6rd 1/1 Running 0 12m kube-system kube-prometheus-stack-1633-operator-8576fc8f45-64vpb 1/1 Running 0 52s kube-system kube-prometheus-stack-1633036072-grafana-778bcb548b-256nw 2/2 Running 0 52s kube-system kube-prometheus-stack-1633036072-kube-state-metrics-68c6b6dxj5s 1/1 Running 0 52s kube-system kube-prometheus-stack-1633036072-prometheus-node-exporter-w2k67 1/1 Running 0 53s kube-system kube-proxy-csdwp 1/1 Running 0 7m4s kube-system nvidia-device-plugin-1633035998-2j2qp 1/1 Running 0 5m23s kube-system prometheus-kube-prometheus-stack-1633-prometheus-0 2/2 Running 0 48s |
Install DCGM-Exporter service. Please be informed that some metrics disabled in the default installation. If you need your custom set of metrics you need to rebuild docker image of service with your configuration. (See: https://github.com/NVIDIA/dcgm-exporter#changing-metrics) I have used the pre-build docker image from community, you can find reference in Appendix 1.
Code Block |
---|
helm install --namespace kube-system --generate-name \ --values ./dcgm_vals.yaml gpu-helm-charts/dcgm-exporter |
Verify the installation
Code Block |
---|
% kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system alertmanager-kube-prometheus-stack-1633-alertmanager-0 2/2 Running 0 2m47s kube-system aws-node-4j682 1/1 Running 0 7m58s kube-system coredns-f47955f89-gs6zk 1/1 Running 0 14m kube-system coredns-f47955f89-xm6rd 1/1 Running 0 14m kube-system dcgm-exporter-1633036367-nct2v 1/1 Running 0 67s kube-system kube-prometheus-stack-1633-operator-8576fc8f45-64vpb 1/1 Running 0 2m50s kube-system kube-prometheus-stack-1633036072-grafana-778bcb548b-256nw 2/2 Running 0 2m50s kube-system kube-prometheus-stack-1633036072-kube-state-metrics-68c6b6dxj5s 1/1 Running 0 2m50s kube-system kube-prometheus-stack-1633036072-prometheus-node-exporter-w2k67 1/1 Running 0 2m51s kube-system kube-proxy-csdwp 1/1 Running 0 9m2s kube-system nvidia-device-plugin-1633035998-2j2qp 1/1 Running 0 7m21s kube-system prometheus-kube-prometheus-stack-1633-prometheus-0 2/2 Running 0 2m46s |
After this steps you will be able to check the special list of metrics from DCGM exporter, which represents GPU resources usage. List of available metrics: https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv
Now we are ready to perform test GPU run. For this purpose we can install another service
Code Block |
---|
helm fetch https://helm.ngc.nvidia.com/nvidia/charts/video-analytics-demo-0.1.4.tgz && \ helm install video-analytics-demo-0.1.4.tgz --generate-name |
Wait for 3-5 minutes and kill demo service
Code Block |
---|
helm delete $(helm list | grep video-analytics-demo | awk '{print $1}’) |
After that you can go to Prometheus and check DCGM_FI_DEV_GPU_UTIL metric
Then we can apply the metrics to Grafana dashboard, for this purpose refer to:https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#using-grafana
You should see the corresponding dashboards and graphs after guide.
Appendix 1: dcgm_vals.yaml
Code Block |
---|
Copyright (c) 2020, NVIDIA CORPORATION. All rights reserved. # # Licensed under the Apache License, Version 2.0 (the "License"); # you may not use this file except in compliance with the License. # You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. image: repository: shan100docker/dcgm-exporter pullPolicy: IfNotPresent tag: 2.1.8-2.4.0-rc.2-ubuntu18.04-v2 # Comment the following line to stop profiling metrics from DCGM arguments: ["-f", "/etc/dcgm-exporter/dcp-metrics-included.csv"] # NOTE: in general, add any command line arguments to arguments above # and they will be passed through. # Use "-r", "<HOST>:<PORT>" to connect to an already running hostengine # Example arguments: ["-r", "host123:5555"] # Use "-n" to remove the hostname tag from the output. # Example arguments: ["-n"] # Use "-d" to specify the devices to monitor. -d must be followed by a string # in the following format: [f] or [g[:numeric_range][+]][i[:numeric_range]] # Where a numeric range is something like 0-4 or 0,2,4, etc. # Example arguments: ["-d", "g+i"] to monitor all GPUs and GPU instances or # ["-d", "g:0-3"] to monitor GPUs 0-3. # Use "-m" to specify the namespace and name of a configmap containing # the watched exporter fields. # Example arguments: ["-m", "default:exporter-metrics-config-map"] imagePullSecrets: [] nameOverride: "" fullnameOverride: "" serviceAccount: # Specifies whether a service account should be created create: true # Annotations to add to the service account annotations: {} # The name of the service account to use. # If not set and create is true, a name is generated using the fullname template name: podAnnotations: {} podSecurityContext: {} # fsGroup: 2000 securityContext: runAsNonRoot: false runAsUser: 0 capabilities: add: ["SYS_ADMIN"] # readOnlyRootFilesystem: true service: enable: true type: ClusterIP port: 9400 address: ":9400" # Annotations to add to the service annotations: {} resources: {} # limits: # cpu: 100m # memory: 128Mi # requests: # cpu: 100m # memory: 128Mi serviceMonitor: enabled: true interval: 15s additionalLabels: {} #monitoring: prometheus mapPodsMetrics: false nodeSelector: {} #node: gpu tolerations: [] #- operator: Exists affinity: {} #nodeAffinity: # requiredDuringSchedulingIgnoredDuringExecution: # nodeSelectorTerms: # - matchExpressions: # - key: nvidia-gpu # operator: Exists extraHostVolumes: [] #- name: host-binaries # hostPath: /opt/bin extraConfigMapVolumes: [] #- name: exporter-metrics-volume # configMap: # name: exporter-metrics-config-map extraVolumeMounts: [] #- name: host-binaries # mountPath: /opt/bin # readOnly: true extraEnv: [] #- name: EXTRA_VAR # value: "TheStringValue |