GPU Metrics and Cost Allocation
GPU Metric K8s Integration
Motivation
To set up a cluster and configure Prometheus to capture GPU-related CPU and memory metrics.
Prerequisites
To gather GPU telemetry metrics from Kubernetes pods we need to deploy a list of special services:
nvidia-device-plugin
dcgm-exporter
This document will describe how to set up services in the new K8s installation. I have chosen another guide as a source for this instruction and applied several fixes to achieve the result. To check the initial guide please refer to https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html
You need several prerequisites for this guide:
Installed Helm
Installed and configured kubectl
Installation configuration
To test the setup, we have used EKS clusters with versions 1.18 and 1.21.
In the example, we choose a p2.xlarge worker node for costs optimization.
List of Metrics
Metric Id | Semantics |
---|---|
GPU Metrics | |
DCGM_FI_DEV_GPU_UTIL | GPU utilization (in %). |
DCGM_FI_PROF_GR_ENGINE_ACTIVE | Ratio of time the graphics engine is active (in %). |
DCGM_FI_PROF_SM_OCCUPANCY | The ratio of number of warps resident on an SM (in %). Similar to DCGM_FI_DEV_GPU_UTIL, but shows how effectively resource is utilized. |
Memory Metrics | |
DCGM_FI_DEV_MEM_COPY_UTIL | Memory utilization (in %). |
DCGM_FI_DEV_FB_USED | Framebuffer memory used (in MiB). |
DCGM_FI_DEV_PCIE_TX(RX)_THROUGHPUT | Total number of bytes transmitted through PCIe TX(RX) (in KB) via NVML. |
Integration guide
Setup nvidia-device-plugin service running the command
helm repo add nvdp https://nvidia.github.io/k8s-device-plugin && helm repo update &&
helm install --generate-name nvdp/nvidia-device-plugin
Check that installation finished correctly and you have pods with nvidia-device-plugin
% kubectl get pods -A
NAMESPACE NAME READY STATUS RESTARTS AGE
kube-system aws-node-4j682 1/1 Running 0 75s
kube-system coredns-f47955f89-gs6zk 1/1 Running 0 8m5s
kube-system coredns-f47955f89-xm6rd 1/1 Running 0 8m5s
kube-system kube-proxy-csdwp 1/1 Running 0 2m19s
kube-system nvidia-device-plugin-1633035998-2j2qp 1/1 Running 0 38s
Install monitoring solution consisting of kube-state-metrics and Prometheus. We use the predefined Helm chart to deploy the whole list of services. Changes in kube-prometheus-stack.values should be applied in like in the source guide
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && \
helm repo update && \
helm install prometheus-community/kube-prometheus-stack \
--namespace kube-system --generate-name --values ./kube-prometheus-stack.values
Verify the installation
Install DCGM-Exporter service. Please be informed that some metrics disabled in the default installation. If you need your custom set of metrics you need to rebuild docker image of service with your configuration. (See: https://github.com/NVIDIA/dcgm-exporter#changing-metrics) I have used the pre-build docker image from community, you can find reference in Appendix 1.
Verify the installation
After this steps you will be able to check the special list of metrics from DCGM exporter, which represents GPU resources usage. List of available metrics: https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv
Now we are ready to perform test GPU run. For this purpose we can install another service
Wait for 3-5 minutes and kill demo service
After that you can go to Prometheus and check DCGM_FI_DEV_GPU_UTIL metric
Then we can apply the metrics to Grafana dashboard, for this purpose refer to:https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#using-grafana
You should see the corresponding dashboards and graphs after guide.
Appendix 1: dcgm_vals.yaml
Copyright 2023 Yotascale, Inc. All Rights Reserved. Yotascale and the Yotascale logo are trademarks of Yotascale, Inc.