GPU Metrics and Cost Allocation

GPU Metric K8s Integration

Motivation

To set up a cluster and configure Prometheus to capture GPU-related CPU and memory metrics.

Prerequisites

To gather GPU telemetry metrics from Kubernetes pods we need to deploy a list of special services:

nvidia-device-plugin
dcgm-exporter

This document will describe how to set up services in the new K8s installation. I have chosen another guide as a source for this instruction and applied several fixes to achieve the result. To check the initial guide please refer to https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html

You need several prerequisites for this guide:

Installed Helm
Installed and configured kubectl

Installation configuration

To test the setup, we have used EKS clusters with versions 1.18 and 1.21.

In the example, we choose a p2.xlarge worker node for costs optimization.

List of Metrics

Metric Id	Semantics

Metric Id	Semantics
GPU Metrics
DCGM_FI_DEV_GPU_UTIL	GPU utilization (in %).
DCGM_FI_PROF_GR_ENGINE_ACTIVE	Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_OCCUPANCY	The ratio of number of warps resident on an SM (in %). Similar to DCGM_FI_DEV_GPU_UTIL, but shows how effectively resource is utilized.
Memory Metrics
DCGM_FI_DEV_MEM_COPY_UTIL	Memory utilization (in %).
DCGM_FI_DEV_FB_USED	Framebuffer memory used (in MiB).
DCGM_FI_DEV_PCIE_TX(RX)_THROUGHPUT	Total number of bytes transmitted through PCIe TX(RX) (in KB) via NVML.

Integration guide

Setup nvidia-device-plugin service running the command

 helm repo add nvdp https://nvidia.github.io/k8s-device-plugin && helm repo update && 
 helm install --generate-name nvdp/nvidia-device-plugin

Check that installation finished correctly and you have pods with nvidia-device-plugin

% kubectl get pods -A
NAMESPACE     NAME                                    READY   STATUS    RESTARTS   AGE
kube-system   aws-node-4j682                          1/1     Running   0          75s
kube-system   coredns-f47955f89-gs6zk                 1/1     Running   0          8m5s
kube-system   coredns-f47955f89-xm6rd                 1/1     Running   0          8m5s
kube-system   kube-proxy-csdwp                        1/1     Running   0          2m19s
kube-system   nvidia-device-plugin-1633035998-2j2qp   1/1     Running   0          38s

Install monitoring solution consisting of kube-state-metrics and Prometheus. We use the predefined Helm chart to deploy the whole list of services. Changes in kube-prometheus-stack.values should be applied in like in the source guide

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && \
helm repo update && \
helm install prometheus-community/kube-prometheus-stack \
--namespace kube-system --generate-name --values ./kube-prometheus-stack.values

Verify the installation

Install DCGM-Exporter service. Please be informed that some metrics disabled in the default installation. If you need your custom set of metrics you need to rebuild docker image of service with your configuration. (See: GitHub - NVIDIA/dcgm-exporter: NVIDIA GPU metrics exporter for Prometheus leveraging DCGM) I have used the pre-build docker image from community, you can find reference in Appendix 1.

Verify the installation

After this steps you will be able to check the special list of metrics from DCGM exporter, which represents GPU resources usage. List of available metrics: dcgm-exporter/etc/dcp-metrics-included.csv at main · NVIDIA/dcgm-exporter
Now we are ready to perform test GPU run. For this purpose we can install another service

Wait for 3-5 minutes and kill demo service

After that you can go to Prometheus and check DCGM_FI_DEV_GPU_UTIL metric

Then we can apply the metrics to Grafana dashboard, for this purpose refer to:https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#using-grafana
You should see the corresponding dashboards and graphs after guide.

Yotascale Classic Dashboard