GPU Metrics and Cost Allocation

GPU Metric K8s Integration

Motivation

To set up a cluster and configure Prometheus to capture GPU-related CPU and memory metrics.

Prerequisites

To gather GPU telemetry metrics from Kubernetes pods we need to deploy a list of special services:

  • nvidia-device-plugin

  • dcgm-exporter

This document will describe how to set up services in the new K8s installation. I have chosen another guide as a source for this instruction and applied several fixes to achieve the result. To check the initial guide please refer to https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html

You need several prerequisites for this guide:

  • Installed Helm

  • Installed and configured kubectl

Installation configuration

To test the setup, we have used EKS clusters with versions 1.18 and 1.21.

In the example, we choose a p2.xlarge worker node for costs optimization.

List of Metrics

Metric Id

Semantics

Metric Id

Semantics

GPU Metrics

DCGM_FI_DEV_GPU_UTIL

GPU utilization (in %).

DCGM_FI_PROF_GR_ENGINE_ACTIVE

Ratio of time the graphics engine is active (in %).

DCGM_FI_PROF_SM_OCCUPANCY

The ratio of number of warps resident on an SM (in %). Similar to DCGM_FI_DEV_GPU_UTIL, but shows how effectively resource is utilized.

Memory Metrics

DCGM_FI_DEV_MEM_COPY_UTIL

Memory utilization (in %).

DCGM_FI_DEV_FB_USED

Framebuffer memory used (in MiB).

DCGM_FI_DEV_PCIE_TX(RX)_THROUGHPUT

Total number of bytes transmitted through PCIe TX(RX) (in KB) via NVML.

Integration guide

  • Setup nvidia-device-plugin service running the command

helm repo add nvdp https://nvidia.github.io/k8s-device-plugin && helm repo update && helm install --generate-name nvdp/nvidia-device-plugin
  • Check that installation finished correctly and you have pods with nvidia-device-plugin

% kubectl get pods -A NAMESPACE NAME READY STATUS RESTARTS AGE kube-system aws-node-4j682 1/1 Running 0 75s kube-system coredns-f47955f89-gs6zk 1/1 Running 0 8m5s kube-system coredns-f47955f89-xm6rd 1/1 Running 0 8m5s kube-system kube-proxy-csdwp 1/1 Running 0 2m19s kube-system nvidia-device-plugin-1633035998-2j2qp 1/1 Running 0 38s
  • Install monitoring solution consisting of kube-state-metrics and Prometheus. We use the predefined Helm chart to deploy the whole list of services. Changes in kube-prometheus-stack.values should be applied in like in the source guide

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && \ helm repo update && \ helm install prometheus-community/kube-prometheus-stack \ --namespace kube-system --generate-name --values ./kube-prometheus-stack.values
  • Verify the installation

  • Install DCGM-Exporter service. Please be informed that some metrics disabled in the default installation. If you need your custom set of metrics you need to rebuild docker image of service with your configuration. (See: https://github.com/NVIDIA/dcgm-exporter#changing-metrics) I have used the pre-build docker image from community, you can find reference in Appendix 1.

  • Verify the installation

  • Wait for 3-5 minutes and kill demo service

  • After that you can go to Prometheus and check DCGM_FI_DEV_GPU_UTIL metric

 

Appendix 1: dcgm_vals.yaml

 

 

Copyright 2023 Yotascale, Inc. All Rights Reserved. Yotascale and the Yotascale logo are trademarks of Yotascale, Inc.