GPU Metrics and Cost Allocation

GPU Metric K8s Integration

Motivation

To set up a cluster and configure Prometheus to capture GPU-related CPU and memory metrics.

Prerequisites

To gather GPU telemetry metrics from Kubernetes pods we need to deploy a list of special services:

nvidia-device-plugin
dcgm-exporter

This document will describe how to set up services in the new K8s installation. I have chosen another guide as a source for this instruction and applied several fixes to achieve the result. To check the initial guide please refer to https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html

You need several prerequisites for this guide:

Installed Helm
Installed and configured kubectl

Installation configuration

To test the setup, we have used EKS clusters with versions 1.18 and 1.21.

In the example, we choose a p2.xlarge worker node for costs optimization.

List of Metrics

Metric Id	Semantics
GPU Metrics
DCGM_FI_DEV_GPU_UTIL	GPU utilization (in %).
DCGM_FI_PROF_GR_ENGINE_ACTIVE	Ratio of time the graphics engine is active (in %).
DCGM_FI_PROF_SM_OCCUPANCY	The ratio of number of warps resident on an SM (in %). Similar to DCGM_FI_DEV_GPU_UTIL, but shows how effectively resource is utilized.
Memory Metrics
DCGM_FI_DEV_MEM_COPY_UTIL	Memory utilization (in %).
DCGM_FI_DEV_FB_USED	Framebuffer memory used (in MiB).
DCGM_FI_DEV_PCIE_TX(RX)_THROUGHPUT	Total number of bytes transmitted through PCIe TX(RX) (in KB) via NVML.

Integration guide

Setup nvidia-device-plugin service running the command

 helm repo add nvdp https://nvidia.github.io/k8s-device-plugin && helm repo update && 
 helm install --generate-name nvdp/nvidia-device-plugin

Check that installation finished correctly and you have pods with nvidia-device-plugin

% kubectl get pods -A
NAMESPACE     NAME                                    READY   STATUS    RESTARTS   AGE
kube-system   aws-node-4j682                          1/1     Running   0          75s
kube-system   coredns-f47955f89-gs6zk                 1/1     Running   0          8m5s
kube-system   coredns-f47955f89-xm6rd                 1/1     Running   0          8m5s
kube-system   kube-proxy-csdwp                        1/1     Running   0          2m19s
kube-system   nvidia-device-plugin-1633035998-2j2qp   1/1     Running   0          38s

Install monitoring solution consisting of kube-state-metrics and Prometheus. We use the predefined Helm chart to deploy the whole list of services. Changes in kube-prometheus-stack.values should be applied in like in the source guide

helm repo add prometheus-community https://prometheus-community.github.io/helm-charts && \
helm repo update && \
helm install prometheus-community/kube-prometheus-stack \
--namespace kube-system --generate-name --values ./kube-prometheus-stack.values

Verify the installation

% kubectl get pods -A
NAMESPACE     NAME                                                              READY   STATUS    RESTARTS   AGE
kube-system   alertmanager-kube-prometheus-stack-1633-alertmanager-0            2/2     Running   0          49s
kube-system   aws-node-4j682                                                    1/1     Running   0          6m
kube-system   coredns-f47955f89-gs6zk                                           1/1     Running   0          12m
kube-system   coredns-f47955f89-xm6rd                                           1/1     Running   0          12m
kube-system   kube-prometheus-stack-1633-operator-8576fc8f45-64vpb              1/1     Running   0          52s
kube-system   kube-prometheus-stack-1633036072-grafana-778bcb548b-256nw         2/2     Running   0          52s
kube-system   kube-prometheus-stack-1633036072-kube-state-metrics-68c6b6dxj5s   1/1     Running   0          52s
kube-system   kube-prometheus-stack-1633036072-prometheus-node-exporter-w2k67   1/1     Running   0          53s
kube-system   kube-proxy-csdwp                                                  1/1     Running   0          7m4s
kube-system   nvidia-device-plugin-1633035998-2j2qp                             1/1     Running   0          5m23s
kube-system   prometheus-kube-prometheus-stack-1633-prometheus-0                2/2     Running   0          48s

Install DCGM-Exporter service. Please be informed that some metrics disabled in the default installation. If you need your custom set of metrics you need to rebuild docker image of service with your configuration. (See: https://github.com/NVIDIA/dcgm-exporter#changing-metrics) I have used the pre-build docker image from community, you can find reference in Appendix 1.

helm install --namespace kube-system --generate-name \
--values ./dcgm_vals.yaml gpu-helm-charts/dcgm-exporter

Verify the installation

% kubectl get pods -A
NAMESPACE     NAME                                                              READY   STATUS    RESTARTS   AGE
kube-system   alertmanager-kube-prometheus-stack-1633-alertmanager-0            2/2     Running   0          2m47s
kube-system   aws-node-4j682                                                    1/1     Running   0          7m58s
kube-system   coredns-f47955f89-gs6zk                                           1/1     Running   0          14m
kube-system   coredns-f47955f89-xm6rd                                           1/1     Running   0          14m
kube-system   dcgm-exporter-1633036367-nct2v                                    1/1     Running   0          67s
kube-system   kube-prometheus-stack-1633-operator-8576fc8f45-64vpb              1/1     Running   0          2m50s
kube-system   kube-prometheus-stack-1633036072-grafana-778bcb548b-256nw         2/2     Running   0          2m50s
kube-system   kube-prometheus-stack-1633036072-kube-state-metrics-68c6b6dxj5s   1/1     Running   0          2m50s
kube-system   kube-prometheus-stack-1633036072-prometheus-node-exporter-w2k67   1/1     Running   0          2m51s
kube-system   kube-proxy-csdwp                                                  1/1     Running   0          9m2s
kube-system   nvidia-device-plugin-1633035998-2j2qp                             1/1     Running   0          7m21s
kube-system   prometheus-kube-prometheus-stack-1633-prometheus-0                2/2     Running   0          2m46s

After this steps you will be able to check the special list of metrics from DCGM exporter, which represents GPU resources usage. List of available metrics: https://github.com/NVIDIA/dcgm-exporter/blob/main/etc/dcp-metrics-included.csv
Now we are ready to perform test GPU run. For this purpose we can install another service

helm fetch https://helm.ngc.nvidia.com/nvidia/charts/video-analytics-demo-0.1.4.tgz && \
helm install video-analytics-demo-0.1.4.tgz --generate-name

Wait for 3-5 minutes and kill demo service

helm delete $(helm list | grep video-analytics-demo | awk '{print $1}’)

After that you can go to Prometheus and check DCGM_FI_DEV_GPU_UTIL metric

Then we can apply the metrics to Grafana dashboard, for this purpose refer to:https://docs.nvidia.com/datacenter/cloud-native/kubernetes/dcgme2e.html#using-grafana
You should see the corresponding dashboards and graphs after guide.

Appendix 1: dcgm_vals.yaml

 Copyright (c) 2020, NVIDIA CORPORATION.  All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

image:
  repository: shan100docker/dcgm-exporter
  pullPolicy: IfNotPresent
  tag: 2.1.8-2.4.0-rc.2-ubuntu18.04-v2


# Comment the following line to stop profiling metrics from DCGM
arguments: ["-f", "/etc/dcgm-exporter/dcp-metrics-included.csv"]
# NOTE: in general, add any command line arguments to arguments above
# and they will be passed through.
# Use "-r", "<HOST>:<PORT>" to connect to an already running hostengine
# Example arguments: ["-r", "host123:5555"]
# Use "-n" to remove the hostname tag from the output.
# Example arguments: ["-n"]
# Use "-d" to specify the devices to monitor. -d must be followed by a string
# in the following format: [f] or [g[:numeric_range][+]][i[:numeric_range]]
# Where a numeric range is something like 0-4 or 0,2,4, etc.
# Example arguments: ["-d", "g+i"] to monitor all GPUs and GPU instances or
# ["-d", "g:0-3"] to monitor GPUs 0-3.
# Use "-m" to specify the namespace and name of a configmap containing
# the watched exporter fields.
# Example arguments: ["-m", "default:exporter-metrics-config-map"]

imagePullSecrets: []
nameOverride: ""
fullnameOverride: ""

serviceAccount:
  # Specifies whether a service account should be created
  create: true
  # Annotations to add to the service account
  annotations: {}
  # The name of the service account to use.
  # If not set and create is true, a name is generated using the fullname template
  name:

podAnnotations: {}
podSecurityContext: {}
  # fsGroup: 2000

securityContext:
  runAsNonRoot: false
  runAsUser: 0
  capabilities:
     add: ["SYS_ADMIN"]
  # readOnlyRootFilesystem: true

service:
  enable: true
  type: ClusterIP
  port: 9400
  address: ":9400"
  # Annotations to add to the service
  annotations: {}

resources: {}
  # limits:
  #   cpu: 100m
  #   memory: 128Mi
  # requests:
  #   cpu: 100m
  #   memory: 128Mi
serviceMonitor:
  enabled: true
  interval: 15s
  additionalLabels: {}
    #monitoring: prometheus

mapPodsMetrics: false

nodeSelector: {}
  #node: gpu

tolerations: []
#- operator: Exists

affinity: {}
  #nodeAffinity:
  #  requiredDuringSchedulingIgnoredDuringExecution:
  #    nodeSelectorTerms:
  #    - matchExpressions:
  #      - key: nvidia-gpu
  #        operator: Exists

extraHostVolumes: []
#- name: host-binaries
#  hostPath: /opt/bin

extraConfigMapVolumes: []
#- name: exporter-metrics-volume
#  configMap:
#    name: exporter-metrics-config-map

extraVolumeMounts: []
#- name: host-binaries
#  mountPath: /opt/bin
#  readOnly: true

extraEnv: []
#- name: EXTRA_VAR
#  value: "TheStringValue