NVIDIA GPU Operator

AI & ML

NVIDIA GPU Operator plays a crucial role in enabling organizations to harness the power of NVIDIA GPUs for AI and machine learning workloads in Kubernetes environments, leading to faster innovation, improved model performance, and greater efficiency in AI deployments.

QBO Kubernetes Engine (QKE) offers unparalleled performance for any ML and AI workloads, bypassing the constraints of traditional virtual machines. By deploying Kubernetes components using Docker-in-Docker technology, it grants direct access to hardware resources. This approach delivers the agility of the cloud while maintaining optimal performance.

QKE + NVIDIA GPU Operator + Kubernetes-in-Docker + Cgroups v2 - Part 1

Prerequsites

Dependency Validated or Included Version(s) Notes
Kubernetes v1.25.11
NVIDIA Container Toolkit v1.14.3
NVIDIA GPU Operator v23.9.1
NVIDIA Driver 535.129.03 546.01
NVIDIA CUDA 12.2
OS Linux, Windows 10, 11 (WSL2)

QBOT

Install

Run

./qbot gpu-operator

Deploy

Kubernetes Cluster

For this tutorial we are using nvidia as our cluster name

export NAME=nvidia

Get qbo version to make sure we have access to qbo API

qbo version | jq .version[]?

Add a K8s cluster with image v1.25.11. See Kubeflow compatibility

qbo add cluster $NAME -i hub.docker.com/kindest/node:v1.25.11 | jq

Get nodes information using qbo API

qbo get nodes $NAME | jq .nodes[]?

Configure kubectl

export KUBECONFIG=$HOME/.qbo/$NAME.cfg

Get nodes with kubectl

kubectl get nodes

Nvidia GPU Operator

Helm Chart

helm repo add nvidia https://helm.ngc.nvidia.com/nvidia || true
helm repo update
helm install --wait --generate-name -n gpu-operator --create-namespace nvidia/gpu-operator --set driver.enabled=false

Configure

Windows

WSL2

PCI Labels
for i in $(kubectl get no --selector '!node-role.kubernetes.io/control-plane' -o json | jq -r '.items[].metadata.name'); do
kubectl label node $i feature.node.kubernetes.io/pci-10de.present=true
done
Chart Templates
git clone https://github.com/alexeadem/qbot
cd qbot/gpu-operator
OUT=templates
kubectl apply -f $OUT/gpu-operator/crds.yaml
kubectl apply -f $OUT/gpu-operator/templates/
kubectl apply -f $OUT/gpu-operator/charts/node-feature-discovery/templates/
watch kubectl get pods

Vector Addition

Deploy
cat cuda/vectoradd.yaml
apiVersion: v1
kind: Pod
metadata:
name: cuda-vectoradd
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vectoradd
image: "nvcr.io/nvidia/k8s/cuda-sample:vectoradd-cuda11.7.1-ubuntu20.04"
resources:
limits:
nvidia.com/gpu: 1
kubectl apply -f cuda/vectoradd.yaml
Test
kubectl logs cuda-vectoradd
[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done