方式零:这是一个简单的静态守护进程集,旨在演示 的基本功能。 (未验证成功)
1
| kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml
|
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
| exxk@exxk:~$ sudo helm repo add nvdp https://nvidia.github.io/k8s-device-plugin exxk@exxk:~$ sudo helm repo update
exxk@exxk:~$ sudo helm search repo nvdp --devel NAME CHART VERSION APP VERSION DESCRIPTION nvdp/gpu-feature-discovery 0.17.0 0.17.0 A Helm chart for gpu-feature-discovery on Kuber... nvdp/nvidia-device-plugin 0.17.0 0.17.0 A Helm chart for the nvidia-device-plugin on Ku...
exxk@exxk:~$ cat nvidia-runtimeclass.yaml apiVersion: node.k8s.io/v1 kind: RuntimeClass metadata: name: nvidia handler: nvidia exxk@exxk:~$ sudo kubectl apply -f nvidia-runtimeclass.yaml
exxk@exxk:~$ sudo kubectl get runtimeclass crun crun 11d lunatic lunatic 11d nvidia nvidia 11m
exxk@exxk:~$ sudo helm upgrade -i nvidia-device-plugin nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.17.0 \ --set runtimeClassName=nvidia \ --kubeconfig /etc/rancher/k3s/k3s.yaml
exxk@exxk:~$ sudo helm upgrade -i nvidia-device-discovery nvdp/gpu-feature-discovery \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.17.0 \ --set runtimeClassName=nvidia \ --kubeconfig /etc/rancher/k3s/k3s.yaml
exxk@exxk:~$ sudo kubectl describe node | grep nvidia.com nvidia.com/cuda.driver-version.full=560.35.03 nvidia.com/cuda.driver-version.major=560 nvidia.com/cuda.driver-version.minor=35 nvidia.com/cuda.driver-version.revision=03 nvidia.com/cuda.driver.major=560 nvidia.com/cuda.driver.minor=35 nvidia.com/cuda.driver.rev=03 nvidia.com/cuda.runtime-version.full=12.6 nvidia.com/cuda.runtime-version.major=12 nvidia.com/cuda.runtime-version.minor=6 nvidia.com/cuda.runtime.major=12 nvidia.com/cuda.runtime.minor=6 nvidia.com/gfd.timestamp=1734424517 nvidia.com/gpu.compute.major=8 nvidia.com/gpu.compute.minor=6 nvidia.com/gpu.count=1 nvidia.com/gpu.family=ampere nvidia.com/gpu.machine=Standard-PC-i440FX-PIIX-1996 nvidia.com/gpu.memory=4096 nvidia.com/gpu.mode=graphics nvidia.com/gpu.product=NVIDIA-GeForce-RTX-3050-Laptop-GPU nvidia.com/gpu.replicas=1 nvidia.com/gpu.sharing-strategy=none nvidia.com/mig.capable=false nvidia.com/mps.capable=false nvidia.com/vgpu.present=false nvidia.com/gpu: 1 nvidia.com/gpu: 1 nvidia.com/gpu 0 0
|
方式二:HelmChart方式 (未验证成功)
通过 HelmChart
CR 配置实现自动安装。这种方式无需手动运行 helm
命令。
创建 HelmChart 配置文件
nvidia-device-plugin.yaml
1 2 3 4 5 6 7 8 9 10 11 12
| apiVersion: helm.cattle.io/v1 kind: HelmChart metadata: name: nvdp namespace: kube-system spec: chart: https://github.com/NVIDIA/k8s-device-plugin/releases/download/v0.17.0/nvidia-device-plugin-0.17.0.tgz version: 0.17.0 targetNamespace: nvidia-device-plugin set: - name: runtimeClassName value: nvidia
|
执行下面命令进行安装
1 2 3 4 5 6
| exxk@exxk:~$ kubectl create namespace nvidia-device-plugin
exxk@exxk:~$ sudo kubectl apply -f nvidia-device-plugin.yaml
exxk@exxk:~$ sudo kubectl delete -f nvidia-device-plugin.yaml
|
使用或验证
创建一个Deployment,可以在kuboard界面上创建,关键信息:
镜像(界面上):docker.io/nvidia/cuda:11.8.0-base-ubuntu20.04
持久化命令(界面上):tail -f /dev/null
指定gpu运行(yaml添加):runtimeClassName: nvidia
进入容器执行nvidia-smi
,就可以看到显卡相关信息,如果找不到命令,代表前面配置有问题。
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24
| 欢迎使用 HCYTech - Kubernetes 多集群管理工具。 您在终端界面中执行的操作将被记录到审计日志。
root@cudatest-549f4f49f8-ltblq:/# nvidia-smi Tue Dec 17 10:03:34 2024 +-----------------------------------------------------------------------------------------+ | NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 | |-----------------------------------------+------------------------+----------------------+ | GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. | | | | MIG M. | |=========================================+========================+======================| | 0 NVIDIA GeForce RTX 3050 ... Off | 00000000:00:10.0 Off | N/A | | N/A 48C P8 4W / 60W | 2MiB / 4096MiB | 0% Default | | | | N/A | +-----------------------------------------+------------------------+----------------------+ +-----------------------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=========================================================================================| | No running processes found | +-----------------------------------------------------------------------------------------+
|
完整的yaml示例如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
| apiVersion: apps/v1 kind: Deployment metadata: annotations: {} labels: k8s.kuboard.cn/name: cudatest name: cudatest namespace: default resourceVersion: '3843602' spec: progressDeadlineSeconds: 600 replicas: 1 revisionHistoryLimit: 10 selector: matchLabels: k8s.kuboard.cn/name: cudatest strategy: rollingUpdate: maxSurge: 25% maxUnavailable: 25% type: RollingUpdate template: metadata: creationTimestamp: null labels: k8s.kuboard.cn/name: cudatest spec: containers: - command: - tail - '-f' - /dev/null image: 'docker.io/nvidia/cuda:11.8.0-base-ubuntu20.04' imagePullPolicy: IfNotPresent name: cuda resources: {} terminationMessagePath: /dev/termination-log terminationMessagePolicy: File dnsPolicy: ClusterFirst nodeSelector: nvidia.com/gpu.count: '1' restartPolicy: Always runtimeClassName: nvidia schedulerName: default-scheduler securityContext: {} terminationGracePeriodSeconds: 30
|