k8s安装k8s-device-plugin并使用gpu

安装k8s-device-plugin

方式零:这是一个简单的静态守护进程集,旨在演示 的基本功能。 (未验证成功)

1
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

方式一:helm命令方式 (成功)

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
#添加插件仓库
exxk@exxk:~$ sudo helm repo add nvdp https://nvidia.github.io/k8s-device-plugin
exxk@exxk:~$ sudo helm repo update
#从仓库查找可用版本
exxk@exxk:~$ sudo helm search repo nvdp --devel
NAME CHART VERSION APP VERSION DESCRIPTION
nvdp/gpu-feature-discovery 0.17.0 0.17.0 A Helm chart for gpu-feature-discovery on Kuber...
nvdp/nvidia-device-plugin 0.17.0 0.17.0 A Helm chart for the nvidia-device-plugin on Ku...
#安装nvidia-runtimeclass,因为后面安装插件依赖于他
exxk@exxk:~$ cat nvidia-runtimeclass.yaml
apiVersion: node.k8s.io/v1
kind: RuntimeClass
metadata:
name: nvidia
handler: nvidia
exxk@exxk:~$ sudo kubectl apply -f nvidia-runtimeclass.yaml
#验证是否安装成功,或者有就不需要安装了
exxk@exxk:~$ sudo kubectl get runtimeclass
crun crun 11d
lunatic lunatic 11d
nvidia nvidia 11m
#安装nvidia-device-plugin支持k3s里面用gpu
exxk@exxk:~$ sudo helm upgrade -i nvidia-device-plugin nvdp/nvidia-device-plugin \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.17.0 \
--set runtimeClassName=nvidia \
--kubeconfig /etc/rancher/k3s/k3s.yaml
#安装nvidia-device-discovery自动发现gpu,并标记节点
exxk@exxk:~$ sudo helm upgrade -i nvidia-device-discovery nvdp/gpu-feature-discovery \
--namespace nvidia-device-plugin \
--create-namespace \
--version 0.17.0 \
--set runtimeClassName=nvidia \
--kubeconfig /etc/rancher/k3s/k3s.yaml
#验证,可以执行下面命令,也可以直接看节点的标签,如果有下述标签,代表nvidia-device-plugin和gpu-feature-discovery插件安装成功
#标签详细说明见:https://github.com/NVIDIA/k8s-device-plugin?tab=readme-ov-file#catalog-of-labels
exxk@exxk:~$ sudo kubectl describe node | grep nvidia.com
nvidia.com/cuda.driver-version.full=560.35.03
nvidia.com/cuda.driver-version.major=560
nvidia.com/cuda.driver-version.minor=35
nvidia.com/cuda.driver-version.revision=03
nvidia.com/cuda.driver.major=560
nvidia.com/cuda.driver.minor=35
nvidia.com/cuda.driver.rev=03
nvidia.com/cuda.runtime-version.full=12.6
nvidia.com/cuda.runtime-version.major=12
nvidia.com/cuda.runtime-version.minor=6
nvidia.com/cuda.runtime.major=12
nvidia.com/cuda.runtime.minor=6
nvidia.com/gfd.timestamp=1734424517
nvidia.com/gpu.compute.major=8
nvidia.com/gpu.compute.minor=6
nvidia.com/gpu.count=1 #gpu数量
nvidia.com/gpu.family=ampere
nvidia.com/gpu.machine=Standard-PC-i440FX-PIIX-1996
nvidia.com/gpu.memory=4096
nvidia.com/gpu.mode=graphics
nvidia.com/gpu.product=NVIDIA-GeForce-RTX-3050-Laptop-GPU
nvidia.com/gpu.replicas=1
nvidia.com/gpu.sharing-strategy=none #共享策略none不共享、mps、time-slicing
nvidia.com/mig.capable=false #是否支持 MIG
nvidia.com/mps.capable=false #是否配置了 MPS
nvidia.com/vgpu.present=false #是否使用 vGPU
nvidia.com/gpu: 1
nvidia.com/gpu: 1
nvidia.com/gpu 0 0

方式二:HelmChart方式 (未验证成功)

通过 HelmChart CR 配置实现自动安装。这种方式无需手动运行 helm 命令。

  1. 创建 HelmChart 配置文件

    nvidia-device-plugin.yaml

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    apiVersion: helm.cattle.io/v1
    kind: HelmChart
    metadata:
    name: nvdp
    namespace: kube-system
    spec:
    chart: https://github.com/NVIDIA/k8s-device-plugin/releases/download/v0.17.0/nvidia-device-plugin-0.17.0.tgz #这里从https://github.com/NVIDIA/k8s-device-plugin/releases找到对应的包覆之下载地址到这里
    version: 0.17.0 # Chart 版本号,和压缩包的一致
    targetNamespace: nvidia-device-plugin #插件将被部署到的命名空间
    set:
    - name: runtimeClassName #如果需要自定义安装参数,可以在此处设置键值对。
    value: nvidia
  2. 执行下面命令进行安装

    1
    2
    3
    4
    5
    6
    #创建命名空间
    exxk@exxk:~$ kubectl create namespace nvidia-device-plugin
    #执行安装
    exxk@exxk:~$ sudo kubectl apply -f nvidia-device-plugin.yaml
    #如果要删除执行
    exxk@exxk:~$ sudo kubectl delete -f nvidia-device-plugin.yaml

使用或验证

创建一个Deployment,可以在kuboard界面上创建,关键信息:

镜像(界面上):docker.io/nvidia/cuda:11.8.0-base-ubuntu20.04
持久化命令(界面上):tail -f /dev/null

指定gpu运行(yaml添加):runtimeClassName: nvidia

进入容器执行nvidia-smi ,就可以看到显卡相关信息,如果找不到命令,代表前面配置有问题。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
            欢迎使用 HCYTech - Kubernetes 多集群管理工具。
您在终端界面中执行的操作将被记录到审计日志。

root@cudatest-549f4f49f8-ltblq:/# nvidia-smi
Tue Dec 17 10:03:34 2024
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 560.35.03 Driver Version: 560.35.03 CUDA Version: 12.6 |
|-----------------------------------------+------------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+========================+======================|
| 0 NVIDIA GeForce RTX 3050 ... Off | 00000000:00:10.0 Off | N/A |
| N/A 48C P8 4W / 60W | 2MiB / 4096MiB | 0% Default |
| | | N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=========================================================================================|
| No running processes found |
+-----------------------------------------------------------------------------------------+

完整的yaml示例如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
apiVersion: apps/v1
kind: Deployment
metadata:
annotations: {}
labels:
k8s.kuboard.cn/name: cudatest
name: cudatest
namespace: default
resourceVersion: '3843602'
spec:
progressDeadlineSeconds: 600
replicas: 1
revisionHistoryLimit: 10
selector:
matchLabels:
k8s.kuboard.cn/name: cudatest
strategy:
rollingUpdate:
maxSurge: 25%
maxUnavailable: 25%
type: RollingUpdate
template:
metadata:
creationTimestamp: null
labels:
k8s.kuboard.cn/name: cudatest
spec:
containers:
- command:
- tail
- '-f'
- /dev/null
image: 'docker.io/nvidia/cuda:11.8.0-base-ubuntu20.04'
imagePullPolicy: IfNotPresent
name: cuda
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
dnsPolicy: ClusterFirst
nodeSelector:
nvidia.com/gpu.count: '1'
restartPolicy: Always
runtimeClassName: nvidia # 注意这句,这句在kuboard界面无法添加,需要编辑yaml,添加到这里
schedulerName: default-scheduler
securityContext: {}
terminationGracePeriodSeconds: 30