硬件进行隔离(硬隔离) 需要显卡支持 GPU 虚拟化(如 vGPU)或 MIG(Multi-Instance GPU)功能。
软件进行隔离和共享(软隔离) 硬件不支持可以采用。如:NVIDIA GeForce RTX 3050
特点 :允许多个 Pod 共享同一个 GPU 的资源,通过限制显存使用实现隔离。
时间分片是一种通过 GPU 驱动实现的调度机制,允许多个进程按时间片轮流使用 GPU。
特点 :串行执行,简单易用,隔离性强,资源利用率低,延迟增加。
适合场景 :边缘计算、轻量级推理任务。
安装: 前提是已经安装k8s-device-plugin
添加如下配置,在kuboard管理界面的nvidia-device-plugin
空间内创建configmap(也可以通过k8s命令创建),名称为gpu-share-configs
,key为: time-slicing.yaml
,值如下:
1 2 3 4 5 6 7 8 version: v1 sharing: timeSlicing: renameByDefault: true failRequestsGreaterThanOne: true resources: - name: nvidia.com/gpu replicas: 10
应用配置,在原来安装插件的命令上增加--set config.default=time-slicing.yaml
和--set config.name=gpu-share-configs
,详细命令如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 exxk@exxk:~$ sudo helm upgrade -i nvidia-device-plugin nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.17.0 \ --set runtimeClassName=nvidia \ --set config.default=time-slicing.yaml \ --set config.name=gpu-share-configs \ --kubeconfig /etc/rancher/k3s/k3s.yaml exxk@exxk:~$ sudo helm upgrade -i nvidia-device-discovery nvdp/gpu-feature-discovery \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.17.0 \ --set runtimeClassName=nvidia \ --set config.default=time-slicing.yaml \ --set config.name=gpu-share-configs \ --kubeconfig /etc/rancher/k3s/k3s.yaml exxk@exxk:~$ sudo kubectl describe node | grep nvidia.com nvidia.com/cuda.driver-version.full=560.35.03 nvidia.com/cuda.driver-version.major=560 nvidia.com/cuda.driver-version.minor=35 nvidia.com/cuda.driver-version.revision=03 nvidia.com/cuda.driver.major=560 nvidia.com/cuda.driver.minor=35 nvidia.com/cuda.driver.rev=03 nvidia.com/cuda.runtime-version.full=12.6 nvidia.com/cuda.runtime-version.major=12 nvidia.com/cuda.runtime-version.minor=6 nvidia.com/cuda.runtime.major=12 nvidia.com/cuda.runtime.minor=6 nvidia.com/gfd.timestamp=1734514101 nvidia.com/gpu.compute.major=8 nvidia.com/gpu.compute.minor=6 nvidia.com/gpu.count=1 nvidia.com/gpu.family=ampere nvidia.com/gpu.machine=Standard-PC-i440FX-PIIX-1996 nvidia.com/gpu.memory=4096 nvidia.com/gpu.mode=graphics nvidia.com/gpu.product=NVIDIA-GeForce-RTX-3050-Laptop-GPU nvidia.com/gpu.replicas=10 nvidia.com/gpu.sharing-strategy=time-slicing nvidia.com/mig.capable=false nvidia.com/mps.capable=false nvidia.com/vgpu.present=false nvidia.com/gpu: 0 nvidia.com/gpu.shared: 10 nvidia.com/gpu: 0 nvidia.com/gpu.shared: 10 nvidia.com/gpu 0 0 nvidia.com/gpu.shared 0 0
使用,在部署容器时,如果要限制gpu使用,需要修改将nvidia.com/gpu
修改为nvidia.com/gpu.shared
1 2 3 4 5 6 7 8 9 10 11 12 spec: containers: - command: - tail - '-f' - /dev/null image: 'docker.io/nvidia/cuda:11.8.0-base-ubuntu20.04' imagePullPolicy: IfNotPresent name: cuda resources: limits: nvidia.com/gpu.shared: '1'
经测试,如果配置nvidia.com/gpu.shared
,超过10个,就无法创建pod了,没配置可以超过10个。
存在问题:运行业务时会提示gpu忙碌 MPS 是 NVIDIA 提供的 GPU 多进程共享服务,允许多个 CUDA 进程同时在 GPU 上执行。
特点 :并行执行,高资源利用率,性能提升,可控分配,复杂性增加,隔离性较弱,兼容不是很强(使用前要验证CUDA内核是否支持MPS)。
适合场景 :多任务训练、大规模高效推理。
安装: 前提是已经安装k8s-device-plugin
添加如下配置,在kuboard管理界面的nvidia-device-plugin
空间内创建configmap(也可以通过k8s命令创建),名称为gpu-share-configs
,key为: mps.yaml
,值如下:
1 2 3 4 5 6 7 version: v1 sharing: mps: renameByDefault: true resources: - name: nvidia.com/gpu replicas: 10
应用配置,在原来安装插件的命令上增加--set config.default=mps.yaml
和--set config.name=gpu-share-configs
,详细命令如下
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 exxk@exxk:~$ sudo helm upgrade -i nvidia-device-plugin nvdp/nvidia-device-plugin \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.17.0 \ --set runtimeClassName=nvidia \ --set config.default=mps.yaml \ --set config.name=gpu-share-configs \ --kubeconfig /etc/rancher/k3s/k3s.yaml exxk@exxk:~$ sudo helm upgrade -i nvidia-device-discovery nvdp/gpu-feature-discovery \ --namespace nvidia-device-plugin \ --create-namespace \ --version 0.17.0 \ --set runtimeClassName=nvidia \ --set config.default=mps.yaml \ --set config.name=gpu-share-configs \ --kubeconfig /etc/rancher/k3s/k3s.yaml exxk@exxk:~$ sudo kubectl describe node | grep nvidia.com nvidia.com/cuda.driver-version.full=560.35.03 nvidia.com/cuda.driver-version.major=560 nvidia.com/cuda.driver-version.minor=35 nvidia.com/cuda.driver-version.revision=03 nvidia.com/cuda.driver.major=560 nvidia.com/cuda.driver.minor=35 nvidia.com/cuda.driver.rev=03 nvidia.com/cuda.runtime-version.full=12.6 nvidia.com/cuda.runtime-version.major=12 nvidia.com/cuda.runtime-version.minor=6 nvidia.com/cuda.runtime.major=12 nvidia.com/cuda.runtime.minor=6 nvidia.com/gfd.timestamp=1734573325 nvidia.com/gpu.compute.major=8 nvidia.com/gpu.compute.minor=6 nvidia.com/gpu.count=1 nvidia.com/gpu.family=ampere nvidia.com/gpu.machine=Standard-PC-i440FX-PIIX-1996 nvidia.com/gpu.memory=4096 nvidia.com/gpu.mode=graphics nvidia.com/gpu.product=NVIDIA-GeForce-RTX-3050-Laptop-GPU nvidia.com/gpu.replicas=10 nvidia.com/gpu.sharing-strategy=mps nvidia.com/mig.capable=false nvidia.com/mps.capable=true nvidia.com/vgpu.present=false nvidia.com/gpu: 0 nvidia.com/gpu.shared: 10 nvidia.com/gpu: 0 nvidia.com/gpu.shared: 10 nvidia.com/gpu 0 0 nvidia.com/gpu.shared 0 0
使用,在部署容器时,如果要限制gpu使用,需要修改将nvidia.com/gpu
修改为nvidia.com/gpu.shared
1 2 3 4 5 6 7 8 9 10 11 12 spec: containers: - command: - tail - '-f' - /dev/null image: 'docker.io/nvidia/cuda:11.8.0-base-ubuntu20.04' imagePullPolicy: IfNotPresent name: cuda resources: limits: nvidia.com/gpu.shared: '1'
经测试,如果配置nvidia.com/gpu.shared
,超过10个,就无法创建pod了。
方式四:IMEX GPU虚拟化,需要硬件GPU支持虚拟化,目前有的显卡不支持,暂不考虑。