阿里云kk一键搭建k8s
kk多节点安装
准备
准备一个弹性伸缩,用于管理ECS虚拟机
伸缩组实例配置好ecs的配置(如果是学习可以采用抢占式虚拟机节约成本)
Eg: ecs.hfc6.large(ecs.c7a.largeamd也可以了)抢占式2vCPU+4GiB+centos7.9 64为,可以挂载同一个共享数据盘,用于存储配置数据
配置证书,采用证书
cer
登录端口要求:安全组开放(因为用的一个安全组,所以组内连通策略:组内互通)因此不需要设置服务 端口 ssh 22 TCP etcd 2379-2380 TCP apiserver 6443 TCP calico 9099-9100 TCP bgp 179 TCP nodeport 30000-32767 TCP master 10250-10258 TCP dns 53 TCP/UDP local-registry(离线环境需要) 5000 TCP local-apt(离线环境需要) 5080 TCP rpcbind( 使用 NFS 时需要) 111 TCP ipip(Calico 需要使用 IPIP 协议) IPENCAP / IPIP metrics-server 8443 TCP
只安装kubernetes
1 | yum update -y |
安装kubernetes+kubesphere
1 | yum update -y |
添加/删除节点
1 | #没有之前的部署文件,通过下面命令生产部署文件,文件名sample.yaml,我修改了名字为add-node.yaml |
添加/删除污点
1 | #给节点 node1 增加一个污点,它的键名是 key1,键值是 value1,效果是 NoSchedule。 这表示只有拥有和这个污点相匹配的容忍度的 Pod 才能够被分配到 node1 这个节点。 |
注意事项
- 能修改deployment优先修改deployment,如果不能修改再修改pod的配置文件,因为pod修改之后会重启消失。
常见问题
kk安装多节点集群的时候,报如下错误:
[wait-control-plane] Waiting for the kubelet to boot up the control plane as static Pods from directory "/etc/kubernetes/manifests". This can take up to 4m0s [kubelet-check] Initial timeout of 40s passed.
解决:amd的主机有问题,换了一个inter芯片的主机就OK了,新解决方法安装前,执行
yum update -y
一个节点宕机后,添加一个新的节点替代,执行kk添加节点命令时报如下错误:
etcd health check failed: Failed to exec command: sudo -E /bin/bash -c "export ETCDCTL_API=2;export ETCDCTL_CERT_FILE='/etc/ssl/etcd/ssl/admin-node1.pem';export ETCDCTL_KEY_FILE='/etc/ssl/etcd/ssl/admin-node1-key.pem';export ETCDCTL_CA_FILE='/etc/ssl/etcd/ssl/ca.pem';/usr/local/bin/etcdctl --endpoints=https://192.168.14.16:2379,https://192.168.14.17:2379,https://192.168.14.20:2379 cluster-health | grep -q 'cluster is healthy'
解决:先删除etcd的异常节点,在重新执行kk添加etcd及master节点
执行
kubectl drain node3 --force --ignore-daemonsets --delete-emptydir-data
删除节点时报如下错误:I0705 16:41:20.004286 18301 request.go:665] Waited for 1.14877279s due to client-side throttling, not priority and fairness, request: GET:https://lb.kubesphere.local:6443/api/v1/namespaces/kubesphere-monitoring-system/pods/alertmanager-main-1
解决:强制取消,执行
kubectl delete node node3
即可重建节点之后,新加的节点无法调度,报如下错误:
0/3 nodes are available: 1 node(s) had taint {node-role.kubernetes.io/master: }, that the pod didn't tolerate, 2 Insufficient cpu.
解决:在集群节点查看污点,然后执行
kubectl taint nodes node4 node-role.kubernetes.io/master=:NoSchedule-
删除污点重建节点之后组件监控
prometheus-k8s
容器事件提示如下错误:MountVolume.NewMounter initialization failed for volume "pvc-60891ee0-ba6c-4df4-b381-6e542b27d3a7" : path "/var/openebs/local/pvc-60891ee0-ba6c-4df4-b381-6e542b27d3a7" does not exist
解决:在master节点执行,以下方法并不能解决,待验证存储卷是否是分布式的?1
2
3
4
5
6
7
8
9#在/etc/kubernetes/manifests/kube-apiserver.yaml
#spec:
# containers:
# - command:
# - kube-apiserver
# - -–feature-gates=RemoveSelfLink=fals #添加该行
vim /etc/kubernetes/manifests/kube-apiserver.yaml
#应用配置
kubectl apply -f /etc/kubernetes/manifests/kube-apiserver.yaml使用amd主机安装kubesphere,一直卡在
Please wait for the installation to complete:
,查看pod日志,发现calico-node-4hgbb
的pod提示如下错误:1
2
3
4Type Reason Age From Message
---- ------ ---- ---- -------
Warning Unhealthy 3m18s (x440 over 66m) kubelet (combined from similar events): Readiness probe failed: 2022-07-06 02:39:53.164 [INFO][4974] confd/health.go 180: Number of node(s) with BGP peering established = 2
calico/node is not ready: felix is not ready: readiness probe reporting 503参考: kubernetes v1.24.0 install failed with calico node not ready #1282
解决:resolved by change calico version, maybe calico verison should update from v3.20.0 to v3.23.0
1
2
3
4
5
6
7
8
9#删除calico相关pod
kubectl -n kube-system get all |grep calico | awk '{print $1}' | xargs kubectl -n kube-system delete
#获取3.23新版本
wget https://docs.projectcalico.org/archive/v3.23/manifests/calico.yaml
#重新安装calico
kubectl apply -f calico.yaml
#calico虽然正常了,但是后续重新用kk安装又回回到不正常状态,注意不要修改pod的配置文件,要修改deployment。
#最终解决在安装集群前执行
yum update -y系统组件
->监控
->prometheus-k8s
->事件
->错误日志:0/3 nodes are available: 3 Insufficient cpu.解决:修改
工作负载
->有状态副本集
->prometheus-k8s
总结:
requests.cpu
设置为0.5代表一个cpu的一半,0.5等价于500m,读做”500 millicpu”(五百毫核)官方说明:Kubernetes 中的资源单位
1
2
3
4
5
6
7
8
9
10#重启之后需要重新修改
containers:
- name: prometheus
resources:
limits:
cpu: '4'
memory: 16Gi
requests:
cpu: 200m #修改为20m
memory: 400Mi #修改为40Mi执行
kubectl top node
提示error: Metrics API not available
错误解决:1.未安装修改
kubesphere
部署配置文件,已安装登录kubesphere点击定制资源定义
->ClusterConfiguration
->ks-installer
修改。1
2metrics_server:
enabled: false #设置为truecalico/node is not ready: felix is not ready: readiness probe reporting 503
再次尝试之后
calico/node is not ready: felix is not ready: Get "http://localhost:9099/readiness": dial tcp [::1]:9099: connect: connection refused
记一次
Error from server (BadRequest): container "calico-node" in pod "calico-node-bfmgs" is waiting to start: PodInitializing
解决及排查过程:(参考:Kubernetes Installation Tutorial: Kubespray)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39[root@master ~]# kubectl get nodes
master Ready control-plane 158d v1.26.4
node-102 NotReady control-plane 42h v1.26.4
....
[root@master ~]# kubectl get pods -o wide -n kube-system
calico-node-2zbtg 0/1 Init:CrashLoopBackOff
[root@master ~]# kubectl describe pod -n kube-system calico-node-2zbtg
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning BackOff 15s (x7 over 79s) kubelet Back-off restarting failed container install-cni in pod calico-node-2zbtg_kube-system(337004cf-9136-48ac-bc6b-eb897bd2c806)
[root@master ~]# kubectl logs -n kube-system calico-node-2zbtg
Defaulted container "calico-node" out of: calico-node, upgrade-ipam (init), install-cni (init), flexvol-driver (init)
Error from server (BadRequest): container "calico-node" in pod "calico-node-2zbtg" is waiting to start: PodInitializing
# --------- 删除pod会马上重启一个 -----------------------------------------------------------
[root@master ~]# kubectl delete pod -n kube-system calico-node-2zbtg
# ----------- 在对应节点通过时间找到尸体容器 ---------------------------------------------------
[root@node-102 ~]# crictl ps -a
fc19864603510 628dd70880410 About a minute ago Exited install-cni
[root@node-102 ~]# crictl logs fc19864603510
time="2024-01-14T12:26:57Z" level=info msg="Running as a Kubernetes pod" source="install.go:145"
2024-01-14 12:26:57.761 [INFO][1] cni-installer/<nil> <nil>: File is already up to date, skipping file="/host/opt/cni/bin/bandwidth"
.....
2024-01-14 12:26:57.964 [INFO][1] cni-installer/<nil> <nil>: CNI plugin version: v3.24.5
2024-01-14 12:26:57.964 [INFO][1] cni-installer/<nil> <nil>: /host/secondary-bin-dir is not writeable, skipping
W0114 12:26:57.964140 1 client_config.go:617] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work.
time="2024-01-14T12:26:57Z" level=info msg="Using CNI config template from CNI_NETWORK_CONFIG_FILE" source="install.go:340"
time="2024-01-14T12:26:57Z" level=fatal msg="open /host/etc/cni/net.d/calico.conflist.template: no such file or directory" source="install.go:344"
// ----得到精准的错误信息 open /host/etc/cni/net.d/calico.conflist.template: no such file or directory
// 这个文件找不到,就从master节点拷贝一个过来
[root@node-102 ~]# cd /etc/cni/net.d/
[root@node-102 net.d]# ls
calico-kubeconfig
# -----------------------------拷贝到有问题的节点之后么,删除pod,加速重启 ----------------------------
[root@master net.d]# kubectl delete pod calico-node-bfmgs -n kube-system
#-----------------------------再次查看就已经有了,而且节点加入成功,各项pod正常------------------------
[root@node-102 net.d]# ls
10-calico.conflist calico.conflist.template calico-kubeconfigk8s启动服务提示
Failed to create pod sandbox: rpc error: code = Unknown desc = failed to setup network for sandbox "5bc537d4925604f98d12ec576b90eeee0534402c6fb32fc31920a763051e6589": plugin type="calico" failed (add): error getting ClusterInformation: connection is unauthorized: Unauthorized
解决:原因服务器时间不对,导致授权异常,排查那个节点的时间不对,重新同步时间,然后重启对应节点的calico服务
1
2
3
4
5
6
7
8
9
10
11
12
13
14[root@node-102 ~]# timedatectl
Local time: 一 2024-01-15 20:33:23 CST
Universal time: 一 2024-01-15 12:33:23 UTC
RTC time: 一 2024-01-15 06:43:54
Time zone: Asia/Shanghai (CST, +0800)
NTP enabled: yes
NTP synchronized: no
RTC in local TZ: no
DST active: n/a
[root@node-102 ~]# chronyc makestep
200 OK
[root@master ~]# kubectl get pods -o wide -n kube-system
calico-node-rlxg5 1/1 Running 0 26h 172.16.10.192 node-192
[root@master ~]# kubectl delete pod -n kube-system calico-node-rlxg5记一次,服务之间通过servername无法访问提示
Caused by: java.net.UnknownHostException: system-business
,1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28[root@master ~]# kubectl get pods -o wide -n kube-system
nodelocaldns-g72zt 0/1 CrashLoopBackOff 17 (39s ago) 63m 172.16.10.4 master <none> <none>
#查看该pod日志
[root@master ~]# crictl ps -a | grep nodelocaldns
7ae6e0fc97f0b 9eaf430eed843 46 seconds ago Exited node-cache 19 c8ccd1afaa13e nodelocaldns-g72zt
[root@master ~]# crictl logs 7ae6e0fc97f0b
2024/04/25 09:22:45 [INFO] Starting node-cache image: 1.22.18
2024/04/25 09:22:45 [INFO] Using Corefile /etc/coredns/Corefile
2024/04/25 09:22:45 [INFO] Using Pidfile
2024/04/25 09:22:46 [ERROR] Failed to read node-cache coreFile /etc/coredns/Corefile.base - open /etc/coredns/Corefile.base: no such file or directory
2024/04/25 09:22:46 [INFO] Skipping kube-dns configmap sync as no directory was specified
.:53 on 169.254.25.10
cluster.local.:53 on 169.254.25.10
in-addr.arpa.:53 on 169.254.25.10
ip6.arpa.:53 on 169.254.25.10
[INFO] plugin/reload: Running configuration SHA512 = aa809f767f97014677c4e010f69a19281bea2a25fd44a8c9172f6f43db27a70080deb3a2add822c680f580337da221a7360acac898a1e8a8827a7bda80e00c2d
CoreDNS-1.10.0
linux/amd64, go1.18.10,
[FATAL] plugin/loop: Loop (169.254.25.10:34977 -> 169.254.25.10:53) detected for zone ".", see https://coredns.io/plugins/loop#troubleshooting. Query: "HINFO 1324972288110025575.747110785225940346."
#其他两个节点的提示的其他错误
#节点B
.....
CoreDNS-1.10.0
linux/amd64, go1.18.10,
[ERROR] plugin/errors: 2 helm.yangcoder.online. A: read udp 169.254.25.10:56963->169.254.25.10:53: i/o timeout
#节点C
[ERROR] plugin/errors: 2 47.30.16.172.in-addr.arpa. PTR: read tcp 10.233.0.3:47246->10.233.0.3:53: i/o timeout
[ERROR] plugin/errors: 2 47.30.16.172.in-addr.arpa. PTR: read tcp 10.233.0.3:47252->10.233.0.3:53: i/o timeout分析原因:
公司部分网络出现访问慢,出现过114.114.114.114的dns不通,怀疑是公司网络导致的
初步排查:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19三天节点的dns都不一样:
[root@master ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search default.svc.cluster.local svc.cluster.local
nameserver 169.254.25.10
nameserver 183.221.253.100 #ping Destination Host Unreachable
nameserver 114.114.114.114 #ping Destination Host Unreachable
[root@node-192 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search default.svc.cluster.local svc.cluster.local
nameserver 169.254.25.10
nameserver 183.221.253.100 #ping icmp_seq=1 Destination Host Unreachable
nameserver 172.16.10.49 #ping 通
[root@node-102 ~]# cat /etc/resolv.conf
# Generated by NetworkManager
search default.svc.cluster.local svc.cluster.local
nameserver 169.254.25.10
nameserver 172.16.10.63 #ping 通
nameserver 192.168.100.254 #ping 通应该是主节点的dns都无法访问,导致主节点的nodelocaldns无法正常启动,导致和其他节点dns失联(待验证)
解决:
附录
config-kubernetes-1.23.7.yaml
部署文件
1 | apiVersion: kubekey.kubesphere.io/v1alpha2 |
config-kubesphere3.3.0-kubernetes1.23.7.yaml
部署文件
1 | apiVersion: kubekey.kubesphere.io/v1alpha2 |
add-node.yaml
1 | apiVersion: kubekey.kubesphere.io/v1alpha2 |