前言:
kubernetes集群不是简单的安装部署就完事了,还需要根据业务的性质设定一些策略,比如,某些pod不希望被调度到硬件条件比较差的节点,某些pod又希望调度到含有比如有特定的硬件GPU的节点上。又或者某个节点由于硬件资源比如CPU,内存并没有彻底耗尽,但如果在继续调度pod到此节点有造成集群崩溃的风险,如何阻止并驱逐此节点在运行的pod,以及集群需要检修或者重建某个节点,此时的节点上运行的pod应该如何处置等等各种各样的问题以及解决方案就形成了pod调度策略,驱逐策略,污点、容忍调度策略。
说人话,这些也可以说是kubernetes集群的优化策略。下面将就以上提出的情况和一些没有提到过的情况做一个总结吧!!!
主要的pod调度策略:
- 自由调度:pod运行在哪个节点完全由scheduler经过一系列算法计算得出(如果没有定向调度,亲和性,容忍策略,此策略就是默认的啦)
- 定向调度:采用nodeName、nodeSelector来实现pod定向调度(pod面向节点)
- 亲和性调度:NodeAffinityinity、PodAffinity、PodAntiAffinity
- 污点、容忍调度:Taints、Toleration(前面讲过了)
一,节点维护状态(cordon,uncordondrain):
节点情况举例:
例如有三个节点,k8s-node1,k8s-node2,k8s-master
[root@master ~]# k get no NAME STATUS ROLES AGE VERSION k8s-master Ready <none> 25d v1.18.3 k8s-node1 Ready <none> 25d v1.18.3 k8s-node2 Ready <none> 25d v1.18.3
现有两个pod在节点2运行,其中pod kube-flannel-ds-mlb7l是daemonsets方式部署,是核心组件:
NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default busybox-7bf6d6f9b5-jg922 1/1 Running 2 24d 10.244.0.11 k8s-master <none> <none> default nginx-7c96855774-28b5w 1/1 Running 2 24d 10.244.0.12 k8s-master <none> <none> default nginx-7c96855774-d592j 1/1 Running 0 4h44m 10.244.0.13 k8s-master <none> <none> default nginx1 1/1 Running 2 24d 10.244.2.11 k8s-node2 <none> <none> kube-system coredns-76648cbfc9-lb75g 1/1 Running 2 24d 10.244.2.10 k8s-node2 <none> <none> kube-system kube-flannel-ds-mhkdq 1/1 Running 7 24d 192.168.217.17 k8s-node1 <none> <none> kube-system kube-flannel-ds-mlb7l 1/1 Running 6 24d 192.168.217.18 k8s-node2 <none> <none> kube-system kube-flannel-ds-sl4qv 1/1 Running 2 24d 192.168.217.16 k8s-master <none> <none>
假如现在需要维护node2节点,那么,先需要驱逐node2节点上的所有pod,pod方式部署的直接驱逐,daemonsets的忽略:
[root@master ~]# k drain k8s-node2 --force --ignore-daemonsets node/k8s-node2 already cordoned WARNING: deleting Pods not managed by ReplicationController, ReplicaSet, Job, DaemonSet or StatefulSet: default/nginx1; ignoring DaemonSet-managed Pods: kube-system/kube-flannel-ds-mlb7l evicting pod default/nginx1 evicting pod kube-system/coredns-76648cbfc9-lb75g pod/coredns-76648cbfc9-lb75g evicted pod/nginx1 evicted node/k8s-node2 evicted
结果是这样的:
可以看到,coredns这个pod漂移到了node1,pod方式部署的nginx直接销毁了,kube-flannel-ds-mlb7l没有变动
[root@master ~]# k get po -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default busybox-7bf6d6f9b5-jg922 1/1 Running 2 24d 10.244.0.11 k8s-master <none> <none> default nginx-7c96855774-28b5w 1/1 Running 2 24d 10.244.0.12 k8s-master <none> <none> default nginx-7c96855774-d592j 1/1 Running 0 4h52m 10.244.0.13 k8s-master <none> <none> kube-system coredns-76648cbfc9-z8kh5 1/1 Running 0 2m2s 10.244.1.8 k8s-node1 <none> <none> kube-system kube-flannel-ds-mhkdq 1/1 Running 7 24d 192.168.217.17 k8s-node1 <none> <none> kube-system kube-flannel-ds-mlb7l 1/1 Running 6 24d 192.168.217.18 k8s-node2 <none> <none> kube-system kube-flannel-ds-sl4qv 1/1 Running 2 24d 192.168.217.16 k8s-master <none> <none>
这个时候的节点是部分禁用的(这里的意思是scheduler服务不会调度新的pod到此节点,但,如果强制nodeselector,仍然会运行新pod)
此时的scheduler不会调度新pod到node2节点
[root@master ~]# k get no NAME STATUS ROLES AGE VERSION k8s-master Ready <none> 25d v1.18.3 k8s-node1 Ready <none> 25d v1.18.3 k8s-node2 Ready,SchedulingDisabled <none> 25d v1.18.3
节点维护状态和解除节点维护状态:
[root@master coredns]# k cordon k8s-node2 node/k8s-node2 cordoned [root@master coredns]# k uncordon k8s-node2 node/k8s-node2 uncordoned
小结:
cordon,uncordon,drain这三个命令主要是用在节点维护场景,drain有安全驱逐pod的功能,pod会实现漂移,但此驱逐并非硬性驱逐,管不了pod的指定调度策略。
适用范围是比较窄的哦,比如,使用了本地存储卷的pod或者有状态pod不适合使用drain,因为drain了相关服务就完蛋了。
二,taints--节点污点
关于污点的解释:
[root@master coredns]# k explain node.spec.taints KIND: Node VERSION: v1 RESOURCE: taints <[]Object> DESCRIPTION: If specified, the node's taints. The node this Taint is attached to has the "effect" on any pod that does not tolerate the Taint. FIELDS: effect <string> -required- Required. The effect of the taint on pods that do not tolerate the taint. Valid effects are NoSchedule, PreferNoSchedule and NoExecute. key <string> -required- Required. The taint key to be applied to a node. timeAdded <string> TimeAdded represents the time at which the taint was added. It is only written for NoExecute taints. value <string> The taint value corresponding to the taint key.
taint的子选项effect有三个结果定义:
1,NoSchedule:表示k8s将不会将Pod调度到具有该污点的Node上 2,PreferNoSchedule:表示k8s将尽量避免将Pod调度到具有该污点的Node上 3,NoExecute:表示k8s将不会将Pod调度到具有该污点的Node上,同时会将Node上已经存在的Pod驱逐出去
污点的设置:
例如设置node2节点污点为NoExecute(这里的key=value 可以随意设置,比如,A=B:noExecute
也是OK的,但最好有意义,后面的容忍会用到key和values的值):
kubectl taint nodes k8s-node2 key=value:NoExecute
查看节点和pod(可以看到,node2不可调度,并且其上的pod都被Terminating,因为busybox这个pod我是设置了nodeSelector ):
[root@master coredns]# k get no NAME STATUS ROLES AGE VERSION k8s-master Ready <none> 25d v1.18.3 k8s-node1 Ready <none> 25d v1.18.3 k8s-node2 Ready,SchedulingDisabled <none> 25d v1.18.3 [root@master coredns]# k get po -A -owide NAMESPACE NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES default busybox-68c4f6755d-24f79 0/1 Terminating 0 18s <none> k8s-node2 <none> <none> default busybox-68c4f6755d-26f5j 0/1 Terminating 0 34s <none> k8s-node2 <none> <none> default busybox-68c4f6755d-28m4l 0/1 Terminating 0 42s <none> k8s-node2 <none> <none> default busybox-68c4f6755d-2bb7z 0/1 Terminating 0 39s <none> k8s-node2 <none> <none> default busybox-68c4f6755d-2gkss 0/1 Terminating 0 4s <none> k8s-node2 <none> <none> default busybox-68c4f6755d-2gpq4 0/1 Terminating 0 87s <none> k8s-node2 <none> <none> kube-system kube-flannel-ds-mlb7l 1/1 Terminating 6 25d 192.168.217.18 k8s-node2 <none> <none>
解除污点:
1. kubectl taint nodes k8s-node2 key:NoExecute- 2. kubectl uncordon k8s-node2
污点的查看:
1. [root@master ~]# kubectl describe nodes k8s-node2 |grep Taints 2. Taints: key=value:NoSchedule
OK,现在node2有污点,此节点不调度新pod,那么,我们来部署一个三副本的pod看看能否成功:
[root@master coredns]# cat test.yaml apiVersion: apps/v1 kind: Deployment metadata: name: busybox namespace: default spec: replicas: 3 selector: matchLabels: app: busybox template: metadata: labels: app: busybox spec: # nodeName: k8s-node2 containers: - name: busybox image: busybox:1.28.3 imagePullPolicy: IfNotPresent args: - /bin/sh - -c - sleep 10; touch /tmp/healthy; sleep 30000
可以看到,pod确实没有在node2上,即使副本数修改为10个,仍然是不会调度到node2这个节点。
[root@master coredns]# k get po -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES busybox-7bf6d6f9b5-5qzfn 1/1 Running 0 44s 10.244.1.10 k8s-node1 <none> <none> busybox-7bf6d6f9b5-j72q7 1/1 Running 0 44s 10.244.0.14 k8s-master <none> <none> busybox-7bf6d6f9b5-mgt8j 1/1 Running 0 44s 10.244.0.15 k8s-master <none> <none>