Kubernetes Cordon and Drain: A Comprehensive Guide
cordon and drain are crucial for node management. These operations are essential when you need to take a node out of service for maintenance, upgrade, or other administrative tasks without disrupting the overall cluster’s functionality. In this blog post, we will delve deep into the concepts of cordon and drain in Kubernetes, explore typical usage examples, common practices, and best practices.Table of Contents
Core Concepts
Cordon
In Kubernetes, cordon is an operation that marks a node as unschedulable. When a node is cordoned, the Kubernetes scheduler will not assign any new pods to this node. However, the existing pods running on the cordoned node will continue to function as normal. This is useful when you want to prevent new workloads from being placed on a node that you plan to take offline soon.
You can cordon a node using the following kubectl command:
kubectl cordon <node-name>
Here, <node-name> is the name of the node you want to cordon.
Drain
The drain operation is a more comprehensive action. It first cordons the node to prevent new pods from being scheduled on it. Then, it tries to evict all the pods running on the node gracefully. Evicting pods gracefully means that Kubernetes will send a termination signal to the pods, allowing them to perform any necessary cleanup operations before shutting down.
If a pod has a PodDisruptionBudget (PDB) associated with it, the drain operation will respect the PDB to ensure that the minimum number of replicas required for the application to function properly is maintained.
The basic kubectl command to drain a node is:
kubectl drain <node-name>
Typical Usage Example
Let’s assume you have a three - node Kubernetes cluster, and you need to perform maintenance on one of the nodes, say node-1.
Step 1: Cordon the Node
First, you can cordon the node to prevent new pods from being scheduled on it.
kubectl cordon node-1
After running this command, if you check the node status using kubectl describe node node-1, you will see that the Schedulable field is set to False.
Step 2: Drain the Node
Next, you can drain the node to evict all the running pods gracefully.
kubectl drain node-1 --ignore-daemonsets
The --ignore-daemonsets flag is used because DaemonSets are designed to run one pod on each node, and you usually don’t want to evict them during a node drain.
Once the drain operation is complete, the node is ready for maintenance. After the maintenance is done, you can uncordon the node using the following command:
kubectl uncordon node-1
Common Practices
Handling PodDisruptionBudget
As mentioned earlier, PodDisruptionBudget (PDB) is an important concept when draining nodes. If a pod is part of a deployment or a stateful set with a PDB, the drain operation will pause if evicting the pod would violate the PDB.
To handle this, you can either increase the allowed disruptions in the PDB temporarily or drain the node in multiple steps, waiting for the application to recover between each step.
Using Labels and Taints
You can use node labels and taints in combination with cordon and drain operations. For example, you can label nodes that are part of a specific maintenance group. Then, you can cordon and drain all the nodes in that group using a single command with label selectors.
kubectl cordon -l maintenance-group=group-1
kubectl drain -l maintenance-group=group-1 --ignore-daemonsets
Best Practices
Testing in a Staging Environment
Before performing a cordon and drain operation in a production environment, it is highly recommended to test the process in a staging environment. This helps you identify any potential issues, such as pods not terminating gracefully or PDB violations.
Monitoring the Cluster
During the cordon and drain process, closely monitor the cluster using tools like Prometheus and Grafana. This allows you to detect any anomalies, such as increased latency or service outages, and take corrective actions immediately.
Documenting the Process
Document the cordon and drain process, including the commands used, the expected behavior, and any potential issues. This documentation can be useful for future reference and for other team members who may need to perform similar operations.
Conclusion
Kubernetes cordon and drain operations are powerful tools for node management. By understanding the core concepts, following typical usage examples, and adhering to common and best practices, you can perform node maintenance and upgrades without causing significant disruptions to your applications. Remember to test in a staging environment, monitor the cluster, and document the process for a smooth and efficient node management experience.