Kubernetes Cordon and Drain: A Comprehensive Guide

Kubernetes, an open - source container orchestration platform, offers a wide range of features to manage containerized applications efficiently. Among these features, cordon and drain are crucial for node management. These operations are essential when you need to take a node out of service for maintenance, upgrade, or other administrative tasks without disrupting the overall cluster’s functionality. In this blog post, we will delve deep into the concepts of cordon and drain in Kubernetes, explore typical usage examples, common practices, and best practices.

Table of Contents

  1. Core Concepts
  2. Typical Usage Example
  3. Common Practices
  4. Best Practices
  5. Conclusion
  6. References

Core Concepts

Cordon

In Kubernetes, cordon is an operation that marks a node as unschedulable. When a node is cordoned, the Kubernetes scheduler will not assign any new pods to this node. However, the existing pods running on the cordoned node will continue to function as normal. This is useful when you want to prevent new workloads from being placed on a node that you plan to take offline soon.

You can cordon a node using the following kubectl command:

kubectl cordon <node-name>

Here, <node-name> is the name of the node you want to cordon.

Drain

The drain operation is a more comprehensive action. It first cordons the node to prevent new pods from being scheduled on it. Then, it tries to evict all the pods running on the node gracefully. Evicting pods gracefully means that Kubernetes will send a termination signal to the pods, allowing them to perform any necessary cleanup operations before shutting down.

If a pod has a PodDisruptionBudget (PDB) associated with it, the drain operation will respect the PDB to ensure that the minimum number of replicas required for the application to function properly is maintained.

The basic kubectl command to drain a node is:

kubectl drain <node-name>

Typical Usage Example

Let’s assume you have a three - node Kubernetes cluster, and you need to perform maintenance on one of the nodes, say node-1.

Step 1: Cordon the Node

First, you can cordon the node to prevent new pods from being scheduled on it.

kubectl cordon node-1

After running this command, if you check the node status using kubectl describe node node-1, you will see that the Schedulable field is set to False.

Step 2: Drain the Node

Next, you can drain the node to evict all the running pods gracefully.

kubectl drain node-1 --ignore-daemonsets

The --ignore-daemonsets flag is used because DaemonSets are designed to run one pod on each node, and you usually don’t want to evict them during a node drain.

Once the drain operation is complete, the node is ready for maintenance. After the maintenance is done, you can uncordon the node using the following command:

kubectl uncordon node-1

Common Practices

Handling PodDisruptionBudget

As mentioned earlier, PodDisruptionBudget (PDB) is an important concept when draining nodes. If a pod is part of a deployment or a stateful set with a PDB, the drain operation will pause if evicting the pod would violate the PDB.

To handle this, you can either increase the allowed disruptions in the PDB temporarily or drain the node in multiple steps, waiting for the application to recover between each step.

Using Labels and Taints

You can use node labels and taints in combination with cordon and drain operations. For example, you can label nodes that are part of a specific maintenance group. Then, you can cordon and drain all the nodes in that group using a single command with label selectors.

kubectl cordon -l maintenance-group=group-1
kubectl drain -l maintenance-group=group-1 --ignore-daemonsets

Best Practices

Testing in a Staging Environment

Before performing a cordon and drain operation in a production environment, it is highly recommended to test the process in a staging environment. This helps you identify any potential issues, such as pods not terminating gracefully or PDB violations.

Monitoring the Cluster

During the cordon and drain process, closely monitor the cluster using tools like Prometheus and Grafana. This allows you to detect any anomalies, such as increased latency or service outages, and take corrective actions immediately.

Documenting the Process

Document the cordon and drain process, including the commands used, the expected behavior, and any potential issues. This documentation can be useful for future reference and for other team members who may need to perform similar operations.

Conclusion

Kubernetes cordon and drain operations are powerful tools for node management. By understanding the core concepts, following typical usage examples, and adhering to common and best practices, you can perform node maintenance and upgrades without causing significant disruptions to your applications. Remember to test in a staging environment, monitor the cluster, and document the process for a smooth and efficient node management experience.

References