Kubernetes: Why it’s important to configure system resource managment?
As a rule, there is always a need to provide a dedicated pool of resources to an application for its correct and stable operation. But what if several applications are running on the same resources? How to provide each of them with minimum required resources? How can you limit the consumption of resources? How to distribute the load between the nodes correctly? How to ensure that the horizontal scaling mechanism works in case the application load grows?
To begin with, the main types of resources that exist in the system are, of course, CPU time and RAM. In the k8s manifests these resource types are measured in the following units:
- CPU – in cores
- RAM in bytes
For each resource, it is possible to set two types of requirements – requests and limits. Requests – describes the minimum requirements for free node resources to run the container (and the feed in general), while limits sets a hard limit on the resources available to the container.
It is important to understand that it is not necessary to explicitly define both types in the manifest, and the behavior will be as follows:
- If only resource limits is explicitly specified, then requests for that resource automatically takes a value equal to limits (you can verify this by calling the entity describe). That is, in fact, the container will be limited to the same amount of resources it requires to run.
- If only requests is explicitly set for a resource, no upper limits are set for that resource – i.e. the container is limited only by the resources of the node itself.
It is also possible to configure resource management not only at the container level, but also at the namespace level with the following units:
- LimitRange – describes the container/feed constraint policy at the container/feed level in ns and is used to describe default constraints for containers/feeds, prevent creation of obviously fat containers/feeds (or vice versa), limit their number and define possible difference in limits and requests
- ResourceQuotas – describe a whole limit policy for all containers in ns and is usually used to map resources to environments (useful when environments are not strictly delimited at the node level)
Below are examples of manifests where resource limits are set:
- At the level of a specific container:
containers: - name: app-nginx image: nginx resources: requests: memory: 1Gi limits: cpu: 200m
In this case, to run the container with nginx you need at least 1G of free RAM and 0.2 CPU on the node, while the maximum container can eat 0.2 CPU and all available RAM on the node.
- At the level of whole ns:
apiVersion: v1 kind: ResourceQuota metadata: name: nxs-test spec: hard: requests.cpu: 300m requests.memory: 1Gi limits.cpu: 700m limits.memory: 2Gi
It means that sum of all request containers in default ns cannot exceed 300m for CPU and 1G for OP, and sum of all limit – 700m for CPU and 2G for SPD.
- Default limits for containers in ns:
apiVersion: v1 kind: LimitRange metadata: name: nxs-limit-per-container spec: limits: - type: Container defaultRequest: cpu: 100m memory: 1Gi default: cpu: 1 memory: 2Gi min: cpu: 50m memory: 500Mi max: cpu: 2 memory: 4Gi
This means that the default namespace for all containers will be set to request 100m for CPU and 1G for RAM, and limit to 1 CPU and 2G. This also sets a limit on the possible request/limit values for CPUs (50m < x < 2) and RAM (500M < x < 4G).
- Restrictions at the pod ns level:
apiVersion: v1 kind: LimitRange metadata: name: nxs-limit-pod spec: limits: - type: Pod max: cpu: 4 memory: 1Gi
I.e. for every feed in the default ns will be set to a limit of 4 vCPUs and 1G.
Now I would like to tell you what advantages can give us the installation of these restrictions.
Load balancing mechanism between nodes
As you know, the k8s component scheduler is responsible for distributing pods among nodes, which operates according to a certain algorithm. This algorithm undergoes two stages during the selection of the optimal node to run:
- Filtering
- Ranking
I.e. according to the described policy, nodes are firstly chosen for which it is possible to run a pod based on the predicates set (including checking if the node has enough resources to run the pod – PodFitsResources), and then each of these nodes is scored according to priorities (including the more free resources a node has – the more points are assigned to it -LeastResourceAllocation/LeastRequestedPriority/BalancedResourceAllocation) and the pod is launched on the node with the highest score (if several nodes meet this condition, then a random node is selected) .
You should understand that the scheduler evaluates available resources for a node based on the data stored in etcd, i.e. the sum of requested/limit resources for each node running on that node, but not the actual resource consumption. This information can be obtained from the output of the kubectl describe node $NODE command, for example:
# kubectl describe nodes nxs-k8s-s1 .. Non-terminated Pods: (9 in total) Namespace Name CPU Requests CPU Limits Memory Requests Memory Limits AGE --------- ---- ------------ ---------- --------------- ------------- --- ingress-nginx nginx-ingress-controller-754b85bf44-qkt2t 0 (0%) 0 (0%) 0 (0%) 0 (0%) 233d kube-system kube-flannel-26bl4 150m (0%) 300m (1%) 64M (0%) 500M (1%) 233d kube-system kube-proxy-exporter-cb629 0 (0%) 0 (0%) 0 (0%) 0 (0%) 233d kube-system kube-proxy-x9fsc 0 (0%) 0 (0%) 0 (0%) 0 (0%) 233d kube-system nginx-proxy-k8s-worker-s1 25m (0%) 300m (1%) 32M (0%) 512M (1%) 233d nxs-monitoring alertmanager-main-1 100m (0%) 100m (0%) 425Mi (1%) 25Mi (0%) 233d nxs-logging filebeat-lmsmp 100m (0%) 0 (0%) 100Mi (0%) 200Mi (0%) 233d nxs-monitoring node-exporter-v4gdq 112m (0%) 122m (0%) 200Mi (0%) 220Mi (0%) 233d Allocated resources: (Total limits may be over 100 percent, i.e., overcommitted.) Resource Requests Limits -------- -------- ------ cpu 487m (3%) 822m (5%) memory 15856217600 (2%) 749976320 (3%) ephemeral-storage 0 (0%) 0 (0%)
Here we can see all the pods running on a particular node, as well as the resources that each of the pods is requesting. Initially, the scheduler filters and generates a list of 3 nodes on which it is possible to run (nxs-k8s-s8, nxs-k8s-s9, nxs-k8s-s10). Then performs a scoring on several parameters (including BalancedResourceAllocation, LeastResourceAllocation) for each of these nodes in order to determine the most suitable node. In the end, the node with the highest score (here, two nodes have the same 100037 score at once, so a random one is chosen – nxs-k8s-s10).
Conclusion: if a node has running pods for which no limits are set, then for k8s (in terms of resource consumption) it will be as if there were no such pods on that node at all. So, if you have, say, a pod with a voracious process (e.g., wowza) and no limits are set for it, you might have a situation where it actually eats up all the resources of the node, but it is considered unloaded by k8s and will get the same number of points in ranking (exactly in points with available resources) as a node with no working pods, which can eventually lead to unequal load distribution between nodes.
Pod eviction
As you know – each pod gets one of 3 QoS classes:
- guaranuted – assigned when each container in the pod has a request and limit for memory and cpu, and these values must match
- burstable – at least one container in a pod has request and limit and request < limit
- best effort – when none of the containers in a pod has resource limitations
Whenever a node runs out of resources (disk, memory), kubelet starts ranking and evicting nodes according to a certain algorithm which takes into account priority of the node and its QoS class. For example, if we are talking about RAM, then based on the QoS class the points are awarded according to the following principle:
- Guaranteed: -998
- BestEffort: 1000
- Burstable: min(max(2, 1000 – (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)
I.e. with the same priority, kubelet will evict nodes with QoS class of best effort first.
Conclusion: if you want to reduce the probability of evicting a particular pod from a node in case of lack of resources on it, you should also take care of setting request/limit for it along with the priority.
Horizontal pode applications autoscaling (HPA)
When you want to automatically scale a pod up or down depending on resource usage (CPU/RAM or user rps) a k8s entity like HPA (Horizontal Pod Autoscaler) can help. Algorithm of which is as follows:
- The current readings of the observed resource are determined (currentMetricValue)
- Definition of desired values for the resource (desiredMetricValue), which for the system resources are set by request
- The current number of replicas (currentReplicas) is defined
- The following formula calculates the desired number of replicas (desiredReplicas)
- desiredReplicas = [ currentReplicas * ( currentMetricValue / desiredMetricValue )]
No scaling will occur when the coefficient (currentMetricValue / desiredMetricValue) is close to 1 (we can set the tolerance error ourselves, the default is 0.1).
Horizontal node Cluster Autoscaler mechanism (CA)
The presence of a configured HPA is not enough to compensate for the negative impact on the system during load spikes. For example, according to the hpa settings the controller manager decides that the number of replicas should be doubled, but the nodes have no free resources to run so many pods (i.e. the node cannot provide the requested resources) and these pods go into Pending state.
In this case, if the provider has the appropriate IaaS/PaaS (eg, GKE/GCE, AKS, EKS, etc.), a tool such as Node Autoscaler can help us. It allows to set maximal and minimal number of nodes in cluster and automatically adjust current number of nodes (by accessing API of cloud provider to order/delete nodes), when there is lack of resources in cluster and nodes cannot be scheduled (are in Pending state).
Conclusion: to be able to autoscale nodes, it is necessary to specify requests in pod containers, so that k8s could correctly estimate the load on nodes and accordingly report that there are no resources in the cluster to start the next pod.
Summary
It should be noted that setting container resource limits is not a prerequisite for successfully running an application, but it is still better to do so for the following reasons:
- To make the scheduler work more accurately in terms of load balancing between k8s nodes
- To reduce the probability of the “eviction of a pod” event
- For horizontal autoscaling of nodes (HPA) to work
- For horizontal node autoscaling (Cluster Autoscaling) at cloud providers