💸Save up to $132K/month in CI costs!👉 Try Free
Skip to main content
Kubernetes Horizontal Pod Autoscaler
9 min read

Kubernetes Horizontal Pod Autoscaler

Optimize Your CI/CD Pipeline

Get instant insights into your CI/CD performance and costs. Reduce build times by up to 45% and save on infrastructure costs.

45% Faster Builds
60% Cost Reduction

Introduction

Kubernetes is immensely powerful in terms of managing and scaling applications with much ease. Among the various features of this tool, Horizontal Pod Autoscaling stands out as an important ingredient for performance maintenance in case of variable loads. The automatic scaling of the HPA scales the pods based on the demand at any given moment in time so that resources are utilized to maximum efficiency.

We will settle in some detail how HPA works, the working mechanisms involved, and how it can be effectively implemented in your Kubernetes environment.

Steps we will cover in this article:

How Horizontal Pod Autoscalers Operate

Let's look at how a Horizontal Pod Autoscaler in Kubernetes actually works. It is a tool that automatically scales pod replicas for a deployment based on any observed CPU metrics and/or any metrics you may have specified. In other words, it tries to keep the balance between the demands of the pods and the availability of the resources.

The autoscaler continuously monitors the resource metrics of the target workload-such as deployments-and changes the count of pod replicas accordingly. If the JVM load goes higher, the HPA will ramp up the pods to meet that demand. Conversely, when the load reduces to a minimum pod specification, it scales down the pods. HPA works based on defined targets, such as average CPU utilization metrics.

In doing so, I will define a metric configuration in the YAML file of HPA to trigger scaling activities with respect to the current usage of the CPU. It would be easy to set this up straightforwardly to keep workloads efficient. The HPA will keep a close watch on the metric to avoid wastage of allocated resources and optimize for performance.

Implementing Horizontal Scaling with Confidence

The subtlety of understanding between horizontal and vertical scaling is key when deploying applications in Kubernetes. HPA performs horizontal scaling by increasing the number of instances such that the increased load is distributed across a larger number of instances.

On the contrary, horizontal scaling only increases the amount of resources/units being utilized by a running pod, such as the CPU or memory. Workloads whose demand changes with time should definitely use this without sacrificing much on vertical scaling.

Below, I will configure HPA using a YAML file to monitor average CPU utilization. Here is a basic setup:

apiVersion: autoscaling/v2
type: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60

HPA, in this setup, will keep an average CPU utilization of about 60%, scaling the number of replicas between 2 and 10. This is quite effective and easy to manage-the balance between the resource usage and application performance.

Scaling with Resource and Custom Metrics

Understand how HPAs drive scaling decisions based on resource and custom metrics. Defining metrics related to resources, such as CPU or user-defined values, will dynamically auto-scale pod counts by means of an HPA. This dynamic adjustment of resource level will ensure that applications are responsive to user needs and optimize resource usage.

Well, it is possible in Kubernetes to use traditional resource metrics such as CPU and memory, or custom metrics depending on your application's needs. In regard to that, you define metrics such as these in the HPA configuration file. Below is how you would set up an HPA YAML file to scale on resource utilization:

apiVersion: autoscaling/v2
type: HorizontalPodAutoscaler
metadata:
name: my-custom-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: External
external:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: 100

Similarly, in the preceding example, HPA scales the replicas of my-app based both on CPU utilization and a custom metric representing requests per second, to efficiently handle pods due to current traffic and usage patterns. Using both resource and custom metrics within your HPA configurations allows a more responsive and efficient scaling strategy.

Algorithm Behind the Scaling Decisions

Diving into the algorithm powering HPA scaling logic, one can find just how dynamically Kubernetes adjusts the workloads. The major logic of the algorithm performs a number of calculations of the ratio of current versus desired metrics within efficient resource management.

It queries the current metric values from all active pods when the HPA runs. The desired replicas are computed using the formula below:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

For example, if the current CPU utilization is at 200m and the target of desire is 100m, this algorithm calculates the new number of replicas to be 2.0, rounded up to 2. If on the other hand, the current utilization falls back to 50m, it decreases by half the amount of replicas in use. The HPA controller also involves checks about pod readiness and the availability of metric data. It does not take into account any pod that may be in the process of startup or has missing metric values from scaling calculations. This prevents aggressive scaling in case the environments are unpredictable during their startup time.

The algorithm then allows for balancing the demand for resources by Kubernetes, alongside dynamic workloads, and ensures high availability of the service provided.

Handling Rolling Updates with HPAs

Overview When integrating Horizontal Pod Autoscalers with rolling updates in Kubernetes, high performance should be maintained without changing application versions. HPA will watch the metrics and will change the replica counts during the updates actively to handle the fluctuating load of the application without any disturbance.

During the process of a rolling update, Kubernetes will gradually replace the old pods with new ones, probably avoiding this downtime. The HPA dynamically regulates the number of replicas. For example, if during the upgrade the load on the application increases, then it scales up the pods, depending on how HPA was configured.

Coming back to scale down after upgrading and finding more resources available, it will do so but respecting the minimum count of replicas. Here is an example of how to configure HPA to manage a Deployment. This example also serves to illustrate how scaling works in the context of a rolling update:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70

In this environment, the HPA should maintain CPU utilization of the pods to a level of ~70%. If there is a rolling update in the case of increased workload, it can lead the HPA to increase the replicas up to a maximum defined limit. The old pods will gradually be replaced by the new ones, so that remaining pods can seamlessly handle the traffic, hence reducing the time of the potential service interruption.

Configuring Custom Scaling Behavior

In Kubernetes, the Horizontal Pod Autoscaler (HPA) allows you to fine-tune its scaling actions through custom configurations, such as scale-up and scale-down settings. This is particularly useful for ensuring that your applications respond quickly to changes in demand while maintaining stability and resource efficiency.

Custom Scale-Up and Scale-Down Configurations

The HPA enables you to get specific about scale-up and scale-down behaviors to accommodate not only change in load inside your systems but also prevent sharp changes in the number of pods. Using the behavior field in the HPA configuration allows you to manipulate the behavior of the HPA.

Stabilization Windows

That is important because the window of stabilization prevents the autoscaler from making rapid changes that would actually lead to its instability. The setting ensures no scale-down operations start too quickly before potentially terminating pods that may still be needed behind a short while.

Example - adding a stabilisation window to your HPA:

behavior:
scaleDown:
stabilizationWindowSeconds: 300

In this example, it specifies that a 5-minute window is to be considered in order to decide whether scaling down should be done or not in order to avoid sudden drops in available pods.

Rate Limits for Scaling

Aside from the stabilisation windows, you can set rate limits on how fast your HPA is allowed to scale up or down. That's helpful when you want more granular control over how fast the number of replicas is allowed to change.

Here is how you might configure a rate limit to scale down by no more than 4 pods per minute:

behavior:
scaleDown:
policies:
- type: Pods
value: 4
periodSeconds: 60

This setting means that the HPA is allowed to remove a maximum of 4 pods within a 1-minute period. In similar ways, you could restrict scale-up operations as well.

Example YAML Configuration

Merging these - both the stabilization windows and the rate limits - you can configure your HPA as follows:

apiVersion: autoscaling/v2
type: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 4
periodSeconds: 60

This setup not only defines how the HPA scales but also makes your application's pod life cycle more rigorously managed during fluctuating load conditions.

Conclusions

Kubernetes Horizontal Pod Autoscaler is the sine qua non of today's cloud-native application management, enabling dynamic scaling driven by real-world demands. Having a clear understanding of its operational mechanics all the way from metric collection down to decision-making will optimize your application so that it meets user needs efficiently without wastage. You will, therefore, be able to achieve much better resilience and high performance in workloads using best practices and custom configurations with HPAs.