This article was last updated on January 29, 2025, to include advanced techniques for configuring and optimizing Kubernetes HPA, such as custom metrics integration, stabilization windows, and rate limiting, along with real-world examples and simplified explanations to enhance clarity.
What is Kubernetes HPA?โ
Imagine operating a restaurant: during lunchtime, you'd want more workers to help deal with the rush of hungry customers, while during slow times, it would be a waste to have workers doing nothing. This is the exact challenge we must address in the cloud world, and this is where Kubernetes Horizontal Pod Autoscaler, or HPA for short, saves the day.
Having worked with Kubernetes for a couple of years now, I must say that mastering HPA saved me from several hours of work in scaling manually. In this post, I will explain, in simple terms, how HPA works and how you can use it effectively.
Try Our Interactive HPA Simulatorโ
For practicality, let's use an example to illustrate how HPA works. The following interactive simulator will show, in real time, exactly how HPA works. Try adjusting the CPU load and see HPA scaling your pods up or down:
HPA Simulator
See how Kubernetes HPA scales your pods based on CPU utilization.
The above simulator demonstrates a few key points:
- Real-time Scaling: See HPA in action, scaling the number of pods based on CPU load
- Target Utilization: The green dashed line represents your target CPU utilization (70%)
- Stabilization: Observe how scaling did not take up right away, in order to prevent thrashing
- Boundaries: Observe how the system respects minimum and maximum replica bounds
Try these scenarios:
- Gradually increase CPU load and watch new pods being added
- Briefly spike the load to see how stabilization prevents immediate scaling
- Release the load to observe how well the system scales down gracefully
How Does Kubernetes HPA Work? A Step-by-Step Guideโ
Think of HPA as the automated restaurant manager. Much like a good manager would look at how busy the restaurant is and adjust staff accordingly, HPA looks at the resource usage of your application and automatically changes the number of running instances (pods).
I will remember how the very first HPA that went into place within my old company-they used always to firefight issues at capacity peaks of hours. We set up this HPA and thus changed our operations into 24 hours each day, or with no sleep as it were-the system would go scale high when traffic becomes heavy and just scale down as periods tend to be in quiet times.
How this works in practice:
- HPA constantly monitors the metrics of your application, for example, CPU consumption.
- When metrics go above your set threshold, it adds more pods
- It removes unused pods when the usage goes down and saves resources.
Implementing Kubernetes HPA: Real-World Examplesโ
Let me relate this to the real world. Last year, I was working on an e-commerce platform that used to see huge spikes in traffic during flash sales. In the pre-HPA era, we would start scaling up our cluster before the sale and would always get it wrong, either way overprovisioning or sometimes worse, with not enough capacity.
Here is a basic HPA I used, which saved the day:
apiVersion: autoscaling/v2
type: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app-deployment
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
This would tell Kubernetes to: "Keep enough pods running to keep CPU utilisation at about 60%. Never go below 2 pods - for high availability, nor above 10 pods - to control costs." Finally, it is like guardrails for your scaling decisions.
Advanced HPA Configuration: CPU, Memory, and Custom Metricsโ
Now, let's get a bit more sophisticated: while CPU and memory are the most common metrics, sometimes you need to scale based on business-specific metrics. As a matter of fact, I once set up HPA to scale based on the number of messages in a RabbitMQ queue.
Here's how you could combine both resource and custom metrics:
apiVersion: autoscaling/v2
type: HorizontalPodAutoscaler
metadata:
name: my-custom-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
- type: External
external:
metric:
name: requests_per_second
target:
type: AverageValue
averageValue: 100
Consider this as having a number of sensors in your application. In the same way a smart home system will adjust and respond to temperature, humidity, and motion sensors, your application can now scale to multiple signals.
Understanding the HPA Scaling Algorithm: Simple Explanationโ
This may sound complicated as an algorithm for scaling, but I like explaining it using simple analogies. Suppose you are managing a team of workers:
- If one is doing the job of two 200% utilization, one hires one more
- If two people are doing the work of one (50% utilization), you can reduce the team size
The actual formula is:
desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]
For example, if the current CPU utilization is 200m
and the target of desire is 100m
, this algorithm will calculate the new number of replicas to be 2.0
rounded up to 2
. Where if the current utilization falls back to 50m
, it decreases by half amount of replicas in use. The HPA controller also involves checks about pod readiness and the availability of metric data. It does not take into account any pod that may be in the process of startup or has missing metric values from scaling calculations.
This can avoid aggressive scaling in case the underlying environments are unpredictable at their startup time. It then allows balancing the demand for resources by Kubernetes, together with dynamic workloads, and high availability of the service provided.
Best Practices for HPA and Rolling Updates in Kubernetesโ
HPA with rolling updates is like changing tires on a car that's still moving-you'd better be careful! I learned that the hard way once when an update configured too poorly dropped requests on our service. Here's how it's done properly:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 3
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
In this environment, HPA should maintain CPU utilization of pods at a level of ~70%. If, due to the increase in the workload, it results in a rolling update, leading HPA to increase the replicas up to a maximum defined limit, then that is perfectly OK. The old pods will be replaced gradually by the new ones so that the remaining pods can handle seamlessly the traffic, hence reducing the time of the potential service interruption.
Advanced HPA Configuration: Performance Tuning Guideโ
This is where we can fine-tune our HPA to be less jumpy in its decisions. I often refer to this as similar to cruise control in your car-you don't want it to slam on the brakes or hit the accelerator with every minor fluctuation in speed.
Optimize Scale-Up and Scale-Down Behaviorsโ
The HPA allows you to get specific about scale-up and scale-down behaviors. This is useful not only to accommodate change in load inside your systems, but also to prevent sharp changes in the number of pods. You can manipulate the HPA behavior by using the behavior
field in the HPA configuration.
Configure HPA Stabilization Windowsโ
That is important because the window of stabilization prevents the autoscaler from making rapid changes that would result in its instability. The setting ensures that all scale-down operations do not get started too quickly before potentially terminating pods which may still be needed behind a short while.
Example - adding a stabilisation window to your HPA:
behavior:
scaleDown:
stabilizationWindowSeconds: 300
In this example, it defines the fact that a 5-minute window shall be considered in order to take a decision related to whether scale down is necessary or not in order to prevent huge drops of the available pods.
Set Up HPA Rate Limits for Better Performanceโ
Besides the stabilisation windows, you can set rate limits on how fast your HPA is allowed to scale up or down. That's useful when you want more fine grained control over how fast the number of replicas is allowed to change.
Here is how you might configure a rate limit to scale down no more than 4 pods per minute:
behavior:
scaleDown:
policies:
- type: Pods
value: 4
periodSeconds: 60
This setting means that the HPA is allowed to remove a maximum of 4 pods within a 1-minute period. Similarly, you could restrict scale-up operations.
Complete HPA Configuration Example with Best Practicesโ
Merging these, both the stabilization windows and the rate limits, you can configure your HPA as:
apiVersion: autoscaling/v2
type: HorizontalPodAutoscaler
metadata:
name: my-app-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: my-app
minReplicas: 1
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
behavior:
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 4
periodSeconds: 60
This setup not only defines how HPA scales, but also makes your application's pod life cycle to be more rigorously managed during fluctuating load conditions.
Implementing HPA with Redis Queue Metrics: A Real-World Exampleโ
Let me give a pretty practical example from my experience: once, I had to create a system where we had to do the scaling by the number of available workers, usually based on messages from a certain Redis queue-this is very basic when one faces the problem or challenge of message queuing, running batch jobs, or queues in general at the architectural levels.
Setting Up the Custom Metrics Pipelineโ
First, we had to get the metrics pipeline up and running. Here's how we did it:
1. Install Prometheus Adapter:
apiVersion: apps/v1
kind: Deployment
metadata:
name: prometheus-adapter
namespace: monitoring
spec:
template:
spec:
containers:
- name: prometheus-adapter
image: k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1
args:
- --cert-dir=/var/run/serving-cert
- --config=/etc/adapter/config.yaml
- --logtostderr=true
- --metrics-relist-interval=30s
2. Configure the adapter to collect Redis metrics:
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-adapter-config
namespace: monitoring
data:
config.yaml: |
rules:
- seriesQuery: 'redis_queue_length{queue="my_queue"}'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "redis_queue_length"
as: "redis_queue_size"
metricsQuery: 'redis_queue_length{queue="my_queue"}'
Creating the HPA with Redis Metricsโ
Now we can make an HPA which scales based on queue length:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: queue-processor-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-processor
minReplicas: 2
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: redis_queue_size
selector:
matchLabels:
queue: my_queue
target:
type: AverageValue
averageValue: 100
behavior:
scaleUp:
stabilizationWindowSeconds: 60
policies:
- type: Pods
value: 4
periodSeconds: 60
scaleDown:
stabilizationWindowSeconds: 300
policies:
- type: Pods
value: 2
periodSeconds: 60
This configures Kubernetes to:
- Scale up when there are more than 100 messages per worker
- When scaling up add a maximum of 4 pods within minute
- Remove up to 2 pods every minute when scaling down
- Wait 5 minutes before scaling down to prevent thrashing
Monitoring and Debugging Custom Metricsโ
To test that your custom metrics are working, you can use these commands:
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .
kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/redis_queue_size" | jq .
From my experience, the following tips are key:
1. Metric Reliability: Make your metrics pipeline highly available. If metrics become unavailable, the HPA will not be able to make scaling decisions.
2. Correct Thresholds: Be conservative with target values. The first threshold on our Redis queue was set a bit too low (50 messages) and consequently, it triggered scaling too frequently.
3. Stabilization Windows: Queue lengths can be spiky. Apply appropriate stabilization windows to avoid rapid scaling changes:
- Smaller window sizes, such as 60 seconds, scale-up for spikes in traffic
- Longer window for scale down to prevent killing pods prematurely: 300s
4. Resource Correlation: Keep in mind that custom metrics, by default, do not take into account CPU/Memory. You may want to combine them:
metrics:
- type: External
external:
metric:
name: redis_queue_size
# ... as above ...
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 70
This will scale your pods to queue length but also on the CPU utilization; hence, for better resource efficiencies.
Conclusionโ
After dozens of production environments where HPA has been implemented, I can confidently say that it is one of the most powerful features of Kubernetes. If tuned correctly, it gives you the perfect balance between performance and cost efficiency.
Remember: The simplest configurations go first, observe the behavior, and then tune it little by little to your application needs. And most importantly, always test your HPA settings under realistic conditions before going to production.