Kubernetes HPA (Horizontal Pod Autoscaler) in 2025

January 29, 202511 min read

This article was last updated on January 29, 2025, to include advanced techniques for configuring and optimizing Kubernetes HPA, such as custom metrics integration, stabilization windows, and rate limiting, along with real-world examples and simplified explanations to enhance clarity.

What is Kubernetes HPA?

Imagine operating a restaurant: during lunchtime, you'd want more workers to help deal with the rush of hungry customers, while during slow times, it would be a waste to have workers doing nothing. This is the exact challenge we must address in the cloud world, and this is where Kubernetes Horizontal Pod Autoscaler, or HPA for short, saves the day.

Having worked with Kubernetes for a couple of years now, I must say that mastering HPA saved me from several hours of work in scaling manually. In this post, I will explain, in simple terms, how HPA works and how you can use it effectively.

Try Our Interactive HPA Simulator

For practicality, let's use an example to illustrate how HPA works. The following interactive simulator will show, in real time, exactly how HPA works. Try adjusting the CPU load and see HPA scaling your pods up or down:

HPA Simulator

See how Kubernetes HPA scales your pods based on CPU utilization.

CPU Load: 50%

Simulation Speed: 1000ms

Current Replicas2

Target CPU70%

Min Replicas1

Max Replicas10

The above simulator demonstrates a few key points:

Real-time Scaling: See HPA in action, scaling the number of pods based on CPU load
Target Utilization: The green dashed line represents your target CPU utilization (70%)
Stabilization: Observe how scaling did not take up right away, in order to prevent thrashing
Boundaries: Observe how the system respects minimum and maximum replica bounds

Try these scenarios:

Gradually increase CPU load and watch new pods being added
Briefly spike the load to see how stabilization prevents immediate scaling
Release the load to observe how well the system scales down gracefully

How Does Kubernetes HPA Work? A Step-by-Step Guide

Think of HPA as the automated restaurant manager. Much like a good manager would look at how busy the restaurant is and adjust staff accordingly, HPA looks at the resource usage of your application and automatically changes the number of running instances (pods).

I will remember how the very first HPA that went into place within my old company-they used always to firefight issues at capacity peaks of hours. We set up this HPA and thus changed our operations into 24 hours each day, or with no sleep as it were-the system would go scale high when traffic becomes heavy and just scale down as periods tend to be in quiet times.

How this works in practice:

HPA constantly monitors the metrics of your application, for example, CPU consumption.
When metrics go above your set threshold, it adds more pods
It removes unused pods when the usage goes down and saves resources.

Implementing Kubernetes HPA: Real-World Examples

Let me relate this to the real world. Last year, I was working on an e-commerce platform that used to see huge spikes in traffic during flash sales. In the pre-HPA era, we would start scaling up our cluster before the sale and would always get it wrong, either way overprovisioning or sometimes worse, with not enough capacity.

Here is a basic HPA I used, which saved the day:

apiVersion: autoscaling/v2
type: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app-deployment
  minReplicas: 2
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
      target:
        type: Utilization
        averageUtilization: 60

This would tell Kubernetes to: "Keep enough pods running to keep CPU utilisation at about 60%. Never go below 2 pods - for high availability, nor above 10 pods - to control costs." Finally, it is like guardrails for your scaling decisions.

Advanced HPA Configuration: CPU, Memory, and Custom Metrics

Now, let's get a bit more sophisticated: while CPU and memory are the most common metrics, sometimes you need to scale based on business-specific metrics. As a matter of fact, I once set up HPA to scale based on the number of messages in a RabbitMQ queue.

Here's how you could combine both resource and custom metrics:

apiVersion: autoscaling/v2
type: HorizontalPodAutoscaler
metadata:
  name: my-custom-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
      target:
        type: Utilization
        averageUtilization: 70
    - type: External
      external:
        metric:
          name: requests_per_second
        target:
          type: AverageValue
          averageValue: 100

Consider this as having a number of sensors in your application. In the same way a smart home system will adjust and respond to temperature, humidity, and motion sensors, your application can now scale to multiple signals.

Understanding the HPA Scaling Algorithm: Simple Explanation

This may sound complicated as an algorithm for scaling, but I like explaining it using simple analogies. Suppose you are managing a team of workers:

If one is doing the job of two 200% utilization, one hires one more
If two people are doing the work of one (50% utilization), you can reduce the team size

The actual formula is:

desiredReplicas = ceil[currentReplicas * (currentMetricValue / desiredMetricValue)]

For example, if the current CPU utilization is 200m and the target of desire is 100m, this algorithm will calculate the new number of replicas to be 2.0 rounded up to 2. Where if the current utilization falls back to 50m, it decreases by half amount of replicas in use. The HPA controller also involves checks about pod readiness and the availability of metric data. It does not take into account any pod that may be in the process of startup or has missing metric values from scaling calculations.

This can avoid aggressive scaling in case the underlying environments are unpredictable at their startup time. It then allows balancing the demand for resources by Kubernetes, together with dynamic workloads, and high availability of the service provided.

Best Practices for HPA and Rolling Updates in Kubernetes

HPA with rolling updates is like changing tires on a car that's still moving-you'd better be careful! I learned that the hard way once when an update configured too poorly dropped requests on our service. Here's how it's done properly:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
      target:
        type: Utilization
        averageUtilization: 70

In this environment, HPA should maintain CPU utilization of pods at a level of ~70%. If, due to the increase in the workload, it results in a rolling update, leading HPA to increase the replicas up to a maximum defined limit, then that is perfectly OK. The old pods will be replaced gradually by the new ones so that the remaining pods can handle seamlessly the traffic, hence reducing the time of the potential service interruption.

Advanced HPA Configuration: Performance Tuning Guide

This is where we can fine-tune our HPA to be less jumpy in its decisions. I often refer to this as similar to cruise control in your car-you don't want it to slam on the brakes or hit the accelerator with every minor fluctuation in speed.

Optimize Scale-Up and Scale-Down Behaviors

The HPA allows you to get specific about scale-up and scale-down behaviors. This is useful not only to accommodate change in load inside your systems, but also to prevent sharp changes in the number of pods. You can manipulate the HPA behavior by using the behavior field in the HPA configuration.

Configure HPA Stabilization Windows

That is important because the window of stabilization prevents the autoscaler from making rapid changes that would result in its instability. The setting ensures that all scale-down operations do not get started too quickly before potentially terminating pods which may still be needed behind a short while.

Example - adding a stabilisation window to your HPA:

behavior:
  scaleDown:
    stabilizationWindowSeconds: 300

In this example, it defines the fact that a 5-minute window shall be considered in order to take a decision related to whether scale down is necessary or not in order to prevent huge drops of the available pods.

Set Up HPA Rate Limits for Better Performance

Besides the stabilisation windows, you can set rate limits on how fast your HPA is allowed to scale up or down. That's useful when you want more fine grained control over how fast the number of replicas is allowed to change.

Here is how you might configure a rate limit to scale down no more than 4 pods per minute:

behavior:
  scaleDown:
    policies:
      - type: Pods
        value: 4
        periodSeconds: 60

This setting means that the HPA is allowed to remove a maximum of 4 pods within a 1-minute period. Similarly, you could restrict scale-up operations.

Complete HPA Configuration Example with Best Practices

Merging these, both the stabilization windows and the rate limits, you can configure your HPA as:

apiVersion: autoscaling/v2
type: HorizontalPodAutoscaler
metadata:
  name: my-app-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: my-app
  minReplicas: 1
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
      target:
        type: Utilization
        averageUtilization: 70
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Pods
          value: 4
          periodSeconds: 60

This setup not only defines how HPA scales, but also makes your application's pod life cycle to be more rigorously managed during fluctuating load conditions.

Implementing HPA with Redis Queue Metrics: A Real-World Example

Let me give a pretty practical example from my experience: once, I had to create a system where we had to do the scaling by the number of available workers, usually based on messages from a certain Redis queue-this is very basic when one faces the problem or challenge of message queuing, running batch jobs, or queues in general at the architectural levels.

Setting Up the Custom Metrics Pipeline

First, we had to get the metrics pipeline up and running. Here's how we did it:

1. Install Prometheus Adapter:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: prometheus-adapter
  namespace: monitoring
spec:
  template:
    spec:
      containers:
      - name: prometheus-adapter
        image: k8s.gcr.io/prometheus-adapter/prometheus-adapter:v0.9.1
        args:
        - --cert-dir=/var/run/serving-cert
        - --config=/etc/adapter/config.yaml
        - --logtostderr=true
        - --metrics-relist-interval=30s

2. Configure the adapter to collect Redis metrics:

apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-adapter-config
  namespace: monitoring
data:
  config.yaml: |
    rules:
    - seriesQuery: 'redis_queue_length{queue="my_queue"}'
      resources:
        overrides:
          namespace:
            resource: namespace
          pod:
            resource: pod
      name:
        matches: "redis_queue_length"
        as: "redis_queue_size"
      metricsQuery: 'redis_queue_length{queue="my_queue"}'

Creating the HPA with Redis Metrics

Now we can make an HPA which scales based on queue length:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: queue-processor-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: queue-processor
  minReplicas: 2
  maxReplicas: 20
  metrics:
  - type: External
    external:
      metric:
        name: redis_queue_size
        selector:
          matchLabels:
            queue: my_queue
      target:
        type: AverageValue
        averageValue: 100
  behavior:
    scaleUp:
      stabilizationWindowSeconds: 60
      policies:
      - type: Pods
        value: 4
        periodSeconds: 60
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
      - type: Pods
        value: 2
        periodSeconds: 60

This configures Kubernetes to:

Scale up when there are more than 100 messages per worker
When scaling up add a maximum of 4 pods within minute
Remove up to 2 pods every minute when scaling down
Wait 5 minutes before scaling down to prevent thrashing

Monitoring and Debugging Custom Metrics

To test that your custom metrics are working, you can use these commands:

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1" | jq .

kubectl get --raw "/apis/custom.metrics.k8s.io/v1beta1/namespaces/default/pods/*/redis_queue_size" | jq .

Best Practices for Custom Metrics

From my experience, the following tips are key:

1. Metric Reliability: Make your metrics pipeline highly available. If metrics become unavailable, the HPA will not be able to make scaling decisions.

2. Correct Thresholds: Be conservative with target values. The first threshold on our Redis queue was set a bit too low (50 messages) and consequently, it triggered scaling too frequently.

3. Stabilization Windows: Queue lengths can be spiky. Apply appropriate stabilization windows to avoid rapid scaling changes:

Smaller window sizes, such as 60 seconds, scale-up for spikes in traffic
Longer window for scale down to prevent killing pods prematurely: 300s

4. Resource Correlation: Keep in mind that custom metrics, by default, do not take into account CPU/Memory. You may want to combine them:

metrics:
- type: External
  external:
    metric:
      name: redis_queue_size
      # ... as above ...
- type: Resource
  resource:
    name: cpu
    target:
      type: Utilization
      averageUtilization: 70

This will scale your pods to queue length but also on the CPU utilization; hence, for better resource efficiencies.

Conclusion

After dozens of production environments where HPA has been implemented, I can confidently say that it is one of the most powerful features of Kubernetes. If tuned correctly, it gives you the perfect balance between performance and cost efficiency.

Remember: The simplest configurations go first, observe the behavior, and then tune it little by little to your application needs. And most importantly, always test your HPA settings under realistic conditions before going to production.

Sidecar Containers in Kubernetes

Understanding how to extend your application's functionality within a Kubernetes Pod using sidecar containers.

October 28, 2024·9 min read

What Makes a Software Delivery Team Elite? North Star Metrics for CI Workflows

We explore the key metrics that define elite software delivery teams and how to track and improve them.

October 2, 2024·19 min read

Kubernetes Probes - A Complete Guide to Container Health Checks

A practical guide to configuring liveness, readiness, and startup probes in Kubernetes, with real examples and best practices

January 9, 2025·7 min read

Kubernetes HPA (Horizontal Pod Autoscaler) in 2025

What is Kubernetes HPA?​

Try Our Interactive HPA Simulator​

HPA Simulator

How Does Kubernetes HPA Work? A Step-by-Step Guide​

Implementing Kubernetes HPA: Real-World Examples​

Advanced HPA Configuration: CPU, Memory, and Custom Metrics​

Understanding the HPA Scaling Algorithm: Simple Explanation​

Best Practices for HPA and Rolling Updates in Kubernetes​

Advanced HPA Configuration: Performance Tuning Guide​

Optimize Scale-Up and Scale-Down Behaviors​

Configure HPA Stabilization Windows​

Set Up HPA Rate Limits for Better Performance​

Complete HPA Configuration Example with Best Practices​

Implementing HPA with Redis Queue Metrics: A Real-World Example​

Setting Up the Custom Metrics Pipeline​

Creating the HPA with Redis Metrics​

Monitoring and Debugging Custom Metrics​

Conclusion​

Related Posts