Leveraging Kubernetes HPA and Prometheus Adapter
Learn how to scale Kubernetes deployments using Prometheus metrics with the Horizontal Pod Autoscaler (HPA)
Published on:
May 15, 2025Last updated on:
May 16, 2025This blog is part of our Horizontal Pod Autoscaling series, we recommend reading the rest of the posts in the series:
- Introduction to Horizontal Pod Autoscaling in Kubernetes
- How to use Custom & External Metrics for Kubernetes HPA
- Set up Kubernetes scaling via Prometheus & Custom Metrics
- Leveraging Kubernetes HPA and Prometheus Adapter
Go beyond CPU and Memory: Autoscale Your Workloads with Meaningful, Application-Level Metrics
Kubernetes Horizontal Pod Autoscaler (HPA) is one of those quietly brilliant features, often underused, but essential once you’ve experienced its full potential. At a glance, it seems simple: it automatically adjusts the number of pods in a deployment based on resource usage. Most commonly, it works with CPU or memory.
But heres the thing: CPU and memory rarely tell the full story, especially in modern, cloud-native applications. If your backend processes jobs from a queue, or you handle varying workloads like video uploads or payment processing, those resource metrics can be misleading, or worse, dangerously slow to react.
In this post, we will explore how to scale more intelligently using custom and external metrics via Prometheus and Prometheus Adapter. You’ll learn how to scale based on signals that actually matter to your users, such as queue depth, request rate, or even metrics from third-party services like AWS SQS.
A Brief Refresher: Custom vs External Metrics
Before we dive into implementation, lets set the scene. Kubernetes HPA supports three types of metrics:
- Resource metrics: such as CPU and memory usage. These are the default and come from the Metrics Server.
- Custom metrics: internal metrics from your application, exposed via Prometheus and surfaced to the HPA through the Custom Metrics API.
- External metrics: signals from outside your Kubernetes cluster, such as cloud queues or payment systems, integrated via the External Metrics API.
In this article, we will be focusing on the latter two: custom and external metrics. If you’d like a deeper dive into how these APIs work behind the scenes, we’ve covered them previously in this Horizontal Pod Autoscaling series.
For now, lets look at how you can make use of them in practice.
Example 1: Using Custom Metrics to Scale on Internal Queue Length
Imagine you have a microservice that consumes jobs from a queue. You probably already have a metric that tracks queue depth, why not scale on that?
Lets say you’re using Python. With the prometheus_client
library, exposing a custom metric is straightforward. Here’s a small example:
from prometheus_client import start_http_server, Gauge
import random
import time
queue_length = Gauge('custom_queue_length', 'Length of job queue')
def get_queue_length():
return random.randint(0, 100)
if __name__ == '__main__':
start_http_server(8000)
while True:
queue_length.set(get_queue_length())
time.sleep(5)
This script exposes the metric custom_queue_length
on port 8000, updating it every 5 seconds. In reality, you’d fetch the real value from Redis, RabbitMQ, or your job broker of choice.
Once your application exposes metrics, you need to ensure Prometheus knows where to find them. Assuming you’re using kube-prometheus-stack, you can configure a ServiceMonitor
like so:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: queue-worker-monitor
labels:
release: prometheus
spec:
selector:
matchLabels:
app: queue-worker
namespaceSelector:
matchNames:
- default
endpoints:
- port: metrics
interval: 15s
This tells Prometheus to scrape metrics from your service every 15 seconds. Be sure your deployment and service expose the metrics endpoint correctly.
Make sure your Prometheus instance is configured to select ServiceMonitors
with the release: prometheus
label (this is the default in kube-prometheus-stack).
Prometheus Adapter acts as a bridge between Prometheus and the Kubernetes metrics APIs. You’ll need to add a rule so it knows how to expose your metric:
rules:
custom:
- seriesQuery: 'custom_queue_length'
resources:
overrides:
namespace:
resource: namespace
pod:
resource: pod
name:
matches: "custom_queue_length"
as: "queue_length"
metricsQuery: 'avg(custom_queue_length) by (namespace)'
This maps your raw Prometheus metric (custom_queue_length
) to a Kubernetes-compatible custom metric (queue_length
) which the HPA can use.
Now that the metric is exposed, you can plug it into a HorizontalPodAutoscaler
definition:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: worker-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: queue-worker
minReplicas: 2
maxReplicas: 10
metrics:
- type: Pods
pods:
metric:
name: queue_length
target:
type: AverageValue
averageValue: "30"
This HPA will scale your worker pods to maintain an average queue length of 30 jobs per pod, providing much more relevant responsiveness than CPU usage alone.
Example 2: Using External Metrics to Scale on AWS SQS Depth
Sometimes the metric you care about lives outside Kubernetes. Say you’re using AWS SQS for job queuing, how do you scale based on the number of messages in that queue?
Enter external metrics and tools like YACE (Yet Another Cloudwatch Exporter). YACE pulls metrics from AWS and exposes them in Prometheus format. As with Prometheus Adapter and Prometheus itself, YACE can be installed using an official Helm chart.
To track SQS queue depth, you might configure it like this:
config: |-
apiVersion: v1alpha1
sts-region: eu-west-2
discovery:
exportedTagsOnMetrics:
AWS/SQS:
- Name
jobs:
- type: AWS/SQS
regions:
- eu-west-2
searchTags:
- key: Environment
value: production
metrics:
- name: ApproximateNumberOfMessagesVisible
statistics:
- Average
period: 60
This setup pulls queue length from any SQS queues tagged for your production environment. Once deployed in your cluster, YACE exposes those metrics on a Prometheus-compatible endpoint.
Next define a rule to surface the metric to the External Metrics API:
rules:
external:
- seriesQuery: 'aws_sqs_approximate_number_of_messages_visible_average{queue_name="video-processing"}'
name:
matches: ".*"
as: "external_queue_depth"
metricsQuery: 'avg(aws_sqs_approximate_number_of_messages_visible_average{queue_name="video-processing"})'
This maps the SQS queue size to a Kubernetes-compatible external metric (external_queue_depth
) which the HPA can use.
This can be created either in the Prometheus Adapter or the YACE chart, both charts will implement this rule by creating a PrometheusRule
Kubernetes object.
Now you can define an autoscaler that responds to that external queue size:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: transcoder-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: transcoder
minReplicas: 1
maxReplicas: 20
metrics:
- type: External
external:
metric:
name: external_queue_depth
target:
type: Value
value: "50"
This HPA will scale the transcoder pods until the SQS queue depth drops below 50 messages.
Combining Custom and External Metrics for Smarter Scaling
In real world systems, relying on a single metric for scaling can be too simplistic. For example, queue length might indicate backlog, but request rate helps contextualise how fast work is arriving. By combining both, you can trigger scaling only when load is high and demand is sustained.
Kubernetes allows you to define multiple metrics in the HPA spec. The deployment will scale based on whichever metric requires the highest number of replicas. In addition, you can define both metric types in a single autoscaler, allowing Kubernetes to react to internal signals and external demand.
Here’s an updated HPA definition that scales based on both queue_length
and request_rate
:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: transcoder-autoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: transcoder
minReplicas: 2
maxReplicas: 25
metrics:
- type: Pods
pods:
metric:
name: queue_length
target:
type: AverageValue
averageValue: "30"
- type: External
external:
metric:
name: external_queue_depth
target:
type: Value
value: "50"
Kubernetes will evaluate both metrics and scale based on whichever one requires more replicas. This helps you respond more intelligently to different kinds of pressure, whether it originates inside your cluster or out in the cloud.
This is especially powerful when:
- Internal metrics may be delayed (such as queue size not rising fast enough)
- External systems signal future load (such as SQS buildup)
- You want more stable, responsive scaling during traffic spikes
Closing Thoughts
Autoscaling with custom and external metrics opens up a whole new level of responsiveness and efficiency in Kubernetes. Rather than relying on blunt signals like CPU, you can scale your systems based on real indicators of demand, whether those come from inside your app or the wider ecosystem it operates in.
It does require some additional setup, exposing metrics, configuring Prometheus Adapter, and tuning your autoscalers, but the benefits are well worth the investment.
By aligning autoscaling with business logic and actual user behaviour, you’ll end up with systems that:
- Scale faster when they’re needed most
- Stay lean when traffic dips
- Deliver better performance without unnecessary cost
If you’re looking to improve resilience, optimise cost, or build systems that scale organically with user demand, custom & external metrics are an essential part of your Kubernetes toolkit.
This blog is part of our Horizontal Pod Autoscaling series, we recommend reading the rest of the posts in the series:
- Introduction to Horizontal Pod Autoscaling in Kubernetes
- How to use Custom & External Metrics for Kubernetes HPA
- Set up Kubernetes scaling via Prometheus & Custom Metrics
- Leveraging Kubernetes HPA and Prometheus Adapter