Horizontal Pod Autoscaling in Kubernetes with Prometheus

Louise | 28 May 2019

By default, Kubernetes can perform horizontal autoscaling of pods based on observed CPU utilization (average CPU load across all the pods in a deployment). The HPA works as a control loop like every other controller in the kube cluster - each time it starts up, it gets the metrics for the pods for the period since the last loop and queries this resource utilisation against the baselines specified in the HPA.

Based on the difference between the observed metrics and the average usage baseline, the HPA will either reduce or increase the number of pods in the set to a level that will enable each pod in the set to match the desired baseline usage as closely as possible. The baseline is calculated based on the CPU limit in the pod specifications, for example:

resources:
  requests:
    cpu: 25m
  limits:
    cpu: 100m

This means if we have a baseline CPU target of 50%, and we have pods which each have a 100m limit, we will aim for under 50m usage per pod. Say we have a set of five pods that have a combined distributed load of 240m (so 48m cpu per pod), with 48% average cpu utilisation. If we suddenly receive a spike in requests that causes the load to shoot up to a combined distributed load of 350m, this will cause pod utilisation to shoot up to 70m (or 70%). In order to bring utilisation back to 50%, this will require the HPA to spawn two new pods for the average load to fall back down to 50m per pod.

But for most real-life use cases, it doesn’t make sense to scale based on CPU usage - for applications which service requests from users, the traffic load is a much more applicable metric. You may even wish to determine load based on an algorithm that incorporates a number of different metrics. This is where we start to jump into the deep end.

In this blog post, we’ll take a look at how we can perform autoscaling based on metrics collected by Prometheus.

Custom and External Metrics APIs

There are three metrics APIs the HPA uses: metrics.k8s.io, custom.metrics.k8s.io and external.metrics.k8s.io.

  • Resource metrics are outputted @ metrics.k8s.io. This is the use case for our example above. The metrics.k8s.io api relies on metrics to be forwarded to it by a metrics aggregator. This will be metrics-server, or can also be heapster (although support is now deprecated)

  • Custom metrics are outputted @ custom.metrics.k8s.io. You’ll use this if you want to scale based on metrics based in Kubernetes objects like pods or nodes, or metrics that are output by your workload using a /metrics endpoint. Getting these metrics into custom.metrics.k8s.io relies on an additional metrics adapter.

  • External metrics are outputted @ external.metrics.k8s.io. You’ll use this if you would like to scale based on metrics in your existing metrics aggregator (like stackdriver or Prometheus). This API is implemented in the same way as custom.metrics.k8s.io

In order to set up these two additional metrics APIs, this requires the API aggregation layer to be supported by the kube-api (see here)

–horizontal-pod-autoscaler-use-rest-clients also needs to be enabled on the kube-api, although this is enabled by default on recent versions of Kubernetes.

The custom and external metrics API is not available by default, so we need to set up these new API servers on the Kube master plane. Kubernetes allows us to do this using APIService objects. Even metrics-server has its own APIService object:

NAME                                   SERVICE                            AVAILABLE   AGE
v1beta1.metrics.k8s.io                 kube-system/metrics-server         True        24h

If we take a look at it, we can see what the function of an APIService is:

apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata:
  labels:
    addonmanager.kubernetes.io/mode: Reconcile
    kubernetes.io/minikube-addons: metrics-server
  name: v1beta1.metrics.k8s.io
spec:
  group: metrics.k8s.io
  groupPriorityMinimum: 100
  insecureSkipTLSVerify: true
  service:
    name: metrics-server
    namespace: kube-system
  version: v1beta1
  versionPriority: 100

We can see that all requests to v1beta1.metrics.k8s.io are routed to the service that distributes these requests to the metrics-server pods, which acts as the metrics adapter for this API. We also define the API grouping this should use as metrics.k8s.io.

This means we need a metrics adapter that we can tell the APIService for external.metrics.k8s.io to forward requests to.

A good one to use is kube-metrics-adapter, which most notably implements pod collectors for custom.metrics.k8s.io and Prometheus collectors for external.metrics.k8s.io.

These Prometheus collectors enable us to configure a HPA to fetch metrics from Prometheus using queries written using promql and perform autoscaling based on the results of that query. We define Prometheus collectors using annotations in the hpa object, and then provide the name of the Prometheus collector as the desired metric in the hpa specification.

Kube-metrics-adapter includes a control loop that watches the hpa objects on the cluster and creates and deletes Prometheus collectors based on these definitions. This means that, rather than forwarding a complete set of metrics to the metrics api, only the results of the promql queries we’ve configured are streamed.

An example of a HPA object that defines and uses a Prometheus collector:

apiVersion: autoscaling/v2beta1
kind: HorizontalPodAutoscaler
metadata:
 name: hpa-test
 namespace: dev
 annotations:
   # metric-config.<metricType>.<metricName>.<collectorName>/<configKey>
   # <configKey> == query-name
   metric-config.external.prometheus-query.prometheus/autoregister_queue_latency: autoregister_queue_latency{endpoint="https",job="apiserver",namespace="default",quantile="0.99",service="kubernetes"}  
spec:
 scaleTargetRef:
   apiVersion: extensions/v1beta1
   kind: Deployment
   name: test
 minReplicas: 1
 maxReplicas: 10
 metrics:
 - type: External
   external:
     metricName: prometheus-query
     metricSelector:
       matchLabels:
         query-name: autoregister_queue_latency
     targetAverageValue: 1

Although you will see external metrics in the autoscale/v2beta1 specification in the Kubernetes docs, note that external metrics are only supported in versions of Kube which support the autoscale/v2beta2 specification (so anything after 1.10.0). External metrics aren’t in autoscale/v2beta1 in 1.9, but are in subsequent versions.

This also means there is a difference between the two specifications when we create HPA Kubernetes objects using autoscaling/v2beta2.

Example: the previous hpa using the new specification:

apiVersion: autoscaling/v2beta2
kind: HorizontalPodAutoscaler
metadata:
  name: hpa-test
  namespace: dev
  annotations: metric-config.external.prometheus-query.prometheus/autoregister_queue_latency: autoregister_queue_latency{endpoint="https",instance="192.168.99.101:8443",job="apiserver",namespace="default",quantile="0.99",service="kubernetes"}    
spec:
  scaleTargetRef:
    apiVersion: extensions/v1beta1
    kind: Deployment
    name: test
  minReplicas: 1
  maxReplicas: 10
  Metrics:
  - type: External
    external:
      metric:
        name: prometheus-query
        selector:
          matchLabels:
            query-name: autoregister_queue_latency
      target:
        type: AverageValue
        averageValue: 1

To set up the kube-metrics-adapter:

git clone https://github.com/zalando-incubator/kube-metrics-adapter.git
kubectl apply -f docs/rbac.yaml
kubectl apply -f docs/external-metrics-apiservice.yaml
kubectl apply -f docs/service.yaml

In docs/deployment.yaml, modify the name record for Prometheus’ service to match your own setup, and append the port number. For example:

- --prometheus-server=http://prometheus.monitoring.svc.cluster.local:9090

And then apply the deployment:

kubectl apply -f docs/deployment.yaml

You should now have an external API set up - but there are no metrics being fed to it. You can see this by checking the API resources for external.metrics.k8s.io:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": []
}

Let’s set up an HPA and create a Prometheus collector

Prometheus

Of course, in order to get metrics from Prometheus, this requires that Prometheus is already aggregating the metrics that we desire.

In terms of metrics, we can import metrics from kube-state-metrics (kube object state metrics), node-exporter (host metrics), or we can import metrics from any service that exposes a metrics endpoint by defining targets for Prometheus to scape. This is done using service monitors. Like APIServices, service monitors require that these pods (if in the cluster) or binaries (if not) have a service that is able to point to the target we want Prometheus to scrape. For example, Prometheus can scrape a deployed instance of node-exporter using this service monitor:

kind: ServiceMonitor
metadata:
 name: node-exporter
 namespace: monitoring
spec:
 endpoints:
 - interval: 30s
   port: metrics
 jobLabel: jobLabel
 selector:
   matchLabels:
     app: prometheus-node-exporter

Putting it all together

Deploy the application you want to scale:

cat << EOF | kubectl apply -f -
apiVersion: v1
kind: Service
metadata:
 name: test
 namespace: dev
spec:
 ports:
 - name: http
   port: 80
   protocol: TCP
   targetPort: http
 selector:
   app: test
   release: dev
 sessionAffinity: None
 type: ClusterIP

---

apiVersion: extensions/v1beta1
kind: Deployment
metadata:
 name: test
 namespace: dev
spec:
 replicas: 1
 selector:
   matchLabels:
     app: test
     release: dev
 template:
   metadata:
     labels:
       app: test
       release: dev
   spec:
     containers:
       image: busybox
     command:
       - sleep
       - "3600"
       imagePullPolicy: IfNotPresent
       name: test
       ports:
       - containerPort: 80
         name: http
         protocol: TCP
EOF

When an HPA that defines a Prometheus collector is deployed, you can check it is working if there is a APIResource for external.metrics.k8s.io called prometheus-query. If there is, metrics are available from this endpoint:

kubectl get --raw "/apis/external.metrics.k8s.io/v1beta1" | jq .

{
  "kind": "APIResourceList",
  "apiVersion": "v1",
  "groupVersion": "external.metrics.k8s.io/v1beta1",
  "resources": [
    {
      "name": "prometheus-query",
      "singularName": "",
      "namespaced": true,
      "kind": "ExternalMetricValueList",
      "verbs": [
        "get"
      ]
    }
  ]
}

You will also see these metrics being collected in the kube-metrics-adapter logs:

kubectl logs kube-metrics-adapter-6559fc9987-s8lhb -n kube-system

[...]
time="2019-05-24T16:14:07Z" level=info msg="Looking for HPAs" provider=hpa
time="2019-05-24T16:14:07Z" level=info msg="Found 0 new/updated HPA(s)" provider=hpa
time="2019-05-24T16:14:36Z" level=info msg="Collected 1 new metric(s)" provider=hpa
time="2019-05-24T16:14:36Z" level=info msg="Collected new external metric 'prometheus-query' (927) [query-name=autoregister_queue_latency]" provider=hpa
[...]

And that’s all you need to do to use your existing Prometheus metrics to autoscale pods in Kubernetes! While this is a good foundation for ensuring that your applications are able to scale appropriately, there is more we can do to ensure that, not only is Kubernetes able to accommodate the additional capacity required by scalable applications (cluster autoscaling), but that Kubernetes is in a better place to be able to predict just how much capacity individual pods require (vertical pod autoscaling). Check back soon when we’ll be looking at vertical pod autoscaling.

Need help running Kubernetes?

Get in touch and see how we can help you.

Contact Us