Cover image for LiveWyer blog post: Datadog Kubernetes Autoscaling: Choosing the Right Tool
Engineering • 17min read

Datadog Kubernetes Autoscaling: Choosing the Right Tool

A use-case-driven guide to choosing between HPA + Cluster Agent, DatadogPodAutoscaler, and KEDA for Kubernetes autoscaling on Datadog metrics.

Written by:

Avatar Louise Champ Louise Champ

Published on:

Last updated on:

This blog is part of our Kubernetes Autoscaling with Datadog series, we recommend reading the rest of the posts in the series:

  1. Kubernetes Autoscaling: Getting Started with Datadog
  2. Kubernetes Autoscaling with Datadog External Metrics
  3. Kubernetes Autoscaling: Using the DatadogPodAutoscaler
  4. Kubernetes Autoscaling with Datadog: HPA or KEDA?
  5. Datadog Kubernetes Autoscaling: Choosing the Right Tool

Over the course of the Kubernetes Autoscaling with Datadog series, we have covered three distinct approaches to autoscaling Kubernetes workloads using Datadog metrics, with each tool examined in its own dedicated post. That depth is useful for understanding what each approach is and how it works, but it leaves a practical question unanswered.

Engineering teams rarely get the opportunity to evaluate these tools against greenfield workloads; Instead, they are working with enterprise applications that have evolved over years, migrating between stacks and absorbing new technologies and architectural patterns along the way. The question is not which tool suits a clean theoretical workload, but which fits what is actually running.

This post will work through that using five concrete workload patterns that cover the majority of scaling patterns seen in practice. For readers arriving at this post without the full series context, a brief recap of each approach:

  • The HPA + Cluster Agent path is the most conventional: the Cluster Agent registers as an External Metrics Provider, and Kubernetes’ native HPA consumes Datadog metrics through DatadogMetric CRDs. No additional operators are required beyond what the Datadog Agent already provides, and the scaling logic stays explicit and auditable.
  • The DatadogPodAutoscaler consolidates horizontal and vertical scaling into a single custom resource, driven by Datadog’s analysis of historical workload behaviour. It handles CPU / memory scaling, custom metric queries, and vertical right-sizing from one resource definition per workload. The trade-off is a hard dependency on Datadog’s Remote Configuration service; without outbound connectivity, the autoscaler installs cleanly and then does nothing. In environments with strict egress controls, that dependency is a genuine blocker.
  • KEDA is a CNCF project with a native Datadog scaler, scale-to-zero support, and heterogeneous trigger composition the standard HPA cannot replicate. It introduces its own operator and CRDs; each ScaledObject trigger makes independent API calls to Datadog, rather than sharing the Cluster Agent’s cache. An experimental useClusterAgentProxy: "true" option routes queries through the Cluster Agent to reduce rate-limit exposure, though it remains experimental in KEDA 2.19 and has known issues with multiple triggers on the same workload.

The right choice will depend on what the workloads actually require.

Each post in this series has approached these tools individually, but the question engineers face in practice is: “I have this specific workload with these specific requirements; which tool should I reach for?”

That is harder to answer from individual tutorials, and it is what this post addresses.

Pattern 1: Stateless HTTP Service

This will be a web-facing Deployment handling synchronous HTTP traffic with variable but predictable load patterns: business-hours peaks, marketing traffic spikes, overnight quiet periods. The service is stateless and horizontally scalable, with resource requests set from historical data that drift over time as the application evolves.

The primary scaling signal is request rate or connection count; both are leading indicators of load, unlike CPU utilisation which is lagging. By the time CPU climbs and triggers an HPA, the service may already be degrading. Scaling on nginx.net.request_per_s reacts faster and more accurately to what the workload is actually doing.

Right-sizing is a secondary but meaningful concern: resource requests over-provisioned by 30 to 40% carry unnecessary overhead across every replica, though correcting it is not time-sensitive the way scale-up events are.

The workload should never reach zero replicas; idle periods will still need a baseline for the first request.

The right tool for this pattern is DatadogPodAutoscaler, assuming the Remote Configuration dependency is viable. The workload runs continuously, stays above a minimum replica count, and benefits from horizontal scaling on a custom metric alongside ongoing vertical right-sizing.

A custom query objective handles the request rate signal directly:

apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
  name: frontend-api
  namespace: production
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: frontend-api
  owner: Local
  applyPolicy:
    mode: Apply
    scaleDown:
      rules:
        - periodSeconds: 600
          type: Percent
          value: 20
      stabilizationWindowSeconds: 600
    scaleUp:
      rules:
        - periodSeconds: 120
          type: Percent
          value: 50
      stabilizationWindowSeconds: 60
    update:
      strategy: Auto
  constraints:
    minReplicas: 2
    maxReplicas: 30
  objectives:
    - type: CustomQuery
      customQuery:
        request:
          formula: req_rate
          queries:
            - name: req_rate
              source: Metrics
              metrics:
                query: avg:nginx.net.request_per_s{service:frontend-api,env:production}.rollup(avg, 60)
        value:
          type: AbsoluteValue
          absoluteValue: 200
        window: 5m0s
  fallback:
    horizontal:
      enabled: true
      direction: ScaleUp
      objectives:
        - type: PodResource
          podResource:
            name: cpu
            value:
              type: Utilization
              utilization: 70
      triggers:
        staleRecommendationThresholdSeconds: 300

The update.strategy: Auto setting enables Datadog to adjust resource requests over time based on observed usage, reducing the manual overhead of keeping resource configurations accurate as the application changes.

The fallback to CPU utilisation means the autoscaler retains a local signal if the external metric becomes temporarily unavailable. A Remote Configuration connectivity issue can leave the autoscaler without recommendations. The fallback ensures that if the primary metric goes stale beyond staleRecommendationThresholdSeconds, the scaler still has a signal for scale-up decisions.

The HPA + Cluster Agent path is a viable alternative if the Remote Configuration dependency is not viable, or if existing DatadogMetric CRDs are in place and migration is not currently justified. The primary thing it does not provide is automatic right-sizing; resource requests would need to be revisited manually, or a separate VPA run alongside.

The Workload Scaling list view in the Datadog UI shows recommended versus applied replica counts and a history of scaling events. An additional dashboard plotting nginx.net.request_per_s alongside the Deployment replica count is usually enough to validate that scaling decisions are correlating with the expected signals, and wil pay for itself the first time a scaling event happens at an inconvenient hour.

Pattern 2: Queue Worker

This will be a Deployment of workers pulling jobs from a queue such as AWS SQS or RabbitMQ. Work arrives in bursts: a batch upload triggers a spike of jobs, the workers drain the queue, and then the queue sits empty until the next batch. Between bursts, the workers are idle, and idle pods impose a direct cost with no corresponding benefit.

Queue depth is a direct, lag-free leading indicator of pending work: as soon as messages appear, the scaler can act, rather than waiting for CPU to climb as workers begin processing. The target value, the queue depth per worker, should reflect each worker’s throughput capacity.

The critical capability needed here is scale-to-zero. When the queue is empty, the right number of workers should be zero; for high-volume batch systems where idle periods are long and frequent, that waste is non-trivial.

The right tool for this pattern is KEDA, and provides the clearest use case for KEDA in the series. Scale-to-zero is a hard requirement, and neither the HPA + Cluster Agent path nor the DatadogPodAutoscaler can satisfy it.

Setting minReplicaCount: 0 on a ScaledObject, combined with an activationQueryValue to prevent scale-up on noise, handles this cleanly:

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: video-worker-scaledobject
  namespace: video
spec:
  scaleTargetRef:
    name: video-transcoder
  minReplicaCount: 0
  maxReplicaCount: 20
  cooldownPeriod: 600
  triggers:
    - type: datadog
      metricType: Value
      metadata:
        query: "avg:aws.sqs.messages_visible{queuename:video-processing,env:prod}.rollup(avg, 60)"
        queryValue: "50"
        activationQueryValue: "1"
        age: "120"
        timeWindowOffset: "30"
        lastAvailablePointOffset: "1"
      authenticationRef:
        name: datadog-trigger-auth

The activationQueryValue: "1" is worth calling out specifically. Without it, a metric returning a very small non-zero value during a quiet period could trigger an unnecessary scale-up.

The timeWindowOffset: "30" accounts for the latency in CloudWatch metric availability; for SQS metrics sourced through CloudWatch, the realistic end-to-end delay before a stable value reaches KEDA is 90 to 120 seconds, and the offset helps avoid acting on partially-propagated data.

The lastAvailablePointOffset: "1" addresses a separate concern: Datadog applies implicit rollup averaging to the most recent data point, which can cause KEDA to act on a partially-aggregated value, so setting this to "1" instructs KEDA to use the second-to-last data point instead, which is fully settled.

For implementation detail, including TriggerAuthentication setup, see our KEDA series.

The DatadogPodAutoscaler does not support scale-to-zero, and the HPA + Cluster Agent path (in most circumstances) enforces a minimum of one replica. For this pattern, KEDA is not a preference, but the only tool in this series that can consistently deliver what the workload requires.

The one nuance is that right-sizing is not available alongside KEDA’s ScaledObject; the DatadogPodAutoscaler’s vertical scaling cannot run simultaneously with KEDA. For most queue workers this is acceptable, because the primary cost driver is idle replicas rather than per-replica resource allocation, but if right-sizing does matter, a periodic review of Datadog’s workload recommendations in the UI are a reasonable substitute.

Pattern 3: Batch Job

This is a Kubernetes Job or CronJob that runs periodically, processes a dataset, and exits cleanly when finished. Work may arrive on a schedule or be triggered by an external event. The key characteristic is that the workload has a defined start and end, with nothing to do or keep warm between runs.

The HPA model targets continuously-running Deployment or StatefulSet workloads, and is not designed to manage Job lifecycle. Applying an HPA to a Job is possible in principle, but produces awkward behaviour: the HPA reasons about replica count whilst the Job controller reasons about completions. The two models do not cleanly compose.

The right scaling model in this instance is event-driven job creation: if there is no signal, there are no pods; when a signal arrives, jobs are created with parallelism proportional to the backlog; when the backlog is drained, jobs complete and are cleaned up. This is a fundamentally different model from replica scaling.

The right tool is KEDA’s ScaledJob, which is designed specifically for this pattern. Where a ScaledObject manages the replica count of a long-running Deployment, a ScaledJob creates new Job objects in response to trigger activity and scales their parallelism with the incoming signal depth. The Datadog scaler works with ScaledJob in the same way it does with ScaledObject.

To paint a picture, let’s say we have an image processing pipeline triggered by SQS messages. With a ScaledJob targeting the queue metric, KEDA can create job pods as messages arrive, scale concurrent pods with queue depth, and stop when the queue is empty.

This pattern warrants its own treatment separate from this series; the KEDA ScaledJob documentation is a good starting point for understanding how job scaling differs from workload scaling.

Pattern 4: Stateful Service

This will be a StatefulSet with persistent identity requirements: a database, cache cluster, or streaming system where pods are not interchangeable, and adding or removing a replica has consequences beyond simply changing capacity. Adding a Kafka broker or Redis Cluster node, for example, involves data rebalancing, and removing one requires careful coordination to avoid data loss.

For most stateful services, automated horizontal scaling is not appropriate without significant application-level awareness that a generic autoscaler cannot provide. Manual scaling decisions made by someone who understands the data implications are usually preferable.

Vertical scaling, on the other hand, is often safe and useful: adjusting CPU / memory requests to match actual usage without changing the replica count improves resource efficiency without the risks of horizontal scaling.

If we were to suggest any tool, it would be DatadogPodAutoscaler configured to apply vertical updates whilst suppressing horizontal scaling changes.

Setting both scaleDown.strategy and scaleUp.strategy to Disabled while leaving update.strategy: Auto active allows Datadog to right-size individual pods without touching the replica count:

apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
  name: redis-cluster
  namespace: data
spec:
  targetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: redis-cluster
  owner: Local
  applyPolicy:
    mode: Apply
    scaleDown:
      strategy: Disabled
    scaleUp:
      strategy: Disabled
    update:
      strategy: Auto
  constraints:
    maxReplicas: 5

From Datadog Agent 7.78+, vertical updates use in-place resource resizing where the cluster and container runtime support it (Kubernetes 1.27+), meaning pod restarts are not required. On older agents or clusters without in-place update support, vertical updates will trigger pod restarts as pods cycle; for a StatefulSet with a PodDisruptionBudget restricting unavailability to a single pod, updates will propagate one pod at a time.

We would strongly suggest running in mode: Preview first and observing the recommended resource adjustments before switching to mode: Apply. A week of Preview mode data, covering typical day-of-week traffic variance, will give a meaningful baseline before committing to live vertical updates.

For vertical-only use cases of this kind, the Kubernetes-native Vertical Pod Autoscaler would be the appropriate tool; it falls outside the scope of this series, but our Vertical Pod Autoscaling series covers it in detail.

Because the DatadogPodAutoscaler is making resource decisions on the team’s behalf, tracking what it recommends matters more than it would in a horizontal scaling scenario.

Monitoring resource request values on pods over time identifies unexpected vertical drift; a monitor on DatadogPodAutoscaler status conditions reaching an error state, or on lastUpdatedTime going stale during active load, gives early warning of Remote Configuration connectivity issues.

Pattern 5: Multi-Signal Workload

This will be a service driven by two or more independent upstream systems that can each independently push it to capacity. No single metric captures the full picture of demand, and scaling on just one of these leaves the service exposed to the other.

In this instance, the autoscaler needs to respond to whichever signal is most demanding at any given moment. This differs from the multi-metric HPA behaviour we saw in Blog 2, where all signals come through the Cluster Agent’s External Metrics API. The challenge arises when signals come from different source types, such as a Datadog metric alongside a Kafka consumer group lag measured directly from the broker, that cannot be routed through the same External Metrics Provider.

The right tool here depends on where the signals come from. If all signals are Datadog metrics, HPA + Cluster Agent handles this cleanly with multiple external metrics in a single HPA (Blog 2 covers this in detail). DatadogPodAutoscale` also supports multiple objectives, with the additional benefit of vertical right-sizing alongside multi-signal horizontal scaling.

If any signal comes from a non-Datadog source, such as Kafka topic lag or Redis list depth, KEDA will be the right tool. Each trigger is evaluated independently, and the workload scales to whichever demands the most replicas. Routing everything through Datadog first adds unnecessary complexity and latency.

Where things get complicated is when this workload also requires scale-to-zero and right-sizing. No single tool currently satisfies all three simultaneously. For most multi-signal workloads, minimum replica count stays above zero, making DatadogPodAutoscaler with multiple objectives the strongest option.

For workloads where zero replicas genuinely matters, KEDA wins and right-sizing can be handled through periodic manual review of Datadog’s workload recommendations.

Pattern Summary

The five patterns covered in this post each have a primary recommendation, even where the reasoning behind it differs. In most cases, a single capability (scale-to-zero, right-sizing, or signal source) is what resolves the decision.

The table below captures that for quick reference:

PatternRecommended ToolKey Reason
Stateless HTTP ServiceDatadogPodAutoscalerCustom metric scaling with automatic right-sizing
Queue WorkerKEDAScale-to-zero is a hard requirement
Batch JobKEDA ScaledJobEvent-driven job creation; replica scaling does not apply
Stateful ServiceDatadogPodAutoscaler (vertical only)Right-sizing without horizontal scaling
Multi-Signal WorkloadDepends on signal sourcesDatadog-only signals: HPA or DatadogPodAutoscaler; mixed sources: KEDA

The Multi-Signal Workload pattern is the exception: there is no single answer because the right tool genuinely depends on where the signals originate. The section below works through the three questions that resolve that ambiguity, and covers the cases where workloads do not map cleanly to one of the five patterns above.

The Three Axes That Drive Most Decisions

Rather than a decision tree, which tends to oversimplify, it helps to think in terms of three concrete questions that each workload does or does not require an answer to. Once those are answered, the tool choice usually follows straightforwardly.

Does any workload need to scale to zero? If yes, go with KEDA. Neither the HPA nor the DatadogPodAutoscaler supports minReplicas: 0 in standard configuration.

A HPAScaleToZero feature gate does technically exist in Kubernetes that can fulfill purpose, but it has remained alpha since Kubernetes 1.16, and is not enabled by default on any major managed Kubernetes service; so for practical purposes it is not a factor in this decision. This is a hard requirement that resolves the decision immediately for worker-style workloads with intermittent load.

Does right-sizing matter? If resource waste is a significant cost driver, or if maintaining accurate resource requests manually is becoming a burden, the DatadogPodAutoscaler is the tool that can address this. The HPA + Cluster Agent path has no equivalent; running a separate VPA alongside is possible, but introduces its own co-ordination complexity. KEDA has no vertical scaling capability at all.

Where do the signals come from? If all scaling signals are Datadog metrics, any of the three approaches can handle them. If any signal comes from a non-Datadog source, KEDA’s heterogeneous trigger composition is the cleanest path; routing everything through Datadog first is possible, but adds latency and complexity.

The table below maps each tool against those three axes. These are the capabilities that, in practice, resolve most selection decisions:

CapabilityHPA + Cluster AgentDatadogPodAutoscalerKEDA
Scale-to-zero
Vertical right-sizing
Remote Configuration required
Non-Datadog signal sources
Additional operator required

With those three questions answered, the summary looks roughly like this:

  • HPA + Cluster Agent is the right default when the team wants the simplest possible setup, all signals are Datadog metrics, and there is no requirement for right-sizing or scale-to-zero.
    • It is also the correct choice when the Remote Configuration dependency is not viable.
    • It is the lowest-risk starting point, and the easiest to reason about when something has gone wrong at 02:00.
  • DatadogPodAutoscaler is the right choice when right-sizing matters, either because resource waste is a significant cost driver or because maintaining accurate resource requests manually is becoming a burden.
    • It handles multidimensional scaling natively and supports custom metric queries as objectives.
    • Its two primary constraints are the Remote Configuration dependency and the absence of scale-to-zero support; both are worth verifying before committing.
  • KEDA is the right choice when scale-to-zero is a hard requirement, when signals come from non-Datadog sources, or when the team already operates KEDA for other scalers.
    • Its per-trigger API call model is less efficient than the Cluster Agent’s shared cache, and introducing it where the Cluster Agent is already active requires navigating the External Metrics Provider conflict.
    • That conflict is manageable, but requires a deliberate migration plan.

Getting Started in Practice

The approach we would suggest, for any of these tools, is to start with a single workload in the appropriate preview mode, observe what the tool would do, and only then switch to applying recommendations.

  • For the DatadogPodAutoscaler, that means mode: Preview in the applyPolicy section for at least a week before switching to mode: Apply.
  • For the HPA + Cluster Agent path, a conservatively high target value and kubectl describe hpa output achieves something similar before tightening the threshold.
  • For KEDA, the ScaledObject status conditions and operator logs together confirm trigger evaluation is working before relying on it in production.

The common thread is not to go straight to production with an aggressive configuration. Autoscaling decisions interact with PodDisruptionBudget constraints and node provisioner behaviour, and those interactions are easier to understand from a controlled starting point as opposed to during an incident.

Series Summary

The series has covered:

  • Setting up Cluster Agent as an External Metrics Provider in both inline and CRD modes (Blog 1).
  • Building HPAs against Datadog metrics including request rate, SQS queue depth, and multi-signal combinations (Blog 2).
  • Configuring the DatadogPodAutoscaler for horizontal, vertical, multidimensional, and custom-query scaling (Blog 3).
  • The architectural trade-offs between the HPA + Cluster Agent approach and KEDA, with working end-to-end configurations for both (Blog 4).
  • This post, working through the decision across workload patterns and distilling it to three deciding questions.

For implementation detail on KEDA’s Datadog scaler, including ScaledObject configuration, TriggerAuthentication setup, and the activationQueryValue parameter for scale-to-zero patterns, see our KEDA series.

For the full set of DatadogPodAutoscaler fields and configuration options, the Datadog Kubernetes Autoscaling documentation is the authoritative reference.

If any of the patterns here map closely to a workload they are currently managing and the guidance leaves questions unanswered, we are happy to dig into the specifics. The patterns we have described work well as starting points, but real workloads can have their own edge cases, and those are often where the interesting decisions actually lie.

Frequently Asked Questions

Can the DatadogPodAutoscaler and KEDA run on the same cluster for different workloads?

Yes. The DatadogPodAutoscaler uses Remote Configuration rather than the External Metrics Provider interface, so it does not conflict with KEDA’s metrics adapter; both can coexist on the same cluster. The constraint described in Blog 4 only applies if the Cluster Agent’s metricsProvider is also active for HPA-backed workloads; in that case, the External Metrics Provider conflict between KEDA and the Cluster Agent remains.

What happens to a DatadogPodAutoscaler-managed workload if Remote Configuration connectivity is permanently lost?

The autoscaler continues applying its last known recommendation but cannot receive updates. Scale-up decisions fall back to CPU utilisation if a fallback block is configured; vertical right-sizing stops entirely. The migration path is to delete the DatadogPodAutoscaler and replace it with an HPA and DatadogMetric CRDs, which have no Remote Configuration dependency.

Can APM metrics such as request latency or error rate be used as autoscaling signals?

Yes. APM-derived metrics such as trace.web.request.hits and per-service error rates are available as standard Datadog metrics and work with all three approaches. Define a DatadogMetric CRD for HPA + Cluster Agent, use a CustomQuery objective for DatadogPodAutoscaler, or use the Datadog trigger for KEDA. Apply a 60-second rollup to avoid acting on single-request noise.

This blog is part of our Kubernetes Autoscaling with Datadog series, we recommend reading the rest of the posts in the series:

  1. Kubernetes Autoscaling: Getting Started with Datadog
  2. Kubernetes Autoscaling with Datadog External Metrics
  3. Kubernetes Autoscaling: Using the DatadogPodAutoscaler
  4. Kubernetes Autoscaling with Datadog: HPA or KEDA?
  5. Datadog Kubernetes Autoscaling: Choosing the Right Tool