Kubernetes Autoscaling: Using the DatadogPodAutoscaler
Learn how to consolidate Kubernetes pod autoscaling with the DatadogPodAutoscaler CRD.
Published on:
Last updated on:
This blog is part of our Kubernetes Autoscaling with Datadog series, we recommend reading the rest of the posts in the series:
- Kubernetes Autoscaling: Getting Started with Datadog
- Kubernetes Autoscaling with Datadog External Metrics
- Kubernetes Autoscaling: Using the DatadogPodAutoscaler
In the previous two posts in this series, we built out a complete autoscaling setup using the Datadog Cluster Agent, DatadogMetric CRDs, and Kubernetes’ native Horizontal Pod Autoscaler. That combination is solid, and it works well in production, but it does require managing several distinct resources: the HPA itself, the DatadogMetric objects, and careful co-ordination of polling intervals and tag selectors. As your autoscaling surface grows, that overhead will accumulate, and teams may start looking for a more consolidated model.
Datadog Kubernetes Autoscaling, and the DatadogPodAutoscaler CRD at its core, takes a different approach. Rather than composing multiple Kubernetes primitives, you define scaling behaviour for a single workload in a single resource. The Datadog Cluster Agent watches that resource and acts as its controller, handling both horizontal and vertical scaling decisions from the same telemetry your agents are already collecting.
Throughout this post we focus on owner: Local mode, where the team defines and deploys the DatadogPodAutoscaler themselves using their existing GitOps or deployment workflows. This is typically what platform teams and SREs want when autoscaling is part of a managed service offering.
One important constraint to flag before getting into the implementation: Datadog recommends removing any existing HPAs or VPAs from a workload when enabling DatadogPodAutoscaler for it. Running both simultaneously can produce conflicting scale decisions. If you want to evaluate the feature without touching your live autoscaling setup, mode: Preview in the applyPolicy section generates recommendations without applying them; we cover this in the migration section below.
Before getting into configuration, though, there is one architectural dependency worth understanding properly, because it shapes whether the tool is viable for a given environment at all.
The Remote Configuration Dependency
The DatadogPodAutoscaler requires Remote Configuration to function. The Datadog Cluster Agent uses Remote Configuration to communicate scaling decisions back to the Datadog platform, and without it, the autoscaler cannot apply recommendations regardless of how the resource is configured.
What this means in practice is that the Cluster Agent must be able to reach Datadog’s Remote Configuration endpoints from inside the cluster, and that Remote Configuration must be enabled at the organisation level in the Datadog UI.
The endpoints that must be reachable are:
remote-config.datadoghq.com(US1 region), on port 443
The exact hostname varies by Datadog site. If the organisation is on a different site (EU, US3, US5, AP1), use Datadog’s network requirements documentation with the site selector set to the appropriate region to confirm the correct endpoint before testing. Note also that Datadog Kubernetes Autoscaling is not available for the Datadog for Government site.
The right time to verify this is before anything else. For teams where Remote Configuration is not viable, the HPA + Cluster Agent path from the earlier posts in this series remains the right approach; it has no external connectivity requirements beyond the standard Datadog API endpoints, and the DatadogMetric CRD model gives teams explicit, auditable control over their scaling configuration.
Prerequisites
Before deploying a DatadogPodAutoscaler, the following need to be in place:
- Datadog Agent and Cluster Agent v7.66.1 or later (required for live autoscaling; in-app recommendations work from v7.50+).
- For workloads using vertical scaling, Linux kernel v5.19+ with cgroup v2 is recommended for optimal behaviour.
- Remote Configuration enabled at the organisation level in the Datadog UI, and the Remote Configuration capability enabled on the API key the Cluster Agent is using. From Agent v7.47.0 onwards Remote Configuration is on by default in the Agent itself.
- The Admission Controller enabled in the Datadog Helm chart or Operator configuration (on by default).
- Kubernetes State Core integration is enabled.
- The following Datadog RBAC permissions:
- Org Management (to enable Remote Configuration at the organisation level)
- API Keys Write (to enable the Remote Configuration capability on the API key itself)
- Workload Scaling Write
- Autoscaling Manage
Verifying Remote Configuration Connectivity
From inside the cluster, run a temporary pod to test egress to the Remote Configuration endpoint directly:
kubectl run rc-check --rm -it --restart=Never \
--image=curlimages/curl -- \
curl -sv --max-time 5 https://remote-config.datadoghq.com 2>&1 | grep -E "Connected|SSL|Could not resolve|Connection refused"
A successful TLS handshake, even if the response is a 401 or 403, confirms the endpoint is reachable. A connection timeout or DNS resolution failure points to an egress rule or DNS configuration issue that needs to be resolved before proceeding.
Verifying Kubernetes State Core
The Kubernetes State Core integration provides the workload metadata Datadog uses to build autoscaling recommendations. If it is missing or misconfigured, the DatadogPodAutoscaler status will show no recommendation and no obvious error, which is one of the more frustrating failure modes to diagnose after the fact.
Verify it is active in the Datadog UI under Integrations > Kubernetes State Core, then confirm the integration is producing data by checking for kubernetes_state.* metrics in the Datadog Metrics Explorer scoped to the cluster. If those metrics are absent, enabling the autoscaler will not produce recommendations.
If Kubernetes State Core is enabled but metrics are not appearing, check the Cluster Agent logs:
kubectl logs -n datadog deploy/datadog-cluster-agent | grep -i "kubernetes_state"
A common cause is a missing RBAC permission preventing the Cluster Agent from reading workload metadata. The Datadog Helm chart creates the necessary RBAC by default, but it is possible customised installations may omit it.
Verifying Admission Controller Status
The Admission Controller is required for vertical scaling: it intercepts pod creation to apply updated resource requests. If it is disabled, the DatadogPodAutoscaler can still perform horizontal scaling, but vertical updates will not be applied.
kubectl get mutatingwebhookconfiguration | grep datadog
If there is no Datadog-related webhook configuration listed, the Admission Controller is not active. In the Helm chart, it is enabled via clusterAgent.admissionController.enabled: true.
Enabling the Feature via Helm
With prerequisites verified, enabling Datadog Kubernetes Autoscaling requires a small set of additions to values.yaml. Although Remote Configuration is on by default in the Agent from v7.47.0 onwards (and this feature requires v7.66.1+), including datadog.remoteConfiguration.enabled: true explicitly makes the dependency visible in version-controlled configuration.
The unbundleEvents flag ensures Kubernetes events are collected individually rather than grouped, which improves the resolution of autoscaling signal data.
DD_AUTOSCALING_FAILOVER_ENABLED is also worth setting here: it enables a local failover mode where the Cluster Agent continues applying the last known good recommendation if it temporarily loses contact with Datadog’s API, which is particularly relevant given the Remote Configuration dependency.
datadog:
remoteConfiguration:
enabled: true
autoscaling:
workload:
enabled: true
kubernetesEvents:
unbundleEvents: true
clusterAgent:
admissionController:
enabled: true
env:
- name: DD_AUTOSCALING_FAILOVER_ENABLED
value: "true"
agents:
env:
- name: DD_AUTOSCALING_FAILOVER_ENABLED
value: "true"
Apply the update:
helm repo update
helm upgrade -f values.yaml <RELEASE_NAME> datadog/datadog
If using the Datadog Operator instead, add the equivalent configuration to datadog-agent.yaml:
spec:
features:
autoscaling:
workload:
enabled: true
eventCollection:
unbundleEvents: true
override:
clusterAgent:
env:
- name: DD_AUTOSCALING_FAILOVER_ENABLED
value: "true"
nodeAgent:
env:
- name: DD_AUTOSCALING_FAILOVER_ENABLED
value: "true"
Common Scaling Configurations
The DatadogPodAutoscaler resource has a consistent structure across all configurations.
targetRef points at the workload to scale; applyPolicy controls when and how recommendations are applied, including stabilisation windows and the vertical update strategy; constraints sets the replica count boundaries; and objectives defines the scaling targets. The API version to use is datadoghq.com/v1alpha2.
Horizontal Scaling on CPU Utilisation
This is the most straightforward configuration. The autoscaler adjusts replica count to maintain a target CPU utilisation across pods, with update.strategy: Disabled to suppress vertical changes and keep this resource focused purely on horizontal scaling.
apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
name: frontend
namespace: production
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: frontend
owner: Local
applyPolicy:
mode: Apply
scaleDown:
rules:
- periodSeconds: 600
type: Percent
value: 20
stabilizationWindowSeconds: 600
scaleUp:
rules:
- periodSeconds: 120
type: Percent
value: 50
stabilizationWindowSeconds: 60
update:
strategy: Disabled
constraints:
minReplicas: 2
maxReplicas: 20
objectives:
- type: PodResource
podResource:
name: cpu
value:
type: Utilization
utilization: 70
The conservative scaleDown rule, limited to 20% reduction per 10-minute window, prevents rapid shedding of capacity after a spike. That is often the right choice for stateless web-facing workloads where the container resource configuration is already well-understood.
Vertical-Only Scaling
For workloads that cannot be horizontally scaled, or where the goal is to rightsize individual pods without changing replica count, setting both scaleDown and scaleUp strategies to Disabled whilst leaving update.strategy: Auto active allows Datadog to adjust resource requests as pods cycle, without touching the replica count.
apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
name: analytics-worker
namespace: data
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: analytics-worker
owner: Local
applyPolicy:
mode: Apply
scaleDown:
strategy: Disabled
scaleUp:
strategy: Disabled
update:
strategy: Auto
constraints:
maxReplicas: 10
Datadog applies vertical updates via the Admission Controller, updating the pod spec when pods are recreated rather than immediately evicting running pods. The result is broadly similar to VPA’s Initial mode, applied progressively as pods cycle.
A third option, update.strategy: TriggerRollout, triggers a full pod rollout immediately when new recommendations are available, which is useful when rightsizing needs to take effect quickly but does mean accepting the disruption of a rollout outside of a normal deploy cycle.
Multi-dimensional Scaling: Horizontal and Vertical Together
Unlike the standard HPA and VPA combination, a single DatadogPodAutoscaler resource here can handle both scenarios. Setting update.strategy: Auto alongside an objective enables both horizontal replica count changes and ongoing vertical rightsizing from a single resource definition.
apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
name: checkout-api
namespace: commerce
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: checkout-api
owner: Local
applyPolicy:
mode: Apply
scaleDown:
rules:
- periodSeconds: 1200
type: Percent
value: 20
stabilizationWindowSeconds: 600
scaleUp:
rules:
- periodSeconds: 120
type: Percent
value: 50
stabilizationWindowSeconds: 60
update:
strategy: Auto
constraints:
minReplicas: 3
maxReplicas: 30
objectives:
- type: PodResource
podResource:
name: cpu
value:
type: Utilization
utilization: 70
The CPU utilisation target drives horizontal replica count, and Datadog’s historical analysis drives resource request sizing; both can change in the same evaluation cycle. For services with strict availability requirements, starting with update.strategy: Disabled and enabling vertical updates once there is confidence in the recommendations is usually the safer path.
Scaling on a Custom Datadog Metric Query
For workloads where CPU or memory utilisation is not the right signal, DatadogPodAutoscaler supports arbitrary Datadog metric queries as scaling objectives. This is the analogue of the DatadogMetric CRD from the previous post, expressed inline within a single resource. The fallback block is important here: if the external metric source becomes unavailable, the autoscaler falls back to CPU utilisation for scale-up decisions rather than losing its signal entirely.
apiVersion: datadoghq.com/v1alpha2
kind: DatadogPodAutoscaler
metadata:
name: video-transcoder
namespace: video
spec:
targetRef:
apiVersion: apps/v1
kind: Deployment
name: video-transcoder
owner: Local
applyPolicy:
mode: Apply
scaleDown:
rules:
- periodSeconds: 1200
type: Percent
value: 20
stabilizationWindowSeconds: 600
scaleUp:
rules:
- periodSeconds: 120
type: Percent
value: 50
stabilizationWindowSeconds: 300
update:
strategy: Disabled
constraints:
minReplicas: 1
maxReplicas: 20
objectives:
- type: CustomQuery
customQuery:
request:
formula: queue_depth
queries:
- name: queue_depth
source: Metrics
metrics:
query: avg:aws.sqs.messages_visible{queuename:video-processing,env:prod}.rollup(avg, 60)
value:
type: AbsoluteValue
absoluteValue: 50
window: 5m0s
fallback:
horizontal:
enabled: true
direction: ScaleUp
objectives:
- type: PodResource
podResource:
name: cpu
value:
type: Utilization
utilization: 70
triggers:
staleRecommendationThresholdSeconds: 600
The window: 5m0s setting evaluates the query over a five-minute window, smoothing transient spikes.
Migrating From HPA and DatadogMetric
If you wish to migrate to DatadogPodAutoscaler from HPA + DatadogMetric, the migration itself is straightforward.
Start by deploying the DatadogPodAutoscaler in mode: Preview:
applyPolicy:
mode: Preview
In this mode, Datadog surfaces recommendations in the Autoscaling Summary page in the Datadog UI without applying them; the existing HPA will remain in control.
It is suggested to run DatadogPodAutoscaler in Preview for at least a week before switching to Apply for any workload that has significant traffic variance, so that recommendations cover day-of-week patterns and offer a meaningful baseline for comparison.
Once the recommendations look sensible, we can switch to mode: Apply and delete the HPA. Delete the DatadogMetric CRDs once the HPA is gone, as the Cluster Agent no longer needs them.
The critical constraint here is that an HPA and a DatadogPodAutoscaler in Apply mode must not target the same workload simultaneously; the two controllers will fight over the replica count, and the resulting behaviour is unpredictable.
Verifying the Setup
Once DatadogPodAutoscaler is deployed, there are a few commands to confirm things are working:
# Inspect the autoscaler's current status and last applied recommendation
kubectl describe datadogpodautoscaler frontend -n production
# List all DatadogPodAutoscaler resources in the cluster
kubectl get datadogpodautoscalers --all-namespaces
# Check Cluster Agent logs for autoscaling activity
kubectl logs -n datadog deploy/datadog-cluster-agent | grep -i autoscal
The describe output includes a Status block showing the last recommended replica count, the last applied vertical adjustments, and any error conditions.
If the Cluster Agent cannot reach the Datadog API or Remote Configuration is not correctly configured, the status conditions will indicate this, though the messages are sometimes terse.
Checking the Cluster Agent logs alongside the resource status is usually necessary to get the full picture.
Monitoring Autoscaling Decisions in Production
With DatadogPodAutoscaler, Datadog is making decisions on a team’s behalf, so knowing what it is deciding, and why, is essential for both debugging and building confidence in the feature.
In addition to the Workload Scaling list view at app.datadoghq.com/orchestration/scaling/workload, a Datadog dashboard that plots replica count changes alongside the underlying metric values will provide a second layer of visibility.
For CPU-driven scaling, overlaying kubernetes.cpu.usage.total against the replica count of the target Deployment is usually sufficient. For custom query objectives, plotting the query result alongside replica count lets a team verify that scaling decisions are correlating with the signals they expect.
Some useful monitors to put in place early are:
- Alert on the
DatadogPodAutoscalerstatus.conditionsfield reaching an error state. This catches Remote Configuration connectivity issues, invalid query syntax, or Kubernetes State Core failures before they become silent problems. - Monitor for a stale recommendation, where the
lastUpdatedTimeon the status has not changed for more than five minutes during a period when the workload is under load. - For workloads using
update.strategy: Auto, track the resource request values on pods over time to identify if vertical adjustments are drifting in an unexpected direction.
Production Considerations
Stabilisation Windows and Scale-Down Conservatism
The scaleDown.stabilizationWindowSeconds value deserves careful tuning. If a workload has a pattern of short bursts followed by quiet periods, a stabilisation window that is too short will cause the autoscaler to shed replicas aggressively after each burst, only to scale back up when the next one arrives.
For most web-facing workloads, 600 seconds (ten minutes) is a sensible starting point for scale-down stabilisation. For background workers where utilisation tends to drop rapidly after a batch completes, a shorter window may be appropriate to reclaim capacity more quickly.
Resource Constraints for Vertical Scaling
When using update.strategy: Auto, Datadog will adjust pod resource requests based on observed usage.
It is worth ensuring containers have both requests and limits defined before enabling vertical updates, and reviewing what Preview mode recommendations suggest before switching to Apply.
If containers do not have limits set, Datadog may recommend requests that differ significantly from what the scheduler was previously placing pods with, which can produce unexpected outcomes.
Interaction With Pod Disruption Budgets
Vertical updates will trigger pod restarts as pods cycle.
If a workload has a PodDisruptionBudget that restricts simultaneous unavailability to a single pod, vertical updates will proceed gradually, which is usually the desired behaviour, but on a large deployment it can take considerable time for updated resource requests to propagate across all replicas. Factor this into expectations when evaluating the feature in staging.
Wrapping Up
The DatadogPodAutoscaler consolidates what previously required at least three separate Kubernetes resources, an HPA, a DatadogMetric CRD, and careful RBAC configuration around both, into a single, self-contained definition per workload.
That said, the Remote Configuration dependency is a real constraint, not a footnote. Teams operating in environments where outbound connectivity is restricted, or where the egress to config.datadoghq.com cannot be guaranteed, should check that before going further. The HPA path from the earlier posts in this series remains the right choice in those environments, and it continues to be actively supported.
In this post, we covered:
- The Remote Configuration dependency as an architectural constraint, and how to verify it before starting setup.
- How to verify the Kubernetes State Core integration and Admission Controller status before enabling the feature.
- Enabling the feature via Helm and the Datadog Operator.
- The four main scaling configurations: horizontal CPU, vertical-only, multidimensional, and custom metric query.
- Using Preview mode to evaluate recommendations before committing, and how to migrate safely from an HPA and
DatadogMetricsetup. - Monitoring autoscaling decisions in production, including the Workload Scaling UI and useful alerting patterns.
- Production considerations around stabilisation windows, vertical updates, and pod disruption budgets.
The next post in this series takes a step back from Datadog’s own tooling and looks at KEDA, a CNCF project that approaches the same problem of metric-driven autoscaling from a different architectural direction. Understanding where the two sit relative to each other, and what requirements push a team towards one or the other, is what the following posts address.
The Datadog Kubernetes Autoscaling documentation covers the full set of available fields and configuration options, and is a good reference as you move beyond the patterns described here.
Frequently Asked Questions
Does the DatadogPodAutoscaler also scale the underlying nodes?
No, DatadogPodAutoscaler scales pods only; it does not provision or deprovision the underlying node pool. Once the cluster’s nodes are saturated, horizontal scale-up events will produce pending pods rather than running ones, regardless of how well the autoscaler is configured. The standard complement is a node autoscaler, and Karpenter is now the most widely adopted choice for this on AWS. Karpenter watches for unschedulable pods and provisions nodes to match their resource requirements, typically within seconds. Ensuring your NodePool definitions are sized appropriately for the workloads that DatadogPodAutoscaler will be scaling is a worthwhile step before enabling the feature across a fleet.
Is there a limit to how many workloads can be autoscaled?
Datadog Kubernetes Autoscaling supports a maximum of 1,000 DatadogPodAutoscaler resources per cluster. For most environments this will not be a constraint, but it is worth knowing on large clusters with many small workloads.
This blog is part of our Kubernetes Autoscaling with Datadog series, we recommend reading the rest of the posts in the series:
- Kubernetes Autoscaling: Getting Started with Datadog
- Kubernetes Autoscaling with Datadog External Metrics
- Kubernetes Autoscaling: Using the DatadogPodAutoscaler
