LiveWyer FinOps – Karpenter Implementation

Our enterprise client now uses Karpenter and AWS Spot instances for huge cloud infrastructure savings

Key Results
  • Cost optimisation

    Forecast saving of $2.4 million annually.

  • Reduced operational effort

    Karpenter implementation will save 600+ engineering hours over the next year.

  • Zero downtime

    Seamless integration with GitOps workflows and zero customer impact.

Challenges

Reduce cloud costs

The client struggled with high costs and operational inefficiencies due to reliance on AWS EKS Node Groups. These rigid configurations prevented the use of cost-saving Spot Instances, requiring predefined instance types and frequent manual adjustments for changing workload needs. Scaling was slow, as EKS Node Groups depended on EC2 Auto Scaling Groups, and managing AMI updates across clusters became a burden. Engineers faced increased complexity, navigating both AWS and Kubernetes contexts, leading to higher costs, delays, and strained resources.

Solution

Cloud Platform FinOps

The client replaced AWS EKS Node Groups with Karpenter, a dynamic Kubernetes autoscaler, to optimise resource usage and reduce costs. Karpenter prioritises Spot Instances for cost efficiency, dynamically provisions instances to match workloads, and automates instance lifecycle management, significantly simplifying server patching across over 100 clusters. The solution, deployed via the client’s GitOps processes using Rancher and Fleet, eliminated manual intervention, enhanced agility, and reduced operational complexity, freeing up engineering resources for higher-value tasks.

Results

The implementation of Karpenter brought significant benefits to the client and their Platform Team. Engineers now have more time to focus on delivering new features, supporting a “Platform as a Product” approach to enhance value for software engineers. The effort required for AMI upgrades forecast to reduce by 6000 engineering hours per year, while infrastructure cost savings amounted to approximately $2.4 million annually across 130 clusters. Platform Team morale improved with reduced operational overhead, and the solution was implemented with no downtime, ensuring uninterrupted service.

Forecast as of Nov 2024

2

Month FinOps engagement

$2.4M

Infrastructure cost reduction

6000

Engineer hours saved

Full Report

Introduction

LiveWyer has partnered with the client for many years, architecting and building their Kubernetes Platform. Faced with rising infrastructure costs, LiveWyer were tasked with finding an elegant solution to reduce cloud infrastructure costs while optimising AWS Spot Instance usage. LiveWyer delivered through its technical FinOps services, designing and implementing an effective, cost-saving technical solution and ongoing strategy.

Requirements & Challenges

The client encountered several challenges related to both costs and engineering effort, primarily stemming from their use of AWS EKS Node Groups and their inability to effectively leverage Spot Instances:

High cloud costs

  • The platform did not utilise AWS Spot Instances, a highly cost-efficient option that takes advantage of spare EC2 capacity, offering discounts of up to 90% compared to On-Demand pricing.
  • Spot Instances use a flexible, dynamic pricing model based on supply and demand, enabling significant savings. However, the client’s existing EKS Node Groups configuration prevented them from capitalising on these savings.

Rigid node group configuration

  • EKS Node Groups required predefined instance types at creation, making the setup static and unable to adapt to dynamic workload requirements, resulting in inefficiencies.
  • Changes to workload requirements—such as needing a different operating system, CPU architecture, or specialised hardware like GPUs—necessitated manual intervention from the platform team. This increased operational overhead and introduced delays.

Scaling limitations

  • EKS Node Groups are built on EC2 Auto Scaling Groups, which often struggled to adjust quickly to fluctuating capacity demands.
  • These slow response times caused suboptimal performance during sudden changes in workload.

Operational complexity

  • Node Groups required extensive configuration, including specifying launch templates and AMI versions. Updating or replacing nodes with newer AMIs was a tedious and error-prone task.
  • Engineers were required to work across both AWS and Kubernetes environments, increasing complexity. Node Groups were managed using AWS APIs, with Kubernetes itself not fully utilised for managing the nodes.

Management overhead

  • Managing multiple clusters and Node Groups created operational difficulties. Each workload change required manual adjustments, reducing scalability and flexibility.
  • AMI updates were particularly burdensome due to the intricate setup, further compounding operational challenges.

These issues not only limited the client’s ability to effectively reduce cloud costs but also strained their engineering resources, diminishing agility and delaying critical updates and workload optimisations.

Solution design & implementation

The solution replaced AWS EKS Node Groups with Karpenter, a flexible and intelligent Kubernetes cluster autoscaler. This significantly improved resource efficiency and reduced operational overhead.

Key features of the Karpenter implementation

Dynamic and cost-effective scaling

  • Karpenter dynamically provisions EC2 instances tailored to the workloads running in the cluster, selecting from a wide variety of instance types.
  • It prioritises cost savings by leveraging Spot Instances wherever possible, ensuring efficient use of resources while staying within budget.

Adaptive resource allocation

  • Unlike EKS Node Groups, Karpenter continually assesses workload requirements and adjusts infrastructure accordingly, providing the best match for current needs without requiring manual intervention.

Streamlined upgrades

  • Regular server patching, a key requirement for the client, was simplified with Karpenter. Previously, the platform team spent months updating base images across over 100 clusters. Karpenter automates instance lifecycle management, significantly reducing this effort.

GitOps integration

The implementation was seamlessly rolled out using the client’s GitOps processes, managed via Rancher and Fleet. This ensured a smooth, consistent deployment across clusters, aligning with existing workflows.

By adopting Karpenter, the client achieved dynamic scaling, reduced cloud costs through efficient resource utilisation, and alleviated the operational complexity of managing and upgrading nodes across multiple clusters. This enhanced agility and freed up engineering resources for higher-value tasks.

Conclusion

The implementation of Karpenter significantly benefited both the client and their Platform Team. Engineers now have more time to focus on developing new features, aligning with a “Platform as a Product” mindset that adds greater value for software engineers.

Operationally, AMI upgrade efforts were reduced by an estimated 6000 hours per year, and cloud infrastructure cost savings reached approximately $2.4 million annually across 130 clusters. Team morale improved due to reduced operational overhead, and the platform maintained zero downtime, ensuring uninterrupted service for its users.

Book a meeting with a Platform Specialist