Cover image for LiveWyer blog post: KubeVirt Isn't vSphere: Closing the Operational Gap Before You Migrate
Engineering • 13min read

KubeVirt Isn't vSphere: Closing the Operational Gap Before You Migrate

Broadcom’s licensing changes have put KubeVirt firmly on the architecture agenda. The feature and cost questions are well-covered; this post is about the quieter problem: the operational surface area that vCenter handled quietly for years, and what it takes to build it back.

Written by:

Avatar Louise Champ Louise Champ

Published on:

Last updated on:

Broadcom’s acquisition of VMware changed the licensing landscape decisively, and organisations that were previously content running vSphere are now actively re-evaluating their virtualisation strategy. KubeVirt is receiving the kind of serious attention from architecture teams that it has not had before; the combination of an open-source licence, a Kubernetes-native model, and a maturing upstream project makes it a plausible candidate for teams looking to exit the VMware ecosystem.

The evaluation conversation tends to focus on two questions: does KubeVirt do what vSphere does, and what does the total cost of ownership look like? Both are worth asking. There is, however, a quieter and considerably harder problem that tends to surface after the proof of concept is complete: the operational surface area that vCenter handled quietly for years is simply not present in a default KubeVirt deployment. Understanding that gap before the migration begins is the difference between a considered platform transition and an extended stabilisation period. That is what this post is about.

What vCenter Actually Does For You

Before discussing what is missing in KubeVirt, it is worth being precise about what vCenter provides. The list that follows covers capabilities most teams have come to rely on without fully accounting for the work of recreating them.

Unified VM lifecycle management. The vSphere web client provides a graphical interface for creating, configuring, cloning, migrating, and retiring virtual machines. Operations that take a few clicks in vCenter require kubectl and virtctl commands in KubeVirt, and the conceptual model is substantially different from what VMware administrators are accustomed to.

Built-in performance monitoring and alerting. vCenter ships with per-VM and per-host metrics for CPU, memory, disk I/O, and network throughput, with historical trending and configurable alerts. There is no equivalent in upstream KubeVirt without assembling a monitoring stack from separate components.

Mature RBAC and multi-tenancy. vSphere’s permission model is hierarchical: roles are applied at the datacenter, cluster, resource pool, folder, or individual resource level, and permissions propagate down the hierarchy. Kubernetes RBAC is powerful, but it does not map cleanly onto this model, and the namespace-based isolation requires deliberate upfront design.

Live migration orchestration. vMotion is a well-understood capability with broad storage and network compatibility. KubeVirt supports live migration, but the prerequisites around storage access modes and network configuration are non-trivial, and they frequently catch teams off guard during proof of concept work.

Snapshot and backup integration. VMware’s vStorage APIs for Data Protection (VADP) provide a standardised interface that backup vendors have been building against for years. KubeVirt has no native equivalent, and assembling a comparable backup pipeline requires deliberate tooling choices and operational investment.

CPU and memory management. vSphere provides DRS to automatically balance VM workloads across hosts based on utilisation, with configurable automation levels and per-VM resource reservations, limits, and shares. Hot-adding CPU and memory to running VMs without a reboot is a standard capability that many guest workloads depend on for live scaling.

Cluster-wide resource visibility. vCenter provides a consolidated view of capacity utilisation, allocation, and historical trends across the entire virtualisation estate. Kubernetes provides some of this via kubectl top and resource quotas, but not at the same depth without additional tooling.

None of these capabilities are impossible to achieve in a KubeVirt environment. The issue is that none of them are provided out of the box, and teams frequently do not budget for the engineering work required to build them.

The Operational Gap in Practice

What follows draws on our own hands-on work with upstream KubeVirt, supplemented by community knowledge and upstream documentation, with the aim of giving a realistic picture of what this gap looks like in practice.

VM Lifecycle Without a GUI

In a KubeVirt environment, virtual machines are defined as VirtualMachine custom resources; the running instances are represented as VirtualMachineInstance objects. Day-to-day operations involve a combination of kubectl for resource management and virtctl for VM-specific actions such as starting, stopping, restarting, and accessing the console.

virtctl is a capable tool, and the underlying model is coherent once understood. The difficulty is the shift required of VMware administrators. Common friction points in early adoption include: understanding the relationship between a VirtualMachine resource and its backing DataVolume, navigating console access via virtctl console or virtctl vnc, and debugging situations where a VirtualMachineInstance shows as running but is not reachable on the network.

Cloning is another area where vSphere habits do not transfer well. Cloning a VM in vCenter is a guided wizard with a handful of configuration screens. In KubeVirt, cloning typically involves creating a new DataVolume from an existing PersistentVolumeClaim using the CDI (Containerized Data Importer) clone API; this requires understanding the DataVolume spec, cross-namespace clone permissions, and the CDI import lifecycle. It is not difficult once familiar, but it is not self-evident to someone coming from the vSphere toolset.

There is also no native template library. Teams that want a catalogue of base images need to build that capability, typically via a GitOps workflow that manages DataVolume source references alongside VirtualMachine definitions.

Observability

KubeVirt exposes Prometheus metrics covering CPU usage, memory availability, network throughput, and storage I/O at the VirtualMachineInstance level. A minimal observability stack for a KubeVirt environment requires Prometheus scraping the KubeVirt metrics endpoint, Grafana for visualisation, and dashboards written to the KubeVirt metric naming conventions.

The KubeVirt project provides some reference Grafana dashboards, and the community has produced additional options. Assembling a useful monitoring stack is achievable, but it requires genuine effort and a clear picture of what to monitor. In our experience, the most commonly overlooked component is the qemu-guest-agent. Without it installed inside each VM image, visibility into the guest operating system is limited to what Kubernetes and KubeVirt can observe from the outside; per-process CPU consumption, memory pressure from within the guest, and filesystem utilisation are not available without it.

Even with a solid Prometheus and Grafana setup in place, cluster-wide capacity planning requires additional dashboard work. Understanding how VM density is trending across nodes, identifying which nodes are approaching resource limits before they become a problem, and correlating migration event frequency with performance degradation requires custom visualisation work that goes beyond the metrics KubeVirt exposes by default.

RBAC and Multi-tenancy

vSphere’s permission model is hierarchical and well-understood. Kubernetes RBAC operates on namespaces as the primary isolation boundary, with ClusterRole and Role resources defining what operations are permitted on which resource types. The two models address similar problems, but the mapping between them is not one-to-one, and teams that try to replicate a vSphere permission structure in Kubernetes directly tend to run into friction.

A workable approach for VM workloads is to use namespaces to represent the primary tenancy boundary, whether that is per team, per environment, or per project, and to define custom ClusterRole resources that reflect the actual operator personas needed. A typical set includes a VM operator role with permission to start, stop, and console into a VirtualMachineInstance but not to delete or modify the VirtualMachine definition, and a VM administrator role with broader lifecycle permissions.

The rough edges are real. Network isolation between namespaces requires explicit NetworkPolicy configuration or separate network segments via Multus CNI. Storage isolation requires attention to StorageClass access controls. There is no equivalent to vSphere’s permission propagation for managing a large multi-tenant environment. Teams that design their RBAC model after the platform is operational typically discover that significant rework is required.

CPU and Memory Management

Kubernetes is built around a bin-packing scheduler: it places workloads based on requested CPU and memory, and assumes those workloads are broadly interchangeable from a scheduling perspective. For containerised applications, that assumption holds well. VM workloads have requirements that it does not naturally accommodate.

KubeVirt VMs declare CPU and memory in spec.domain.resources, and the Kubernetes scheduler places them accordingly. The gap opens up in three areas.

Dedicated CPU resources. KubeVirt supports CPU pinning via spec.domain.cpu.dedicatedCpuPlacement, which pins guest vCPUs to host physical CPUs using cgroup cpusets. This requires the Kubernetes CPU Manager running with the static policy on the target nodes. Without CPU pinning, guest vCPUs compete for host CPU time as normal POSIX threads, which is significant for latency-sensitive workloads. Getting this right requires understanding the interaction between the KubeVirt CPU topology spec, the Kubernetes CPU Manager policy, and the NUMA Topology Manager.

Hot-plug. KubeVirt supports hot-adding CPU and memory to a running VirtualMachineInstance in recent upstream releases, but the operational workflow is less straightforward than the vCenter equivalent. It requires the guest operating system to support hot-plug, the VirtualMachine spec to be configured to allow it, and the operation to be triggered via virtctl or a patch to the running instance. Teams migrating workloads that rely on live resource scaling should test this against their actual guest images early.

Workload balancing. There is no KubeVirt equivalent of DRS. The Kubernetes Descheduler can be configured to rebalance VM workloads across nodes based on utilisation, but it operates by evicting and rescheduling VirtualMachineInstance objects, which triggers a live migration if the storage configuration supports it and a cold restart if it does not. The policy flexibility available in DRS takes significant additional configuration to approximate, and the result is rarely as seamless.

Live Migration

KubeVirt supports live migration, but the prerequisites are more demanding than many teams anticipate. The primary constraint is storage: live migration requires that the VM’s root disk be backed by a PersistentVolumeClaim with ReadWriteMany access mode. Many common block storage CSI drivers, particularly those using local or host-path volumes, do not support ReadWriteMany, and this limitation is frequently discovered late in a proof of concept.

Storage backends that support RWX block access and work reliably with KubeVirt live migration include Ceph RBD via the Ceph CSI driver, and shared filesystem solutions such as NFS or CephFS. The choice of storage backend is one of the most consequential early architecture decisions in a KubeVirt deployment. Retrofitting a different storage backend after VM images have been deployed is significantly more involved than choosing correctly at the outset.

Network configuration for live migration also warrants attention. KubeVirt supports a dedicated migration network interface, which prevents migration traffic from competing with regular VM traffic during migration events. Configuring this correctly requires understanding Multus CNI and the KubeVirt MigrationPolicy API. Without a dedicated migration network, live migration traffic shares bandwidth with VM workload traffic, which can cause latency spikes for running workloads during migration.

Backup and Disaster Recovery

vSphere’s backup ecosystem is mature and well-integrated. VADP provides a standardised interface that backup vendors have built against for years, resulting in broad commercial and open-source tooling with well-understood behaviour.

KubeVirt has no native equivalent. The most commonly used approach is Velero with the KubeVirt plugin, which can back up VirtualMachine resource definitions and associated PersistentVolumeClaims using CSI volume snapshots. This works well for consistent backups of stopped VMs. For running VMs, crash-consistent backups are achievable via CSI snapshots, but application-consistent backups require qemu-guest-agent integration to quiesce the guest filesystem before the snapshot is taken.

The operational discipline required is also different from what vSphere teams are accustomed to. A Velero-based backup strategy requires building the policy management layer: Schedule resources in Kubernetes to define backup cadence, monitoring configured to alert on backup failures, and a tested restore runbook that the operations team has exercised against real workloads before it is needed in production.

Closing the Gap: What Good Looks Like

None of the gaps described above are beyond reach. The difference between a KubeVirt migration that lands well and one that extends well past its original timeline is typically not the availability of tooling; it is whether the key architectural decisions were made before the platform was running or in response to problems that surfaced after it was already carrying workloads.

VM lifecycle management needs a deliberate process to replace the GUI. The workflow question comes first: how VM definitions, templates, and lifecycle transitions are owned, reviewed, and audited in a team that previously managed this through a web client. The tooling choices follow from that.

Observability needs to be designed around VM workload requirements, not inherited from the container monitoring stack. The metrics that matter for VM operations, covering guest-level resource consumption, migration event impact, and per-node capacity trends, sit at the intersection of what KubeVirt can report from the hypervisor layer and what the guest agent can expose from inside the VM. Understanding where that boundary sits, and what is invisible without crossing it, determines what the monitoring architecture needs to cover.

RBAC should reflect the actual tenancy model before the first workload lands. The namespace structure, the operator personas, and the network isolation boundaries all compound as the estate grows and are difficult to rework without disrupting running VMs. The right model is entirely dependent on the organisational structure and the degree of tenant isolation required; there is no universal pattern that transfers cleanly from vSphere.

CPU resource management requires understanding where the Kubernetes scheduling model diverges from VM workload requirements. For latency-sensitive workloads and anything that needs dedicated CPU access, the default bin-packing behaviour creates problems that are not visible until load increases. Which configurations are viable depends on the workload mix and the hardware topology of the cluster.

Storage architecture is the decision with the least room for error. Whether live migration is a hard operational requirement, which backends are available in the target infrastructure, and what the recovery expectations are for a failed node all combine to make this highly context-specific. Getting it wrong early is expensive; retrofitting storage after the estate is populated is one of the more disruptive interventions a platform team can face.

Backup and recovery requires a tested restore process, not just tooling in place. The tooling landscape for KubeVirt backup is mature enough; the gap is almost always operational: whether the restore procedure has been exercised, whether the recovery time objective is achievable with the chosen approach, and who owns the runbook when it is needed.

The common shape across all of these is that the right answer is heavily context-dependent, and the decisions interact with each other. A storage choice constrains live migration options; an RBAC model constrains how tenant isolation can be extended later; an observability architecture that does not account for guest-level visibility creates blind spots that are expensive to close after the fact. That interdependency is why the sequence and timing of these decisions matters as much as the decisions themselves.

Is KubeVirt the Right Call?

KubeVirt makes a compelling case when an organisation is already operating Kubernetes at scale and wants to consolidate virtualisation and container workloads under a single control plane. The composability of the stack, the ability to apply the same GitOps workflows to VMs and containers, and the absence of per-socket licensing are genuine advantages for teams that have the operational foundation in place.

It is a harder sell when the primary motivation is escaping Broadcom’s licensing costs without an existing Kubernetes operational capability. In that scenario, a team may find themselves trading a familiar operational burden for a larger and less familiar one. The work required to assemble a production-grade operational layer around KubeVirt is substantial, and underestimating it is one of the most common ways that KubeVirt migrations extend well beyond their initial estimates.

The work required is real, and underestimating it is common. That is not a reason to avoid KubeVirt; it is a reason to start the operational design work early rather than deferring it until the platform is already running.

Working Through This Together

We spend a lot of time on exactly this kind of migration work: helping teams move from vSphere to KubeVirt with the operational layer designed up front, not pieced together after the fact. If your organisation is in the evaluation or early migration phase, we are happy to talk through what that looks like for your specific environment and team structure. Get in touch and we can walk through the detail together.