KubeVirt vs vSphere: The Operational Gap Teams Miss Before Migration
What vCenter provides that KubeVirt does not ship with by default, and the operational gaps to close before migrating off VMware.
Published on:
Last updated on:
Broadcom’s acquisition of VMware decisively changed the licensing landscape, and organisations that were previously content running vSphere are now strenuously re-evaluating their virtualisation strategy. KubeVirt is now receiving serious attention from architecture teams that it had not before; the combination of an open-source licence, a Kubernetes-native model, and a maturing upstream project makes it a plausible offering for teams looking to exit the VMware ecosystem. We have written before about why a VMware migration is just the beginning; this post looks at the operational layer that decision exposes.
The evaluation conversation tends to focus on two questions:
- Does KubeVirt do what vSphere does?
- What does the total cost of ownership look like?
Both are worth asking. However, there is a quieter and considerably harder problem that tends to surface after the proof of concept is complete: the operational surface area that vCenter handled quietly for years is simply not there in a default KubeVirt deployment.
Understanding that gap before a migration begins will make the difference between a considered platform transition and enduring through an extended stabilisation period. That is what this post hopes to address.
In short, for those weighing the decision more than implementing it:
vCenter bundles VM lifecycle management, performance monitoring, hierarchical RBAC, live migration, CPU and memory management, and backup integration into a single product.
KubeVirt is capable of all of these, but each becomes a deliberate engineering decision as opposed to a feature that ships by default.
The cost teams tend to underestimate is not the licence fee, but the operational layer that has to be designed, built, and owned.
These decisions also interact with one another, so the order in which they are made matters as much as the decisions themselves.
What Does vCenter Actually Provide?
Before diving into what is missing in KubeVirt, it is worth defining precisely what it is that vCenter provides. The list that follows will cover the capabilities most teams have come to rely on without fully accounting for the work of recreating them.
Unified VM lifecycle management:
- The vSphere web client provides a graphical interface for creating, configuring, cloning, migrating, and retiring virtual machines.
- Operations that take a few clicks in vCenter require
kubectlandvirtctlcommands in KubeVirt, meaning the conceptual model is substantially different from what VMware administrators are accustomed to.
Built-in performance monitoring and alerting:
- vCenter provides per-VM and per-host metrics for CPU, memory, disk I/O, and network throughput, with historical trending and configurable alerts.
- There is no equivalent in upstream KubeVirt without assembling a monitoring stack from separate components.
Mature RBAC and multi-tenancy:
- vSphere’s permission model is hierarchical: roles are applied at the datacenter, cluster, resource pool, folder, or individual resource level, and permissions propagate down the hierarchy.
- Kubernetes RBAC is powerful, but it does not map cleanly onto this model, and the namespace-based isolation requires some deliberate upfront design.
Live migration orchestration:
- vMotion is a well-understood capability with broad storage and network compatibility.
- KubeVirt supports live migration, but the prerequisites around storage access modes and network configuration are non-trivial, and they can frequently catch teams off guard during proof of concept work.
Snapshot and backup integration:
- VMware’s vStorage APIs for Data Protection (VADP) provide a standardised interface that backup vendors have been building against for years.
- Kubernetes does provide CSI
VolumeSnapshotresources as a building block, but the mature ecosystem of backup vendor integrations that VADP enabled has no direct equivalent, and assembling a comparable backup pipeline requires both deliberate tooling choices and operational investment.
CPU and memory management:
- vSphere provides DRS to automatically balance VM workloads across hosts based on utilisation, with configurable automation levels and per-VM resource reservations, limits, and shares.
- Hot-adding CPU and memory to running VMs without a reboot is a standard capability that many guest workloads depend on for live scaling.
Cluster-wide resource visibility:
- vCenter provides a consolidated view of capacity utilisation, allocation, and historical trends across the entire virtualisation estate.
- Kubernetes does provide some of this via
kubectl topand resource quotas, but not at the same depth without additional tooling.
None of these capabilities are impossible to achieve in a KubeVirt environment. The issue is more that none of them are provided out of the box, and teams frequently do not budget for the engineering work required to build them.
Where Does the Operational Gap Surface in Practice?
What follows draws on our own hands-on work with upstream KubeVirt, supplemented by community knowledge and upstream documentation, with the aim of giving a realistic picture of what this gap can look like in practice.
What Does VM Lifecycle Look Like Without a GUI?
In a KubeVirt environment, virtual machines are defined as VirtualMachine custom resources; the running instances are represented as VirtualMachineInstance objects. Day-to-day operations involves a combination of kubectl for resource management and virtctl for VM-specific actions (such as starting, stopping, restarting, and accessing the console).
virtctl is a capable tool, and once understood, the underlying model is coherent. The difficulty is the shift needed of VMware administrators. Common friction points in early adoption include:
- understanding the relationship between a
VirtualMachineresource and its backingDataVolume, - navigating console access via
virtctl consoleorvirtctl vnc, - and debugging situations where a
VirtualMachineInstanceshows as running but is not reachable on the network.
To paint a picture of the basic operational cycle, starting a VM, checking its status, and accessing its console looks like this:
$ virtctl start my-vm -n production
VM my-vm was scheduled to start
$ kubectl get vmi -n production my-vm
NAME AGE PHASE IP NODENAME READY
my-vm 2m Running 10.244.2.140 node-worker-2 True
$ virtctl console my-vm -n production
Successfully connected to my-vm console. The escape sequence is ^]
Cloning is another area where vSphere habits do not transfer well. Cloning a VM in vCenter is done using a guided wizard with a handful of configuration screens. In KubeVirt, cloning typically involves creating a new DataVolume from an existing PersistentVolumeClaim using the CDI (Containerized Data Importer) clone API; this requires understanding the DataVolume spec, cross-namespace clone permissions, as well as the CDI import lifecycle. It is not difficult once familiar, but to someone coming from the vSphere toolset, it is not self-evident.
There is also no GUI-driven template catalogue equivalent to vCenter’s. Upstream KubeVirt does provide the building blocks (instancetypes and preferences capture reusable CPU, memory, and guest configuration), and recent releases have added VirtualMachineTemplate tooling via virtctl.
What it does not provide is the self-service catalogue of base images that vSphere administrators are used to browsing and deploying from. Teams that want that experience will need to build it, typically via a GitOps workflow that manages DataVolume source references alongside VirtualMachine definitions, layered over the instancetype and preference primitives.
What Does the Observability Stack Need to Cover?
KubeVirt exposes Prometheus metrics covering CPU usage, memory availability, network throughput, and storage I/O at the VirtualMachineInstance level. Collating a minimal observability stack for a KubeVirt environment requires Prometheus to scrape the KubeVirt metrics endpoint, Grafana for visualisation, and dashboards which are written to the KubeVirt metric naming conventions.
The KubeVirt project does provide some reference Grafana dashboards, and the community has produced additional efforts. Assembling a useful monitoring stack is achievable, but it requires some effort and a clear picture of what needs to be monitored.
In our experience, the most commonly overlooked component is the qemu-guest-agent. Without it installed inside each VM image, visibility into the guest operating system is limited only to what Kubernetes and KubeVirt can observe from the outside; without it, per-process CPU consumption, memory pressure from within the guest, and filesystem utilisation are not available.
Even with a solid Prometheus and Grafana setup in place, cluster-wide capacity planning will require additional dashboard work. Understanding how VM density is trending across nodes, identifying which nodes are approaching resource limits before they become a problem, and correlating migration event frequency with performance degradation requires custom visualisation work that goes beyond the metrics KubeVirt exposes by default.
How Does RBAC and Multi-tenancy Differ From vSphere?
vSphere’s permission model is hierarchical and well-understood. Kubernetes RBAC operates on namespaces being the primary isolation boundary, with ClusterRole and Role resources defining what operations are permitted on which resource types. The two models address similar problems, but the mapping between them is not one-to-one, and teams that try to replicate a vSphere permission structure in Kubernetes directly tend to run into friction.
A structural difference worth understanding is that Kubernetes RBAC is strictly additive; permissions can only be granted, never explicitly denied, and a subject holding multiple role bindings will accumulate all permissions from each.
A workable approach for VM workloads is to represent the primary tenancy boundary using namespaces, whether that is per team, per environment, or per project, and to define custom ClusterRole resources that reflect the actual operator personas needed.
A typical set includes a VM operator role with permission to start, stop, and console into a VirtualMachineInstance but not to delete or modify the VirtualMachine definition, and a VM administrator role with broader lifecycle permissions. A ClusterRole that captures those constraints looks like this:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
name: vm-operator
rules:
- apiGroups: ["kubevirt.io"]
resources: ["virtualmachines"]
verbs: ["get", "list", "watch"]
- apiGroups: ["subresources.kubevirt.io"]
resources: ["virtualmachines/start", "virtualmachines/stop", "virtualmachines/restart"]
verbs: ["update"]
- apiGroups: ["subresources.kubevirt.io"]
resources: ["virtualmachineinstances/console", "virtualmachineinstances/vnc"]
verbs: ["get"]
Network isolation between namespaces requires explicit NetworkPolicy configuration or separate network segments via Multus CNI.
Storage isolation requires attention to StorageClass access controls. There is no equivalent to vSphere’s permission propagation for managing a large multi-tenant environment.
Teams that design their RBAC model after the platform is operational typically find that significant rework is required.
How Does CPU and Memory Management Translate From vSphere?
Kubernetes is built around a bin-packing scheduler: it places workloads based on requested CPU and memory, and assumes those workloads are broadly interchangeable from a scheduling perspective. For containerised applications, that assumption holds well. However, VM workloads have requirements that it does not naturally accommodate.
KubeVirt VMs declare CPU and memory in spec.domain.resources, and the Kubernetes scheduler places them accordingly. The gap opens up in three areas.
Dedicated CPU resources. KubeVirt supports CPU pinning via spec.domain.cpu.dedicatedCpuPlacement, which pins guest vCPUs to host physical CPUs using cgroup cpusets. This capability requires the Kubernetes CPU Manager to run with the static policy on the target nodes.
The relevant section of the VirtualMachine spec looks like this:
spec:
domain:
cpu:
cores: 4
dedicatedCpuPlacement: true
resources:
requests:
memory: 8Gi
Without CPU pinning, guest vCPUs would compete for host CPU time as normal POSIX threads, which is significant for latency-sensitive workloads. Getting this right requires understanding the interaction between the KubeVirt CPU topology spec, the Kubernetes CPU Manager policy, and the NUMA Topology Manager.
Hot-plug. KubeVirt has supported hot-adding CPU since v1.0 and memory since v1.1, but the operational workflow is less straightforward than the vCenter equivalent, and there is one characteristic of the current implementation that matters a great deal.
Hot-plug is applied by editing the running VirtualMachine, increasing the socket count for CPU or the guest memory value (provided the spec has been configured ahead of time with the relevant liveUpdate settings and a maxSockets or maxGuest ceiling). Crucially, the controller applies the change by live-migrating the workload, so the hot-plug operation is, in essence, a live migration.
What this means is that hot-plug inherits the same prerequisites discussed above, with the additional requirement for ReadWriteMany backing storage. This means a cluster that cannot live-migrate a VM also cannot hot-plug its resources either.
The guest operating system also has to support hot-plug, and neither CPU nor memory hot-plug is currently available on ARM64. Teams migrating workloads that rely on live resource scaling should test this against their actual guest images, and confirm their storage supports live migration, early.
Workload balancing. KubeVirt has no equivalent of the Distributed Resource Scheduler (DRS). While the Kubernetes Descheduler can be configured to rebalance VM workloads across nodes based on utilisation, it operates by evicting and rescheduling VirtualMachineInstance objects, which means it triggers a live migration if the storage configuration supports it (and a cold restart if it does not).
To approximate the policy flexibility available in DRS takes significant additional configuration, and the result is rarely as seamless.
What Are the Prerequisites for Live Migration?
KubeVirt supports live migration, but the prerequisites are more demanding than many teams anticipate. The primary constraint is storage: live migration requires that the VM’s root disk be backed by a PersistentVolumeClaim with ReadWriteMany access mode. Many common block storage CSI drivers, particularly those using local or host-path volumes, do not support ReadWriteMany, and this limitation is frequently discovered late in a proof of concept. The PersistentVolumeClaim backing the VM root disk must declare ReadWriteMany access:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: vm-rootdisk
spec:
accessModes:
- ReadWriteMany
resources:
requests:
storage: 50Gi
storageClassName: ceph-rbd
Storage backends which support RWX block access and work reliably with KubeVirt live migration include Ceph RBD via the Ceph CSI driver, and shared filesystem solutions such as NFS or CephFS. The choice of storage backend is one of the most consequential early architecture decisions in a KubeVirt deployment, as retrofitting a different storage backend after VM images have been deployed is significantly more involved than choosing correctly at the outset.
Network configuration for live migration also warrants attention. KubeVirt supports running migrations over a dedicated network, which can keep migration traffic from competing with regular VM traffic during migration events. This is configured by creating a NetworkAttachmentDefinition in the namespace where KubeVirt is installed and referencing it from the KubeVirt CR under spec.configuration.migrations.network; the change will trigger a restart of the virt-handler pods so that subsequent migrations run over that network:
apiVersion: kubevirt.io/v1
kind: KubeVirt
metadata:
name: kubevirt
namespace: kubevirt
spec:
configuration:
migrations:
network: migration-network
This is distinct from the MigrationPolicy API, which is what you would reach for to tune the migration behaviour itself, which covers bandwidth, auto-convergence, pre-copy versus post-copy, parallel migration limits, and timeouts. This also applies when different settings to different groups of VMs. The two are easy to conflate; the former controls where migration traffic flows, while the latter controls how the migration itself behaves. Without a dedicated migration network, live migration traffic shares bandwidth with VM workload traffic, which can cause latency spikes for running workloads during migration.
How Does Backup and Disaster Recovery Work Without VADP?
vSphere’s backup story is one most VMware administrators have never had to think too hard about. VADP has served as the de facto interface for backup vendors for long enough that the tooling is broad, the behaviour is well-understood, and integration with vCenter tends to be a configuration exercise rather than an engineering one.
KubeVirt does not come with an equivalent. The foundation most tooling builds on is the Kubernetes-native CSI VolumeSnapshot API. The most commonly used approach is Velero with the KubeVirt plugin, which can back up VirtualMachine resource definitions and associated PersistentVolumeClaims using CSI volume snapshots.
For consistent backups of stopped VMs, this works well. For running VMs, crash-consistent backups are achievable using CSI snapshots, but application-consistent backups require qemu-guest-agent integration to quiesce the guest filesystem before the snapshot is taken.
What changes is not tooling availability, but who owns the operational layer around it. Backup vendors built scheduling, alerting, and retention on top of VADP as part of their product; in a Velero-based setup, that work would sit with the platform team instead, such as: Creating Schedule resources for backup cadence, monitoring that alerts on failures, and a restore runbook that has been exercised against actual workloads. A daily backup Schedule for the production namespace would look like this:
apiVersion: velero.io/v1
kind: Schedule
metadata:
name: kubevirt-vms-daily
namespace: velero
spec:
schedule: "0 02 * * *"
template:
includedNamespaces:
- production
storageLocation: default
ttl: 720h0m0s
What Does Closing the Operational Gap Look Like?
None of the gaps described above are beyond reach. The difference between a KubeVirt migration that lands well and one that extends well past its original timeline rarely comes down to tooling availability. It comes down to whether key architectural decisions were made before the platform was running, or in response to problems that surfaced after it was already carrying workloads.
VM lifecycle management needs a deliberate process to replace the GUI. The workflow question comes first: how VM definitions, templates, and lifecycle transitions are owned, reviewed, and audited in a team that previously managed these through a web client. The tooling choices will then follow from that.
Observability needs to be designed around VM workload requirements, not inherited from the container monitoring stack. The metrics that matter for VM operations (guest-level resource consumption, migration event impact, and per-node capacity trends) sit at the intersection of what KubeVirt can report from the hypervisor layer and what the guest agent can expose from inside the VM. Understanding where that boundary sits determines what the monitoring architecture needs to cover.
RBAC should reflect the actual tenancy model before the first workload lands. The namespace structure, the operator personas, and the network isolation boundaries all compound as the estate grows, and are difficult to rework without disrupting running VMs. The right model will entirely depend on the organisational structure and the degree of tenant isolation required; there is no one-size-fits-all pattern that transfers cleanly from vSphere.
CPU resource management requires understanding where the Kubernetes scheduling model diverges from VM workload requirements. For latency-sensitive workloads and anything that needs dedicated CPU access, the default bin-packing behaviour can create problems that are not visible until loads increase. Which configurations are viable will depend on the workload mix and the hardware topology of the cluster.
Storage architecture is the decision with the least room for error. Whether live migration is a hard operational requirement, which backends are available in the target infrastructure, and what the recovery expectations are for a failed node will all combine to make this highly context-specific. Getting it wrong early is expensive, but retrofitting storage after the estate is populated is one of the more disruptive interventions a platform team can experience.
Backup requires a tested restore, not just tooling in place. The tooling itself is mature enough; the gap is almost always that a Schedule exists and monitoring is wired up, but no one has actually run a restore against real workloads. A restore exercise is what will reveal whether the approach chosen actually meets the recovery time objective, and who is responsible for executing the runbook at 02:00 when needed.
The common shape across all of these is that the right answer is heavily context-dependent, and the decisions interact with each other. That interdependency is why the sequence and timing of these decisions matters as much as the decisions themselves.
Is KubeVirt the Right Call?
KubeVirt makes a compelling case when an organisation is already operating Kubernetes at scale and wants to consolidate virtualisation and container workloads under a single control plane. The composability of the stack, the ability to apply the same GitOps workflows to VMs and containers, and the absence of per-socket licensing are real advantages for teams that have the operational foundation in place.
When the primary motivation is escaping Broadcom’s licensing costs without an existing Kubernetes operational capability, it is a harder sell. In that scenario, a team may find themselves trading a familiar operational burden for a larger and less familiar one. The work required to assemble a production-grade operational layer around KubeVirt is substantial, so underestimating it is one of the most common ways that KubeVirt migrations can extend well beyond their initial estimates.
None of that argues against KubeVirt. It argues for starting the operational design work early, rather than deferring it until the platform is already running and carrying workloads.
This is the work we do. LiveWyer has built and run Kubernetes platforms since 2015, and the upstream KubeVirt work behind this post is part of how we help teams make this call before it gets expensive. We do not sell a hypervisor or resell a storage vendor, so the recommendation we land on is the one the workloads point to, not the one that suits a product line.
When a move off vSphere is on the table, our VMware Kubernetes Migration pilot puts it to the test against real workloads over three weeks, not theoretical benchmarks: the storage and live migration constraints, the observability that is actually needed, the RBAC and tenancy model, and a straight answer on whether the operational foundation is there yet. The result is a clear recommendation, to proceed, to wait, or to take another path, with the cost and skills picture to back it up. If the question is broader than the hypervisor, the Workload Modernisation pilot covers the move from VMs into native container workloads, and a Technical Review is the lighter way to talk it through first.
