Manually re-balance VMs on sole-tenant nodes with "Migrate within node group" maintenance policy set

Manually re-balance VMs on sole-tenant nodes with "Migrate within node group" maintenance policy set

Sole-tenant nodes are an important service on Google Cloud Platform to run workloads that require workload isolation or need to comply to specific licensing requirements that demand dedicated infrastructure. A detailed description what a Sole-Tenant Node is and how it is different from general fleet VMs can be found in the Compute Engine documentation.

Sole-tenant nodes supports different maintenance policies which can be selected based on the workload requirements. A maintenance event happens when Google needs to update the underlying infrastructure.

This article focuses on the Migrate within node group maintenance policy which is primarily used for scenarios where a fixed-sized group of nodes is required (e.g. licensing constraints). The sole-tenant node documentation has more information about the possible maintenance policies and their use-cases.

Challenge

As sole-tenant nodes are a good fit for long running workloads due to shifts in resource requirements like the demand for new VMs or other VMs being decommissioned, over time the resource distribution within a node group may not be perfect. While the Compute Engine scheduler aims to achieve the highest possible bin-packing ratio depending on outside factors this might not always be possible.

In rare cases this may lead to situations, where more sole-tenant nodes are deployed than strictly necessary. Unbalanced sole-tenant node group with Migrate within node group maintenance policy set

While it is possible for node groups that do not have the Migrate within node group maintenance policy set to manually live migrate VMs to other nodes in order to increase density, this is not possible for node groups that use Migrate within node group policy. This is due to the fact that manual live migration uses affinity labels to target individual nodes which can’t be used in conjunction with this policy.

Workaround

When a VM is either shutdown and restarted or even only suspended and resumed, the scheduler will make new decisions for placement and aim to maximize bin-packing.

While both of these actions come with downtime, suspend and resume will retain the VMs state and unavailability will be in the span of seconds. Sole-tenant node group after manual re-balancing with one node that can be deleted

Proof of concept

The following script creates a sole-tenant node group with Migrate within node-group maintenance policy. It creates three nodes (two for running VMs and one to satisfy the maintenance policy requirements).

# Create node template
gcloud compute sole-tenancy node-templates create prod --node-type=n1-node-96-624

# Create node group
gcloud compute sole-tenancy node-groups create prod --node-template=prod --target-size=3 --maintenance-policy=migrate-within-node-group
Create node template and sole-tenant node group

The nodes running VMs are subsequently filled with VMs after which artificial gaps are created on both sole-tenant nodes.

# Create smaller VMs
for number in 01 02 03 04 05 06 07 08 09 10 11 12; do
    gcloud compute instances create small-vm-$number --machine-type=n1-highmem-16 --node-affinity-file affinity.json --enable-display-device --image=debian-11-bullseye-v20210817 --image-project=debian-cloud --boot-disk-size=10GB --boot-disk-type=pd-balanced --boot-disk-device-name=small-vm-$number --shielded-secure-boot --shielded-vtpm --shielded-integrity-monitoring
done

# Create some holes by deleting some VMs
for number in 01 02 03 10 11 12; do
    gcloud compute instances delete small-vm-$number --quiet
done
Schedule VMs on both sole-tenant nodes then create artificial holes

Finally the VMs on the second node are paused and then resumed causing the Compute Engine scheduler to resume the VMs on the first node to maximize resource utilization.

# Pause VMs
for number in 07 08 09; do
    gcloud beta compute instances suspend small-vm-$number
done

# Resume VMs
for number in 07 08 09; do
    gcloud beta compute instances resume small-vm-$number
done
Pause and resume VMs to show consolidation by the Compute Engines scheduler

Code

The full script is available on GitHub.