Patching Windows VMs with GCP’s VM Manager

Overview

In this blog post I share some of my experiences of using GCP's VM Manager to patch one of our customer's Windows VMs

Introduction

Whilst I am a huge fan of short lived, immutable VMs with system state turned over to managed services like Cloud SQL, sometimes this simply isn’t practical or possible.

In these situations we are often left with long running, stateful instances that require the same sort of maintenance as ‘traditional’ infrastructure. But how do we manage these without the pain and ancillary infrastructure this often requires.

Recently I have been working with a client that has a stateful .NET application that runs on top of Windows Server 2016 GCE Instances. The client tasked me with determining a patching strategy for these VMs to ensure stability and more importantly security.

Now I have plenty of experience patching Windows servers from previous jobs and have seen it done both well and badly. Some particular bad examples that stick out to me are:

  • Logging in once a month to spend half a day patching a PCI CDE manually
  • Having business critical, clustered services with different nodes running at entirely different patch levels (and not to test for a short period!)
  • Requiring nearly a fortnight of application testing in non-production before roll-out could be authorized for production.

Over the course of my career I have had the pleasure (or misfortune) to have worked with both WSUS and SCCM. These tools however can be complex and temperamental and in the case of SCCM require deep pockets.

Generally I have found the best patching strategies to follow these ideas:

  • Automated Management with little human interaction, people get busy and patching too often is the task that gets kicked down the road to a new month.
  • Regular patching to ensure baseline is not far behind latest In my experience having too much drift can cause support problems with vendors for whom the default response is “ensure you are running the latest patches”
  • Deploying to non-production environments before production. I like to think this goes without saying but in the (hopefully unlikely) event a patch causes issues it should be caught in non-prod first.

Anyway back to my client’s request. In April Google launched their OS patch management service to the masses. This is a tool I was curious about but hadn’t the requirement to implement it at the time. This though I felt was the right opportunity to test it.

Prerequisites and Setup

Initially I wanted to see what the tool could show me with the OS Inventory Management functionality before committing to patching, as I did this in Terraform here is my code:

resource "google_compute_project_metadata_item" "guest_attributes" {  key   = "enable-guest-attributes"  
value = "TRUE"
}
resource "google_compute_project_metadata_item" "osconfig" {
key = "enable-osconfig"
value = "TRUE"
}
resource "google_project_service" "osconfig" {
service = "osconfig.googleapis.com"
disable_on_destroy = false
}

The above is really the result of the setup requirements found in the Google documentation but in summary requires a couple of metadata values to be set and API to be enabled.

The next step is to ensure that the OS you want to monitor has the required agent, fortunately as the instances used a recent Google baked 2016 image I could skip this step. If however, you aren't so fortunate installation instructions are here.

At this point I also want to state that Automatic Windows Updates had been disabled previously by setting a registry key in the startup script using the Powershell below.

Set-ItemProperty -Path HKLM:\SOFTWARE\Policies\Microsoft\Windows\WindowsUpdate\AU -Name AUOptions -Value 1

 

Compliance Graphs

Like most ops guys I do like a pretty graph (and could probably stare at Grafana dashboards for hours!) and in this regard Google doesn’t disappoint with clear reporting on a pie chart, even broken down by OS.

patch management dashboard 

For the eagle eyed among you the three VMs reporting back no data are actually GKE Nodes running Google’s Container OS that they are responsible for maintaining.

By selecting view details then selecting a specific VM I can get a breakdown of available patches including their categories, KB numbers and when they were published.

VM instance details

Turning Insight into Action

So satisfied with the insight I now had into the VMs I was keen to try out the functionality to apply patches. This is done by creating a Google OS Patch Deployment, this can be done in the Console or with Terraform.

resource "google_os_config_patch_deployment" "win-patch" {
patch_deployment_id = "win-patch"
instance_filter {
group_labels {
labels = {
win-patch = "true"
}
}
zones = ["europe-west2-a", "europe-west2-b", "europe-west2-c"]
}
patch_config {
reboot_config = "DEFAULT"
windows_update {
classifications = ["CRITICAL", "SECURITY", "DEFINITION"]
}
}
duration = "3600s"recurring_schedule {
time_zone {
id = "Europe/London"
}
time_of_day {
hours = 2
}
weekly {
day_of_week = var.win_patch_day
}
}
rollout {
mode = "ZONE_BY_ZONE"
disruption_budget {
fixed = 5
}
}
}

The Terraform Documentation for this resource is quite an interesting read. There are for example options available for pre and post patching scripts which may be very useful in some environments where automated testing exists.

To break down the code above it:

  • Targets VMs with the label win-patch = “true” that exist in all three of the Europe-west2 zones.
  • The reboot config is set to default which in Windows terms means only rebooting if required.
  • A weekly patch run at 2AM on a day of the week I have variablised (to allow me to set different days for non-prod and prod environments)
  • A rollout plan which allows all 5 VMs to be disrupted simultaneously. (In theory you can put a percentage in this field but I had difficulty doing so)

As another quick a note this above resource is new and was only added in provider version 3.30.0. I was required to update to a newer provider as we were a couple of versions behind.

Reviewing Patch Jobs

When a patch job is executed its progress can be watched in real time or more likely (with patching done in the early hours) reviewed the following morning within the VM Management .

Reviewing Patch Jobs

It is also possible to drill even further into the logs on specific machines to see how the job progressed. As you can see this job also required a reboot which the machine executed automatically.

Operations Log

Concluding Thoughts

Simply I am a fan, this solution has enabled me to deploy Windows Patches in an automated, reasonably (though not completely) controlled way with minimal to no cost. However I did have to make some compromises and assumptions:

  • Microsoft typically releases patches on the 2nd Tuesday of the month in what has affectionately been coined “Patch Tuesday”. This means that I have configured my schedule so that non-production environments get patched on Thursday with production the following Monday. So in theory this means non-production will be patched first with production following unless Microsoft releases patches out of their usual routine.
  • I am also trusting the Windows update source configured in my machine to be ‘safe’. As it is left as the default this should be Microsoft (or potentially Google). But Google themselves state that for absolute control they recommend a WSUS server be deployed as part of the OS Patch Management Solution. I felt in this environment deploying such would be overkill and would require additional unwelcome management.

Finally I hope this post has given some food for thought and at least presented another method to help maintain long lived GCP VMs. 

Recent posts

Google Cloud Next '21 - Highlights from the CTS GCP dream team

Google Next 2021 has been a blast, we’ve spent the past few days reviewing the latest...

Sources to get you in the mood for Google Next '21

From our GCP team… 

 

The New Stack

Focussed on cloud native tech, there are regular...

Podcasts to prep for Google NEXT 21’: CTS top picks

GCP Focussed brought to you by Stuart Barnett (Cloud Architect @ CTS) 

Kubernetes podcast

Recent posts

Google Cloud Next '21 - Highlights from the CTS GCP dream team

Google Next 2021 has been a blast, we’ve spent the past few days reviewing the latest...

Sources to get you in the mood for Google Next '21

From our GCP team… 

 

The New Stack

Focussed on cloud native tech, there are regular...

Podcasts to prep for Google NEXT 21’: CTS top picks

GCP Focussed brought to you by Stuart Barnett (Cloud Architect @ CTS) 

Kubernetes podcast