Moving Active K8S Cluster On Qemu Vm

Author: Sam Griffith

Alta3 Research has been using its own bare metal cloud for several years now, ever since our monthly Amazon bill outgrew our comfort zone. Because we give every student we teach their very own VMs (virtual machines) for the duration of the course, we constantly create, destroy, and move our VMs around. So we initially moved to an OpenStack Cloud. But since we understand OpenStack at a very deep level and aren’t constrained to keeping the status quo, we decided to move towards creating our own fully automated, bare-metal cloud.

With a small team focused solely on doing-everything-that-comes-with-running-a-small-IT-business for the past 6 months, we have made significant progress. We have VMs being created and destroyed from our cloud by using Python and Ansible to control the QEMU hypervisor. We have a SQLite database maintaining the state of our cloud, and we know how to move a single VM from one of our hosts to another with reasonable downtime for our students.

But this is not a complete product yet. There are a lot more features we want to add, and even more documentation than we want to admit before we even think about pushing out a release to the public. So as we move through our development and testing phases, we are discovering more features we need to design for, and more cool things we want to share with you.

This week, I started testing the feasability of moving seven VMs at a time. Specifically, seven VMs that have a High-Availailability Kubernetes Cluster configured and running; this is the setup that we provide to students in our Kubernetes Bootcamp.

My goal was to explore an efficient way to transfer these VMs from one host to another, without causing any significant disruption to our students or their cluster. Although we don’t typically move any production VMs, we do want to know what the cost to us and our students would be if this situation became necessary.

Test 1: Rolling Update

Taking one of our student environments with all seven VMs, I first verified that I could see some pods with a kubectl get pods command. This showed me that I:

Had a working cluster.
Had pods on each of my nodes.

Then I decided to use a “Rolling Update” strategy to move all seven VMs. This meant that I would destroy and move one of them at a time, and verify that it was working as expected before doing the same for the next VM.

The kubectl machine

The first machine I tried this on was the machine we have dubbed “beachhead”, the vm that is the entrypoint for our students. It’s also where we have kubectl configured, external to the Kubernetes Cluster, with an NGINX load balancer in place to manage connections to our three Kubernetes controllers.

The process of moving a virtual machine is fairly simple:

Stop the VM
Copy the VM’s image and any associated files to your new host
And then start the VM again

Using a combination of Python and Ansible, this task has been abstracted down to a single command for me.

Doing this, I found that after waiting the two-and-a-half minutes it takes to move the VM, everything came back as expected. The cluster was accessible, and the pods did not move or get restarted.

Note: I am using a somewhat more complicated measure to first copy everything from host01 to my bastion host, then on to host02. I chose this method so I didn’t need to keep any private keys on my production servers. I could easily cut the time in half if I simply copied from host01 to host02.

I was not surprised that it worked, and not too displeased with the down time. Most people can forgive a two or three minute blip. And as I mentioned earlier, we are not planning on needing to do this.

The Controllers

Next I tried to move each of the Controller machines, one at a time.

While the Controller VMs were down, I attempted to access my cluster through kubectl.

The first time, it worked. The second time, it worked.

The third time, it failed.

As soon as I hit Enter, I waited. And I waited. For 10 seconds I waited and then:

$ kubectl get pods --all-namespaces                         
Error from server: rpc error: code = Unavailable desc = transport is closing 
$

It tried to connect to the dead machine. It wasn’t there, so it could not establish a connection.

I tried the command again, and it worked.

Then it worked again. And again. And again!

All of this was working while the VM was still down.

I wasn’t sure why I was seeing this behavior, so I did a little bit of research.

It turns out that whenever you are using NGINX as a load balancer, it will automatically try to reach out to the next scheduled host one time, and timeout after 10 seconds. Then it will remove that machine from the load balanced servers for you.

Effectively, this caused 10 seconds of downtime for a student and only if they were actively using kubectl commands.

The Nodes

I was most concerned about stopping and moving the VMs for the nodes. This is where the Kubernetes Pods live, and I wasn’t sure if they would come back.

Thankfully when I was doing this test, all of the Pods came back. I noticed the Pods showed a restart after the VM came back, but saw no differences in being able to issue kubectl command or iteracting with the cluster.

I will admit I could have improved this test by actually watching the connections to the Pods that were affected. But then again, the application developer could have done a better job of using deployments and services to actually allow for a reasonable Pod Distruption Budget.

Test 1 Conclusions

This is a slow but reasonable way to move Kubernetes Virtual Machines that should not impact the student’s ability to perform their labs. At most they should see their system go unresponsive for about 3 minutes.

Test 2: Obliteration

After verifying that my cluster was back to behaving normally, I decided to try a second strategy I call, obliteration. Instead of moving one at a time, I tried moving all seven VMs at one time.

A part of our planned VM disruption budget is to guarantee that if one of our hosts goes down, it will not affect more than 15% of our students. To achieve this, we consolidate each student environment onto one VM. Which means that all seven of the VMs are being copied from host01’s SSD RAID array, and they are being copied to host02’s SSD RAID array.

Using a 10 Gbps network between our hosts, we are severely being limited by our I/O when trying to move 18 GB (and doing it twice). Thankfully, we don’t anticipate this to be necessary for production.

All this to say that moving these seven virtual machines took about 25 minutes. The entire environment was unusable for nearly half-an-hour. Not acceptable.

But at least it came back and was working as expected!

Conclusions

Out of the two tests, the clear winner was Test # 1: Rolling Update. It only caused three minutes of downtime compared to twenty-five minutes with Test 2.

Because of these tests, we have decided to design a way to sequentially move each of the VMs in a Kubernetes environment to minimize the impact that performing a VM migration may have on our students.

Side Note: An alternative route may be to use QEMU to do live migrations, but we haven’t tested this yet.

We look forward to sharing more of our finds with you, as we continue to develop and test our product!