Thinking the cloud is magical with 100% uptime ? Think again.
Amazon will tomorrow begin a bloody global reboot of its Elastic Compute Cloud (EC2) compute instances after it found a security bug within the Xen virtualisation platform.
The rolling minutes-long reboots would be completed by 30 September. Amazon did not name the reason for the upgrade, widely thought to be a security issue affecting underlying hosts.
Tech site ITNews cited an unnamed source who said the reboot was due to an unspecified vulnerability in the open-source Xen-108 platform. Later, Xen and Amazon confirmed a fix for a non-disclosed security flaw is due to be released on October 1.
Reboots made prior to the patch blitz would not guarantee connection to a patched host unlike previous maintenance updates.
Thorsten von Eicken, founder of Rightscale which manages AWS work loads, said EC2 users should monitor their ‘events’ page within the AWS console for the most reliable updates.
“For instances where a short reboot is safe and acceptable, you don’t need to do anything: They will simply reboot during maintenance and stay on the same host with the same ephemeral disks and the same IP address,” von Eicken said.
“For databases, if you have set up the recommended master-slave configuration across AZs, you have the option to reboot the impacted AZ ahead of the maintenance window in an attempt to get an instance that is already patched.”
Back in 2011 Randy Bias has blogged about Amazon mandating instance reboots for hundreds, perhaps thousands, of instances (Amazon’s term for VMs). Affected instances seem to be scheduled for reboots over the next couple of weeks. Speculation is that the reboots are to patch a recently-reported vulnerability in the Xen hypervisor, which is the virtualization technology that underlies Amazon’s EC2. The GigaOm story gives some links, and the CRN story discusses customer pain.
Maintenance reboots are not new on Amazon, and are detailed on Amazon’s documentation about scheduled maintenance. The required reboots this time are instance reboots, which are easily accomplished — just point-and-click to reboot on your own schedule rather than Amazon’s (although you cannot delay past the scheduled reboot). Importantly, instance reboots do not result in a change of IP address nor do they erase the data in instance storage (which is normally non-persistent).
For some customers, of course, a reboot represents a headache, and it results in several minutes of downtime for that instance. Also, since this is peak retail season, it is already a sensitive, heavy-traffic time for many businesses, so the timing of this widespread maintenance is problematic for many customers.
However, cloud IaaS isn’t magical. If these customers were using dedicated hosting, they would still be subject to mandated reboots for security patches — hosting providers generally offer some flexibility on scheduling such reboots, but not aa lot (and sometimes none at all if there’s an exploit in the wild). If these customers were using a provider that uses live migration technology (like VMotion on a VMware-virtualized cloud), they might be spared reboots for system reasons, but they might still be subject to reboots for mandated operating system patches.
Given that what’s underlying EC2 are ordinary physical servers running virtualization without a live migration technology in use, customers should reasonably expect that they will be subject to reboots — server-level (what Amazon calls a system reboot), as well as instance-level — and also anticipate that they may sometimes need to reboot for their own guest OS patches and the like (assuming that they don’t simply patch their AMIs and re-launch their instances, arguably a more “cloudy” way to approach this problem).
What makes this rolling scheduled maintenance remarkable is its sheer scale. Hosting providers typically have a few hundred customers and a few thousand servers. Mass-market VPS hosters have lots of VPS containers, but there’s a roughly 1:1 VPS:customer ratio and a small-business-centricity that doesn’t lead to this kind of hullabaloo. Amazon’s largest competitor is estimated to be around the 100,000 VM mark. Only the largest cloud IaaS providers have more than 2,000 VMs. Consequently, this involves a virtually unprecedented number of customers and mission-critical systems.
Amazon has actually been very good about not taking down its cloud customers for extended maintenance windows. (I can think of one major Amazon competitor that took down one whole data center for an eight-hour maintenence evidently involving a total outage this past weekend, and which regularly has long-downtime maintenance windows in general.) A reboot is an inconvenience, but if you are running production infrastructure, you should darn well think about how to handle the occasional reboot, including reboots that affect a significant percentage of your infrastructure, because reboots are not likely to go away in IaaS anytime soon.
To hammer on the point again: Cloud IaaS is not magical. It still requires management, and it still has some of the foibles of both physical servers and non-cloud virtualization. Being able to push a button and get infrastructure is nice, but the responsibility to manage that infrastructure doesn’t go away — it’s just that many cloud customers manage to delay the day of reckoning when the attention they haven’t paid to management comes back to bite them.
If you run infrastructure, regardless of whether it’s in your own data center, in hosting, or in cloud IaaS, you should have a plan for “what happens if I need to mass-reboot my servers?” because it is something that will happen. And add “what if I have to do that immediately?” to the list, because that is also something that will happen, because mass exploits and worms certainly have not gone away.
365 thoughts on “The Cloud is not magical”