Infrastructure as Code: A Double-Edged Sword
In an ever-evolving technological landscape, handling complex environments is far from a walk in the park. From larger and pricier ops teams to stricter hardware and software standardization, many strategies have been put to the test. Automation and buzzwords of the moment have all had their time in the spotlight.
We are, and have been for some time now, embracing the paradigm of “infrastructure as code” (IaC). This approach scales well, demands fewer resources, integrates seamlessly with other tools and processes like source control and change tracking, and is a more efficient way to manage the ever-growing demands of today’s digital world.
However, as with anything else, IaC isn’t without its flaws. We have decades of experience managing the infrastructure side of things, but the “code” portion of IaC is where issues arise. If the number of bugs, patches, exploits, bug bounties, and other associated activities are any indication, it’s clear we have room to improve our coding prowess.
When, anecdotally, different software packages are compared not for being performant or reliable, but by being just as buggy as other packages or versions, treating infrastructure as code inevitably introduces those same bugs and issues into the infrastructure itself. A case in point is the recent incident at Azure.
The Azure Outage: A Case Study in IaC
A few days ago, a mishap at Azure impacted customers in the southern part of Brazil. As a major cloud provider, Azure subscribes to the “everything is code” motto, allowing configuration of services, servers, databases, endpoints, backups, and more to be handled as code. This philosophy extends to the management side of Azure, not just the customer facing cloud, where sysadmins treat everything as code, enabling automation of every operation.
Some days ago, Azure was testing a change in the management components. The goal was to replace some older components with more recent ones, which involved removing some NuGet packages and installing new ones. As part of changing the NuGet packages, the code that referenced them also had to be slightly altered – the package names were different and some calling conventions changed. It wasn’t exactly a drop-in replacement.
The changes were scripted, committed, reviewed, tested, and then deployed in a test environment known as “Ring 0” at Azure-land, akin to the name of the mode with the highest privileges. Once all went well in Ring 0, the changes were rolled out to Ring 1, the production environment visible to customers.
However, Ring 0, much like many lab or test environments, is a scaled-down version of the production environment, without the same load, user base, or intricate system interactions.
Interestingly, one relatively routine task that Azure sysadmins perform in production is creating snapshots of running databases to debug without affecting real workloads or customers. As there’s no need to create snapshots in a lab environment devoid of actual production databases and issues, Ring 0 was snapshot-free.
One of the changes in the NuGet packages was a change in the functions handling snapshots.
Internally, Azure runs a background job to delete old snapshots periodically. Since there were no snapshots in Ring 0, it never attempted to delete any during the test phase. However, in the production environment, where snapshots existed, the background job ran with the NuGet package changes, and what was once a call to remove a database snapshot morphed into a call to remove a database server.
Like a slow-moving car crash, when the job ran, every database server with an old enough snapshot was deleted – while service after service became unresponsive.
The Aftermath of the Azure Outage
The server deletion triggered a chain of tangentially related events:
- In Azure, a customer can not restore a deleted server by him/herself. This had to be done by Azure’s support team.
- Not all affected servers had the same type of backup, causing variance in restore speed. Some were stored in the same region (read: datacenter), while others were geographically redundant, meaning that they were stored in different Azure regions. The data had to be transferred before restoring. This added to the recovery time.
- It was discovered that some web servers that relied on the database servers that were deleted ran a test at startup to determine which databases were available and reachable. These requests were directed at all the database servers, and had the unintended side effect of interfering with the database server recovery process. Additionally, the web servers, when the test failed, would immediately trigger a restart. Which happened. And it happened again. And again, as those databases were not available for a prolonged period of time.
- As a precautionary measure, a technique called backoff is employed when a test fails. It basically means that when a test is attempted, and it fails, then the period of time until the next retry keeps increasing. This increase can be linear or exponential, and the idea is to give enough time for the other side to recover from whatever situation it is experiencing without having to handle such test requests. In this specific case, it had the unintended consequence of causing restarts to take over one hour, when they would normally take seconds.
- Restoring backups was slow due to the infrastructure being overwhelmed with requests.
- As soon as a few servers were back online, they quickly got overwhelmed by customer traffic.
To allow all servers to come back online, all traffic had to be interrupted until all the servers were recovered. This traffic peak made the outage (for customers of the affected Azure tenants) be longer than it otherwise would have been.
The result? Approximately 10 hours of downtime, business disruptions, and a slew of frustrated customers.
And it all originated from updating a few code packages. Not a hack, or a fire, or other hazard.
Takeaways and Lessons Learned
A robust testing environment is essential, with the capacity to mimic production as closely as possible. Exact replicas are difficult to achieve, become out of sync easily, and are hard to maintain. But too simplistic testing environments are essentially worthless as they lack the necessary challenges that the production environment faces.
Effective backup and restore strategies are crucial. Backups should not only be available but should also have been trialed in live restoration scenarios.
Human error is a persistent challenge in code development. We must acknowledge the potential for overlooking critical interactions and potential bugs in very complex systems. In fact, the systems do not even have to be that complex for humans to lack the ability to grasp all the intricacies and interactions within. Coding in that situation inevitably leads to problems.
The cloud, while revolutionary, isn’t immune to failure. Human error, software glitches, and hardware faults can and do occur. Repeatedly and unexpectedly. And if Murphy has any say in it, at exactly the worst possible moment too.
On a brighter note, the incident showcased the tenacity of Azure’s support and operations teams, who managed to restore deleted resources without data loss within a relatively short span. It underlines that while managing large infrastructures, especially as code, is challenging, the real measure of a service is how swiftly it responds to and rectifies an issue after a slip-up.
In conclusion, it’s always wise to automate wherever possible, ensuring that the test environment mirrors production as closely as it can. Incidents like these serve as valuable reminders to consider potential vulnerabilities in our environments before they lead to failures.
It is also important to automate the trivial tasks first – say, automatic patch deployment – allowing for the focus to go to more complex operations that demand higher levels of attention. This helps in freeing the only resource that the cloud can not scale (brain power) for more important tasks.