Hardware Level Vulnerabilities, Revisited
In August of last year, I examined several CPU bugs that posed serious security threats. The mitigations for these vulnerabilities generally involved either incorporating additional instructions or opting for alternative CPU instructions – strategies that lead to diminished system performance overall. My argument was that such vulnerabilities effectively revert your infrastructure to the technological level of a previous hardware generation, even though you’re paying for the latest, which simply means you’re wasting money.
Then, Spectre resurfaced in early April, and the familiar cycle repeated itself. Let’s go back and delve deeper into this ongoing issue.
Spectre. Again.
Spectre is the somewhat whimsical name given to a bug discovered in 2017 that has since evolved into a family of vulnerabilities, each exploiting similar or closely related architectural flaws in CPUs. With the end of Moore’s Law (the observation that the number of transistors on a chip doubles approximately every two years) CPU advancements have shifted dramatically. Whereas in the ’90s and early 2000s, new CPU generations from any vendor typically meant performance doubling, or nearly so, today’s advancements face physical limits. It’s become increasingly challenging to pack more transistors into the same space without overheating the hardware to a melting point.
As a result, instead of producing faster CPUs, manufacturers have focused on increasing the number of cores. Initially, the performance boost from multi-core CPUs was modest, as software wasn’t designed to utilize parallel execution effectively. Although software has since evolved to better leverage multiple cores, the physical constraints of heat dissipation remain a significant challenge. Today, we see CPUs with over 100 cores and 400W TDP (thermal design power), but these chips still operate at clock speeds comparable to those of a 3GHz Pentium 4 from over 15 years ago. Tentative work on higher TDP CPUs leads to obtuse cooling solutions, like running the entire system submerged.
In their quest to maximize CPU performance within existing thermal limits, manufacturers have introduced various enhancements. These include larger cache sizes, multiple caching levels, better inter-die communication, integration of GPU and memory controllers directly into the CPU, and enhancements in motherboard designs to manage signal degradation over PCIe lanes. Another significant development has been the introduction of branch prediction and isolation mechanisms to prevent processes on one core from accessing data on another – critical in environments like cloud computing, where multiple virtual machines may operate on the same physical hardware.
However, the complexity of these new techniques means that implementation issues are somewhat inevitable. Branch prediction (a special case of Speculative Execution), for example, isn’t just about executing the instructions at hand faster. The CPU anticipates potential future instructions based on previous ones and pre-executes possible paths. This preemptive execution is intended to enhance efficiency, much like a waiter at a restaurant preparing to add fries to your order before you confirm you want them. But if you overwhelm the system with demands, it can lead to errors – like getting a milkshake when you asked for fries.
Security researchers, perhaps with some extra time on their hands, have managed to exploit the branch prediction mechanism to access data that should be inaccessible, enabling a regular user to read privileged system data or a virtual machine tenant to access data from another tenant or even the host system. Although these exploits can only extract data at a rate measured in a few kilobits per hour, for a determined attacker, this is sufficient to feasibly extract sensitive information – like private keys – over time.
The idea with the new variant is that not only can you look at data in predicted branches, but you can trick the cpu to execute code in other branches that it would otherwise not run. This is possible even in systems where mitigations for previous Spectre attacks have been deployed, so any mitigation is on top of already existing performance-hitting fixes.
Mitigating Hardware Vulnerabilities
How do hardware vendors address these issues? The fixes for current designs involve hardware modifications that will only appear in new chips manufactured in the coming years. They cannot retroactively correct vulnerabilities in CPUs already in the market. Thus, most “fixes” for these types of vulnerabilities act more like “kill switches” for the problematic features, either disabling or throttling them to mitigate risks. Unfortunately, this means a significant performance trade-off: branch prediction, for example, accounts for a substantial part of the performance gains in modern CPUs. Disabling it leads to a noticeable slowdown.
When the fixes are not forthcoming, operating system vendors’ rely on different approaches – the linux kernel, for example, included an entirely different approach to function calling and “return”-ing, to block the attack vector. However, more instructions mean more code to run, which means slower execution, which degrades performance just the same.
If the latest CPUs are only about 10% faster than their predecessors, and a Spectre-related mitigation reduces performance by more than that, then the effective performance is worse than that of the older generation. This impact could be profound, especially for operations like AI model training where a 10% performance reduction could be costly.
This emerging issue should be factored into organizational planning, including business strategy and disaster recovery. During your next tabletop exercise, consider asking whether your infrastructure could withstand a 10% or 15% drop in performance and still meet your business objectives. How would you manage such a scenario?
These considerations are no longer hypothetical but essential components of modern cybersecurity strategies, and emphasize the importance of staying ahead in a landscape where hardware vulnerabilities can have tangible impacts on operational capabilities.