AI Hardware is Expensive. Here’s How to Maximize Utilization

Joao Correia

June 14, 2024 - Technical Evangelist

Computex 2024 is in full swing, and AI is everywhere. Hardware makers are embracing it as the best thing since sliced bread. However, if you’re planning your AI hardware deployments, there are several architectural aspects to consider to maximize your return on investment.

The latest AI equipment is now packing GPUs and other dedicated boards into full rack footprints. Where you would previously see full rack solutions as part of supercomputers and very large storage arrays, full racks of AI are now being announced that include specialized networking interconnects, dedicated enclosures for massive GPUs, and dedicated computational units – all packed into racks that draw the same power that entire aisles once did. And let’s not forget dedicated water cooling pipes, pumps, and radiators for good measure – after all, when you pack over 100 kW of power into one rack, you don’t want to melt your (very expensive) equipment immediately upon startup.

While this type of densification improves concentration and efficiently uses available datacenter space, it introduces its own set of challenges. On a physical level, power requirements are skyrocketing. Existing data centers were not designed for single racks needing as much electricity as seven or eight racks used to (a Nuclear Power reactor with that data center, anyone?). The same goes for cooling requirements, which grow hand-in-hand with power demand. But that is just scratching the surface.

When you purchase this type of hardware, you have a very clear idea of how you’re going to use it. Whether they’re for massive dataset inference, training, or some other computationally intensive workload, these racks are not bought to stand idle or run at less-than-full capacity. Like supercomputers, where time slots for running jobs are scheduled weeks in advance, AI systems are expected to be in full-time use every day until decommissioned. And no one is decommissioning current-gen AI hardware anytime soon.

Handling Security

The security requirements for these systems are critical. Systems packing the latest GPUs would be very useful for crypto-miner threat actors. If breached, running a crypto miner would be difficult to spot and costly to support, so these systems need to be as secure as possible. At the same time, they can’t be air-gapped to remain useful.

So, consider the following options. To remain secure, they need regular updates and patches applied, but traditional operations mean downtime, which – for systems like these – can easily reach 20-30 minutes per reboot due to hardware monitoring and checks alone, leading to incurred costs with no return.

Another approach sometimes considered is high availability. However, this would mean purchasing more capacity than needed to maintain performance during maintenance or running at degraded capacity, both of which are contrary to the goal for which these systems are acquired. More so than for other systems, running hot spares or reduced capacity is shunned for cost and efficiency reasons.

On its own, high availability is not the right fit to cover for maintenance operations. While it has been used for this purpose with some success, it was originally created to cover hardware failures without disruption, not for planned operations. Over-committing resources to compensate for inefficient operational decisions is hardly suitable for a highly available architecture.

If you’re running Linux on this hardware (which, let’s face it, you are), you have another option available: you can live patch the systems to maintain uptime and avoid interruptions to workloads.

As organizations embrace and deploy the latest technology, it’s time to also improve operational practices that are stuck in the pre-cloud era. For a serious security posture that matches the strategic investment in hardware, modern methodologies must replace old paradigms. If for no other reason, it’s expensive to run this hardware unpatched, just as it is to have it wasting cycles in reboot operations.

Summary