5 Ways to Reduce Server Downtime (And 1 Way To Eliminate It)

September 7, 2020 - TuxCare PR Team

Rebooting servers hurts you and your customers. It’s often done during off-peak hours (usually at night) when servers process fewer transactions, but even rebooting at this time costs thousands in downtime. One server reboot can take from several minutes to over an hour depending on the configuration, and it can take additional time for services to synchronize. As a matter of fact, 25% of organizations report that downtime costs them between $300,000 and $400,000 for every hour servers are unavailable. Downtime is avoidable and reboots due to patching can be completely eliminated.

Servers fail for a variety of reasons, but a server’s failure doesn’t always mean downtime. Downtime is much more critical to an organization because it means a single point of failure was ignored or overlooked or failover systems weren’t able to seamlessly take over. Google hosted a video on the top ten reasons for server downtime. We’ll summarize the 50 minute video below.

Resource Overload

When server requests exceed available resources, performance suffers and eventually the server crashes. Cloud servers can expand resources dynamically, but on-premise administrators responsible for those cloud servers must always ensure that servers can support customer applications and resource expansions.

Noisy Neighbor

The “bad neighborhood” issue is mainly a concern for cloud hosts with shared hosting services. When one client uses too much of a server’s resources, it affects performance of other client sites. Most hosts will move the “noisy neighbor” off shared services to control the issue or limit available resources to a problem client.

Retry Spikes

Whether it’s from an overloaded server or an application gone rogue, when users are unable to connect to a server, they often try several times before giving up. Now add thousands of users performing the same retries several times, and you have a server crashing due to retry spikes. Administrators can configure servers to reject aggressive retry connections to help reduce retry spikes.

Buggy Dependencies, Patches or Applications

Poor patching habits, outdated software, slow dependencies, and numerous other issues related to applications running on the server can cause downtime. Administrators can’t simply install patches and reboot indiscriminately. They must schedule patching and updates and reboot during off-peak hours. Live patching can help (more on that later).

Third-Party Scaling

Your servers might be able to scale, but the third-party APIs used in application processing might not scale. Google recommends “sharding” where large consistent processes are broken into chunks to reduce overhead.

Inefficient Sharding

Sharding benefits performance, but when one shard is too large compared to others, you have uneven sharding. Google recommends breaking larger shards into even smaller ones to remediate the issue.

Human Errors

Some server procedures have too much human involvement. Without automation, human errors could be introduced. For instance, relying on IT staff to manually patch and upgrade servers often leads to mistakes and downtime. Patch management and automation greatly reduces human errors by only requiring administrators to be involved when an issue is found.

Bad Code Deployments

For organizations with in-house applications, testing is critical to ensure that deployed code does not present issues. In addition to heavy testing and quality assurance (QA) procedures, a rollback process should always be developed.

Poor Monitoring

Most administrators know that monitoring is essential. It’s also a component in regulatory compliance. Just one missed configuration or server in a monitoring strategy leaves the organization open to monitoring gaps. Auditing the network to ensure every resource is added to monitoring applications prevents this issue.

Misconfigured Domains and Infrastructure

Connectivity to a server resource doesn’t always stem from local machine issues. A failed domain could result in server downtime as clients cannot connect to servers. Failover and testing before deploying configuration changes will help prevent this issue.

Downtime Costs to Organizations

No matter what the root cause, the main concern for businesses is the money lost during (and after) downtime. Transactions can’t be processed, and they could be lost to the void without failover systems in place. Customer frustrations are another primary issue that could result in revenue loss from losing customers and brand damage as downtime affects reputation.

In a recent Ponemon report, organizations experience 30% more downtime due to poor patch management and vulnerability patching delays. Of the businesses polled, 52% said that they had no tolerance for downtime, including reboots due to patching and operating system updates. Small businesses suffer more than large businesses as they do not have the resources and automation in place to handle vulnerability patching, which leads to an increase in downtime.

Of all the aforementioned downtime causes, human error and poor patch deployments can be completely eliminated using patch automation. Reboots can be completely eliminated using live patching. Organizations spend $1.4 million annually for vulnerability management, but patch management and automation greatly reduces staff overhead, downtime costs, and even reboot issues.

Scheduling Downtime- Maintenance Planning and Execution

Scheduling Downtime: Maintenance Planning and Execution

At some point in a server’s lifetime, administrators must schedule downtime. This could be for code deployment, changes to server hardware, configuration changes, or a switchover between a retired server with a new one. Scheduled maintenance is usually executed during off-peak hours, but there are a few steps that can be taken to reduce downtime.

Ensure backups are recent, working, and available. Should you need to perform any critical rollback that interrupts service and you need backups, make sure they are available so that they can be extracted and deployed faster.
Check disk usage. For small businesses with servers using limited resources, always check that disk storage is available for updates. A full drive will have unexpected results along with severe performance degradation.
Check server resource utilization. In addition to checking storage space, validate that the server has no CPU or memory spikes that could interfere with a successful update or configuration change.
Test before deploying any changes. This might seem like administrative common sense, but many “quick and easy” configuration changes or updates cause downtime and administrators skip testing for small changes. Administrators think that a small change couldn’t possibly cause issues, but the chance is always there. Always test changes to any production server in a staging environment first.

How to Minimize Server Downtime

5 Ways to Reduce Server Downtime (and 1 Way To Eliminate It)

Unexpected server downtime is much more damaging to an organization than scheduled maintenance. Administrators should have a backup and rollback plan and be ready for issues during scheduled maintenance, but unexpected downtime requires root-cause analysis and the resources to bring the server back into service. Administrators should take preventative measures to ensure that a server experiences as little downtime as possible. Here are some best practices to help reduce downtime:

Security

Cybersecurity is insurmountably important for server reliability and uptime. Administrators who work with public-facing servers will experience numerous vulnerability scans, exploit attempts, and suspicious traffic that should be monitored. Any vulnerabilities reported publicly will follow with exploits and attacks on the server, so administrators must take immediate action and patch the system. Downtime due to a data breach brings with it much more revenue loss and corporate issues than just the downtime cost from a reboot.

Server Monitoring

For organizations with hundreds of servers, it’s easy to miss just one. Auditing the network and identifying every server ensures that servers have the right monitoring in place, not just for a crash but resource spikes and inefficiencies (e.g. cooling) that could create a slow failure. Any issues should be messaged to administrators including text messages for critical errors. Proactive monitoring alerts administrators of pending crashes both virtual and physical so that they can remediate the issue before it causes downtime.

Retire Inefficient Servers

Older servers are much more prone to failure, so eventually a server should be retired. It’s not uncommon for administrators to update hardware, but eventually it isn’t cost efficient to always upgrade hardware. These servers can consume more power and have a cascading effect on environment performance.

Optimize Cooling

Heat and moisture will slowly destroy server equipment. With monitoring implemented, these environmental factors will be detected before they destroy equipment and servers suffer from hardware failure. The right cooling should be installed across all server rooms, and a backup system should be in place in case primary cooling fails.

Perform Load Testing

Using a load balancer to distribute across multiple servers helps with performance, but what if more than one server fails? With load testing, you know how servers will perform after partial resources fail. This might result in additional servers being provisioned or resources added to existing servers. For any critical servers, always overestimate capacity limits to ensure that enough resources are available to scale and grow.

Patch Automation and Live Patching

Manual patching results in human error and missing important vulnerability alerts. Instead, organizations should be using patch automation. Even with patch automation, updating the Linux kernel still requires a reboot, until now. With KernelCare and KernelCare + for Shared Libraries, administrators can patch their systems without rebooting the server. Live patching completely eliminates the need for scheduled maintenance and downtime for kernel updates. For instance, HostUS uses KernelCare and recently retired a server that hadn’t rebooted for 5.5 years.

Conclusion

Server downtime is extremely costly, but it can be reduced with the right best practices. Most downtime from unexpected errors can be prevented, but any downtime due to patching can be completely eliminated with live patching from KernelCare. To see what KernelCare can do for your servers, sign up for free and get started.