Linux 6.1 Help Users Identify Faulty CPUs
Linux Kernel 6.1 one of the latest updates to the Linux operating system provides users with a new logging system that will enable them to identify faulty CPUs and their associated cores within a server.
The logging system detects which core, CPU, and socket failed at a given time. However, the logger is far from perfect, as there is a possibility that the kernel gets rescheduled toward another CPU or CPU core, although it can still help identify faulty CPUs or cores.
“This is not perfect, since the task might get rescheduled on another CPU between when the fault hit, and when the message is printed, but in practice, this has been good enough to help people identify several bad CPU cores,” explained Rik van Riel, the author of the change.
Often CPU bugs have the ability to be “oddly specific,” where certain programs or pieces of code only crash the core.
“In a large enough fleet of computers, it is common to have a few bad CPUs. Those can often be identified by seeing that some commonly run kernel code, which runs fine everywhere else, keeps crashing on the same CPU core on one particular bad system. However, the failure mode in CPUs that have gone bad over the years are often oddly specific, and the only bad behavior seen might be segfaulting in programs like bash, Python, or various system daemons that run fine everywhere else,” said Riel.
The logging system will help detect potentially faulty processors and will be in use from Linux 6.1 later this year. It will also complement the new Intel In-Field Scan, MCEs, EDAC reporting and others.
The sources for this piece include an article in Tech Radar.