Stack unwinding in AArch64 processors: what is it and how it works

March 1, 2023 - TuxCare PR Team

KernelCare Enterprise’s Linux kernel live patching software has supported ARMv8 (AArch64) in addition to x86_64 (Intel IA32/AMD AMD64) for some time now. However, to get KernelCare running on ARM, you’ll need something called a stack frame unwinder.

This article explains what they are, what they’re used for, and why we had to write our own stack frame unwinder.

Stack Unwinders: What Are They, What Are They Used for, and Their History
How Stack Unwinding Works: Stack Primer and How It Works in AArch64 Processors
Instructions for Stack Unwinders: Instructions to Jump, Address to Jump to, and Troubleshooting

Part A – Stack Unwinders

What Is a Stack Unwinder?

A stack unwinder is a piece of software that lists the addresses of every function currently in the calling stack. It shows you where you are in a program’s execution, but more importantly, also shows you how you got there. The call stack is the list of all the functions currently being executed. It is called a stack because, as one function calls another, that new function gets added on top of the stack (while it is the currently executing one). When a function returns a value or reaches an exit instruction, it is removed from said stack.

An effective unwinder must have all of these characteristics:

Fast: so that processing can quickly resume (if possible).
Cheap: so that it doesn’t drain system resources.
Accurate: so that memory addresses and name-spaces are accurately reported.

What Is a Stack Unwinder Used for?

To provide stack traces when a program crashes. (In the Linux kernel world, such crashes are called ‘Oops’. )
To help with performance analysis, showing the route (which functions called which) a program takes within a program.
To enable Linux kernel live patching, the act of fixing kernel bugs without stopping (rebooting) the system.

History of Stack Unwinding

Historically, stack unwinding helped developers to debug software. After all, anyone who has written at least a few programs knows that programs most likely contain errors.

Some errors are easy to spot while other errors go almost unnoticeable. The bigger the program is, the more difficult it becomes to debug a program: huge programs are almost impossible to debug by using only source code analysis as your sole debugging technique.

That’s why there are several secondary techniques which aim to facilitate the debugging process:

Logging (i.e. debugging output). Here the programmer uses the printing operators of his language of choice to display contents of specific variables at specific points in the program flow. This allows them to spot where the program flow diverged from the expected scenario. Modern software vastly relies on logging but uses different logging levels – debug, notice, warning, error and so on – to be able to disable some less important logging messages in production. Typically, logging messages go to a dedicated logfile, or, alternatively, to a system-wide log repository or file.
Debugging via breakpoints. In order to facilitate the debugging process, almost every modern CPU architecture includes a special instruction called a “breakpoint” instruction (for instance, bkpt for ARM and int3 for x86). The purpose of this instruction is to cause a special processor interrupt. The hardware then saves the current program counter and may save a few general-purpose registers in order to help interrupt service routine to successfully start. Control is then transferred to the interrupt service routine.

What this routine does is extract saved contents of general-purpose registers to help a programmer examine certain program variables at the current point where the program has been stopped. Just before executing the special interrupt service routine return instruction, this interrupt handler typically restores original instruction and after returning to the interrupted thread, the program runs as if it has never been touched at all.

Modern-day usage of this technique relies on support from the operating system kernel since all interrupt service routines are part of it. Usually, the OS kernel provides special system calls (for instance, ptrace on Linux) that userspace processes may use to perform debugging functions. The popular Linux open-source debugger, named GDB, uses ptrace to perform step-by-step debugging (via breakpoints).

Also, some CPU architectures (x86-64 for instance) define additional capabilities, which are based on the original idea. For example, there may be special system registers containing the address of a (virtual) breakpoint: when the program counter hits this address, interrupt service routine is called immediately without any need to patch (and consequently restore) program code.
Debugging via assertions. The assertion is a special function which checks the given condition and, if this condition is false, causes the program’s termination. Assertions can be used by programmers to ensure that internal variables are sane. Assertions are typically disabled in production programs. Here, the most interesting case is when the assertion comes in false. This state clearly signals that some program error took place. To help investigate a problem, stack unwinding is performed. By unwinding the program’s stack, we can obtain an “execution path” which leads our program to the point where it crashed.

Advanced unwinders allow seeing parameters for each function within the call chain.

So, stack unwinding originally came from software debugging. Now, however, it has other applications, one of which is in Kernelcare Enterprise.

Why the KernelCare Enterprise Team Needed An Unwinder For ARM

To install a Linux kernel live patch, the patching software must know what functions are in the current calling stack. If a function currently in the calling stack is patched, the system can crash when returning. There is some stack unwinding functionality already in the Linux kernel. Here is a brief review of that functionality and the reasons why it can’t be used for live patching:

The ‘Guess’ unwinder: It guesses the contents of the stack. It is not accurate and so not useful for live patching. It is only available for x86_64 architectures.
The ‘frame pointer’ unwinder: Available only for x86_64
The ‘ORC’ unwinder: Introduced in Linux Kernel v4.14., “ORC” is an acronym for “Oops Rewind Capability”. Originally developed for x86_64 only, it has continued to be improved upon and there are patches being considered for inclusion in the kernel (as of Linux Kernel 6.3) that extend its support to ARM, as well as adding reliability checks to ensure accurate information is returned.

Part B – How Stack Unwinding Works

Stack Primer

When a function is called, a stack frame keeps track of the function’s arguments as well as its entry and exit points.
A processor register is assigned a stack point (‘SP’) that references the object most recently put on the stack. The memory implementing that stack expands downwards towards lower memory addresses (so-called ‘full-descending’).
Stack memory must be aligned to byte boundaries (16 bytes for AArch64). This is enforced by the hardware (but can be deactivated on some ARM models).

Details

How does stack unwinding work in AArch64 processors?

A special kernel function performs stack unwinding. When called, the function gets the frame pointer (FP) of the calling function. The FP refers to the stack frame which is represented by the struct stack_frame structure. It contains a pointer to the stack frame of the function that has called the calling function.

This means we have a linked list of stack frames that ends when the next obtained FP equals 0 according to AAPCS64 (the procedure call standard for AArch64). In each stack frame, we can retrieve a return address where the calling function should delegate control after it finishes its work. Using the fact that a return address should point out inside the calling function, we can get the symbolic names of all functions, up to and including the point where FP=0.

To do so, we keep the names of the functions and their begin and end addresses. This can be implemented using the Linux kernel subsystem called kallsyms.

The core problem can be boiled down to this: how do we move the program counter from one place to another (a jump) and resume processing without any problems?

This action can be expressed in assembly language as:

Let’s look at these in more detail.

1. Instruction to Jump

Procedures are invoked with BL. The 32-bit instruction can be visualized as follows:


31 30 29 28 27 26 25 (...) 2 1 0
1  0  0  1  0  1  imm26
op

Here, imm26 is a 26-bit PC offset. For an extended look into this example, consider checking the ARM documentation here.

2. Address to Jump to

We calculate where to jump to using:


bits(64) offset = SignExtend(imm26:'00', 64)

The offset shifts by two bits to the left and converts to 64 bit (i.e. the high bits fill with 1 if imm26 < 0, and with 0, otherwise).

The address to jump to is then:


Address = PC + offset

The offset is labeled and used in the BL instruction. (Instructions in AArch64 always take 4 bytes, which is an advantage compared to x86.) The register X30 (also known as the Link Register) is set to PC+4. This is the return address for RET (defaults to X30 if not specified).

So, for the complementary instruction RET, it is enough to retrieve the saved LR value and transfer control onto it and return into the calling function.

The Problem: What Happens If the Called Function Calls Another Function Itself?

And here we have a problem: what happens if the called function calls another function itself? If we do nothing, then the value saved in LR will be replaced with a new return address — it will not be able to return to the initial function and the program will most likely abort.

Solving The Problem

There are some ways to solve this problem:

Save the LR value into some other register
Save the LR value into RAM

The first case is very restrictive, as the number of available registers is limited (to 31 registers). ARM uses the RISC architecture load/store architecture philosophy which says that memory calls are done via LD (LOAD) and ST (STORE) instructions, whereas arithmetical and logical operations are performed on registers. Therefore, empty registers are needed for program execution, and we are left with option 2, save the LR value into RAM.

Saving Called Functions to Return Addresses in the Stack

A frame structure implemented in C looks like this:


struct stack_frame {
    unsigned long fp;
    unsigned long lr;
    char data[0];
};

In other words, each function allocates n bytes in the stack, reducing by n the stack pointer at the moment when it takes control with the BL instruction. The contents of registers x29 (FP) and x30 (LR) are saved according to the obtained stack pointer value — the calling function used by these values. After that, the new value SP is assigned to the register x29(called the frame pointer (FP)). The remaining space in the stack frame is used by the function local variables. And the condition that the frame pointer (FP) and link register (LR) of the calling function are always located at the beginning of any stack frame, is always met. After finishing its work, the called function takes the saved values of FP and LR from the stack frame and increases the stack pointer (SP) by n.

How gcc Cross-compiler Deals with AArch64

Using the frame pointer requires the Linux kernel to be compiled with the –fno-omit-frame-pointer gcc option. This option tells gcc to store the stack frame pointer in a register. (NOTE: The default for gcc is –fomit-frame-pointer, so this option must be explicitly set.)

For AArch64, the register is X29. This is reserved for the stack frame pointer when the option is set. (Otherwise, it can be used for other purposes.) The cross-compiler GCC used to compile Linux under AArch64 sets the following instructions before the function body:


ffffff80080851b8 :
ffffff80080851b8: a9be7bfd stp x29, x30, [sp, #-32]!
ffffff80080851bc: 910003fd mov x29, sp

Here, so-called indirect addressing with pre-increment where the stack pointer (SP) is decreased by 32 at the beginning and then x29, x30 are sequentially saved in the memory by the value obtained in the first instruction.

Usually, the function finishes as follows:


ffffff80080851fc: a8c27bfd ldp x29, x30, [sp], #32
ffffff8008085200: d65f03c0 ret

The indirect addressing with the post-increment where the saved values x29, x30, are taken from the memory on the stack pointer (SP) and then SP increases by 32. The code examples above are called the prologue and epilogue of the function respectively. GCC always generates such prologues and epilogues if the flag -fno-omit-frame-pointer is set. Linux on AArch64 is compiled with that flag so that stack frames look like regular code (except assembly code). This fact allows us to easily unwind the stack, i.e. track the call chain in the program.

Assembly Reference:

Conclusion

Despite its usefulness, there is no common approach to stack unwinding that covers all architectures and systems. For Linux kernel live patching, a reliable and quick stack unwinder is essential. For Linux running on ARM, the need for a robust stack unwinding solution is even more pressing, as ARM gains traction in the IoT device and edge-cloud computing markets. With KernelCare pushing into both, we had to look into our own solutions for kernel stack unwinding.

Additional Reading

About KernelCare Enterprise

KernelCare Enterprise makes patching your Linux kernels simple for servers on CentOS, Amazon Linux, RHEL, Ubuntu, Debian, and other Linux distributions, including Poky (the Yokto Project’s distribution) and Raspbian.

KernelCare maintains kernel security with automated, rebootless updates without any service interruption or degradation. The service promptly delivers the latest security patches for different Linux distributions applied automatically to the running kernel in just nanoseconds. KernelCare Enterprise works in both live and staging environments as well as local and on-the-cloud systems, and for servers located behind a firewall, there is an on-premise ePortal tool to help you manage it.

KernelCare Enterprise enhances compliance on hundreds of thousands of servers of various companies where the service availability and data protection are the most crucial parts of the business: financial and insurance services, video conferencing solution providers, companies protecting domestic abuse victims, hosting companies, and public service providers.

Learn more about KernelCare Enterprise and the benefits it affords here.

Summary

Article Name

Stack unwinding in AArch64 processors: what is it and how it works

Description

KernelCare Enterprise’s Linux kernel live patching software has supported (AArch64) but what they are, what they’re used for...Learn more

Author

Stephan Venter

Publisher Name

TuxCare

Publisher Logo

Experience the KernelCare Benefits Yourself

Become a TuxCare Guest Writer

Get started

Solutions

Resources

Buying Time Until the Next Patching Cycle

Who We Serve

Join Our Popular Newsletter

Join 4,500+ Linux & Open Source Professionals!

Stack unwinding in AArch64 processors: what is it and how it works

Part A – Stack Unwinders

What Is a Stack Unwinder?

What Is a Stack Unwinder Used for?

History of Stack Unwinding

Why the KernelCare Enterprise Team Needed An Unwinder For ARM

Part B – How Stack Unwinding Works

Stack Primer

Details

How does stack unwinding work in AArch64 processors?

1. Instruction to Jump

2. Address to Jump to

The Problem: What Happens If the Called Function Calls Another Function Itself?

Solving The Problem

Saving Called Functions to Return Addresses in the Stack

How gcc Cross-compiler Deals with AArch64

Conclusion

Additional Reading

About KernelCare Enterprise

Experience the KernelCare Benefits Yourself

Become a TuxCare Guest Writer