Overview
About vulnerability
In the Linux kernel, the following vulnerability has been resolved:
blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init
blk-iocost sometimes causes the following crash:
BUG: kernel NULL pointer dereference, address: 00000000000000e0
…
RIP: 0010:_raw_spin_lock+0x17/0x30
Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0
This happens because iocg->ioc is NULL. The field is initialized by ioc_pd_init() and never cleared. The NULL deref is caused by blkcg_activate_policy() installing blkg_policy_data before initializing it.
blkcg_activate_policy() was doing the following:
- Allocate pd’s for all existing blkg’s and install them in blkg->pd[].
- Initialize all pd’s.
- Online all pd’s.
blkcg_activate_policy() only grabs the queue_lock and may release and re-acquire the lock as allocation may need to sleep. ioc_weight_write() grabs blkcg->lock and iterates all its blkg’s. The two can race and if ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a pd which is not initialized yet, leading to crash.
The crash can be reproduced with the following script:
#!/bin/bash
echo +io > /sys/fs/cgroup/cgroup.subtree_control systemd-run –unit touch-sda –scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct echo 100 > /sys/fs/cgroup/system.slice/io.weight bash -c “echo ‘8:0 enable=1’ > /sys/fs/cgroup/io.cost.qos” & sleep .2 echo 100 > /sys/fs/cgroup/system.slice/io.weight
with the following patch applied:
diff –git a/block/blk-cgroup.c b/block/blk-cgroup.c index fc49be622e05..38d671d5e10c 100644 — a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol) pd->online = false; }
if (system_state == SYSTEM_RUNNING) {spin_unlock_irq(&q->queue_lock);ssleep(1);spin_lock_irq(&q->queue_lock);}- /* all allocated, init in the same order */ if (pol->pd_init_fn) list_for_each_entry_reverse(blkg, &q->blkg_list, q_node)
I don’t see a reason why all pd’s should be allocated, initialized and onlined together. The only ordering requirement is that parent blkgs to be initialized and onlined before children, which is guaranteed from the walking order. Let’s fix the bug by allocating, initializing and onlining pd for each blkg and holding blkcg->lock over initialization and onlining. This ensures that an installed blkg is always fully initialized and onlined removing the the race window.
Details
- Affected product:
- AlmaLinux 9.2 ESU , TuxCare 9.6 ESU
- Affected packages:
- kernel @ 5.14.0 (+1 more)
In the Linux kernel, the following vulnerability has been resolved:
blk-cgroup: Fix NULL deref caused by blkg_policy_data being installed before init
blk-iocost sometimes causes the following crash:
BUG: kernel NULL pointer dereference, address: 00000000000000e0
…
RIP: 0010:_raw_spin_lock+0x17/0x30
Code: be 01 02 00 00 e8 79 38 39 ff 31 d2 89 d0 5d c3 0f 1f 00 0f 1f 44 00 00 55 48 89 e5 65 ff 05 48 d0 34 7e b9 01 00 00 00 31 c0
This happens because iocg->ioc is NULL. The field is initialized by ioc_pd_init() and never cleared. The NULL deref is caused by blkcg_activate_policy() installing blkg_policy_data before initializing it.
blkcg_activate_policy() was doing the following:
- Allocate pd’s for all existing blkg’s and install them in blkg->pd[].
- Initialize all pd’s.
- Online all pd’s.
blkcg_activate_policy() only grabs the queue_lock and may release and re-acquire the lock as allocation may need to sleep. ioc_weight_write() grabs blkcg->lock and iterates all its blkg’s. The two can race and if ioc_weight_write() runs during #1 or between #1 and #2, it can encounter a pd which is not initialized yet, leading to crash.
The crash can be reproduced with the following script:
#!/bin/bash
echo +io > /sys/fs/cgroup/cgroup.subtree_control systemd-run –unit touch-sda –scope dd if=/dev/sda of=/dev/null bs=1M count=1 iflag=direct echo 100 > /sys/fs/cgroup/system.slice/io.weight bash -c “echo ‘8:0 enable=1’ > /sys/fs/cgroup/io.cost.qos” & sleep .2 echo 100 > /sys/fs/cgroup/system.slice/io.weight
with the following patch applied:
diff –git a/block/blk-cgroup.c b/block/blk-cgroup.c index fc49be622e05..38d671d5e10c 100644 — a/block/blk-cgroup.c +++ b/block/blk-cgroup.c @@ -1553,6 +1553,12 @@ int blkcg_activate_policy(struct gendisk *disk, const struct blkcg_policy *pol) pd->online = false; }
if (system_state == SYSTEM_RUNNING) {spin_unlock_irq(&q->queue_lock);ssleep(1);spin_lock_irq(&q->queue_lock);}- /* all allocated, init in the same order */ if (pol->pd_init_fn) list_for_each_entry_reverse(blkg, &q->blkg_list, q_node)
I don’t see a reason why all pd’s should be allocated, initialized and onlined together. The only ordering requirement is that parent blkgs to be initialized and onlined before children, which is guaranteed from the walking order. Let’s fix the bug by allocating, initializing and onlining pd for each blkg and holding blkcg->lock over initialization and onlining. This ensures that an installed blkg is always fully initialized and onlined removing the the race window.