Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
259 changes: 259 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26583_cos/docs/exploit.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,259 @@
## Setup

To trigger the TLS encryption we must first configure the socket.
This is done using the setsockopt() with SOL_TLS option:

```
static struct tls12_crypto_info_aes_ccm_128 crypto_info;
crypto_info.info.version = TLS_1_2_VERSION;
crypto_info.info.cipher_type = TLS_CIPHER_AES_CCM_128;

if (setsockopt(sock, SOL_TLS, TLS_TX, &crypto_info, sizeof(crypto_info)) < 0)
err(1, "TLS_TX");

```

This syscall triggers allocation of TLS context objects which will be important later on during the exploitation phase.

In KernelCTF config PCRYPT (parallel crypto engine) is disabled, so our only option to trigger async crypto is CRYPTD (software async crypto daemon).

Each crypto operation needed for TLS is usually implemented by multiple drivers.
For example, AES encryption in CBC mode is available through aesni_intel, aes_generic or cryptd (which is a daemon that runs these basic synchronous crypto operations in parallel using an internal queue).

Available drivers can be examined by looking at /proc/crypto, however those are only the drivers of the currently loaded modules. Crypto API supports loading additional modules on demand.

As seen in the code snippet above we don't have direct control over which crypto drivers are going to be used in our TLS encryption.
Drivers are selected automatically by Crypto API based on the priority field which is calculated internally to try to choose the "best" driver.

By default, cryptd is not selected and is not even loaded, which gives us no chance to exploit vulnerabilities in async operations.

However, we can cause cryptd to be loaded and influence the selection of drivers for TLS operations by using the Crypto User API. This API is used to perform low-level cryptographic operations and allows the user to select an arbitrary driver.

The interesting thing is that requesting a given driver permanently changes the system-wide list of available drivers and their priorities, affecting future TLS operations.

Following code causes AES CCM encryption selected for TLS to be handled by cryptd:

```
struct sockaddr_alg sa = {
.salg_family = AF_ALG,
.salg_type = "skcipher",
.salg_name = "cryptd(ctr(aes-generic))"
};
int c1 = socket(AF_ALG, SOCK_SEQPACKET, 0);

if (bind(c1, (struct sockaddr *)&sa, sizeof(sa)) < 0)
err(1, "af_alg bind");

struct sockaddr_alg sa2 = {
.salg_family = AF_ALG,
.salg_type = "aead",
.salg_name = "ccm_base(cryptd(ctr(aes-generic)),cbcmac(aes-aesni))"
};

if (bind(c1, (struct sockaddr *)&sa2, sizeof(sa)) < 0)
err(1, "af_alg bind");
```

## What we start with and what can we do

If we win the race condition, vulnerability gives us a limited write primitive.
To be exact, it gives us an ability to change a 8 bit integer value of '1' to '0' at an offset 0x158 in the struct tls_sw_context_rx object which is allocated from a general kmalloc-512 cache.

The big problem is finding a victim object in which this limited write gives us the ability to escalate privileges or at least get a better exploitation primitive.

## Victim object

We had no success looking for kmalloc-512 objects, so we had to turn our attention to objects from other caches, even though it requires a cross-cache attack.

The only object we were able to find is ipcomp_tfms:

```
struct ipcomp_tfms {
struct list_head list; /* 0 0x10 */
struct crypto_comp * * tfms; /* 0x10 0x8 */
int users; /* 0x18 0x4 */

/* size: 32, cachelines: 1, members: 3 */
};
```

This is used in XFRM code. Changing the reference counter 'users' from 1 to 0 gives us a use-after-free.

Unfortunately, only one object can be created for the whole system, so there is no way to spray the whole page with these objects.

There 128 possible positions of this object in the kmalloc-128 slab and 16 positions of rx context in kmalloc-512.

Only a few of these combinations align with the 0x158 offset giving us a chance to perform the attack.

```
Target: 0x158 (base: 0x0) victim(ipcomp_tfms): 0x158 (base: 0x140)
Target: 0x358 (base: 0x200) victim(ipcomp_tfms): 0x358 (base: 0x340)
Target: 0x558 (base: 0x400) victim(ipcomp_tfms): 0x558 (base: 0x540)
Target: 0x758 (base: 0x600) victim(ipcomp_tfms): 0x758 (base: 0x740)
Target: 0x958 (base: 0x800) victim(ipcomp_tfms): 0x958 (base: 0x940)
Target: 0xb58 (base: 0xa00) victim(ipcomp_tfms): 0xb58 (base: 0xb40)
Target: 0xd58 (base: 0xc00) victim(ipcomp_tfms): 0xd58 (base: 0xd40)
Target: 0xf58 (base: 0xe00) victim(ipcomp_tfms): 0xf58 (base: 0xf40)
```

Another issue is that kmalloc-32 uses order 0 pages, while kmalloc-512 uses order 1.

This means we not only have to discard the slab page back to the page allocator, but also move it from the PCP to the buddy allocator and arrange the state of the allocator so that order 1 page is returned for an order 0 request.

All those issues combined resulted in a very unreliable exploit, however it was reliable enough to eventually get the flag.

## Triggering use-after-free through race condition

```
spin_lock_bh(&ctx->decrypt_compl_lock);
if (!atomic_dec_return(&ctx->decrypt_pending))
[1] complete(&ctx->async_wait.completion);
[2] spin_unlock_bh(&ctx->decrypt_compl_lock);
}
```

To exploit the race condition we have to hit window between lines [1] and [2] and perform following actions:
1. Close the socket to free tls context (struct tls_sw_context_rx), leading to discard of the slab page
2. Allocate a new page table in place of the tls context.

To hit this small window and extend it enough to fit our allocations we turn to a well-known timerfd technique invented by Jann Horn.
The basic idea is to set hrtimer based timerfd to trigger a timer interrupt during our race window and attach a lot (as many as RLIMIT_NOFILE allows) of epoll watches to this timerfd to make the time needed to handle the interrupt longer.
For more details see the original [blog post](https://googleprojectzero.blogspot.com/2022/03/racing-against-clock-hitting-tiny.html).

Exploitation is done in 2 threads - main process runs on CPU 0, and a new thread (child_recv()) is cloned for each attempt and bound to CPU 1

| CPU 0 | CPU 1 |
| -------- | -------- |
| allocate tls context | - |
| - | exploit calls recv() triggering async crypto ops |
| - | tls_sw_recvmsg() waits on completion |
| - | cryptd calls tls_decrypt_done() |
| - | tls_decryption_done() finishes complete() call |
| - | timer interrupts tls_decrypt_done() |
| recv() returns to userspace unlocking the socket | timerfd code goes through all epoll notifications |
| exploit calls close() to free tls context | ... |
| exploit allocates a page table in place of tls context| ... |
| - | interrupt finishes and returns control to tls_decrypt_done() |
| - | spin_unlock_bh() writes to PTE |


## Ensuring the slab page is discarded

struct tls_sw_context_rx is allocated from kmalloc-512. This cache uses a single page slab storing 16 objects.
To ensure the slab page is discarded we have to meet the same requirements as in a cross-cache attack:

- all objects in the same slab as tls_sw_context_rx must be freed. All neighbouring objects are xattrs from the same kmalloc-512 cache and are freed before starting the race condition, which freezes the slab and puts it on a per cpu partial list
- per cpu partial list must be full to unfreeze the slab after tls context is freed
- per node partial list must also be full for the slab to be discarded instead of moved to the per node list

All these requirements are met before tls context is freed by freeing enough kmalloc-512 xattrs.


## Moving the order-1 page from PCP to buddy allocator

If we free more than pages then 'high' limit of the given PCP cache, a batch of pages will be released back to the buddy allocator:

```
if (pcp->count >= high) {
int batch = READ_ONCE(pcp->batch);

free_pcppages_bulk(zone, nr_pcp_free(pcp, high, batch), pcp);
}
}
```

To be able to do this efficiently in the race condition window we free pages exactly up to the limit, so that the discard of the slab page immediately triggers free_pcppages_bulk().
The information we need about the current state of the PCP comes from [reading the zoneinfo file](../CVE-2024-26582_lts/docs/novel-techniques.md##predicting-how-much-we-have-to-allocate-free-to-trigger-pcp-flush).

## Allocating an order 1 page

As long as there are no order 0 pages available, buddy allocator will return the order 1 page that was recently moved from the PCP.

We just have to allocate enough objects from order 0 slab like kmalloc-256, but if we allocate too much, buddy allocator will split some higher order pages and order 0 count might increase instead.

Fortunately, we can parse the [buddyinfo](novel-techniques.md) file to get the zone counts we need.

## Triggering the use-after-free after 'users' field change

At this point our users field was changed from 1 to 0 (this is stage2() in the exploit).

This field is a reference counter, but doesn't use the refcount_t type, so there are no protections against invalid values.

Code checking if the object is used is very simple:
```
static void ipcomp_free_tfms(struct crypto_comp * __percpu *tfms)
{
struct ipcomp_tfms *pos;
int cpu;

list_for_each_entry(pos, &ipcomp_tfms_list, list) {
if (pos->tfms == tfms)
break;
}

WARN_ON(list_entry_is_head(pos, &ipcomp_tfms_list, list));

[1] if (--pos->users)
return;

list_del(&pos->list);
kfree(pos);

if (!tfms)
return;

for_each_possible_cpu(cpu) {
struct crypto_comp *tfm = *per_cpu_ptr(tfms, cpu);
crypto_free_comp(tfm);
}

}
```

If 'users' is equal to 1, objects are freed.

Right now our counter is at 0, but we can just allocate another XFRM SA to increase this count to 1 and then perform the delete, freeing the object while still in use.

## Getting RIP control

When ipcomp_tfms is freed, all crypto context is freed as well, including struct crypto_alg which contains struct compress_alg:

```
struct compress_alg {
int (*coa_compress)(struct crypto_tfm *, const u8 *, unsigned int, u8 *, unsigned in
t *); /* 0 0x8 */
int (*coa_decompress)(struct crypto_tfm *, const u8 *, unsigned int, u8 *, unsigned
int *); /* 0x8 0x8 */

/* size: 16, cachelines: 1, members: 2 */
};
```


These function pointers are called to compress/decompress network data on sockets configured with XFRM ipcomp.

If we allocate our payload in place of this object, we can trigger code execution by calling sendmsg() on our XFRM socket.

## Pivot to ROP

At this point RSI contains a pointer to our data, so we only need 2 gadgets to pivot to ROP:
```
push rsi
jmp qword ptr [rsi+0xf]
```

and

```
pop rsp
```

## Second pivot

At this point we have full ROP and enough space available, but our standard privilege escalation payload relies on ROP being at a known location, so we choose an unused read/write area in the kernel and use copy_user_generic_string() to copy the second stage ROP from userspace to that area.
Then we use a `pop rsp ; ret` gadget to pivot there.

## Privilege escalation

The execution is happening in the context of a syscall this time, so it's easy to escalate privileges with standard commit_creds(init_cred); switch_task_namespaces(pid, init_nsproxy); sequence and return to the root shell.

14 changes: 14 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26583_cos/docs/novel-techniques.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
## Determining the state of the buddy allocator by parsing /proc/buddyinfo

Linux kernel exposes statistics about each memory zone in a world-readable /proc/buddyinfo.
For example:
```
Node 0, zone DMA 0 0 0 0 0 0 0 1
Node 0, zone DMA32 4 2 1 1 3 2 3 2
Node 0, zone Normal 0 1 0 2 2 2 2 3
```

This means there are 4 order 0 pages, 3 order 1 pages, etc.

This is very useful when exploit needs to manipulate the buddy allocator to be able to get the target page even if it is not an exact match for the currently requested page (e.g. different order)

49 changes: 49 additions & 0 deletions pocs/linux/kernelctf/CVE-2024-26583_cos/docs/vulnerability.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
## Requirements to trigger the vulnerability

- Kernel configuration: CONFIG_TLS and one of [CONFIG_CRYPTO_PCRYPT, CONFIG_CRYPTO_CRYPTD]
- User namespaces required: no

## Commit which introduced the vulnerability

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=0cada33241d9de205522e3858b18e506ca5cce2c

## Commit which fixed the vulnerability

https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=aec7961916f3f9e88766e2688992da6980f11b8d

## Affected kernel versions

Introduced in 4.20. Fixed in 6.1.78, 5.15.159 and other stable trees.

## Affected component, subsystem

net/tls

## Description

TLS decryption works by calling recvmsg() on a TLS configured socket.
This will retrieve an encrypted message from the network stack and perform decryption.
AEAD decryption work is submitted to the crypto subsystem in tls_do_decryption(), setting tls_decrypt_done() as a callback and calling crypto_aead_decrypt().

If decryption is done asynchronously, crypto_aead_decrypt() returns immediately with EINPROGRESS value instead of waiting.
Execution then returns to tls_sw_recvmsg() which waits for the async crypto operations to be done using a completion mechanism.

When decryption is finished, the crypto subsystem calls tls_decrypt_done() callback function, which calls complete() allowing tls_sw_recvmsg() to exit. When recvmsg() returns, the socket is no longer locked and it is now possible to close it, which causes all associated objects to be freed.

Relevant tls_decrypt_done() code:

```
...
spin_lock_bh(&ctx->decrypt_compl_lock);
if (!atomic_dec_return(&ctx->decrypt_pending))
[1] complete(&ctx->async_wait.completion);
[2] spin_unlock_bh(&ctx->decrypt_compl_lock);
}

```

The bug is a race condition - calling complete at [1] allows the socket to be closed, which causes the ctx object to be freed, but ctx is later used as an argument to spin_unlock_bh()

If an attacker manages to close the socket and reallocate freed ctx with controlled data between points [1] and [2], he can manipulate memory using spin_unlock_bh().

This is a very limited write primitive, as it only allows changing an 8 bit integer value of 1 to 0 at a fixed position in memory (spinlock is basically a 32 bit unsigned integer with the least significant byte used for the actual lock value).
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
INCLUDES =
LIBS = -pthread -ldl -lm
CFLAGS = -fomit-frame-pointer -static -fcf-protection=none

exploit: exploit.c kernelver_17412.226.68.h
gcc -o $@ exploit.c $(INCLUDES) $(CFLAGS) $(LIBS)

prerequisites:
sudo apt-get install libkeyutils-dev
Binary file not shown.
Loading
Loading