I previously wrote about using libfuse and Zig to create a minimal file system. You can see from that article that I had some issues compiling libfuse and also speculated that it would be better to use the raw interface directly.
Soon after finishing that article I decided to try exploiting a bug (an n-day UAF). I have spent a lot of time reproducing bugs for the Linux Test Project, but never exploiting one.
Presently I’m stuck on finding a heap spray technique
that I can get to work with the bug given the limitations of
Google’s kCTF targets. I’ll leave the details for another article if
or when I get it to work. For now let’s just say that I want to
implement setxattr
and/or some other messages. To see
why take a look at chompie’s
article.
To get a deeper look at how FUSE works I decided to use the raw interface. It seems like there is a lot of interest in FUSE and I keep finding use cases for it (in addition to nefarious activities). Most recently MemProcFs which lets you mount your RAM as a file system. I’ve also been using sshfs quite extensively with my headless workstation.
This article is as much a narrative into my investigation as it is information about FUSE and Zig. As such it will contain false understanding and speculation.
Raw FUSE
The raw interface for FUSE is usually found at
/dev/fuse
. This is a special character file
that your system creates using mknod
with the device
type 0xa:0xe5
(I’ll just refer to it as
/dev/fuse
).
The character device type with the major number 0xa
and minor number 0xe5
is what allows us to create file
systems in userspace. Usually this is made accessible at
/dev/fuse
, but in theory you could create this file at
any path using mknod
. This is true for all devices and
device-like interfaces such as FUSE.
Whether this file is accessible depends on what permissions it is
given. Most distro’s allow regular users to access it. If they don’t
then a user that has CAP_MKNOD
could still create the
device node. Then again, a user that can create device nodes can
probably do anything.
A character device can be opened like a regular file. Once we
have a file handle to it then we can read and write to it. Note that
each open and resulting file handle are independent. To create
multiple file systems we just open /dev/fuse
multiple
times.
Exactly what happens when we read or write to the file handle depends on the device driver. In FUSE’s case we’re not really dealing with a device; it is purely an interface between software components.
So FUSE’s device driver is basically a big piece of glue code
which translates file system requests into messages that are passed
to our daemon reading /dev/fuse
.
We read from /dev/fuse
and when we get a message, we
send back a response. The FUSE code then does all kinds of caching,
sends out notifications, translates error codes, creates internal
kernel objects and so on.
Opening /dev/fuse
, is not enough to use the file
system, it also needs to be mounted. Usually a device is mounted by
specifying its path (e.g. /dev/hda
) or if there is no
device, we just specify the file system name (e.g. tmpfs). However,
to my knowledge, a FUSE file system has no device or name associated
with it.
So instead we pass the file descriptor we just opened to mount.
This is done using a file system option called fd
. The
mount system call then looks at our process’s file descriptor table
and checks that the file descriptor points to an instance of
/dev/fuse
.
Before looking into any of this I was contacted by José-Paul
Dominguez who was using the low-level libfuse API. While debugging
their application with strace
they noticed that libfuse
called mount and it failed. However something still got mounted.
I mentioned in my previous article that a regular user can use
FUSE. Strictly speaking though this is not true, because a regular
user can not call the mount
system call. This confused
me and I wondered if there is in fact some other magical way to
mount a file system. At least a FUSE file system.
It turns out that what libfuse actually does is call a
suid
binary called fusermount
. This has
the permission bit to run as root. So it can perform the mount. This
didn’t turn up in strace
because it was missing the
-f
follow switch, so child processes (i.e. fusermount)
are not traced. Also I doubt that suid binaries can be traced unless
strace is ran as root.
For some reason fusermount
does not simply take an
argument specifying the open fd
for
/dev/fuse
. Usually unless FD_CLOEXEC
is
set, the fd
should just stay open after the
fork
and exec
to execute
fusermount
.
Instead it opens a UNIX socket and uses a control message to
transfer the fd
from the parent process to the child.
Passing file descriptors between running processes is not something
I have seen very often. OpenQA does it to pass an FD to QEMU, but
let’s not get into that.
I decided not to do that, instead I run my code in an unprivileged user and mount namespace. This is not a good general solution, but it works for this demo. User namespaces are how it’s possible to have a container with a root user in. It’s not the real root user, but it lets you mount some file systems, including FUSE filesystems.
Being able to create a new user namespace from an unprivileged user is obviously a massive security challenge. It simply makes a lot more system calls available to an attacker.
It’s useful though for avoiding your container runtime requiring root while still being able to run containers with a root user in. So a lot of distributions allow it by default.
Methodology
The raw FUSE API has a man page. However it is both incomplete
and not up to date. libfuse
or the Go equivalent can
also be used as documentation. I didn’t spend much time looking at
these though.
Instead I decided to (mostly) ignore the library and look at the
kernel directly. I setup some break points in GDB (actually
pwndbg
also with the linux kernel scripts) and started
reading /dev/fuse
to see what happened. I like
bpftrace
for debugging the kernel, but didn’t use it
for this.
I have a script
which starts QEMU with the -s
switch to enable the
GDB stub. Then once QEMU has started I run pwndbg
from
the kernel source directory. The output of pwndbg
is
highly verbose, so I have removed most of it.
$ pwndbg vmlinux
...
pwndbg> target remote localhost:1234
...
pwndbg> lx-symbols
loading vmlinux
scanning for modules in /home/rich/kernel/linux
loading @0xffffffffc0201000: /home/rich/kernel/linux/fs/fuse/fuse.ko
For lx-symbols
to work I have a
~/.conf/gdb/gdbinit
like:
add-auto-load-safe-path /home/rich/kernel/linux/scripts/gdb/vmlinux-gdb.py
There is a whole bunch of other setup to build the VM image I am using. You can see this at github.com/richiejp/m. At some point I’d like to fully automate recreating the environment and get cross compilation to work for a reasonable set of tools. Right now though you may find some bits are missing from the initrd. Also see the related article on cross compiling with Zig and LLVM.
Anyway, once we have all this setup, then it’s possible to set a
breakpoint at say fuse_simple_request
in the
kernel.
pwndbg> b fuse_simple_request
Breakpoint 1 at 0xffffffffc0201587: file fs/fuse/dev.c, line 485.
Then when we start processing requests we’ll get dropped into this function. This function is particularly useful because it gets called for most requests. If we inspect the backtrace then it is usually easy to see what the current operation is.
pwndbg> bt
#0 fuse_simple_request (fm=0xffff888100ac1860, args=0xffffc900003f7b18) at fs/fuse/dev.c:485
#1 0xffffffffc0207e67 in fuse_do_getattr (inode=0xffff888107e7c340, stat=0x0 <fixed_percpu_data>, file=0x0 <fixed_percpu_data>) at fs/fuse/dir.c:1119
#2 0xffffffffc02081e3 in fuse_perm_getattr (inode=0xffff888107e7c340, mask=1) at fs/fuse/dir.c:1306
#3 fuse_permission (mnt_userns=<optimized out>, inode=0xffff888107e7c340, mask=1) at fs/fuse/dir.c:1347
#4 0xffffffff81347749 in do_inode_permission (mnt_userns=0xffffffff828534b0 <init_user_ns>, inode=0xffff888107e7c340, mask=1) at fs/namei.c:458
#5 inode_permission (mnt_userns=0xffffffff828534b0 <init_user_ns>, inode=0xffff888107e7c340, mask=1) at fs/namei.c:525
#6 0xffffffff8134e8a5 in may_lookup (mnt_userns=0xffffffff828534b0 <init_user_ns>, nd=0xffffc900003f7d30) at fs/namei.c:1715
#7 link_path_walk (name=0xffff88810088b02f "foo", nd=0xffffc900003f7d30) at fs/namei.c:2262
#8 0xffffffff8134809a in path_lookupat (nd=0xffffc900003f7d30, flags=65, path=0xffffc900003f7e88) at fs/namei.c:2473
#9 0xffffffff81347f37 in filename_lookup (dfd=-100, name=0xffff88810088b000, flags=<optimized out>, path=0xffffc900003f7e88, root=0x0 <fixed_percpu_data>) at fs/namei.c:2503
#10 0xffffffff81348c16 in user_path_at_empty (dfd=-100, name=0x7b9079d55ef8 "/tmp/fuse-test/foo", flags=1, path=0xffffc900003f7e88, empty=0x0 <fixed_percpu_data>) at fs/namei.c:2876
#11 0xffffffff8136e73f in user_path_at (dfd=-100, name=0x7b9079d55ef8 "/tmp/fuse-test/foo", flags=1, path=0xffffc900003f7e88) at ./include/linux/namei.h:57
#12 path_setxattr (pathname=0x7b9079d55ef8 "/tmp/fuse-test/foo", name=0x215f25 "user.bar", value=0x215f2e, size=3, flags=0, lookup_flags=1) at fs/xattr.c:631
#13 0xffffffff8136d807 in __do_sys_setxattr (pathname=0xb54d318b60b4ad00 <error: Cannot access memory at address 0xb54d318b60b4ad00>, name=0xffffc900003f7b18 "\001", value=0x0 <fixed_percpu_data>, size=0,
flags=81) at fs/xattr.c:652
#14 __se_sys_setxattr (pathname=-5382591504945206016, name=-60473135367400, value=0, size=0, flags=<optimized out>) at fs/xattr.c:648
#15 __x64_sys_setxattr (regs=<optimized out>) at fs/xattr.c:648
#16 0xffffffff81b90089 in do_syscall_x64 (regs=0xffffc900003f7f58, nr=1622453504) at arch/x86/entry/common.c:50
#17 do_syscall_64 (regs=0xffffc900003f7f58, nr=1622453504) at arch/x86/entry/common.c:80
#18 0xffffffff81c0009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120
In this backtrace a lot of stuff is going on and I’ll come back
to it again. However we can see at the bottom we enter the kernel
due to the setxattr
system call. Then there a bunch of
functions related to a path lookup, which results in a permissions
check, which then results in getattr
(note there is no
x
).
The actual request is GETATTR
which can be seen by
printing the args->opcode
struct member passed to
fuse_simple_request
.
pwndbg> p args->opcode
$3 = 3
We know that FUSE_GETATTR = 3
from
enum fuse_opcode
in uapi/linux/fuse.h
.
This is part of the Linux headers. One of the first things I did was
to translate it into Zig using zig translate-c
. I then
just copy and paste useful bits of this into the
final program.
TLV
The FUSE protocol is one that can be described as roughly Tag-length-value. Every message contains some minimal information including the message length, opcode (tag) and some common info like the user ID.
The opcode decides what other data is transmitted after the standard header. Compared to HTTP/2 it’s wonderfully simple. Although this would be expected from a purely local protocol (maybe not USB).
The headers sent from kernel to userspace daemon and userspace daemon to kernel are different. In Zig notation they are
const InHeader = extern struct {
u32,
len: u32,
opcode: u64,
unique: u64,
nodeid:
uid: l.uid_t,
gid: l.gid_t,
pid: l.pid_t,
u32,
padding: };
const OutHeader = extern struct {
u32,
len: i32,
err: u64,
unique: };
“In” is what comes into the daemon and “Out” is sent back to the kernel. This naming seems to be used consistently between the kernel and userspace headers.
The len
includes the length of the header and the
following operation arguments. The unique
field
identifies the request so that responses can be matched to requests
out-of-order.
nodeid
appears to be the inode
number.
I’d describe an inode
as a thing which exists in a file
system. These things have numbers and we’re supposed to keep track
of them, but instead I choose just to give them amusing (to me)
numbers like 0xf00
.
The purpose of nodeid
of course depends on the
opcode. It seems that at least the init operation does not need it,
but probably most other operations do.
We also have the user credentials and the process ID. I assume these are usually of the process that triggered a file system request.
Then there is padding
which I guess is there to
ensure the struct is aligned to 8 bytes (64bit). Otherwise padding
may get inserted elsewhere if the struct is embedded in another
struct with 8 byte fields or if it is put into an array.
Because the kernel and userland may use different compilers, the padding in structs needs to be made explicit. In our case we are not even using the same language.
The out header has an err
field which either has a
negative error code or 0. For a lot of opcodes we can set the error
to ENOSYS
and the kernel won’t try making another
request with the same opcode. We also don’t have to send a valid
message body.
libfuse implements a lot of basic features by default so that you
can implement a small file system and have it work with standard
tools like ls
. However this isn’t necessary for the
kernel which just passes on errors to userspace or just gives up
trying to do some operation.
INIT
The first thing we actually receive when reading from
/dev/fuse
is an init message that negotiates which
protocol to use.
const InitIn = extern struct {
u32,
major: u32,
minor: u32,
max_readahead: u32,
flags: u32,
flags2: 11]u32,
unused: [};
const InitOut = extern struct {
u32,
major: u32,
minor: u32,
max_readahead: u32,
flags:
u16,
max_background: u16,
congestion_threshold: u32,
max_write: u32,
time_gran: u16,
max_pages: u16,
map_alignment: u32,
flags2: 7]u32 = .{0} ** 7,
unused: [};
For the most part I just reflect what is in the
InitIn
message to the InitOut
or pick some
minimal value the kernel will accept. In reality my FS does not
support even a small fraction of whats in flags
and
flags2
.
To get InitIn
we just read the bytes from the file
descriptor and cast them to the above structs.
From init:
var buf: []u8 = &self.read_buf;
const len = try os.read(fd, buf);
@sizeOf(InHeader) + @sizeOf(InitIn));
assert(len >=
const hdr = mem.bytesAsValue(InHeader, buf[0..@sizeOf(InHeader)]);
"kernel: hdr: {}\n", .{hdr.*});
std.debug.print(
const opcode: Opcode = @enumFromInt(hdr.opcode);
assert(opcode == .INIT);@sizeOf(InHeader) + @sizeOf(InitIn));
assert(hdr.len ==
self.read_len = len - hdr.len;
const req = mem.bytesAsValue(InitIn, (buf[@sizeOf(InHeader)..][0..@sizeOf(InitIn)]));
"kernel: init: {}\n", .{req.*});
std.debug.print(
7);
assert(req.major == 37); assert(req.minor ==
The Zig standard library has mem.bytesAsValue
which
just casts the buffer to a specified type. Apart from some safety
checks (assuming they are enabled), it doesn’t appear to do anything
at runtime.
fn BytesAsValueReturnType(comptime T: type, comptime B: type) type {
const size = @as(usize, @sizeOf(T));
if (comptime !trait.is(.Pointer)(B) or
u8 and meta.Child(B) != [size:0]u8))
(meta.Child(B) != [size]{
@compileError(std.fmt.comptimePrint("expected *[{}]u8, passed " ++ @typeName(B), .{size}));
}
return CopyPtrAttrs(B, .One, T);
}
/// Given a pointer to an array of bytes, returns a pointer to a value of the specified type
/// backed by those bytes, preserving pointer attributes.
pub fn bytesAsValue(comptime T: type, bytes: anytype) BytesAsValueReturnType(T, @TypeOf(bytes)) {
return @as(BytesAsValueReturnType(T, @TypeOf(bytes)), @ptrCast(bytes));
}
My understanding is that BytesAsValueReturnType
inspects the argument types at compile time. If argument
B
or bytes
is a pointer or a slice of
u8
then it copies some the “pointer attributes” of
B
to a new pointer type.
Pointer attributes here includes things such as
const
, volatile
,
address_space
and alignment
. The size and
underlying type are not copied though, they come from what we are
trying to cast to (T
).
Apart from the fact that this is implemented in the Zig library
and not the compiler. It’s interesting because the
alignment
ends up being 1 (the length of
u8
) which is copied from the slice. To my knowledge an
alignment of 8 (u64
) should be ideal and is what the
protocol structures are aligned to.
I’m not sure how to get the alignment to be 8. There’s probably some assertion or cast that can be done. It’s not important for getting the code to work, but it’s interesting from a performance point of view.
After casting into the InitIn
struct we can print it
using the Zig standard library. The result is quite ugly, but
requires minimal effort and works for debugging.
kernel: hdr: fuse.InHeader{ .len = 104, .opcode = 26, .unique = 2, .nodeid = 0, .uid = 0, .gid = 0, .pid = 0, .padding = 0 }
kernel: init: fuse.InitIn{ .major = 7, .minor = 37, .max_readahead = 131072, .flags = 1946157051, .flags2 = 1, .unused = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }
Once we have checked some assumptions about what we should receive and have printed it. We write a response.
const res = InitOutMsg{
{
.head = .@sizeOf(InitOutMsg),
.len = 0,
.err =
.unique = hdr.unique,},
{
.body = .7,
.major = 37,
.minor =
.max_readahead = req.max_readahead,
.flags = req.flags,
.flags2 = req.flags2,
0,
.max_background = 0,
.congestion_threshold = 4096,
.max_write = 0,
.time_gran = 1,
.max_pages = 1,
.map_alignment = },
};
"fuse: init: {}\n", .{res});
std.debug.print(try os.write(fd, mem.asBytes(&res)) == @sizeOf(@TypeOf(res)));
assert(
u8, buf, buf[@sizeOf(InHeader) + @sizeOf(InitIn) ..]); mem.copyForwards(
This casts our response to a u8
slice, writes it to
/dev/fuse
and shifts the read buffer to the left.
Probably the read buffer only has the init message in it, but we
just treat it the same as other messages in this regard.
When we read we assume that multiple messages will be queued and that we could read a chunk of the next message(s).
GETATTR
The first message I expected to receive was LOOKUP
because when we do setxattr
on the file path
/tmp/fuse-test/foo
. The foo
part is inside
our file systems mount and it needs to be resolved to an
inode
number. It’s the job of the file system, to take
components of a path and resolve them to a number.
Most internal operations in the kernel work on inodes or some similar thing, not paths.
According to the kernel our file system already has an inode in
it with nodeid = 1
. Maybe this is the mount point
itself or some other file system thing. Whatever it is the kernel
calls inode_permission
, then
fuse_permission
and eventually
fuse_do_getattr
. So we can guess it wants the inodes
mode to see if it can be accessed.
switch (opcode) {
{
.GETATTR => const getattr_in =
0..@sizeOf(GetattrIn)]);
mem.bytesAsValue(GetattrIn, msg[
"kernel: getattr: {}\n", .{getattr_in});
std.debug.print(
const time: u64 = @intCast(@min(0, std.time.timestamp()));
{
res.out.attr = .300,
.valid = time + 0,
.valid_nsec = 0,
.dummy = {
.attr = .
.ino = getattr_in.nodeid,1,
.blocks = 42,
.size =
.atime = time,
.mtime = time,
.ctime = time,0,
.atimensec = 0,
.mtimensec = 0,
.ctimensec = 0o666,
.mode = l.S.IFDIR | 1,
.nlink =
.uid = l.getuid(),
.gid = l.getgid(),0,
.rdev = 0,
.blksize = 0,
.flags = },
};
@sizeOf(AttrOut);
res.hdr.len += },
I’ve skipped reading the headers and setting up the response. Above you can just see the response we send.
From reading the kernel code I deduced that valid
is
the time when the attributes cease to be valid. If that is not the
case then it still accepted this value. I suppose in the worst case
it’ll remain valid for a few decades.
Next we have things like blocks
, size
,
blksize
, rdev
and various time stamps.
These certainly don’t have any effect on what the kernel is
currently trying to do, but it’s probably best to try picking values
in a sensible range.
The nlink
does seem to be important. Setting it to
zero would seem to indicate the item has been deleted. Although
whether that would effect the current operation, I don’t know.
The important field is mode
, this lets us set the
file type and permissions. I decided that whatever the thing is that
is being accessed it should be a directory and all users should have
read-write permissions. The kernel was happy with this.
LOOKUP
Next the message I was expecting arrived.
{
.LOOKUP => blk: const Static = struct {
var generation: u64 = 0;
};
const lookup_in: []const u8 = msg[0..msg_len];
"kernel: lookup: {s}\n", .{lookup_in});
std.debug.print(
if (!mem.eql(u8, "foo", lookup_in[0..3])) {
@as(i32, @intFromEnum(E.NOENT));
res.hdr.err = -break :blk;
}
const time: u64 = @intCast(@min(0, std.time.timestamp()));
1;
Static.generation +=
{
res.out.entry = .0xf00,
.nodeid =
.generation = Static.generation,300,
.entry_valid = time + 0,
.entry_valid_nsec = 300,
.attr_valid = time + 0,
.attr_valid_nsec = {
.attr = .0xf00,
.ino = 1,
.blocks = 420,
.size =
.atime = time,
.mtime = time,
.ctime = time,0,
.atimensec = 0,
.mtimensec = 0,
.ctimensec = 0o666,
.mode = l.S.IFREG | 1,
.nlink =
.uid = l.getuid(),
.gid = l.getgid(),0,
.rdev = 0,
.blksize = 0,
.flags = },
};
@sizeOf(EntryOut);
res.hdr.len += },
The Static
struct containing generation
is how static variables are declared in Zig. We use this to set the
inode generation. For now it is unlikely we’ll get more than one
lookup request, but if we do we’ll increase the generation to ensure
it’s unique for this session.
Another interesting Zig feature is the blk: {...}
label. Here blk
is an arbitrary name, but I with
convention. In Zig you can add a label to a block
(i.e. {...}
). Then use the break
keyword
to return from that block. Including with a value, but we don’t use
that here. This eliminates one use of goto
or very
small functions.
We check that the lookup path is “foo” and return from the block if it is not. This isn’t strictly necessary, it’s just checking my assumptions about what is happening.
Then the result again has an attribute embedded within it
(attr
). So the response to LOOKUP
is a
superset of GETATTR
.
The big difference is that we get to set the inode number and how long the mapping lasts. The mapping being the path to inode number relationship.
SETXATTR
Finally we get to the operation that matters to us. This allows
us to implement the setxattr
system call for our file
system. This system call is used to set extended attributes.
I imagine many readers have never heard of extended attributes. On file systems that support them, they let you set name value pairs on files.
The name is a null terminated string and the value is a binary chunk. This allows files to be tagged with whatever meta-data you like.
For instance the QEMU 9p remote filesystem implementation uses this to map user permissions within a VM to user permissions on the host. I’m not entirely sure what that entails, but I use 9p to share folders with VMs with this mapping feature enabled.
You can imagine that if you want to have multiple sets of permissions on a file for different systems. Then you can use extended attributes to store those permissions.
Extended attribute names start with a namespace. We’re interested in the “user” namespace which any user can write to. There are a number of other namespaces, such as the security namespace which SELinux uses.
If you try writing to these other namespaces, you will probably get EPERM or some error if you have permission, but the attribute is in the wrong format.
First of all let’s look at how we use the setxattr
system call. Zig has no wrapper at the time (I should add it) for
setxattr, so we make the system call directly.
const XATTR_NAME: [:0]const u8 = "user.bar";
...
fn setxattr(path: [*:0]const u8, name: [*:0]const u8, value: []const u8, size: usize, flags: usize) usize {
return l.syscall5(.setxattr, @intFromPtr(path), @intFromPtr(name), @intFromPtr(value.ptr), size, flags);
}
fn setXAttr(env: *const TestEnv) void {
var buf: [os.PATH_MAX]u8 = .{0} ** os.PATH_MAX;
const path = std.fmt.bufPrint(buf[0 .. buf.len - 1], "{s}/{s}", .{ env.mnt_path, "foo" }) catch |err| {
"bufPrint: {}", .{err});
std.debug.print(return;
};
const res = setxattr(@ptrCast(path), XATTR_NAME, "baz", 3, 0);
const err = os.errno(res);
"setxattr: {s}: {}\n", .{ path, err });
std.debug.print(}
We could set any binary data, but for now I just set “user.bar” to “baz”.
The code to handle this is pretty simple as we don’t bother to actually store anything.
const SetxattrIn = extern struct {
u32,
size: u32,
flags: u32,
setxattr_flags: u32 = 0,
padding: };
...
{
.SETXATTR => const xattr_in = mem.bytesToValue(SetxattrIn, msg[0..@sizeOf(SetxattrIn)]);
const tail = msg[@sizeOf(SetxattrIn)..];
"kernel: setxattr: {}\n", .{xattr_in});
std.debug.print(
const name_len = msg_len - @sizeOf(SetxattrIn) - xattr_in.size;
const name = tail[0 .. name_len - 1 :0];
const value = tail[name_len..];
"kernel: setxattr: [{}]{s} => {s}\n", .{ name.len, name, value });
std.debug.print(
u8, XATTR_NAME, name));
assert(mem.eql(u8, "baz", value));
assert(mem.eql(},
The SETXATTR
name-value pair are not part of the
SetxattrIn
structure. They are transmitted immediately
after it. The size
field refers to the value size. This
seems rather arbitrary and a bit redundant because the name is
null-terminated and we already have the overall message length.
This worries me because possibly it means that padding can be inserted somewhere which would warrant sending the value size. Or something else I have not thought of. On the other hand it does allow separating the name from the value without scanning for the null byte.
Anyway, all we do is assert that we got the expected values.
SETATTR
I thought that would be the end of it, but even after handling
SETXATTR
the kernel refused to return from
setxattr
(the system call). I also foolishly did not
check if any further messages had been received in my test.
Initially debugging did not reveal what was going on because I
made some assumptions about what was happening and did not do a
careful analysis of each call to
fuse_simple_request
.
However looking back on the kernel code which handles
setxattr
, there is an obvious culprit.
int fuse_setxattr(struct inode *inode, const char *name, const void *value,
size_t size, int flags, unsigned int extra_flags)
{
struct fuse_mount *fm = get_fuse_mount(inode);
(args);
FUSE_ARGSstruct fuse_setxattr_in inarg;
int err;
if (fm->fc->no_setxattr)
return -EOPNOTSUPP;
(&inarg, 0, sizeof(inarg));
memset.size = size;
inarg.flags = flags;
inarg.setxattr_flags = extra_flags;
inarg
.opcode = FUSE_SETXATTR;
args.nodeid = get_node_id(inode);
args.in_numargs = 3;
args.in_args[0].size = fm->fc->setxattr_ext ?
argssizeof(inarg) : FUSE_COMPAT_SETXATTR_IN_SIZE;
.in_args[0].value = &inarg;
args.in_args[1].size = strlen(name) + 1;
args.in_args[1].value = name;
args.in_args[2].size = size;
args.in_args[2].value = value;
args= fuse_simple_request(fm, &args); <-- The request is sent
err if (err == -ENOSYS) { <-- I look in debugger and see err == 0
->fc->no_setxattr = 1;
fm= -EOPNOTSUPP;
err }
if (!err)
(inode); <-- I skip over this function in the debugger
fuse_update_ctime
return err;
}
The function fuse_update_ctime
results in
SETATTR
being sent to the FUSE daemon. It takes a
rather roundabout route to achieve this (going through
fs-writeback.c
), but it tells our daemon the inode’s
timestamps need updating.
Our code which handles SETATTR
looks a lot like
GETATTR
. The main difference is we get some attribute
info from the kernel. This tells us what attributes are being set
and to which values. However we get chance to modify this and send
it back.
const SetattrIn = extern struct {
u32,
valid: u32,
padding: u64,
fh: u64,
size: u64,
lock_owner: u64,
atime: u64,
mtime: u64,
ctime: u32,
atimensec: u32,
mtimensec: u32,
ctimensec: u32,
mode: u32,
unused4: u32,
uid: u32,
gid: u32,
unused5: };
...
{
.SETATTR => const setattr_in =
0..@sizeOf(SetattrIn)]);
mem.bytesAsValue(SetattrIn, msg[
"kernel: setattr: {}\n", .{setattr_in});
std.debug.print(
const time: u64 = @intCast(@min(0, std.time.timestamp()));
{
res.out.attr = .300,
.valid = time + 0,
.valid_nsec = 0,
.dummy = {
.attr = .
.ino = hdr.nodeid,1,
.blocks = 42,
.size =
.atime = time,
.mtime = time,
.ctime = time,0,
.atimensec = 0,
.mtimensec = 0,
.ctimensec = 0o666,
.mode = l.S.IFDIR | 1,
.nlink =
.uid = l.getuid(),
.gid = l.getgid(),0,
.rdev = 0,
.blksize = 0,
.flags = },
};
const v = setattr_in.valid;
if (v & ~(FATTR.ATIME | FATTR.MTIME | FATTR.CTIME) > 0) {
"setattr: setting attributes not supported\n", .{});
std.debug.print(@as(i32, @intFromEnum(E.OPNOTSUPP));
res.hdr.err = -} else {
@sizeOf(AttrOut);
res.hdr.len += }
},
The valid
field in SetattrIn
we receive
from the kernel says which attributes are being set. It is a bit
field, where each bit represents an attribute.
We only expect some time attributes to be valid in the request.
It’s not clear if the valid
bitfield also applies to
the response. You can see in the code I accidentally set
mode
to a directory. I’m not sure if the kernel ignores
this or it is simply happy that after updating the timestamps the
inode transformed from a file into a directory.
Closing remarks
Part of me thinks that I really need to read up on what an inode is. Or some other FS thing like a super block. However this would give a normative understanding of these things. At best it is a peg to hang future knowledge on and at worst it delays achieving real world understanding that comes from exposure.
Often labels are mistaken for abstractions. The same labels are given to some group of objects or interfaces or illogical concepts. When you start digging into it you realise that the code only behaves according to the label’s description in some specific circumstance or not at all.
I Hope you enjoyed that.