Zig & /dev/fuse: A weird file system

I previously wrote about using libfuse and Zig to create a minimal file system. You can see from that article that I had some issues compiling libfuse and also speculated that it would be better to use the raw interface directly.

Soon after finishing that article I decided to try exploiting a bug (an n-day UAF). I have spent a lot of time reproducing bugs for the Linux Test Project, but never exploiting one.

Presently I’m stuck on finding a heap spray technique that I can get to work with the bug given the limitations of Google’s kCTF targets. I’ll leave the details for another article if or when I get it to work. For now let’s just say that I want to implement setxattr and/or some other messages. To see why take a look at chompie’s article.

To get a deeper look at how FUSE works I decided to use the raw interface. It seems like there is a lot of interest in FUSE and I keep finding use cases for it (in addition to nefarious activities). Most recently MemProcFs which lets you mount your RAM as a file system. I’ve also been using sshfs quite extensively with my headless workstation.

This article is as much a narrative into my investigation as it is information about FUSE and Zig. As such it will contain false understanding and speculation.

I created a video which mainly shows the debugging process.

Raw FUSE

The raw interface for FUSE is usually found at /dev/fuse. This is a special character file that your system creates using mknod with the device type 0xa:0xe5 (I’ll just refer to it as /dev/fuse).

The character device type with the major number 0xa and minor number 0xe5 is what allows us to create file systems in userspace. Usually this is made accessible at /dev/fuse, but in theory you could create this file at any path using mknod. This is true for all devices and device-like interfaces such as FUSE.

Whether this file is accessible depends on what permissions it is given. Most distro’s allow regular users to access it. If they don’t then a user that has CAP_MKNOD could still create the device node. Then again, a user that can create device nodes can probably do anything.

A character device can be opened like a regular file. Once we have a file handle to it then we can read and write to it. Note that each open and resulting file handle are independent. To create multiple file systems we just open /dev/fuse multiple times.

Exactly what happens when we read or write to the file handle depends on the device driver. In FUSE’s case we’re not really dealing with a device; it is purely an interface between software components.

So FUSE’s device driver is basically a big piece of glue code which translates file system requests into messages that are passed to our daemon reading /dev/fuse.

We read from /dev/fuse and when we get a message, we send back a response. The FUSE code then does all kinds of caching, sends out notifications, translates error codes, creates internal kernel objects and so on.

Opening /dev/fuse, is not enough to use the file system, it also needs to be mounted. Usually a device is mounted by specifying its path (e.g. /dev/hda) or if there is no device, we just specify the file system name (e.g. tmpfs). However, to my knowledge, a FUSE file system has no device or name associated with it.

So instead we pass the file descriptor we just opened to mount. This is done using a file system option called fd. The mount system call then looks at our process’s file descriptor table and checks that the file descriptor points to an instance of /dev/fuse.

Before looking into any of this I was contacted by José-Paul Dominguez who was using the low-level libfuse API. While debugging their application with strace they noticed that libfuse called mount and it failed. However something still got mounted.

I mentioned in my previous article that a regular user can use FUSE. Strictly speaking though this is not true, because a regular user can not call the mount system call. This confused me and I wondered if there is in fact some other magical way to mount a file system. At least a FUSE file system.

It turns out that what libfuse actually does is call a suid binary called fusermount. This has the permission bit to run as root. So it can perform the mount. This didn’t turn up in strace because it was missing the -f follow switch, so child processes (i.e. fusermount) are not traced. Also I doubt that suid binaries can be traced unless strace is ran as root.

For some reason fusermount does not simply take an argument specifying the open fd for /dev/fuse. Usually unless FD_CLOEXEC is set, the fd should just stay open after the fork and exec to execute fusermount.

Instead it opens a UNIX socket and uses a control message to transfer the fd from the parent process to the child. Passing file descriptors between running processes is not something I have seen very often. OpenQA does it to pass an FD to QEMU, but let’s not get into that.

I decided not to do that, instead I run my code in an unprivileged user and mount namespace. This is not a good general solution, but it works for this demo. User namespaces are how it’s possible to have a container with a root user in. It’s not the real root user, but it lets you mount some file systems, including FUSE filesystems.

Being able to create a new user namespace from an unprivileged user is obviously a massive security challenge. It simply makes a lot more system calls available to an attacker.

It’s useful though for avoiding your container runtime requiring root while still being able to run containers with a root user in. So a lot of distributions allow it by default.

Methodology

The raw FUSE API has a man page. However it is both incomplete and not up to date. libfuse or the Go equivalent can also be used as documentation. I didn’t spend much time looking at these though.

Instead I decided to (mostly) ignore the library and look at the kernel directly. I setup some break points in GDB (actually pwndbg also with the linux kernel scripts) and started reading /dev/fuse to see what happened. I like bpftrace for debugging the kernel, but didn’t use it for this.

I have a script which starts QEMU with the -s switch to enable the GDB stub. Then once QEMU has started I run pwndbg from the kernel source directory. The output of pwndbg is highly verbose, so I have removed most of it.

$ pwndbg vmlinux
...
pwndbg> target remote localhost:1234
...
pwndbg> lx-symbols
loading vmlinux
scanning for modules in /home/rich/kernel/linux
loading @0xffffffffc0201000: /home/rich/kernel/linux/fs/fuse/fuse.ko

For lx-symbols to work I have a ~/.conf/gdb/gdbinit like: add-auto-load-safe-path /home/rich/kernel/linux/scripts/gdb/vmlinux-gdb.py

There is a whole bunch of other setup to build the VM image I am using. You can see this at github.com/richiejp/m. At some point I’d like to fully automate recreating the environment and get cross compilation to work for a reasonable set of tools. Right now though you may find some bits are missing from the initrd. Also see the related article on cross compiling with Zig and LLVM.

Anyway, once we have all this setup, then it’s possible to set a breakpoint at say fuse_simple_request in the kernel.

pwndbg> b fuse_simple_request
Breakpoint 1 at 0xffffffffc0201587: file fs/fuse/dev.c, line 485.

Then when we start processing requests we’ll get dropped into this function. This function is particularly useful because it gets called for most requests. If we inspect the backtrace then it is usually easy to see what the current operation is.

pwndbg> bt
#0  fuse_simple_request (fm=0xffff888100ac1860, args=0xffffc900003f7b18) at fs/fuse/dev.c:485
#1  0xffffffffc0207e67 in fuse_do_getattr (inode=0xffff888107e7c340, stat=0x0 <fixed_percpu_data>, file=0x0 <fixed_percpu_data>) at fs/fuse/dir.c:1119
#2  0xffffffffc02081e3 in fuse_perm_getattr (inode=0xffff888107e7c340, mask=1) at fs/fuse/dir.c:1306
#3  fuse_permission (mnt_userns=<optimized out>, inode=0xffff888107e7c340, mask=1) at fs/fuse/dir.c:1347
#4  0xffffffff81347749 in do_inode_permission (mnt_userns=0xffffffff828534b0 <init_user_ns>, inode=0xffff888107e7c340, mask=1) at fs/namei.c:458
#5  inode_permission (mnt_userns=0xffffffff828534b0 <init_user_ns>, inode=0xffff888107e7c340, mask=1) at fs/namei.c:525
#6  0xffffffff8134e8a5 in may_lookup (mnt_userns=0xffffffff828534b0 <init_user_ns>, nd=0xffffc900003f7d30) at fs/namei.c:1715
#7  link_path_walk (name=0xffff88810088b02f "foo", nd=0xffffc900003f7d30) at fs/namei.c:2262
#8  0xffffffff8134809a in path_lookupat (nd=0xffffc900003f7d30, flags=65, path=0xffffc900003f7e88) at fs/namei.c:2473
#9  0xffffffff81347f37 in filename_lookup (dfd=-100, name=0xffff88810088b000, flags=<optimized out>, path=0xffffc900003f7e88, root=0x0 <fixed_percpu_data>) at fs/namei.c:2503
#10 0xffffffff81348c16 in user_path_at_empty (dfd=-100, name=0x7b9079d55ef8 "/tmp/fuse-test/foo", flags=1, path=0xffffc900003f7e88, empty=0x0 <fixed_percpu_data>) at fs/namei.c:2876
#11 0xffffffff8136e73f in user_path_at (dfd=-100, name=0x7b9079d55ef8 "/tmp/fuse-test/foo", flags=1, path=0xffffc900003f7e88) at ./include/linux/namei.h:57
#12 path_setxattr (pathname=0x7b9079d55ef8 "/tmp/fuse-test/foo", name=0x215f25 "user.bar", value=0x215f2e, size=3, flags=0, lookup_flags=1) at fs/xattr.c:631
#13 0xffffffff8136d807 in __do_sys_setxattr (pathname=0xb54d318b60b4ad00 <error: Cannot access memory at address 0xb54d318b60b4ad00>, name=0xffffc900003f7b18 "\001", value=0x0 <fixed_percpu_data>, size=0,
    flags=81) at fs/xattr.c:652
#14 __se_sys_setxattr (pathname=-5382591504945206016, name=-60473135367400, value=0, size=0, flags=<optimized out>) at fs/xattr.c:648
#15 __x64_sys_setxattr (regs=<optimized out>) at fs/xattr.c:648
#16 0xffffffff81b90089 in do_syscall_x64 (regs=0xffffc900003f7f58, nr=1622453504) at arch/x86/entry/common.c:50
#17 do_syscall_64 (regs=0xffffc900003f7f58, nr=1622453504) at arch/x86/entry/common.c:80
#18 0xffffffff81c0009b in entry_SYSCALL_64 () at arch/x86/entry/entry_64.S:120

In this backtrace a lot of stuff is going on and I’ll come back to it again. However we can see at the bottom we enter the kernel due to the setxattr system call. Then there a bunch of functions related to a path lookup, which results in a permissions check, which then results in getattr (note there is no x).

The actual request is GETATTR which can be seen by printing the args->opcode struct member passed to fuse_simple_request.

pwndbg> p args->opcode
$3 = 3

We know that FUSE_GETATTR = 3 from enum fuse_opcode in uapi/linux/fuse.h. This is part of the Linux headers. One of the first things I did was to translate it into Zig using zig translate-c. I then just copy and paste useful bits of this into the final program.

TLV

The FUSE protocol is one that can be described as roughly Tag-length-value. Every message contains some minimal information including the message length, opcode (tag) and some common info like the user ID.

The opcode decides what other data is transmitted after the standard header. Compared to HTTP/2 it’s wonderfully simple. Although this would be expected from a purely local protocol (maybe not USB).

The headers sent from kernel to userspace daemon and userspace daemon to kernel are different. In Zig notation they are

const InHeader = extern struct {
    len: u32,
    opcode: u32,
    unique: u64,
    nodeid: u64,

    uid: l.uid_t,
    gid: l.gid_t,
    pid: l.pid_t,

    padding: u32,
};

const OutHeader = extern struct {
    len: u32,
    err: i32,
    unique: u64,
};

“In” is what comes into the daemon and “Out” is sent back to the kernel. This naming seems to be used consistently between the kernel and userspace headers.

The len includes the length of the header and the following operation arguments. The unique field identifies the request so that responses can be matched to requests out-of-order.

nodeid appears to be the inode number. I’d describe an inode as a thing which exists in a file system. These things have numbers and we’re supposed to keep track of them, but instead I choose just to give them amusing (to me) numbers like 0xf00.

The purpose of nodeid of course depends on the opcode. It seems that at least the init operation does not need it, but probably most other operations do.

We also have the user credentials and the process ID. I assume these are usually of the process that triggered a file system request.

Then there is padding which I guess is there to ensure the struct is aligned to 8 bytes (64bit). Otherwise padding may get inserted elsewhere if the struct is embedded in another struct with 8 byte fields or if it is put into an array.

Because the kernel and userland may use different compilers, the padding in structs needs to be made explicit. In our case we are not even using the same language.

The out header has an err field which either has a negative error code or 0. For a lot of opcodes we can set the error to ENOSYS and the kernel won’t try making another request with the same opcode. We also don’t have to send a valid message body.

libfuse implements a lot of basic features by default so that you can implement a small file system and have it work with standard tools like ls. However this isn’t necessary for the kernel which just passes on errors to userspace or just gives up trying to do some operation.

INIT

The first thing we actually receive when reading from /dev/fuse is an init message that negotiates which protocol to use.

const InitIn = extern struct {
    major: u32,
    minor: u32,
    max_readahead: u32,
    flags: u32,
    flags2: u32,
    unused: [11]u32,
};

const InitOut = extern struct {
    major: u32,
    minor: u32,
    max_readahead: u32,
    flags: u32,

    max_background: u16,
    congestion_threshold: u16,
    max_write: u32,
    time_gran: u32,
    max_pages: u16,
    map_alignment: u16,
    flags2: u32,
    unused: [7]u32 = .{0} ** 7,
};

For the most part I just reflect what is in the InitIn message to the InitOut or pick some minimal value the kernel will accept. In reality my FS does not support even a small fraction of whats in flags and flags2.

To get InitIn we just read the bytes from the file descriptor and cast them to the above structs.

From init:

        var buf: []u8 = &self.read_buf;
        const len = try os.read(fd, buf);

        assert(len >= @sizeOf(InHeader) + @sizeOf(InitIn));

        const hdr = mem.bytesAsValue(InHeader, buf[0..@sizeOf(InHeader)]);

        std.debug.print("kernel: hdr: {}\n", .{hdr.*});

        const opcode: Opcode = @enumFromInt(hdr.opcode);
        assert(opcode == .INIT);
        assert(hdr.len == @sizeOf(InHeader) + @sizeOf(InitIn));

        self.read_len = len - hdr.len;

        const req = mem.bytesAsValue(InitIn, (buf[@sizeOf(InHeader)..][0..@sizeOf(InitIn)]));

        std.debug.print("kernel: init: {}\n", .{req.*});

        assert(req.major == 7);
        assert(req.minor == 37);

The Zig standard library has mem.bytesAsValue which just casts the buffer to a specified type. Apart from some safety checks (assuming they are enabled), it doesn’t appear to do anything at runtime.

fn BytesAsValueReturnType(comptime T: type, comptime B: type) type {
    const size = @as(usize, @sizeOf(T));

    if (comptime !trait.is(.Pointer)(B) or
        (meta.Child(B) != [size]u8 and meta.Child(B) != [size:0]u8))
    {
        @compileError(std.fmt.comptimePrint("expected *[{}]u8, passed " ++ @typeName(B), .{size}));
    }

    return CopyPtrAttrs(B, .One, T);
}

/// Given a pointer to an array of bytes, returns a pointer to a value of the specified type
/// backed by those bytes, preserving pointer attributes.
pub fn bytesAsValue(comptime T: type, bytes: anytype) BytesAsValueReturnType(T, @TypeOf(bytes)) {
    return @as(BytesAsValueReturnType(T, @TypeOf(bytes)), @ptrCast(bytes));
}

My understanding is that BytesAsValueReturnType inspects the argument types at compile time. If argument B or bytes is a pointer or a slice of u8 then it copies some the “pointer attributes” of B to a new pointer type.

Pointer attributes here includes things such as const, volatile, address_space and alignment. The size and underlying type are not copied though, they come from what we are trying to cast to (T).

Apart from the fact that this is implemented in the Zig library and not the compiler. It’s interesting because the alignment ends up being 1 (the length of u8) which is copied from the slice. To my knowledge an alignment of 8 (u64) should be ideal and is what the protocol structures are aligned to.

I’m not sure how to get the alignment to be 8. There’s probably some assertion or cast that can be done. It’s not important for getting the code to work, but it’s interesting from a performance point of view.

Update! The alignment of the read buffer can be set as follows

read_buf: [MIN_READ_BUFFER * 2]u8 align(@alignOf(InHeader)) = undefined,

Then the beginning of the buffer is aligned so that we can directly cast it to InHeader.

After casting into the InitIn struct we can print it using the Zig standard library. The result is quite ugly, but requires minimal effort and works for debugging.

kernel: hdr: fuse.InHeader{ .len = 104, .opcode = 26, .unique = 2, .nodeid = 0, .uid = 0, .gid = 0, .pid = 0, .padding = 0 }
kernel: init: fuse.InitIn{ .major = 7, .minor = 37, .max_readahead = 131072, .flags = 1946157051, .flags2 = 1, .unused = { 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 }

Once we have checked some assumptions about what we should receive and have printed it. We write a response.

        const res = InitOutMsg{
            .head = .{
                .len = @sizeOf(InitOutMsg),
                .err = 0,
                .unique = hdr.unique,
            },
            .body = .{
                .major = 7,
                .minor = 37,
                .max_readahead = req.max_readahead,
                .flags = req.flags,
                .flags2 = req.flags2,

                .max_background = 0,
                .congestion_threshold = 0,
                .max_write = 4096,
                .time_gran = 0,
                .max_pages = 1,
                .map_alignment = 1,
            },
        };

        std.debug.print("fuse: init: {}\n", .{res});
        assert(try os.write(fd, mem.asBytes(&res)) == @sizeOf(@TypeOf(res)));

        mem.copyForwards(u8, buf, buf[@sizeOf(InHeader) + @sizeOf(InitIn) ..]);

This casts our response to a u8 slice, writes it to /dev/fuse and shifts the read buffer to the left. Probably the read buffer only has the init message in it, but we just treat it the same as other messages in this regard.

When we read we assume that multiple messages will be queued and that we could read a chunk of the next message(s).

GETATTR

The first message I expected to receive was LOOKUP because when we do setxattr on the file path /tmp/fuse-test/foo. The foo part is inside our file systems mount and it needs to be resolved to an inode number. It’s the job of the file system, to take components of a path and resolve them to a number.

Most internal operations in the kernel work on inodes or some similar thing, not paths.

According to the kernel our file system already has an inode in it with nodeid = 1. Maybe this is the mount point itself or some other file system thing. Whatever it is the kernel calls inode_permission, then fuse_permission and eventually fuse_do_getattr. So we can guess it wants the inodes mode to see if it can be accessed.

    switch (opcode) {
        .GETATTR => {
            const getattr_in =
                mem.bytesAsValue(GetattrIn, msg[0..@sizeOf(GetattrIn)]);

            std.debug.print("kernel: getattr: {}\n", .{getattr_in});

            const time: u64 = @intCast(@min(0, std.time.timestamp()));

            res.out.attr = .{
                .valid = time + 300,
                .valid_nsec = 0,
                .dummy = 0,
                .attr = .{
                    .ino = getattr_in.nodeid,
                    .blocks = 1,
                    .size = 42,
                    .atime = time,
                    .mtime = time,
                    .ctime = time,
                    .atimensec = 0,
                    .mtimensec = 0,
                    .ctimensec = 0,
                    .mode = l.S.IFDIR | 0o666,
                    .nlink = 1,
                    .uid = l.getuid(),
                    .gid = l.getgid(),
                    .rdev = 0,
                    .blksize = 0,
                    .flags = 0,
                },
            };

            res.hdr.len += @sizeOf(AttrOut);
        },

I’ve skipped reading the headers and setting up the response. Above you can just see the response we send.

From reading the kernel code I deduced that valid is the time when the attributes cease to be valid. If that is not the case then it still accepted this value. I suppose in the worst case it’ll remain valid for a few decades.

Update! Indeed the valid time provided is the amount of time it is valid for. This is seen in fs/fuse/dir.c:time_to_jiffies().

Next we have things like blocks, size, blksize, rdev and various time stamps. These certainly don’t have any effect on what the kernel is currently trying to do, but it’s probably best to try picking values in a sensible range.

The nlink does seem to be important. Setting it to zero would seem to indicate the item has been deleted. Although whether that would effect the current operation, I don’t know.

The important field is mode, this lets us set the file type and permissions. I decided that whatever the thing is that is being accessed it should be a directory and all users should have read-write permissions. The kernel was happy with this.

LOOKUP

Next the message I was expecting arrived.

        .LOOKUP => blk: {
            const Static = struct {
                var generation: u64 = 0;
            };
            const lookup_in: []const u8 = msg[0..msg_len];

            std.debug.print("kernel: lookup: {s}\n", .{lookup_in});

            if (!mem.eql(u8, "foo", lookup_in[0..3])) {
                res.hdr.err = -@as(i32, @intFromEnum(E.NOENT));
                break :blk;
            }

            const time: u64 = @intCast(@min(0, std.time.timestamp()));

            Static.generation += 1;

            res.out.entry = .{
                .nodeid = 0xf00,
                .generation = Static.generation,
                .entry_valid = time + 300,
                .entry_valid_nsec = 0,
                .attr_valid = time + 300,
                .attr_valid_nsec = 0,
                .attr = .{
                    .ino = 0xf00,
                    .blocks = 1,
                    .size = 420,
                    .atime = time,
                    .mtime = time,
                    .ctime = time,
                    .atimensec = 0,
                    .mtimensec = 0,
                    .ctimensec = 0,
                    .mode = l.S.IFREG | 0o666,
                    .nlink = 1,
                    .uid = l.getuid(),
                    .gid = l.getgid(),
                    .rdev = 0,
                    .blksize = 0,
                    .flags = 0,
                },
            };

            res.hdr.len += @sizeOf(EntryOut);
        },

The Static struct containing generation is how static variables are declared in Zig. We use this to set the inode generation. For now it is unlikely we’ll get more than one lookup request, but if we do we’ll increase the generation to ensure it’s unique for this session.

Another interesting Zig feature is the blk: {...} label. Here blk is an arbitrary name, but I with convention. In Zig you can add a label to a block (i.e. {...}). Then use the break keyword to return from that block. Including with a value, but we don’t use that here. This eliminates one use of goto or very small functions.

We check that the lookup path is “foo” and return from the block if it is not. This isn’t strictly necessary, it’s just checking my assumptions about what is happening.

Then the result again has an attribute embedded within it (attr). So the response to LOOKUP is a superset of GETATTR.

The big difference is that we get to set the inode number and how long the mapping lasts. The mapping being the path to inode number relationship.

SETXATTR

Finally we get to the operation that matters to us. This allows us to implement the setxattr system call for our file system. This system call is used to set extended attributes.

I imagine many readers have never heard of extended attributes. On file systems that support them, they let you set name value pairs on files.

The name is a null terminated string and the value is a binary chunk. This allows files to be tagged with whatever meta-data you like.

I probably also need to implement OPEN and READ to enable mmap on my file system. However for now I’m going with the idea that if I fully control SETXATTR or GETXATTR then I don’t need to inject delays via mmap.

Also extended attributes in the “user” namespace are not supported on TMPFS. If /tmp or /run are mounted on TMPFS and we can’t write to any other FS. Then the only option left may be to mount a FUSE FS that supports setxattr.

For instance the QEMU 9p remote filesystem implementation uses this to map user permissions within a VM to user permissions on the host. I’m not entirely sure what that entails, but I use 9p to share folders with VMs with this mapping feature enabled.

You can imagine that if you want to have multiple sets of permissions on a file for different systems. Then you can use extended attributes to store those permissions.

Extended attribute names start with a namespace. We’re interested in the “user” namespace which any user can write to. There are a number of other namespaces, such as the security namespace which SELinux uses.

If you try writing to these other namespaces, you will probably get EPERM or some error if you have permission, but the attribute is in the wrong format.

First of all let’s look at how we use the setxattr system call. Zig has no wrapper at the time (I should add it) for setxattr, so we make the system call directly.

const XATTR_NAME: [:0]const u8 = "user.bar";

...

fn setxattr(path: [*:0]const u8, name: [*:0]const u8, value: []const u8, size: usize, flags: usize) usize {
    return l.syscall5(.setxattr, @intFromPtr(path), @intFromPtr(name), @intFromPtr(value.ptr), size, flags);
}

fn setXAttr(env: *const TestEnv) void {
    var buf: [os.PATH_MAX]u8 = .{0} ** os.PATH_MAX;

    const path = std.fmt.bufPrint(buf[0 .. buf.len - 1], "{s}/{s}", .{ env.mnt_path, "foo" }) catch |err| {
        std.debug.print("bufPrint: {}", .{err});
        return;
    };

    const res = setxattr(@ptrCast(path), XATTR_NAME, "baz", 3, 0);
    const err = os.errno(res);

    std.debug.print("setxattr: {s}: {}\n", .{ path, err });
}

We could set any binary data, but for now I just set “user.bar” to “baz”.

The code to handle this is pretty simple as we don’t bother to actually store anything.

const SetxattrIn = extern struct {
    size: u32,
    flags: u32,
    setxattr_flags: u32,
    padding: u32 = 0,
};

...

        .SETXATTR => {
            const xattr_in = mem.bytesToValue(SetxattrIn, msg[0..@sizeOf(SetxattrIn)]);
            const tail = msg[@sizeOf(SetxattrIn)..];

            std.debug.print("kernel: setxattr: {}\n", .{xattr_in});

            const name_len = msg_len - @sizeOf(SetxattrIn) - xattr_in.size;
            const name = tail[0 .. name_len - 1 :0];
            const value = tail[name_len..];

            std.debug.print("kernel: setxattr: [{}]{s} => {s}\n", .{ name.len, name, value });

            assert(mem.eql(u8, XATTR_NAME, name));
            assert(mem.eql(u8, "baz", value));
        },

The SETXATTR name-value pair are not part of the SetxattrIn structure. They are transmitted immediately after it. The size field refers to the value size. This seems rather arbitrary and a bit redundant because the name is null-terminated and we already have the overall message length.

This worries me because possibly it means that padding can be inserted somewhere which would warrant sending the value size. Or something else I have not thought of. On the other hand it does allow separating the name from the value without scanning for the null byte.

Anyway, all we do is assert that we got the expected values.

SETATTR

I thought that would be the end of it, but even after handling SETXATTR the kernel refused to return from setxattr (the system call). I also foolishly did not check if any further messages had been received in my test.

Initially debugging did not reveal what was going on because I made some assumptions about what was happening and did not do a careful analysis of each call to fuse_simple_request.

However looking back on the kernel code which handles setxattr, there is an obvious culprit.

int fuse_setxattr(struct inode *inode, const char *name, const void *value,
          size_t size, int flags, unsigned int extra_flags)
{
    struct fuse_mount *fm = get_fuse_mount(inode);
    FUSE_ARGS(args);
    struct fuse_setxattr_in inarg;
    int err;

    if (fm->fc->no_setxattr)
        return -EOPNOTSUPP;

    memset(&inarg, 0, sizeof(inarg));
    inarg.size = size;
    inarg.flags = flags;
    inarg.setxattr_flags = extra_flags;

    args.opcode = FUSE_SETXATTR;
    args.nodeid = get_node_id(inode);
    args.in_numargs = 3;
    args.in_args[0].size = fm->fc->setxattr_ext ?
        sizeof(inarg) : FUSE_COMPAT_SETXATTR_IN_SIZE;
    args.in_args[0].value = &inarg;
    args.in_args[1].size = strlen(name) + 1;
    args.in_args[1].value = name;
    args.in_args[2].size = size;
    args.in_args[2].value = value;
    err = fuse_simple_request(fm, &args); <-- The request is sent
    if (err == -ENOSYS) { <-- I look in debugger and see err == 0
        fm->fc->no_setxattr = 1;
        err = -EOPNOTSUPP;
    }
    if (!err)
        fuse_update_ctime(inode); <-- I skip over this function in the debugger

    return err;
}

The function fuse_update_ctime results in SETATTR being sent to the FUSE daemon. It takes a rather roundabout route to achieve this (going through fs-writeback.c), but it tells our daemon the inode’s timestamps need updating.

Our code which handles SETATTR looks a lot like GETATTR. The main difference is we get some attribute info from the kernel. This tells us what attributes are being set and to which values. However we get chance to modify this and send it back.

const SetattrIn = extern struct {
    valid: u32,
    padding: u32,
    fh: u64,
    size: u64,
    lock_owner: u64,
    atime: u64,
    mtime: u64,
    ctime: u64,
    atimensec: u32,
    mtimensec: u32,
    ctimensec: u32,
    mode: u32,
    unused4: u32,
    uid: u32,
    gid: u32,
    unused5: u32,
};


...

        .SETATTR => {
            const setattr_in =
                mem.bytesAsValue(SetattrIn, msg[0..@sizeOf(SetattrIn)]);

            std.debug.print("kernel: setattr: {}\n", .{setattr_in});

            const time: u64 = @intCast(@min(0, std.time.timestamp()));

            res.out.attr = .{
                .valid = time + 300,
                .valid_nsec = 0,
                .dummy = 0,
                .attr = .{
                    .ino = hdr.nodeid,
                    .blocks = 1,
                    .size = 42,
                    .atime = time,
                    .mtime = time,
                    .ctime = time,
                    .atimensec = 0,
                    .mtimensec = 0,
                    .ctimensec = 0,
                    .mode = l.S.IFDIR | 0o666,
                    .nlink = 1,
                    .uid = l.getuid(),
                    .gid = l.getgid(),
                    .rdev = 0,
                    .blksize = 0,
                    .flags = 0,
                },
            };

            const v = setattr_in.valid;
            if (v & ~(FATTR.ATIME | FATTR.MTIME | FATTR.CTIME) > 0) {
                std.debug.print("setattr: setting attributes not supported\n", .{});
                res.hdr.err = -@as(i32, @intFromEnum(E.OPNOTSUPP));
            } else {
                res.hdr.len += @sizeOf(AttrOut);
            }
        },

The valid field in SetattrIn we receive from the kernel says which attributes are being set. It is a bit field, where each bit represents an attribute.

We only expect some time attributes to be valid in the request. It’s not clear if the valid bitfield also applies to the response. You can see in the code I accidentally set mode to a directory. I’m not sure if the kernel ignores this or it is simply happy that after updating the timestamps the inode transformed from a file into a directory.

Closing remarks

Part of me thinks that I really need to read up on what an inode is. Or some other FS thing like a super block. However this would give a normative understanding of these things. At best it is a peg to hang future knowledge on and at worst it delays achieving real world understanding that comes from exposure.

Often labels are mistaken for abstractions. The same labels are given to some group of objects or interfaces or illogical concepts. When you start digging into it you realise that the code only behaves according to the label’s description in some specific circumstance or not at all.

I Hope you enjoyed that.