richiejp logo

Getting NodeJS 18.x to run on the nanos unikernel

I told myself, “Don’t do anything fancy, no new tech, just write the booking app how most web developers write apps”. A few months later I am implementing the clone3 systemcall in the nanos unikernel so that it can run programs linked with new versions of glibc.

Why do I need to run my nodejs app on a unikernel you ask? Because it’s more efficient and very simple… at least it would be simple if I took my own advice and didn’t need cutting edge versions of everything. Also if I strictly adhered to my own advice, then nanos is also too new.

Nanos is pretty amazing though, it implements enough Linux system calls that it can run hefty programs like Node which were compiled for Linux.

You run a command and it puts a program in an image along with any libraries its linked to. Then the image can be run in the cloud or QEMU. It’s a lot like creating a container from scratch except that, in production, you don’t need to boot Linux and have an init program start the container. This cuts out a huge number of layers.

If you are not sure what I mean, then look at how containers are ran on Fly.io. They run one container per VM. They are not running a full Linux distro in the VM, it’s just the kernel and an init process which spawns the container image.

In nanos there is no separate init process, it’s not needed or possible. It runs your app executable directly. There is no creating new PID or user namespaces or even just processes. It doesn’t even have processes or users (plural). It has one process and one (fake) user.

This is a major limitation, but it also makes everything faster and simpler. To the extent that it can beat Linux in benchmarks and nanos has not been optimised at all by Linux standards. Nanos also retains the kernel-user-land barrier which means it is sacrificing some performance in comparison to other unikernels. So it’s not just running everything in kernel-land (which would save on context switches).

It does have threads however, so multi-threaded applications can run. With the exception of thread local storage, thread’s share the same virtual address space and underlying memory. They also share the same file descriptors and other resources. Nanos doesn’t need to implement copy on write and such to efficiently support threads.

On the down side my app uses Redis as the database. Redis forks a new process to write to storage in the background. It’s possible to run Redis on nanos, but it won’t persist to disk (or at least it won’t rewrite the RDB/AOF, something like that). The solution is to rewrite Redis to use threads, however let’s just ignore that for now.

Something to note here is that I’m using nodejs because of SvelteKit. It’s also, strictly speaking, too new. However it’s really good, much better than React. I won’t get into that here, it deserves its own article. The thing to note though is if you throw out nodejs and use Go, Rust, Zig, Pony (maybe) etc. Then you are not likely going to hit the issues I did with node.

Running node

How do you make and run an nanos-node image? Well, we can make a simple web server with the following js:

// hi.js
var http = require('http');
http.createServer(function (req, res) {
    res.writeHead(200, {'Content-Type': 'text/plain'});
    res.end('Hello World\n');
}).listen(8083, "0.0.0.0");
console.log('Server running at http://127.0.0.1:8083/');

And a manifest for the image:

{
    "Args": ["hi.js"],
    "Files": ["hi.js"]
}

Then run it locally with ops in QEMU with:

$ ops run /usr/bin/node -a hi.js -c config.json
...

The ops command is made specifically for building and deploying nanos images. This works somewhat like creating a container from scratch and copying in node, node’s libs and hi.js. Except that it creates a raw bootable VM image.

The above works fine when used with versions of node compiled for older distros which use older kernels. However there is a problem for people living on the cutting edge…

Implementing clone3

What happens if we run node 17/18 from OpenSUSE Tumbleweed or Nix?

Uh, well, it works just fine now! Perhaps I was hallucinating or someone got fed up of containers randomly breaking because clone3 is disallowed by seccomp or similar.

Let’s just pretend that it fails and talk about my clone3 implementation. Running ops with --trace shows that node dies when it tries to use syscall 435. We can find out what syscall that is by looking in $linux_tree/arch/x86/entry/syscalls/syscall_64.tbl.

It’s clone3 of course which I happened to previously write a test for in the Linux Test Project. It’s a nicer interface than clone, especially as the arguments don’t change position on different platforms. It’s more extensible allowing new processes to be cloned directly into new namespaces and CGroups which I have had lot’s of fun with.

By the way, clone is used for spawning new processes (usually done with fork which is implemented in terms of clone these days). Or for spawning new threads, which is usually done with the POSIX pthreads library.

Luckily nanos doesn’t even have processes never mind CGroups. So clone3 really just allows the stack size to be specified. Otherwise there’s no difference with it and clone.

The original nanos clone is implemented in $nanos_tree/src/unix/thread.c. It looks like the following:

#if defined(__x86_64__)
sysreturn clone(unsigned long flags, void *child_stack, int *ptid, int *ctid, unsigned long newtls)
#elif defined(__aarch64__) || defined(__riscv)
sysreturn clone(unsigned long flags, void *child_stack, int *ptid, unsigned long newtls, int *ctid)
#endif
{
    thread_log(current, "clone: flags %lx, child_stack %p, ptid %p, ctid %p, newtls %lx",
        flags, child_stack, ptid, ctid, newtls);

    if (!(flags & CLONE_THREAD)) {
        thread_log(current, "attempted to create new process, aborting.");
        return set_syscall_error(current, ENOSYS);
    }

    /* no stack size given, just validate the top word */
    if (!validate_user_memory(child_stack, sizeof(u64), true))
        return set_syscall_error(current, EFAULT);

    if (((flags & CLONE_PARENT_SETTID) &&
         !validate_user_memory(ptid, sizeof(int), true)) ||
        ((flags & CLONE_CHILD_CLEARTID) &&
         !validate_user_memory(ctid, sizeof(int), true)))
        return set_syscall_error(current, EFAULT);

    thread t = create_thread(current->p, INVALID_PHYSICAL);
    context_frame f = thread_frame(t);
    /* clone frame processor state */
    clone_frame_pstate(f, thread_frame(current));
    thread_clone_sigmask(t, current);

    /* clone behaves like fork at the syscall level, returning 0 to the child */
    set_syscall_return(t, 0);
    f[SYSCALL_FRAME_SP] = u64_from_pointer(child_stack);
    if (flags & CLONE_SETTLS)
        set_tls(f, newtls);
    if (flags & CLONE_PARENT_SETTID)
        *ptid = t->tid;
    if (flags & CLONE_CHILD_CLEARTID)
        t->clear_tid = ctid;
    t->blocked_on = 0;
    t->syscall = 0;
    f[FRAME_FULL] = true;
    thread_reserve(t);
    schedule_thread(t);
    return t->tid;
}

Linux declares syscalls with a system of macros that actually wrap the function definition. It seems that in nanos we just write a normal function (named accordingly), add the SYS_* define to a header file and register it like so.

void register_thread_syscalls(struct syscall *map)
{
    register_syscall(map, futex, futex, 0);
    register_syscall(map, set_robust_list, set_robust_list, 0);
    register_syscall(map, get_robust_list, get_robust_list, 0);
    register_syscall(map, clone, clone, SYSCALL_F_SET_PROC);
#ifdef __x86_64__
    register_syscall(map, arch_prctl, arch_prctl, 0);
#endif
    register_syscall(map, set_tid_address, set_tid_address, 0);
    register_syscall(map, gettid, gettid, 0);
}

The new clone3 syscall takes a single struct and its size as the only two arguments.

struct clone_args {
     u64 flags;
     u64 pidfd;
     u64 child_tid;
     u64 parent_tid;
     u64 exit_signal;
     u64 stack;
     u64 stack_size;
     u64 tls;
};

sysreturn clone3(struct clone_args *args, bytes size)
{
...
}

Originally I just copy and pasted the clone syscall and modified it to take this struct. I was asked to deduplicate the code, so I copied what Linux does and create an internal clone which takes a cut down version of clone_args. Then used this to implement both syscalls.

struct clone_args_internal {
     u64 flags;
     int *child_tid;
     int *parent_tid;
     void *stack;
     bytes stack_size;
     u64 tls;
};

sysreturn clone_internal(struct clone_args_internal *args, bytes size)
{
     u64 flags = args->flags;

     if (!args->stack_size)
          return set_syscall_error(current, EINVAL);

     if (!(flags & CLONE_THREAD)) {
          thread_log(current, "attempted to create new process, aborting.");
          return set_syscall_error(current, ENOSYS);
     }

     if (!validate_user_memory(args->stack, args->stack_size, true))
          return set_syscall_error(current, EFAULT);

     if (((flags & CLONE_PARENT_SETTID) &&
          !validate_user_memory(args->parent_tid, sizeof(u64), true)) ||
         ((flags & CLONE_CHILD_CLEARTID) &&
          !validate_user_memory(args->child_tid, sizeof(u64), true)))
          return set_syscall_error(current, EFAULT);

     thread t = create_thread(current->p, INVALID_PHYSICAL);
     context_frame f = thread_frame(t);

     clone_frame_pstate(f, thread_frame(current));
     thread_clone_sigmask(t, current);

     set_syscall_return(t, 0);
     f[SYSCALL_FRAME_SP] = (u64)args->stack;
     if (flags & CLONE_SETTLS)
      set_tls(f, args->tls);
     if (flags & CLONE_PARENT_SETTID)
      *(args->parent_tid) = t->tid;
     if (flags & CLONE_CHILD_SETTID)
      *(args->child_tid) = t->tid;
     if (flags & CLONE_CHILD_CLEARTID)
      t->clear_tid = args->child_tid;
     t->blocked_on = 0;
     t->syscall = 0;
     f[FRAME_FULL] = true;
     thread_reserve(t);
     schedule_thread(t);
     return t->tid;
}

#if defined(__x86_64__)
sysreturn clone(unsigned long flags, void *child_stack, int *ptid, int *ctid, unsigned long newtls)
#elif defined(__aarch64__) || defined(__riscv)
sysreturn clone(unsigned long flags, void *child_stack, int *ptid, unsigned long newtls, int *ctid)
#endif
{
    thread_log(current, "clone: flags %lx, child_stack %p, ptid %p, ctid %p, newtls %lx",
        flags, child_stack, ptid, ctid, newtls);

    struct clone_args_internal args = {
         .flags = flags,
         .child_tid = ctid,
         .parent_tid = ptid,
         /* no stack size given, just validate the top word */
         .stack = child_stack,
         .stack_size = sizeof(u64),
         .tls = newtls,
    };

    return clone_internal(&args, sizeof(args));
}

struct clone_args {
     u64 flags;
     u64 pidfd;
     u64 child_tid;
     u64 parent_tid;
     u64 exit_signal;
     u64 stack;
     u64 stack_size;
     u64 tls;
};

sysreturn clone3(struct clone_args *args, bytes size)
{
     thread_log(current,
         "clone3: args_size: %ld, pidfd: %p, child_tid: %p, parent_tid: %p, exit_signal: %ld, stack: %p, stack_size: 0x%lx, tls: %p",
         size, args->pidfd, args->child_tid, args->parent_tid, args->exit_signal,
         args->stack, args->stack_size, args->tls);

     if (size < sizeof(*args))
          return set_syscall_error(current, EINVAL);

     struct clone_args_internal argsi = {
          .flags = args->flags,
          .child_tid = (int *)args->child_tid,
          .parent_tid = (int *)args->parent_tid,
          .stack = ((char *)args->stack) + args->stack_size,
          .stack_size = args->stack_size,
          .tls = args->tls
     };

     return clone_internal(&argsi, sizeof(argsi));
}

Note that clone3 accepts a pointer to the bottom of the stack while clone takes a pointer to the top. Stacks grow down (from high to low) so before knowing the stack size, a pointer to the end of the memory range needed to be provided. Now that there is a stack size, a pointer to the start of the range can be provided. This means the result of mmap can be passed directly.

Now did I get caught out by this? You bet I did. It’s also the major difference between clone and clone3 from the nanos perspective.

You can see how pull request is going/gone here.

Running node 18

The above gets node 17 running. However that is not new and unstable enough for me. Node 18 is released so I want to be running that. However there is a problem and this time I can reproduce it.

$ ops run /nix/store/yg2w28z1ph0h4a2ydkgbyfz9rl5gd9yh-nodejs-18.2.0/bin/node -a hi.js -c config.json -f
booting /home/rich/.ops/images/node.img ...
en1: assigned 10.0.2.15

frame trace:
ffffc0000706ff40:   ffffffff800a1929    (adjust_process_heap + 0000000000000049/0000000000000064)
ffffc0000706ff60:   ffffffff800ba3b6    (brk + 0000000000000156/00000000000001fc)
ffffc0000706ffb0:   ffffffff800c8d8d    (syscall_handler + 00000000000002ed/00000000000005e4)
ffffc0000706fff0:   0000000000001000
assertion rbtree_remove_by_key(t, n) failed at /rich/kernel/nanos/src/runtime/rbtree.h:28  in rbtree_remove_node(); halt

Oh no, an assertion failure during an operation on a red-black tree. The syscall which triggers this is brk. This is used to move the end of the heap. Unless I am mistaken brk is used for smaller memory allocations by the likes of malloc. Bigger and more complicated ones are done with mmap.

To find out more about what’s going on we can add nanos trace output (with ops run ... --trace). Although it turned out brk didn’t output much trace information.

So I incrementally added trace messages to brk until I was able to pinpoint where things were going wrong. By the end brk looked like this.

static sysreturn brk(void *addr)
{
    process p = current->p;
    process_lock(p);

    thread_log(current, "brk: p->brk: %p, addr: %p", p->brk, addr);

    /* on failure, return the current break */
    if (!addr || p->brk == addr)
        goto out;

    u64 old_end = pad(u64_from_pointer(p->brk), PAGESIZE);
    u64 new_end = pad(u64_from_pointer(addr), PAGESIZE);

    thread_log(current, "brk: old_end: %lx, new_end: %lx", old_end, new_end);

    if (old_end > new_end) {
        if (u64_from_pointer(addr) < p->heap_base ||
            !adjust_process_heap(p, irange(p->heap_base, new_end)))
            goto out;
        write_barrier();
        unmap_and_free_phys(new_end, old_end - new_end);
    } else if (new_end > old_end) {
        u64 alloc = new_end - old_end;
        if (!validate_user_memory(pointer_from_u64(old_end), alloc, true) ||
            !adjust_process_heap(p, irange(p->heap_base, new_end))) {
        thread_log(current, "brk: failed");
        goto out;
    }
        pageflags flags = pageflags_writable(pageflags_noexec(pageflags_user(pageflags_memory())));
        if (new_zeroed_pages(old_end, alloc, flags, 0) == INVALID_PHYSICAL) {
            adjust_process_heap(p, irange(p->heap_base, old_end));
            goto out;
        }
    }
    p->brk = addr;
  out:
    addr = p->brk;
    process_unlock(p);

    thread_log(current, "brk: ret addr: %p", addr);
    return sysreturn_from_pointer(addr);
}

The thread_log calls add trace information. Of particular interest is the one which prints “brk: failed”.

The trace output looked like this.

$ ops run /nix/store/yg2w28z1ph0h4a2ydkgbyfz9rl5gd9yh-nodejs-18.2.0/bin/node -a hi.js -c config.json -f --trace | rg -A 10 brk

... many lines redacted ...

    2 brk
    2 brk: p->brk: 0x0000000002a6f000, addr: 0x0000000002a91000
    2 brk: old_end: 2a6f000, new_end: 2a91000
    2 brk: failed
    2 brk: ret addr: 0x0000000002a6f000
    2 direct return: 44494848, rsp 0xffe38e98
    2 run thread, cpu 0, frame 0xffffc00001807000, pc 0x100121d37, sp 0xffe38e98, rv 0x2a6f000
    2 mmap
    2 mmap: addr 0x0000000000000000, length 0x100000, prot 0x3, flags 0x22, fd -1, offset 0x0
    2    returning 0x1007a3000
    2 direct return: 4302974976, rsp 0xffe38e98
    2 run thread, cpu 0, frame 0xffffc00001807000, pc 0x1001254b3, sp 0xffe38e98, rv 0x1007a3000
    2 page fault, vaddr 0x1007a3008, vmap 0xffffc0000040a500, ctx 0xffffc00001807000, type 3, pc 0x1000b769e
    2 page fault, vaddr 0x1007ab018, vmap 0xffffc0000040a500, ctx 0xffffc00001807000, type 3, pc 0x1000b7229
    2 page fault, vaddr 0x1007a7010, vmap 0xffffc0000040a500, ctx 0xffffc00001807000, type 3, pc 0xd54c32
--
    2 brk
    2 brk: p->brk: 0x0000000002a6f000, addr: 0x0000000002afe000
    2 brk: old_end: 2a6f000, new_end: 2afe000

frame trace:
ffffc0000706ff40:   ffffffff800a1ce9    (adjust_process_heap + 0000000000000049/0000000000000064)
ffffc0000706ff60:   ffffffff800bb107    (brk + 00000000000004d7/0000000000000548)
ffffc0000706ffb0:   ffffffff800c959d    (syscall_handler + 00000000000002ed/00000000000005e4)
ffffc0000706fff0:   0000000000001000
assertion rbtree_remove_by_key(t, n) failed at /home/rich/kernel/nanos/src/runtime/rbtree.h:28  in rbtree_remove_node(); halt

The second til last brk fails, then the last one triggers the assertion. My suspicions fell on adjust_process_heap(p, irange(p->heap_base, new_end)) before even adding the “brk: failed” message.

To see why let’s look at the implementation.

boolean adjust_process_heap(process p, range new)
{
    vmap_lock(p);
    boolean inserted = rangemap_reinsert(p->vmaps, &p->heap_map->node, new);
    vmap_unlock(p);
    return inserted;
}

OK, OK, we really want to look at rangemap_reinsert.

boolean rangemap_reinsert(rangemap rm, rmnode n, range k)
{
    rangemap_remove_node(rm, n);
    n->r = k;
    return rangemap_insert(rm, n);
}

If rangemap_insert can fail after rangemap_remove_node succeeded then we can remove the node representing the heap memory from the rangemap using brk.

In fact rangemap_remove_node can’t fail without triggering an assertion. That’s what causes the assertion failure above. So let’s look at rangemap_insert.

boolean rangemap_insert(rangemap rm, rmnode n)
{
    init_rbnode(&n->n);
    rangemap_foreach_of_range(rm, curr, n) {
        if (curr->r.start >= n->r.end)
            break;
        range i = range_intersection(curr->r, n->r);
        if (range_span(i)) {
            msg_warn("attempt to insert %p (%R) but overlap with %p (%R)\n",
                     n, n->r, curr, curr->r);
            return false;
        }
    }
    if (!rbtree_insert_node(&rm->t, &n->n)) {
        halt("scan found no intersection but rb insert failed, node %p (%R)\n",
             n, n->r);
    }
    return true;
}

It appears that insertion can fail if the new range intersects an existing one. When this happens it should print a warning message that we don’t see. The reason for that though is because msg_warn needs to be enabled at compile time.

Doing that confirms my suspicions. The following line is added to the log.

rangemap_insert warning: attempt to insert 0xffffc00000405f00 ([0x29b8000 0x2a91000)) but overlap with 0xffffc0000040a400 ([0x2a6f000 0xaa6f000))

So what range is it overlapping? Grepping the start of the range reveals it.

...
    2 mmap
    2 mmap: addr 0x0000000002a6f000, length 0x8000000, prot 0x0, flags 0x4022, fd -1, offset 0x0
    2    returning 0x2a6f000
...

This means that node deliberately maps this address. It does this on Linux too which we an see more clearly with strace.

$ strace -e brk,mmap /nix/store/dv8rq1kl181whp5r1f30j0ar4i11axqw-nodejs-18.4.0/bin/node
...
brk(0x2e8e000)                          = 0x2e8e000
mmap(0x2e8e000, 134217728, PROT_NONE, MAP_PRIVATE|MAP_ANONYMOUS|MAP_NORESERVE, -1, 0) = 0x2e8e000
...

The addresses are different, but this is the offending mmap. What I find odd is that it’s deliberately mapping the end of the heap. I haven’t investigated this further. From the nanos point of view, it simply needs to deal with it.

To this end I introduced the following change to rangemap_reinsert.

boolean rangemap_reinsert(rangemap rm, rmnode n, range k)
{
    range old = n->r;
    rangemap_remove_node(rm, n);
    n->r = k;
    if (!rangemap_insert(rm, n)) {
        n->r = old;
        assert(rangemap_insert(rm, n));
        return false;
    }
    return true;
}

So now, when the insert fails, it tries to put the rangemap back into the state it found it in. This way we won’t unmap the heap when brk fails. The result is that node 18 can now run.

Conclusion

Can I host my app on nanos yet? Not quite, there is the issue of Redis’s call(s) to fork and also the fact I haven’t finished my app. Hosting providers like Fly abstract away many of the issues with containers, although you still pay for the CPU and memory their kernel and init system uses. You also start getting into some vendor lockin, but none of this is a concern when you have zero users.

On the other hand, if you have thousands or millions of users then nanos has huge potential. Although this article is about doing kernel development, I wouldn’t expect you to need to do that if you are just deploying a Go or Rust microservice or stick to the nanosvms supported packages.

If you want to rewrite your application as a unikernel then you are likely to fall off the deep end at some point. The fact that nanos keeps the userland barrier and copies the Linux ABI is pretty important.

Anyway back to writing Typescript, HTML and CSS.