libactors; Actor model and message passing in C with Userland RCU

I recently came across liburcu, which is a userland implementation of the Linux kernel’s RCU (read-copy-update) synchronisation mechanism and a bunch of useful concurrent data structures. This struck me as something I could quickly use to build yet-another actor model library and use in a new concurrent Linux kernel test executor. Indeed I think that for a C project I was able to do quite a lot very quickly. Below are some reflections on this.

Read-Copy-Update

Linux’s RCU strikes me as a typical example of something which at first seems magical and difficult to use, but actually is simpler than many alternatives.

What initially confused me about RCU and its API usage (and maybe still does confuse me) is that RCU enabled data structures often only require a read lock to perform an update (that is, a write). For example, a hash map may only require a read lock to add an entry. However this is because the hash map data structure is completely lockless, meaning that it can be mutated concurrently without requiring a lock to be taken¹, it is infact the value of the entry which is protected by the RCU read lock².

Of course that doesn’t have to be the case and a data structure can use a lock and RCU to give updaters exclusive access or some other combination thereof. More to the point though, the end result of a lockless data structure plus RCU combo is often simpler than the equivalent with mutex’s, or whatever, because taking an RCU read lock or performing a synchronisation are just three calls with no function parameters.

Take a look at the following pseudo C which deletes a value from a concurrently accessed hash map. The details are not too important, just look for the 3 functions with RCU in the name.

static struct cds_lfht *book;

...

addr_t addr = /* some address */
struct addr_entry *entry;
struct cds_lfht_iter iter;
struct cds_lfht_node *node;
int ret;

/* Start a read critical section */
rcu_read_lock();

/* Lookup the addr in a hash map 'book' (we assume it still exists) */
cds_lfht_lookup(book, addr, addr_entry_match, &addr, &iter);
node = cds_lfht_iter_get_node(&iter);
entry = caa_container_of(node, struct addr_entry, node);

/* Remove the entry from the hash map, but don't delete the value yet */
ret = cds_lfht_del(book, &entry->node);

rcu_read_unlock();

/* Delete the value after all existing readers have finished their read critical section */
if (!ret) {
    synchronize_rcu();
    free(entry);
} else if (ret == -ENOENT) {
    // someone else removed it...
}

The entry is removed from the hash map between rcu_read_lock() and rcu_read_unlock(), then later deleted after synchronize_rcu(). No need to acquire a particular mutex or anything like that, at least not in this case. There is more to the RCU API, but I didn’t use any of it in this case.

So I like RCU; not least because of its API and also it has some fancy performance characteristics. Once you get the interaction between read sections and synchronisation, I actually think it is preferable to some other synchronisation mechanisms which are better understood in general.

Speaking of getting it, RCU consists of two things; read sections and synchronisation points. When we enter a synchronisation point, we wait until all current read sections exit. However we do not wait for any new read sections which are entered while synchronising. Furthermore synchronisation points do not block read sections.

A big impediment to understanding RCU, is seeing how waiting only for current read sections is ever sufficient to ensure exclusive access to some data. Hopefully looking at the above code we can see why it is in this particular case.

First we enter a read section where we fetch and remove a node from the lock free hash map. Any read sections which started before or during ours may still have a copy of the entry pointer, however any which begin after we call the delete function are guaranteed not to obtain a copy of the entry.

By the time we call synchronize_rcu() the only read sections which may have access to the entry are ones which began before we finished our read section. We will wait for these to finish, after which we are safe to free the entry.

Hopefully in this particular case you can see why it is only necessary to wait for existing ‘readers’. In general it is sufficient because one first modifies a pointer atomically, removing access to some memory, then waits for all readers that may have access to that memory to finish. New readers won’t have access after the atomic update, so there is no need to block them.

Another strange thing about RCU is that synchronize_rcu() waits for all read sections on the system, including ones which have no relevance to the current operation. Not only is this fine performance wise, but it is actually very efficient at least when implemented in kernel space. The user space implementations are maybe not as efficient, but I assume they are still very good. From a user point of view, this nice because we don’t need to worry about which lock to take, however it is also confusing.

Userland RCU

While I do like RCU and that is why I was looking at liburcu, it is not really why I am using it. It’s more because it provides some nice concurrent data structures and some other libs which are missing from the C standard library.

Most languages have a hash map, vector and similar in their base library. This is not always a good thing, but in C it is completely missing and so every project has to go looking for these things.

liburcu provides a lot of standard data structures in one place along with much other boiler plate. I’m not completely sure the APIs it provides are the best, perhaps they are a bit intrusive, but it is difficult to make things much better in C.

Actors

Message passing and the Actor model makes for a nice way of doing concurrency and data processing. It also easily leads to nondeterministic chaos as the state of different Actors and message orders interact, but this is maybe something which can be mitigated.

I think most people are introduced to the Actor model informally through Erlang or some Actor’s library like Akka, but there is also a rigorous definition put forward at the dawn of modern computing. The formalisms in Actors: A Model of Concurrent Computation in Distributed Systems are interesting not least for how much they differ from the practical adoption of Actors.

I previously created an Actors library in Julia and now libactors. These are both quite far from the formal definition of the Actor model, especially libactors. I will set out below roughly what I mean by an actor.

An Actor is an object (struct) with:

An address
A message box
Some message handlers
Some other user defined state or data

Actors can send messages to each other, by sending a message to an address. They can’t access each other directly. This is analogous to a collection of networked computers; they can only communicate with message passing, they can not access each other’s memory directly³.

An Actor is able to determine if an address exists and send a message to it. It is able to check its own message box for new arrivals and update its own state/memory. Finally it can also start new Actors.

This is also somewhat analogous to Operating System processes, communicating with pipes or sockets, or micro services. Indeed an Actor library may abstract away where an Actor is running and what transport is used so that each Actor runs in its own thread, process or even computer, but the interface remains the same.

This makes a nice fractal where concurrency at the smallest scale, say POSIX threads, looks similar to a large scale cluster of disparate machines, thanks to the Actor abstraction. Having said that, libactors would have to do a deep copy of message data in order to provide such an abstraction, which it does not, but it’s possible.

Also see Pony Lang’s idea of an Actor and my description for Actors.jl.

libactors

Not to be confused with libactor; I have created libactors and used it in the experimental LTP Executor ⁴. I had surprisingly few problems with doing this. Not least thanks to Clang’s address sanitizer, compiling with the strictest set of options I can find⁵ and much use of function attributes.

So far the library is fairly simple and doesn’t contain anything magical. An Actor definition looks something like the following contrived nonsense.

enum static_addrs {
    ADDR_FOO = 1,
    ADDR_BAR = 2,
};

enum msg_types {
    MSG_PING,
    MSG_PONG,
    MSG_DATA
};

// Actor state definition
struct foo {
    unsigned int state;
};

// Message body definition
struct bin {
    int length;
    char bytes[];
};

// Foo's message handler
static void foo_hear(struct actor *self, struct msg *msg)
{
    // Get our private, user defined, actor state
    struct foo *my = self->priv;
    
    // Dispatch on the message type
    switch(msg->type) {
    case MSG_PING:
        // Just change the message type and pass (say) the same struct back
        msg->type = MSG_PONG;
        actor_say(self, msg->from, msg);
    case MSG_PONG:
        free(msg);
        // So they are awake, send them some junk
        msg = msg_alloc_extra(sizeof(struct bin) + 1024);
        msg->type = MSG_DATA;
        actor_say(self, ADDR_BAR, msg);
    default:
        // Blow up or something
        ...
    }
}

...

void main(void)
{
    struct actor *foo;
    ...
    
    actors_init();
    
    foo = actor_alloc_extra(sizeof(struct foo));
    foo->addr = ADDR_FOO;
    foo->hear = foo_hear;
    
    // Create other actors
    ...
    
    // Wait for all the actors to exit
    actors_wait();
}

By default when an Actor hears a message it executes the function pointed to by struct actor::hear. To send a message⁶ one calls void actor_say(struct actor *self, addr_t to, struct msg *msg). Actor’s are usually only accessed through one or more user defined addresses, not memory pointers and you are strongly discouraged from directly accessing another Actor’s memory.

Unfortunately it is quite easy to accidentally access another Actor’s memory by passing a message containing a pointer to it. This is something which Rust’s borrow checker can prevent and Pony appears to have an even richer mechanism for dealing with shared references/pointers.

However not all is lost in C; for one we can instrument runtime code with the Clang address or thread sanitizers. These are slow (well, still faster than most languages) and don’t catch everything, but are easy to use. Then there is the possibility of adding annotations to variables and functions which can be enforced by a static analyzer like sparse. In fact Clang and GCC will check some annotations as well. I haven’t tried it, but possibly this could be used to check that pointers passed with messages are accessed in a sensible way. The Linux kernel uses this to put some constraints on pointers used in RCU data structures.

By default, each Actor has only a single message handler at any one time, we simply just branch on the msg type to determine what to do when receiving a message⁷. I think this works out fairly well for basic usage as there is no magic involved and doesn’t look too ugly. Also C compilers are pretty clever so there is no need to worry about the performance of large switch statements.

Alternatively the user can provide a listen callback instead which requires some more boilerplate, but gives them the freedom to read some other data source at the same time or perform some setup when the actor starts.

static void tester_listen(struct actor *self)
{
    struct msg *msg = msg_alloc();
    struct tester *my = malloc(sizeof(struct tester));

    assert_perror(errno);
    assert(my);
    memset(my, 0, sizeof(*my));
    self->priv = my;

    // Inform another actor we have been allocated/started
    msg->type = MSG_ALLC;
    actor_say(self, ADDR_WRITER, msg);

    // The message loop
    for (;;) {
        msg = actor_inbox_pop(self);

        if (msg)
            self->hear(self, msg); // If we have an actor message then handle it
        else if (!(my->child || my->cout || my->eout))
            actor_wait(self, NULL); // Sleep-wait if there is nothing to do

        // Check if the child process has completed
        if (my->child)
            tester_check_child(self);

        // Check the childs standard output which we log
        if (my->cout || my->eout)
            tester_check_output(self);
    }
}

Above is an actual snippet from the LTP Executor, it shows the listen function for a tester actor. This actor starts a child process which is usually an LTP test. It needs to listen for new actor messages at the same time as reading the child’s output and checking its exit status.

This is a common problem with Actor’s, where you need to both wait for new actor messages while also waiting for some other kind of I/O to appear. At the same time you can’t merely spin-wait because the CPU won’t be happy. In this case, when there is an active child, we block using poll with a short timeout, which is acceptable because we don’t need to respond quickly to Actor messages. When no child is active, we can use actor_wait which uses fancy Linux futex’s to wait efficiently.

In general this might not be suitable, but on the other hand it would be possible to interrupt poll with a signal if we really need the actor to check its messages.

Note that we can also call the default message loop from listen and just use it to perform some actor setup or something like the following.

__attribute__((pure))
static int only_pong(const struct actor *self __attribute__((unused)),
             const struct msg *msg)
{
    return msg->type == MSG_PONG;
}

static void writer_listen(struct actor *self)
{
    struct msg *msg;

    // Only pop messages for which only_pong returns true
    // Other messages are put into a buffer for later
    actor_inbox_filter(self, only_pong);

    // Send ping until we get a pong
    while ((msg = actor_inbox_pop(self)) == NULL) {
        dprintf(STDOUT_FILENO, "PING\n");
        usleep(1000000);
    }
    assert(msg->type == MSG_PONG);
    free(msg);

    // Release any filtered messages
    actor_inbox_filter(self, NULL);

    // Enter the default message loop
    actor_hear_loop(self);
}

The writer actor is a gateway to some other, probably remote, process. It is not known what transport is being used and whether it is reliable, we just write to stdout. Therefor before sending any important messages we initially send ping and wait for a pong. Other actors may send messages to the writer and they will be queued while waiting for communication to be established.

Presently each Actor gets its own thread and starting a new actor starts a new thread. This is perhaps not ideal if you wish to create many more actors than CPUs or create many short lived actors. Also it would perhaps be better to use processes rather than threads at the cost of being forced to copy some message data. Using processes would provide much better isolation.

I suspect that having many threads is OK on modern kernels and it offloads a lot of work to the kernel. However creating threads is relatively slow, so it would at least make sense to reuse threads (or actors) in the case where there are a lot of short lived actors.

So far I have only shown actors being created in main outside the actor system with static addresses. However actors may arbitrarily create other actors.

// This is called by a 'reader' actor in response to an 'ALLC' (allocate) 
// message from a remote process
void tester_start(struct actor *self, addr_t id)
{
    struct msg *msg;
    struct actor *tester;

    // Possibly the remote side is asking us to do something we already did
    // in that case we just inform the existing actor
    if (actor_exists(id)) {
        msg = msg_alloc();
        msg->type = MSG_ALLC;
        actor_say(self, id, msg);
        return;
    }

    // Create (allocate) a new tester actor
    tester = actor_alloc();
    tester->addr = id;
    tester->listen = tester_listen;
    tester->hear = tester_hear;

    actor_start(tester);
}

So the above functions checks whether an actor already exists with the address provided and if it does not then it starts a new one.

Looking forwards

As I mentioned when describing actors, a big problem is the chaos which results from unstructured message passing. When something goes wrong in an actor based system, or any concurrent system for that matter, it can be very difficult to know what sequence of events lead up to the error.

Stack traces from an error often end with msg_box_pop before revealing any important information. What is really needed is the causal sequence of messages and actor states, not just a log of the messages received by an actor, but the chain of messages and actor states.

Currently there isn’t even a message log. This could be implemented by copying some message data into a buffer for each actor. Its difficult to log the content of the message, in a generic way, because this is just a pointer to some arbitrary user defined data, but at least the message type and from address can be logged.

Actor state is also some pointer to arbitrary data, so providing a generic mechanism for logging this may be difficult. In some cases however the actor state may be entirely or partially represented by an integer (or bit field) and this could certainly be logged very cheaply.

Initially liburcu was created for LTTng which is a tracing framework. So possibly libactors could define some trace points on sending and receiving messages. The user can then provide more tracepoints if they wish.

Finally, this is being used in the experimental LTP test executor. If the executor is to be included in the LTP, then this should be bundled with it. Generally we avoid including new dependencies like the plague unless it is something which is included with practically every Linux distribution’s base packages for the last 10+ years. This may not be the case with liburcu, so we will probably need some other, more compact, implementation of a concurrent message queue and address table (hash map). This will be somewhat ironic given the original motivator for trying this was liburcu.

Lockless data structures usually rely on the optimistic updating of some atomic variable(s) which will be retried on failure, which arguably is equivalent to a spin lock, but it looks different↩︎
Actually the RCU API is also used to dereference the data structure’s pointers and maybe more, but you don’t need to worry about this as you will simply be told “call this in a read lock”↩︎
RDMA still uses message passing, it just cuts out some parts of the stack. Ultimately any long distance communication will require encapsulation and serialisation into something which looks like a message↩︎
which itself may not be a good idea, but that is another matter↩︎
-O1 -Wall -Wextra -pedantic -Werror -g -fno-omit-frame-pointer -fsanitize=address↩︎
unless you are operating outside the actor system and have a pointer to the actor struct then you can use msg_box_push↩︎
Actually I also use the from address in one instance and obviously the body of the message in many cases↩︎