zc-data

To brush up on my coding-test skills, I decided to implement radix sort and hash map. This was a chance to learn how some core data structures and algorithms work. Also I tried out using Zig to build and test the project. I believe there is a very strong case for using Zig in existing C codebases.

I ended up using Meson to build some tests written in C as well. Zig is frankly very new and unstable (version 0.9.0-dev). At least the C header translation gave me some trouble. Hopefully I will get time to look into that at a later date.

Zig

Insofar that Zig did work I was very happy with it. It can cross-compile C to almost any architecture with minimal difficulty. That is usually a nightmare unless using something like Buildroot. Even then Buildroot recompiles GCC to make cross compilation work and is more challenging than Zig. At least if you ignore Zig’s immaturity.

Zig’s build system is quite nice; it’s just a Zig file and library. Below is the Zig build file.

const std = @import("std");

const cflags = &[_][]const u8{ "-Wall", "-Wextra" };

pub fn addCExe(
    b: *std.build.Builder,
    name: []const u8,
    files: []const []const u8,
) *std.build.LibExeObjStep {
    const exe = b.addExecutable(name, null);
    exe.linkLibC();
    exe.addIncludeDir("include");
    exe.addCSourceFiles(files, cflags);
    exe.install();

    return exe;
}

pub fn build(b: *std.build.Builder) void {
    const exes = [_]*std.build.LibExeObjStep{
        addCExe(b, "sort_tests", &.{
            "src/sort_tests.c",
            "src/radix_sort.c",
            "src/hash.c",
        }),
        addCExe(b, "map_tests", &.{
            "src/map_tests.c",
            "src/hash.c",
        }),
    };

    const target = b.standardTargetOptions(.{});
    const mode = b.standardReleaseOptions();
    for (exes) |exe| {
        exe.setTarget(target);
        exe.setBuildMode(mode);
    }

    const tests = b.addTest("src/main.zig");
    tests.linkLibC();
    tests.addIncludeDir("include");
    tests.addCSourceFiles(&.{ "src/radix_sort.c", "src/hash.c" }, cflags);

    const test_step = b.step("test", "Run tests");
    test_step.dependOn(&tests.step);
}

This builds some C object files and executables. Also it builds a Zig executable that links to the C. The Zig executable just contains some unit tests. These all look similar to the following test.

const std = @import("std");

const c = @cImport({
    @cInclude("slab.h");
    @cInclude("sort.h");
    @cInclude("hash.h");
});

fn lt(_: void, lhs: u64, rhs: u64) bool {
    return lhs < rhs;
}

test "order matches" {
    var prng = std.rand.SplitMix64.init(0x505e);
    var unsorted_ints: [100000]u64 = undefined;
    for (unsorted_ints) |*n| {
        n.* = prng.next();
    }
    var sorted_ints = unsorted_ints;
    var radix_sorted_ints = unsorted_ints;

    std.sort.sort(u64, &sorted_ints, {}, lt);
    _ = c.radix_sort(&radix_sorted_ints, radix_sorted_ints.len);

    for (sorted_ints) |v, i| {
        try std.testing.expect(radix_sorted_ints[i] == v);
    }
}

This tests my radix sort implementation (c.radix_sort) against Zig’s sort function in the standard library (std.sort.sort). This all worked quite nicely if we just ignore bugs, missing docs and such. The foundations being set here are very strong.

The Zig guys seem to be serious about augmenting C projects. Clearly it’s inventor wants to replace C with Zig. However they are taking a very pragmatic approach to it. There is Zig the programming language and Zig the C build system. The former can piggyback on the later.

Meson

Having said that, I have been burnt (repeatedly) before by trying to do too many new things at once. Or equally, being sidetracked so far from my primary goal that it’s no long in sight. So after having some issues with Zig I decided to try Meson.

This worked surprisingly nicely. I was expecting to be forced back into using CMake or simply Make. However I got things to work very easily in Meson.

project('zc-data', 'c',
  version : '0.1',
  default_options : [
      'warning_level=3',
      'b_sanitize=address,undefined',
      'b_lto=true'
  ])

incdir = include_directories('include')

add_global_arguments('-fanalyzer', language : 'c')

executable('sort_tests',
           'src/radix_sort.c',
           'src/sort_tests.c',
           install : true,
           include_directories : incdir)

executable('map_tests',
           'src/map_tests.c',
           include_directories : incdir)

executable('slab_tests',
           'src/slab_tests.c',
           include_directories : incdir)

Above is the build script. It creates a few test executables. The only thing unusual thing here is that I have set the defaults to include my favorite sanitizers. Also I enabled the static analyzer for GCC which found one bug at compile time.

This setup has a lot of the same benefits as Zig, which at least turns on the undefined sanitizer as well. In my opinion the sanitizers should all be enabled by default. Using a modern C compiler with all these features turned off is like disabling ABS and traction control on a sports car then driving it blindfolded drunk as a sailor on shore leave. Compilers themselves can’t turn them on by default without causing chaos. However build systems could.

Radix sort

Radix sort is the best kind of sort if you can allocate memory. My implementation allocates a buffer the same size as the input data. There are of course other constraints to using radix sort. However its performance and simplicity for fixed sized keys is quite incredible.

def_attr static inline u8 radix_n(u64 key, usize n)
{
    return (key >> (n << 3)) & 0xff;
}

void radix_sort(u64 *keys, const usize keys_len)
{
    usize col_indx[256];
    u64 *keys_buf = calloc(keys_len, sizeof(u64));
    u64 *keys_l = keys, *keys_r = keys_buf;

    const usize max_pass = sizeof(*keys);

    for (usize p = 0; p < max_pass; p++) {
        memset(col_indx, 0, sizeof(col_indx));

        for (usize i = 0; i < keys_len; i++) {
            u8 key = radix_n(keys_l[i], p);
            col_indx[key]++;
        }

        for (usize i = 0, offs = 0; i < 256; i++) {
            const usize freq = col_indx[i];
            col_indx[i] = offs;
            offs += freq;

            if (i == 255)
                assert(offs == keys_len);
        }

        for (usize i = 0; i < keys_len; i++) {
            u8 key = radix_n(keys_l[i], p);
            usize col = col_indx[key]++;
            keys_r[col] = keys_l[i];
        }

        assert(col_indx[255] == keys_len);

        keys = keys_r;
        keys_r = keys_l;
        keys_l = keys;
    }

    if (keys == keys_buf) {
        memcpy(keys_l, keys, keys_len * sizeof(*keys));
        keys = keys_l;
    }

    free(keys_buf);
}

This is the whole implementation. Even though it allocates memory it was still an order of magnitude faster than Zig’s default sort function. Zig uses a beast of an algorithm called “block sort”, this performs the sort in-place and is fairly generic. The extra constraints which block sort satisfies have a large impact on speed and complexity.

I won’t try to explain how radix sort works. I’ll just point out that my implementation counts how many keys will go in each radix. It uses those figures to calculate offsets into a shared buffer. This means that we don’t need a full sized buffer for each radix. It also complicates the algorithm a bit by introducing some pointer gymnastics.

The function radix_n is preceded by a macro containing some function attributes. Let’s explore those for a minute.

Attributes

For legacy reasons C defaults to all the wrong settings. I don’t just mean compiler flags. By default variables and pointers can be written to. You have to write const in several places if something shouldn’t change. In the above code I’ve actually forgotten to add const in some places it could be added.

I now try to add const everywhere because over a long period I have observed that it catches enough bugs to be worth the pain. Zig deals with the problem nicely by forcing all variables, which are actually variable, to be prepended with var.

This is just one example of where C supports something nice, but it’s not the default. C compilers also support a number of extensions, including some exposed as attributes, which should be the default.

#ifndef ZCD_ATTRS_H
#  define ZCD_ATTRS_H

#  define attr(...) __attribute__((__VA_ARGS__))

#  define nonnull_attr attr(nonnull)
#  define warn_unused_attr attr(warn_unused_result)
#  define access_attr(...) attr(access (__VA__ARGS__))

#  define def_attr attr(nonnull, warn_unused_result)
#endif

GCC introduced the __attribute__((...)) syntax, however it is now supported by Clang. A new C standard also introduce some new syntax for attributes. I look forward to using that in 20 years.

There are many attributes, but most of them are useless for catching bugs. They are mostly to deal with edge cases and allow optimizations. However at least one is very useful. i.e. warn_unused_result.

Again it should be the default that if a function returns something, then it is a bug to discard it. Other languages force you to assign the result of a function to a dummy variable (e.g. _) if it is not needed. This at least shows you didn’t simply forget to check the result.

Hash Map

At its core the hash map is an array. The index of an element or key is calculated from its value somehow. My hash map is an array with extra meta-data inserted at intervals. It’s of the variety which deal with collisions via chaining. This seems unpopular these days. However I think it’s possible that chaining is more cache efficient if one avoids pure linked lists.

On modern hardware cache lines are 64 bytes. Meaning memory is loaded in 64 byte chunks. Because reading memory is much slower than most CPU instructions, we want to do as much work as possible on batches of 64 bytes. At least that is the simple theory I have stuck in my head.

Therefor I have shaped the data so that everything is done on batches of 64 bytes. With 64-bit integers (or words), that is 8 units. So the main hash map vector is split into slabs of 64 bytes. The first word being used by meta data for the next 7 map entries. When multiple entries hash to the same value, they are put in a bucket.

The buckets are split up into slabs of 64 bytes. The last word of each slab is used as a pointer to the next slab. So buckets are essentially still linked lists, but we clump items together to avoid chasing pointers. The first item of the first slab is also used to store the length of the bucket. This means the bucket items can take any value, including NULL.

One reason the main vector (or array) has meta-data is because it allows us to store pointers to items directly in the slots. In fact we can store the items themselves in the slots if they fit in 8 bytes. This makes the best case very good. That is, when one item occupies a slot, we can retrieve it with a single cache line fill.

When there is a collision we need to store a pointer to a bucket in the slot. Using the meta-data we can tell if the value in the slot is a pointer to a bucket or something else. When it is a bucket, then we need at least two cache line fills to retrieve the item. Up to 6 items can collide before this increases to 3 fills, when we must load another slab.

Much of the time we will need to load the full value of a key (or item) to compare it to the key we are looking for. At least when there is a full hash collision. For now though, I’m just ignoring that.

Instead of putting the meta-data in each slot, it is clumped together in one word at the start of each hash slab.

enum hash_slot_kind {
    hash_slot_nil,
    hash_slot_imm,
    hash_slot_ptr
} attr(packed);
ASSERT_SIZE(enum hash_slot_kind, 1);

union hash_slot64 {
    u64 imm_val;
    struct slab64 *bucket;
};
ASSERT_SIZE(union hash_slot64, 8);

struct hash_slot_kinds64 {
    u8 error_bits;
    enum hash_slot_kind kinds[7];
};
ASSERT_SIZE(struct hash_slot_kinds64, 8);

struct hash_slab64 {
    struct hash_slot_kinds64 slot;
    union hash_slot64 slots[7];
};
ASSERT_SIZE(struct hash_slab64, 64);

Above you can see the meta-data in struct hash_slot_kinds64. Putting all the meta-data at the beginning makes alignment easier. Otherwise we would have one byte of meta-data then 8 bytes of data in each slot. This would most likely get padded or else we will have unaligned memory accesses.

Presently we don’t need one byte of meta-data per entry. error_bits is unused and enum hash_slot_kind can be represented by 3 bits. This means there is room for tighter packing or more meta-data.

Inserting the meta-data into the data vector does make calculating which slab an index belongs to a little complicated. I have to admit that I got a little confused and was trying to divide by 8. This was perhaps wishful thinking because 8 is a power of 2. Dividing by a power of two can be done with a fast left shift. However it’s division by 7 that is needed.

const usize offset_indx = i + (i/7 + 1);
const usize slab_indx = offset_indx >> 3;
const usize slot_indx = (offset_indx & 0x7) - 1;

Every 7 items we have to add one to the index to skip the meta-data. Relatively speaking though, loading and storing items is simple. At least the functions are quite short.

attr(nonnull, warn_unused_result)
static inline enum hash_slot_kind hash_map64_load(const struct hash_map64 *const self,
                          const usize i,
                          u64 *const key,
                          hash_map64_cmp_fn cmp)
{
    const usize offset_indx = i + (i/7 + 1);
    const usize slab_indx = offset_indx >> 3;
    const usize slot_indx = (offset_indx & 0x7) - 1;
    const struct hash_slab64 *const slab = self->slabs + slab_indx;
    const enum hash_slot_kind *const kind = slab->slot.kinds + slot_indx;
    const union hash_slot64 *const slot = slab->slots + slot_indx;

    zcd_assert(slab_indx < self->slab_nr, "%zu < %zu", slab_indx, self->slab_nr);
    zcd_assert(slot_indx < 7, "%zu < 7", offset_indx);

    switch(*kind) {
    case hash_slot_imm: {
        if (cmp(*key, slot->imm_val))
            return hash_slot_nil;

        *key = slot->imm_val;
        return hash_slot_imm;
    }
    case hash_slot_ptr: {
        const u64 *bckt_slot;

        for_each_slab64_u64(slot->bucket, bckt_slot) {
            if (cmp(*key, *bckt_slot))
                continue;

            *key = *bckt_slot;
            return hash_slot_ptr;
        }

        return hash_slot_nil;
    }
    case hash_slot_nil:
        return hash_slot_nil;
    default:
        zcd_assert(0, "unreachable");
    }
}

Note that i should already be the index as selected by some hashing function. I also implemented a basic hash function which you can see in the full code.

Radix or Digital Trees

Linux has a data structure called the xarray, which can be used like an array or hash map. Apparently this is based on the radix tree. While looking at these data structures I was lead to discover “Judy arrays”. A cache efficient, compact, digital tree.

Dealing with the issues of hashing makes these “hybrid” tree structures look appealing. However they are far more involved than creating a simple radix tree or hash map. These are where I would be looking next though.