Bitbanging 1D Reversible Automata

One Dimensional Reversible Automata

I created a demo for the GFXPrim library. It implements and displays a nearest-neighbor, one-dimensional, binary cell automata. Additionally it implements a reversible automata, which is almost identical except for a small change to make it reversible. The automata is displayed over time in two dimensions, time travels from top to bottom. Although in the reversible case time could be played backwards.

The automata works as follows:

Each cell has a state, which is on or off, black or white, boolean etc.
At each time step, the state of a cell in the next step is chosen by a rule.
The rule looks at a cell’s current value and the values of its left and right neighbors.
There are 2³ = 8 possible state combinations (patterns) for 3 binary cells.
A rule states which patterns result in a black cell in the next time step.
There are 2⁸ = 256 possible rules. That is, 256 unique combinations of patterns.

So a pattern is a 3 digit binary number, where each digit corresponds to a cell. The middle digit is the center cell, the high order bit the left cell, the low order bit the right cell. A rule can be display by showing a row of patterns and a row of next states.

111	110	101	100	011	010	001	000
0	1	1	0	1	1	1	0

Above is rule 110, 0x6e or 01101110. It essentially says to match patterns 110, 101, 011, 010, 001. Where a pattern match results in the cell being set to 1 at the next time step. If no pattern is matched or equivalently, an inactive pattern is matched, then the cell will be set to 0.

Again note that each pattern resembles a 3bit binary number. Also the values of the active patterns resemble an 8bit binary number. We can use this to perform efficient matching of the patterns using binary operations.

Let’s assume our CPU natively operates on 64bit integers (called words). We can pack a 64 cell automata into a single 64bit integer. Each bit corresponds to a cell. If a bit is 1 then it is a black cell and 0 for white. In this case we are using integers as bit fields. We don’t care about the integer number the bits can represent.

The CPU can perform bitwise operations on all 64bits in parallel and without branching. This means we can perform a single operation 64 times in parallel.¹

If we rotate (wrapped >>²) all bits to the right by one, then we get a new integer where the left neighbor of a bit is now in its position. Likewise if we shift all bits to the left, then we get an integer representing the right neighbors. This gives us 3 integers where the left, center and right bits are in the same position. For example, using only 8bits:

left:	0100 1011	`>>`
center:	1001 0110
right:	0010 1101	`<<`

Each pattern can be represented as a 3bit number, plus a 4th bit to say whether it is active in a given rule. As we want to operate on all 64bits at once in the left, right and center bit fields. We can generate 64bit long masks from the value of each bit in a given pattern.

So if we have a pattern where the left cell should be one, then we can create a 64bit mask of all ones. If it should be zero, then all zeroes. Likewise for the center and right cells. The masks can be xor’ed³ (^) with the corresponding cell fields to show if no match occurred. That is, if the pattern is one and the cell is zero or the cell is one and the pattern is zero. We can invert this (~) to give one when a match occurs.

To see whether all components (left, right, center) of a pattern matches we can bitwise and (&) them together. We can then bitwise or⁴ (|) the result of the pattern matches together to produce the final values.

If we wish to operate on an automata larger than 64 cells, then we can combine multiple integers into an array. After performing the left and right shifts, we get the high or low bit from the next or previous integers in the array. Then set the low and high bits of the right and left bit fields. In other words we chain them together using the end bits of the left and right bit fields.

For illustration purposes, below is the kernel of the the automata algorithm.

/* If bit n is 1 then make all bits 1 otherwise 0 */
#define BIT_TO_MAX(b, n) (((b >> n) & 1) * ~0UL)

/* Numeric representation of the current update rule */
static uint8_t rule = 110;

/* Apply the current rule to a 64bit segment of a row */
static inline uint64_t ca1d_rule_apply(uint64_t c_prev,
                                       uint64_t c,
                                       uint64_t c_next,
                                       uint64_t c_prev_step)
{
    int i;
    /* These are wrapping shifts when c_prev == c or c_next == c */
    uint64_t l = (c >> 1) ^ (c_prev << 63);
    uint64_t r = (c << 1) ^ (c_next >> 63);
    uint64_t c_next_step = 0;

    for (i = 0; i < 8; i++) {
        uint64_t active = BIT_TO_MAX(rule, i);
        uint64_t left   = BIT_TO_MAX(i, 2);
        uint64_t center = BIT_TO_MAX(i, 1);
        uint64_t right  = BIT_TO_MAX(i, 0);

        c_next_step |=
            active & ~(left ^ l) & ~(center ^ c) & ~(right ^ r);
    }

    /* The automata becomes reversible when we use c_prev_state */
    return c_next_step ^ c_prev_step;
}

To make the automata “reversible” an extra step can be added. We look at a cell’s previous (in addition to the current, left and right) and if it was one then invert the next value. This is equivalent to xor’ring the previous value with the next.

It is not entirely clear to me what the mathematical implications are of being reversible. However it is important to physics and makes some really cool patterns which mimic nature. Also entropy and the second rule of themodynamics, yada, yada…

The automata definition is taken from Stephen Wolfram’s “A new kind of science”. He proposes at least one obvious⁵ C implementation using arrays of cells. He also provides a table of binary expressions for each rule. E.g. rule 90 reduces to just the l^r binary expression. It may be possible for the compiler to automatically reduce my implementation to these minimal expressions.

To see why, let’s consider rule 90 for each pattern.

111	110	101	100	011	010	001	000
0	1	0	1	1	0	1	0

01011010 = 90

First for pattern 000.

  active & ~(left ^ l) & ~(center ^ c) & ~(right ^ r);
= 0 & ~(0 ^ l) & ~(0 ^ c) & ~(0 ^ r);
= 0.`

Active is zero so the whole line reduces to zero. Now for 001. Note that 1 here actually means ~0UL, that is 64bit integer max.

   1 & ~(0 ^ l) & ~(0 ^ c) & ~(1 ^ r);
= ~l & ~c & r.

As expected pattern 001 matches l=0, c=0, r=1. Let’s just list the remaining patterns or’ed together in their reduced state. Then reduce that further. Note that the for loop in ca1d_rule_apply will be unrolled by the compiler when optimising for performance. It’s also quite clear that c_next_step is dependant on an expression from the previous iteration or zero. So all the pattern match results will get or’ed together.

  l & c & ~r | l & ~c & ~r | ~l & c & r | ~l & ~c & r;
= l & ~r | ~l & r;
= l ^ r.

See on the top row that (l & c & ~r | l & ~c & ~r) or’s together c and not c. So we can remove it. Then we get an expression equivalent to xor’ring l and r.

In theory at least, the compiler can see that rule only has 256 values and create a reduced version of ca1d_rule_apply for each value. Whether it actually does is not of much practical concern when the rendering code is the bottle neck. However it’s interesting to see if the compiler can deduce the best solution or whether anything trips it up.

Judging from the disassembly from gcc -O3 -mcpu=native -mtune=native, it may actually do this. Additionally it vectorizes the code packing four 64bit ints at a time into 256bit registers and operating on those. I don’t know which part of the code it is vectorising or how. It’s possible that what I think is the rule being reduced is something related to vectorisation.

To render the automata we take the approach of iterating over each pixel in the image. We calculate which cell the pixel falls inside and set the color of the pixel to that of the cell. That’s it.

/* Draws a pixel */
static inline void shade_pixel(gp_pixmap *p, gp_coord x, gp_coord y,
                               gp_pixel bg, gp_pixel fg)
{
    gp_pixel px;
    size_t i = (x * (64 * width)) / p->w;
    size_t j = (y * height) / p->h;
    size_t k = 63 - (i & 63);
    uint64_t c = steps[gp_matrix_idx(width, j, i >> 6)];

    c = BIT_TO_MAX(c, k);
    px = (fg & c) | (bg & ~c);

    gp_putpixel_raw(p, x, y, px);
}

GFXPrim makes drawing very simple. The above code is fast enough for my purpsoses, but a significant improvement can be had. Integer division is much slower than floating point multiplication on most newer CPUs. It’s actually much faster (2x at least) on my CPU to calculate a pair of ratios in floating point, then convert them back to integers.

However, you may ask why we are even drawing on the CPU in the first place? This is because GFXPrim targets embedded systems with no graphics processor. Additionally the CPU may not even support floating point natively. So integer division may actually be faster in this case. Still better would be to limit the size of the pixmap to be 2^x larger than the dimensions of the automata, where x ∈ ℕ then we can use shifts instead of division.

In fact on my CPU it is 256 cells at a time. As AVX can be used to operate on 4 64bit words at a time.↩︎
We perform a rotating shift which moves the end bit to the start. This causes the automata to wrap around.↩︎
Combined with exclusive or.↩︎
Or we can use xor as the patterns are mutually exclusive, so only one may match at a time for each bit.↩︎
That’s me being an arse not Wolfram. Of course whenever “obvious” is used in these contexts it’s never really correct.↩︎