## One Dimensional Reversible Automata

I created a demo for the GFXPrim library. It implements and displays a nearest-neighbor, one-dimensional, binary cell automata. Additionally it implements a reversible automata, which is almost identical except for a small change to make it reversible. The automata is displayed over time in two dimensions, time travels from top to bottom. Although in the reversible case time could be played backwards.

The automata works as follows:

- Each cell has a state, which is on or off, black or white, boolean etc.
- At each time step, the state of a cell in the next step is chosen by a rule.
- The rule looks at a cell’s current value and the values of its left and right neighbors.
- There are 2
^{3}= 8 possible state combinations (patterns) for 3 binary cells. - A rule states which patterns result in a black cell in the next time step.
- There are 2
^{8}= 256 possible rules. That is, 256 unique combinations of patterns.

So a pattern is a 3 digit binary number, where each digit corresponds to a cell. The middle digit is the center cell, the high order bit the left cell, the low order bit the right cell. A rule can be display by showing a row of patterns and a row of next states.

111 | 110 | 101 | 100 | 011 | 010 | 001 | 000 |
---|---|---|---|---|---|---|---|

0 | 1 | 1 | 0 | 1 | 1 | 1 | 0 |

Above is rule `110`

, `0x6e`

or
`01101110`

. It essentially says to match patterns
`110`

, `101`

, `011`

, `010`

,
`001`

. Where a pattern match results in the cell being set to
1 at the next time step. If no pattern is matched or equivalently, an
inactive pattern is matched, then the cell will be set to 0.

Again note that each pattern resembles a 3bit binary number. Also the values of the active patterns resemble an 8bit binary number. We can use this to perform efficient matching of the patterns using binary operations.

Let’s assume our CPU natively operates on 64bit integers (called
*words*). We can pack a 64 cell automata into a single 64bit
integer. Each bit corresponds to a cell. If a bit is `1`

then
it is a black cell and `0`

for white. In this case we are
using integers as bit fields. We don’t care about the integer number the
bits can represent.

The CPU can perform bitwise operations on all 64bits in parallel and
without branching. This means we can perform a single operation 64 times
in parallel.^{1}

If we *rotate* (wrapped `>>`

^{2})
all bits to the right by one, then we get a new integer where the left
neighbor of a bit is now in its position. Likewise if we shift all bits
to the left, then we get an integer representing the right neighbors.
This gives us 3 integers where the left, center and right bits are in
the same position. For example, using only 8bits:

left: | 0100 1011 | `>>` |

center: | 1001 0110 | |

right: | 0010 1101 | `<<` |

Each pattern can be represented as a 3bit number, plus a 4th bit to
say whether it is active in a given rule. As we want to operate on all
64bits at once in the left, right and center bit fields. We can generate
64bit long *masks* from the value of each bit in a given
pattern.

So if we have a pattern where the left cell should be one, then we
can create a 64bit mask of *all* ones. If it should be zero, then
all zeroes. Likewise for the center and right cells. The masks can be
*xor’ed*^{3} (`^`

) with the
corresponding cell fields to show if no match occurred. That is, if the
pattern is one and the cell is zero or the cell is one and the pattern
is zero. We can invert this (`~`

) to give one when a match
occurs.

To see whether all components (left, right, center) of a pattern
matches we can bitwise *and* (`&`

) them together.
We can then bitwise *or*^{4} (`|`

) the
result of the pattern matches together to produce the final values.

If we wish to operate on an automata larger than 64 cells, then we can combine multiple integers into an array. After performing the left and right shifts, we get the high or low bit from the next or previous integers in the array. Then set the low and high bits of the right and left bit fields. In other words we chain them together using the end bits of the left and right bit fields.

For illustration purposes, below is the *kernel* of the the
automata algorithm.

```
/* If bit n is 1 then make all bits 1 otherwise 0 */
#define BIT_TO_MAX(b, n) (((b >> n) & 1) * ~0UL)
/* Numeric representation of the current update rule */
static uint8_t rule = 110;
/* Apply the current rule to a 64bit segment of a row */
static inline uint64_t ca1d_rule_apply(uint64_t c_prev,
uint64_t c,
uint64_t c_next,
uint64_t c_prev_step)
{
int i;
/* These are wrapping shifts when c_prev == c or c_next == c */
uint64_t l = (c >> 1) ^ (c_prev << 63);
uint64_t r = (c << 1) ^ (c_next >> 63);
uint64_t c_next_step = 0;
for (i = 0; i < 8; i++) {
uint64_t active = BIT_TO_MAX(rule, i);
uint64_t left = BIT_TO_MAX(i, 2);
uint64_t center = BIT_TO_MAX(i, 1);
uint64_t right = BIT_TO_MAX(i, 0);
|=
c_next_step & ~(left ^ l) & ~(center ^ c) & ~(right ^ r);
active }
/* The automata becomes reversible when we use c_prev_state */
return c_next_step ^ c_prev_step;
}
```

To make the automata “reversible” an extra step can be added. We look
at a cell’s previous (in addition to the current, left and right) and if
it was one then *invert* the next value. This is equivalent to
xor’ring the previous value with the next.

It is not entirely clear to me what the mathematical implications are of being reversible. However it is important to physics and makes some really cool patterns which mimic nature. Also entropy and the second rule of themodynamics, yada, yada…

The automata definition is taken from Stephen Wolfram’s “A new kind
of science”. He proposes at least one *obvious*^{5} C
implementation using arrays of cells. He also provides a table of binary
expressions for each rule. E.g. rule 90 reduces to just the
`l^r`

binary expression. It may be possible for the compiler
to automatically reduce my implementation to these minimal
expressions.

To see why, let’s consider rule 90 for each pattern.

111 | 110 | 101 | 100 | 011 | 010 | 001 | 000 |
---|---|---|---|---|---|---|---|

0 | 1 | 0 | 1 | 1 | 0 | 1 | 0 |

01011010 = 90

First for pattern `000`

.

```
& ~(left ^ l) & ~(center ^ c) & ~(right ^ r);
active = 0 & ~(0 ^ l) & ~(0 ^ c) & ~(0 ^ r);
= 0.`
```

Active is zero so the whole line reduces to zero. Now for
`001`

. Note that `1`

here actually means
`~0UL`

, that is 64bit integer max.

```
1 & ~(0 ^ l) & ~(0 ^ c) & ~(1 ^ r);
= ~l & ~c & r.
```

As expected pattern `001`

matches
`l=0, c=0, r=1`

. Let’s just list the remaining patterns or’ed
together in their reduced state. Then reduce that further. Note that the
`for`

loop in `ca1d_rule_apply`

will be
*unrolled* by the compiler when optimising for performance. It’s
also quite clear that `c_next_step`

is dependant on an
expression from the previous iteration or zero. So all the pattern match
results will get or’ed together.

```
& c & ~r | l & ~c & ~r | ~l & c & r | ~l & ~c & r;
l = l & ~r | ~l & r;
= l ^ r.
```

See on the top row that
`(l & c & ~r | l & ~c & ~r)`

or’s together
`c`

and not `c`

. So we can remove it. Then we get
an expression equivalent to xor’ring `l`

and
`r`

.

In theory at least, the compiler can see that `rule`

only
has 256 values and create a reduced version of
`ca1d_rule_apply`

for each value. Whether it actually does is
not of much practical concern when the rendering code is the bottle
neck. However it’s interesting to see if the compiler can deduce the
best solution or whether anything trips it up.

Judging from the disassembly from
`gcc -O3 -mcpu=native -mtune=native`

, it may actually do
this. Additionally it *vectorizes* the code packing four 64bit
ints at a time into 256bit registers and operating on those. I don’t
know which part of the code it is vectorising or how. It’s possible that
what I think is the rule being reduced is something related to
vectorisation.

To render the automata we take the approach of iterating over each pixel in the image. We calculate which cell the pixel falls inside and set the color of the pixel to that of the cell. That’s it.

```
/* Draws a pixel */
static inline void shade_pixel(gp_pixmap *p, gp_coord x, gp_coord y,
, gp_pixel fg)
gp_pixel bg{
;
gp_pixel pxsize_t i = (x * (64 * width)) / p->w;
size_t j = (y * height) / p->h;
size_t k = 63 - (i & 63);
uint64_t c = steps[gp_matrix_idx(width, j, i >> 6)];
= BIT_TO_MAX(c, k);
c = (fg & c) | (bg & ~c);
px
(p, x, y, px);
gp_putpixel_raw}
```

GFXPrim makes drawing very simple. The above code is fast enough for my purpsoses, but a significant improvement can be had. Integer division is much slower than floating point multiplication on most newer CPUs. It’s actually much faster (2x at least) on my CPU to calculate a pair of ratios in floating point, then convert them back to integers.

However, you may ask why we are even drawing on the CPU in the first
place? This is because GFXPrim targets embedded systems with no graphics
processor. Additionally the CPU may not even support floating point
natively. So integer division may actually be faster in this case. Still
better would be to limit the size of the pixmap to be 2^{x} larger than the
dimensions of the automata, where *x* ∈ ℕ then we can use shifts instead
of division.

In fact on my CPU it is 256 cells at a time. As AVX can be used to operate on 4 64bit words at a time.↩︎

We perform a rotating shift which moves the end bit to the start. This causes the automata to wrap around.↩︎

Combined with exclusive

*or*.↩︎Or we can use xor as the patterns are mutually exclusive, so only one may match at a time for each bit.↩︎

That’s me being an arse not Wolfram. Of course whenever “obvious” is used in these contexts it’s never really correct.↩︎