richiejp logo

Reproducers make the best (Linux kernel) tests

Originally posted on Medium

Reproducers make the best (Linux Kernel) tests

After several years of writing Linux kernel tests. My conclusion is that reproducers of existing bugs, find the most new bugs.

Possibly fuzzers find more bugs than anything else. This does not contradict my hypotheses because fuzzers generate reproducers. At least Syzkaller can generate a C based reproducer or replay its activity some other way.

With some difficulty the C based reproducers can be run independently. In a number of cases we have manually converted them to Linux Test Project (LTP) test cases. These are much better behaved. The LTP is relatively easy to install and run.

Of course fuzzers are not the only source of reproducers. Even in this day and age of super fancy artificial intelligence. Many reproducers are created entirely by hand, even as the result of manual code review. Although static analyses also plays a big role.

Usually reproducers are created to prove a hypotheses. That a bug is real and can be practically triggered. That it exists on a code path which may be executed in a context that matters. All bugs should be fixed, just perhaps not with the same priority.

After attempting to solve the bug, the reproducer can be used to verify the fix. Crucially for those maintaining their own kernel branches. A reproducer helps to validate backports of fixes or the decision a backport is not required.

They may also prevent the exact same bug from being reintroduced. Regressions of this type do happen. Especially during rewrites and refactoring of complicated code.

This appears to be relatively well understood. What may be surprising to the reader is my following assertion. Reproducers of a specific bug, tailored precisely to trigger that bug, often find the most new bugs.

More so if the reproducer can be trivially generalised. For example, say a reproducer triggers a bug via a TCP socket. If the test can be slightly modified to work with UDP sockets or indeed any protocol. Then we say it can be trivially generalised.

To be clear generalisation is not required. In comparison to other hand written tests, reproducers find the most new bugs. Including bugs that are unrelated to the original.

What is required is complication or complexity. The software under test and the reproducer need to be non-trivial. If they are both trivial, then there is not much room for the unknown.

It has often been observed that bugs cluster. That if you find one bug, the probability of finding another in the “adjacent area”, increases. This is true for both insects and software defects. This is usually true regardless of how one measures the proximity of bugs.

There are many convincing explanations for why bugs cluster. Which are easy to get sucked into. It pays to keep an open mind as to what the causes might be. In fact it often pays to temporarily suspend any thoughts about causation.

If we think about the underlying causes of a bug, then we may be tempted to skip reproducing it reliably. Because we may think up a way of preventing that type of bug. Likewise if we think we know why bugs cluster. We may decide to skip bug reproduction. Preferring instead to spend time on process changes, instrumentation, adjacent code review, exploratory testing or static analyses.

Our thoughts about causation may even be correct. However we may miss a far more profitable line of inquiry. Sometimes a bug is found which exists in largely bug free code. The reason it was not found previously is because an unusual context is required to exercise it.

The code required to create the context in question may be very buggy. Simply because it is neglected, changes very often or any reason we may think of. Creating a reproducer therefore increases coverage in a very profitable direction.

This is merely an example to aid understanding. The general principle is that in complicated software there are always unknowns. A clear, convincing and correct root cause analyses may lead to a lower payoff. When compared to assuming ignorance and testing via an exact reproduction of the bug.

Root cause analyses and speculating on how to prevent similar bugs is essential. The point is that the process of creating a reproducer, including making it reliable, reveals new information. Running it reveals yet more information.

The full value of a reproducer is exposed over time and “space”. It must be ran across many configurations and versions. Put another way, it needs multiple iterations of the test matrix.

This means making it reliable and portable. Which is where things usually go wrong. The extra effort required is not invested. Understandably because the extra effort is often great and the payoff unclear.

An immense amount of time has gone into making the Linux Test Project “portable” across multiple kernel versions and user lands. Many of the tests would be suitable for BSD and other operating systems as well. Some work has been done on that, but alas not enough.

It is one thing to create a reproducer that works on one machine, with one kernel version, some of the time. Entirely another to make it work on any suitable configuration in a timely and reliable manner.

This is especially true when a bug is the result of a data race. Or if a bug requires a particular outcome to one or more data races. Also known as race conditions. For the LTP we have put huge effort into reproducing these reliably.

Regardless of race involvement. Reproducer code is often fragile, relying on particulars of a given configuration. At least the initial reproducer code is fragile. As produced by a fuzzer or human being.

To make it compile and run across multiple kernel versions, architectures and compilers. Requires tweaks and vendoring in odd bits of libraries. Ensuring reproduction of the bug may require some level of generalisation.

For example if the reproducer relies on particular internal memory offsets which may change between configurations. These may be practically unknown, possibly requiring some complicated probing to figure out exactly. So we may need to try a range of read or write lengths at runtime.

I suspect tests which are incapable of reproducing the original bug. If that bug were reintroduced to a newer kernel. Are still more reliable at finding new bugs than purely speculative tests. Hence my assertion generalisation is not required for the finding of new bugs.

However, for those who are testing backports, there is an obvious need to generalise in this case.

This all comes from my anecdotal experience with the Linux kernel and low level user space libraries. With other software it can be simpler. The kernel suffers from combinatorial explosions on all fronts. Including over time as the pace of development is so fast. This makes full coverage of the whole test matrix impossible.

Therefore we are poking around in the dark to some extent. We need to use probabilistic methods to identify the most urgent areas for testing. The observation that bugs cluster, is then very important.

The following aphorism is another way to look at it: To avoid writing pointless tests. Write reproducers and then generalise them.

Ironically this suggests to follow an “evidence based” approach. Yet I have provided no evidence. I encourage developers to mention specific test names in fix commits. Even better to submit patches adding commit or CVE tags to LTP tests. With data such as this we can perhaps validate my claims.

On a project less nebulous than the Linux kernel, it is easier to test the hypothesis. Simply tag test failures which fail for legitimate reasons. Then collect and aggregate the tags. You will see which tests have historically been the most effective.

Of course this is not a scientific experiment. The results are open to interpretation and may not hold any more value than gut feeling. Nonetheless the data would be interesting.

Finally, consider the following; the thing under test is testing its testers.