richiejp logo

A review of tools for rolling your own C static analysis

So you have a large C code base. It has its own libraries, rules and context. You see the same mistakes being repeated. Not general C coding mistakes, but errors unique to your project.

They look like the kind of mistakes which could be detected at “compile time”. In fact, they may only be detectable at compile time. Put another way, we think we can find bugs in the source code without running it. This is commonly referred to as “static analysis”.

The Linux Test Project is a large and eccentric C code base. It has its own library and rules. We most definitely see the same mistakes, again and again. Despite having spent years developing the LTP, the present author still makes those mistakes.

Beyond simple regular expressions and syntactic checks. Writing static analysis tools is hard. Some languages come equipped with self analysis or reflection. Needless to say, this is not the case for C.

While we can justify a significant investment in developing checks. It can not be on the order of creating a new compiler. Much code review, feedback delays and bugs can be eliminated with checks. However the effort saved by static analysis, should not be all spent on developing static analysis.

Luckily C does have tooling available to develop semantic checks and perform various types of analysis. For the LTP we investigated a range of tools. This article will review those tools and explain our choices.

For the time being we have chosen Sparse to power our primary tool. This is not the most powerful, nor arguably the most user friendly. It is the most self-contained and small enough to vendor in.

Background

Program representation

There are many ways to represent a computer program. With different abstractions and structures. One may also partially evaluate a program with given assumptions. Theoretically there is no difference between transforming a program and running it. They are both computations.

For the sake of this review and the simple minds of quality assurance engineers, we just need to roughly identify what level each tool works on.

  1. Text
  2. Abstract Syntax Tree (AST)
  3. Linearised Intermediate Representation (IR)
  4. Exploded Graph or IR with state tracking

The first level is the raw source code. Secondly we have the AST which is fairly well known. Then we have IR and IR with state. IR is a generic term which we shall take some liberties with.

C compiler’s usually convert the AST into a collection of linear instruction blocks (basicblocks). These blocks are linked together into a graph or network (Control Flow Graph). The links (graph edges) represent function calls, jumps or branches.

The instructions within a block are sequential. Meanwhile one may go between blocks in whatever order the logic allows. Usually compilers also convert the IR into Single Static Assignment (SSA) form. Meaning each variable is only assigned to once. New variable names are generated for each additional assignment.

In this article we use IR to mean approximately the above. If you look in the compiler of a functional language you may find something utterly different. Also most compilers tend to use a different IR for assembly generation. We are not concerned with that.

IR with state is where we are given a range of possible values for each variable. This means we may also be given multiple versions of the code for each branch which sets a variable differently. Thus resulting in an “exploded” CFG.

If that makes no sense to you, then imagine being able to see multiple parallel universes. With the caveat that each universe could be further split into more exact possibilities.

For many checks, the AST is not an ideal level of abstraction. Even just figuring out if a memory location is modified from the AST can be difficult. There are lots of syntactic constructs which will cause a store instruction to be issued.

The reason compilers have “linearized” IR, is to cut out a bunch of syntactic details. Making it easier to perform optimisations and machine code generation. Sometimes the AST is the best place to do certain types of analysis, but often not.

LTP Requirements

The review of each tool is heavily influenced by the LTP’s requirements. With a different project in mind, one may come to very different conclusions.

The LTP project can not tolerate more barriers to development. We have contributors from many different organisations. All using different Linux distros. There are even some downstream forks of LTP on other operating systems.

The test API is very large, there is a lot to learn for a new contributor. Not to mention that the thing we are testing is exceedingly complicated.

There are many things we want to create checks for. However we have begun by trying to enforce the following rule:

The test author is guaranteed that the test API will not modify the
TST_ERR and TST_RET. This prevents silent errors where the return
value and errno are overwritten before the test has chance to
check them.

These global variables are used by the TEST() macro and similar. These are intended for exclusive use by the test author. However they were being written to by library code.

Possibly there is somme way to define these variables as const in library code, but not in test code. In general though we want an extensible way of performing checks. This seemed as good a place as any to start.

The goal is to enforce these checks and for contributors to run them locally. Both to save reviewer time and to save contributor time. It is much less expensive to correct mistakes before sending a patch.

If checks are not mandatory, then they tend to be forgotten about. Or it becomes one person’s job to run and maintain them. This then means the check have to be robust. We need to avoid false positives and allow checks to be disabled sometimes.

Tool overview

We took a look at the following tools.

There are more out there. These are not the only ones we found even. We just have limited time and resources.

We did not investigate any proprietary tools. It is expected that LTP developers can freely download, modify and run any mandatory development tools.

The amount of time and effort assigned to each tool was not equal. They have been listed roughly in the amount of progress made before giving up. With tools such as GCC I quickly abandoned them. This is as much due to the nature of the LTP as the tool in question.

GCC

The GCC compiler is available on practically every Linux distribution and desktop OS. It is the main compiler used to build LTP. We don’t need to worry about parsing problems and non-standard C when using GCC.

Some of the older parts of the LTP are quite disgusting. They are not compliant with any C standard, but GCC has accepted them. We of course want to remove this code, but in the meantime we have to deal with it.

GCC Analyzer

The GCC Analyzer appears to be powerful if inaccurate. It tracks program control and data flow. Including between procedures, which is often absent in analysers of this type. This means you can see an approximation of what values a variable may take at any given point in the program.

At the time we investigated it, there did not seem to be any way to extend it. So we couldn’t use it to develop LTP specific checks without forking GCC.

It did find some general errors. Such as null pointer dereferences. It also found some false positives and missed other errors.

GCC Plugins

GCC has plugins which allow one to interfere with various compilation passes. They provide access to more than one type of intermediate representation used by GCC.

The access is both read and write. So we could also create our own instrumentation, that is, insert runtime checks.

Primarily there are two reasons for discarded this option. Firstly GCC’s code base is nearly opaque. Secondly plugins appear to be version dependent. Possibly GCC’s internal representation does not change much. However even small changes would create issues when the resident compiler “expert” is not around.

This means we would have high up front cost and ongoing maintenance. This is a shame because we can always rely on LTP developers to have GCC.

Smatch

The Smatch analyser is in the same league as the GCC Analyzer and Clang Analyzer. It can do inter-procedural control flow and state tracking. At least if you can figure out how to operate it.

To get the full might of Smatch one needs to generate a database. This required quite some time and fiddling for the LTP. It was never quite clear if this was fully working. On the other hand, it found some general bugs without any false positives.

To extend Smatch we could either fork it or submit LTP specific tests upstream. It is clear how a new check is added to Smatch. It is less obvious how to construct the check logic.

Smatch now uses Sparse to parse the C AST. Otherwise it seems to be its own beast. Possibly the only analyser of this type which can find bugs in the Linux kernel without producing huge amounts of noise.

We discarded it for now because of the high level of friction. In the future we may able to swap our checks from Sparse to Smatch.

Clang

The LLVM C frontend. Clang is essentially part of LLVM. All of the tools below are in the LLVM mono repository. It appears that they all get wrapped up into LLVM releases.

While LLVM and Clang are supported by every major Linux distribution. It is often a much older version on stable releases. We also found that compiling against LLVM on multiple distributions is inconvenient.

LLVM comes with the llvm-config utility to help figure out what compiler flags are needed and such. This itself comes in multiple versions on some distributions. There isn’t necessarily a default version either.

Unsurprisingly building LLVM and Clang from source is quite time consuming. So we can not sidestep distribution package issues by vendoring it in to the LTP.

Clang can output LLVM IR. We could read this and perform checks on it. We did not see an easy way to do this. So it was not properly investigated.

Clang Plugins

Like GCC, Clang has plugins. These appear to be based on the same interface(s) as Clang Tidy and LibTooling described below.

Clang Analyzer

The Clang Analyzer is another powerful analyser capable of tracking state. It is comparable to GCC’s analyser described earlier. Less so to Smatch which is far more self-contained. Out of the analysers, Clang appears to produce the most false positives.

Unlike GCC it has an extention mechanism. It’s not clear how well supported or popular this is. It appears that analyser extensions do not get inter-procedural state information.

It was dismissed primarily for the same reasons as LibTooling and Clang Tidy. Although it provides an exploded graph instead of AST matching. The analyser appears to be accessible through Clang Tidy and LibTooling.

This is a very attractive option for those already invested in the LLVM ecosystem.

Clang Tidy

Checks developed with the clang-tidy command need to be added to the LLVM mono repository. Other project specific checks have been added to upstream. So perhaps LTP specific checks would also be accepted.

The issue for us is the time between a check being accepted into LLVM upstream and the check being available to all LTP contributors. Considering the frequency of LLVM releases and stable distribution releases. It could be years before we can demand test developers run the checker.

Demanding our contributors download and compile LLVM is not reasonable. So we can dismiss the Clang Tidy approach.

Clang LibTooling

Clang has an unstable C++ interface and a stable C interface. LibTooling represents the C++ interface.

As the C++ interface is not stable, any checks written with it will need to be adapted for each LLVM release. Although Clang and LLVM are much less opaque than GCC. We still can’t afford that kind of maintenance.

The LTP is also written in C not C++. This is only a minor point, but it does save some effort to use C throughout.

libclang

This is the stable C interface. It is a wrapper for the C++ interface. The main advantage is that functions are only added, not changed or removed.

It appears that the interface’s primary clients are text editors. Specifically to allow features like auto completion. The primary header is even called Index.h.

It does provide some access to the AST. This is done through a relatively simple and well documented API. Combined with its stability promises, we decided this was enough to give it a serious try.

Below is the code which performs the check in version three of the patch series. Code for printing errors and such has been removed.

#include <clang-c/Index.h>

/* The rules for test, library and tool code are different */
enum ltp_tu_kind {
    LTP_LIB,
    LTP_OTHER,
};

/* Holds information about the TU which we gathered on the first pass */
static struct {
    enum ltp_tu_kind tu_kind;
} tu_info;

static int cursor_cmp_spelling(const char *const spelling, CXCursor cursor)
{
    CXString cursor_spelling = clang_getCursorSpelling(cursor);
    const int ret = strcmp(spelling, clang_getCString(cursor_spelling));

    clang_disposeString(cursor_spelling);

    return ret;
}

static int cursor_type_cmp_spelling(const char *const spelling, CXCursor cursor)
{
    CXType ctype = clang_getCursorType(cursor);
    CXString ctype_spelling = clang_getTypeSpelling(ctype);
    const int ret = strcmp(spelling, clang_getCString(ctype_spelling));

    clang_disposeString(ctype_spelling);

    return ret;
}

/*
 * Check if the TEST() macro is used inside the library.
 *
 * This check takes an AST node which should already be known to be a
 * macro expansion kind.
 *
 * If the TU appears to be a test executable then the test does not
 * apply. So in that case we return.
 *
 * If the macro expansion AST node is spelled TEST, then we emit an
 * error. Otherwise do nothing.
 */
static void check_TEST_macro(CXCursor macro_cursor)
{
    if (tu_info.tu_kind != LTP_LIB)
        return;

    if (!cursor_cmp_spelling("TEST", macro_cursor)) {
        emit_check_error(macro_cursor,
               "TEST() macro should not be used in library");
    }
}

/* Recursively visit each AST node and run checks based on node kind */
static enum CXChildVisitResult check_visitor(CXCursor cursor,
                         attr_unused CXCursor parent,
                         attr_unused CXClientData client_data)
{
    CXSourceLocation loc = clang_getCursorLocation(cursor);

    if (clang_Location_isInSystemHeader(loc))
        return CXChildVisit_Continue;

    switch (clang_getCursorKind(cursor)) {
    case CXCursor_MacroExpansion:
            check_TEST_macro(cursor);
        break;
    default:
        break;
    }

    return CXChildVisit_Recurse;
}

static void collect_info_from_args(const int argc, const char *const *const argv)
{
    int i;

    for (i = 0; i < argc; i++) {
        if (!strcmp("-DLTPLIB", argv[i])) {
            tu_info.tu_kind = LTP_LIB;
        }
    }
}

int main(const int argc, const char *const *const argv)
{
    CXIndex cindex = clang_createIndex(0, 1);
    CXTranslationUnit tu;
    CXCursor tuc;
    enum CXErrorCode ret;

    tu_info.tu_kind = LTP_OTHER;
    collect_info_from_args(argc, argv);

    ret = clang_parseTranslationUnit2(
        cindex,
        /*source_filename=*/NULL,
        argv + 1, argc - 1,
        /*unsaved_files=*/NULL, /*num_unsaved_files=*/0,
        CXTranslationUnit_DetailedPreprocessingRecord,
        &tu);

    if (ret != CXError_Success) {
        emit_error("Failed to parse translation unit!");
        return 1;
    }

    tuc = clang_getTranslationUnitCursor(tu);

    clang_visitChildren(tuc, check_visitor, NULL);

    /* Stop leak sanitizer from complaining */
    clang_disposeTranslationUnit(tu);
    clang_disposeIndex(cindex);

    return error_flag;
}

The above code uses Clang to create an AST from a C file (Translation Unit; TU). Then recurses into the AST, checking the type (kind) of each node (cursor). If we find a node of a kind we can check, then we call a checking function on it.

The LTP build system passes the same flags it would pass to the compiler. In addition we add -resource-dir $(shell $(CLANG) -print-resource-dir). Because libclang can not find the compiler’s resource directory.

The resource directory contains some compiler specific headers and libraries. The clang command is able to find it automatically. The code which performs this search is not in the Clang library.

We search the arguments for -DLTPLIB which tells us if we are compiling the test library. The TEST() macro check only applies to the test library. In a previous version we looked at the code itself to decide if the the file were a test.

/* If we find `struct tst_test = {...}` then record that this TU is a test */
static void info_ltp_tu_kind(CXCursor cursor)
{
    CXCursor initializer;

    if (clang_Cursor_hasVarDeclGlobalStorage(cursor) != 1)
        return;

    if (cursor_cmp_spelling("test", cursor))
        return;

    if (cursor_type_cmp_spelling("struct tst_test", cursor))
        return;

    initializer = clang_Cursor_getVarDeclInitializer(cursor);

    if (!clang_Cursor_isNull(initializer))
        tu_info.tu_kind = LTP_TEST;
}

Apart from being more complicated, the problem here was clang_Cursor_getVarDeclInitializer. This function was only introduced in LLVM 12. Meanwhile stable Ubuntu was on LLVM 10. It’s not clear how to achieve the same thing without this function.

There is another problem with our TEST() check. The actual requirement is to ensure the variables TST_ERR and TST_RET are not written to. Determining from the AST if a variable is written to is awkward enough. In libclang’s case it seems to be impossible. The necessary information is not exposed.

The amount of friction simply integrating with libclang is probably enough for us to have dismissed it. Even if that were not the case though, there is too much stuff missing for it to be useful.

If you can use LLVM at all, it is better to use the C++ interface.

Coccinelle

Also known as the spatch command. It is described as a semantic patch tool. It implements a pattern matching language which looks somewhat like a C code “diff”.

These patterns match against the syntax, semantics and control flow of C code. Under the hood Coccinelle operates on one or more IRs of the C program. However the user is not exposed to that. We are given a quirky language which looks like a Git commit to some C code.

Apparently a Coccinelle semantic patch is compiled into Control Tree Logic (CTL) and this is matched against some representation of the C code. This is perhaps analogous to how a regular expression is compiled into an automata and the automata matches the input text.

As far as we can tell, Coccinelle does not track state automatically. It does understand control flow however. Limited state tracking can be added using Python or OCaml snippets. These may be attached at certain points in the matching process.

All in all, You can be forgiven for thinking it works by magic. The tool has multiple papers and presentations. There is quite a bit of documentation. Still it is difficult to grasp. One suspects this is due to some misconceptions and communication issues. Perhaps the notes below will help.

  1. There is no plain text or C code in a semantic patch. It all has meaning specified by the domain specific language. It looks like C code mixed with some special symbols, but it is not.

  2. Matching takes the control flow into consideration. You can specify that all branches must match. Or that one or more matches exists.

  3. You can match against the spelling of variables and other syntactic details. However it is primarily matching against the deeper structure of the program.

With these things in mind you may have more of a chance understanding the documentation.

smatch does not have helpful error messages. The implementation is also opaque to us (more on that later). So the process of writing a semantic patch is often blind trial and error, mixed with reading the docs and examples.

That said it is a truly wonderful tool. We made a lot of progress in a short time. Below is a semantic patch which both finds and (almost) fixes TEST() macro usages.

// Find and fix violations of rule LTP-002

// Set with -D fix
virtual fix

// Find all positions where TEST is _used_.
@ depends on !fix exists @
@@

* TEST(...);

// Below are rules which will create a patch to replace TEST usage
// It assumes we can use the ret var without conflicts

// Fix all references to the variables TEST modifies when they occur in a
// function where TEST was used.
@ depends on fix exists @
@@

 TEST(...)

 ...

(
- TST_RET
+ ret
|
- TST_ERR
+ errno
|
- TTERRNO
+ TERRNO
)

// Replace TEST in all functions where it occurs only at the start. It
// is slightly complicated by adding a newline if a statement appears
// on the line after TEST(). It is not clear to me what the rules are
// for matching whitespace as it has no semantic meaning, but this
// appears to work.
@ depends on fix @
identifier fn;
expression tested_expr;
statement st;
@@

  fn (...)
  {
-   TEST(tested_expr);
+   const long ret = tested_expr;
(
+
    st
|

)
    ... when != TEST(...)
  }

// Replace TEST in all functions where it occurs at the start
// Functions where it *only* occurs at the start were handled above
@ depends on fix @
identifier fn;
expression tested_expr;
statement st;
@@

  fn (...)
  {
-   TEST(tested_expr);
+   long ret = tested_expr;
(
+
    st
|

)
    ...
  }

// Add ret var at the start of a function where TEST occurs and there
// is not already a ret declaration
@ depends on fix exists @
identifier fn;
@@

  fn (...)
  {
+   long ret;
    ... when != long ret;

    TEST(...)
    ...
  }

// Replace any remaining occurrences of TEST
@ depends on fix @
expression tested_expr;
@@

-   TEST(tested_expr);
+   ret = tested_expr;

This has been merged into the LTP. However we determined that Coccinelle can not be forced upon LTP contributors. Despite the fact Coccinelle is stable and has been around for years. We ran into distribution issues. It seems that at least the Gentoo package is lacking a maintainer.

We suspect this has little to do with Coccinelle itself. The issue is that it is written in OCaml. Package maintainers struggle with OCaml projects. It is easy to see why, as our attempts to learn the basics of OCaml were fraught with issues.

For a tool as good as Coccinelle, some of us are willing to learn a new language. If it were Haskell, for example, we’d not have a problem fixing the occasional issue.

However everyones’ patience ran out with OCaml. Being functional when we are primarily working on C does not help. However the main issue is that many distributions are not maintaining the packages properly. So it is often difficult just to get the REPL and compiler running.

I personally have no opinion on whether it is a good language. I didn’t get far enough to decide that. It seems to be the case though that people are not interested in it. Meanwhile we want to get some static analysis done, not revive a struggling language.

We still merged the Coccinelle scripts into the LTP. They provide a useful example of how to automate changes with spatch. We haven’t found another option for making these kinds of changes. Using Clang Tidy is extremely laborious compared to writing a semantic patch.

Sadly it has to be dismissed as our primary checker due to the OCaml ecosystem.

Sparse

Sparse is a stand alone C parser library. It produces an AST and linearised IR consisting of basicblocks. In fact it can produce executable x86 code or LLVM IR. So it is essentially a compiler.

Unlike most C compilers however it is very simple. It is not designed to produce fast code, nor can it parse everything GCC can. It only parses C and is not concerned with C++.

Sparse itself can be compiled relatively quickly and has few dependencies. It doesn’t take long to clone it with Git either. This meant we were able to vendor it in as a Git module.

Some disapprove of vendoring and Git modules. Unfortunately the Sparse package available on most distributions is not useful to us. Sparse is linked statically and the package only contains an executable for use with the Linux kernel. There is no dynamic library. Of course someone can change that, but it would take time to propagate downstream.

The IR is relatively easy to traverse and write checks against. The documentation is maybe a little sparse. However with some knowledge about compilers, it’s not too hard to understand the code. It is written in a similar style to the kernel and LTP. Albeit with some quirks.

Below is the full checker program sans some boilerplate.

/* The rules for test, library and tool code are different */
enum ltp_tu_kind {
    LTP_LIB,
    LTP_OTHER,
};

static enum ltp_tu_kind tu_kind = LTP_OTHER;

/* Check for LTP-002
 *
 * Inspects the destination symbol of each store instruction. If it is
 * TST_RET or TST_ERR then emit a warning.
 */
static void check_lib_sets_TEST_vars(const struct instruction *insn)
{
    static struct ident *TST_RES_id, *TST_ERR_id;

    if (!TST_RES_id) {
        TST_RES_id = built_in_ident("TST_RET");
        TST_ERR_id = built_in_ident("TST_ERR");
    }

    if (insn->opcode != OP_STORE)
        return;
    if (insn->src->ident != TST_RES_id &&
        insn->src->ident != TST_ERR_id)
        return;

    warning(insn->pos,
        "LTP-002: Library should not write to TST_RET or TST_ERR");
}

static void do_basicblock_checks(struct basic_block *bb)
{
    struct instruction *insn;

    FOR_EACH_PTR(bb->insns, insn) {
        if (!bb_reachable(insn->bb))
            continue;

        if (tu_kind == LTP_LIB)
            check_lib_sets_TEST_vars(insn);
    } END_FOR_EACH_PTR(insn);
}

static void do_entrypoint_checks(struct entrypoint *ep)
{
    struct basic_block *bb;

    FOR_EACH_PTR(ep->bbs, bb) {
        do_basicblock_checks(bb);
    } END_FOR_EACH_PTR(bb);
}

/* Compile the AST into a graph of basicblocks */
static void process_symbols(struct symbol_list *list)
{
    struct symbol *sym;

    FOR_EACH_PTR(list, sym) {
        struct entrypoint *ep;

        expand_symbol(sym);
        ep = linearize_symbol(sym);
        if (!ep || !ep->entry)
            continue;

        do_entrypoint_checks(ep);

        if (dbg_entry)
            show_entry(ep);
    } END_FOR_EACH_PTR(sym);
}

static void collect_info_from_args(const int argc, char *const *const argv)
{
    int i;

    for (i = 0; i < argc; i++) {
        if (!strcmp("-DLTPLIB", argv[i]))
            tu_kind = LTP_LIB;
    }
}

int main(int argc, char **argv)
{
    struct string_list *filelist = NULL;
    char *file;

    /* ... Disable a bunch of inbuilt checks ... */

    do_output = 0;

    collect_info_from_args(argc, argv);

    process_symbols(sparse_initialize(argc, argv, &filelist));
    FOR_EACH_PTR(filelist, file) {
        process_symbols(sparse(file));
    } END_FOR_EACH_PTR(file);

    report_stats();
    return 0;
}

Unlike the Clang and Coccinelle checks, this actually checks the variables themselves. We traverse the IR and look for writes to them. This will catch some additional cases where we write to the variables without using the macros.

It may be possible to fool it somehow. It does have the issue that library header files are considered part of the test code. We have only just begun to use Sparse so this will likely be modified over time.

For now we don’t try to do anything with the AST, we just look at the IR. Unlike Clang, Sparse does not save information about macro expansions. They do not show up as nodes in the AST. It appears that preprocessing is performed without saving any details. We may need to change this.

Sparse has many built-in checks and warnings. We have disabled most of them for now. In some cases they are kernel specific. In other cases they have been adopted by GCC and Clang which produce prettier warnings. Mostly though we just need to clean up the LTP, then we can enable them.

Sparse also introduces some attributes (e.g. __attribute__(address_space(name))) which may be useful or not. Attributes are a way of extending C which does not interfere with compilers that do not support them. The kernel uses them to prevent functions and variables being used in certain ways.

Conclusion

Going forwards we will continue to develop Sparse as our main tool. We may still need to abandon it. Perhaps the checks we really want will be too difficult. Personally, I will also continue to use Coccinelle, especially for “evolutionary development”.

There is a huge amount of great software here. Which took a lot of hard work by smart people. As usual with open source it is rough around the edges. In the end we chose the solution which we are mostly likely able to fix ourselves. Also the solution least likely to need fixing once implemented.

Depending on how things progress, I will be back to write about using Sparse and Coccinelle. Please send any suggestions, praise or insults via the contact details below.