Use Benchmarks to Assess Static Analysis Tools

By Paul Anderson

VP of Engineering

GrammaTech

December 22, 2015

Story

Use Benchmarks to Assess Static Analysis Tools

Researchers from Toyota recently published a paper entitled "Test Suites for Benchmarks of Static Analysis Tools" at the 26th IEEE International Sympo...

Researchers from Toyota recently published a paper entitled “Test Suites for Benchmarks of Static Analysis Tools” at the 26th IEEE International Symposium on Software Reliability Engineering (ISSRE). This was a follow-up paper to one the same team had published in 2014 called “Quantitative Evaluation of Static Analysis Tools,” which has since been withdrawn for IP reasons.

In this new paper, the authors use the same (although slightly restructured) benchmark suite as before (available on John Regehr’s blog) to measure how well the tools find defects. CodeSonar was one of the three top-performing tools used in the study.

I generally like studies such as this, mostly because CodeSonar always does very well. It’s designed for safety-critical systems and therefore typically finds the most software defects. The Toyota studies are no exception. CodeSonar performs measurably better overall than both of the other tools. As far as I am aware, CodeSonar has placed first (or tied for first) in every head-to-head comparison of this kind.

In the recent Toyota study, however, there was one category of bugs—Stack-Related Defects—in which CodeSonar scored zero. This didn’t seem right, so I decided to look more closely. What I found illustrates some of the disadvantages of these kinds of benchmark suites, reinforcing something I have been saying for years—that benchmarks can be misleading, and you should never make a decision on which tool to deploy based on benchmark results alone.

In the stack-related defects category, I looked into one of the warning classes, Stack Overflow. The examples are all similar; an extremely large buffer is allocated on the stack. Here’s an example:

void st_overflow_003_func_001 (st_overflow_003_s_001 s)
{
      char buf[524288]; /* 512 Kbytes */ /*Tool should detect this line as error*/
                          /*ERROR:Stack overflow*/
        s.buf[0] = 1;
        buf[0] = 1;
        sink = buf[idx];
}

A buffer of that size is likely to exceed the capacity of what most machines can handle, so it’s reasonable to expect a static analysis tool to report it as a potential problem.

But CodeSonar can find stack overflows of this kind. It lets you specify what the maximum stack size is, and it finds call paths where that capacity is exceeded. Why weren’t these examples detected? The answer is simple; that particular checker isn’t enabled in CodeSonar by default, so the paper’s authors must not have noticed that they had to explicitly enable it. Once I turned it on, CodeSonar detected the overflow as expected.

A second warning class in the study’s Stack-Related Defects category was Stack Underflow. Again, CodeSonar failed to detect any of those defects, which seemed unlikely to me because it has checkers for a few different kinds of underflows. All of the examples were similar: an index into a stack-allocated buffer was decremented in a loop, where the loop’s termination condition loop didn’t check that the index could go negative. Here’s the code:

void st_underrun_001 ()
{
         char buf[10];
         strcpy(buf, "my string");
         int len = strlen(buf) - 1;
         while (buf[len] != 'Z')
         {
                 len--; /*Tool should detect this line as error*/ 
                        /* Stack Under RUN error */
         }
}

The termination condition is based only on the contents of the character that the index points to. This is an example of a sentinel-based search, which explains why CodeSonar didn’t flag it. It’s difficult for static analysis tools to reason precisely about the position of sentinels, and CodeSonar doesn’t report defects if it can’t do so with reasonable precision.

When we develop checkers for CodeSonar, we try to balance recall (the ability to find a real bug) against precision (the proportion of results that are true positives). If the precision is too low, then the checker is typically useless because the noise drowns out the signal. Of course, we could adapt our underrun checker to report this flaw, but due to the inherent difficulty of the problem in general, the precision would be too low for it to be useful. Unfortunately, precision is one aspect of analysis that studies such as this are very poor at assessing.

This example is highly misleading. A casual reader of the paper would incorrectly conclude that CodeSonar is completely incapable of finding stack underruns. In reality, the examples chosen aren’t very representative of all possible stack underruns; examples like those are IMHO not even likely to occur in the wild. There are plenty of other underrun examples that CodeSonar (and other static analysis tools) would be good at finding.

Benchmarks such as these can be useful at times; they can help identify weak spots in tool coverage, and may be helpful in comparing how different tools report the defects. They are typically easy to use and to understand.

If benchmarks have been designed to target a particular application domain, they can yield helpful specific insights. But for that reason, beware that benchmarks targeted at reliability problems in safety-critical medical devices, for instance, are unlikely to have much in common with those targeted at security problems in server software.

In any domain, benchmarks are most useful when they are models of code found “in the wild,” but that’s essentially impossible to do without sacrificing some of the positive aspects of benchmarks. Don’t think of these kinds of micro-benchmarks as real bugs, but rather as caricatures of bugs. Although they can be useful to assess some limited aspects of static analysis tools, it would be unwise to make a purchase decision based solely on benchmark results. For example, a tool might do well on a micro-benchmark but be useless on real code because of weak precision, poor performance, or failure to integrate well into the workflow. The best way to assess a tool’s effectiveness is to try it on real code, make sure it’s configured correctly, and judge the results rationally.

Paul Anderson is the VP of Engineering at GrammaTech. He started there 24 years ago as a software engineer working on language-sensitive editor technology, before leading the conception and development of both CodeSurfer and CodeSonar. Prior to joining GrammaTech, Paul was a lecturer at City University in London, England. He retains connections to the academic world as a member of the program committees for several software engineering research conferences.

VP of Engineering at GrammaTech, where I oversee product development activities. My main focus is on CodeSonar, an advanced static-analysis tool for finding serious programming errors in C and C++, and a new as-yet unannounced cybersecurity product.