Finding concurrency errors with static analysis

June 1, 2011 OpenSystems Media

4Due to the rise of multicore processors, programmers increasingly need to write multithreaded code. Pitfalls such as race conditions, starvation, and deadlock make it hard to get such code right, and traditional testing can only help so much. New approaches based on static analysis are proving effective at quickly finding difficult bugs.

Although decades of advances in miniaturization have yielded enormous performance gains for single processors, it appears this era is coming to a close. The best bet for achieving significant additional performance with single chips is through multiple cores, but only if software can be programmed to take advantage of them.

Unfortunately, concurrent programming is difficult. Even expert-level programmers familiar only with single-threaded programming often fail to appreciate that concurrent programs are susceptible to entirely new classes of defects such as race conditions, deadlocks, and starvation. It is difficult for humans to reason about concurrent programs, and some aspects of programming languages themselves are ill-suited to concurrency. Consequently, experts frequently stumble over these hazards. The following discussion describes common concurrency pitfalls and explains how static analysis tools can find such defects without executing the program.

Consequences of race conditions

A race condition arises when multiple threads of execution access a shared piece of data and at least one of them changes the value of that data without an explicit synchronization operation to separate the accesses. Depending on the interleaving of the two threads, the system can be left in an inconsistent state.

Race conditions are especially insidious because they can lurk undetected indefinitely, and only show up in rare circumstances exhibiting mysterious symptoms that are difficult to diagnose and reproduce. In particular, they are likely to survive through testing into deployed software. At best, this means increased development times; at worst, the consequences can be devastating.

One reason the Northeast Blackout of 2003 was so widespread was that a race condition in a computerized energy management system caused misleading information to be communicated to the operators. As Kevin Poulsen noted in a 2004 article (www.securityfocus.com/news/8412), “the bug had a window of opportunity measured in milliseconds.” The chances of a problem like this manifesting during testing are infinitesimal. In another case, a race condition in iOS 4.0 through 4.1 (now fixed) meant that any person with physical access to an iPhone 3G or later could bypass its passcode lock under certain conditions.

An example of a simple race condition is shown in Figure 1. A manufacturing assembly line with entry and exit sensors maintains a running count of the items currently on the line. This count is incremented every time an item enters the line and decremented every time an item reaches the end of the line and exits. If an item enters the line at the same time that another item exits, the count should be incremented and then decremented (or vice versa) for a net change of zero. However, normal increment and decrement are not atomic operations; they are composed of a sequence of individual instructions that first loads the value from memory, then modifies it locally, and finally stores it back in memory. If the updating transactions are processed in a multithreaded system without sufficient safeguards, a race condition can arise because the sensors read and write a shared piece of data: the count. The interleaving in Figure 1 results in an incorrect count of 69. There are also interleavings that can result in an incorrect count of 71, as well as some that correctly result in a count of 70.

21
Figure 1: A race condition leads to an incorrect count of items on an assembly line.

For this example and for race condition bugs in general, standard debugging techniques may be ineffective for several reasons.

Rare occurrence means a reduced chance of noticing there is a problem. If the problem manifests infrequently, it may never show up during testing. The issue is twofold. Firstly, the number of possible interleavings of the instructions in two threads can be huge and increases enormously as the number of instructions grows. This phenomenon is known as combinatorial explosion. If thread A executes M instructions and thread B executes N;instructions, the possible interleavings of the two threads are:

21
Equation 1

For example, given two trivial threads with 10 instructions each, there are 184,756 possible interleavings of the instructions. Real-world software is large and complex; testing every interleaving is simply impossible. Secondly, even if testers can identify a few interleavings that merit inspection, it is difficult to set up test cases to ensure they actually occur because thread scheduling can be highly nondeterministic.

If exhaustive testing is intractable, then what can developers do? One extremely useful approach is static analysis. Advanced static analysis tools such as CodeSonar use highly sophisticated symbolic execution techniques to consider many possible execution paths and interleavings at once. These techniques can find race conditions and other concurrency errors without needing to execute the program.

Several factors make race condition diagnosis difficult. Firstly, the symptoms can be perplexing. In the Figure 1 example, the running count will usually be correct, but sometimes too high and other times too low. Secondly, programmers unaccustomed to considering the particular pitfalls of multithreaded programming may spend a lot of time puzzling over the code before the possibility of a race condition occurs to them. Advanced static analysis tools are especially helpful in this regard. They identify race conditions by examining patterns of access to shared memory locations; that is, they focus on the race itself, not its symptoms. When a race condition is identified, an advanced static analysis tool will report it along with supporting information to aid the user in evaluation and debugging. The onus on the programmer is substantially reduced.

More complexity, more bugs

Race conditions are typically avoided by using locks to protect shared resources. However, locks can introduce performance bottlenecks that might prevent the program from taking advantage of the full potential of multiple cores, so programmers must exercise care in using them. It can be tricky to write code that uses locks effectively, and this complexity can lead to a different set of problems, namely deadlock and starvation.

In a deadlock, two or more threads prevent each other from making progress because each holds a lock needed by another. Figure 2 shows how a deadlock can arise with two locks used to protect two shared variables. In this example, multiple assembly lines share a count of the total number of items currently under assembly, and a second bad_items value records how many finished items failed quality control. One thread acquires the lock on count, another acquires the lock on bad_items. Neither thread can obtain the second lock it needs; thus neither can carry out its operations, nor can it get to the point where it will release its lock. As neither update can be completed, both threads are completely stuck.

22
Figure 2: In a deadlock between two threads, neither thread can progress.

Static analysis tools can identify software at risk of deadlock by flagging situations where the same locks can be acquired in different orders by different threads, such as the threads shown in Figure 2. Eliminating all such cases is sufficient to ensure that the system cannot become deadlocked.

Starvation is another problem that occurs in multithreaded programs that use locks. A thread can starve if it is waiting for a resource currently held by another thread that takes a very long time. For example, suppose the aforementioned manufacturing automation system includes a regular audit thread that examines all entry and exit records to ensure that the running count matches total items entering less total items exiting. The audit thread would need to hold locks on the count and on all sensors, so all updates must wait for the audit to finish. If the audit runs for a long time, updates can be significantly delayed. If it runs too long, the next audit may manage to acquire all the locks and start running before the outstanding thread can make any progress. In the worst case, some or all of the updates may never have the opportunity to run.

Static analysis can provide significant value here by posing questions such as, “Is there a call to a long-running library function while a lock is held?” Tools such as CodeSonar also provide mechanisms for users to add their own checks. If an in-house function f() is known to have a long running time, engineers could add a custom check that triggers a warning whenever f() is called by a thread that holds one or more locks.

Multithreading adds entirely new classes of potential bugs to those that embedded developers must consider, making it significantly more difficult to find bugs of all kinds. The latest generation of static analysis tools can help with both of these issues.

23
ECD in 2D: A demo of CodeSonar shows how the tool can identify bugs in each version of the software being analyzed. Use your smartphone, scan this code, watch a video: http://opsy.st/kSRTWh.

Paul Anderson is VP of engineering at GrammaTech, where he manages the engineering team and architects static analysis tools. He has worked in the software industry for 20 years, helping organizations including NASA, the FDA, the FAA, MITRE, Draper Laboratory, GE, Lockheed Martin, and Boeing apply automated code analysis to critical projects. He received his BSc from King’s College London and his PhD in Computer Science from City University London.

GrammaTech paul@grammatech.com www.grammatech.com

Paul Anderson (GrammaTech)
Previous Article
Proven technologies and embedded expertise improve the smart grid's security and resiliency
Proven technologies and embedded expertise improve the smart grid's security and resiliency

Just like building a house that can withstand storms, building a smart grid device requires a secure operat...

Next Article
Static analysis: Beyond a simple list of defects
Static analysis: Beyond a simple list of defects

Contextual information is a valuable asset that enhances static code analysis in the quest for error-free s...