Stuff breaks. Things go wrong. The less polite epithet is: **** happens. Whatever word you use, it’s a fact that we live in an imperfect world. In an embedded system, there are many opportunities for failure. In simple systems, failure generally results in them just not working. In complex systems, a failure may manifest itself in more subtle ways.
An embedded system is “smart,” so it seems obvious that this intelligence could be harnessed to detect impending problems and those that have already occurred and maybe mitigate against the effects of the failure.
The usual term for such built-in failure control is “self-testing.” It’s a big subject that has most likely been covered by many conference papers and the details could probably fill a book. But here I would like to just consider the key issues.
Essentially, there are four areas of possible failure in an embedded system:
Failure of a CPU is quite rare, but, of course, not unknown. Partial failure is very unlikely, so the expected scenario would be the inability to run code, so there’s no opportunity to address a failure. As failure of electronic components most commonly occurs on power-up, CPU failure will most likely present itself as a completely dead device. It’s a different matter in a multi-CPU design, when one CPU may monitor the activity of another and report failure more gracefully.
Memory is a critical system component, of course, and modern devices have lots of it. Failure is far from unknown. A transient failure, caused perhaps by a stray sub-atomic particle, might result in the unexplained and unreproducible crashing of a device. There’s really nothing that can be done to address this possibility. A hard/permanent failure is more likely to be detectable.
Memory can be tested in two ways: on power-up (which is when failure is most likely to occur), before any useful data is stored in it, or on the fly, if spare CPU time is available. A comprehensive memory test, before it contains any data, can we worthwhile, if a brief start-up delay can be tolerated. The usual test is called “moving ones,” where memory is cleared and a one is written to each bit in turn and every other bit is checked to ensure that it’s a zero. A “moving zeros” test applies the same idea.
On the fly testing is naturally less comprehensive, as live data can’t be corrupted. The only real option is to test each byte/word by writing and reading a series of patterns to it, while interrupts are disabled.
Peripherals are many and varied and can fail is numerous interesting ways. However, there’s very little general advice that I can offer. Self-testing code can check that a device responds to its address, as a failure to do so would suggest that something bad has happened. Otherwise, some devices may have a “loop back” mode that enables basic transmit/receive functionality to be checked. Beyond that, creativity, driven by a knowledge of a device’s functionality, is needed to implement any self-testing.
If software fails, it’s because an error was made in its design or implementation. Unlike hardware, error-free software (if it even exists) can’t go bad over time. Software failures fall into broadly two categories:
- Stuck in a loop (unresponsive)
- Data/code corruption
The most common cause for (1) is actually some kind of hardware issue, where the software is awaiting a response that never comes. This is still a software error, as a timeout is always prudent. The best way to address this kind of fault is to use some kind of watchdog facility. This is normally hardware that resets the system if a regular response from the software isn’t received. A dedicated task might do the same kind of job in a multi-threaded application.
Errors with pointers are the likely cause of (2) and completely random memory corruption is hard to detect and diagnose. Fortunately, a common error is the use of null or completely invalid pointers. As this leads to a trap (software interrupt), the precaution is to ensure that a trap handler is implemented. The other popular error is overrun of a memory area like a stack or an array. This may be addressed by using “guard words” at either end and monitoring their access.
There remains a significant open issue. Once a failure or impending failure has been detected, what can you do about it? This depends entirely on the nature of the system. In some cases, particularly deeply embedded systems, a system reset is the only sensible course of action. Logging the failure for some later analysis may be a possibility. With other systems, the user may be advised and perhaps determine the action to be taken. A further possibility is for a device to “phone home” or send information about the failure to the user/supplier/developer using a network connection.
The bottom line is that every embedded system is different, which is what makes working in this industry interesting. The result is that self-testing is different for every device and the response to discovery of a fault is just as variable. The only constant factors are the likelihood of failure and the denial of such a possibility in the minds of many developers.
Colin Walls is an Embedded Software Technologist at Mentor Graphics’ Embedded Software Division.eletter-08-20-2018 eletter-08-21-2018