In the 1980s, software design flaws in the Therac-25 radiation therapy system resulted in massive overdoses to at least six people and ultimately caused three deaths. Simple data entry mistakes led to patients being exposed to 10,000 percent more radiation than what was prescribed. Investigations into this system, and especially the brilliant insights of Nancy Levenson (a professor of aeronautics and astronautics at MIT) informed and influenced the standards we apply today in building safety critical software. This article discusses a few aspects of software design and development and outlines keys that can help companies design from a safety-centric perspective beyond basic compliance standards.
The U.S. mandates compliance with standards for the software development lifecycle (SDLC) of safety critical projects. Compliance standards for the software development lifecycle (SDLC) of such projects are IEC 62304 for medical devices and DO-178C for avionics. Other specialized markets have their own standards including ISO 26262 for cars/trucks, IEC 60880 for nuclear power, and many others. These lay out detailed procedures for documenting and monitoring nearly every aspect of software specification, design, coding, and testing, as well as rigorous standards for oversight, compliance, and certification. Many managers and engineers believe compliance with relevant SDLC standards is sufficient for developing safe and useful software. This is profoundly false.
“Standards can have the undesirable effect of limiting the safety efforts and investment of companies that feel their legal and moral responsibilities are fulfilled if they follow the standards.” ― Levenson, 2017, Computer
The true key to safer software is not requirements validation, but careful, reasoned application of systematic, incremental, iterative techniques applied to every level of development. This is not a mysterious process; safe complex systems are only built on top of safe, simple ones. After complexity becomes unbounded, no level of after-the-fact inspection, testing, or documentation is sufficient to rein it in again.
So rather than discussing the compliance standards, let’s start by talking about practices, those that these standards don't cover and those that contribute to inadequate designs, and by extension, to unsafe systems.
On Time! Under Budget!
Companies reward managers who produce systems under tight schedules, with limited resources. The unpleasant truth is these are not just poor metrics for building machines that can kill people when they fail, they are incentives to cut corners, to fudge data, and to kick the can down the road. And they are time bombs that can blow up years later. When measured against the potential for billion-dollar lawsuits, fast turnaround matters far less than quality.
This is not to say that there are no opportunities to reduce costs while improving safety, but cost reduction has to remain secondary to safety concerns. The greatest improvements are seen when design complexity is reduced. This involves selective use of higher level tools such as functional and type safe programming languages, and model driven development (MDD). C and C++ are both popular, and terrible, for safety critical systems. Their popularity stems from the deep well of tooling and the availability of experienced developers. They are terrible because languages with unsafe pointers and direct memory management present a nearly impossible test design challenge. Experienced software architects understand the trade-offs between control and expressive power that high-level tools provide. A homogeneous monoculture of tools sounds great but is rarely possible or desirable with complex products.
Programming requires developers to spend lots of time 'playing computer in their heads' as Bret Victor likes to say. The size of a programmer's head is limited, so the level of abstraction she works at is key to the scale of the problem that can be solved in a given sitting. Greater tool functionality and more immediate feedback not only produce systems faster, they improve the kinds of systems that can be built.
Meets or Exceeds Every Requirement!
SDLC standards focus on assuring that the software that is placed in the product faithfully satisfies a set of high-level requirements (HLRs), and that these HLRs have been decomposed into low-level requirements (LLRs) that can be faithfully traced to every line of code. What is less discussed is the origin of the requirements themselves. Levenson writes:
"Another widespread misunderstanding is that software safety is enhanced by assuring that the software satisfies the requirements. Almost all software-related accidents have involved requirements flaws, not coding or implementation errors ... we have to focus less on assurance and more on identifying the safety-critical requirements and building safety into these machines from the beginning of development.”
The process of setting the requirements for systems is, in and of itself, a huge source of error. As systems become more complex, boilerplate text gets pulled or “adapted” from older designs. When things go well, these errors are discovered early and reworked or removed; when the process fails, nonsense code is implemented to trace to inapplicable requirements. And often, requirements are so badly worded, they're not even wrong.
Requirements development is frequently treated as though it’s a magical process that occurs before the first line of code is written. This is at least as bad as the second most common model, which is to write them after the production of a “working” system. Just as software must be carefully built from simpler validated systems, requirements can't be understood in the abstract. Regardless of the format, requirements need to be sufficiently detailed to be testable and provably satisfied or not.
Meaningful requirements enumeration has to be an iterative, interactive process that involves specialists from every aspect of the system under construction. Stakeholders need to be identified early and just as code should be traceable to requirements, requirements must be traceable to the individuals who understand their needs in depth. Above all, systems designers must avoid the trap of dropping some 300-page document into the mix and blandly stating: “the system must comply with all applicable aspects of 'specification X'.”
Finally, all requirements are not created equal. Some have very broad implications that are poorly understood by their authors, or whose merits have been overtaken by technology. Overly specific statements such as "all values will be represented in IEEE 754-2008 formats," might have seemed prudent at one level, but can drive bad design choices. Dynamic, flexible development, adoption, review and, especially, rejection processes for requirements are likely to have a greater impact on systems safety than anything else described here.
Passes Every Test!
Comprehensive, integrated, internal test design is the single most important element of high quality, reliable software. It is also the easiest element to ignore in the early stages of development and cannot be added to an existing codebase without effectively rewriting the code from scratch. Weak management often fails to invest in parallel teams of test developers, insisting instead that “all our developers write their own unit tests,” if they write any at all. This produces “happy path” testing that initially covers only the most common scenarios. When driven to comply with SDLC requirements for complete code coverage, automated tools and broad tests are used to meet the letter of the law, while violating the underlying rationale.
All safe code is extensively instrumented and designed to be tested in as isolated a manner as possible. Unit tests cover every line of code, and independent test engineers are rewarded by building tests for corner cases, just as function developers see rewards for features. Error handlers, and other stress conditions should be tested most completely, and by engineers with devious minds who think first about the malicious conditions that might occur.
Effective test design has to be incorporated into code design. Before the first line is written, languages, frameworks, simulators, and other tool suites need to be evaluated with the test paradigm at the forefront.
Medical devices and avionics are especially difficult to test comprehensively, largely because there aren't a bunch of human test subjects or spare airplanes lying around in the test lab. This presents an extra burden to create comprehensive and accurate simulations and emulations that are employed during every phase of development. It's equally critical that these tools are validated against real-world conditions. The team must have enough practical knowledge to anticipate and eliminate usage errors by the end user. Will a specific combination of key strokes lead to an unexpected malfunction? What if the product is applied incorrectly? Could product misuse due to ignorance potentially lead to dire consequences? These are crucial questions that the team must be able to anticipate and answer.
One Million Years Between Failures!
Finally, let’s address the fallacy of MTBF (mean time between failure) calculations. To put it directly, they should be banned from any discussions of software. The assumptions that underly the math of these statements are themselves so subjective and prone to error that we should assume that all those carefully constructed charts and graphs are meaningless.
“In our enthusiasm to provide measurements, we should not attempt to measure the unmeasurable ... Specific software flaws leading to a serious loss can’t be assessed probabilistically.” ― Levenson, 2017, Computer
Start by assuming that any process, and any system can fail. Just as building complete test suites must not rely on one team, failure remediation must include independent teams who bring different ideas to the table. Independent watchdog processes, especially when coupled with redundant and independent hardware sensor arrays, provide excellent techniques for reducing failure modes. Avoid temptations to remove mechanical and electronic safeguards, just because “the software can do it.”
The Lesson of Complex Design
If we've learned one lesson from just over half a century of complex software design it is this: every working system is built from simpler working systems. This is the core underpinning of agile development; it posits that you have to keep building on simple systems, and only by understanding and completing one level, can you hope to build the next. We've learned that unit tests, stress tests, performance tests, system tests, must develop incrementally along with the other elements, or they won't provide the necessary safety.
The problem presented by agile design in safety-critical projects, is this also mandates a cyclical requirements development process, which remains anathema to many corporate cultures. Regardless, managing software development can't be isolated to a handful of coders, it now touches every aspect of product development. Flexibility, coupled with tight cooperation between teams, are more important than all the checklists, documents, and procedural requirements that allow us to certify software systems as 'safe.'
It is almost never one massive engineering flaw that causes disasters like the injuries and deaths in the Therac-25 case, but several smaller missteps throughout the design process that lead to a cumulative safety failure. The overconfidence in prior models, lack of testing in a real-world environment and unwillingness to believe system failures resulted in a complete malfunction of the machine when used in a clinical environment. The inability of the development team to plan for and prevent these errors serves as a startling reminder of how important even the smallest step can be when designing safety-critical software. With a thoughtful eye to every milepost of the design process and a testing protocol that goes beyond the baseline standards, tragedies like this can be mitigated at the starting gate instead of discovered past the finish line.
John Fogarty is an advisory software engineer at Base2 Solutions, a software engineering services firm that helps companies in highly-regulated industries, including medical devices, aerospace and defense, develop reliable and complex interconnected systems. John has more than two decades of experience in software development and management across the full technical stack.