Continuous downscaling of CMOS technologies has led to smaller IC size, lower power consumption, and faster clock frequency. At the same time, system reliability has become an increasingly major concern. Specifically, technology trends such as transistor downsizing, use of new materials, and system-on-chip architectures continue to reduce the soft error tolerance of systems. To meet reliability requirements, it is necessary for both circuit designers and test engineers to have a basic knowledge of soft errors.
In recent times, the performance of VLSI and mixed signal circuits has been increasing noticeably due to improved fabrication technology. Smaller feature sizes, lower operating voltage levels, and reduced noise margins have also helped to improve the performance and lower the power consumption of today's electronic circuits.
At the same time, however, these advancements have made ICs more susceptible to transient faults. Transient faults are temporary changes of states resulting from the latching of a single-event transient (transient voltage fluctuations), external particle strikes, or process-related parametric variations. When transient faults occur in a system, they create a soft error. Unlike fabrication or design errors, soft errors do not occur consistently. While soft errors do not cause permanent physical damage to the chip, they involve changes to data by altering signal state and stored values - the charge in a storage circuit, for instance – and potentially result in incorrect logical execution.
Soft errors may change the control flow of the system with catastrophic results relative to the desired operation of the system. Specifically, soft errors are of major concern in systems where high reliability is essential, such as space programs (where a system cannot malfunction while in flight), banking transactions (where a single fault may cause a huge difference in balance), automated railway systems (where a single bit flip can cause a drastic accident), military critical embedded applications, and so forth.
This article presents a tutorial study of the radiation-induced single event upsets caused by external radiation, a major source of soft errors. We will cover basic radiation mechanisms, their resulting soft errors, and the effect in silicon. Soft error mitigation techniques with time and space redundancy are also illustrated. Some literature reviews show how design engineers are addressing soft errors today.
A short history of soft errors
For almost half a century, electrical, aerospace, and nuclear and radiation engineers have been studying soft errors. In the 1950s, failures in digital electronics were reported during above-ground nuclear bomb tests. At the time, these were regarded as electronic anomalies in the monitoring equipment because they were random and their cause could not be traced back to a known source.
The work of Wallmark and Marcus predicted that cosmic rays would start affecting microcircuits due to heavy ionized particle strikes and cosmic ray reactions when feature sizes became small enough. May and Woods of Intel found that these errors were caused by the alpha particles emitted in the radioactive decay of uranium and thorium present in package materials. Their work represented the first observations of radiation-induced single event upsets in electronic systems at sea level and these errors were referred to as “soft errors”. The term soft error was used to differentiate from the repeatable errors traceable to permanent hardware faults. Later Geezer and Colicky stated that these radiations induced particles that came, not just from uranium and thorium, but also from nuclear reactions generating high-energy neutrons and protons. In 1979, Ziegler and Langford from IBM predicted that cosmic rays could also be a source of upset phenomenon in electronics. In 2000, Sun server systems crashed due to cosmic rays affecting equipment deployed to America Online, eBay, and others extensively.
Sources of soft errors
Soft errors may be caused by electronic noise sources such as a noisy power supply, lighting, and electrostatic discharge (ESD). These are the intrinsic sources of soft errors. Other non-environmental effects like resistive or capacitive variations or couplings, voltage fluctuations, and electromagnetic interference may also cause intermittent faults in systems.
With advancements in design and fabrication technology, non-environmental conditions may not affect the sub-micron semiconductor effectively. However, radiation-induced alpha particles and cosmic rays remain the dominant factors causing errors in electronic systems. Three principal radiation sources cause soft errors:
- Alpha particles are radiated when an unstable isotope of a radioactive element decays to a lower energy state. These particles contain high kinetic energy in the range of 4 to 9 MeV. There are many radioactive compounds, with uranium and thorium the most active among naturally occurring materials. In the terrestrial environment, major sources of alpha particles are radioactive impurities such as lead-based isotopes in solder bumps of the PCB technology, gold used for bonding wires and lid plating, and aluminum used in ceramic packages and lead-alloys.
- High-energy neutrons (> 1 MeV) from cosmic radiation interact with silicon nuclei and produce secondary ions that may cause soft errors in semiconductor circuits. Galactic cosmic rays react with the Earth’s atmosphere to produce complex cascades of secondary particles. Neutrons are most likely to cause single event upsets (SEU) in submicron semiconductors devices at terrestrial altitudes.
- Ionizing particles in electronic devices is the secondary radiation induced by low-energy cosmic neutron interactions with the Boron-10 isotope. Modern microprocessors use highly purified package materials and therefore this radiation mechanism is greatly reduced, making high-energy cosmic rays the major reason for soft errors.
Soft error effects
Soft errors are critical issue in high safety and performance systems like avionics and transportation, space and military applications, medical equipment, and high-end networking systems. When soft errors occur in memory elements such as SRAM, DRAM, latches, and registers, they directly affect the information stored in digital circuits and can cause the failure of an entire piece of equipment.
The effects of soft error can be classified as follows:
- Single event transient (SET) is when one or more voltages pulses (i.e. glitches) are generated that propagate through the circuit. An SEL results in a high operating current, above device specification.
- Single event upset (SEU) is a change of state caused by a radiating particle strikes a sensitive node. SEUs are transient and non-destructive soft errors, which means that a reset or rewriting of the device results in normal device behavior thereafter. SEUs result in either SBUs (Single-Bit Upsets) or MBUs (Multiple-Bit Upsets). SBU refers to the flipping of one bit due to the passage of a single energetic radiation particle, whereas an MBU is caused when a single ion hits two or more bits, causing simultaneous errors.
- Single event latch-up (SEL) may cause an apparent short-circuit from power to ground. If power is not removed quickly, catastrophic failure may occur due to excessive heating, metallization, or bond wire failure.
- Single event interrupt (SEI) causes a bit change in a control register, which may lead to malfunction of the system.
When a single high-energy particle strikes a sensitive node (such as a reverse-biased n+ p junction), it loses energy and produces electron-hole pairs. This results in a densely ionized track in the vicinity of that element. When the resultant ionization track traverses or comes close to the depletion region, carriers are rapidly collected by the electric field and create a large current/voltage transient at that node.
The reverse-biased junction is the most charge-sensitive part of circuits, particularly if the junction is floating or weakly driven. The charge collection can be described by three processes: drift in the equilibrium depletion region, funneling, and diffusion.
Drift is the process of a high electric field existing in the depletion region and sweeping out the carriers. This follows the concurrent distortion of the potential into a funnel shape. This funnel extends the high field depletion region deeper into the substrate and enhances the efficiency of the drift collection. This phase is completed almost within a nanosecond and soon diffusion begins to dominate the collection process. Additional charge is collected as electrons diffuse into the depletion region until all excess carriers have been collected, recombined, or diffused away from the junction area (see Figure 1). In general, if the event occurs farther away from the junction, the amount of charge collected would be smaller and it is less likely that the event will cause a soft error .
[Figure 1 | Drift is the process of a high electric field existing in the depletion region and sweeping out the carriers.]
If the charge collected in this process is larger than the critical charge, a soft error will be induced. It may result in a current pulse or voltage transient affecting the sensitive node. In VLSI circuits, it can flip the state of a memory cell, inverting its logical state.
The magnitude of Qcoll depends on various factors including the size of the device, biasing of the various circuit nodes, substrate structure, device doping, the type of ion, its energy, its trajectory, the initial position of the event within the device, and the state of the device. However, the sensitivity primarily depends on the node capacitance, operating voltage, and the strength of feedback transistors. This all defines the amount of charge or critical charge (Qcrit) required to trigger a change in the data state.
Soft error minimization techniques
Soft error mitigation techniques can be classified into two types: prevention and recovery. Prevention methods protect electronic systems from soft errors before they occur. They are implemented during the chip design process and development. Recovery methods include ECC/parity checks, online testing, redundancy, and fault tolerant computing.
Prevention techniques at the device level include purification and hardening. Purifying the fabrication materials, combined with using highly pure materials and processes, reduces the alpha particle emission in the final packaged IC. Over the years, uranium and thorium impurities have been reduced below one hundred parts per trillion to achieve better reliability. Similarly, borophosphosilicate glass (BPSG) has been replaced by other insulators that do not contain boron and thus the amount of B10 in packaged materials is also greatly reduced. However, the SERs caused by high-energy neutron interactions cannot be so easily minimized. For this, radiation hardened process technologies are required.
SERs can be greatly reduced by adapting the process technology either to reduce the collected charge (Qcoll) or increase the critical charge (Qcrit) . This can be done by using additional well isolation (triple-well or guard-ring structure) to reduce the amount of charge collected or by creating potential barriers that can limit the efficiency of the funneling effect and reduce the likelihood of parasitic bipolar collection paths . Qcrit can be increased by increasing the sizing of the transistors or even by increasing the supply voltage.
There is another approach that replaces bulk silicon well-isolation with silicon-on-insulator (SOI) substrate material. The direct charge collection is significantly reduced in SOI devices because the active device volume is greatly reduced (due to thin silicon device layer on the oxide layer) .
Soft errors can be eliminated in the circuit level by using masking methodologies. Logical Masking removes the sensitive path from the node to the output or flip flop. A single event effects is logically masked if its propagation path is blocked from reaching an output latch because off-path gate inputs prevent a logical transition of that gate's output. Electrical Masking refers to the gradual attenuation of the soft error as it propagates through combinational logic. The signal is attenuated by the electrical properties of gates on its propagation path such that the resulting pulse becomes insufficient to cause any further error. Timing Window Masking increases the clock period so that it is greater the soft error duration, preventing soft errors from affecting the output at latching time.
Recovery from soft errors
Much of the research on protection techniques for soft errors focuses on combinational and sequential elements (latches and flip-flops). Dual interlocked storage cells, SRAM cells are seen to be used in feedback paths in latest architectures. But this technique increases the area of the ICs significantly introducing more number of nodes and also adding to power consumption. Modern architectures must tradeoff between time and space redundancy.
Fault-tolerant computing methods are often used to recovery from soft errors. They have existed in the literature for quite some time but have seen renewed interest due to the increasing risk of SEUs. On-line testing techniques are also used as recovery solutions for soft error mitigation. Specific techniques that are used include self-checks, concurrent error detection for finite state machines (FSM) through signature monitoring, error detection and correction (EDAC) codes, and redundancy.
- Redundancy: Redundancy in design can be achieved by gaining higher system reliability, sacrificing the minimality of time or space, or both. Triple modular redundancy (TMR) is considered to be a classic example of this. Space redundancy and time redundancy are often implemented together to meet high fault-tolerance requirements with reduced hardware overhead, using duplication and comparison instead of TMR.
- ECC and Parity: Large memories are more sensitive to ionized particles than logic because of the high number of storage cells they comprise. A simple solution for protecting memory is to use parity bit checking. If a particle strike changes the state of a single bit of a memory word, the error can be detected by checking a parity code during the read operation. If more parity bits are used, this scheme can not only detect an error but correct it as well for the highest reliability.
As the feature sizes and operating voltages of electronic systems are reduced to satisfy the market’s insatiable demand for higher density, functionality, and lower power dissipation, the sensitivity of ICs to radiation increases dramatically. This is a significant reliability challenge in modern CMOS logic design. By understanding single event effects (SEE) research, radiation mechanisms and modelling, and mitigation techniques, engineers can minimize the impact of soft errors and maintain desired levels of reliability.
 R.C. Baumann, “Soft errors in advanced semiconductor devices-part I: the three radiation sources,” IEEE Transactions on Device and Materials Reliability, Volume 5, Issue 3, pp. 305 - 316, September, 2005.
 Rahebeh Niaraki Asli, Saeideh Shirinzadeh “High Efficiency Time Redundant Hardened Latch for Reliable Circuit Design,” Springer Science, Business Media New York, May, 2013.