ECC RAM in embedded/industrial applications

July 01, 2015

ECC RAM in embedded/industrial applications

ECC RAM has historically only been found in servers, but it's now finding its way into embedded and industrial applications.

It’s easy to liken the job of DRAM to that of a long serving production line worker; undertaking the exact same repetitive task, day in day out and over decades, performing that task to literal perfection – though that analogy doesn’t hold true.

Today’s DRAM device’s job is far from easy, expected to store 4-8 billion unique pieces of data in leaky capacitors at densities of tens of nanometers and capacitance in the order of tens of femtocoulombs – though that’s not all. They are also expected to transmit that data at billions of bits per second in parallel, or up to 72 bits simultaneously and without the benefit of an embedded clock, almost never losing a single bit. Oh, and we expect all of this for the price of a coffee!

What could possibly go wrong? Quite a lot actually…

A corruption in stored memory state, caused at its core by a “bit flip”, typically results in an application crash, or in more dire cases an operating system crash or even a terminal hardware reset. In the enterprise computing world, at worst case you may lose a few hours work; imagine this occurring in an embedded or industrial environment. In extreme cases we’re talking of launching a missile, or not; executing an incorrect trillion dollar transaction, or not; or, increasingly relevant given the current self-driving revolution, invoking a vehicle’s brakes, or not! So what’s the solution?

In 1950, Richard Wesley Hamming invented an algorithm that makes it possible to correct bit errors. The Hamming code is today known in computer systems under the name “ECC” (Error-Correcting Code). It is immune to single bit flip errors: the data that is read is always the same as the data that had been written to it, even if one or more bits actually stored have been flipped to the wrong state. So why isn’t this employed ubiquitously? One word: cost.

The ECC-functionality is typically implemented in the memory controller of the CPU. Whenever the CPU wants to write data to the DRAM-memory, additional ECC-parity-bits are generated for that data. When reading data back from the memory, the CPU can analyze the received data and the ECC-parity-bits with the Hamming-algorithm and correct up to 1 bit-error.

Processors with ECC capabilities cost significantly more than standard processors. In addition, they need a wide memory interface – typically 72-bit wide – which can only be reached by using a large quantity of memory chips.

ECC is easily encompassed for servers whose raison d’être is 100 percent reliability before cost. Servers also have enough space for the wide memory bus by using 72-bit memory modules.

The consumer desktop and laptop markets at the other end of the scale naturally favour cost before reliability – so ECC historically has made few inroads outside of the server market or in scientific or financial computing.

Where does the embedded and industrial computing market fit on this scale? One would assume far closer to the server end than the consumer end, but you’d be surprised at your inaccuracy. Whilst you’d naturally expect any embedded and industrial system in our market would be built for high reliability (simply because the wording “industrial” implies “reliability”), actually only the very highest-end solutions implement ECC – and you could well argue these are technically industrial servers!

The DRAM memory used on the majority of industrial and embedded applications is not ECC-protected and thus has a potential risk to see malfunctions and crashes as often as any standard PC.

Redesigning those applications to include ECC capable CPUs and wide memory is not possible in the majority of cases. How could you get a large memory module or nine pieces of DRAM components onto a small-form-factor industrial PC?

That’s all about to change with one company, Intelligent Memory, removing all obstacles to wider implementation with their new ECC DRAM components.

The Intelligent Memory ECC DRAMs integrate the error-correction code right at the place where the errors can happen: in the DRAM itself!

As ECC DRAMs are fully compatible to conventional DRAM memory-chips, they can easily be used on any application as a simple drop-in-replacement. Without any further hardware or software changes, the product is then equipped with an ECC correction functionality. Even the CPU does not need to be changed.

Operating temperature concerns are more than satisfied, even encompassing emerging automotive applications by supporting -40°C to +125°C operation – exciting news for those developing tomorrow’s connected and self-driving vehicles.

Importantly, a major faux pas was avoided by not focusing purely on DDR3 technology. As earlier presented, those systems most desperately in need of this technology are the low to mid levels, where DDR2 and even DDR1 are still heavily deployed.

The most fundamental stumbling block of unit cost, critically has been addressed, with implementation of ECC DRAM technology now a very slight increase in cost, effectively adding cream to our earlier proverbial coffee, rather than purchasing the entire coffee stand.

What this should herald is the era of universal implementation of “intelligent” memory – however, a risk remains. In the increasingly competitive market we’re in, unscrupulous manufacturers more concerned with a sale and less concerned with reliability will always have that option to undercut a vendor who is. Such purely price-driven vendors won’t survive for long, but be careful in the interim and request ECC DRAM.

Rory Dear, European Editor/Technical Contributor