Getting rid of the denominator: Looking beyond performance per watt in embedded systems

November 1, 2011 OpenSystems Media

3Designers typically select CPUs with the highest power efficiency to deliver the most computing performance per watt. But is performance per watt really the right metric for selecting the CPU in an embedded system? Using an adaptive power management approach shifts the control variable for performance from heat to electrical power, enabling adjustments that help deliver maximum performance within the embedded power envelope.

If an embedded computing system is designed for a 15 W CPU, does it really matter how well the CPU performs at 1 W, especially if it lags on performance compared to other 15 W CPUs? This question is critically important for embedded system designers who need scalable performance for design reuse, as many low-power processor families don’t offer high-performance features (multiple cores, 64-bit processing, high-speed memory, large caches, virtualization, and hardware encryption).

It might seem heretical to focus on raw performance after all the effort designers have put into getting desktop-enamored CPU companies to focus on power efficiency. However, the competition between x86 and high-end ARM processors has forced embedded system vendors to once again differentiate on performance. In addition, with Intel seeing some multicore competition from other suppliers providing embedded x86 CPUs, embedded system vendors are turning to raw performance within a fixed power envelope as a differentiator.

While CPU architects will continue to promote power efficiency in the battle to sell billions of units to ultra-portable markets, most embedded designs require a different approach to CPU selection, as system designers focus on “do-not-exceed” power constraints for the diverse range of embedded applications. The system-level design constrains the amount of cooling and power available to the CPU, representing a single horizontal line on the power/performance graph. As long as a CPU can stay below that line, raw performance remains the preeminent differentiating factor in embedded designs.

Closing the loop with power-controlled performance

Given the need to maximize performance while staying below the peak power consumption line, how can a system designer squeeze every last bit of performance from a CPU? Consider the fact that modern CPUs have performance headroom that can only be accessed with additional cooling. Instead of increasing CPU performance until you hit the power budget, think in terms of power consumption as a control system variable for performance.

CPU performance can be adaptively adjusted to throttle back in a closed-loop control system that lets the CPU run as fast as the system power design allows. New to embedded processors, adaptive power management techniques have been used with high-end CPUs for desktop and mobile designs – Intel Turbo Boost being an obvious example.

Embedded CPU designs must meet system-level power constraints, such as the maximum chassis temperature for a passively cooled system or the maximum fan utilization for active cooling. In a non-adaptive system with performance-constrained CPUs, the designer assigns the CPU chip a power budget based on the Thermal Design Power (TDP). Designing for max TDP ensures that the cooling solution keeps the maximum transistor junction temperatures below manufacturer specifications. The thermal design needs plenty of margin because everything is based on worst-case estimates of the CPU workload and device characteristics. This static approach to CPU system design ultimately fails to render the best performance from the system power budget.

It’s not just about CPU heat when you have performance to burn

TDP is not necessarily a measure of heat dissipated by the CPU. Most embedded designers are familiar with adaptive power management techniques that rely on thermal diodes to automatically throttle back the CPU if on-die temperature exceeds a certain value. A more advanced adaptive thermal technique (TM3) adjusts voltage and clock rate to change Performance States (P-States) to keep the CPU operating at its maximum performance level. For many embedded systems, using a thermal threshold for adaptive power management might be the best solution. After all, the CPU is guaranteed to stay below a maximum temperature while operating at the highest possible performance level.

However, what happens at the system level if the CPU keeps cranking up both the performance and power while extra CPU heat is conducted away (perhaps by an active cooling system that cycles a fan)? The CPU will happily take advantage of this extra thermal envelope but might exceed the overall system power budget. This would be an issue if the software load on the embedded system is higher than anticipated, potentially leading to a system with a hotter chassis, noisier fans, and so on. One solution for CPUs with a variable TM3 threshold would anticipate this thermal transfer and set the maximum CPU temperature to a lower value.

Using electrical power as the performance control variable

High-end CPUs already use another adaptive power method. The voltage regulator module provides an output signal (IMON for Intel CPUs) the CPU reads to dynamically calculate the power on the CPU supply rails. With this information, electrical power (not heat) becomes the control variable for performance. The CPU will dynamically adjust P-States to deliver maximum performance for the CPU power envelope, also ensuring that system power stays within design specs. With this adaptive design approach, every bit of performance headroom is always accessible to the application as the CPU constantly changes speed to make optimum use of the power envelope.

The extra performance headroom can make a big difference, as illustrated in a technical white paper from VIA Technologies (www.via.com.tw/en/products/processors/nanoX2/whitepaper.jsp). This study used SPEC CPU2000 to test the performance of multicore CPUs from both VIA and Intel. The 13 W dual-core Intel Atom D525 ran at a fixed 1.8 GHz (fastest available dual-core) frequency, while the 13 W dual-core Nano X2 ran at a base frequency of 1.2 GHz.

The VIA systems tested in this study support Adaptive Overclocking, designated by a plus sign next to the frequency. In this case, the 1.2+ GHz system shifted up in P-States to 1.4 GHz, as long as the voltage regulator module signal indicated power consumption less than 13 W TDP. With 50 percent more x86 instructions per clock (3-issue) and faster memory, the VIA CPU provided a clear advantage as it came close to the Atom’s clock rate at the same TDP.

This report highlights a nearly 40 percent performance advantage for Nano X2 compared to Atom, even with a 50 percent lower base frequency. Without the benefit of hyper-threading, the dual-core Nano X2 roughly matched Atom’s performance on four threads.

Cranking up performance headroom even higher

Why stop with only two embedded CPU cores if you have the power budget and want to get more performance headroom for four or more threads? With power consumption as the control variable, the system can maximize performance by making sure all processors run at the highest possible clock rate.

The study included a 27.5 W VIA QuadCore, also running at 1.2+ GHz. As shown in the graph of SPECrate performance (Figure 1), extra CPU cores make a big difference when using this benchmark on four threads, resulting in an 80 percent performance advantage over the hyper-threaded dual-core Atom at 1.8 GHz. With competition from quad-core ARM-based processors, x86-based embedded system vendors now have a way to differentiate on performance.

21
Figure 1: Testing shows SPECrate performance increasing with more physical cores, compared to a hyper-threaded CPU.

Figure 2 shows a board from VIA’s Embedded Processor Division that uses Adaptive Overclocking. Because all of VIA’s CPUs share a common pin-out, the systems can scale from one to four CPUs, matching the number of application threads to the number of CPU cores. Many of the configurations support fanless operation, including dual-core in the Eden X2 family.

22
Figure 2: VIA’s M900 board supports dynamic power management to take advantage of performance headroom within a fixed power envelope.

Adaptive power management for embedded designs

In terms of performance per watt, the Intel Atom is likely the most power-efficient x86 CPU and is closing fast on ARM-based processors in the battle to compute tasks with the least amount of energy. While the title of this article is somewhat provocative, compute efficiency obviously can be extended to any power level. However, most CPU vendors want to be in cloud-based servers and smartphones, leaving the embedded designer to repurpose suboptimal CPUs.

In this broad realm of embedded power constraints, the use of adaptive power management offers a way to bring higher-performance architectures into the embedded power envelope, outperforming more efficient architectures that have been pushed to their limits. With new embedded CPU choices, the power denominator is no longer a primary constraint, and embedded CPU systems can once again focus on maximizing the numerator to differentiate their systems with the best performance.

J. Scott Gardner is an engineer and consultant who began his career 25 years ago designing microprocessor-based systems. He has served in various marketing and management roles in the semiconductor industry, most notably during 10 years at IDT. He has recently held executive staff or board positions at several start-ups and continues to consult part time. Scott received a BS in Electrical Engineering from the University of Kansas and an MBA from Santa Clara University.

Advantage Engineering
512-779-9561
gardner@advantage-engineer.com
www.advantage-engineer.com

 

VIA Technologies
Linkedin: www.linkedin.com/company/via-technologies-inc
Facebook: www.facebook.com/viaembedded
Twitter: @VIAmkt
www.viaembedded.com 

J. Scott Gardner (Advantage Engineering)
Previous Article
Non-intrusive code coverage for safety-critical software
Non-intrusive code coverage for safety-critical software

A coupled target emulator and non-intrusive coverage analyzer tool enhances safety-critical structural cove...

Next Article
Many-core processing: Sharing the performance load for greater energy efficiency
Many-core processing: Sharing the performance load for greater energy efficiency

The optimization of many-core architectures to balance power and performance is key in the future of mobile...