Many-core processing: Sharing the performance load for greater energy efficiency

November 1, 2011 OpenSystems Media

3The future of multicore design requires more targeted processing, optimization, and differentiation in the design process for systems that contain different types of processors, as well as support for the requirements of the software ecosystem. Energy efficiency is a vital differentiator in the world of processing and will be the key driver for the future of computing.

For more than 60 years – from the early days of mainframe computers through the PC revolution of the 1980s and into today’s explosion of smart mobile devices – processor technology has always evolved to meet users’ expectations, at times driving unforeseen innovations in the computing industry. Given the diversity of new mobile devices coming to market on a daily basis, processor innovations continue to be a powerful force for change.

With the advent of mainstream mobile computing, processor architectures have shifted from the traditional desktop model driven by performance, regardless of the power required. Devices that require all-day or even multiday battery life will require a more compact energy envelope while pushing processor performance to new levels.

In the beginning, performance was king

The first consumer computers powered by microprocessors were simple, power-hungry, stationary devices that were tethered to their source of electricity (the common wall socket). This meant that microprocessors could be designed solely with performance in mind, which soon became the “holy grail” for developers.

Early PCs comprised a single-threaded CPU running a single application. Soon, these early 8-bit microprocessors grew to 16-bit and, eventually, 32-bit processing by the mid-1980s. Then the market started to see PCs capable of running multiple applications simultaneously. With performance rising as the number of transistors doubled in cadence with Moore’s Law, each new processor design offered the ability to develop new features and functionality, whether playing a DVD or editing the family album, which, in turn, whetted consumer appetites for more powerful devices.

Eventually, consumer demand for different form factors and the expectation of performance improvements pushed the demands on processors beyond what was capable within a single core. At the same time, demand for mobile devices started to explode and, as it grew, so did the call for more energy-efficient processing.

When ARM was launched in 1990, a main objective of its founders was to create an energy-efficient processor architecture for handheld devices. Employing the RISC CPU architecture, ARM’s approach simplified instructions, streamlined task execution, and reduced the power required per instruction.

A few characteristics were critical to the development of a more energy-efficient microprocessor, the most important of which was an intense focus on limiting power consumption to the lowest possible levels. Keeping the power envelope to the smallest possible footprint not only increased battery life, but also limited the weight of the battery required to power the device, which reduced the bill of materials and kept overall costs down.

Multicore: Benefits beyond mobile

Today, the benefits of this high-performance, energy-efficient processing architecture are bearing fruit in devices such as digital televisions and set-top boxes, office equipment such as printers and copiers, and mobile devices such as tablets, portable gaming units, and smartphones.

Since the mid-2000s it has been accepted that building bigger and bigger CPUs to realize single-thread performance gains not only becomes increasingly difficult, but also runs counter to the energy-efficiency limitations of mobile devices. This is because exponentially more energy is required for every few percentage points gained in performance.

Multicore solutions can deliver higher performance at comparable frequencies to single-core designs while offering dramatic savings in terms of cost and power efficiency. Furthermore, multicore solutions can leverage cores with high transistor counts and optimize systems by powering them up only when needed. In essence, this can be thought of as intelligent load balancing. Not only does a system need to consider which processor is best suited to execute a specific task, but it also must consider the performance required of that task and assign it to the most power-efficient processor available.

Using cores as needed while keeping others idle helps keep energy consumption as low as possible, with limited impact on performance. As tasks are distributed across multiple processor cores, an individual processor might not run at full capacity, allowing the voltage and frequency of a multicore processor to be lowered. This results in significant power savings related to the system’s aggregate performance.

To illustrate this idea with a common use case, consider today’s smartphones, which must be powerful enough to render a complex Web page and play gaming applications, often in parallel with basic e-mail synchronization and phone management functionality. With the ability to power up cores only when needed, multicore smartphones can deliver increased battery life compared to their single-core, full-throttle predecessors. Market demand for scalable performance has resulted in most current smartphones containing multicore CPUs, as well as GPUs with multiple cores found in many of today’s leading mobile video and gaming devices (see Figure 1).

Figure 1: Most smartphones contain multicore CPUs, leveraging the ability to power up cores only when needed.

A “many-core” approach to multicore processing requires the performance loads to be shared across many smaller processors, such as a Cortex-A5, rather than with multiple single-thread workloads across a single-core processor. Designers are increasingly deploying clusters of processors designed to work together, sharing data and tasks among caches or multiple instances of the same processor (see Figure 2).

Figure 2: In a many-core architecture, clusters of processors share data and tasks among caches or multiple instances of the same processor.

Many core becomes even more interesting as smaller processors work together to deliver a combined performance level with lower power consumption than a larger processor multitasking the same workload. As previously mentioned, the costs associated with increased performance on a single thread are exponential; however, with multicore processing the cost becomes more linear in scale. Designers are using many cores to significantly reduce aggregate system costs.

As hardware designers begin to implement these many-core systems, software developers will need to produce code capable of using a many-core processing solution. Until then, devices must have the ability to execute high-performance tasks. One example of a system that contains both high single-thread performance multicore and the greater power efficiency of many cores is the current deployment of CPUs and GPUs, where the many-core GPU can deliver graphical computation using less power than the multicore CPU. Since the GPU remains coherent with the CPU and shares its caches, external memory bandwidth and performance demands on the CPU can be reduced. Languages such as OpenCL and CUDA are working to enable these issues for more generic applications.

Optimizing for future performance

Our industry lies at a crossroads in balancing performance and power. By leveraging domain-specific processors and heterogeneous general-purpose computing, designers can optimize limited hardware resources and footprints. Optimizing designs and the design process across all types of multicore Systems-on-Chip (SoCs) also can achieve these gains.

While optimization might not get as much attention as multicore processing, it is equally important, especially in small-footprint applications with greater coherency challenges. Cache coherency is key to multicore computing applications, ensuring that the data stored in shared resources is properly maintained. Standards and specifications such as the AMBA 4 bus are encouraging steps toward providing system-level cache support across clusters of multicore processors, as well as maintaining prime performance and power efficiency in complex SoCs.

Future devices will continue to require more powerful processing performance, most likely under increasingly tight power constraints. By developing more targeted processing, optimization, and differentiation throughout the design process, developers can bring to market systems that not only support the many-core concept, but also incorporate software support.

John Goodacre is director of program management in ARM’s Processor Division. He has more than 20 years of experience in the engineering industry, including five years working for Microsoft as group program manager in the Exchange Server Group and as the manager of a team developing mobile phone software. John graduated from the University of York with a BSc in Computer Science.

ARM +44-1223-400-400 @ARMEmbedded

John Goodacre (ARM)
Previous Article
Getting rid of the denominator: Looking beyond performance per watt in embedded systems
Getting rid of the denominator: Looking beyond performance per watt in embedded systems

For embedded systems requiring scalable processing, adaptive power management approaches provide peak perfo...

Next Article
Using virtualization to maximize multicore SoC performance
Using virtualization to maximize multicore SoC performance

Using virtualization techniques to leverage the potential of multicore SoCs.