Printed from:

Developing applications for parallel architecture CPU chips

Paul Fischer By Paul Fischer
TenAsys
Craig Szydlowski and Craig Szydlowski
Intel

New-generation dual-core processors not only raise the performance bar, but also make it easier to dramatically improve system functionality and hold down costs while increasing performance. The main challenge for designers is how to partition applications to make the best use of the dual-core CPUs. A ready solution to the application partitioning problem exists for real-time system developers, who can run deterministic real-time and human-directed software on independent cores, enabling real-time processes to be relatively unencumbered from non-real-time applications. By giving real-time control application code its own processing core, scheduling jitter and response time are greatly improved, yielding a more reliable, higher-performance system.

Why move to dual-core?
Delivering greater CPU performance to applications has traditionally been accomplished by cranking up the CPU clock speed at the expense of greater power dissipation. Ever-increasing clock speed is a major detriment for many embedded applications due to the rising cost of removing the additional heat. Furthermore, simply increasing processor speed does not automatically result in improved deterministic system response.

For example, using a faster processor may not significantly change the worst-case response time to an event; increased speed can decrease the average jitter (the spread and intensity of variations in response to an event), but will not eliminate jitter, especially worst-case jitter. Worst-case jitter is frequently caused by system elements that do not scale linearly with CPU clock speed, thus making the ability for faster processors to eliminate or significantly reduce worst-case jitter less predictable.

That does not mean that improving the performance of a system is not useful. Faster CPU cycles increase the amount of work done within a given cycle time (that is, within a particular sampling interval). Because of this, using a faster clock improves the complexity of the data acquisition and control algorithms that one can implement in software. However, bounded determinism is still necessary to insure that a stable and accurate system can be deployed regardless of the compute performance of that system.

Instead of pushing chips to run faster, CPU manufacturers are now driving a major evolution in computing technology by adding more execution cores and cache memory to provide better performance at lower power. Dual-core CPUs, for example, can be clocked at slower speeds and supplied with lower voltage to yield much improved performance per watt.

10:1 jitter reduction – single-core versus dual-core

To determine the jitter reduction provided by a dual-core CPU platform, TenAsys conducted the following tests on a Windows XP system running INtime 3.0. The hardware platform was an Intel dual-core processor system running at 3.0 GHz. Two tests were performed on this single piece of hardware, with the second core disabled for the single-core test and enabled for the dual-core test. The reduction in jitter between these two scenarios is almost exactly 10 to 1.

Single-core test: The machine was configured via the BIOS to disable the dual-core feature, forcing it to boot as a single-core CPU so that INtime and Windows were sharing a single CPU core. The machine was idle during this test to keep jitter to an absolute minimum. More than 5 million timer interrupt measurements were taken during the test. The longest observed jitter measurement was approximately 22 microseconds.

Dual-core test: The dual-core test was performed on the identical machine, with the second core enabled. One core was dedicated to running the INtime RTOS and the second core was dedicated to Windows. The machine was not idle for this test, but was continuously playing audio with Windows Media Player and displaying the Windows Media visualization effects. During this test more than 6 million timer interrupt measurements were made. The longest jitter measurement was approximately 2.2 microseconds.

Harnessing additional CPU performance
Multicore processors provide an extra degree of freedom by allowing application software to provision tasks among multiple execution cores for optimal performance. For those embedded applications that would benefit from a faster and more repeatable real-time response, it is possible to run real-time tasks on a dedicated execution core without interference from other tasks that would otherwise compete for CPU resources. The approach can significantly improve the determinism of the real-time response by reducing its interaction with less time-critical system functions.

The importance of managing the execution of real-time and non-real-time functions is readily apparent in the digital factory, which strives to better integrate assembly line process planning, evaluation, and continuous improvement. As a result, factory floor systems must be more closely coupled with non-real-time processing applications such as Enterprise Resource Planning (ERP). By allowing application software to run real-time processes on one execution core, the other core can be available to service the requirements of the digital factory without adversely affecting the equipment’s control tasks.

A typical example of this application partitioning is a robotic gantry system that relies heavily on continuous user feedback and monitoring during operation but also requires precise deterministic control to safely and quickly move tons of equipment over long distances, such as in a warehouse or loading dock. User feedback and monitoring is handled on the first core while the control application is on the second core.

Controlled coexistence saves costs
Balancing the flexibility of a general-purpose OS with the requirements of an embedded application in a single OS environment usually ends in too much compromise for applications that require hard real-time determinism. Before the advent of multicore processors, this compromise usually resulted in two distinct computing platforms being implemented in the system: one for the general-purpose OS to handle the Graphical User Interface (GUI), third-party applications, and enterprise-related functions, and a second to host a dedicated Real-Time Operating System (RTOS) to manage time-critical processing. These compromises lead to communication overhead, larger physical system size, excess heat generated, and additional cost.

Systems that use dual-core processors can be constructed with a single processor board containing a single set of central processor resources (memories, disks, and so on). For OEMs, this could reduce system cost and size by up to 40 percent or more when compared to high-performance real-time systems implemented with multiple linked computer systems. The dual-core system package is much simpler and more reliable compared to segregated systems having their own power supplies, mass storage devices, system packaging, and high-speed communication interconnect hardware.

This implies that high-performance real-time systems such as complex motion and vision systems can be implemented with a minimal set of hardware. More broadly, it will cause a substantial change in the way that real-time system designers architect their hardware.

From a global software architecture perspective, incorporating multiple OSes in a system is simplified using a virtual machine approach, with different operating systems running on different virtual machines. Whether virtual machines are only logically unique as in the case of a real-time and human-directed OS sharing a single CPU, or physically distinct as in the case of real-time and human-directed OSes running on different cores in a multicore chip, the basic software architecture is the same (see Figure 1).

Whether virtual machines are only logically unique as in the case of a real-time and human-directed OS sharing a single CPU, or physically distinct as in the case of real-time and human-directed OSes running on different cores in a multicore chip, the basic software architecture is the same
Figure 1

Dual-OS virtual machine systems exist today and have been deployed with real-time applications operating at cycle times of 500-1,000 microseconds on the current crop of single-core desktop and industrial motherboard platforms, such as uniprocessor and hyper-threaded Intel Pentium 4 class processors running at 1-3 GHz. However, some applications demand faster cycle times, and the optimum solution is a multicore processor, not necessarily a faster processor.

Dual-core optimizations for dual-OS systems
When two virtual machines share a CPU as in single-core processor designs, they must maintain a full machine context or a partial context in the case of a hyper-threaded core to switch between the two operating systems. Saving and restoring these contexts results in event response latency and cycle time compromises. Such compromises can contribute 10-30 microseconds to the worst-case timer interrupt jitter. For a cycle time of 1 millisecond, 10-30 microseconds of worst-case interrupt latency represents a jitter variation of only a few percent.

Higher-speed cycle times translate into higher-bandwidth controllers, a desirable trait because it leads to improved performance and throughput. Unfortunately, 10-30 microseconds of timer jitter is a significant number for cycle times of 50-200 microseconds. If jitter becomes a considerable percentage of the cycle time it adversely affects the stability and quality of the control algorithm, which is inversely related to the amount of cycle time variability. Cycle time variations will degrade the stability margin of a closed-loop control system, especially naturally unstable systems like position-feedback motion control loops.

Dedicating one CPU to the RTOS on a multicore platform allows virtually 100 percent of that core’s CPU instruction cycles to be available for running real-time threads. Figure 2 shows an example of one of the newest dual-core processor chips, the 2.0 GHz Intel Core Duo processor. The CPU cycles of the remaining core(s) become the exclusive property of the General-Purpose (GP) OS, and are managed according to the GPOS’s virtual machine architecture. At the highest level this configuration is an Asymmetric MultiProcessing (AMP) system. When three or more cores are available, and only one CPU core is dedicated to the RTOS, the GPOS runs as a Symmetric MultiProcessing (SMP) system on the remaining CPU cores.

an example of one of the newest dual-core processor chips, the 2.0 GHz Intel Core Duo processor
Figure 2

Contention for key CPU resources such as pipelines, cache, and the floating-point unit are avoided in a dedicated RTOS core system. Coordination between the RTOS core and GPOS is accomplished using built-in interprocessor interrupt mechanisms, eliminating inter-OS context switch times. On a multicore dual-OS system, real-time interrupt latencies are reduced by an order of magnitude, from 10-30 microseconds to 1-3 microseconds. Loop cycle times in the 50-200 microsecond range execute with very high precision. The net result is an order of magnitude improvement in quality and bandwidth for control algorithms deployed on a dual-OS system.

In addition, a dual-core CPU supporting both a real-time OS and a GPOS can continue to run real-time processes even if the GPOS crashes. This is a critical requirement for hard (that is, mission-critical and time-critical) real-time applications.

How does the dual-OS trick work?
Isolation can be implemented by creating a separate hardware task environment to contain the real-time kernel and all of its processes, including real-time code, variables, I/O, and interrupts. This hardware task environment is constructed by building separate descriptor tables, page tables, and other data structures specific to the RTOS and independent of those maintained by the GPOS. Finally, the GPOS, its drivers, and all its applications are encapsulated by the real-time kernel and configured to run as the lowest-priority task in the real-time priority scheme.

To make this work, the GPOS’ Hardware Abstraction Layer (HAL) must be hooked. The HAL is hooked to insure that shared hardware is properly managed between the two operating systems, specifically to:

In the case of Microsoft Windows, this virtualization technique does not require a special version of the OS. Any standard desktop, embedded, or server distribution of Windows can be used. It is also independent of service packs. Windows is unaware it is running in a virtual or encapsulated environment so that standard off-the-shelf applications can be used to deploy the enterprise level application part of the system.

Windows applications can use a special interface library to synchronize with and transfer data to real-time processes. Because both operating systems reside on the same physical machine, it is possible to create shared-memory blocks for quickly transferring large amounts of data. Thus, communication overhead can be brought to an absolute minimum when real-time and nondeterministic processes share data in situ via RAM.

In such a dual-OS system, the RTOS also must meet some key requirements. Consistent with the virtual machine approach, whereby the operating system environments are isolated, it is critical that the RTOS not be designed as an extension to the general-purpose OS, but as a parallel entity. The RTOS must provide communication mechanisms, preferably hardware enforced, that make interprocess coordination and message-passing quick and easy to perform.

A valuable consideration is to use an RTOS for which the real-time applications development tools run on the parallel GPOS. This has the advantage of allowing users to simplify software development and reduce software maintenance costs.

Summary

Dual-core processor chips promise to deliver greater performance and lower power consumption to embedded computing systems. Among these are the ability to increase the performance and real-time responsiveness of embedded applications while keeping system cost and complexity under control. The potential software complexity of implementing dual-core systems is eased by basing the processor usage models around a virtual machine approach and by hosting the real-time and general-purpose, human-directed portions of the system on dedicated processor execution cores.

Paul Fischer is a senior technical marketing engineer at TenAsys Corporation. Paul’s experience with INtime goes back to 1997, when the product was first introduced. He has more than 20 years experience with real time and embedded systems in a variety of engineering and marketing roles. Fischer has an MSE from UC Berkeley and a BSME from the University of Minnesota.

Craig Szydlowski is a technical marketing engineer at Intel, with 16 years of experience in the embedded group. Craig has a BSEE from Yale University and an MBA from the Wharton School.

To learn more, contact Paul or Craig at:

TenAsys Corporation
1600 N.W. Compton Drive, Suite 104
Beaverton, OR 97006
Tel: 503-748-4720
Fax: 503-748-4730
E-mail: info@tenasys.com

Intel Corporation
5000 W. Chandler Blvd., MS CH6-236
Chandler, AZ 85226
Tel: 480-552-3264
Fax: 480-552-8939
E-mail: craig.p.szydlowski@intel.com