Making multicore CPUs work in embedded communications designs

November 9th, 2009

3As network traffic gets more voluminous, diverse, and unpredictable, the solutions that used to work well are being overtaxed. A new heterogeneous multicore architecture comes to the rescue.

Traffic volume in both enterprise and carrier networks continues to rise exponentially, scaling from 10 Gbps to 40 Gbps and likely to rise to 100 Gbps in the near future. This rapid growth has been fueled by a combination of more network users, more devices and end points (like PCs, servers, mobile phones, IP phones, and IP set-top boxes), more applications carried by the converged network (such as VoIP, IPTV, P2P, Web 2.0, and network attached storage), and greater bandwidth demands by those applications.

Communications equipment manufacturers must build products that support application- and service-specific requirements. This involves meeting today’s performance needs with integrated security capabilities, content awareness, and the flexibility and programmability to handle the long list of applications ranging from the evolutionary to the yet unknown.

Designing intelligent data plane applications

Enterprise and carrier networks contain numerous devices that provide a myriad of network and security functions including Deep Packet Inspection (DPI), test and measurement, service assurance, intrusion prevention, data loss prevention, firewall, load balancing, and many others. Regardless of the specific function these devices perform, they often share several common characteristics.

First and foremost, these networking devices forward traffic through the network based on much more than basic L2-L3 information. Their primary purpose is running a data plane application that utilizes information across L2-L7 to either make intelligent, content-based forwarding decisions or analyze and assess flow quality. These devices are essentially hosts for applications that are fully involved in packet forwarding. These applications are computationally intense and have generally been written to x86 architectures. Because these applications are deployed in-line, they must not have a negative impact on network throughput or induce latency.

To address the many places within enterprise and carrier networks where data plane applications reside, manufacturers offer a scalable range of products optimized for price, performance, interface types and density, and reliability (see Figure 1). These products range from fixed-configuration devices on the low end to configurable appliances in the mid range and modular chassis on the high end. A common engineering goal is to regularly increase total performance across the product line. Custom application development per hardware device can be avoided by implementing a single design across the entire product line.

Figure1
Figure 1: Products for intelligent data plane applications range from fixed-configuration devices to modular chassis.
(click graphic to zoom by 1.8x)

Engineering challenges: Performance and intelligence

Designers of high-performance, intelligent data plane applications who are considering using multicore CPUs will face numerous product requirements that introduce significant engineering challenges, including:

·    High data rates: Applications must be able to operate at 10 Gbps today and are rapidly moving to 40 Gbps and beyond.

·    Flow-based: Traffic processed at those speeds must be stateful versus packet-oriented, requiring hardware to be flow-aware for millions of simultaneous flows.

·    Non-uniform traffic: High packet rates from millions of simultaneous flows create non-uniform multiplexed data, negatively impacting cache effectiveness, memory utilization, and I/O performance through the application on the host CPU.

·    Low latency: In-line networking applications must not introduce significant latency that affects real-time application performance. However, high data rates from non-uniform traffic can starve host processor cycles and increase system latency.

·    Integrated security: Most network and security applications require some form of security processing. The majority of the traffic requires acceleration for security processing, making look-aside security coprocessors highly inefficient. Dedicated security processing must now be tightly integrated within the data plane to meet security requirements for all traffic without decreasing performance or increasing latency.

·    DPI: With forwarding decisions moving beyond simple L2-L3 criteria, being application- and content-aware is paramount. At very high data rates, DPI also creates a new class of processor workload that can tax the highest-performing general-purpose CPUs. External, regular expression engines are often required.

·    Flexibility and adaptation: With L2-L3 protocol evolution as well as a near constant rate of change in L4-L7 applications, services, and protocols, hardware designs for intelligent data plane applications must be highly programmable to provide the flexibility required to rapidly adapt with these changes.

·    Virtualization-aware: The explosion of task-specific network and security appliances within networks has created a significant cost, power, and operational challenge. Many network operators are taking advantage of highly virtualized solutions to reduce the number of network devices.

In light of these requirements, engineers ultimately face three challenges when using general-purpose multicore CPUs for intelligent data plane applications. The first is to make the general-purpose CPU as computationally effective as possible.

Regardless of the number of cores or the speed at which they operate, when taking into account all of the requirements, designers must answer questions such as: What is the system’s true performance? How many individual instructions can be applied to each packet? How does one partition the various workloads associated with the full set of requirements to maximize real-world performance?

The second concern is memory efficiency. Achieving the highest throughput with the lowest latency is most impacted by memory bandwidth, cache effectiveness, and avoiding stalls that waste processor cycles.

Finally, power efficiency plays an important role in overall system design in terms of the performance-per-power ratio.

The solution: A heterogeneous multicore architecture

Meeting the challenges network equipment vendors face in delivering high-performance, intelligent, and programmable designs requires a new multichip, multicore heterogeneous processor architecture.

This architecture couples a high-performance programmable networking data plane optimized for L2-L7 packet processing and virtualized x86 multicore CPUs over a virtualized PCI Express interface. Designs based on this architecture can enable equipment providers to deliver high-performance, flexible, and field-programmable systems that are 4x more efficient than x86 multicore solutions with standard network interface cards.

From the narrow view of application development and hosting, no other processor architecture is more widely adopted or better suited than x86. It provides unmatched options in terms of price, performance, power, continuity of supply, and innovation, including critical areas such as virtualization. But the compounded requirements of high performance, low latency, and stateful flow processing with application- and content-aware L2-L7 DPI prevent homogeneous multicore processors from scaling beyond a few gigabits of real-world application performance.

A heterogeneous multicore architecture has three goals (see Figure 2). The first goal is to optimize the general-purpose multicore processor, allowing it to focus on the workloads it is best suited for. In these embedded communications examples, workloads include hosting the intrusion prevention, firewall, data loss prevention, test and measurement, or other similar application. These applications can take advantage of single-core performance increases, additional performance from multicore processors, and even further benefits from virtualization. This allows developers to call upon a wide range of standard development tools and human resources for rapid feature development.

Figure2
Figure 2: A heterogeneous multicore architecture focuses on the workloads a general-purpose multicore processor is built to handle.

The second goal is to introduce a dedicated set of smaller multicore coprocessors optimized to offload burdensome workloads from the general-purpose multicore CPUs. In this architecture, the multicore flow processor de-multiplexes the non-uniform traffic at line rate and provides fine-grain packet and flow classification. Additional lower-level network-processing functions such as TCP termination and SSL offload can be removed from the general-purpose CPU. Granular flow processing, DPI, and security processing can also be offloaded and performed in-line on all packets. The traffic can then be cleanly structured for transmission to the appropriate general-purpose core for application processing, thereby increasing host performance. Additional policy-based load balancing of classified and structured flows further improves host processor efficiency.

The third goal is to link these two workload-optimized processor domains via a high-performance, virtualization-aware communications path. Zero-copy drivers allow for fast and efficient big-block transfers of data from the I/O to multiple cores, virtual machines, or virtual end points. The two multicore processor domains are linked by PCI Express 2.0’s I/O Virtualization (IOV) and can be further enhanced with Intel’s VTd support. This high-performance, network I/O-aware communications path provides the final link between the multicore network-optimized flow processors and the general-purpose multicore application processors.

Outperforming the rest

This heterogeneous multicore architecture is ideally suited for intelligent, high-performance data plane applications.

By applying dedicated multicore processors optimized for L2-L7 network functions, the general-purpose cores operate more efficiently, allowing networking applications to significantly boost performance, reduce latency, and increase the number of effective processing cycles per packet. The highly structured and preprocessed data improves memory efficiency and memory bandwidth, reducing cache misses and processor stalls. The task-optimized multicore processors provide the highest performance-to-power consumption ratio available.

All these benefits combine to make the heterogeneous multicore architecture the preferred solution for embedded network communications, substantially outperforming standard multicore designs.

Jarrod Siket is senior VP of sales and marketing for Netronome Systems, Inc., based in Pennsylvania. He has 18 years of experience in the data and telecommunications industry, including roles at Tollgrade Communications, FORE Systems, and three terms (2000-2005) as the vice chairman of the IP/MPLS Forum Technical Committee. Jarrod holds a BS in Information and Decision Systems from Carnegie Mellon University and an MBA from the Joseph M. Katz Graduate School of Business at the University of Pittsburgh.

Netronome Systems
724-778-3290
jarrod.siket@netronome.com
www.netronome.com

Silicon, software, and strategies for embedded devices
Embedded Computing Design magazine is the resource for engineers, architects, and decision makers involved with embedded devices. Topics explored span silicon, software, and strategies for designing and connecting with small devices in mobile, automotive, home, industrial, and medical applications. Departments include Deep Green discussing the latest in energy efficient, low power designs and applications. Content is available in print, E-letter, E-cast, white papers, video, RSS, social networks, and more. Subscriptions are free of charge.
©MMXIIEmbedded Computing Design.
An OpenSystems Media publication.