OEMs are rethinking their approaches to systems design. As a result, they are building more highly available, always-on solutions that will help reduce operational expenses and revenue loss caused by service interruptions. Key to developing applications in an HA architecture is to allow the service being provided to be decoupled from the platform on which it is being hosted. By providing standards-based Redundant Host architectures, protocols and programming interfaces, the developer is able to focus on service availability rather than having to worry about system availability. Additionally, this standards-based philosophy allows the developer to easily port the service to future platforms. This article will discuss Redundant Host architecture in embedded design.
The proliferation of business and mission-critical applications, such as billing, missile test, and radar detection, has rendered architectures offering 99.999 percent availability unacceptable. Developers are now seeking platforms that offer always-on capabilities to help ensure their critical systems will not fail. Until now, the most common method of constructing such highly available, fault-tolerant architectures was to couple proprietary hardware designs with specialized interfaces. This generally made for an acceptable level of availability, but such platforms did not allow for portability to architectures from external vendors. Emerging Redundant Host architectures are based on open-standard Application Programming Interfaces (API), like those in the PICMG 2.12 software interoperability specification and like those that offer availability beyond the standard 99.999 percent. By building applications around such architectures, developers can now realize the goal of constructing highly available host applications that are portable to emerging designs from multiple vendors.
Redundant Host
In systems design there are three types of device interaction:
- System data
- Management
- Control
These types of interaction can occur over the same physical medium or discrete mediums like Intelligent Platform Management Bus (IPMB), CompactPCI, Ethernet, or the H.110 telephony bus. The data plane is responsible for exchanging data traffic such as voice, images, and Internet data. The management plane is used to get status information and to set thresholds for reporting or to take local actions regarding the functional health of the devices in the system. The control plane performs interactions such as initialization, configuration, and control of devices within a system. See Figure 1 for an illustration of the three planes.
Open Figure 1
The Redundant Host feature is concentrated in the control plane. There is typically a 1:N relationship of control blade to I/O blades. In CompactPCI, the system master generally performs the control blade role while the peripherals perform the I/O blade roles, creating a single point of failure in that architecture. The Redundant Host design overcomes this by providing management of a redundant control blade. This paradigm holds true for some PICMG 2.16, and even pure Ethernet, architectures where I/O devices manipulate the payload data and require a controlling device to either provide routing for further processing, storage of data, or to simply collect statistics.
With a standards-based design founded on the PICMG 2.12 Redundant Host API, the architecture is open and not likely to change as a proprietary solution is prone to do. Additionally, products built around the Redundant Host architecture can offer the fastest, open-standards-based failover time on the market as little as single-digit milliseconds. With this rapid failover capability, system applications can recover almost immediately from any catastrophic control blade failure and can circumvent losing valuable data. Predictive failure analysis can trigger switch overs, allowing the drivers and applications time to synchronize their databases and state information before handing over the control.
Designing for service availability
A Redundant Host architecture provides a platform that is not only tolerant of fault situations but offers graceful mechanisms for recovery as well. Using the example of a fault-tolerant telecommunications switch, the chassis in this scenario may contain at least one control blade and any number of peripheral boards. If a line card fails, even though total system capacity is reduced, service availability is maintained as long as the rest of the system preserves integrity for the balance of the connected calls. Service availability is applied to both the host domain as well as the chassis in its entirety.
Common attributes one can consider when thinking about highly available host applications include failure recovery, inter-host synchronization, and switch-over strategies. Redundant Host architectures generally provide these capabilities at some level. Acceptable levels of service availability vary significantly depending on the application in question. In order to establish a balance between acceptable service interruption intervals and the system overhead required for continued service availability, developers must evaluate these attribute implementation details and carefully weigh the performance trade-offs.
Switch overs are categorized as cooperative or hostile. An effective switch-over strategy can ensure quick recovery from failure with minimal loss of data. The timeliness of such a recovery reflects directly on the systems service availability. In a cooperative switch over, the active host notifies the backup host that a transition is about to occur, resulting in a graceful failover. The backup host normalizes state and seamlessly provides service once the switch over is complete. Predictive failure techniques can be applied that monitor threshold incursions, causing system alarms. A pre-configured alarm could initiate a switch over, allowing for a graceful transition of control without any loss of data.
A hostile switch over usually occurs during a worst-case scenario, such as kernel panic or critical system component failure. In such a case, there is no time to quiesce the failing domain before control transmissions to a backup domain.
Ideally, a system will respond to a switch over while maintaining the integrity of all system data. The level of synchronization will dictate how seamlessly a redundant application will achieve a normalized state and maintain service ability after the switch over is executed. In some systems, the switch-over process may interrupt the routing of TCP/IP packets for a number of seconds while the newly active host application reaches a functional state. Depending upon the system in question, this may be considered either acceptable, or undesirable and costly.
The strategy used for data synchronization is a compromise between the volume of data to be synchronized and the level of service interruption that can be tolerated. Implementations of synchronization methods run the gamut from solutions having two hosts linked by a cable connecting their serial ports, to higher cost, proprietary solutions such as dedicated high speed buses that provide high throughput. Developers must reach a balance of methods that helps them maintain a cost-sensitive solution while achieving a reasonable throughput of data, thus helping to ensure service or application integrity and availability. If an application can leverage off-the-shelf solutions, it is much easier to maintain the system target cost. Further, non-proprietary solutions are often simpler to implement and lend themselves to reduced support costs. An example of a high throughput, non-proprietary synchronization strategy is an inter-chassis Ethernet backplane implementation as described in the PICMG 2.16 specification, which uses a time-tested, packet-switched data bus, combined with the reliability of a dual-star topology.
Designing for portability
To ensure longevity, system developers try to produce designs that will be portable to third-party architectures as they emerge. Portable applications maintain the initial development investment and help reduce maintenance costs over the long run by providing a mature code base. However, once the realities of schedule concerns and the difficulties of developing a complex embedded system are realized, the portability attribute is quite often the first to be tossed. Therefore, to preserve as much of the initial investment as possible, the application must be designed so that the service being provided is logically decoupled from the platform on which the application is being hosted. Combining standard software interfaces and a modular design approach can help achieve this separation of functionality.
Several standardization efforts have been made over the past few years to provide a foundation for application portability. These efforts have produced a number of APIs that allow an application or system service to manage the Redundant Host, slot control, and hot-swap aspects of a chassis. These forward-looking APIs preserve the development investment of highly available applications as designers move toward architectures like existing PICMG 2.16 and emerging AdvancedTCA systems. Included in such APIs is the PICMG 2.12 Redundant Host interface, which provides control mechanisms for configuration and stat use of system-host switch overs. This API is architected in a way that allows for easy extensibility beyond just dual host configurations to N-host-based configurations.
PICMG specifications offer interfaces beyond Redundant Host control that provide a granular level of control for peripheral slots. These interfaces involve both the slot control and hot-swap characteristics of the chassis. The slot control interface provides an accepted method for detecting peripheral presence and board health in addition to controlling slot power and board reset capabilities. The hot-swap interfaces laid out in the PICMG specifications provide a consistent infrastructure for constructing a platform that supports the dynamic insertion and extraction of peripherals in a powered chassis. These interfaces extend beyond the user space and into the kernel and device driver layers.
In addition to the management interfaces previously mentioned, operating systems are now becoming hot-swap friendly through either device-driver interfaces that support a stated-device-driver model or through third-party add-ons that enhance the hot-swap capabilities of a chassis. The adaptation of non-proprietary operating systems into the embedded market is due in part to the current climate of open source development. While there is no definitive specification that encapsulates all the qualities required for Redundant Host support within the operating system, newly emerging device driver models that support hot-plug and power management attributes have significantly simplified the development process and started the transition away from proprietary plug-and-play implementations.
Redundant Host benefits service availability design
This article highlighted the critical design areas of data synchronization, service recovery, and switch-over considerations. Service availability is a function of system availability, and the redundant capabilities of the entire platform must be considered when designing a system. The good news is vendors of standards-based embedded system architectures, such as the CompactPCI and PICMG 2.16 architectures mentioned in this article, are already offering products with these capabilities built in. By providing standards-based Redundant Host architectures, protocols, and programming interfaces, developers can now focus on designing for service availability without worrying about system availability.
. . . . .
David Radecki, senior software engineer, has been with Performance Technologies for six years. Before joining the company, David worked as a software engineer for Hughes Aircraft and Beckman Instruments. He has a BS and MS in computer science from California State University, Fullerton.
Sean OBrien is a software engineering manager for Performance Technologies, managing the development of software and firmware for the Computing Products Group in San Luis Obispo, California. Sean has been with the company for five years. He is actively involved in the development of industry standards, and recently chaired the PICMG® 2.14 Multi Computing Subcommittee. Before joining Performance Technologies, Sean worked for SPAR and ARK Telecom as a software engineer developing software and firmware for TDMA and FDMA satellite telecommunications equipment. He graduated from California Polytechnic Institute in 1991 with a BS in computer science.
Performance Technologies offers systems development services and products in the communications, military, and commercial markets.
With core competencies in compute platforms, IP/Ethernet communications, high availability, and SS7/IP interworking, and a host of unifying software, Performance Technologies can meet growing customer needs and can assist in developing tighter and more strategic relationships with their core vendors.
For further information about the company, contact:
Performance Technologies
205 Indigo Creek Drive
Rochester, NY 14626
Tel.:
E-mail: David.radecki@pt.com
E-mail: Sean.obrien@pt.com
|