Tony Romero
Performance Technologies


tony.romero@pt.com


Managing high availability systems

By Tony Romero
Benjamin Franklin once wrote, “God helps them that help themselves.” Well, thanks to distributed and intelligent management, communications equipment is now becoming much more capable of “helping itself,” thus increasing the availability - or uptime - of applications. Major components in a system now provide sensors and intelligent management architectures to provide real-time feedback on the status of its operation. This article will show that the standards-based Intelligent Platform Management Interface, commonly known as IPMI, is at the crux of the management framework.

Similar to a corporation’s organizational chart, there are multiple levels of management in high availability (HA) systems. Starting at the highest level, enterprise management monitors and manages all the equipment in one or more locations, either locally or remotely. Network management configures, manages, and monitors all the networking equipment. In lower levels of management, such as system management, a single platform or cluster of platforms can increase the high availability of the applications or services running on them.

Thanks to distributed and intelligent management, communications equipment is now becoming much more self sufficient, with the ability to provide real-time feedback on operation status. The standards-based Intelligent Platform Management Interface (IPMI) is the brain of the management framework.

When building a highly available system, the goal is to increase Mean Time Between Interruptions (MTBI) and decrease Mean Time To Repair (MTTR). When management occurs at the shelf level, it has the ability to locate the failed component, understand the problem, and fix the failure. It can even warn technicians of impending failures before they occur. This all translates to decreased technician diagnostic and deployment time.

The IPMI specification is a popular standard management technology for various types of systems. Management occurs inherently at the core of most HA applications, as it collects health status, environmental information, and asset management information. It integrates to higher levels of management software or to GUIs that allow execution of intelligent actions based on the system management information. Also, it manages slot control to power up/down or reset the components in the platform.

How it works
Typically, a system supports one or two redundant shelf management modules with a single IP address to interrogate and control some or all of the components in a system. In fault-tolerant applications, for example, the module could send commands over the system management bus (typically called the Intelligent Platform Management Bus or IPMB) to reset or cut power to failed cards, thus isolating problems from other components in the system. Each IPMI-based managed component is designed with a microcontroller running an independent small footprint operating system. The shelf management module includes the IPMI-defined Baseboard Management Controller (BMC). The BMC receives event information from the other managed objects and stores it in an event log. The other managed objects, such as server boards, line cards, or even power supplies, have a much simpler microcontroller (called the Satellite Management Controller or SMC), which just reports information about itself.

Preventing catastrophe
It is critical that a system be able to capture and report unusual events that may result in service interruptions. Most managed objects allow system managers to set pre-determined system thresholds, such as temperature or voltage, and will warn system managers if parameters approach these thresholds. This allows the manager to react to problems before they become catastrophic.

Sometimes failures do occur. To prevent an operating system crash from bringing down an entire system, IPMI functions as a separate management plane, independent of the main application processors or operating system. Therefore, if the host’s operating system crashes, the SMC can still report out to the dedicated shelf management module that a failure has occurred. The module will then reset the board or power it down until a technician can diagnose the problem.

Fault tolerance
Fault-tolerance describes a system or component designed so that, in the event that a component fails, a backup component or procedure can immediately take its place with no loss of service. Typically, two or more redundant components operate in either active/active mode or active/standby mode in the system. In the active/standby mode, if the active component fails, the standby component detects this through a watchdog routine and assumes control. With complex components typically running software, an elegant fault-tolerant failover cannot occur unless the latest, most reliable data is synchronized between two redundant components.

When using IPMI for communication, failover happens with sub-second latency. When a failure occurs within a simple Ethernet, heart-beating watchdog configuration, a device may ask several times whether another is truly non-functional. This can take several seconds. However, with an IPMI controller on each redundant component, failover procedures begin immediately. In addition, as discussed previously, the IPMI controller can cause a proactive failover before an actual failure. In such a scenario, the first device can synchronize its configuration data and last transaction data to the other device prior to shutting down or resetting.

Predictive analysis & policy-based management
Sophisticated management software allows developers to set thresholds to predict problems before they become catastrophic. This type of software controls the entire application in real-time and typically features a data-gathering engine that continuously monitors all of the system management information from the application infrastructure, the network, and the business or service processes into a single repository. By defining a set of rules by which to take action when a specific event occurs, a certain level of performance can be established for the entire system. This enterprise level of view management, which monitors all aspects of operations, ensures that policies are being enacted intelligently and globally, and are not based on one specific shelf or network module.

Predictive analysis and policy-based management can also serve as the intelligence to trigger the deployment of new cards to keep up with increasing loads or to react to failures. One server, typically called the deployment server, can be set to store all the hard drive images of specific applications or services required for a single board computer. When a new card must be deployed, the shelf manager can automatically power up the board and capture its Field Replacement Unit (FRU) information. The shelf manager can then notify the deployment server to upload the specific image onto the card using a deployment application.

Unrivaled reliability
Thanks to IPMI, management architectures now have the intelligence to provide valuable system status information to prevent catastrophe. The level of reliability and rate at which failover occurs with IPMI-based systems are unrivaled. Vendors of standards-based embedded system architectures such as CompactPCI and PICMG 2.16 are offering products with these always-on design methodologies built in. By using integrated solutions that offer these features, the developer will ultimately be able to offer high availability, service ability, and scalability, ultimately providing a lower total cost of ownership for the end user.

. . . . .

Tony Romero is a senior product manager with Performance Technologies. For the past three years, Tony has worked extensively in system architecture and product development of platforms with CompactPCI Packet Switched Backplanes, both pre-PICMG 2.16 and PICMG 2.16. His responsibilities include managing the CompactPCI computing platform products that comprise chassis, midplanes, system management, power supplies, and cooling. Before working at Performance Technologies, Tony worked for Primus Knowledge Solutions and Dell Computer Corp.

Performance Technologies develops embedded computing products and system-level solutions for equipment manufacturers and service providers worldwide. With competencies in compute platforms, IP/Ethernet switching, communications software, wide-area networking, SS7/IP interworking, and high availability, the company offers unified products for existing and emerging applications. Serving the industry for more than 20 years, Performance Technologies’ solutions focus on time-to-market, performance, and cost advantages for companies in the communications, military, and commercial markets.

Contact the company directly for further information.

Performance Technologies
1050 Southwood Drive
San Luis Obispo, CA 93401
Tel.: 805-783-6071
E-mail: Tony.romero@pt.com
Web site: www.pt.com