Accelerating high-performance computing applications using parallel computing

August 1, 2012 OpenSystems Media

In recent years, certain software fields that involve complex, large-scale data processing have reached increasingly high levels of complexity. Consequently, processing now takes hours and sometimes even days or weeks on modern hardware. This is notably the case for software dealing with various simulations, energy analytics, Computer-Aided Design and Computer-Aided Manufacturing (CAD/CAM), graphics rendering, life sciences, finances, and data conversion. Accelerating the processing functions in these software packages not only increases user satisfaction, but also enables higher accuracy, better decision-making, and more efficient work procedures in organizations using the software. Software acceleration has thus become a high priority for software organizations in these fields.

One obvious approach to software acceleration is to invest in stronger computer hardware. As processor technology consistently improves, this approach is valid as a short-term decision. However, more often than not it fails in the long term because processing demands tend to consistently increase as well, and in many cases the results do not meet the application requirements. For software deployed on a large scale, investment in strong servers can be prohibitively expensive. Due to the these factors, software organizations are increasingly seeking parallel and distributed processing systems as a cost-effective way to accelerate time-consuming computational applications.

Parallel computing options

Using parallel computing to accelerate highly complex computational processes is not a new concept. This approach has been tested and proven, and with the recent influx of affordable multicore and General-Purpose Graphics Processing Unit (GPGPU)-based technologies, it is more relevant now than ever before.

However, choosing the right technology to accelerate time-consuming computational processes is far from a straightforward decision. The various options for implementing a parallel or distributed system offer substantial differences in the resulting acceleration potential, as well as in direct development costs and indirect/long-term costs (maintenance, infrastructure, energy, and so on). This is particularly true when considering platforms for migrating existing code to a parallel or distributed architecture. Choosing a less-than-ideal system may incur dramatically increased costs – both direct and indirect – compared to a better alternative.

Part of the uncertainty around choosing the right parallel or distributed architecture lies in the  diversity of time-consuming processes, each involving different requirements and considerations. When selecting an acceleration approach for a specific process, it is important to consider the characteristics and limitations of the scenario involved. Developers can use some parameters to characterize highly computational processes and then choose the appropriate acceleration methodology.

CPU-bound versus I/O-bound processes

Certain types of applications such as data warehouses and enterprise resource planning are characterized by extensive data access, while others such as simulations, rendering, and terrain analysis typically place greater emphasis on algorithmic or CPU-bound complexity. If the portions of the application that are to be executed in parallel are algorithmic rather than data-intensive (that is, they are more CPU-bound than I/O-bound), parallel execution over the local network, or in some cases over a Wide Area Network (WAN), should be considered because it effectively utilizes the available hardware resources. For processes that are more data-bound and involve reading and writing very large amounts of data, the chosen architecture should address the expected data bottlenecks with an emphasis on high-throughput disc and network systems.

Highly isolated versus environment-dependent processes

This characteristic refers to the level of interaction between the process that is to be executed in parallel and the host environment – specifically the size and complexity of the application to be run in parallel (executable, libraries, and binary dependencies), file system activity, and access to other environmental databases such as registry and environment blocks. A highly isolated application involves a minimal set of such interactions; however, an application dependent on the runtime environment typically requires the parallel computing architecture to either preconfigure the relevant computers with the complete set of required software and data files or include a virtualization component to emulate the running environment on each of the computing nodes. Several approaches to virtualization in parallel computing environments exist, and the selected parallel computing environment can significantly affect total system cost.

Embarrassingly parallel versus inherently serial processes (and everything in between)

One issue that can complicate development efforts in building a parallel computing architecture is the application's suitability for "slicing" into multiple, independently executable parts that can run in parallel. Some legacy applications require intensive code structure refactoring to allow this. Others require little or no work to prepare the application for parallel execution. The most common example is batch sequential data processing, in which the same process executes over and over again, each time with a different input set. Such examples are sometimes described as "embarrassingly parallel" to indicate the relative simplicity of converting them to a parallel execution model. On the other hand, some applications are "inherently serial" and are not well-suited for parallel execution. With these applications, it is still sometimes possible to gain performance improvements by converting certain processing micro-elements to GPU-based parallel execution using technologies such as OpenCL or CUDA.

High-end versus cost-effective acceleration requirements

Commercial parallel computing platform costs vary greatly, with high-end systems being several orders of magnitude higher than low-end systems. Thus, it is important to define the performance improvement expectations by moving to a parallel computing environment. Acceleration through parallel computing is, by definition, a move with diminishing returns; in many cases, reducing execution time by 50 to 70 percent is sufficient to create a radical change in application performance, and the additional value of improving this to an 80 to 90 percent reduction is not worth the investment. While low- and mid-range parallel computing systems provide a reasonable performance improvement, high-end systems offer an additional 10 to 20 percent acceleration, but at a significant additional cost that does not always justify itself.

Legacy versus newly developed applications

For obvious reasons, converting a legacy application that was originally designed for serial execution is significantly more time-consuming and expensive than designing a new application for parallel execution. Most parallel computing platforms offer APIs that allow software developers to modify the application code to utilize the platform. Some APIs are more complex than others, and previous developer experience with these APIs is recommended to allow effective integration with the platform.

Approaches to software acceleration via parallel computing

As stated earlier, pricing of commercial parallel computing platforms varies substantially between low- and high-end products. In addition, higher-end systems require considerably more sophisticated adaptation and administration, and the combined costs of software licenses and professional services make the price differential even higher, with high-end systems sometimes costing orders of magnitude higher than simpler systems.

Furthermore, when migrating existing applications to a parallel computing architecture, it is important to consider the migration costs involved with adapting the parallel computing platform (professional services, programming, and quality assurance). It is therefore recommended to choose an approach that will provide the minimal set of features to adequately answer the software project needs, without investing in an unnecessarily expensive higher-end system.

The following overview examines current categories of parallel computing platforms and explains how each relates to the characteristics presented in the previous section.

Local parallelization using multiple cores and/or GPGPUs

In recent years, the potential for accelerating computational processes using parallel computing resources within a single machine has grown significantly with the introduction of strong multicore CPUs and GPGPUs. While the capability for local parallelization using these technologies is still limited by hardware specifications, in many cases they provide a cost-effective, low-end alternative to a full-scale distributed system. Parallel localization also works around the need to invest in virtualization technologies required by some distributed computing technologies. The benefits of using multicore and/or GPGPUs include:

  • Multiple core utilization: Applications that are more CPU-bound than I/O-bound can be modified to run different executable parts in parallel as separate processes. Modern Operating Systems (OSs) today are aware of multiple CPU cores and can automatically manage parallel processes and send each to run using a different core, allowing effective parallelization. In applications with potential for simple parallel separation, this is often a winning approach. The chief problems with this method are hardware restrictions, as the number of cores in each system is limited, and typically all processes share just one disk drive. However, these issues can be averted using a system like IncrediBuild-XGE (Figure 1), which allows applications utilizing multiple cores in parallel to automatically use all available cores in the local network.
  • GPGPUs: These components are fast emerging as another way to achieve acceleration using existing parallel resources in PCs and servers. Originally designed to process graphics-oriented processing tasks in parallel with general processing tasks, GPUs can now be used to handle nongraphical processing tasks, with hardware vendors aiming at systems having multiple strong GPUs to promote this approach. GPU-based parallel computing is performed at the thread-level (multiple parallel threads per process each utilizing a different GPU) and involves the use of dedicated APIs such as OpenCL and CUDA, which require expertise and sometimes significant development effort.

21
Figure 1: Processes are distributed to idle resources on a local network using process virtualization.

In-house (nongeneric) distributed computing implementations

In scenarios involving simple parallelization challenges in which the target application is highly isolated, embarrassingly parallel (or close to it), and can settle for reasonable acceleration results without requiring investment in high-end infrastructure, it may be practical to develop an application-specific distributed computing implementation. The simplest example would involve running different parts of the application in parallel on separate, predefined servers. The relative simplicity of the target application might make the development and maintenance costs involved in creating a proprietary system comparable to or even less expensive than adapting a commercial system.

Another advantage to this approach is the high level of flexibility achieved in developing a proprietary system. However, for almost any scenario beyond the most simplistic ones, developing a parallel computing implementation in-house is likely to result in costly ongoing maintenance efforts and complications in handling issues that generic systems already address, such as error handling, availability, scalability, dynamic resource allocation, management requirements, and reporting.

Computing clusters

A computing cluster is a group of servers dedicated to sharing an application's workload. Servers in the cluster run a homogeneous environment that includes both an up-to-date version of the runtime environment (application and binary dependencies) and shared access to I/O files. Having a dedicated computing environment such as a computing cluster eliminates the need for virtualization (see the previous section on highly isolated versus environment-dependent processes) and offers effective central administration of the computing cluster. The downsides of this approach are:

  • Maintaining a dedicated farm of expensive servers running the software incurs additional costs and does not take advantage of unutilized computing power in existing hardware connected to the network.
  • Clusters are often dedicated to a single application and cannot support several applications.
  • Migrating an existing application to a computing cluster platform typically involves significant software development to adapt the application to use the cluster APIs.

Cluster-based systems can be combined with high-throughput storage as well as network hardware and software to optimize performance for data-bound applications with high-end performance requirements.

Grid computing

Grid computing is similar to cluster computing in the sense that it involves a group of computers dedicated to solving a common problem, but differs from cluster computing by allowing a mixture of heterogeneous systems (different OSs and hardware) in the same grid. Grid systems also do not limit usage to a single application and enable more distributed control and administration of the systems connected to the grid. Finally, grids allow the largest scale of distributed system architecture in terms of the number of nodes involved, with large systems sometimes reaching many thousands of interconnected nodes.

Some grid systems not only utilize the combined computing power of dedicated servers, but also allow PCs and workstations to contribute spare processor cycles to the grid even while they are running other computing tasks. For example, a user writing a document using a word processing tool such as Microsoft Word could simultaneously contribute 80 to 90 percent of idle processing power to computing tasks running on the grid. This simultaneous utilization can dramatically increase the grid's potential computing power; however, in order to achieve this, the application running on the grid requires modification to use the grid system's APIs. The more environment-dependent the application is, the more extensive the changes will be to the application code to allow it to utilize available computing power on nondedicated machines.

Grid computing systems are, in general, the distributed parallel processing offering with the most comprehensive feature set and capabilities. As such, they also tend to be quite complex in terms of required expertise, both in development efforts (migrating existing code to the platform APIs) and ongoing maintenance and administration efforts. It is therefore recommended to evaluate these aspects when considering a grid-based approach.

Grid systems can be commercial or open source. Open-source systems are less expensive but tend to leave open ends (scheduling, management, and physical implementation aspects) that are not covered by the project, and require either in-house development or collaboration with the project development community. It is therefore important to carefully assess the total cost of ownership involved in completing the missing components in open-source systems. Several commercial grid computing products provide fuller feature sets.

Grid computing products tend to be at the highest end of the price range for parallel distributed systems. As with cluster-based systems, grid-based systems can be combined with high-end products to optimize network and storage bottlenecks.

Public compute clouds

Public clouds such as Amazon's EC2 and Microsoft's Azure platform are a form of computing in which the cloud user purchases computing power from a virtualized compute farm over the Internet, as opposed to private clouds that run on computers stored on location at an organization. Payment models are flexible, allowing the user to grow and shrink in computing power according to requirements and pay only for the computing power that was used over time. This greatly reduces the need to make long-term investments in on-site hardware and infrastructure. Public clouds have traditionally been used for business applications with an emphasis on load-balancing requirements rather than accelerating computing processes, but public cloud high-performance computing systems are gaining popularity.

The advantages of public cloud high-performance computing include:

  • Flexible, pay-as-you-go licensing
  • No need to invest and maintain dedicated hardware
  • Valid choice for applications in which high-end performance is not a requirement

Disadvantages include:

  • Cost of service can be quite high over time
  • Raises security concerns when sensitive data is transferred from the organization's servers to the Internet
  • In some cases, the latency of uploading and downloading large amounts of data over a WAN connection can create performance bottlenecks
  • Requires maintenance of virtual system images or modification of code to the platform APIs or both, which can be time-consuming and requires expertise
  • Creates a dependency on the public cloud vendor and the availability of an open Internet connection

A new approach to the heterogeneity challenge

When accelerating applications that interact with the computing environment – read/write files, binary executables, dynamic link libraries, and read registry and environment values – a challenge surfaces for traditional distributed computing systems.

One approach is to dedicate a compute cluster preinstalled with the required runtime environment and files for the distributed application. This answers the application's requirements but requires investment in dedicated servers and does not take advantage of the computing power available in existing PCs and workstations connected to the network. It also requires maintaining the cluster and making sure it always runs an up-to-date version of the runtime and data environment.

Virtualization allows servers to change the runtime environment on demand by loading a different system image each time, thereby improving manageability and increasing flexibility. However, virtual image initialization forms an additional bottleneck and, as in cluster systems, does not effectively utilize the sometimes vast amounts of idle processing power on existing computers.

Some grid platforms provide APIs that, when integrated into the application code, allow the use of remote machine resources without requiring extensive preconfiguration of these machines. In some cases, this effectively enables nondedicated machines to connect to the grid and contribute their idle processing power. However, this is applicable only in certain scenarios and in most cases requires extensive modification of the application code.

Process virtualization via a platform like IncrediBuild-XGE is a new approach to parallel distributed computing that enables software acceleration by combining the relative ease of migration and deployment characterizing cluster-based systems with the computing strength and flexibility of grid systems.

With process virtualization, an initiator machine sends processes for parallel execution on other machines connected to the network. These processes will then run on these machines alongside any other processes running at the time on the OS, but will run in a special self-contained virtual environment that completely emulates the initiator's environment, including installed applications, file system, registry, and environment. These virtual processes will only use the idle processing power of remote machines so as not to interfere with concurrently running processes not related to grid activity. The resource coordination module also ensures that processes are allocated to the strongest and most available nodes in the system at any time.

Because virtualization is performed at the process level, there is no need to program code for the platform and integrate platform-specific APIs to application source code to migrate the application to the grid. Instead, IncrediBuild-XGE uses a compact XML definition file that specifies which processes should be farmed out to remote machines on the grid and which should always run on the initiator. This makes grid-enablement significantly faster compared to systems that require extensive modification of source code. For example, it typically takes less than an hour to convert an application that already uses local parallelization (processes running in parallel CPUs or cores on a single machine). Ongoing maintenance costs are also reduced because the need to maintain virtual image banks or a cluster environment is eliminated.

The end result is a distributed processing application acceleration platform that effective accelerates both new and legacy applications, enables rapid integration, and reduces maintenance costs.

Uri Mishol is cofounder and chairman of Xoreax.

Xoreax UriM@incredibuild.com www.incredibuild.com

Follow: Twitter LinkedIn

Uri Mishol (Xoreax)
Previous Article
Enhancing the visual computing experience through accelerated processing - Q&A with Arun Iyengar, Corporate VP and General Manager, AMD Embe
Enhancing the visual computing experience through accelerated processing - Q&A with Arun Iyengar, Corporate VP and General Manager, AMD Embe

As the performance capabilities of embedded devices continue to increase, designers are seeking ways to mee...

Next Article
CompactPCI Serial: Enhancing functionality and increasing longevity through innovation
CompactPCI Serial: Enhancing functionality and increasing longevity through innovation

PICMG’s latest addition to the CompactPCI family of specifications uses high-speed interfaces to mode...