Applications that can best exploit a multicomputer such as the RS/6000 SP (SP) typically consist of several cooperating processes that are running on multiple nodes. These applications, which may or may not be subsystems, can be structured as follows:
Most multicomputer environments are subject to node, process, network, and device failures. Such failures arise from a combination of hardware failures, software failures, resource exhaustion, and operator error.
To be competitive, multicomputer applications must be highly available. This means that the application continues to run after a failure, perhaps after a brief interruption of service to accommodate error recovery, and perhaps with degraded performance. While dealing with a failure, an application should never break any correctness requirement. For example, a database should never violate the integrity of customer data.
Hardware techniques form an essential component of a comprehensive suite of system support for high availability. These include the use of multi-tailed disks, where a node can take over the data that was previously owned by a failed node, and the use of network address takeover (for example, IP takeover), where a node can assume the identity of a failed node.
Hardware techniques, however, make up only part of a complete solution. Additional aspects of the solution include detection of component (process and node) failures, recovery from communication partitions, coordination of activities among the processes of an application, and coordination of activities between applications. The Group Services subsystem and its Application Programming Interface (GSAPI) are designed to satisfy these additional requirements.