IBM Books

Planning Volume 2, Control Workstation and Software Environment


IBM's approach to high availability for control workstations

For the reasons already mentioned, IBM has taken the high availability approach to control workstation support for the SP system. The control workstation is a suitable candidate for high availability because it can typically withstand a short interruption, but must be restored quickly. In the SP configuration, the control workstation has been a possible single point of failure.

Eliminating the control workstation as a single point of failure

A single point of failure exists when a critical function is provided by a single component. If that component fails, the system has no other way to provide that function and essential services become unavailable.

The key facet of a highly available system is its ability to detect and respond to changes that could impair essential services. The SP with the HACWS software lets a system continue to provide services critical to an installation even though a key system component -- the control workstation -- is no longer available. When the control workstation becomes unavailable, through either a planned or inadvertent event, the SP high availability component is able to detect the loss and shift that component's workload to a backup control workstation.

Refer to the following tables for some of the consequences of failure of a control workstation that has not been backed up.

Table 42. Effect of CWS failure on mandatory software in a single-CWS configuration

Major software component Effect on SP System
Hardware Monitor
  1. No control of SP hardware except for the on/off switch on a node, and the use of the service laptop connected to a frame supervisor cable.
  2. Nodes cannot be hot-plugged in or out of the frames controlled by the failed control workstation.

SDR
  1. Current running jobs continue to completion.
  2. No new parallel jobs can start.
  3. The Resource Manager daemons die because they cannot make contact with the SDR.
  4. Serial jobs can continue to be started.
  5. No hardware or software configuration changes can occur.
  6. No installations can be started.
  7. A switch fault will not complete processing, and the switch will remain in service mode if a fault occurs while the control workstation is unavailable.
  8. No cluster shutdowns can occur.
  9. A node can still be powered off and on manually, but this causes a switch fault.

Kerberos V4 Authentication Server (if no backup server exists)
  1. Users cannot obtain new tickets via kinit.
  2. Background processes using rcmdtgt to get ticket will fail.
  3. Users cannot change passwords.
  4. New users cannot be added to the authentication database.

Diagnostics Diagnostics cannot be run on node boot disks.
File Collections Master No new distributed file updates can occur.
Availability subsystems (hats, hags, haem) These subsystems will not restart upon node reboot.

Table 43. Effect of CWS failure on user data on the CWS

Major software component Effect on SP System
User Management You cannot make changes to a user data base stored on the control workstation.
Hardware Logging Daemon
  1. Hardware logging immediately stops.
  2. Nodes cannot be hot plugged.

Error Logging Alerts If sent by mail will be put in the node mail spool.
Accounting Master
  1. No consolidated accounting records are kept during down time.
  2. Records are consolidated after the control workstation comes up.

User File Server
  1. Running jobs might fail.
  2. Jobs might not be able to access needed data.

Consequences of a high availability control workstation failure

When a failure occurs in a high availability control workstation, the following steps take place automatically:

Note:
|See Limits and restrictions for limitations with respect to RS/6000 and |pSeries servers. |

System stability with HACWS

When a control workstation fails, it causes significant loss of function in configuration, systems management, hardware monitoring, and the ability to handle a switch fault. The reliability of the whole system is compromised by the chance of a switch fault during a control workstation outage. Using the high availability control workstation increases the mean time before failure (MTBF) of the entire system.

The failover is disruptive. Applications at the control workstation that are interrupted will not resume automatically and must be restarted. The interruption is momentary. Applications within nodes, that require no communication with the control workstation might not notice the failover. Applications relying on data from the SDR will be momentarily interrupted. Having a backup control workstation available prevents this problem.

Occasionally, you might need to take a control workstation down to maintain the hardware or software or to repair or update a component of the system. Using high availability control workstation lets you schedule this upkeep without taking the entire system down. The serviceability of the SP is increased by the service time for the control workstation, which increases the mean time to repair (MTTR) of the system as a whole.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]