For the reasons already mentioned, IBM has taken the high availability approach to control workstation support for the SP system. The control workstation is a suitable candidate for high availability because it can typically withstand a short interruption, but must be restored quickly. In the SP configuration, the control workstation has been a possible single point of failure.
A single point of failure exists when a critical function is provided by a single component. If that component fails, the system has no other way to provide that function and essential services become unavailable.
The key facet of a highly available system is its ability to detect and respond to changes that could impair essential services. The SP with the HACWS software lets a system continue to provide services critical to an installation even though a key system component -- the control workstation -- is no longer available. When the control workstation becomes unavailable, through either a planned or inadvertent event, the SP high availability component is able to detect the loss and shift that component's workload to a backup control workstation.
Refer to the following tables for some of the consequences of failure of a
control workstation that has not been backed up.
Table 42. Effect of CWS failure on mandatory software in a single-CWS configuration
Major software component | Effect on SP System |
---|---|
Hardware Monitor |
|
SDR |
|
Kerberos V4 Authentication Server (if no backup server exists) |
|
Diagnostics | Diagnostics cannot be run on node boot disks. |
File Collections Master | No new distributed file updates can occur. |
Availability subsystems (hats, hags, haem) | These subsystems will not restart upon node reboot. |
Table 43. Effect of CWS failure on user data on the CWS
Major software component | Effect on SP System |
---|---|
User Management | You cannot make changes to a user data base stored on the control workstation. |
Hardware Logging Daemon |
|
Error Logging Alerts | If sent by mail will be put in the node mail spool. |
Accounting Master |
|
User File Server |
|
When a failure occurs in a high availability control workstation, the following steps take place automatically:
When a control workstation fails, it causes significant loss of function in configuration, systems management, hardware monitoring, and the ability to handle a switch fault. The reliability of the whole system is compromised by the chance of a switch fault during a control workstation outage. Using the high availability control workstation increases the mean time before failure (MTBF) of the entire system.
The failover is disruptive. Applications at the control workstation that are interrupted will not resume automatically and must be restarted. The interruption is momentary. Applications within nodes, that require no communication with the control workstation might not notice the failover. Applications relying on data from the SDR will be momentarily interrupted. Having a backup control workstation available prevents this problem.
Occasionally, you might need to take a control workstation down to maintain the hardware or software or to repair or update a component of the system. Using high availability control workstation lets you schedule this upkeep without taking the entire system down. The serviceability of the SP is increased by the service time for the control workstation, which increases the mean time to repair (MTTR) of the system as a whole.