The following definitions will help in understanding HACWS function.
The HACWS function improves the reliability, availability and serviceability of an SP system by providing for a backup control workstation. HACWS provides for extending the SP system configuration to include a second RS/6000 to function as a backup to the primary control workstation. Automatic failover and reintegration to the backup control workstation are provided should the primary control workstation fail or in the case of taking the primary control workstation down to perform hardware and software maintenance.
HACWS provides hardware and software features that remove a single point of failure for the SP system. HACWS does not, however, protect against double failures. You should not expect control workstation functions to work during the time that a control workstation is failing over. If a failure on the control workstation occurs while a control workstation function is being performed, the function needs to be restarted.
The following figure illustrates a typical HACWS configuration:
Figure 37. An HACWS configuration
View figure.
HACWS is a two-node configuration: a primary control workstation with a backup control workstation. (Do not confuse the use of node in this context with an SP system node.) IBM suggests that the two control workstations be configured identically, with the same model hardware and I/O configurations. This is not required, but it simplifies management of the HACWS configuration. IBM suggests that the primary and backup control workstations do not use the same power source.
External fixed disks that provide only nonconcurrent access are used in an HACWS configuration. IBM suggests that the external fixed disks be mirrored across two disk controllers. If the external fixed disks are not mirrored, the single point of failure has not been removed from the SP system. The fixed disk becomes the single point of failure: if the disk fails the SP system will be down until the disk can be replaced. IBM suggests that the two disk controllers be on different power sources.
HACWS requires a dual RS-232 frame supervisor card per frame with a connection from each control workstation to each SP frame. Hardware feature #1245 provides this new type of frame supervisor card with an RS-232 Y-cable that connects from a frame in the SP system to the control workstation. All frames in the SP system need this new frame supervisor card and the Y-cables in order for the HACWS software to be configured. The frame supervisor connections from each frame must be connected to the same tty ports on both control workstations. If not cabled correctly, frame supervisor connections will not be activated by the hardware monitor.
|HACWS is not supported in any configuration of systems of clustered |enterprise servers. HACWS is also not supported in SP systems with |SP-attached servers that use the CSP or HMC hardware protocol. See the |Hardware protocol values table in the book IBM RS/6000 |SP: Planning Volume 2, Control Workstation and Software |Environment for which protocol is used by the nodes supported for |running PSSP.
|Generally, it is not a good idea to run HACWS on systems with |SP-attached servers. However, if you already have HACWS running on your |SP system, have reasoned that you cannot do without HACWS, have specialized |experience, can implement your own manual intervention procedures for fail |over, and accept all the limitations, you might be able to make HACWS continue |to work in your SP system with SP-attached servers that use the SAMI hardware |protocol.
|IBM pSeries 680 and RS/6000 Enterprise Servers S70,
|S7A, and S80 connect to the control workstation by two serial connections,
|making them SP-attached servers in an SP system. One connection is for
|hardware monitoring and control and the other is for serial terminal
|support. These servers use the SAMI hardware protocol for those
|connections. Only one control workstation at a time can be connected to
|each server, so there cannot be automatic physical failover done by the HACWS
|software. When the primary control workstation fails over to the backup
|control workstation, hardware control and monitoring support and serial
|terminal support are not available for these servers.
The following apply if you use HACWS in a system containing any SP-attached server |with the SAMI hardware protocol:
One of the following network connections must exist between the primary and backup control workstations:
Each control workstation requires the same number of connections to the SP Ethernet on the same LAN segments. Each SP Ethernet LAN segment must be cabled to the same Ethernet (enx) adapter. Standby network adapters are optional in an HACWS configuration because there may not be enough adapter slots in the control workstation for a standby adapter on each service network. On a small one- or two-frame SP system the normally used control workstation has only two slots and does not have enough slots for standby adapters. In a large multi-frame SP system the number of slots required for frame supervisor and SP Ethernet LAN segments may not allow for standby adapters. However, the presence of a standby adapter may avoid the need to fail over to the inactive control workstation.
The software foundation for HACWS is the High Availability Cluster Multi-Processing for AIX (HACMP) licensed program product. Only the high availability subsystem is required. The concurrent Resource Manager of HACMP is not supported (all the SP data on the control workstation is in an AIX Journaled File System and, therefore, does not allow concurrent access).
You may use any level of HACMP that is supported with the level of AIX that you are using. Refer to the appropriate HACMP documentation to determine what levels of HACMP are supported with the level of AIX that you are using.
The primary and backup control workstation are configured in a two-node rotating configuration. The external fixed disks are configured in nonconcurrent access only. For information on HACMP, refer to HACMP: Concepts and Facilities.
Building on HACMP, the ssp.hacws optional installation package that is part of the IBM Parallel System Support Programs for AIX provides:
When a failure of the primary control workstation takes place there is a disruptive failover (that is, the failover is noticeable) of the control workstation to the backup control workstation. This failover:
The backup control workstation assumes the IP address (service address) and IP aliases of the primary control workstation, resulting in only one active control workstation at a time and allowing client applications to run without changes. You cannot use IPv6 aliasing on a system with HACWS or HACMP. See Figure 38 for an illustration of addressing in an HACWS configuration.
Figure 38. Addressing in an HACWS configuration
View figure.
This section describes some of the operational characteristics of the control workstation and HACMP.
In a two-node rotating HACMP configuration, if you are booting both the primary and backup control workstations at the same time, the first control workstation to have the cluster manager started will acquire the shared resources. In an HACWS scenario this will cause the control workstation IP addresses (service address) and the /spdata/sys1 file system to be configured on that control workstation. It is important that the RS/6000 that is intended to be the active control workstation have the HACMP cluster manager started on it first. To help ensure that the correct control workstation acquires the shared resources, IBM suggests that the cluster manager be started manually instead of being defined to start at system restart. This allows both control workstations to be booted in any order and any potential race conditions can be avoided during the boot process. After both control workstations are booted the cluster manager can be started on the intended control workstation.
If the control workstation function is started on the unintended control workstation, you can perform the following sequence of events to make the intended control workstation active:
Using SMIT:
Using the Command Line:
Enter
/usr/sbin/cluster/utilities/clstop -y -N -g
Using SMIT:
If a control workstation is configured without a standby adapter on one of its service networks, and the other control workstation is powered off, the HACMP software may find it difficult to determine service adapter failure. This is because the HACMP cluster manager cannot use a standby adapter to force packet traffic over the service adapter to verify its operation. This shortcoming is less of an exposure if one or more of the following is true:
If neither of these cases is true, then HACMP might report that a service adapter has failed, even though there has not been a failure, because there are no other adapters with which to communicate. (For example, this can occur when all the nodes are powered off.)
An enhancement to netmon, the network monitor portion of the HACMP cluster manager, is described below. It can be configured to allow more accurate determination of a service adapter failure. This function can be used in configurations that require a single service adapter per network.
A netmon configuration file, /usr/sbin/cluster/netmon.cf, will specify additional network addresses to which ICMP ECHO requests can be sent. The configuration file consists of one IP address or IP label per line. The maximum number of addresses used is five. All addresses specified after the fifth will be ignored. No comments are allowed in the file.
The following is an example of a netmon.cf configuration file:
180.146.181.119 steamer chowder 180.146.181.121 mussel
This file must exist at cluster startup. The cluster software will scan the configuration file during its initialization phase. When netmon needs to stimulate the network to verify adapter function, it will send ICMP ECHO requests to each address. After sending the request to every address, netmon will check the inbound packet count before determining whether there is an adapter failure.
The following conditions will cause a failover of a control workstation:
This occurs when the operator stops the cluster manager on a node with a shutdown mode of graceful with takeover.
If the AIX operating system fails, the cluster manager on the inactive control workstation will detect this by observing missed heartbeats. The cluster manager then declares the crashed control workstation dead and performs a takeover of the shared resources.
If HACWS is configured with only one external disk adapter per control workstation, you need to write an Error Notification Object for the failure of the disk adapter to do a graceful shutdown with takeover. The other control workstation will then take over the shared resources.
A failure of a service adapter on the active control workstation, without a standby adapter to take over, will cause some SP node subsystems to perceive that the control workstation has failed and the system will quickly become unstable. To prevent such system instability, a service network failure on the active control workstation is promoted to a node failure, causing a failover to the inactive control workstation.
The following conditions will not cause a failover of a control workstation:
In the case where the application software hangs and the network adapters being monitored via HACMP still respond to heartbeats from the inactive control workstation, HACMP will not detect the hang and will not fail over.
If the inactive control workstation is not available to take over, then a service network failure on the active control workstation will not be promoted to a node failure because no failover can occur.
IBM suggests that you make configuration changes on the SP system only when the primary control workstation is the active control workstation. Most configuration changes are allowed when the backup control workstation is the active control workstation. However, if changes are made when the backup control workstation is active, some configuration files will need to be updated on the primary control workstation. When changes are made on the primary control workstation, some configuration files will need to be updated on the backup control workstation. If configuration changes are done on the backup control workstation, the updates go the opposite direction (backup to primary instead of primary to backup). It is easier to manage the system when the updates go only in one direction.
You can decide to what extent you make configuration changes on the backup control workstation. Because only two machines are involved, the bidirectional updates might not be a problem.
The following table lists tasks that can be performed in an HACWS
configuration and whether or not it is possible to update configuration files
on the active backup control workstation. A 'No' in the
'Backup CWS Active' column means that the file cannot be updated on
the active backup control workstation. In that case, you will have to
update the file on the primary control workstation when it is active
again.
Table 16. HACWS configuration and task summary
Task | Primary CWS active | Backup CWS active |
---|---|---|
Update passwords | Yes | No |
Add or change users | Yes | No |
Change Kerberos keys | Yes | No |
Install a node | Yes | Yes |
Change or add system partitions | Yes | Yes |
Add nodes to the system | Yes | No |
Hardware monitoring | Yes | Yes |
Reboot nodes | Yes | Yes |
Run diagnostics | Yes | Yes |
Shut down and restart system | Yes | Yes |
Run parallel jobs | Yes | Yes |
Update file collections | Yes | Yes |
Accounting | Yes | Yes |
Change site environment information | Yes | No |
The following table lists the tasks whose support is different for
SP-attached servers if the active backup control workstation has not been
enabled to the servers:
Table 17. High Availability Control Workstation and task summary (SP-attached servers)
Task | Backup CWS active |
---|---|
Install an SP-attached server | No |
Reboot SP-attached servers | No |
Shut down and restart the system | Yes |
Note that although you can shut down and restart the system with the SP-attached servers present, the attached servers will not be included in the process.