Administration Guide

Understanding the HACWS function

The following definitions will help in understanding HACWS function.

Primary control workstation: The first control workstation that is installed with the SP system. The primary control workstation is a physical machine.
Backup control workstation: The standby backup control workstation that is installed during the HACWS installation. The backup control workstation is a physical machine.
Active control workstation: The control workstation that is currently running the SP control workstation applications (the SDR and hardmon, for example). This can be either the primary or backup control workstation.
Inactive control workstation: The control workstation that is not running the SP control workstation applications. This can be either the primary or backup control workstation.

The HACWS function improves the reliability, availability and serviceability of an SP system by providing for a backup control workstation. HACWS provides for extending the SP system configuration to include a second RS/6000 to function as a backup to the primary control workstation. Automatic failover and reintegration to the backup control workstation are provided should the primary control workstation fail or in the case of taking the primary control workstation down to perform hardware and software maintenance.

HACWS provides hardware and software features that remove a single point of failure for the SP system. HACWS does not, however, protect against double failures. You should not expect control workstation functions to work during the time that a control workstation is failing over. If a failure on the control workstation occurs while a control workstation function is being performed, the function needs to be restarted.

The following figure illustrates a typical HACWS configuration:

Figure 37. An HACWS configuration

View figure.

Hardware

HACWS is a two-node configuration: a primary control workstation with a backup control workstation. (Do not confuse the use of node in this context with an SP system node.) IBM suggests that the two control workstations be configured identically, with the same model hardware and I/O configurations. This is not required, but it simplifies management of the HACWS configuration. IBM suggests that the primary and backup control workstations do not use the same power source.

External fixed disks and disk controllers

External fixed disks that provide only nonconcurrent access are used in an HACWS configuration. IBM suggests that the external fixed disks be mirrored across two disk controllers. If the external fixed disks are not mirrored, the single point of failure has not been removed from the SP system. The fixed disk becomes the single point of failure: if the disk fails the SP system will be down until the disk can be replaced. IBM suggests that the two disk controllers be on different power sources.

Frame supervisor cards

HACWS requires a dual RS-232 frame supervisor card per frame with a connection from each control workstation to each SP frame. Hardware feature #1245 provides this new type of frame supervisor card with an RS-232 Y-cable that connects from a frame in the SP system to the control workstation. All frames in the SP system need this new frame supervisor card and the Y-cables in order for the HACWS software to be configured. The frame supervisor connections from each frame must be connected to the same tty ports on both control workstations. If not cabled correctly, frame supervisor connections will not be activated by the hardware monitor.

SP-attached and clustered enterprise servers

|HACWS is not supported in any configuration of systems of clustered |enterprise servers. HACWS is also not supported in SP systems with |SP-attached servers that use the CSP or HMC hardware protocol. See the |Hardware protocol values table in the book IBM RS/6000 |SP: Planning Volume 2, Control Workstation and Software |Environment for which protocol is used by the nodes supported for |running PSSP.

|Generally, it is not a good idea to run HACWS on systems with |SP-attached servers. However, if you already have HACWS running on your |SP system, have reasoned that you cannot do without HACWS, have specialized |experience, can implement your own manual intervention procedures for fail |over, and accept all the limitations, you might be able to make HACWS continue |to work in your SP system with SP-attached servers that use the SAMI hardware |protocol.

|IBM pSeries 680 and RS/6000 Enterprise Servers S70, |S7A, and S80 connect to the control workstation by two serial connections, |making them SP-attached servers in an SP system. One connection is for |hardware monitoring and control and the other is for serial terminal |support. These servers use the SAMI hardware protocol for those |connections. Only one control workstation at a time can be connected to |each server, so there cannot be automatic physical failover done by the HACWS |software. When the primary control workstation fails over to the backup |control workstation, hardware control and monitoring support and serial |terminal support are not available for these servers.

The following apply if you use HACWS in a system containing any SP-attached server |with the SAMI hardware protocol:

Each server is directly attached to the control workstation through two RS-232 serial connections. There is no dual RS-232 hardware support for these connections like there is for SP frames. These servers can only be attached to one control workstation at a time. Therefore, when a control workstation fails or scheduled downtime occurs, and the backup control workstation becomes active, you will lose hardware monitoring and control and serial terminal support for the servers. The specific functions that are lost include:
- Power on and off control
- Reboot control
- Serial port communications for s1term
- Nodecond support to obtain the hardware Ethernet address and to network boot the node
- Monitoring of the following Hardmon variables and state data (whether using the SP Perspectives graphical user interface, commands (like hmmon, spmon, sphardware), or RSCT resource variables):
  
  diagByte
  hardwareStatus
  lcd1
  lcd2
  LCDhasMessage
  nodefail1
  nodeLinkOpen1
  nodepower
  serialLinkOpen
  spcn
  SPCNhasMessage
  src
  SRChasMessage
  timeTicks
- The ability to make configuration changes related to the servers. For example, you cannot add servers.
- The ability to shutdown or restart the servers with a PSSP command.
The servers will have the SP Ethernet connection from the backup control workstation, so PSSP components requiring this connection will still work correctly. This includes components such as the availability subsystems, user management, logging, authentication, the SDR, file collections, accounting and others.

Network connections

One of the following network connections must exist between the primary and backup control workstations:

A dedicated TCP/IP network link
An RS-232 tty link
The target mode SCSI across the external SCSI fixed disks

Note:: IPv6 aliases are not supported. For more information about IPv6, see Appendix G, Tolerating IPv6 alias addresses.

Adapters

Each control workstation requires the same number of connections to the SP Ethernet on the same LAN segments. Each SP Ethernet LAN segment must be cabled to the same Ethernet (enx) adapter. Standby network adapters are optional in an HACWS configuration because there may not be enough adapter slots in the control workstation for a standby adapter on each service network. On a small one- or two-frame SP system the normally used control workstation has only two slots and does not have enough slots for standby adapters. In a large multi-frame SP system the number of slots required for frame supervisor and SP Ethernet LAN segments may not allow for standby adapters. However, the presence of a standby adapter may avoid the need to fail over to the inactive control workstation.

Software

The software foundation for HACWS is the High Availability Cluster Multi-Processing for AIX (HACMP) licensed program product. Only the high availability subsystem is required. The concurrent Resource Manager of HACMP is not supported (all the SP data on the control workstation is in an AIX Journaled File System and, therefore, does not allow concurrent access).

You may use any level of HACMP that is supported with the level of AIX that you are using. Refer to the appropriate HACMP documentation to determine what levels of HACMP are supported with the level of AIX that you are using.

Note:: In an HACWS configuration, the nodes may be running any level of AIX and PSSP.

The primary and backup control workstation are configured in a two-node rotating configuration. The external fixed disks are configured in nonconcurrent access only. For information on HACMP, refer to HACMP: Concepts and Facilities.

Building on HACMP, the ssp.hacws optional installation package that is part of the IBM Parallel System Support Programs for AIX provides:

Scripts to create a backup control workstation
Verification programs
HACMP pre- and post- event scripts
Scripts to synchronize the backup and primary control workstations

When a failure of the primary control workstation takes place there is a disruptive failover (that is, the failover is noticeable) of the control workstation to the backup control workstation. This failover:

Switches the external fixed disks
Reconfigures network adapters from boot addresses to service addresses to perform IP address takeover
Performs hardware address takeover
Remounts file systems
Restarts the control workstation applications
Resumes hardware monitoring
Allows clients to reconnect to obtain services or update control workstation data

The backup control workstation assumes the IP address (service address) and IP aliases of the primary control workstation, resulting in only one active control workstation at a time and allowing client applications to run without changes. You cannot use IPv6 aliasing on a system with HACWS or HACMP. See Figure 38 for an illustration of addressing in an HACWS configuration.

Figure 38. Addressing in an HACWS configuration

View figure.

Operating HACMP and the control workstation

This section describes some of the operational characteristics of the control workstation and HACMP.

Booting the control workstations

In a two-node rotating HACMP configuration, if you are booting both the primary and backup control workstations at the same time, the first control workstation to have the cluster manager started will acquire the shared resources. In an HACWS scenario this will cause the control workstation IP addresses (service address) and the /spdata/sys1 file system to be configured on that control workstation. It is important that the RS/6000 that is intended to be the active control workstation have the HACMP cluster manager started on it first. To help ensure that the correct control workstation acquires the shared resources, IBM suggests that the cluster manager be started manually instead of being defined to start at system restart. This allows both control workstations to be booted in any order and any potential race conditions can be avoided during the boot process. After both control workstations are booted the cluster manager can be started on the intended control workstation.

If the control workstation function is started on the unintended control workstation, you can perform the following sequence of events to make the intended control workstation active:

Boot the intended control workstation without starting the cluster manager.
Do a graceful shutdown without takeover on the incorrectly active control workstation.
Using SMIT:
TYPE
smit clstop
- The Stop Cluster Services menu is displayed.
SELECT
now (for Stop now option)
SELECT
graceful (for Shutdown mode option)
PRESS
Enter (to Do)
Using the Command Line:
Enter
```
/usr/sbin/cluster/utilities/clstop -y -N -g
```
Note:
Never use kill to stop the HACMP cluster manager. Doing so will cause the node to fail.
Start the cluster manager on the intended control workstation.
Using SMIT:
TYPE
smit clstart
- The Start Cluster Services menu is displayed.
SELECT
now (for Start now option)
SELECT
true (for Startup Cluster Information Daemon? option)
Note:
Don't use the on system restart option.

PRESS
Enter (to Do)
Execute the clstat utility to make sure HACMP is up and running on the correct control workstation.
Start the cluster manager on the inactive control workstation.

Avoiding false adapter failures on configurations without standby adapters

If a control workstation is configured without a standby adapter on one of its service networks, and the other control workstation is powered off, the HACMP software may find it difficult to determine service adapter failure. This is because the HACMP cluster manager cannot use a standby adapter to force packet traffic over the service adapter to verify its operation. This shortcoming is less of an exposure if one or more of the following is true:

There are network devices which answer broadcast ICMP ECHO requests. To verify this you can ping the broadcast address and determine the number of different IP addresses that respond.
The service adapter is under heavy use. In this instance the inbound packet count will continue to increase over the service adapter without stimulation from the cluster manager.

If neither of these cases is true, then HACMP might report that a service adapter has failed, even though there has not been a failure, because there are no other adapters with which to communicate. (For example, this can occur when all the nodes are powered off.)

An enhancement to netmon, the network monitor portion of the HACMP cluster manager, is described below. It can be configured to allow more accurate determination of a service adapter failure. This function can be used in configurations that require a single service adapter per network.

A netmon configuration file, /usr/sbin/cluster/netmon.cf, will specify additional network addresses to which ICMP ECHO requests can be sent. The configuration file consists of one IP address or IP label per line. The maximum number of addresses used is five. All addresses specified after the fifth will be ignored. No comments are allowed in the file.

The following is an example of a netmon.cf configuration file:

180.146.181.119
steamer
chowder
180.146.181.121
mussel

This file must exist at cluster startup. The cluster software will scan the configuration file during its initialization phase. When netmon needs to stimulate the network to verify adapter function, it will send ICMP ECHO requests to each address. After sending the request to every address, netmon will check the inbound packet count before determining whether there is an adapter failure.

Conditions that cause a failover of the control workstation

The following conditions will cause a failover of a control workstation:

An intentional failover
This occurs when the operator stops the cluster manager on a node with a shutdown mode of graceful with takeover.
A failure of the operating system on the active control workstation
If the AIX operating system fails, the cluster manager on the inactive control workstation will detect this by observing missed heartbeats. The cluster manager then declares the crashed control workstation dead and performs a takeover of the shared resources.
A failure of an external disk adapter that is not backed up
If HACWS is configured with only one external disk adapter per control workstation, you need to write an Error Notification Object for the failure of the disk adapter to do a graceful shutdown with takeover. The other control workstation will then take over the shared resources.
A service network failure occurs on the active control workstation and the inactive control workstation is available to take over
A failure of a service adapter on the active control workstation, without a standby adapter to take over, will cause some SP node subsystems to perceive that the control workstation has failed and the system will quickly become unstable. To prevent such system instability, a service network failure on the active control workstation is promoted to a node failure, causing a failover to the inactive control workstation.

Failures that do not cause a failover of the control workstation

The following conditions will not cause a failover of a control workstation:

A hung control workstation with network adapters that still heartbeat
In the case where the application software hangs and the network adapters being monitored via HACMP still respond to heartbeats from the inactive control workstation, HACMP will not detect the hang and will not fail over.
A service network failure occurs on the active control workstation and the inactive control workstation is not available to take over
If the inactive control workstation is not available to take over, then a service network failure on the active control workstation will not be promoted to a node failure because no failover can occur.

General guidelines for making configuration changes

IBM suggests that you make configuration changes on the SP system only when the primary control workstation is the active control workstation. Most configuration changes are allowed when the backup control workstation is the active control workstation. However, if changes are made when the backup control workstation is active, some configuration files will need to be updated on the primary control workstation. When changes are made on the primary control workstation, some configuration files will need to be updated on the backup control workstation. If configuration changes are done on the backup control workstation, the updates go the opposite direction (backup to primary instead of primary to backup). It is easier to manage the system when the updates go only in one direction.

You can decide to what extent you make configuration changes on the backup control workstation. Because only two machines are involved, the bidirectional updates might not be a problem.

The following table lists tasks that can be performed in an HACWS configuration and whether or not it is possible to update configuration files on the active backup control workstation. A 'No' in the 'Backup CWS Active' column means that the file cannot be updated on the active backup control workstation. In that case, you will have to update the file on the primary control workstation when it is active again.

Table 16. HACWS configuration and task summary

Task	Primary CWS active	Backup CWS active
Update passwords	Yes	No
Add or change users	Yes	No
Change Kerberos keys	Yes	No
Install a node	Yes	Yes
Change or add system partitions	Yes	Yes
Add nodes to the system	Yes	No
Hardware monitoring	Yes	Yes
Reboot nodes	Yes	Yes
Run diagnostics	Yes	Yes
Shut down and restart system	Yes	Yes
Run parallel jobs	Yes	Yes
Update file collections	Yes	Yes
Accounting	Yes	Yes
Change site environment information	Yes	No

The following table lists the tasks whose support is different for SP-attached servers if the active backup control workstation has not been enabled to the servers:

Table 17. High Availability Control Workstation and task summary (SP-attached servers)

Task	Backup CWS active
Install an SP-attached server	No
Reboot SP-attached servers	No
Shut down and restart the system	Yes

Note that although you can shut down and restart the system with the SP-attached servers present, the attached servers will not be included in the process.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]