Administration Guide

Managing SP subsystems for High Availability Control Workstation function

This section presents information on the administrative tasks you must perform to ensure that the various SP subsystems will fail over to the backup control workstation. The topics are presented in the same order as they are covered as chapters in this book.

Security features of the SP System

This section discusses security issues that pertain specifically to HACWS.

DCE and HACWS

DCE and HACWS restriction:
If you plan to have DCE authentication enabled, you cannot use HACWS. If you already use HACWS, do not enable DCE authentication.

Authorization for AIX remote commands and HACWS

HACWS and authorization for AIX remote commands restriction:
Installing and using HACWS requires authorization for both control workstations to issue AIX remote commands to each other and to nodes. Do not set authorization for AIX remote commands to `none`.

Kerberos V4 and HACWS

In an HACWS configuration the primary authentication service is not duplicated on the backup control workstation. If the primary control workstation is either a primary or secondary authentication server, then the backup control workstation must be a secondary authentication server. The backup control workstation cannot be a primary authentication server.

When the primary control workstation is down, and it is the primary authentication server, no updates to the authentication database can be made.

Kerberos configuration files

The backup control workstation needs to be added to the .klogin files on all the nodes and the control workstation. This .klogin file also needs to be kept in sync on the backup control workstation.

When control workstation services move back and forth between the two control workstations, the Kerberos server keys need to remain the same. To accomplish this, the following commands were issued on the backup control workstation when HACWS was initially configured:

$ /usr/lpp/ssp/rcmd/bin/rcp -p primary_name:/etc/krb-srvtab /etc/krb-srvtab.primary
$ cp -p /etc/krb-srvtab /etc/krb-srvtab.backup
$ cat /etc/krb-srvtab.primary >>/etc/krb-srvtab

Repeat this procedure whenever you change Kerberos server keys on either of the two control workstations.

Starting up and shutting down the SP system

The sequence files need to be kept in sync between the primary and backup control workstations in order for the cshutdown command to work on the backup control workstation. The following files need to be synchronized:

/etc/cstartSeq
/etc/shutSeq
/etc/subsysSeq

Remote execution of SP commands

You can use parallel management commands with the backup control workstation as a target, even when the backup control workstation is inactive, as long as the host name of the backup control workstation is listed in the working collective or host list.

Controlling remote execution using Sysctl

When Sysctl is installed the configuration file /etc/sysctl.conf is created and updated on the backup control workstation. If you update this file or use a different configuration file, the file needs to be kept in sync between the primary and backup control workstation.

The Sysctl ACL files must also be kept in sync between the primary and backup control workstations. This file is shipped as /etc/sysctl.acl. If you add an ACL record to the /etc/sysctl.acl file, the file should be kept in sync between the primary and backup control workstations.

When the backup control workstation is the inactive control workstation, Sysctl commands can be used because Sysctl is installed and running on the inactive control workstation.

Managing file collections

The inactive control workstation acts as a file collection client (supper is run using cron). This keeps all the file collection files and other default file collections in sync between the primary and backup control workstations. If a configuration or file collection master change is done on the active control workstation, you should perform a supper update for the particular file collection. Note that you can do a supper update for all the file collections. If the update is not performed after the change and a failover occurs, the updates will not be on the new active control workstation and such updates will not be available to the SP nodes.

If you create scan files on the active control workstation, you need to make sure that the scan files get created on the takeover control workstation after a control workstation failover. Otherwise, the scan files may not exist or may be out-of-date on the takeover control workstation, and file collections will not function as expected. To automate the creation of scan files after a control workstation failover, create a file named start_server.post_post_event with execute permission for the root user in the /var/adm/hacws/events directory on both the primary and the backup control workstation with contents similar to the following:

#!/bin/ksh --x
exec /var/sysman/supper scan sup.admin user.admin power_system

If you have additional file collections that you are using scan files with, you will need to add those to the list in the previous supper command. For more information on using scan files, refer to Verifying file collections using the scan file.

Managing user accounts

IBM suggests that HACWS configurations use NIS for user management when large numbers of users need to be managed on the SP system. NIS enforces a master-slave relationship for password databases and is easier to manage than file collections for user management in an HACWS configuration.

Using NIS with HACWS

If you are using NIS in an HACWS configuration and the NIS master database is the primary or backup control workstation, there must be at least one slave database within the NIS domain. For high availability of an SP system, IBM suggests that all the nodes where users log in be NIS slave database nodes.

Using file collections with HACWS

If you use file collections for user management, the /etc/passwd and associated files should be on the primary control workstation. Once a failover occurs and the backup control workstation is the active control workstation, the actual master passwd files are no longer available to the backup control workstation. For this reason changing your passwords, gecos information, and login shell is not allowed when the backup control workstation is active. (This is similar to the restriction with NIS when the NIS master is down.) The backup control workstation, when inactive, will act as a file collection client (similar to an SP node) and update its passwd files from the primary control workstation when it is active.

Controlling user login

If you are using login control on your SP system, the spacs_data must be kept in sync between the primary and backup control workstations. The block_usr_sample script and any derivatives of it need to be kept in sync between the primary and backup control workstations.

Managing SP resources

When a failover or reintegration occurs, running parallel jobs should not be affected. LoadLeveler does not have a dependency upon the control workstation as long as the central manager is not being run on the control workstation.

Managing time synchronization

Both control workstations must use the same source for time (for example, timemaster, Internet, or consensus). NTP should be running on both the inactive and active control workstations.

Managing the Automounter

Include the backup control workstation in the distribution of automounter configuration and map files.

Managing mail service

If the primary control workstation is the mail hub, you should use NFS to export the /var/spool/mail file system to the backup control workstation so that mail files for nodes are put in the same file system. When a failover occurs you should have the /var/spool/mail locally mounted on the backup control workstation and the primary control workstation. Use NFS to mount the file on the control workstation when it is the inactive control workstation. The /var/spool/mail file system must be an externally mounted file system.

Accounting

If accounting was enabled in the Site Environment Information SMIT panel and the node number 0 was selected as the Accounting Master, you need to make a directory change. Create a symbolic link from both the primary and backup control workstations to put the cacct directory in the /spdata/sys1 file system.

Do the following on both the primary and backup control workstations after accounting has been enabled:

Enter:
```
ln -s /spdata/sys1/cacct /var/adm/cacct
```
This creates a symbolic link between the cacct directory so that the data in the rootvg is automatically copied into the external file system when crunacct is run using cron. This allows accounting reports over a period of time to have no indication of which control workstation was active.
On an inactive control workstation the crunacct and cmonacct scripts will fail because the symbolic link for the cacct directory on the inactive control workstation will not have a target directory.
To keep those scripts from failing, change the invocation in the crontab file to run a wrapper shell script that determines if the control workstation is the inactive control workstation. If the control workstation is the inactive control workstation, the script should exit without running the cmonacct or crunacct scripts. The lshacws command should be used for this. Alternatively, you may choose to let the script fail. The following is a sample script:
```
#/bin/ksh
# This is the state of the control workstation taking
# the string output of the lshacws command.
# It takes the name either crunacct or cmonacct
# as an argument.
 
HACWS_STATE=`lshacws`
 
# Check to see if we are on an INACTIVE control workstation.
# If we are, exit quietly, if not, run the argument passed.
 
if [[ $HACWS_STATE = 1  ||  $HACWS_STATE = 16 ]] ; then
        exit 0
else
 
# Go run the command
$1
fi
exit 0
```

Using a switch

When an SP switch event occurs during a control workstation failover, the communication subsystem continues to handle switch faults, primary node failures, and primary backup node failures. System data about the switch is inaccessible during the control workstation failover and can become out-of-date. When the backup control workstation assumes the duties of the active control workstation, it must determine if a double failure has occurred (a switch event that occurred during a control workstation failover).

Perform the following to determine if a switch event has occurred and to synchronize the communication subsystem with the system data on the new control workstation:

To determine if a primary node takeover (failure) has occurred, check all nodes in each system partition for a file named act.top.1 that contains a timestamp during the control workstation failover. For an SP Switch subsystem, it is in the /var/adm/SPlogs/css directory. |For a one-plane SP Switch2 subsystem, it is in the |/var/adm/SPlogs/css0/p0 directory. For a two-plane SP Switch2 |subsystem, the act.top.1 file is in the |/var/adm/SPlogs/css0/p0 directory for plane 0 and in the |/var/adm/SPlogs/css1/p0 directory for plane 1. If you locate an act.top.1 file containing a timestamp during the control workstation failover:
1. Run /usr/lpp/ssp/css/rc.switch on all the nodes in each system partition that has a node with an act.top.1 file.
2. Run Estart for each system partition that has a node with an act.top.1 file.
To determine if a primary backup node failed or a switch fault occurred, examine the flt file on the primary node. For an SP Switch subsystem, it is in the /var/adm/SPlogs/css directory. For an SP Switch2 subsystem, it is in the /var/adm/SPlogs/css0/p0 directory. If you find entries in the file that contain a timestamp during the control workstation failover, run Estart for each system partition having entries on its primary node.
IBM suggests that you use an Error Notification Object for switch faults, primary node takeovers, and control workstation failures to send mail or a message to alert you about such double failures. You can then examine the communication subsystem as described previously. Because the timeframe and probability for these double failures are very small, IBM does not suggest that you run rc.switch or Estart whenever a control workstation failover occurs.

Managing SP system partitions

System partition configuration changes must be done from the active control workstation. Such changes may be done on either the primary or backup control workstation, depending on which one is active. IBM suggests that configuration changes be done from the primary control workstation, but it is not required.

When you configure HACWS you are instructed to put IP address aliases that are required for partitioning or any other reason in a script called /spdata/sys1/hacws/rc.syspar_aliases. Do not use the IPv6 form of IP address aliases on a system with HACWS or HACMP. This script will be invoked by the HACWS event scripts at the correct times to enable system partitioning.

Managing extension nodes

The SP SNMP Manager that runs on the control workstation transfers dependent node configuration information to the SNMP Agent running on a dependent node. This SP SNMP Manager support will transfer to the active control workstation.

Using SP Perspectives

If you have installed ssp.clients or ssp.gui on an RS/6000 other than the primary or backup control workstations, carefully consider how you will monitor the SP system from other than the active control workstation. If you execute the perspectives commands without logging into the active control workstation and the active control workstation does a failover, this will not be represented on the SP Perspectives GUI until you take another action. Once you take another action, the perspectives commands will exit. You will have to execute the perspectives commands and reconnect to the currently active control workstation. This occurs because when you are monitoring a TCP/IP connection and the host fails, the client side of that connection does not know about the failure of the connection until a packet is sent from the client side of the connection. Since the users of the perspectives commands who are off the control workstation should be well known, an Error Notification Object can be set up to send mail or a message to the list of well known users to inform them that the control workstation has performed a failover. Such a notification list for the users that monitor the SP system is a good idea in any event. After a failover of a control workstation (or any other major recovery action) an administrator should be notified. Several options for such notification exist:

Error Notification Objects
NetView for AIX control desk alerts
Trouble Ticket for AIX problem incidents
HACMP Notify script

Viewing which control workstation is active

With previous releases of PSSP, you could view which RS-232 tail was active using the Frame Environment Layout display of the SP System Monitor GUI. This information is now available using the hardware perspective. To display this information:

Ensure that the Frames/Switches pane is viewable. You may need to add a Frames/Switches pane first. Then, click in the pane to make it active and select a frame from the pane.
Select Actions > View and Modify Properties from the menu bar or just click on the Notebook icon.
Select the Frame Status page in the Frame notebook to view the information.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]