Administration Guide

Monitoring nodes on an SP switch

In an SP system with a switch, you can monitor and maximize the availability of nodes on the switch by using the Emonitor daemon. The Emonitor daemon is managed by the System Resource Controller (SRC). One instance of the daemon exists for each system partition and it is named Emonitor.partition_name. The /etc/SP/Emonitor.cfg system-wide configuration file lists all node numbers (one per line) on the system that are to be monitored.

Monitoring is started by running the Estart -m command. Once started, the SRC restarts it if it halts abnormally. To end monitoring, run the /usr/lpp/ssp/bin/emonctrl -s command to stop the daemon in each SP system partition.

See the book PSSP: Command and Technical Reference for more information on the Emonitor command.

SP Switch admin daemon (cssadm)

The cssadm daemon runs on the control workstation and it subscribes to information provided by the Event Management subsystem to monitor node, node adapter, and switch clock information. If configured to, the daemon provides node and switch clock recovery on the SP Switch. The daemon is started from /etc/inittab on the control workstation and is controlled by the SRC subsystem. The SRC subsystem name for this daemon is swtadmd. The SRC subsystem is not active in an SP system with the SP Switch2.

Selecting the level of recovery on the SP Switch

The cssadm daemon uses a configuration file to determine the level of recovery you want it to perform. The file is /spdata/sys1/ha/css/cssadm.cfg. It contains up to two lines of configuration data as follows:

Node 1
Switch 1

The first line (Node 1) is present by default. It selects node switch recovery. Set it to zero to disable node switch recovery.

The second line is not included in the file as shipped. It is available so you can select switch clock recovery in SP systems with the SP Switch. Manually add the second line (Switch 1) if you want to configure the cssadm daemon to perform switch clock recovery on the SP Switch. If you want to explicitly indicate that switch clock recovery is disabled, add Switch 0 in the second line.

After you modify the /spdata/sys1/ha/css/cssadm.cfg file, stop and restart the cssadm daemon using the following commands:

stopsrc -s swtadmd
startsrc -s swtadmd

To enable the SP Switch power monitoring for switch recovery, modify the /spdata/sys1/spmon/hmthresholds file according to the specific instruction within the file. Look for lines similar to those in the following example:


  ·
  ·
  ·

# Software thresholding enabled to detect a switch master oscillator failure
# PS1_POWERGOOD set to low threshold at .700
# PS2_POWERGOOD set to low threshold at .700
# In order to enable the thresholding to detect a switch master oscillator
# failure the following line should replace the default thresholds that
# follow the "DO NOT change these values..." line. Please be sure you
# understand the purpose of these thresholds by reading the documentation
# concerning the switch admin daemon (cssadm).
# 0x81 0x00 0xff .700 0xff .700 0xff 0x00 0xff 0x00 0xff 0x00 0xff 0x00 0xff
# Set to 0x00 for defect 47883
#
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
# DO NOT change these values without contacting IBM Support.
#
++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 0x81 0x00 0xff 0x00 0xff 0x00 0xff 0x00 0xff 0x00 0xff 0x00 0xff 0x00 0xff

  ·
  ·
  ·

Replace the 0x81... line (the last line showing in the example) with the 0x81... line that is currently seven lines above it in the commentary. That line sets .700 in two places that are otherwise 0x00. Then stop and restart the hardmon daemon using the following commands:

stopsrc -s hardmon
startsrc -s hardmon

Node recovery

Based on information the cssadm daemon obtains from the Event Management subsystem regarding the state of nodes and node adapters, the daemon handles the following events:

Node joins Host membership or node leaves Host membership (IBM.PSSP.Response.Node.state)
If a node comes up in this state, the node is checked to see if it is the oncoming primary node in its partition. If it is the oncoming primary node, cssadm executes an Estart.
If the Estart command fails, a message is written to the cssadm log file and no further action is taken.
Node leaves Switch Adapter membership (IBM.PSSP.Membership.LANAdapter.state)
If this node is the primary node, the daemon checks to see if there is an active primary backup node. If there is an active primary backup node, the daemon will not intervene because the primary backup node will begin functioning as the backup. If an active primary backup node does not exist, the oncoming node is checked and if the node is up on host responds, cssadm attempts an Estart.
If this node is not the primary and the primary is "none" in the node's partition, cssadm checks to see if there is an active primary backup node. If there is an active primary backup node, the daemon will not intervene because the primary backup node will begin functioning as the backup. If an active primary backup node does not exist, the oncoming node is checked and if the node is up on host responds, cssadm attempts an Estart.

Node event handling operates on a system partition basis. Events from each of the membership groups are handled based on the effect that event would have on that system partition.

HACWS Considerations: In an HACWS environment, do not add an /etc/inittab entry to start the cssadm daemon because in the HACWS environment, the HACWS function starts the cssadm daemon on either the primary or backup control workstation.

As stated previously, the cssadm daemon interacts with the Event Management subsystem. It also interacts with a configuration file to enable node recovery. The configuration file resides at /spdata/sys1/ha/css/cssadm.cfg and it contains one line which reads as follows:

Node 1

If the value is set to 1, the daemon is triggered by the Event Management events described previously. If you do not want the daemon to react to the events, you can disable node recovery by changing the 1 to a 0 and stopping the cssadm daemon by issuing the stopsrc -s swtadmd command. After you restart the daemon, the new value in the configuration file will be picked up.

Understanding SP Switch recovery

Switch recovery is provided for systems with four or less SP Switches. To understand this function you should need a general understanding of SP Switch clocking.

The SP Switch uses a clocking hierarchy. One switch is designated the master clock switch for the entire network. All other switches receive clock information from that switch. Each switch must be initially clocked before it can join the switch network. This is done by the Eclock command. If a switch is powered down and back up, it needs to be reclocked before it can join the network. If the master switch is powered down and back up, then all switches need to be reclocked.

The other event that can happen in the clocking hierarchy is a master oscillator failure, though this is rare. If it occurs, the entire switch network will go down because the master clock will have been lost. In that case, it is necessary to move the master to another switch, if possible, and run the Eclock command again for the system.

The switch recovery of the cssadm daemon handles all global clocking events. It attempts to determine the state of the system and whether any reclocking is necessary based on the following:

The events received from the Event Management subsystem.
The hardmon subsystem.
SDR information

When global events that affect a master oscillator switch happen, it is usually necessary to alter the clocking topology. Before PSSP 3.2, the Eclock topology files contained only one alternative clocking permutation. Now there are more alternatives based on the number of switches available in the SP system.

When a non-master switch loses its controller responds or is powered off, the cssadm daemon makes note of it. When that switch becomes available again, the daemon runs the Eclock function to establish its switch clocking. If the master switch goes off, then the daemon looks at the Eclock topology file for an alternative clocking topology and runs the Eclock function for the remaining switches with the topology alternative. This causes nodes to go off the switch. If node recovery is enabled, the daemon runs the Estart function, otherwise you must run the Estart command.

When using switch recovery, some restrictions are necessary in order for the daemon to always be aware of the current state of the switch clocking topology. The following restrictions apply:

Do not run the Eclock -s -m command.
Do not specify an alternate switch topology file using the Eclock -f command.

Stopping and starting the cssadm daemon

The daemon is added to SRC as the subsystem swtadmd and is started from the /etc/inittab file automatically. To stop the daemon, run the command:

stopsrc -s swtadmd

To permanently stop the daemon from running, remove the subsystem entry from the /etc/inittab file.

To restart the daemon, run the command:

startsrc -s swtadmd

cssadm log files

The daemon generates the following log files all located in the /var/adm/SPlogs/css directory:

cssadm.debug
This file contains trace information of the actions of the daemon. It contains entries for each event received and handled, as well as how the events were handled and the results.
cssadm.stderr
This file contains any unexpected error messages received by the daemon while performing commands external to the daemon.
cssadm.stdout
This file contains any unexpected informational messages received by the daemon while performing commands external to the daemon. In general, this log should remain empty.

SP Switch2 admin daemon (cssadm2)

The cssadm2 daemon runs on the control workstation. It performs the same node recovery functions for SP Switch2 systems as the cssadm daemon does for SP Switch systems. The daemon is started from the /etc/inittab file on the control workstation and is also managed by the SRC subsystem. The SRC subsystem name for this daemon is swtadmd2. This subsystem is not active on an SP Switch system.

Selecting the level of recovery on the SP Switch2

The cssadm2 daemon uses the /spdata/sys1/ha/css/cssadm.cfg configuration file to determine the level of recovery you want. |In a two-plane SP Switch2 system, the configuration file determines |the level of recovery on both switch planes. The file contains the following line:

Node 1

This line is present in the file and selects node switch recovery by default. Set it to 0 if you do not want to have node switch recovery enabled. When the value in this file is modified it is necessary to stop and restart the cssadm2 daemon as described in Stopping and starting the cssadm2 daemon.

When node recovery is enabled, the cssadm2 daemon handles the same Event Management events in the same way as described for the cssadm daemon.

Stopping and starting the cssadm2 daemon

The daemon is added to the SRC as the subsystem swtadmd2 and is started from the /etc/inittab file automatically. To stop the daemon, run the command:

stopsrc -s swtadmd2

To permanently stop the daemon from running, remove the subsystem entry from the /etc/inittab file.

To restart the daemon, run the command:

startsrc -s swtadmd2

cssadm2 log files

The daemon generates the following log files all located in the /var/adm/SPlogs/css directory:

cssadm2.debug
This file contains trace information of the actions of the daemon. It contains entries for each event received and handled, as well as how the events were handled and the results.
cssadm2.stderr
This file contains any unexpected error messages received by the daemon while performing commands external to the daemon.
cssadm2.stdout
This file contains any unexpected informational messages received by the daemon while performing commands external to the daemon. In general, this log should remain empty.

SP Switch2 emaster daemon (emasterd)

The emasterd daemon runs on the control workstation and it subscribes to information provided by the Event Management subsystem in order to monitor the health of the Master Switch Sequencing (MSS) node. The MSS node is the node which periodically resequences the time-of-day (TOD) signals on the SP Switch2. The daemon is started from the /etc/inittab file on the control workstation and is controlled by the SRC subsystem. The SRC subsystem name for this daemon is emasterd. This subsystem is inactive on an SP system with the SP Switch.

MSS node recovery

There is no configuration associated with the emasterd daemon because its use is required on the SP Switch2 for automatic MSS Node recovery. The emasterd daemon handles events received from the Event Management subsystem, such as changes in the IBM.PSSP.Response.Node.state and the IBM.PSSP.Membership.LANAdapter.state resource variables for the MSS node, to determine if the MSS node needs to be changed. If the emasterd daemon changes the MSS node, there might be a loss of clock signals to the nodes of up to 30 seconds.

To see which node is the current MSS node, run the Emaster command.

Restarting the emaster daemon

The daemon is added to the SRC subsystem as emasterd and is started from the /etc/inittab file automatically. If you ever need to restart the daemon, run the startsrc -s emasterd command.

emasterd log files

The daemon generates the following log files all located in the /var/adm/SPlogs/css directory:

emasterd.debug
This file contains trace information of the actions of the daemon. It contains entries for each event received and handled, as well as how the events were handled and the results.
emasterd.stderr
This file contains any unexpected error messages received by the daemon while performing commands external to the daemon.
emasterd.stdout
This file contains any unexpected informational messages received by the daemon while performing commands external to the daemon. In general, this log should remain empty.

css.summlog daemon

This daemon provides a summary log of switch-related AIX error log entries from all nodes in an SP in one convenient location on the control workstation. It provides a summary of switch errors across the entire system, ordered by time and tagged with identifying information, which can serve as the starting point for switch-related diagnosis. The name of this file is /spdata/sys1/ha/css/summlog.

For additional information regarding switch-related diagnostic information, see the PSSP: Diagnosis Guide.

Starting and stopping the css.summlog daemon

To stop the daemon from running, run:

stopsrc -s swtlog

To permanently stop the daemon from running, remove the subsystem entry from the /etc/inittab file.

To restart the daemon, run:

startsrc -s swtlog

css.summlog log files

The daemon generates the following log files all located in /var/adm/SPlogs/css:

logevnt.out
This file contains records of errors which occurred in the components running on the node which experienced the error. These components are notified when switch-related error log entries are made, and report the summary data to Event Management for transmission to the control workstation. The log is a text file and may exist on each node.
summlog.out
This file contains error information for the daemon which gathers summary log information and writes it to the summary log file. This log is a text file and exists on the control workstation.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]