Diagnosis Guide

Error symptoms, responses, and recoveries

The recovery scenarios describe what a system administrator or operator might see when the IBM Recoverable Virtual Shared Disk component goes into action to recover from system problems.

Recognizing recovery

You know that the rvsd subsystem is performing recovery processing when you see IBM Virtual Shared Disks that are in the active or suspended state, and you did not put them there. Use the IBM Virtual Shared Disk Perspective to display the states of your IBM Virtual Shared Disk nodes. See the chapter on Managing and Monitoring Virtual Shared Disks, section on Monitoring Virtual Shared Disks, in: PSSP: Managing Shared Disks.

If you recognize that recovery is not taking place normally, check if the rvsd subsystem is active on all nodes. Use the IBM Virtual Shared Disk Perspective, or the ha.vsd query and hc.vsd query commands to see if the respective subsystems are active. If active=0 or state=idle for an extended period of time, recovery is not taking place normally. The ha.vsd query command returns output in the following format:

Subsystem         Group            PID     Status
rvsd              rvsd             18320   active
rvsd(vsd): quorum= 7, active=0, state=idle, isolation=member,
           NoNodes=5, lastProtocol=nodes_failing,
           adapter_recovery=on, adapter_status=up,
           RefreshProtocol has never been issued from this node,
           Running function level 3.2.0.0.

The hc.vsd query command output looks like this:

Subsystem         Group            PID     Status
hc.hc              rvsd             20440   active
 hc(hc): active=0, state=waiting for client to connect
 PING_DELAY=600
 CLIENT_PATH=/tmp/serv.
 SCRIPT_PATH=/usr/lpp/csd/bin

Planning for recovery

You should have disabled or removed all user-provided scripts that issue change-of-state commands to IBM Virtual Shared Disks.

Monitor the activity of the rvsd subsystem using the IBM Virtual Shared Disk Perspective, to become aware of potential problems as soon as they arise. You can begin a monitoring session, leave it in a window on your workstation and go about doing other work while the monitoring activity continues.

Each of the recovery scenarios is organized as follows:

Symptoms
Detection
Affected components
Recovery steps
Restart

Virtual Shared Disk node failure

Symptoms
A node has either hung or has failed.
Detection
Node failure was detected by the Group Services program, and it has notified the rvsd subsystem. An operator may have seen the change in state displayed by the IBM Virtual Shared Disk Perspective.
Affected Components
The affected components can be everything accessing the IBM Virtual Shared Disks, including: the IBM Virtual Shared Disks themselves, software applications, and the rvsd and related subsystems running on the failed nodes.
Recovery Steps
The recovery steps are:
1. The recovery services of surviving nodes suspend IBM Virtual Shared Disks served by the failing node. The IBM Virtual Shared Disk recovery process puts the suspended IBM Virtual Shared Disks into the active state on the secondary node.
2. Optional subscribers to hc on surviving nodes are informed of the membership change.
Restarting the Failed Node
Some of the following actions must be done manually; others are done automatically by the rvsd subsystem. You should have procedures in place to instruct operators when and how to do the manual operations.
1. An operator reboots the failed node.
2. An operator may need to issue the Estart or Eunfence command to enable the rebooted node to access the switch. Nodes can be set up to reboot automatically, using the Estart -M command, which starts the monitor function.
3. The rvsd subsystem is automatically brought up on reboot, once the communications adapter is available.
4. The hc.activate script, if present, is invoked.
5. The rebooted node rejoins the active group. If it was a primary node, it takes over again as a primary node.
6. Optional subscribers to hc on surviving nodes are informed of the membership change.

Switch failure scenarios

The sequence of events triggered by a switch failure depends on whether adapter recovery is enabled.

When adapter recovery is enabled

Symptoms
The rvsd subsystem initiates recovery on failing nodes. Errors might be generated by applications running on these nodes. Error reports might be generated as well.
Detection
Node failure was detected by the rvsd subsystem. An operator may have seen the change in state displayed by the IBM Virtual Shared Disk Perspective.
Affected Components
The affected components can be everything accessing the IBM Virtual Shared Disks, including: the IBM Virtual Shared Disks themselves, software applications, and the rvsd and related subsystems running on the failed nodes.
Recovery Steps and Restart
Follow problem determination procedures for the switch failure. See Diagnosing SP Switch problems and Diagnosing SP Switch2 problems.

When adapter recovery is disabled

Symptoms
Remote IBM Virtual Shared Disk I/O requests hang and then fail after about 15 minutes. The IBM Virtual Shared Disk clients see a time out error.
Detection
The rvsd subsystem neither detects nor handles switch failure when adapter recovery is disabled or when a non-supported adapter is used. An operator using the SP or IBM Virtual Shared Disk Perspective might recognize that switch_responds for the node is off.
Affected Components
Applications using IBM Virtual Shared Disks will hang and then fail after about 15 minutes.
Recovery Steps and Restart
An operator issues the Estart or Eunfence command to restart the switch.
If Estart or Eunfence fails, use standard diagnostic methods for handling switch problems. See Diagnosing SP Switch problems and Diagnosing SP Switch2 problems.
The Problem Management interface could be used to run a script that would stop the rvsd subsystem on switch failures and restart it when the switch is active again.

Topology Services or recovery service daemon failure

Symptoms
- At the node where the failure occurred, all the IBM Virtual Shared Disks stop and restart, causing I/O errors to the application.
- At other nodes, the IBM Virtual Shared Disks served by the problem node are switched to their secondary servers briefly, then returned to the primary server.
Detection
There is no automatic method to determine that the daemons are failing.
Recovery Steps
1. In the unlikely event that the recovery service daemons hang, no recovery is performed.
2. At the node where the failure occurred, the IBM Virtual Shared Disks remain in the active state. However, subsequent node failures or reboots can cause I/O requests to remote IBM Virtual Shared Disks to hang, and fail after 15 minutes. Issue the ps command at that node to check for the rvsd and hc daemons.
3. At the other nodes, some IBM Virtual Shared Disks remain indefinitely in the suspended state. All the rvsd daemons are present, and their presence can be verified by using the ps command.
4. When the daemons are available again, you can manually issue the ha_vsd reset command on the problem node. If this is insufficient, reboot the problem node.
For more information, see the book RS/6000 Cluster Technology: Group Services Programming Guide and Reference.

Disk EIO errors

Symptoms
I/O errors occur on some or all of the IBM Virtual Shared Disks served from a node.
Detection
There is no automatic method of determining that volume group failures are occurring. An entry is posted in the System Error Log and a hardware error (EIO) code is returned to the IBM Virtual Shared Disk device driver.
Affected Components
The IBM Virtual Shared Disk subsystem cannot access the data on the failed volume groups.
Recovery Steps
1. The volume group that contains the failed IBM Virtual Shared Disks is automatically put into the suspended state by the rvsd subsystem.
2. If EIO recovery has not taken place on this volume group in approximately the last seven minutes, an attempt is made to switch the volume group over to the backup server. The rvsd subsystem switches the volume group over to the newly assigned primary server and retries the previously failed I/O request. This involves switching the primary and secondary fields in the VSD Global Volume Group Information that is stored in the SDR.
  If EIO recovery has taken place on this volume group in approximately the last seven minutes, the IBM Virtual Shared Disks in this volume group are placed in the stopped state.
  Note:
  You can tell if EIO recovery has occurred by using the vsdatalst -g command and looking at the recovery field in the results. If it contains a value other than zero, recovery has taken place at some point.
3. Correct the condition that caused the error. You might need to issue the vsdchgserver command to switch the primary and backup servers back to their original settings. The command will swap the primary and backup server fields in the SDR, and initiate a failover of the specified volume groups to the new primary virtual shared disk server.

Note:: If you mirror data from each IBM Virtual Shared Disk to a IBM Virtual Shared Disk on another adapter, you should not experience this type of error.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]