The recovery scenarios describe what a system administrator or operator might see when the IBM Recoverable Virtual Shared Disk component goes into action to recover from system problems.
You know that the rvsd subsystem is performing recovery processing when you see IBM Virtual Shared Disks that are in the active or suspended state, and you did not put them there. Use the IBM Virtual Shared Disk Perspective to display the states of your IBM Virtual Shared Disk nodes. See the chapter on Managing and Monitoring Virtual Shared Disks, section on Monitoring Virtual Shared Disks, in: PSSP: Managing Shared Disks.
If you recognize that recovery is not taking place normally, check if the rvsd subsystem is active on all nodes. Use the IBM Virtual Shared Disk Perspective, or the ha.vsd query and hc.vsd query commands to see if the respective subsystems are active. If active=0 or state=idle for an extended period of time, recovery is not taking place normally. The ha.vsd query command returns output in the following format:
Subsystem Group PID Status rvsd rvsd 18320 active rvsd(vsd): quorum= 7, active=0, state=idle, isolation=member, NoNodes=5, lastProtocol=nodes_failing, adapter_recovery=on, adapter_status=up, RefreshProtocol has never been issued from this node, Running function level 3.2.0.0.
The hc.vsd query command output looks like this:
Subsystem Group PID Status hc.hc rvsd 20440 active hc(hc): active=0, state=waiting for client to connect PING_DELAY=600 CLIENT_PATH=/tmp/serv. SCRIPT_PATH=/usr/lpp/csd/bin
You should have disabled or removed all user-provided scripts that issue change-of-state commands to IBM Virtual Shared Disks.
Monitor the activity of the rvsd subsystem using the IBM Virtual Shared Disk Perspective, to become aware of potential problems as soon as they arise. You can begin a monitoring session, leave it in a window on your workstation and go about doing other work while the monitoring activity continues.
Each of the recovery scenarios is organized as follows:
A node has either hung or has failed.
Node failure was detected by the Group Services program, and it has notified the rvsd subsystem. An operator may have seen the change in state displayed by the IBM Virtual Shared Disk Perspective.
The affected components can be everything accessing the IBM Virtual Shared Disks, including: the IBM Virtual Shared Disks themselves, software applications, and the rvsd and related subsystems running on the failed nodes.
The recovery steps are:
Some of the following actions must be done manually; others are done automatically by the rvsd subsystem. You should have procedures in place to instruct operators when and how to do the manual operations.
The sequence of events triggered by a switch failure depends on whether adapter recovery is enabled.
The rvsd subsystem initiates recovery on failing nodes. Errors might be generated by applications running on these nodes. Error reports might be generated as well.
Node failure was detected by the rvsd subsystem. An operator may have seen the change in state displayed by the IBM Virtual Shared Disk Perspective.
The affected components can be everything accessing the IBM Virtual Shared Disks, including: the IBM Virtual Shared Disks themselves, software applications, and the rvsd and related subsystems running on the failed nodes.
Follow problem determination procedures for the switch failure. See Diagnosing SP Switch problems and Diagnosing SP Switch2 problems.
Remote IBM Virtual Shared Disk I/O requests hang and then fail after about 15 minutes. The IBM Virtual Shared Disk clients see a time out error.
The rvsd subsystem neither detects nor handles switch failure when adapter recovery is disabled or when a non-supported adapter is used. An operator using the SP or IBM Virtual Shared Disk Perspective might recognize that switch_responds for the node is off.
Applications using IBM Virtual Shared Disks will hang and then fail after about 15 minutes.
An operator issues the Estart or Eunfence command to restart the switch.
If Estart or Eunfence fails, use standard diagnostic methods for handling switch problems. See Diagnosing SP Switch problems and Diagnosing SP Switch2 problems.
The Problem Management interface could be used to run a script that would stop the rvsd subsystem on switch failures and restart it when the switch is active again.
There is no automatic method to determine that the daemons are failing.
For more information, see the book RS/6000 Cluster Technology: Group Services Programming Guide and Reference.
I/O errors occur on some or all of the IBM Virtual Shared Disks served from a node.
There is no automatic method of determining that volume group failures are occurring. An entry is posted in the System Error Log and a hardware error (EIO) code is returned to the IBM Virtual Shared Disk device driver.
The IBM Virtual Shared Disk subsystem cannot access the data on the failed volume groups.
If EIO recovery has taken place on this volume group in approximately the last seven minutes, the IBM Virtual Shared Disks in this volume group are placed in the stopped state.