The IBM Recoverable Virtual Shared Disk recovery subsystems, rvsd and hc, respond to changes in the status of the system by running recovery and notifying client applications. The subsystems operate as daemons named rvsdd and hcd. They use the utilities of the Group Services subsystem. For more information about Group Services, see RS/6000 Cluster Technology: Group Services Programming Guide and Reference.
The following sections describe the functions of the rvsd and hc subsystems.
The rvsd subsystem controls recovery for the IBM Recoverable Virtual Shared Disk component of PSSP. It invokes the recovery scripts whenever there is a change in the group membership. The ha.vsd command controls the rvsd subsystem. When a node goes down or a disk adapter or cable fails, the rvsd subsystem notifies all surviving processes in the remaining virtual shared disk nodes, so they can begin recovery. If a node fails, recovery involves switching the ownership of a twin-tailed disk to the secondary node. If a disk adapter or cable fails, recovery involves switching the server node for a volume group to the secondary node. When the failed component comes back up, recovery involves switching disk or volume group ownership back to the primary node.
Communication adapter (css) failures are treated in the same manner as node failures. Recovery for twin-tailed volume groups consists of switching to the secondary server.
The primary node is a node that is physically connected to a set of virtual shared disks and will always manage them if it is active. The secondary node is a node that is physically connected to a set of virtual shared disks and will manage them only if the primary node becomes inactive. Primary and secondary nodes are defined with the createvsd and createhsd commands. For IBM Concurrent Virtual Shared Disks, there is no concept of primary and secondary nodes. Instead, there is simply a list of servers that can concurrently access the disks.
A client is a node that has access to virtual shared disks but is not physically connected to them. Some nodes in a system partition might not have any access to virtual shared disks defined on that system and are neither servers nor clients.
The rvsd subsystem uses the notion of quorum, the majority of the virtual shared disk nodes, to cope with communication failures. If the nodes in a system partition are divided by a network failure, so that the nodes in one group cannot communicate with the nodes in the other group, the rvsd subsystem uses the quorum to decide which group continues operating and which group is deactivated. In previous releases of the SP system, quorum was defined as a majority of all the nodes in a system partition. As of IBM Recoverable Virtual Shared Disk Version 1 Release 2, quorum is based on nodes that have been defined as virtual shared disk nodes. You can check the current quorum value with the ha.vsd query command on a node by node basis. You can override the system default value for the quorum with the ha.vsd quorum command.
Table 4 shows how the daemons in the nodes in a system partition
react as inactive nodes come back up and as active nodes fail. The
table shows the changes that affect three of the nodes in a system that has
more than three nodes.
Table 4. Recovery Actions When Nodes Fail
Nodes and Clients | Node Is Active | Recovery Scenario (Node Is Inactive) |
---|---|---|
Node1, primary |
|
|
Node2, secondary to node1 |
|
|
Node3, client |
|
|
Hardware interruptions at the disk are known as EIO errors. The IBM Recoverable Virtual Shared Disk component can perform recovery by volume group from some kinds of EIO errors, for example, disk cable and disk adapter failures. These failures can affect some of the volume groups defined on a node without affecting other volume groups.
The IBM Recoverable Virtual Shared Disk component switches the server function from the primary node to the secondary node for the failed volume groups on the node, without changing the server for those volume groups that have not failed. This involves switching the primary and secondary fields in the VSD Global Volume Group Information that is stored in the SDR. This also involves failing over the failed volume groups to the newly defined primary server and retrying the I/O request that previously failed.
The attempt to switch the servers only takes place if there has not been an EIO error on the volume group within approximately the last seven minutes. If EIO recovery fails, the related virtual shared disks are placed in the stopped state.
You can tell if an EIO recovery has happened by using the vsdatalst -g command and looking at the recovery field in the results. If it contains a value other than zero, recovery has taken place at some point.
When the cause of the failure has been located and repaired, you can use the vsdchgserver command to restore the server function to the original setting. This does not happen automatically. To enable or disable EIO recovery, you can use the -o flag with 1 to enable it or 0 to disable it. For example, to reset the primary server to node 3 and the secondary server to node 4 for global volume group sys2vg with EIO recovery enabled, do the following:
vsdchgserver -g sys2vg -p 3 -b 4 -o 1
See the vsdchgserver reference page in the book PSSP: Command and Technical Reference, for the command syntax.
Communication adapter failure is supported for the SP switch. When communication adapter recovery is enabled, an adapter failure is promoted to a recoverable virtual shared disk node failure so that twin-tailed volume groups can fail over to the secondary server. For concurrent volume groups, this provides access to the disks through another server.
Only twin-tailed and concurrent volume groups on nodes connected by the SP switch will recover from adapter failure.
Communication adapter recovery is enabled by default but can be disabled by issuing the ha.vsd adapter_recovery off command. Use this command when you have supported communication adapters and do not want their failures promoted to node failures.
Issue ha.vsd query to determine whether adapter recovery is enabled or disabled. The output will be similar to the following example, where adapter_recovery can be on or off, and adapter_status can be up, down, or unknown.
Subsystem Group PID Status rvsd rvsd 13660 active rvsd(vsd): quorum= 8, active=0, state=idle, isolation=member, NoNodes=2, lastProtocol=nodes_joining, adapter_recovery=on, adapter_status=down.
The hc subsystem is also called the connection manager. It supports the development of recoverable applications. Chapter 9, Application programming considerations provides more information on how to write recoverable applications.
The hc subsystem maintains a membership list of the nodes that are currently running hc processes and an incarnation number that is changed every time the membership list changes. The hc subsystem shadows the rvsd subsystem, recording the same changes in state and management of virtual shared disks that the rvsd subsystem records. The difference is that the hc subsystem only records these changes after the rvsd subsystem processes them, to assure that the rvsd recovery activities begin and complete before the recovery of the hc subsystem client applications takes place. This serialization helps ensure data integrity. It also explains why the hc subsystem cannot run on a node where the rvsd subsystem is not running.
You can use the fencevsd command to implement application recovery procedures that are independent of rvsd subsystem recovery. The application can ask the rvsd subsystem to fence nodes that have failing application instances from nodes where the application is running correctly. See Preserving data integrity during application recovery for more information.
Another important characteristic of the hc subsystem is that it waits for its client application to connect before joining the hc group. If the client application loses its connection, the hc subsystem runs the hc.deactivate script and leaves the group. This means that the membership list of the hc group corresponds to the list of nodes in which the client application is currently running.