Managing Shared Disks

The recovery subsystems

The IBM Recoverable Virtual Shared Disk recovery subsystems, rvsd and hc, respond to changes in the status of the system by running recovery and notifying client applications. The subsystems operate as daemons named rvsdd and hcd. They use the utilities of the Group Services subsystem. For more information about Group Services, see RS/6000 Cluster Technology: Group Services Programming Guide and Reference.

The following sections describe the functions of the rvsd and hc subsystems.

The rvsd subsystem

The rvsd subsystem controls recovery for the IBM Recoverable Virtual Shared Disk component of PSSP. It invokes the recovery scripts whenever there is a change in the group membership. The ha.vsd command controls the rvsd subsystem. When a node goes down or a disk adapter or cable fails, the rvsd subsystem notifies all surviving processes in the remaining virtual shared disk nodes, so they can begin recovery. If a node fails, recovery involves switching the ownership of a twin-tailed disk to the secondary node. If a disk adapter or cable fails, recovery involves switching the server node for a volume group to the secondary node. When the failed component comes back up, recovery involves switching disk or volume group ownership back to the primary node.

Communication adapter (css) failures are treated in the same manner as node failures. Recovery for twin-tailed volume groups consists of switching to the secondary server.

The primary node is a node that is physically connected to a set of virtual shared disks and will always manage them if it is active. The secondary node is a node that is physically connected to a set of virtual shared disks and will manage them only if the primary node becomes inactive. Primary and secondary nodes are defined with the createvsd and createhsd commands. For IBM Concurrent Virtual Shared Disks, there is no concept of primary and secondary nodes. Instead, there is simply a list of servers that can concurrently access the disks.

A client is a node that has access to virtual shared disks but is not physically connected to them. Some nodes in a system partition might not have any access to virtual shared disks defined on that system and are neither servers nor clients.

The rvsd subsystem uses the notion of quorum, the majority of the virtual shared disk nodes, to cope with communication failures. If the nodes in a system partition are divided by a network failure, so that the nodes in one group cannot communicate with the nodes in the other group, the rvsd subsystem uses the quorum to decide which group continues operating and which group is deactivated. In previous releases of the SP system, quorum was defined as a majority of all the nodes in a system partition. As of IBM Recoverable Virtual Shared Disk Version 1 Release 2, quorum is based on nodes that have been defined as virtual shared disk nodes. You can check the current quorum value with the ha.vsd query command on a node by node basis. You can override the system default value for the quorum with the ha.vsd quorum command.

Table 4 shows how the daemons in the nodes in a system partition react as inactive nodes come back up and as active nodes fail. The table shows the changes that affect three of the nodes in a system that has more than three nodes.

Table 4. Recovery Actions When Nodes Fail

Nodes and Clients	Node Is Active	Recovery Scenario (Node Is Inactive)
Node1, primary	Daemons running on the other nodes in the system partition accept node1 into the group. All virtual shared disks on node1 become active. All clients designate node1 as the manager for the node1 virtual shared disks and designate those virtual shared disks as active.	All virtual shared disks defined on node1 are put into suspended state on all clients. Daemons running on the other nodes in the system partition remove node1 from the group. Node2, the secondary node, takes over the management of node1's virtual shared disks. All clients designate node2 as the server for node1's virtual shared disks and put those virtual shared disks into active state.
Node2, secondary to node1	Daemons running on the other nodes in the system partition accept node2 into the group. If node1 is active, there is no change to the status of the node1 virtual shared disks. If node1 is inactive, node2 takes over the management of the node1 virtual shared disks.	If node1 is active, there is no change to the status of its virtual shared disks. If node1 is inactive, the node1 virtual shared disks are put into stopped state on all clients. They remain in stopped state until node1 or node2 comes back up. Daemons running on the other nodes in the system partition remove node2 from the group.
Node3, client	Daemons running on the other nodes in the system partition accept node3 into the group. All virtual shared disks on active nodes for which node3 is a client are put into active state from node3's point of view.	All virtual shared disks defined on node3 are put into stopped state from node3's point of view. Daemons running on the other nodes in the system partition remove node3 from the group.

Disk cable and disk adapter failures

Hardware interruptions at the disk are known as EIO errors. The IBM Recoverable Virtual Shared Disk component can perform recovery by volume group from some kinds of EIO errors, for example, disk cable and disk adapter failures. These failures can affect some of the volume groups defined on a node without affecting other volume groups.

The IBM Recoverable Virtual Shared Disk component switches the server function from the primary node to the secondary node for the failed volume groups on the node, without changing the server for those volume groups that have not failed. This involves switching the primary and secondary fields in the VSD Global Volume Group Information that is stored in the SDR. This also involves failing over the failed volume groups to the newly defined primary server and retrying the I/O request that previously failed.

The attempt to switch the servers only takes place if there has not been an EIO error on the volume group within approximately the last seven minutes. If EIO recovery fails, the related virtual shared disks are placed in the stopped state.

You can tell if an EIO recovery has happened by using the vsdatalst -g command and looking at the recovery field in the results. If it contains a value other than zero, recovery has taken place at some point.

Note:: If sporadic EIO errors are occurring, EIO recovery will keep switching the server for that volume group. This EIO recovery does not apply for concurrent virtual shared disks. For these disks, an EIO error will simply cause the clients to stop accessing the disks through the node that received the EIO error. Occasionally, an I/O request will be sent through the failing node to see if the EIO was only a temporary error.

When the cause of the failure has been located and repaired, you can use the vsdchgserver command to restore the server function to the original setting. This does not happen automatically. To enable or disable EIO recovery, you can use the -o flag with 1 to enable it or 0 to disable it. For example, to reset the primary server to node 3 and the secondary server to node 4 for global volume group sys2vg with EIO recovery enabled, do the following:

vsdchgserver -g sys2vg -p 3 -b 4 -o 1

See the vsdchgserver reference page in the book PSSP: Command and Technical Reference, for the command syntax.

Communication adapter failures

Communication adapter failure is supported for the SP switch. When communication adapter recovery is enabled, an adapter failure is promoted to a recoverable virtual shared disk node failure so that twin-tailed volume groups can fail over to the secondary server. For concurrent volume groups, this provides access to the disks through another server.

Only twin-tailed and concurrent volume groups on nodes connected by the SP switch will recover from adapter failure.

Communication adapter recovery is enabled by default but can be disabled by issuing the ha.vsd adapter_recovery off command. Use this command when you have supported communication adapters and do not want their failures promoted to node failures.

Issue ha.vsd query to determine whether adapter recovery is enabled or disabled. The output will be similar to the following example, where adapter_recovery can be on or off, and adapter_status can be up, down, or unknown.

Subsystem         Group            PID     Status
rvsd             rvsd             13660   active
rvsd(vsd): quorum= 8, active=0, state=idle, isolation=member,
              NoNodes=2, lastProtocol=nodes_joining,
              adapter_recovery=on, adapter_status=down.

The hc subsystem

The hc subsystem is also called the connection manager. It supports the development of recoverable applications. Chapter 9, Application programming considerations provides more information on how to write recoverable applications.

The hc subsystem maintains a membership list of the nodes that are currently running hc processes and an incarnation number that is changed every time the membership list changes. The hc subsystem shadows the rvsd subsystem, recording the same changes in state and management of virtual shared disks that the rvsd subsystem records. The difference is that the hc subsystem only records these changes after the rvsd subsystem processes them, to assure that the rvsd recovery activities begin and complete before the recovery of the hc subsystem client applications takes place. This serialization helps ensure data integrity. It also explains why the hc subsystem cannot run on a node where the rvsd subsystem is not running.

You can use the fencevsd command to implement application recovery procedures that are independent of rvsd subsystem recovery. The application can ask the rvsd subsystem to fence nodes that have failing application instances from nodes where the application is running correctly. See Preserving data integrity during application recovery for more information.

Another important characteristic of the hc subsystem is that it waits for its client application to connect before joining the hc group. If the client application loses its connection, the hc subsystem runs the hc.deactivate script and leaves the group. This means that the membership list of the hc group corresponds to the list of nodes in which the client application is currently running.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]