These procedures check for errors in the installation, configuration and operation of the IBM Virtual Shared Disk and IBM Recoverable Virtual Shared Disk subsystems of PSSP.
This test determines if you have successfully installed the IBM Virtual Shared Disk software. It also tests if you can successfully define, activate, read from, and write to an IBM Virtual Shared Disk. To run the test, follow these steps:
/usr/lpp/csd/bin/createvsd -n 1/:hdisk0/ -g rootvg -s 8 -c 1 -v junk
/usr/lpp/csd/bin/cfgvsd -a
/usr/lpp/csd/bin/vsdvts vsd_name
/usr/lpp/csd/bin/removevsd -f -v vsd_name
The vsdvts function displays the commands it issues and whether they were successful or not. If vsdvts fails, see Configuration test 1 - Check IBM Virtual Shared Disk nodes.
Use these tests to check that the IBM Virtual Shared Disk subsystem is installed properly.
This test checks that the IBM Virtual Shared Disk nodes have been designated appropriately. Obtain the IBM Virtual Shared Disk node information by issuing this command on the control workstation:
/usr/lpp/csd/bin/vsdatalst -n
The output is similar to the following:
VSD Node Information Initial Maximum VSD rw Buddy Buffer node VSD IP packet cache cache request request minimum maximum size: # number host_name adapter size buffers buffers count count size size maxbufs ------ --------- ------- -------- ------- ------- ------- ------- ------- ------- ------- 1 mynode1 css0 61440 64 4096 256 48 4096 262144 130 3 mynode2 css0 61440 64 4096 256 48 4096 262144 130 5 mynode3 css0 61440 64 4096 256 48 4096 262144 130 7 mynode4 css0 61440 64 4096 256 48 4096 262144 130
Good results are indicated by:
In this case, proceed to Configuration test 2 - Check that all IBM Virtual Shared Disk nodes know about each other.
Error results are indicated if any of the criteria listed are not met.
Some of these parameters require that the device driver is unloaded from the kernel and reloaded.
dsh /usr/lpp/csd/bin/ha.vsd stop
This assumes that the WCOLL environment variable is set, and points to a file listing the nodes. See Step 1 of Before you begin.
dsh /usr/lpp/csd/bin/ucfgvsd -a
dsh /usr/lpp/csd/bin/ucfgvsd VSD0
/usr/lpp/csd/bin/updatevsdnode -n ALL -a css0 -M 61440
dsh /usr/lpp/csd/bin/ha_vsd
Check that all IBM Virtual Shared Disk nodes know about each other. On the control workstation, issue this command:
dsh "/usr/lpp/csd/bin/lsvsd -i | wc -l"
Note that the quotes are important here.
The output is similar to the following:
node1: 16 node3: 16 node8: 16
Good results are indicated if all numbers in the right hand column are the same.
Error results are indicated if the numbers are not the same. This means that an IBM Virtual Shared Disk node has been added or deleted, and not all the nodes are aware of the change.
To correct the situation, perform one of the following steps:
/usr/lpp/csd/bin/ha.vsd refresh
dsh "/usr/lpp/csd/bin/lsvsd -i | wc -l"
These tests instruct the user to check for errors in the operation of the IBM Virtual Shared Disk and IBM Recoverable Virtual Shared Disk subsystems. Good results and Error results paragraphs explain how to interpret the test results. Start with Test 1 and proceed to the next test or perform other actions as indicated in text of each test.
Use the IBM Virtual Shared Disk Perspective table view. By putting the Nodes pane in table view, you can keep an eye on the states while continuing with other activities. To view virtual shared disk states (active, suspended or stopped) and the IBM Recoverable Virtual Shared Disk subsystem state in table view, do the following on the control workstation:
Good results are indicated if the IBM Recoverable Virtual Shared Disk subsystem is in an active state, and all IBM Virtual Shared Disks are in an active state.
Error results are indicated if IBM Virtual Shared Disks are in a suspended or stopped state. This indicates a potential problem.
In all cases, proceed to Operational test 2 - Check the IBM Recoverable Virtual Shared Disk subsystem.
This test checks that the IBM Recoverable Virtual Shared Disk subsystem is started, and that the IBM Virtual Shared Disks have been activated. To run this test, issue this command on the control workstation:
dsh /usr/lpp/csd/bin/ha.vsd query
Good results are indicated by output similar to the following. Note in particular that active=1.
+--------------------------------------------------------------------------------+ | | |Subsystem Group PID Status | | rvsd rvsd 19324 active | | rvsd(vsd): quorum= 8, active=1, state=idle, isolation=member, | | NoNodes=12, lastProtocol=nodes_joining, | | adapter_recovery=on, adapter_status=up, | | RefreshProtocol has never been issued from this node, | | Running function level 3.1.1.0. | | | | | +--------------------------------------------------------------------------------+
Error results are indicated by output similar to one of these two examples:
+--------------------------------------------------------------------------------+ | | |Subsystem Group PID Status | | rvsd rvsd inoperative | | | +--------------------------------------------------------------------------------+
OR
+--------------------------------------------------------------------------------+ | | |Subsystem Group PID Status | | rvsd rvsd 2570 active | | rvsd(vsd): quorum= 8, active=0, state=idle, isolation=isolated, | | NoNodes=0, lastProtocol=idle, | | adapter_recovery=on, adapter_status=down, | | RefreshProtocol has never been issued from this node, | | Running function level 3.1.1.0. | | | +--------------------------------------------------------------------------------+
If the test indicates an error, follow these instructions:
Issue this command on the control workstation:
dsh /usr/lpp/csd/bin/ha.vsd query | grep inoperative
You could also check that the rvsd subsystem is not waiting for Group Services on some nodes, by issuing this command on the control workstation:
dsh /usr/lpp/csd/bin/ha.vsd query |\ grep "waiting for Group Services to connect"
errpt -a | pg
In all cases, proceed to Operational test 3 - Check the IBM Virtual Shared Disk states using commands.
On the control workstation, see if the IBM Recoverable Virtual Shared Disk subsystem is active with this command:
dsh /usr/lpp/csd/bin/lsvsd -l | grep -E "SUS|STP" | dshbak -c
Good results are indicated if no shared disks are in the suspend SUS or stopped STP states. In this case, proceed to Operational test 4 - Display the IBM Virtual Shared Disk device driver statistics.
Error results are similar to the following example:
+--------------------------------------------------------------------------------+ | | |HOSTS --------------------------------------------------------------- | |c164n04.ppd.pok.ib | |--------------------------------------------------------------------- | | 7 STP -1 0 0 vsd2 nocache 8 | | 8 STP -1 0 0 vsd1 nocache 8 | |307 SUS -1 0 0 gpfsvsd7 nocache 2148 | | | +--------------------------------------------------------------------------------+
If the test indicates an error, determine why the shared disks are not active (ACT). Virtual shared disks are normally in suspend (SUS) state for only a short period of time, while recovery is taking place. Virtual shared disks are not normally in a stopped (STP) state unless they have been explicitly stopped. Perform these steps on the affected nodes:
errpt -a -N vsdd rvsd rvsdd | pg
If problems with one or more volume groups are detected, pursue this problem as a hardware problem or a potential AIX Logical Volume Manager problem.
On the control workstation, display the device driver statistics by performing these steps from within the IBM Virtual Shared Disk Perspective:
Suspect an error if there are a large number of requests queued and timeouts.
Good results are indicated by 0 timeouts, as in this example:
+--------------------------------------------------------------------------------+ | | |VSD driver (vsdd): IP/SMP interface: PSSP Ver:3 Rel: 1.1 | | | | 9 vsd parallelism | | 61440 vsd max IP message size | | 0 requests queued waiting for a request block | | 0 requests queued waiting for a pbuf | | 0 requests queued waiting for a cache block | | 0 requests queued waiting for a buddy buffer | | 0.0 average buddy buffer wait_queue size | | 0 rejected requests | | 0 rejected responses | | 0 rejected merge timeout. | | 0 requests rework | | 958 indirect I/O | | 0 64byte unaligned reads. | | 0 timeouts | |retries: 0 0 0 0 0 0 0 0 0 | | 0 total retries | |Non-zero Sequence numbers | | node# expected outgoing outcast? Incarnation: 0 | | 1 12660 0 | | 2 51403 0 | | 4 19951 14 | | | | 8 Nodes Up with zero sequence numbers: 3 5 6 7 8 11 12 14 | | | +--------------------------------------------------------------------------------+
Error results are indicated by nonzero values in the timeout field, or large numbers (in the thousands or hundreds of thousands range) of requests queued.
If the test indicates an error, check the error log to determine the minor number of the IBM Virtual Shared Disks that timed out. Issue this command:
errpt -a -J VSD_INT_ER | pgUse the minor number to determine the IBM Virtual Shared Disk name, by issuing this command:
lsvsd -l | grep minor_number
In all cases, proceed to Operational test 5 - Check the IBM Virtual Shared Disk server.
This test checks the IBM Virtual Shared Disk server to see if it is active and accessible. On the client node, issue:
lsvsd -l vsd_nameand refer to and refer to the server_list column to determine the node numbers of the servers. For the case of a Concurrent Virtual Shared Disk (CVSD), repeat the following steps for each server.
The output of the lsvsd -l command is similar to the following:
+--------------------------------------------------------------------------------+ | | |minor state server lv_major lv_minor vsd-name option size(MB) server_list | | 5 ACT 13 42 1 Vsd1n13 nocache 128 13,14 | | | +--------------------------------------------------------------------------------+
vsdatalst -n
to correlate the server numbers to node names. The output is
similar to the following:
VSD Node Information Initial Maximum VSD rw Buddy Buffer node VSD IP packet cache cache request request minimum maximum size: # number host_name adapter size buffers buffers count count size size maxbufs ------ --------------- ------- --------- ------- ------- ------- ------- ------- ------- ------- 1 c164n01.ppd.pok css0 61440 64 256 256 48 4096 262144 66 2 c164n02.ppd.pok css0 61440 64 256 256 48 4096 262144 66 13 c164n13.ppd.pok css0 61440 64 256 256 48 4096 524288 18 14 c164n14.ppd.pok css0 61440 64 256 256 48 4096 524288 18 16 c164n16.ppd.pok css0 61440 64 256 256 48 4096 524288 18
Is the IBM Virtual Shared Disk local or remote? Issue the command:
ping host_name
where host_name is obtained from the previous step.
Good results are indicated if the ping is successful. Proceed to Operational test 6 - Check network options.
Error results are indicated if the ping fails. In this case, issue this command:
ifconfig VSD_adapter
to determine if the IBM Virtual Shared Disk adapter is active (UP). If the adapter is not active, pursue this as a switch or network problem.
This test checks the network options that affect virtual shared disks. Issue this command:
/usr/sbin/no -a | grep -E "thewall|ipqmaxlen"
Good results are indicated by these recommended values:
If the results are good, proceed to Operational test 7 - Check ability to read from the IBM Virtual Shared Disk.
Error results are indicated by values less than the recommended ones. This may adversely affect performance and cause timeouts. To change these network options, issue the commands:
/usr/sbin/no -o thewall=65536
and
/usr/sbin/no -o ipqmaxlen=1024
This test determines if we can read from the IBM Virtual Shared Disk, both locally and remotely. Issue this command both locally on the IBM Virtual Shared Disk server, and remotely on another node:
dd if=/dev/r{vsdname} of=/dev/null bs=4k count=1
Good results are indicated by the following output:
1+0 records in. 1+0 records out.
Error results are indicated if the command hangs, which may be for up to 15 minutes.
If the IBM Virtual Shared Disk is accessible locally but not remotely, there may be a virtual shared disk sequence number problem or a routing problem. Proceed to Operational test 8 - Check that the client and server nodes have routes to each other to check the device driver routing tables.
If the virtual shared disk is not locally accessible, issue the dd command from page *** to the local logical volume. If the dd command fails, pursue the problem as a potential AIX Logical Volume Manager problem.
On both the client node (where an IBM Virtual Shared Disk cannot be read) and on the server node for that disk, issue this command:
/usr/lpp/csd/bin/lsvsd -i
The output is similar to the following:
node IP address 1 9.114.68.129 5 9.114.68.130 12 [KLAPI 11]
Good results are indicated if the output is the same on both the client and server nodes.
Error results are indicated if the output differs between the client and server nodes. A typical problem is missing entries. In this case, issue the command:
ha.vsd refresh
This refreshes the node information on all nodes where the IBM Recoverable Virtual Shared Disk subsystem is active.