Diagnosis Guide

Diagnostic procedures

These procedures check for errors in the installation, configuration and operation of the IBM Virtual Shared Disk and IBM Recoverable Virtual Shared Disk subsystems of PSSP.

Installation verification test

This test determines if you have successfully installed the IBM Virtual Shared Disk software. It also tests if you can successfully define, activate, read from, and write to an IBM Virtual Shared Disk. To run the test, follow these steps:

From the control workstation, use the createvsd command to create an IBM Virtual Shared Disk. For example:
```
/usr/lpp/csd/bin/createvsd -n 1/:hdisk0/ -g rootvg -s 8 -c 1 -v junk
 
```
From the client node where you will run the test, the device driver and the virtual shared disk must be configured. For example:
```
/usr/lpp/csd/bin/cfgvsd -a
```
From a client node, run the vsdvts command. For example,
```
/usr/lpp/csd/bin/vsdvts vsd_name
 
```
If the virtual shared disk was created just for this test, remove it. From the control workstation, issue this command:
```
/usr/lpp/csd/bin/removevsd -f -v vsd_name
 
```

The vsdvts function displays the commands it issues and whether they were successful or not. If vsdvts fails, see Configuration test 1 - Check IBM Virtual Shared Disk nodes.

Configuration verification tests

Use these tests to check that the IBM Virtual Shared Disk subsystem is installed properly.

Configuration test 1 - Check IBM Virtual Shared Disk nodes

This test checks that the IBM Virtual Shared Disk nodes have been designated appropriately. Obtain the IBM Virtual Shared Disk node information by issuing this command on the control workstation:

/usr/lpp/csd/bin/vsdatalst -n

The output is similar to the following:

VSD Node Information
                                   Initial Maximum     VSD      rw     Buddy Buffer
  node            VSD     IP packet  cache   cache request request minimum maximum size: #
number host_name adapter   size    buffers buffers   count   count    size    size maxbufs
------ --------- -------  -------- ------- ------- ------- ------- ------- ------- -------
     1 mynode1   css0        61440      64    4096     256      48    4096  262144     130
     3 mynode2   css0        61440      64    4096     256      48    4096  262144     130
     5 mynode3   css0        61440      64    4096     256      48    4096  262144     130
     7 mynode4   css0        61440      64    4096     256      48    4096  262144     130

Good results are indicated by:

The list contains all expected nodes.
All nodes must use the same type of communications adapter (virtual shared disk adapter).
If using the css0 adapter, the IP packet size should be 61440.

In this case, proceed to Configuration test 2 - Check that all IBM Virtual Shared Disk nodes know about each other.

Error results are indicated if any of the criteria listed are not met.

Actions to take if configuration test 1 detects an error

Some of these parameters require that the device driver is unloaded from the kernel and reloaded.

Stop the IBM Recoverable Virtual Shared Disk subsystem on the nodes by issuing this command on the control workstation:
```
dsh /usr/lpp/csd/bin/ha.vsd stop
 
```
This assumes that the WCOLL environment variable is set, and points to a file listing the nodes. See Step 1 of Before you begin.
Unconfigure the virtual disks on the nodes by issuing this command on the control workstation:
```
dsh /usr/lpp/csd/bin/ucfgvsd -a
```
Unload the device driver on the nodes by issuing this command on the control workstation:
```
dsh /usr/lpp/csd/bin/ucfgvsd VSD0
 
```
Correct the node attributes. For example, on the control workstation issue:
```
/usr/lpp/csd/bin/updatevsdnode -n ALL -a css0 -M 61440
 
```
Configure the device driver and the shared disks on the nodes and start the IBM Recoverable Virtual Shared Disk subsystem. Issue this command on the control workstation:
```
dsh /usr/lpp/csd/bin/ha_vsd
 
```

Configuration test 2 - Check that all IBM Virtual Shared Disk nodes know about each other

Check that all IBM Virtual Shared Disk nodes know about each other. On the control workstation, issue this command:

dsh "/usr/lpp/csd/bin/lsvsd -i | wc -l"

Note that the quotes are important here.

The output is similar to the following:

node1:    16
node3:    16
node8:    16

Good results are indicated if all numbers in the right hand column are the same.

Error results are indicated if the numbers are not the same. This means that an IBM Virtual Shared Disk node has been added or deleted, and not all the nodes are aware of the change.

Actions to take if configuration test 2 detects an error

To correct the situation, perform one of the following steps:

If the rvsd subsystem has been started on all the nodes, issue this command from any node where rvsd is active:
```
/usr/lpp/csd/bin/ha.vsd refresh
```
Follow the steps in Actions to take if configuration test 1 detects an error.

Then, issue this command from the control workstation:

dsh "/usr/lpp/csd/bin/lsvsd -i | wc -l"

Operational verification tests

These tests instruct the user to check for errors in the operation of the IBM Virtual Shared Disk and IBM Recoverable Virtual Shared Disk subsystems. Good results and Error results paragraphs explain how to interpret the test results. Start with Test 1 and proceed to the next test or perform other actions as indicated in text of each test.

Operational Test 1 - Check the IBM Virtual Shared Disks states using SP Perspectives

Use the IBM Virtual Shared Disk Perspective table view. By putting the Nodes pane in table view, you can keep an eye on the states while continuing with other activities. To view virtual shared disk states (active, suspended or stopped) and the IBM Recoverable Virtual Shared Disk subsystem state in table view, do the following on the control workstation:

Start the IBM Virtual Shared Disk Perspective with this command: /usr/lpp/ssp/bin/spvs.
Click on the nodes pane
Click on View > Show Objects in Table View or click on the table icon in the tool bar.
Click on the IBM VSD Node tab in the Set Table Attributes for Nodes dialog.
Click on the attributes to view while pressing the <Ctrl> key.
Click on OK.

Good results are indicated if the IBM Recoverable Virtual Shared Disk subsystem is in an active state, and all IBM Virtual Shared Disks are in an active state.

Error results are indicated if IBM Virtual Shared Disks are in a suspended or stopped state. This indicates a potential problem.

In all cases, proceed to Operational test 2 - Check the IBM Recoverable Virtual Shared Disk subsystem.

Operational test 2 - Check the IBM Recoverable Virtual Shared Disk subsystem

This test checks that the IBM Recoverable Virtual Shared Disk subsystem is started, and that the IBM Virtual Shared Disks have been activated. To run this test, issue this command on the control workstation:

dsh /usr/lpp/csd/bin/ha.vsd query

Good results are indicated by output similar to the following. Note in particular that active=1.

+--------------------------------------------------------------------------------+
|                                                                                |
|Subsystem         Group            PID     Status                               |
| rvsd             rvsd             19324   active                               |
| rvsd(vsd): quorum= 8, active=1, state=idle, isolation=member,                  |
|            NoNodes=12, lastProtocol=nodes_joining,                             |
|            adapter_recovery=on, adapter_status=up,                             |
|            RefreshProtocol has never been issued from this node,               |
|            Running function level 3.1.1.0.                                     |
|                                                                                |
|                                                                                |
+--------------------------------------------------------------------------------+

Error results are indicated by output similar to one of these two examples:

+--------------------------------------------------------------------------------+
|                                                                                |
|Subsystem         Group            PID     Status                               |
| rvsd             rvsd                     inoperative                          |
|                                                                                |
+--------------------------------------------------------------------------------+

+--------------------------------------------------------------------------------+
|                                                                                |
|Subsystem         Group            PID     Status                               |
| rvsd             rvsd             2570    active                               |
| rvsd(vsd): quorum= 8, active=0, state=idle, isolation=isolated,                |
|            NoNodes=0, lastProtocol=idle,                                       |
|            adapter_recovery=on, adapter_status=down,                           |
|            RefreshProtocol has never been issued from this node,               |
|            Running function level 3.1.1.0.                                     |
|                                                                                |
+--------------------------------------------------------------------------------+

If the test indicates an error, follow these instructions:

If active=0 from the output:
- If isolation=member, compare the quorum= and NoNodes= fields. If quorum is greater than NoNodes, determine which nodes are not part of the ha.vsd group.
  Issue this command on the control workstation:
```
dsh /usr/lpp/csd/bin/ha.vsd query | grep inoperative
 
```
  You could also check that the rvsd subsystem is not waiting for Group Services on some nodes, by issuing this command on the control workstation:
```
dsh /usr/lpp/csd/bin/ha.vsd query |\
 grep "waiting for Group Services to connect"
```
- If adapter_recovery=on and adapter_status=down then there is a problem with the communications adapter, which must be corrected.
- If adapter_recovery=on and adapter_status=unknown, check that the hats (Topology Services) and hags (Group Services) subsystems have been started. Issue these commands:
  - lssrc -ls hats
  - lssrc -ls hags
On the affected nodes, issue the vi `lscons` command to view the console log. Use the log to determine why the IBM Recoverable Virtual Shared Disk daemon has exited, or to see if a recovery problem has been logged. An example is a failure varying a volume group online.
On the affected nodes, view the file /var/adm/SPlogs/csd/vsd.log to determine if the IBM Recoverable Virtual Shared Disk subsystem has logged a recovery problem such as a failure varying a volume group online.
On the affected nodes, view the system error log to check for IBM Virtual Shared Disk error entries, by issuing the command:
```
errpt -a | pg
```

In all cases, proceed to Operational test 3 - Check the IBM Virtual Shared Disk states using commands.

Operational test 3 - Check the IBM Virtual Shared Disk states using commands

On the control workstation, see if the IBM Recoverable Virtual Shared Disk subsystem is active with this command:

dsh /usr/lpp/csd/bin/lsvsd -l | grep -E "SUS|STP" | dshbak -c

Good results are indicated if no shared disks are in the suspend SUS or stopped STP states. In this case, proceed to Operational test 4 - Display the IBM Virtual Shared Disk device driver statistics.

Error results are similar to the following example:

+--------------------------------------------------------------------------------+
|                                                                                |
|HOSTS ---------------------------------------------------------------           |
|c164n04.ppd.pok.ib                                                              |
|---------------------------------------------------------------------           |
|  7      STP   -1       0       0      vsd2        nocache          8           |
|  8      STP   -1       0       0      vsd1        nocache          8           |
|307      SUS   -1       0       0      gpfsvsd7    nocache       2148           |
|                                                                                |
+--------------------------------------------------------------------------------+

If the test indicates an error, determine why the shared disks are not active (ACT). Virtual shared disks are normally in suspend (SUS) state for only a short period of time, while recovery is taking place. Virtual shared disks are not normally in a stopped (STP) state unless they have been explicitly stopped. Perform these steps on the affected nodes:

View the console log to determine if the IBM Recoverable Virtual Shared Disk subsystem logged a recovery problem such as a failure varying a volume group online.
View the file /var/adm/SPlogs/csd/vsd.log to determine if the IBM Recoverable Virtual Shared Disk subsystem logged a recovery problem, such as a failure varying a volume group online.
View the system error log to check for IBM Virtual Shared Disk error entries by issuing the command:
```
errpt -a -N vsdd rvsd rvsdd | pg
```

If problems with one or more volume groups are detected, pursue this problem as a hardware problem or a potential AIX Logical Volume Manager problem.

Operational test 4 - Display the IBM Virtual Shared Disk device driver statistics

On the control workstation, display the device driver statistics by performing these steps from within the IBM Virtual Shared Disk Perspective:

Click on a virtual shared disk Node from the Nodes pane.
Click on the Properties notebook icon in the tool bar, or click the Actions > View or Modify Properties... to display the "IBM Virtual Shared Disk Node" notebook.
Click on the Virtual Shared Disk Node Statistics tab of the notebook.

Without using SP Perspectives, use the statvsd command on the nodes, to obtain the statistics.

Suspect an error if there are a large number of requests queued and timeouts.

Good results are indicated by 0 timeouts, as in this example:

+--------------------------------------------------------------------------------+
|                                                                                |
|VSD driver (vsdd): IP/SMP interface:    PSSP Ver:3 Rel: 1.1                     |
|                                                                                |
|         9 vsd parallelism                                                      |
|     61440 vsd max IP message size                                              |
|         0 requests queued waiting for a request block                          |
|         0 requests queued waiting for a pbuf                                   |
|         0 requests queued waiting for a cache block                            |
|         0 requests queued waiting for a buddy buffer                           |
|       0.0 average buddy buffer wait_queue size                                 |
|         0 rejected requests                                                    |
|         0 rejected responses                                                   |
|         0 rejected merge timeout.                                              |
|         0 requests rework                                                      |
|       958 indirect I/O                                                         |
|         0 64byte unaligned reads.                                              |
|         0 timeouts                                                             |
|retries: 0 0 0 0 0 0 0 0 0                                                      |
|         0 total retries                                                        |
|Non-zero Sequence numbers                                                       |
| node#        expected        outgoing   outcast? Incarnation: 0                |
|     1           12660              0                                           |
|     2           51403              0                                           |
|     4           19951             14                                           |
|                                                                                |
| 8 Nodes Up with zero sequence numbers: 3 5 6 7 8 11 12 14                      |
|                                                                                |
+--------------------------------------------------------------------------------+

Error results are indicated by nonzero values in the timeout field, or large numbers (in the thousands or hundreds of thousands range) of requests queued.

If the test indicates an error, check the error log to determine the minor number of the IBM Virtual Shared Disks that timed out. Issue this command:

errpt -a -J VSD_INT_ER | pg

Use the minor number to determine the IBM Virtual Shared Disk name, by issuing this command:

lsvsd -l | grep minor_number

In all cases, proceed to Operational test 5 - Check the IBM Virtual Shared Disk server.

Operational test 5 - Check the IBM Virtual Shared Disk server

This test checks the IBM Virtual Shared Disk server to see if it is active and accessible. On the client node, issue:

lsvsd -l vsd_name

and refer to and refer to the server_list column to determine the node numbers of the servers. For the case of a Concurrent Virtual Shared Disk (CVSD), repeat the following steps for each server.

The output of the lsvsd -l command is similar to the following:

+--------------------------------------------------------------------------------+
|                                                                                |
|minor  state server lv_major lv_minor vsd-name  option  size(MB)  server_list   |
| 5      ACT   13      42       1      Vsd1n13   nocache 128        13,14        |
|                                                                                |
+--------------------------------------------------------------------------------+

Now issue the command:

vsdatalst -n

to correlate the server numbers to node names. The output is similar to the following:

       VSD Node Information
                                         Initial Maximum    VSD      rw      Buddy Buffer
  node                 VSD     IP packet   cache   cache request request minimum maximum size: #
number host_name       adapter   size    buffers buffers   count   count    size    size maxbufs
------ --------------- ------- --------- ------- ------- ------- ------- ------- ------- -------
     1 c164n01.ppd.pok css0        61440      64     256     256      48    4096  262144      66
     2 c164n02.ppd.pok css0        61440      64     256     256      48    4096  262144      66
    13 c164n13.ppd.pok css0        61440      64     256     256      48    4096  524288      18
    14 c164n14.ppd.pok css0        61440      64     256     256      48    4096  524288      18
    16 c164n16.ppd.pok css0        61440      64     256     256      48    4096  524288      18

Is the IBM Virtual Shared Disk local or remote? Issue the command:

ping host_name

where host_name is obtained from the previous step.

Good results are indicated if the ping is successful. Proceed to Operational test 6 - Check network options.

Error results are indicated if the ping fails. In this case, issue this command:

ifconfig VSD_adapter

to determine if the IBM Virtual Shared Disk adapter is active (UP). If the adapter is not active, pursue this as a switch or network problem.

Operational test 6 - Check network options

This test checks the network options that affect virtual shared disks. Issue this command:

/usr/sbin/no -a | grep -E "thewall|ipqmaxlen"

Good results are indicated by these recommended values:

thewall = 65536
ipqmaxlen = 1024

If the results are good, proceed to Operational test 7 - Check ability to read from the IBM Virtual Shared Disk.

Error results are indicated by values less than the recommended ones. This may adversely affect performance and cause timeouts. To change these network options, issue the commands:

/usr/sbin/no -o thewall=65536

and

/usr/sbin/no -o ipqmaxlen=1024

Operational test 7 - Check ability to read from the IBM Virtual Shared Disk

This test determines if we can read from the IBM Virtual Shared Disk, both locally and remotely. Issue this command both locally on the IBM Virtual Shared Disk server, and remotely on another node:

dd if=/dev/r{vsdname} of=/dev/null bs=4k count=1

Good results are indicated by the following output:

1+0 records in.
1+0 records out.

Error results are indicated if the command hangs, which may be for up to 15 minutes.

If the IBM Virtual Shared Disk is accessible locally but not remotely, there may be a virtual shared disk sequence number problem or a routing problem. Proceed to Operational test 8 - Check that the client and server nodes have routes to each other to check the device driver routing tables.

If the virtual shared disk is not locally accessible, issue the dd command from page *** to the local logical volume. If the dd command fails, pursue the problem as a potential AIX Logical Volume Manager problem.

Operational test 8 - Check that the client and server nodes have routes to each other

On both the client node (where an IBM Virtual Shared Disk cannot be read) and on the server node for that disk, issue this command:

/usr/lpp/csd/bin/lsvsd -i

The output is similar to the following:

node              IP address
  1             9.114.68.129
  5             9.114.68.130
  12            [KLAPI 11]

Good results are indicated if the output is the same on both the client and server nodes.

Error results are indicated if the output differs between the client and server nodes. A typical problem is missing entries. In this case, issue the command:

ha.vsd refresh

This refreshes the node information on all nodes where the IBM Recoverable Virtual Shared Disk subsystem is active.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]