Diagnosis Guide

Actions

Action 1. Produce a system dump

When your nodes do not respond or when your system crashes, a system dump may help you determine the cause of the problem. A system dump contains a copy of the kernel data on the system at the time of the crash. This section explains how to produce a dump, verify it, copy the dump to tape, and send the tape to IBM.

In some cases the system produces a dump automatically. If the system senses a fatal condition, it usually dumps automatically to the primary dump device and puts flashing 888 in the node's three-digit display.

Attention
Do not initiate a system dump if the node's three-digit display is 888. If you initiate a dump, you will overwrite the dump that was taken at the time of the problem. Instead, proceed to Action 2. Verify the system dump.

Attention

Do not initiate a system dump if the node's three-digit display is 888. If you initiate a dump, you will overwrite the dump that was taken at the time of the problem.

Instead, proceed to Action 2. Verify the system dump.

Dump methods

There are several ways you can produce a system dump. Some of the methods work with all configurations, and others do not. Each method explained here includes this configuration information.

Notes:

Graphical interface users can use the SP Hardware Perspective to operate the system controls. You can reset the node or put it in service mode either from the Nodes Status page of the Node Notebook, or the Actions menu.
Command interface users can use the spmon command to operate the system controls.
A node can be reset using the command:
```
/usr/lpp/ssp/bin/spmon -reset
```
On systems that do not have a key mode switch, the spmon -reset command produces a dump.
A node's key mode switch can be altered using the command:
```
/usr/lpp/ssp/bin/spmon -key state
```
where state is either normal, secure, or service.

Dump to the primary dump device

Choose one of these methods to produce a dump on the primary dump device.

Method 1

This method works for all systems that have a key switch.

Set the key mode switch to the Service position and press the Reset button once.

Method 2

This method can only be done from a directly-attached keyboard. It cannot be done from a tty connection. This method works only on the control workstation.

Set the key mode switch to the Service position and, while holding the <Ctrl> and <Alt> keys, press the 1 on the numeric key pad.

Method 3

This method works for all system configurations, if the system is responding to commands.

sysdumpstart -p

On the SP system, when sysdumpstart is issued on a PCI node, the SPLED display in SP Perspectives indicates stby. Following a stby, power off the node and then back on. DO NOT RESET THE NODE. A reset will lose the dump taken.

Method 4

This method works for nodes with virtual keys. It produces a dump to the default dump device, as defined by AIX. Issue the following commands from the control workstation, specifying the node's frame and slot.

hmcmds service frame:slot
hmcmds reset frame:slot
Wait for the 0c2 status code to change to 0c0
hmcmds normal frame:slot
hmcmds off frame:slot

Dump to the secondary dump device

Choose one of these methods to produce a dump on the secondary dump device.

Note
If the secondary dump device is a removable media device, such as a tape or diskette drive, make sure that the medium is in the device.

Note

If the secondary dump device is a removable media device, such as a tape or diskette drive, make sure that the medium is in the device.

Method 5

This method can only be done from a directly-attached keyboard. It cannot be done from a tty connection. This method works only on the control workstation.

Set the key mode switch to the Service position and, while holding down <Ctrl> and <Alt> keys, press the 2 on the numeric key pad.

Method 6

This method works for all system configurations, if the system is responding to commands.

sysdumpstart -s

Action 2. Verify the system dump

You may have a system dump because you initiated it yourself or because the system produced one automatically. In either case, follow these steps to verify that the system dump was successful and that the information it contains is usable.

Record the three-digit codes.

If the system dumped automatically, the three-digit display will show flashing 888. Press Reset repeatedly until 888 displays again and write down each three-digit code that is displayed. The last code before 888 displays again indicates if the dump was successful. Check the dump code status in the next table for more information.
If you initiated the dump yourself, the three-digit code that is displayed indicates if the dump was successful. Check the dump code status in the next table for more information.

Table 6. System dump status codes

Three-digit code	Meaning
0c0	The dump completed successfully.
0c1	An I/O error occurred while taking the dump.
0c2	A user-initiated dump is in progress.
0c4	The dump device was too small but the dump may still be usable. If zero bytes are written and 0c4 is displayed, it means the dump device was large enough but the system was hung and not able to initiate a dump.
0c5	An internal error occurred while taking the dump.
0c6	Prompts you to make the secondary dump device available.
0c7	The dump facility is waiting for a response from the NFS (Network File Server).
0c8	No dump device is defined.
0c9	A system-initiated dump is in progress.
0cc	The dump facility has switched to the secondary dump device.

On Micro Channel Nodes, change the key mode switch to "normal", power off the node and power it back on. On PCI nodes, which do not have a key mode switch, power off the node and power it back on. This will allow the last error log entry stored in NVRAM to be placed in the error log. See Effect of not having a battery on error logging.
Note:
DO NOT hit the reset button, because this will cause the current dump information to be overwritten.
Log in as root.
Verify the dump device by entering:
```
sysdumpdev
```
This should return something like:
```
   primary            /dev/hd7
   secondary          /dev/sysdumpnull
```
Note the primary dump device name, and substitute it for /dev/hd# in the following steps.
Verify the dump by entering:
```
sysdumpdev -L
```
Output is similar to:
```
0453-039
 
Device name:         /dev/lv00
Major device number: 10
Minor device number: 10
Size:                67108352 bytes
Date/Time:           Wed Apr  5 14:52:35 EDT 2000
Dump status:         -2
dump device too small
```
In this case, a 0c4 LCD was in the crash codes, and the dump device was too small. There maybe enough dump information, since 67108352 bytes of information were written to the dump device /dev/lv00. Continue to 6. If no bytes were written, the system was hung and no dump exists. DO NOT continue with these steps.
Verify the usability of the dump by entering:
```
   crash /dev/hd#
```
This should return:
```
   Using /unix as the default namelist file.
   Reading in symbols........................
```
- If you get the message: "ATTENTION: dumpfile does not appear to match namelist", either the dump did not take place or the /unix file does not match the dump that was in the dump device.
  The dump file is not useful. Enter q to quit the crash command. DO NOT continue with these steps and do not send the dump to IBM.
- If messages are not displayed, proceed with the next step.
- Enter q to quit the crash command.
Enter the errdead command to extract the error records from the /dev/error buffer and place them in the error log:
```
/usr/lib/errdead /dev/hd#
```
Issue the crash command again:
```
crash /dev/lvxx
```
where /dev/lvxx is the dump device.
When you see the > prompt, enter:
```
   stat
```
Output is similar to:
```
   sysname:   AIX
   nodename:  journey
   release:   2
   version:   3
   machine:   000052643100
   time of crash:  Sun Jan 24 19:18:53 1993
   age of system:  18 day, 1 hr., 29 min.
```
- If the time of crash in the output approximately matches the time the system crashed, the dump is sufficient for analysis. Continue with the next step.
- If the time of crash in the output does not approximately match the time of the system crash, Enter q to quit the crash command. The data is not useful. Do not continue with these steps and do not send the dump to IBM.
Now put the dump symptom string information into the error log. Issue the command:
```
symptom -e 
```
This copies the symptom string into the error log, and can be used to search problem databases for duplicate problems.

Enter:

   trace

Look for a trace report similar to this sample:

   STACK TRACE:
      .m_freem ()
      .soreceive
      ._recv
      .recv

Enter q to quit the crash command.

Gather the dump and other snap or log information for the IBM Support Center. Contact your local service representative or call the IBM Support Center to open a Problem Management Record as explained in How to contact the IBM Support Center.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]