IBM Books

Diagnosis Guide


Actions

Action 1. Produce a system dump

When your nodes do not respond or when your system crashes, a system dump may help you determine the cause of the problem. A system dump contains a copy of the kernel data on the system at the time of the crash. This section explains how to produce a dump, verify it, copy the dump to tape, and send the tape to IBM.

In some cases the system produces a dump automatically. If the system senses a fatal condition, it usually dumps automatically to the primary dump device and puts flashing 888 in the node's three-digit display.

Attention

Do not initiate a system dump if the node's three-digit display is 888. If you initiate a dump, you will overwrite the dump that was taken at the time of the problem.

Instead, proceed to Action 2. Verify the system dump.

Dump methods

There are several ways you can produce a system dump. Some of the methods work with all configurations, and others do not. Each method explained here includes this configuration information.

Notes:

  1. Graphical interface users can use the SP Hardware Perspective to operate the system controls. You can reset the node or put it in service mode either from the Nodes Status page of the Node Notebook, or the Actions menu.

  2. Command interface users can use the spmon command to operate the system controls.

    A node can be reset using the command:

    /usr/lpp/ssp/bin/spmon -reset
    

    On systems that do not have a key mode switch, the spmon -reset command produces a dump.

    A node's key mode switch can be altered using the command:

    /usr/lpp/ssp/bin/spmon -key state
    

    where state is either normal, secure, or service.

Dump to the primary dump device

Choose one of these methods to produce a dump on the primary dump device.

Method 1

This method works for all systems that have a key switch.

Set the key mode switch to the Service position and press the Reset button once.

Method 2

This method can only be done from a directly-attached keyboard. It cannot be done from a tty connection. This method works only on the control workstation.

Set the key mode switch to the Service position and, while holding the <Ctrl> and <Alt> keys, press the 1 on the numeric key pad.

Method 3

This method works for all system configurations, if the system is responding to commands.

Login as root and enter:

sysdumpstart -p

On the SP system, when sysdumpstart is issued on a PCI node, the SPLED display in SP Perspectives indicates stby. Following a stby, power off the node and then back on. DO NOT RESET THE NODE. A reset will lose the dump taken.

Method 4

This method works for nodes with virtual keys. It produces a dump to the default dump device, as defined by AIX. Issue the following commands from the control workstation, specifying the node's frame and slot.

  1. hmcmds service frame:slot
  2. hmcmds reset frame:slot
  3. Wait for the 0c2 status code to change to 0c0
  4. hmcmds normal frame:slot
  5. hmcmds off frame:slot

Dump to the secondary dump device

Choose one of these methods to produce a dump on the secondary dump device.

Note

If the secondary dump device is a removable media device, such as a tape or diskette drive, make sure that the medium is in the device.

Method 5

This method can only be done from a directly-attached keyboard. It cannot be done from a tty connection. This method works only on the control workstation.

Set the key mode switch to the Service position and, while holding down <Ctrl> and <Alt> keys, press the 2 on the numeric key pad.

Method 6

This method works for all system configurations, if the system is responding to commands.

Login as root and enter:

sysdumpstart -s

Action 2. Verify the system dump

You may have a system dump because you initiated it yourself or because the system produced one automatically. In either case, follow these steps to verify that the system dump was successful and that the information it contains is usable.

  1. Record the three-digit codes.

    Table 6. System dump status codes

    Three-digit code Meaning
    0c0 The dump completed successfully.
    0c1 An I/O error occurred while taking the dump.
    0c2 A user-initiated dump is in progress.
    0c4 The dump device was too small but the dump may still be usable.

    If zero bytes are written and 0c4 is displayed, it means the dump device was large enough but the system was hung and not able to initiate a dump.

    0c5 An internal error occurred while taking the dump.
    0c6 Prompts you to make the secondary dump device available.
    0c7 The dump facility is waiting for a response from the NFS (Network File Server).
    0c8 No dump device is defined.
    0c9 A system-initiated dump is in progress.
    0cc The dump facility has switched to the secondary dump device.
  2. On Micro Channel Nodes, change the key mode switch to "normal", power off the node and power it back on. On PCI nodes, which do not have a key mode switch, power off the node and power it back on. This will allow the last error log entry stored in NVRAM to be placed in the error log. See Effect of not having a battery on error logging.
    Note:
    DO NOT hit the reset button, because this will cause the current dump information to be overwritten.
  3. Log in as root.
  4. Verify the dump device by entering:
    sysdumpdev
    

    This should return something like:

       primary            /dev/hd7
       secondary          /dev/sysdumpnull
    

    Note the primary dump device name, and substitute it for /dev/hd# in the following steps.

  5. Verify the dump by entering:
    sysdumpdev -L
    

    Output is similar to:

    0453-039
     
    Device name:         /dev/lv00
    Major device number: 10
    Minor device number: 10
    Size:                67108352 bytes
    Date/Time:           Wed Apr  5 14:52:35 EDT 2000
    Dump status:         -2
    dump device too small
    

    In this case, a 0c4 LCD was in the crash codes, and the dump device was too small. There maybe enough dump information, since 67108352 bytes of information were written to the dump device /dev/lv00. Continue to 6. If no bytes were written, the system was hung and no dump exists. DO NOT continue with these steps.

  6. Verify the usability of the dump by entering:
       crash /dev/hd#
    

    This should return:

       Using /unix as the default namelist file.
       Reading in symbols........................
    
  7. Enter the errdead command to extract the error records from the /dev/error buffer and place them in the error log:
    /usr/lib/errdead /dev/hd#
    
  8. Issue the crash command again:
    crash /dev/lvxx
    

    where /dev/lvxx is the dump device.

    When you see the > prompt, enter:

       stat
    

    Output is similar to:

       sysname:   AIX
       nodename:  journey
       release:   2
       version:   3
       machine:   000052643100
       time of crash:  Sun Jan 24 19:18:53 1993
       age of system:  18 day, 1 hr., 29 min.
    

    Now put the dump symptom string information into the error log. Issue the command:

    symptom -e 
    

    This copies the symptom string into the error log, and can be used to search problem databases for duplicate problems.

  9. Enter:
       trace
    

    Look for a trace report similar to this sample:

       STACK TRACE:
          .m_freem ()
          .soreceive
          ._recv
          .recv
    

    Enter q to quit the crash command.

Gather the dump and other snap or log information for the IBM Support Center. Contact your local service representative or call the IBM Support Center to open a Problem Management Record as explained in How to contact the IBM Support Center.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]