IBM Books

Diagnosis Guide


Diagnostic procedures

These Diagnostic Procedures test the installation, configuration, and operation of all servers.

Installation verification tests

Use these tests to check that the server has been installed properly.

Installation test 1 - Verify the ssp.basic file set

This test verifies that the ssp.basic file set has been installed correctly. All server device drivers are included in the ssp.basic file set. Issue this lslpp command on the control workstation:

lslpp -l ssp.basic

The output is similar to the following:

Path: /usr/lib/objrepos
 ssp.basic      3.1.0.8  COMMITTED  SP System Support Package
 
Path: /etc/objrepos
 ssp.basic      3.1.0.8  COMMITTED  SP System Support Package

Good results are indicated if entries for ssp.basic exist. Proceed to Installation test 2 - Check external hardware daemon.

Error results are indicated in all other cases. Try to determine why the file set was not installed, and either install it, or contact the IBM Support Center.

Installation test 2 - Check external hardware daemon

This test verifies that the appropriate external hardware daemon and it's directory have been created on the control workstation. Issue these commands on the control workstation:

  1. ls -l /usr/lpp/ssp/install/bin/s70d

    The output is similar to the following:

    -r-x------  1 bin  bin  47166 Sep 15 13:00 /usr/lpp/ssp/install/bin/s70d
    
  2. ls -l /var/adm/SPlogs/spmon | grep s70d

    The output is similar to the following:

     drwxr-xr-x  2 bin  bin   512 Sep 15 13:01 s70d
    

Good results are indicated if output for all the commands is similar to the examples provided. Proceed to Configuration test 1 - Check SDR Frame object.

Error results are indicated in all other cases. Record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Configuration verification tests

Use these tests to check that all servers have been configured properly.

Configuration test 1 - Check SDR Frame object

This test verifies that the SDR Frame object was created. The configuration data for all servers must reside in the SDR Frame object. On the control workstation, issue the command:

splstdata -f

For each server you should see a line of output similar to the following. The numbers may be different.

 frame#    tty      s1_tty    frame_type  hardware_protocol
 ----------------------------------------------------------
   3    /dev/tty2  /dev/tty3      ""          SAMI

Good results are indicated if all of the following are true:

  1. One frame entry exists for every server installed on your SP system. If an entry does not exist for one of your servers, the Frame object has not been created.
  2. The tty (serial port for SAMI communication) and s1_tty (serial port for S1 communication) values are correct. If either of these values is incorrect, the Frame object has not been created correctly.
  3. The hardware_protocol is correct. The hardware_protocol value must be set to SAMI (Service and Manufacturing Interface) - communication protocol for an S70, S7A and S80 server.
    Note:
    If hardware_protocol is set to SP, the System Monitor will attempt to send SP frame supervisor commands to the server's service processor. The service processor does not understand this protocol and can become hung. If this happens, it becomes impossible to control the server, either from the SP system, or physically from the operator panel.

If all of these conditions are true, proceed to Configuration test 2 - Check SDR Node object.

Error results are indicated if not all of these conditions are true. Attempt to fix the SDR data by issuing the spframe command with the appropriate parameters, or contact the IBM Support Center. For a description of the spframe command, refer to PSSP: Command and Technical Reference.

Repeat this test after issuing the spframe command. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Configuration test 2 - Check SDR Node object

This test verifies that the SDR Node object (for s70) was created. The configuration data for all servers must reside in the appropriate SDR object. For each S70, S7A and S80 server, issue this command on the control workstation:

splstdata -n -l N

where N is the node number of the server. If you do not know the node number, issue the command: spmon -G -d to determine node numbers.

For each S70, S7A and S80 server, you should see output similar to the following, which is output from splstdata -n -l 33:

node# frame# slot# slots  initial_hostname  reliable_hostname  dcehostname
   default_route   processor_type processors_installed description
--------------------------------------------------------------------------
  33      3     1     1  wild3n01.ppd.pok   wild3n01.ppd.pok  ""          
     9.114.130.130               MP              4 7017-S70      

Good results are indicated if an appropriate entry exists for each server in the system. Proceed to Configuration test 3 - Check SDR Syspar_map object.

Error results are indicated in all other cases. Attempt to fix the SDR data by issuing the spframe command with the appropriate parameters. If a particular entry exists, but contains incorrect information, you must first delete that Frame object by issuing the spdelfram command. For a description of these commands, refer to PSSP: Command and Technical Reference.

Repeat this test after issuing the spframe command. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Configuration test 3 - Check SDR Syspar_map object

This test verifies that the SDR Syspar_map object for all servers was created correctly. The switch port number for the server is stored in the Syspar_map object of the SDR. The switch port number for the server is either defined for the SP-attached server using the spframe command, or can be optionally automatically assigned by the SDR configuration command for clustered enterprise servers. On the control workstation, issue the command once for each server:

SDRGetObjects Syspar_map node_number==N  switch_node_number

where N is the node number of the server.

If you do not know the node number, issue the command: spmon -G -d to determine node numbers. For each command, you should see output similar to the following, which is output from SDRGetObjects Syspar_map node_number==17 switch_node_number:

  switch_node_number 
          5               

Good results are indicated if the switch port number that is returned matches the value requested when the server was originally defined. For systems with an SP Switch, this should be the switch port number associated with the port in which the server is cabled to the SP Switch. For systems without a switch, this should be any unused valid switch port number on the system. For clustered systems, this can be any unused value in the range 0 to 511. Proceed to Operational test 1 - Check hardmon status.

Error results are indicated if no entry exists for the node, or the returned value is incorrect. Attempt to fix the SDR data by issuing the spframe command with the appropriate parameters. If a Frame object already exists for this server, you must first delete that Frame object by issuing the spdelfram command. For a description of these commands, see PSSP: Command and Technical Reference. For information on assigning valid switch port numbers for all servers, see PSSP: Planning Volume 2.

Repeat this test after issuing the spframe command. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center, and contact the IBM Support Center.

Operational verification tests

Use these tests to check that all servers are operating properly.

Operational test 1 - Check hardmon status

This test verifies that the System Monitor (hardmon) is active and running correctly. External hardware daemons cannot run if hardmon is not running. Issue the commands:

  1. lssrc -s hardmon

    The output is similar to the following:

     Subsystem         Group            PID     Status 
      hardmon                           42532   active
    
  2. ps -ef | grep hardmon

    The output is similar to the following:

     root 42532  5966  0  Sep 15   0  9:42 /usr/lpp/ssp/bin/hardmon -r 5
    

Good results are indicated if all of the following are true:

  1. In the lssrc output, look for the entry whose Subsystem is hardmon. The Status column should be active.
  2. In the ps output, verify that the hardmon daemon uses the -r flag and that the argument is 5. This means that the hardmon daemon polls each frame supervisor, including external hardware daemons, for state information every five seconds. This is the default. If the hardmon daemon uses a value other than 5 for the argument to the -r flag, it is not running as IBM recommends.

If these conditions are met, proceed to Operational Test 2 - Check external hardware daemons.

Error results are indicated if these conditions are not met. To determine why hardmon is not running, or why the argument to the -r flag is not 5, refer to Diagnosing System Monitor problems.

Repeat this test, after taking any action suggested in Diagnosing System Monitor problems. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Operational Test 2 - Check external hardware daemons

This test verifies that the s70d daemons are running. If you have one or more S70, S7A or S80 servers, whether they are SP-attached or clustered enterprise servers, issue the command:

ps -ef | grep s70d

For s70, S7A and S80 servers, a line of output for each server is similar to the following:

root 23384 42532 1 Sep 15  2 79:08 /usr/lpp/ssp/install/bin/s70d 
                  -d 0 5 1 7 /dev/tty2 /dev/tty3

Good results are indicated if all of the following are true:

  1. There is a line of output, as in this example, for each server.
  2. The parameters of the command are correct. This is a description of each parameter from left to right:
    1. This parameter will always be -d.
    2. This parameter is the argument to the -d parameter. It is an integer in which each bit (of the binary representation) represents a particular debug option. The external hardware daemon, at the time it is created by the hardmon daemon, inherits this parameter from the current value of the hardmon daemon. In the example output, the value is 0 because no debug options were set in the hardmon daemon at the time the external hardware daemons were created.
    3. This parameter is the frame number of the server.
    4. This parameter is the node number of the server. It should always be 1.
    5. This parameter is the file descriptor of the hardmon side of the socket pair used for two-way communication. The value of this parameter will be whatever the next available file descriptor was at the time the external hardware daemon was created. Any integer value would be considered correct.
    6. This parameter is the SAMI communication port.
    7. This parameter is the S1 communication port. Note that for the s70 type, this tty must be different than the tty in the previous parameter.

To help verify the correctness of the last two parameters, issue the command:

splstdata -f

which produces output that shows what the communication ports are expected to be.

If you receive good results, proceed to Operational test 3 - Check frame responsiveness.

Error results are indicated if any of these conditions are not met. Record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Operational test 3 - Check frame responsiveness

This test verifies that the frames are responding. External hardware daemons cannot run properly if their frames are not responding. To verify that a particular frame is responding, issue the following command on the control workstation:

hmmon -GQv controllerResponds F:0

where F is the frame number of the server that you are checking. Repeat this test for each server in the system.

The output is similar to the following:

frame F, slot 00: 
   TRUE  frame responding to polls  

Good results are indicated if the value is TRUE for each server. Proceed to Operational test 4 - Check SAMI communications.

If the value is FALSE, or you do not get any output, the test may have encountered an error. During normal operation, this value may occasionally switch to FALSE, which may simply mean that the daemon happens to be busy and cannot respond to an individual System Monitor request in a timely manner. Therefore, if you get a FALSE value, repeat the hmmon command several more times, waiting at least five seconds between invocations. If the value is consistently FALSE after several attempts, assume this to be error results.

In the case of error results, perform these steps:

  1. Verify that the tty cables are properly connected to the server. An S70, S7A and S80 type has two cables.
  2. Verify that the server itself is operating properly.

Repeat this test after performing these steps. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Operational test 4 - Check SAMI communications

This test verifies that the SAMI communication is available. For each server, issue the following command twice on the control workstation, waiting at least five seconds between the two commands:

hmmon -GQs F:0,1

where F is the frame number of the server that you are checking. Repeat this test for each server in the system.

This is an example output for an s70 in frame 3:

  3   0  nodefail1            FALSE    0x8802  node 01 I2C not responding
  3   0  nodeLinkOpen1        FALSE    0x8813  node 01 serial link open
  3   0  diagByte                 0    0x8823  diagnosis return code
  3   0  timeTicks            38701    0x8830  supervisor timer ticks
  3   0  type                     2    0x883a  supervisor type
  3   0  codeVersion            769    0x883b  supervisor code version
  3   0  daemonPollRate           5    0x8867  hardware monitor poll rate
  3   0  controllerResponds   TRUE     0x88a8  frame responding to polls
  3   1  nodePower            TRUE     0x944a  DC-DC power on
  3   1  serialLinkOpen       FALSE    0x949d  serial link is open
  3   1  DPOinProgress        FALSE    0x950b  delayed power off active
  3   1  SRChasMessage        FALSE    0x9512  SRC contains a message
  3   1  SPCNhasMessage       FALSE    0x9513  SPCN contains a message
  3   1  LCDhasMessage        FALSE    0x9506  LED/LCD contains a message
  3   1  src                  BLANK    0x9510  System Reference Code
  3   1  spcn                 BLANK    0x9511  System Power Cntl Network
  3   1  hardwareStatus          72    0x94f3  hardware status byte
  3   1  diagByte                15    0x9423  diagnosis return code
  3   1  timeTicks            23850    0x9430  supervisor timer ticks
  3   1  type                    10    0x943a  supervisor type
  3   1  codeVersion            769    0x943b  supervisor code version
  3   1  lcd1                 BLANK    0x94f4  LCD line 1
  3   1  lcd2                 BLANK    0x94f5  LCD line 2

Good results are indicated if all of the following are true:

  1. For each server, the value for timeTicks, for both slots 0 and 1, increases from the first invocation of the hmmon command to the second invocation of the hmmon command. The slot number is indicated by the second column of output.
  2. For each server, the value for nodefail1 is FALSE. Note that, during normal operation, this value may occasionally switch to TRUE. This may simply mean that the daemon happens to be busy and cannot respond to an individual System Monitor request in a timely manner. Therefore, if you get a TRUE value, repeat the hmmon command several more times, waiting at least five seconds in between invocations, before concluding that this test has failed.
  3. For each server, information for slot 1 is displayed. If it is not, this indicates that the SAMI communication is not available. The slot number is indicated by the second column of output.

In this case, proceed to Operational test 5 - Check S1 communications.

Error results are indicated in all other cases. Record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Operational test 5 - Check S1 communications

This test verifies that the S1 communication is available. For each other server, issue the following command on the control workstation:

s1term -G F 1

where F is the frame number of the server that you are checking. Repeat this test for each server in the system.

Good results are indicated if the AIX login prompt is displayed. If you do not see the login prompt after issuing the s1term command, try typing the enter key a second time. If you still do not see the login prompt, consider this to be error results.

To correct the problem, perform these steps:

  1. Verify that the s1 tty cable is properly connected to the server.
  2. Verify that the server itself is operating properly.

Repeat this test after performing these steps. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]