IBM Books

Diagnosis Guide


Diagnostic procedures

These Diagnostic Procedures test the installation, configuration, and operation of all servers.

Installation verification tests

Use these tests to check that the server has been installed properly.

Installation test 1 -Verify the ssp.basic file set

This test verifies that the ssp.basic file set has been installed correctly. All server device drivers are included in the ssp.basic file set. Issue this lslpp command on the control workstation:

lslpp -l ssp.basic

The output is similar to the following:

Path: /usr/lib/objrepos
ssp.basic 3.1.0.8 COMMITTED SP System Support Package

Path: /etc/objrepos
ssp.basic 3.1.0.8 COMMITTED SP System Support Package

Good results are indicated if entries for ssp.basic exist. Proceed to Installation test 2 - Check external hardware daemon.

Error results are indicated in all other cases. Try to determine why the file set was not installed, and either install it, or contact the IBM Support Center.

Installation test 2 - Check external hardware daemon

This test verifies that the appropriate external hardware daemon and it's directory have been created on the control workstation. Issue these commands on the control workstation:

  1. ls -l /usr/lpp/ssp/install/bin/hmcd
    

    The output is similar to the following:

    -r-x------ 1 bin bin 47166 Sep 15 13:00 hmcd
    
  2. ls -l /var/adm/SPlogs/spmon | grep hmcd
    

    The output is similar to the following:

    drwxr-xr-x 2 bin bin 512 Sep 15 13:01 hmcd
    
  3. ls -l /usr/lpp/ssp/install/bin/HMCD.class
    

    The output is similar to the following:

    -rw-r--r-- 1 bin bin 26927 Sep 11 01:03 HMCD.class
    
  4. ls -l /usr/lpp/ssp/lib/libHMCD.so
    

    The output is similar to the following:

    -rwxr-x--x 1 bin bin 17680 Sep 15 13:00 libHMCD.so
    

Good results are indicated if output for all the commands is similar to the examples provided. Proceed to Installation test 3 - Check external hardware daemon components.

Error results are indicated in all other cases. Record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Installation test 3 - Check external hardware daemon components

This test verifies that the appropriate external hardware daemon components and their directories have been created on the control workstation. Issue these commands on the control workstation:

  1. ls -l /usr/java130/jre/lib/ext/xerces.jar
    

    The output is similar to the following:

    -r--r--r-- 1 bin bin 1521373 Feb 12 13:20 xerces.jar
    
  2. ls -l /opt/freeware/cimom/org/snia/wbem/client/CIMClient.class
    

    The output is similar to the following:

    -rwxr-xr-x 1 bin bin 8507 Jul 23 15:20 CIMClient.class
    
    

Good results are indicated if output for all the commands is similar to the examples provided. Proceed to Configuration test 1 - Check SDR Frame object.

Error results are indicated in all other cases. Record all relevant information, see "Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Configuration verification tests

Use these tests to check that all servers have been configured properly.

Configuration test 1 - Check SDR Frame object

This test verifies that the SDR Frame object was created. The configuration data for all servers must reside in the SDR Frame object. On the control workstation, issue the command:

splstdata -f

For each server you should see a line of output similar to the following. The numbers may be different.

                                      List Frame Database Information
  frame#    tty          s1_tty   frame_type        hardware_protocol control_ipaddrs domain_name
 ---------- ------------ -------- ----------------- ----------------- --------------- ------------
 1          /dev/tty0    ""       switch            SP                ""              ""
 2          ""           ""       ""                HMC               9.114.62.123    huntley

Good results are indicated if all of the following are true:

  1. One frame entry exists for every server installed on your SP system. If an entry does not exist for one of your servers, the Frame object has not been created.
  2. The hardware_protocol is correct. The hardware_protocol value must be set to HMC ((Hardware Management Console) - communication protocol for an IBM e(logo)server pSeries 690 server.
  3. The control_ipaddrs is correct. The control_ipaddrs represents the IP address of the HMC that the server is connect to.
  4. The domain_name is correct. The domain_name represents the system name assigned to the IBM e(logo)server pSeries 690 server through the HMC Partition Management interface. You can verify the name by viewing the properties for the IBM e(logo)server pSeries 690 server directly using the HMC WebSM interface.

If all of these conditions are true, proceed to Configuration test 2 - Check HMC password files.

Error results are indicated if one or more of these conditions are not true. Attempt to fix the SDR data by issuing the spframe command with the appropriate parameters, or contact the IBM Support Center. For a description of the spframe command, refer to PSSP: Command and Technical Reference.

Repeat this test after issuing the spframe command. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Configuration test 2 - Check HMC password files

This test verifies that for each unique HMC IP address in your system, a corresponding password file has been created on the control workstation. On the control workstation, issue the command:

sphmcid

For each HMC in your system, you should see a line of output similar to the following. The numbers may be different:

9.114.58.22      hmcadmin
9.114.62.123     hmcadmin

Good results are indicated if for each HMC IP address displayed under the control_ipaddrs heading in Configuration test 1 - Check SDR Frame object, there is a corresponding HMC IP address displayed as a result of issuing the sphmcid command. Proceed to Configuration test 3- Check SDR Syspar_map object.

Error results are indicated if no entry exists for one or more HMC IP address. Attempt to create a password file on the control workstation for the missing HMC IP address. For a description of the sphmcid command , refer to PSSP: Command and Technical Reference.

Repeat the test after issuing the sphmcid command. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center, and contact the IBM Support Center.

Configuration test 3- Check SDR Syspar_map object

This test verifies that the SDR Syspar_map object for all servers was created correctly. The switch port number for the server is stored in the Syspar_map object of the SDR. If the server is attached to the SP Switch or an SP system without a switch, the switch port number for the server is defined for the SP-attached server using the spframe command. If the server is attached to the SP Switch2 or a clustered enterprise server system without a switch, the switch port number can be optionally automatically assigned by the SDR configuration command. On the control workstation, issue the command once for each server:

SDRGetObjects Syspar_map node_number==N switch_node_number

where N is the node number of the server.

If you do not know the node number, issue the command: spmon -G -d to determine node numbers. For example, the command:

SDRGetObjects Syspar_map node_number==17 switch_node_number

produces output similar to the following:

switch_node_number
5

Good results are indicated if the switch port number that is returned matches the value requested when the server was originally defined. For systems with an SP Switch, this should be the switch port number associated with the port in which the server is cabled to the SP Switch. For systems without a switch, this should be any unused valid switch port number on the system. For systems with an SP Switch 2 or for clustered systems, this can be any unused value in the range 0 to 511. Proceed to Operational test 1 - Check hardmon status.

Error results are indicated if no entry exists for the node, or the returned value is incorrect. Attempt to fix the SDR data by issuing the spframe command with the appropriate parameters. If a Frame object already exists for this server, you must first delete that Frame object by issuing the spdelfram command. For a description of these commands, see PSSP: Command and Technical Reference. For information on assigning valid switch port numbers for all servers, see IBM RS/6000 SP: Planning, Volume 2, Control Workstation and Software Environment.

Repeat this test after issuing the spframe command. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Operational verification tests

Use these tests to check that all servers are operating properly.

Operational test 1 - Check hardmon status

This test verifies that the System Monitor (hardmon) is active and running correctly. External hardware daemons cannot run if hardmon is not running. Issue the commands:

  1. lssrc -s hardmon
    

    The output is similar to the following:

    Subsystem Group PID   Status
    hardmon         42532 active
    
  2. ps -ef | grep hardmon
    

    The output is similar to the following:

    root 42532 5966 0 Sep 15 0 9:42 /usr/lpp/ssp/bin/hardmon -r 5
    
    

Good results are indicated if the following are both true:

  1. There is a line of output, as in this example, for each server HMC that is actively controlling your IBM e(logo)server pSeries 690 servers.
  2. In the ps output, the hardmon daemon uses the -r flag and the argument is 5. This means that the hardmon daemon polls each frame supervisor, including external hardware daemons, for state information every five seconds. This is the default. If the hardmon daemon uses a value other than 5 for the argument to the -r flag, it is not running as IBM recommends.

If these conditions are met, proceed to Operational test 2 - Check external hardware daemons.

Error results are indicated if these conditions are not met. To determine why hardmon is not running, or why the argument to the -r flag is not 5, refer to Diagnosing System Monitor problems.

Repeat this test, after taking any action suggested in Diagnosing System Monitor problems. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Operational test 2 - Check external hardware daemons

This test verifies that the hmc daemons are running. If you have one or more IBM e(logo)server pSeries 690 servers, issue the command:

ps -ef | grep hmcd

For IBM e(logo)server pSeries 690 servers, a line of output for each server is similar to the following:

root 23384 42532 1 Sep 15 2 79:08 /usr/lpp/ssp/install/bin/hmcd
-d 0 9.114.58.22 1 minnow 1 5

Good results are indicated if all of the following are true:

  1. There is a line of output, as in this example, for each server.
  2. The parameters of the command are correct. This is a description of each parameter from left to right:

If you receive good results, proceed to Operational test 3 - Check frame responsiveness.

Error results are indicated if any of these conditions are not met. Record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Operational test 3 - Check frame responsiveness

This test verifies that the frames are responding. External hardware daemons cannot run properly if their frames are not responding. To verify that a particular frame is responding, issue the following command on the control workstation:

hmmon -GQv controllerResponds F:0

where F is the frame number of the server that you are checking. Repeat this test for each server in the system.

The output is similar to the following:

frame F, slot 00:
TRUE frame responding to polls

Good results are indicated if the value is TRUE for each server. Proceed to Operational test 4 - Check HMC communications.

If the value is FALSE, or you do not get any output, the test may have encountered an error. During normal operation, this value may occasionally switch to FALSE, which may simply mean that the daemon happens to be busy and cannot respond to an individual System Monitor request in a timely manner. Therefore, if you get a FALSE value, repeat the hmmon command several more times, waiting at least five seconds between invocations. If the value is consistently FALSE after several attempts, assume this to be error results.

In the case of error results, perform these steps:

  1. Verify that the HMC IP address specified in the SDR is correct for the given server.
  2. Verify that the server itself is operating properly.
  3. Examine the daemon log. See Daemon log.

Repeat this test after performing these steps. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center. and contact the IBM Support Center.

Operational test 4 - Check HMC communications

This test verifies that the HMC communication is available. For each server, issue the following command twice on the control workstation, waiting at least five seconds between the two commands:

hmmon -GQs F:0

where F is the frame number of the server that you are checking. Repeat this test for each server in the system. This is an example output for an IBM e(logo)server pSeries 690 in frame 3:

3 0 nodefail1          FALSE 0x8802  node 01 I2C not responding
3 0 nodefail2          TRUE  0x8803  node 02 I2C not responding
3 0 nodefail3          TRUE  0x8804  node 03 I2C not responding
3 0 nodefail4          TRUE  0x8805  node 04 I2C not responding
3 0 nodefail5          TRUE  0x8806  node 05 I2C not responding
3 0 nodefail6          TRUE  0x8807  node 06 I2C not responding
3 0 nodefail7          TRUE  0x8808  node 07 I2C not responding
3 0 nodefail8          TRUE  0x8809  node 08 I2C not responding
3 0 nodefail9          TRUE  0x880a  node 09 I2C not responding
3 0 nodefail10         TRUE  0x880b  node 10 I2C not responding
3 0 nodefail11         TRUE  0x880c  node 11 I2C not responding
3 0 nodefail12         TRUE  0x880d  node 12 I2C not responding
3 0 nodefail13         TRUE  0x880e  node 13 I2C not responding
3 0 nodefail14         TRUE  0x880f  node 14 I2C not responding
3 0 nodefail15         TRUE  0x8810  node 15 I2C not responding
3 0 nodefail16         TRUE  0x8811  node 16 I2C not responding
3 0 nodeLinkOpen1      FALSE 0x8813  node 01 serial link open
3 0 nodeLinkOpen2      FALSE 0x8814  node 02 serial link open
3 0 nodeLinkOpen3      FALSE 0x8815  node 03 serial link open
3 0 nodeLinkOpen4      FALSE 0x8816  node 04 serial link open
3 0 nodeLinkOpen5      FALSE 0x8817  node 05 serial link open
3 0 nodeLinkOpen6      FALSE 0x8818  node 06 serial link open
3 0 nodeLinkOpen7      FALSE 0x8819  node 07 serial link open
3 0 nodeLinkOpen8      FALSE 0x881a  node 08 serial link open
3 0 nodeLinkOpen9      FALSE 0x881b  node 09 serial link open
3 0 nodeLinkOpen10     FALSE 0x881c  node 10 serial link open
3 0 nodeLinkOpen11     FALSE 0x881d  node 11 serial link open
3 0 nodeLinkOpen12     FALSE 0x881e  node 12 serial link open
3 0 nodeLinkOpen13     FALSE 0x881f  node 13 serial link open
3 0 nodeLinkOpen14     FALSE 0x8820  node 14 serial link open
3 0 nodeLinkOpen15     FALSE 0x8821  node 15 serial link open
3 0 nodeLinkOpen16     FALSE 0x8822  node 16 serial link open
3 0 CECUserName        huntley
                             0x8993  user defined CEC name
3 0 CECMode            0     0x8984  0=smp 1=partition
3 0 PowerOffPolicy     TRUE  0x8985  power off with last Lpar
3 0 CECCapability      1     0x8986  0=smp 1=lpar 2=numa
3 0 CECState           1     0x8987  0=off on init err inc con rec
3 0 diagByte           0     0x8823  diagnosis return code
3 0 timeTicks          38701 0x8830  supervisor timer ticks
3 0 type               5     0x883a  supervisor type
3 0 codeVersion        772   0x883b  supervisor code version
3 0 daemonPollRate     5     0x8867  hardware monitor poll rate
3 0 controllerResponds TRUE  0x88a8  frame responding to polls

Good results are indicated if all of the following are true:

  1. For each server, the value for timeTicks increases from the first invocation of the hmmon command to the second invocation of the hmmon command. The slot number is indicated by the second column of output.
  2. For each server, the value for nodefailN (where N is a configured node number) is FALSE. Note that, during normal operation, this value may occasionally switch to TRUE. This may simply mean that the daemon happens to be busy and cannot respond to an individual System Monitor request in a timely manner. Therefore, if you get a TRUE value, repeat the hmmon command several more times, waiting at least five seconds in between invocations, before concluding that this test has failed.

In this case, proceed to Operational test 5 - Check S1 communications.

Error results are indicated in all other cases. Record all relevant information, see Information to collect before contacting the IBM Support Center, and contact the IBM Support Center.

Operational test 5 - Check S1 communications

This test verifies that the S1 communication is available. For each server, issue the following command on the control workstation for one node chosen arbitrarily:

s1term -G F N

where F is the frame number and N is the node number for a logical partition in the server that you are checking. Repeat this test for each server in the system.

Good results are indicated if the AIX login prompt is displayed. If you do not see the login prompt after issuing the s1term command, try typing the enter key a second time. If you still do not see the login prompt, consider this to be error results. Note that the login prompt may take up to 90 seconds to appear.

To correct the problem, perform these steps:

  1. Verify that the server itself is operating properly.
  2. Examine the daemon log. See Daemon log.

Repeat this test after performing these steps. If the test still fails, record all relevant information, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]