IBM Books

Diagnosis Guide


Diagnostic procedures

These procedures check the installation, configuration, and operation of the System Monitor.

Installation verification tests

Use these tests to check that the System Monitor is installed properly.

Installation test 1 - Check ssp.basic file set

This test verifies that the ssp.basic file set has been installed correctly. System Monitor function is included in the ssp.basic file set.

Issue this lslpp command on the control workstation:

lslpp -l ssp.basic
Good results are indicated by output similar to the following:
Path: /usr/lib/objrepos
 ssp.basic      3.1.0.8  COMMITTED  SP System Support Package
 
Path: /etc/objrepos
 ssp.basic      3.1.0.8  COMMITTED  SP System Support Package
 

In this case, proceed to Installation test 2 - Check System Monitor files.

Error results are indicated if entries for ssp.basic do not exist. In this case, try to determine why the file set was not installed, and either attempt to install it, or contact the IBM Support Center.

Installation test 2 - Check System Monitor files

This test verifies that the System Monitor daemon and associated commands and configuration files have been created in the proper directory on the control workstation. Issue these commands and verify that all these files exist:

        ls -l /usr/lpp/ssp/bin/hardmon
        ls -l /usr/lpp/ssp/bin/hmadm
        ls -l /usr/lpp/ssp/bin/hmcmds
        ls -l /usr/lpp/ssp/bin/hmmon
        ls -l /usr/lpp/ssp/bin/spmon
        ls -l /usr/lpp/ssp/bin/s1term
        ls -l /usr/lpp/ssp/bin/spsvrmgr
        ls -l /usr/lpp/ssp/bin/hmckacls
        ls -l /usr/lpp/ssp/bin/hmgetacls
        ls -l /usr/lpp/ssp/bin/hmdceobj
        ls -l /usr/lpp/ssp/bin/spmon_itest
        ls -l /usr/lpp/ssp/bin/spmon_ctest
        ls -l /usr/lpp/ssp/install/bin/hmreinit
        ls -l /spdata/sys1/spmon/hmacls
        ls -l /spdata/sys1/spmon/hmthresholds
        ls -l /spdata/sys1/spmon/hwevents
        ls -l /spdata/sys1/ucode
 

Notes:

  1. The last entry, ucode, is a directory where one or more microcode files are located It is a part of the ssp.ucode file set, which is a prerequisite to the ssp.basic file set.

  2. If the control workstation is in DCE only mode, the hmacls file is not used, and therefore does not need to exist.

Good results are indicated if all of the files exist and the files that are located in a bin directory are executable. Proceed to Installation test 3 - Run the spmon_itest command.

Error results are indicated if entries for one or more of these files does not exist. In this case, try reinstalling your SP system, or contact the IBM Support Center. If you have already reinstalled your SP system, resume diagnostics with Installation test 1 - Check ssp.basic file set.

Installation test 3 - Run the spmon_itest command

This test verifies that the system monitor is installed correctly. Issue this command on the control workstation:

spmon_itest

Good results are indicated by output similar to:

 spmon_itest: Start spmon installation verification test
 spmon_itest: Verification Succeeded

In this case, proceed to Configuration test 1 - Check /etc/services file.

Error results are indicated in all other cases. Check the file /var/adm/SPlogs/spmon/spmon_itest.log for error messages, and take appropriate action based on the messages. Repeat this test after taking corrective actions based on the messages from spmon_itest.

Configuration verification tests

Use these tests to check that the System Monitor is configured properly.

Configuration test 1 - Check /etc/services file

This test verifies that there is an entry for hardmon in the /etc/services file on the control workstation. Browse the file /etc/services and look for hardmon in the left column:

      hardmon         8435/tcp

Good results are indicated if this entry is present. Proceed to Configuration test 2 - Check SDR Frame class.

Error results are indicated if this entry is not present. It should have been made during the installation of the SP system. Contact the IBM Support Center.

Configuration test 2 - Check SDR Frame class

This test verifies that the SDR Frame class exists, that all MACN values are the same for all frames, and that these values are the same as the output of the vhostname command.

Issue these commands on the control workstation:

  1. SDRGetObjects Frame
  2. vhostname

Good results are indicated if all of the MACN values, for each frame in the Frame object, are identical to the value returned by the vhostname command. Proceed to Configuration test 3 - Check SDR SP_ports class.

Error results are indicated if there are differences in the values returned by these two commands. In this case, perform one of these corrective actions:

  1. Issue the command hmreinit to reinitialize the SDR Frame class.
  2. Use the SDRChangeAttrValues command to correct one or more incorrect MACN values, in the respective Frame object. That is, change the value to be identical to what is returned by the vhostname command. Refer to the entries for the hmreinit and SDRChangeAttrValues commands in PSSP: Command and Technical Reference.
  3. Change the system hostname to match the MACN attribute in the Frame class.
    Note:
    Whatever value the hostname has at this point, it must match the hostname attribute of the SDR SP_ports class.

Repeat this test after performing one of the corrective actions.

Configuration test 3 - Check SDR SP_ports class

This test verifies that the SDR SP_ports class exists, and that an object exists whose daemon attribute is hardmon. The test also verifies that the hostname and port attributes for this object are correct.

Issue these commands on the control workstation:

  1. SDRGetObjects SP_ports

    The output is similar to:

    daemon       hostname               port
    hardmon      sup1.ppd.pok.ibm.com   8435
    haemd        ""                     10000
     
    
  2. vhostname

Good results are indicated if all of these conditions are true:

  1. There is an object in the SP_ports class whose daemon attribute is hardmon (as in the SDRGetObjects SP_ports output).
  2. The hostname attribute of this object is identical to the value returned by the vhostname command.
  3. The port attribute of this object is 8435.

In this case, proceed to Configuration test 4 - Check SDR Syspar class.

Error results are indicated if one or more of these conditions is not true. In this case, perform one of these corrective actions:

  1. If the hardmon entry does not exist, use the SDRCreateObjects command to create the missing SP_ports (hardmon entry) object.
  2. If the hardmon entry exists, use the SDRChangeAttrValues command to modify the incorrect SP_ports (hardmon entry) object.
  3. If the problem is that the hostname attribute does not match the value returned by the vhostname command, correct the problem by performing one of these steps:
    1. Change the system hostname so that they both match.
    2. Use the SDRChangeAttrValues command to change the hostname attribute so that they both match.
Note:
Whatever value the hostname has at this point, it must match the hostname attribute of the SDR SP_ports class.
After performing corrective actions, proceed to Configuration test 2 - Check SDR Frame class.

Configuration test 4 - Check SDR Syspar class

This test verifies that the SDR Syspar class exists, and that one object exists for each system partition. Issue this command on the control workstation:

SDRGetObjects Syspar 

Good results are indicated if the Syspar class exists, and there is an object for each system partition. In this case, proceed to Configuration test 5 - Check SDR Switch class.

Error results are indicated in all other cases. If the Syspar class does not exist or it is empty, contact the IBM Support Center. If the Syspar class exists and is not empty, but there is not an object for each system partition, perform one of these actions:

  1. Try to reapply your partition configuration, by issuing the spapply_config command. For details of this command, refer to PSSP: Command and Technical Reference.
  2. Contact the IBM Support Center.

Configuration test 5 - Check SDR Switch class

Do not perform this test unless your system has one or more switches.

This test verifies that the SDR Switch class exists, and that one object exists for each switch in the system. Issue the following command on the control workstation:

SDRGetObjects Switch

Good results are indicated if the Switch class exists, and there is an object for each switch in the system. In this case, proceed to Configuration test 6 - Check SDR Syspar_map class.

Error results are indicated in all other cases. If the Switch class does not exist, contact the IBM Support Center. If the Switch class exists, but it is empty, or it does not contain one object for each switch in the system, perform one of these actions:

  1. Issue the SDR_config command. For details of this command, refer to PSSP: Command and Technical Reference.

    After running SDR_config, if there is still a problem with the SDR Switch class, it may be because the SDR_config command obtains information from hardmon, and hardmon is inactive or having problems. To determine if this is the case, see Operational test 1 - Check that hardmon Is active.

  2. Contact the IBM Support Center

Configuration test 6 - Check SDR Syspar_map class

This test verifies that the SDR Syspar_map class exists, and that one or more objects exist. Issue this command on the control workstation:

SDRGetObjects Syspar_map

Good results are indicated if the Syspar_map class exists, and there is at least one object whose used attribute is 1. Proceed to Configuration test 7 - Check SDR NodeExpansion class.

Note:
Nodes that do not physically exist may have entries, but their used attribute should be 0.

Error results are indicated in all other cases. If the Syspar_map class does not exist, contact the IBM Support Center. If the Syspar_map class exists, perform one of these actions:

  1. Issue the SDR_config command. For details of this command, refer to PSSP: Command and Technical Reference. After running SDR_config, if there is still a problem with the SDR Syspar_map class, it may be because the SDR_config command obtains information from hardmon, and hardmon is inactive or having problems. To determine if this is the case, see Operational test 1 - Check that hardmon Is active.
  2. Contact the IBM Support Center

Repeat this test after taking the suggested actions.

Configuration test 7 - Check SDR NodeExpansion class

Do not perform this test unless your system has one or more Expansion Nodes.

This test verifies that the SDR NodeExpansion class exists, and that one object exists for each Expansion Node in the system. Issue the following command on the control workstation:

SDRGetObjects NodeExpansion

Good results are indicated if the NodeExpansion class exists, and there is an object for each Expansion Node in the system. Proceed to Configuration test 8 - Check frame, node, and switch supervisor cards.

Error results are indicated in all other cases. If the NodeExpansion class does not exist, contact the IBM Support Center. If the NodeExpansion class exists, perform one of these actions:

  1. Issue the SDR_config command. For details of this command, refer to PSSP: Command and Technical Reference. After running SDR_config, if there is still a problem with the NodeExpansion class, it may be because the SDR_config command obtains information from hardmon, and hardmon is inactive or having problems. To determine if this is the case, see Operational test 1 - Check that hardmon Is active.
  2. Contact the IBM Support Center

Repeat this test after taking the suggested actions.

Configuration test 8 - Check frame, node, and switch supervisor cards

This test verifies that all frame, node, and switch supervisor cards that support microcode download, contain the latest level. Issue this command on the control workstation:

smit supervisor

Select either of the two buttons labeled List Supervisors That Require Action.

Good results are indicated by output similar to:

 spsvrmgr: All specified supervisor hardware is current and active. 
           No further action is required at this time. 

Proceed to Configuration test 9 - Run the spmon_ctest command.

Error results are indicated in all other cases. If this test fails, it means that one or more of the supervisor cards require some action. Either refer to the entry for the spsvrmgr command in PSSP: Command and Technical Reference, or use smit to take the action for all of the supervisor cards, or one at a time:

Issue this command:

smit supervisor

To take the required action on all supervisor cards, select the button labeled:

Update *ALL* Supervisors That Require Action

To take the required action on a single supervisor card, select the button labeled:

Update Selectable Supervisors That Require Action

Repeat this test after taking the suggested actions. If this test fails again after taking the required action on all supervisor cards that require action, contact the IBM Support Center.

Configuration test 9 - Run the spmon_ctest command

This test verifies that the System Monitor is configured correctly. Issue this command on the control workstation:

spmon_ctest 

Good results are indicated if the output is similar to:

              spmon_ctest: Start spmon configuration verification test
              spmon_ctest: Verification Succeeded

Proceed to Operational test 1 - Check that hardmon Is active.

Error results are indicated in all other cases. Check the file /var/adm/SPlogs/spmon_ctest.log for error messages, and take appropriate action based on the messages. Repeat this test after taking corrective action.

Operational verification tests

Use these tests to check that the System Monitor is operating properly.

Operational test 1 - Check that hardmon Is active

This test verifies that the System Monitor (hardmon) is active and running correctly. Issue these commands:

  1. lssrc -s hardmon
  2. ps -ef | grep hardmon

Output from the lssrc command is similar to:

              Subsystem         Group            PID     Status 
               hardmon                           42532   active

Output from the ps command is similar to:

              root 42532  5966  0  Sep 15   0  9:42 /usr/lpp/ssp/bin/hardmon -r 5

Good results are indicated if all of the following are true:

  1. In the lssrc command output for the entry whose Subsystem is hardmon, the Status column is active.
  2. In the ps command output, verify that the hardmon daemon uses the -r flag and that it has an argument of 5.

    This means that the hardmon daemon polls each frame supervisor, including external hardware daemons, for state information, every five seconds. This is the default. If the hardmon daemon uses a value other than 5 for the -r flag argument, the daemon is not running according to IBM recommendations.

In this case, proceed to Operational test 2 - Run query command.

Error results are indicated in all other cases. To correct, see Action 2 - Start the hardmon daemon. Repeat this test after taking corrective actions. If the problem persists, contact the IBM Support Center.

Operational test 2 - Run query command

This test verifies that you can run a typical System Monitor query command. Issue this command:

hmmon -GQ 1:0-17

Good results are indicated if you get state output for slot 0 (the frame itself) in frame 1 (which must always exist), and any nodes in frame 1. An example is:

frame 001, slot 00:
  node 01 I2C not responding   FALSE
  node 02 I2C not responding   TRUE
  node 03 I2C not responding   FALSE
  node 04 I2C not responding   TRUE
  node 05 I2C not responding   FALSE
  node 06 I2C not responding   TRUE
  node 07 I2C not responding   TRUE
  node 08 I2C not responding   TRUE
  node 09 I2C not responding   FALSE
  node 10 I2C not responding   TRUE
  node 11 I2C not responding   FALSE
  node 12 I2C not responding   FALSE
  node 13 I2C not responding   FALSE
  node 14 I2C not responding   FALSE
  node 15 I2C not responding   TRUE
  node 16 I2C not responding   FALSE
  switch I2C not responding    FALSE
  controller tail is active    TRUE
  node 01 serial link open     FALSE
  node 02 serial link open     FALSE
           .
           .
           .

Error results are indicated in all other cases. This test can fail for several reasons. Read any error messages that are output by the hmmon command, and check the hardmon log /var/adm/SPlogs/spmon/hmlogfile.ddd, where ddd is the Julian date of when the file was created, for any new messages. Refer to relevant actions in Error symptoms, responses, and recoveries.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]