IBM Books

Diagnosis Guide


Error symptoms, responses, and recoveries

Use this table to diagnose problems with the system monitor component of PSSP. Locate the symptom and perform the action described in the table.

Table 55. System Monitor symptoms

Symptom Recovery
The ssp.basic file set is not installed. See Action 1 - Install ssp.basic.
The hardmon daemon is not running. See Action 2 - Start the hardmon daemon.
The hardmon daemon keeps terminating and then restarting. See Action 3 - Investigate the hardmon daemon.
A System Monitor command, for example hmmon, does not work correctly. See Action 4 - Investigate System Monitor command problems.
Hardmon performance is poor. See Action 5 - Check paging space and CPU utilization.

Actions

Action 1 - Install ssp.basic

Run Installation test 1 - Check ssp.basic file set to verify that this is a problem. Install the ssp.basic file set, by issuing the installp command. Run Installation test 1 - Check ssp.basic file set again to verify that the problem has been resolved.

Action 2 - Start the hardmon daemon

Run Operational test 1 - Check that hardmon Is active to verify that this is a problem.

If the hardmon daemon is not running on the control workstation, you must start it. Do this by issuing the command:

startsrc -s hardmon

Run Operational test 1 - Check that hardmon Is active to verify that hardmon was started successfully.

If the hardmon daemon uses an incorrect polling interval, it may cause problems. The polling interval is chosen when the hardmon daemon is started by the System Resource Controller. The value is in the cmdargs attribute in the hardmon ODM SRCsubsys object. Check the polling interval by issuing the ODM command:

odmget -q subsysname=hardmon SRCsubsys

Output is similar to the following, which is the default:

            SRCsubsys:
                    subsysname = "hardmon"
                    synonym = ""
                    cmdargs = "-r 5"
                    path = "/usr/lpp/ssp/bin/hardmon"
                    uid = 0
                    auditid = 0
                    standin = "/dev/console"
                    standout = "/dev/console"
                    standerr = "/dev/console"
                    action = 1
                    multi = 0
                    contact = 2
                    svrkey = 0
                    svrmtype = 0
                    priority = 20
                    signorm = 15
                    sigforce = 15
                    display = 1
                    waittime = 15
                    grpname = ""
If the cmdargs attribute is not "-r 5", correct it issuing the command:
chssys -s hardmon -a "-r 5"
Then, reissue the odmget command to verify that the new cmdargs attribute is "-r 5" and run Operational test 1 - Check that hardmon Is active to verify that the problem is resolved.

Action 3 - Investigate the hardmon daemon

Possible causes of this problem, and corrective actions are:

  1. The hmthresholds file may not contain an entry for a type of hardware in the system, or there is a format error in the file.

    Repair the /spdata/sys1/spmon/hmthresholds file, referring to the comments located at the beginning of the file. If the file cannot be repaired, restore it from regular system backups or from the PSSP installation media.

  2. Unable to open a log file.

    Other than a system error, the only reason that a log file would not be able to open is if the directory /var/adm/SPlogs/spmon does not exist. Verify that this directory exists, and create it if it does not.

  3. SP Security Services has not been initialized properly by hardmon.

    Examine the file /var/adm/SPlogs/spmon/hmlogfile.ddd, where ddd is the Julian date. Look for any error messages related to SP Security Services. Also, check the error log for error messages, by issuing the AIX errpt command. Take any action suggested to correct these errors.

  4. There is an SDR configuration error.

    The correction is the same as for 3.

  5. There is a system error. For example, a system call failed, a file descriptor was created that was larger than the allowed maximum size, or the system ran out of memory.

    The correction is the same as for 3.

Action 4 - Investigate System Monitor command problems

Possible causes of this problem, and corrective actions are:

  1. The hardmon daemon is rejecting the command, because you are not authenticated to hardmon, or do not have the proper authorization for what the command is trying to do.

    If the problem is authentication or authorization, refer to Diagnosing SP Security Services problems.

  2. Some commands are partition sensitive. If the command takes slot numbers as a parameter, and one or more of the nodes are not in the current partition, and the "global" option (specified by the -G flag) is not used, the command will run as if that node does not exist. Also, for these commands, the "global" option must always be specified for frames and switches, because they do not reside in any system partition, including the current one.

    For example, the commands:

    hmmon -Q 1:0
    hmmon -Q 1:17 
    

    produce an error message, since the -G flag was not specified, and 1:0 represents frame 1, and 1:17 represents the switch in frame 1.

    Another example, if the current system partition is named PART1, and frame 3 node 8 is in the system partition named PART2, the command:

    hmmon -Q 3:8 
    

    produces an error message, since the -G flag was not specified, and frame 3 node 8 is not in the current system partition.

    In the case where the -G flag was not specified, refer to the entry for the particular command in PSSP: Command and Technical Reference.

    Make sure that all objects in the SDR Syspar_map class reflect correct partitioning information. An error in one of these SDR objects can cause hardmon to be unable to locate nodes correctly.

  3. If the command uses slot numbers as a parameter, and one or more of the nodes do not exist.

    Do not specify a target frame, node, or switch that does not exist in the system.

  4. The rs232 tty cables to one or more frames are not connected correctly.

    Verify that the rs232 tty cables from the control workstation to the frame that is not responding are connected correctly. Note that the S70, S7A and S80 type server frames have two rs232 tty cables attached to the control workstation. All other frames have one rs232 tty cable attached to the control workstation.

    If the rs232 tty cables were not connected to the proper frames or servers, and you have already configured these frames using the spframe command, perform these steps:

    1. Issue the spdelfram command to delete the affected frames, before you re-cable the frames.
    2. Re-cable the rs232 tty cables to the proper frames.
    3. Add the frames that were deleted in step 4a. Issue the spframe command using the -r yes operand.
  5. The hardmon daemon is rejecting the command because there is a frame ID mismatch, due to incorrect cabling. That is, the value of the controllerIDMismatch attribute for the frame is TRUE.

    Run the corrective action for cause 4. If the problem is not resolved, check for a frame ID mismatch by issuing the command:

    hmmon -GQv controllerIDMismatch F:0
    

    where F is the number of the frame on which the command is not working.

    If the value is TRUE, this means that the supervisor of this frame believes that it is attached to a frame other than the one it is physically attached to. To correct this, issue the command:

    hmcmds -G setid F:0
    

    where F is the number of the frame on which the command is not working.

    To verify the correction, wait 5 seconds, reissue the hmmon command and verify that the value of controllerIDMismatch is FALSE.

  6. The hmreinit command was run, and it hangs.

    Use the kill command on the hmreinit process, and then reissue the hmreinit command.

Action 5 - Check paging space and CPU utilization

Possible causes of this problem, and corrective actions are:

  1. Paging space is too low.

    Check that the paging space is adequate and adjust it if necessary.

  2. Other processes are consuming CPU resources. For example, if you are using your control workstation as a boot server, the NFS daemons may be using most of the processor time.

    Use the vmstat command to check the overall CPU utilization.

    Check the CPU utilization of the hardmon and logging daemons. One method is to issue these commands:

    1. ps gvc | grep hardmon
    2. ps gvc | grep splogd
    If the CPU utilization rate is very high and this cannot be attributed to the hardmon or logging daemon, look for other processes which are consuming the CPU resources.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]