IBM Books

Diagnosis Guide


Error symptoms, responses, and recoveries

Use the following table to diagnose problems with the PSSP software for all servers. Locate the symptom and perform the action described in this table.

Table 86. SP-attached Server and Clustered Enterprise Server symptoms

Symptom Recovery
The ssp.basic file set is not installed. See Action 1 - Install ssp.basic file set.
An external hardware daemon or it's directory have not been created on the control workstation. See Action 2 - Check permissions of ssp.basic.
The SDR Frame object was not created for a server. See Action 3 - Correct SDR Frame object.
The SDR Node object was not created for a server. See Action 4 - Correct SDR Node object.
The hardmon daemon is not running. See Action 5 - Correct System Monitor polling interval.
The external hardware daemon is not running or not responding. See Action 6 - Investigate external hardware daemon failure.
The SAMI communication is not available. See Action 7 - Restore SAMI communication.
The S1 communication is not available. See Action 8 Restore S1 communication.
The switch port number for the server is not set correctly. See Action 9 - Correct switch port number in SDR Syspar_map object.

Actions

Action 1 - Install ssp.basic file set

Perform Installation test 1 - Verify the ssp.basic file set to verify that this is a problem. Install the ssp.basic file set using the installp command. Repeat Installation test 1 - Verify the ssp.basic file set.

Action 2 - Check permissions of ssp.basic

Perform Installation test 2 - Check external hardware daemon to verify that this is a problem. Install the ssp.basic file set using the installp command. If any of the permission or file attributes do not match what is shown in Installation test 2 - Check external hardware daemon, issue the chmod or chown commands, as appropriate, to correct the attributes. Repeat Installation test 1 - Verify the ssp.basic file set and Installation test 2 - Check external hardware daemon.

Action 3 - Correct SDR Frame object

Perform test Configuration test 1 - Check SDR Frame object to verify that this is a problem. Perform these steps:

  1. If the splstdata -f command that was run in Configuration test 1 - Check SDR Frame object returns an error message similar to:
    splstdata: 0022-001 The repository cannot be accessed. Return code was 80.
    

    refer to Diagnosing SDR problems to determine why the SDR cannot be accessed.

  2. If you can successfully access the SDR, create a Frame object for the one that is missing by issuing the spframe command with the appropriate parameters. For a description of the spframe command, refer to PSSP: Command and Technical Reference.
  3. If there were problems creating the SDR Frame object, investigate why the SDR_config command was unable to create the Frame object. Check the SDR_config log file, /var/adm/SPlogs/sdr/SDR_config.log for error messages. If there are error messages for the Frame object creation, refer to Configuration test 5 - Check SDR Switch class through Configuration test 7 - Check SDR NodeExpansion class.
  4. If the Frame object was successfully created, but the server entry, output by the splstdata -f command, does not have correct information in it's tty or s1_tty column, issue the spframe command with the correct values.
  5. If the Frame object was successfully created, but the server entry, output by the splstdata -f command, does not have correct information in it's hardware_protocol column, you must first issue the spdelfram command. Then, create a new definition by issuing the spframe command with the correct values.

    For an S70, S7A and S80 type server, the value must be SAMI.

After correcting the problem, repeat Configuration test 1 - Check SDR Frame object.

Action 4 - Correct SDR Node object

Perform Configuration test 2 - Check SDR Node object to verify that this is a problem. Perform these steps:

  1. If either of the two splstdata commands that were run in Configuration test 2 - Check SDR Node object returns an error message similar to:
     splstdata: 0022-001 The repository cannot be accessed. Return code was 80.
    

    refer to Diagnosing SDR problems to determine why the SDR cannot be accessed.

  2. If you can successfully access the SDR, first delete the incorrect frame definition with the spdelfram command, and invoke the spframe command with the appropriate parameters. Refer to PSSP: Command and Technical Reference for these commands. By issuing the spframe command, the hardmon and logging daemons together will create Node objects for S70, S7A and S80 servers.
  3. If there were problems creating the Node or ProcessorExtensionNode objects, investigate why the SDR_config command was unable to create the objects. Check the SDR_config log /var/adm/SPlogs/sdr/SDR_config.log for error messages. If there are error messages for the object creation, refer to Configuration test 5 - Check SDR Switch class through Configuration test 7 - Check SDR NodeExpansion class.

After correcting the problem, repeat Configuration test 2 - Check SDR Node object.

Action 5 - Correct System Monitor polling interval

Perform Operational test 1 - Check hardmon status to verify that this is a problem. Perform these steps:

  1. If the hardmon daemon is not running on the control workstation, you need to start it. Issue this command:
    startsrc -s hardmon
    

    Then, perform Operational test 1 - Check hardmon status again to determine if hardmon was started successfully.

  2. If the hardmon daemon uses an incorrect polling interval, it may cause problems. The polling interval is chosen when the hardmon daemon is started by the System Resource Controller. The value is in the cmdargs attribute in the hardmon ODM SRCsubsys object. Check the polling interval by issuing the ODM command:
    odmget -q subsysname=hardmon SRCsubsys
    

    The output is similar to the following, which is the default:

                    SRCsubsys:
                       subsysname = "hardmon"
                        synonym = ""
                        cmdargs = "-r 5"
                        path = "/usr/lpp/ssp/bin/hardmon"
                        uid = 0
                        auditid = 0
                        standin = "/dev/console"
                        standout = "/dev/console"
                        standerr = "/dev/console"
                        action = 1
                        multi = 0
                        contact = 2
                        svrkey = 0
                        svrmtype = 0
                        priority = 20
                        signorm = 15
                        sigforce = 15
                        display = 1
                        waittime = 15
                        grpname = ""
                
    

    If the cmdargs attribute is not "-r 5", correct this by issuing the following command:

    chssys -s hardmon -a "-r 5"
    

    Then reissue the odmget command to verify that the new cmdargs attribute is "-r 5".

After correcting the problem, repeat Operational test 1 - Check hardmon status.

Action 6 - Investigate external hardware daemon failure

Perform Operational Test 2 - Check external hardware daemons to determine if the external hardware daemon is running. Perform Operational test 3 - Check frame responsiveness to determine if the external hardware daemon is responding. If either test produces error results, perform these steps:

  1. Several components of PSSP, involved with the operation of the servers, write data to log files. Check these log files and take appropriate action:
  2. If one of the external hardware daemons is not running, but it should be, check to see if a core dump was created. Refer to Dump information.
  3. If the System Monitor (hardmon) daemon is running, but an external hardware daemon is not running or not responding, issue the following command to start the external hardware daemon:
    hmcmds -G boot_supervisor F:0
    

    where F is the frame number of the server. This notifies the System Monitor that the external hardware daemon has stopped. The System Monitor then starts the daemon.

  4. If you have attempted to start an external hardware daemon, and it still does not start, issue the following command to stop and restart the System Monitor daemon (hardmon):
    hmreinit
    
    The System Monitor daemon (hardmon) will be restarted by the System Resource Controller, and the daemon will then restart all of the external hardware daemons. It also causes the SDR_config command to run, updating the SDR as necessary.
  5. If you have attempted to start an external hardware daemon by running the previous action, and it still does not start, issue the command:
    stopsrc -s hardmon
    

    to stop the System Monitor daemon (hardmon), and then issue the command:

    splstdata -f
    
    to see what ttys (tty and s1_tty) are needed by your external hardware daemons. Refer to Configuration test 1 - Check SDR Frame object for typical output. For an S70, S7A and S80 servers, if one or more of it's ttys has a corresponding entry in the /etc/locks/ directory, delete these entries and repeat this step. A server may be prevented from starting if either of it's two required ttys are locked.

After correcting the problem, repeat Operational Test 2 - Check external hardware daemons and Operational test 3 - Check frame responsiveness.

Action 7 - Restore SAMI communication

Perform Operational test 4 - Check SAMI communications to determine if the SAMI communication is available. If you receive an error result, perform these steps:

  1. Verify that the SAMI (S70, S7A, S80) communication cable is not unplugged or loose, and that it is plugged into the correct tty socket of the control workstation.
  2. Verify the tty definition on the control workstation. The Enable LOGIN characteristic must be set to disable. Use smitty as follows:
              TYPE   :   smitty
              SELECT :   devices 
              SELECT :   TTY 
              SELECT :   Change / Show Characteristics of a TTY
              select the TTY of interest and press ENTER
              check the "Enable LOGIN" value    
    
  3. Verify that the serial port adapter on the control workstation does not have a hardware error, by checking the AIX Error Log.
  4. Verify that the server is operating properly. For more information, refer to the manual for the specific server.

After correcting the problem, repeat Operational test 4 - Check SAMI communications.

Action 8 Restore S1 communication

Perform Operational test 5 - Check S1 communications to verify that this is a problem. If you receive error results, perform these steps:

  1. Verify that the S1 communication cable is not unplugged or loose, and that it is plugged into the correct tty socket of the control workstation.
  2. Verify the S1 tty definition on both the control workstation and the server. On the control workstation, the Enable LOGIN characteristic must be set to disable. On the server itself, the Enable LOGIN characteristic must be set to enable. You can use smitty on the control workstation, and then rsh to the server. This is the smitty sequence:
              TYPE   :   smitty
              SELECT :   devices 
              SELECT :   TTY 
              SELECT :   Change / Show Characteristics of a TTY
              select the TTY of interest and press ENTER
              check the "Enable LOGIN" value    
    
  3. Verify that the serial port adapter on the control workstation and the server do not have a hardware error, by checking the AIX Error Log.
  4. Verify that the server is operating properly. For more information, refer to the manual for the specific server.

After correcting the problem, repeat Operational test 5 - Check S1 communications.

Action 9 - Correct switch port number in SDR Syspar_map object

Perform Configuration test 3 - Check SDR Syspar_map object to verify that the switch port number is not set correctly for the server. Perform these steps:

  1. If the SDRGetObjects Syspar_map command that was run in Configuration test 3 - Check SDR Syspar_map object returns an error message similar to:
    0025-080 The SDR routine could not connect to server.
    

    or some other message indicating a problem with the System Data Repository, refer to Diagnosing SDR problems.

  2. If you can successfully access the SDR, delete the Frame object for the server, if one exists, by issuing the spdelfram command. For a description of the spdelfram command, see PSSP: Command and Technical Reference.
  3. Create a new Frame object for the server by issuing the spframe command, with the appropriate parameters. The -n option is used to specify the switch port number for the server. If this is a system of clustered enterprise servers (no SP frames or SP Switches), you do not need to specify the switch port number. In this case, the SDR configuration command which is invoked during the spframe -r yes processing will automatically assign a valid value for you.

    Refer to PSSP: Planning Volume 2 for information on determining a valid switch port number, and for situations where you may not wish to have the SDR configuration command automatically assign one for you in a clustered enterprise server system. For a description of the spframe command, see PSSP: Command and Technical Reference.

  4. If there were problems creating the Frame or Syspar_Map SDR objects, investigate why the SDR_confg command was unable to create the object by checking the SDR_config log file, /var/adm/SPlogs/sdr/SDR_config.log, for error messages. If there are error messages for the Frame or Syspar_map objects, refer to Configuration test 5 - Check SDR Switch class through Configuration test 7 - Check SDR NodeExpansion class.

After correcting the problem, repeat Configuration test 3 - Check SDR Syspar_map object.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]