IBM Books

Diagnosis Guide


Descriptions of each condition


Table 1. Details about each condition to monitor

Condition Details
Frame Power Description: Whether the frame has its power on or off. When the frame power is off, nodes and switches in the frame cannot receive power.

Resource Variables:

  • SP_HW.Frame.frPowerOff (overall power)
  • SP_HW.Frame.frPowerOff_A (power supply A)
  • SP_HW.Frame.frPowerOff_B (power supply B)
  • SP_HW.Frame.frPowerOff_C (power supply C)
  • SP_HW.Frame.frPowerOff_D (power supply D)
  • SP_HW.Frame.frACLED (AC power OK)
  • SP_HW.Frame.frDCLED (DC power OK)

Notes: With the frPowerOff_* variables, only one needs to be on for the frame to receive power. If all of them are off, then the frame has no power. If power was not explicitly shut down by the system administrator, perform hardware diagnostics on the frame.

Frame Controller Responding Description: This indicates whether the frame controller is responding to command requests. The SP system can function when the frame controller is not responding, but it will not be possible to obtain certain node hardware status (such as key switch position and LED readouts) or issue certain hardware commands (such as resetting the node).

When the controller fails, perform hardware diagnostics and replace the frame controller, if this is called for. Replacing the frame controller requires you to schedule down time for all nodes in that frame.

Resource Variable:

SP_HW.Frame.controllerResponds

Frame Controller ID Mismatch Description: This indicates whether the ID of the frame controller agrees with the ID stored for it in the HACWS supervisor card. If the IDs do not match, this indicates that the HACWS supervisor card is not properly wired to the frame (possibly using the wrong "tty" line). Have the wiring between the control workstations (primary and backup) and the frame controller checked, and if that does not solve the problem, perform hardware diagnostics on both the control workstations and the frame controller. Monitor this condition only when HACWS is installed on the SP system.

Resource Variable:

SP_HW.Frame.controllerIDMismatch

Frame Temperature Description: This indicates whether the frame's temperature is within the normal operational range. If the temperature becomes out of range, hardware within the frame may fail. Make sure that all fans are working properly. There are resource variables that you can check with the SP Event Perspective or the hmmon command to determine this. Make sure that the frame has proper ventilation.

Resource Variable:

SP_HW.Frame.tempRange

Frame Node Slot Failures Description: This indicates whether or not the frame supervisor can communicate with the node supervisor attached to the frame slot. It is possible to see a "failure" in this condition when no real failure exists. For example, since a wide node occupies two slots in the frame but only has one node supervisor, one of the slots associated with the wide node will always show a "failure". Any slots where nodes are not attached will show a "failure", but this is OK. This is why it is important to know the layout of the SP system.

You should be concerned when the status changes to show a failure because it can indicate a failure in the node supervisor. The node may continue to function in this type of failure, but certain hardware status (LEDs, switch position) may not be available and commands (node reset) may not work. Run hardware diagnostics on the node connected to the frame slot showing a failure.

Resource Variables:

  1. SP_HW.Frame.nodefail1
  2. SP_HW.Frame.nodefail2
  3. SP_HW.Frame.nodefail3
  4. SP_HW.Frame.nodefail4
  5. SP_HW.Frame.nodefail5
  6. SP_HW.Frame.nodefail6
  7. SP_HW.Frame.nodefail7
  8. SP_HW.Frame.nodefail8
  9. SP_HW.Frame.nodefail9
  10. SP_HW.Frame.nodefail10
  11. SP_HW.Frame.nodefail11
  12. SP_HW.Frame.nodefail12
  13. SP_HW.Frame.nodefail13
  14. SP_HW.Frame.nodefail14
  15. SP_HW.Frame.nodefail15
  16. SP_HW.Frame.nodefail16
  17. SP_HW.Frame.nodefail17

Note: SP_HW.Frame.nodefail17 indicates a failure in the SP Switch supervisor. If the frame has no switch, this will always show a "failure".

Switch Power Description: This indicates whether the switch power is on or off. If the frame has no power, the switch will not have power, so this should be checked first. If the frame has power but the switch does not, ensure that the switch was not manually shut down, and perform hardware diagnostics on the switch

Resource Variables:

  • SP_HW.Switch.nodePower
  • SP_HW.Switch.powerLED
  • SP_HW.Switch.shutdownTemp

Notes: SP_HW.Switch.nodePower indicates whether the power is on or off. SP_HW.Switch.powerLED indicates this as well, but also indicates whether the switch can receive power but is powered off. SP_HW.Switch.shutdownTemp indicates if the switch was powered off because of a high temperature condition.

Switch Hardware Environment Indicator Description: This indicates if the switch has detected any hardware anomalies that can cause or has caused a shut down of the switch. Such anomalies are: incorrect voltage, fan failure, temperature out of range, and internal hardware failure. This indicator shows whether all is well, whether a condition exists that should be investigated, or whether the switch was forced to shut down because of these errors.

Resource Variable:

SP_HW.Switch.envLED

Notes: Any change in this indicator is worth investigating, even if the indicator shows that the problem is not yet critical. Check for fan failures in the SP Switch. There are additional resource variables that you can use to check this with the SP Event Perspective and the hmmon command. Perform hardware diagnostics on the switch. Schedule repair for any failing hardware components.

Switch Temperature Description: This indicates if the temperature inside the switch hardware is out of the normal operational range. If the temperature becomes out of the normal range, the device may overheat and the hardware may fail.

Resource Variable:

SP_HW.Switch.tempRange

Notes: Check for fan failures in the SP Switch. There are additional resource variables that you can use to check this with the SP Event Perspective and the hmmon command, and ensure that the frame has proper ventilation.

Node Power Description: This indicates whether the node power is on or off. If the frame has no power, the node will not have power, so this should be checked first. If the frame has power but the node does not, ensure that the node was not manually shut down, and perform hardware diagnostics on the node.

Resource Variables:

SP_HW.Node.nodePower (should be used for SP-attached servers and clustered enterprise servers)

SP_HW.Node.powerLED

Notes: SP_HW.Node.nodePower indicates whether the power is on or off. SP_HW.Node.powerLED indicates this as well, but also indicates whether the node can receive power but is powered off.

Node Hardware Environment Indicator Description: This indicates if the node has detected any hardware anomalies that can cause or have caused a shut down of the node. Such anomalies are: incorrect voltage, fan failure, temperature out of range, or internal hardware failure. This indicator shows whether all is well, whether a condition exists that should be investigated, or whether the node was forced to shut down because of these errors.

Resource Variable:

SP_HW.Node.envLED

Notes: Any change in this indicator is worth investigating, even if the indicator shows that the problem is not yet critical. Check for fan failures in the node. There are additional resource variables that you can use to check this with the SP Event Perspective and the hmmon command. Perform hardware diagnostics on the node. Schedule repair for any failed hardware components.

Node Temperature Description: This indicates if the temperature inside the node hardware is out of the normal operational range. If the temperature becomes out of the normal range, the node may overheat and the hardware may fail.

Resource Variable:

SP_HW.Node.tempRange

Notes: Check for fan failures in the node. There are additional resource variables that you can use to check this with the SP Event Perspective and the hmmon command. Ensure that the frame has proper ventilation, and check that all air paths within the frame are not clogged.

Node Key Mode Switch Position Description: This shows the current setting of the node's mode switch. During node boot, the key switch position controls whether the operating system is loaded, and whether service controls are activated. During system operation, the position controls whether the node can be reset and whether system dumps can be initiated. For everyday operation, the key should be in the "Normal" position and should not change. A command must be issued to change the key position. If this does occur, locate the person changing this control and ensure that this action was taken for a proper reason.

Resource Variable:

SP_HW.Node.keyModeSwitch

Note: Not all nodes have a key switch.

Node LED or LCD Readout Description: Each node has an LCD or LED display. This display indicates the status of hardware and software testing during the node's boot process. This display is also used to display specific codes when node hardware or the operating system software fails. These codes indicate the nature of the failure, and whether any additional error data may be present. After a node has successfully booted, this display should be blank. If the display is not blank, use either the SP Hardware Perspective or the spmon command to determine what value is being displayed, and consult SP-specific LED/LCD values, Network installation progress, and Other LED/LCD codes to determine what the LED or LCD value means.

Resource Variable:

SP_HW.Node.LCDhasMessage

Node Reachable by RSCT Group Services Description: This indicates whether RSCT Group Services can reach the node through any of its network adapters and the switch. If this indicates that the node is not reachable, all of the node's network and switch adapters have either failed or been disabled. In this case, the only way to reach the node when it is powered on is through the node's serial link, using the s1term command. When this happens, check the network adapter status and issue the /etc/ifconfig on command to enable the adapter. Also, check the switch status and perform problem determination procedures for Group Services and Topology Services.

Resource Variable:

Membership.Node.state

Node has a processor that is offline Description: This indicates that when the node was booted, one or more processors did not respond. However, there is at least one active processor, so the node functions.

Resource Variable:

processorsOffline

/tmp file system becoming full Description: Each node has its own locally available /tmp file system. This file system is used as temporary storage for many AIX and PSSP utilities. If this file system runs out of space, these utilities can fail, causing failures in those PSSP and LP utilities that depend on them. When this file system nears its storage capacity, it should be checked for large files that can be removed, or the file system size should be increased.

Resource Variable:

aixos.FS.%totused

Resource Identifier:

VG = rootvg LV = hd3

/var file system becoming full Description: Each node has its own locally available /var file system. This file system contains system logs, error logs, trace information, and other important node-specific files. If this file system runs out of space, log entries cannot be recorded, which can lead to loss of error information when critical errors occur, leaving you and IBM service personnel without an audit or debug trail. When the file system nears its storage capacity, it should be checked for old log information that can be removed or cleared, the file system size should be increased, or separate file systems should be made for subdirectories that consume large amounts of disk space.

Resource Variable:

aixos.FS.%totused

Resource Identifier:

VG = rootvg LV = hd9var

/ file system becoming full Description: Each node has its own locally available root file system. This file system contains important node boot and configuration information, as well as LP binaries and configuration files. If this file system runs out of space, it may not be possible to install products on the node, or update that node's configuration information (although the SMIT and AIX-based install procedures should attempt to acquire more space). When this file system nears its storage capacity, it should be checked for core files or any other large files that can be removed, or the file system's size should be increased.

Resource Variable:

aixos.FS.%totused

Resource Identifier:

VG = rootvg LV = hd4

Paging Space Low Description: Each node has at least one locally available paging device. When all these paging devices near their capacity, the node begins to thrash, spending more time and resources to process paging requests than to process user requests. When operating as part of a parallel process, the thrashing node will delay all other parts of the parallel process that wait for this node to complete its processing. It can also cause timeouts for other network and distributed processes. A temporary fix is to terminate any non-critical processes that are using large amounts of memory. If this is a persistent problem, a more permanent fix is to restrict the node to specific processing only, or to add additional paging devices.

Resource Variable:

aixos.PagSp.%totalused

Kernel Memory Buffer Failures Description: Kernel memory buffers, or "mbufs", are critical to network processing. These buffers are used by the kernel network protocol code to transfer network messages. If the kernel begins to encounter failures in acquiring these buffers, network information packets can be lost, and network applications will not run efficiently. An occasional failure can be tolerated, but numerous failures or a continuous stream of small failures indicates that not enough memory has been allocated to the kernel memory buffer pool.

Resource Variable:

aixos.Mem.Kmem.failures

Resource Identifier:

Type = mbuf

Switch Input and Output Errors Description: The switch device driver tracks statistics on the device's use, including any errors detected by the driver. These errors are tracked as counters that are never reset, unless the node is rebooted. Consult Diagnosing SP Switch problems and Diagnosing SP Switch2 problems for assistance in diagnosing any reported errors for the SP Switch or SP Switch2.

Resource Variables:

  • CSS.ibadpackets
  • CSS.ipackets_drop
  • CSS.ierrors
  • CSS.opackets_drop
  • CSS.oerrors
  • CSS.xmitque_ovf

Notes: Any increment in the value of the CSS.ierrors or CSS.oerrors counters indicates that the switch adapter is about to go offline. Continual increments to the CSS.ibadpackets counter can indicate transmission problems or "noise" in the connection between the SP Switch adapter and the SP Switch, so the SP Switch cabling should be checked and hardware diagnostics performed. Continual increments to the CSS.ipackets_drop and CSS.opacktes_drop counters indicate that there is either too much input or too much output for the SP Switch device driver to handle, and packets are lost.

inetd Daemon Activity Description: The inetd master daemon is responsible for activating many AIX and PSSP service daemons when a client for that service connects to the node. If the daemon fails, these services cannot be started. Since many SP applications are network applications, this can cause widespread failure in all SP applications. If the daemon cannot be restarted manually, force a system dump of this node, collect information for the IBM Support Center, and restart the node. The reboot may temporarily resolve the problem.

Resource Variable:

Prog.xpcount

Resource Identifier:

ProgName=inetd UserName = root

srcmstr Daemon Activity Description: The srcmstr daemon implements the System Resource Controller functions. If this daemon fails, services registered with the SRC cannot be controlled using SRC commands. If the daemon cannot be restarted manually, force a system dump of this node, collect information for the IBM Support Center, and restart the node. The reboot may temporarily repair the problem.

Resource Variable:

Prog.xpcount

Resource Identifier:

ProgName=srcmstr UserName = root

biod Daemon Activity Description: The biod daemon handles block I/O requests for the NFS file system. In order for NFS to function on a node, at least one biod daemon must be active. For normal NFS activity, six to eight biod daemons are usually active on a node. For higher NFS activity, some nodes may have more. These daemons are started from node boot, run continuously, and should not shut down. If any daemons shut down, consult the NFS documentation for diagnostic procedures, and attempt to restart the daemon.

Resource Variable:

Prog.pcount

Resource Identifier:

ProgName=biod UserName = root

portmap Daemon Activity Description: This daemon knows all the registered ports on the node, and which programs are available on each of these ports. The daemon's task is to convert Remote Procedure Call (RPC) port numbers to Internet port numbers. RPC clients use this daemon to resolve their RPC port numbers. If the daemon fails, the daemon itself and all RPC servers on the node must be restarted.

Resource Variable:

Prog.pcount

Resource Identifier:

ProgName=portmap UserName = root

xntpd Daemon Activity Description: This daemon is active on a node when the Network Time Protocol (NTP) time synchronization protocol is running. This daemon ensures that the node's time-of-day hardware is synchronized with the network's time server. A failure in the daemon does not necessarily mean that the time of day hardware on the node will no longer be synchronized with the network, although this danger does exist. A failure in the daemon does mean that time change updates from the network server will not be made on this node. Such problems can lead to failures in RSCT's Topology Services component, which may begin to see packets arriving out of chronological order, and may cause RSCT to falsely detect that one of its peer nodes has failed.

Resource Variable:

Prog.pcount

Resource Identifier:

ProgName=xntpd UserName = root

kerberos Daemon Activity Description: The kerberos daemon runs on the node where the Kerberos databases are stored. You need to know which node this is to properly check this condition. The daemon is responsible for accepting Kerberos V4 client requests for principal information, service tickets, and Kerberos V4 database maintenance.

Failure in this daemon will cause failures in Kerberos V4 clients to acquire or validate credentials, which will lead to denial of service for users of the Kerberos V4 clients. If this daemon fails, consult Diagnosing SP Security Services problems for Kerberos V4 diagnostics, and attempt to restart the daemon.

Resource Variable:

Prog.pcount

Resource Identifier:

ProgName=kerberos UserName = root

hatsd Daemon Activity Description: This is the RSCT Topology Services daemon, which is responsible for maintaining an internal topology map of the SP system on this node. The daemon is under SRC control, and should restart automatically if it is accidentally terminated. If this daemon fails and does not restart, the node will be seen as "down" by all other nodes in this system partition.

Other consequences of this daemon's failure to restart include the RSCT Group Services daemon on the node will fail and the RSCT Event Management daemon will fail. This daemon's status cannot be monitored by the SP Event Perspective or Problem Management, because these two facilities depend on the daemon for their own processing. To check this daemon's activity, you must use the lssrc -g hats command or the ps -ef | grep hats command.

hagsd Daemon Activity Description: This is the RSCT Group Services daemon, which is responsible for handling Group Services functions for all Group Services clients on this node. The daemon is under SRC control, and should restart automatically if it is accidentally terminated. If this daemon fails and does not restart, all Group Services clients on this node will appear to have failed, as far as the Group Services group members are concerned.

Those groups will begin their member failure processing for the Group Services clients on this node. The daemon's status cannot be monitored by the SP Event Perspective or Problem Management, because these two facilities depend on the daemon for their own processing. To check this daemon's activity, you must use the lssrc -g hags command or the ps -ef | grep hags command.

haemd Daemon Activity Description: This is the RSCT Event Management daemon, which is responsible for handling Event Management registrations on this node and communicating with other Event Management daemons in this system partition. The daemon is under SRC control, and should restart automatically if it is accidentally terminated. If this daemon fails and does not restart, none of the Event Management resource variables from this node will be available to Event Management applications for monitoring or event generation purposes.

These affected applications include Problem Management and the SP Event Perspective. This daemon's status cannot be monitored by the SP Event Perspective or Problem Management, because these two facilities depend on the daemon for their own processing . To check this daemon's activity, you must use the lssrc -g haem command or the ps -ef | grep haem command.

sdrd Daemon Activity Description: This daemon runs on the control workstation (and therefore must be checked only on that node), and services all requests made of the System Data Repository (SDR). Although a failure in this daemon may not have any immediate consequences, PSSP software services will not be able to access SDR information, and can fail at later times when this information is needed. Certain hardware monitoring capability can also be lost, and may result in widespread, falsely detected "node not responding" failures.

Resource Variable:

Prog.pcount

Resource Identifier:

ProgName=sdrd UserName = root

dced Daemon Activity Description: The DCE client for the dcecp command. This client runs on each host in the SP system when DCE authentication is used. Failures in this daemon can prevent you from administering principals, accounts, passwords and server keys and obtaining other necessary information regarding DCE users. For problem resolution, refer to IBM DCE for AIX, Version 3.1: Problem Determination Guide.

Resource Variable:

Prog.pcount

Resource Identifier:

ProgName=dced UserName = root

cdsadv Daemon Activity Description: The DCE client for the Cell Directory Service. This client runs on each host in the SP system when DCE authentication is used. The cdsd and cdsadv together provide the cell directory service which is essentially a distributed information database. Failures in the client or server daemon may cause problems accessing information in the same way as a file system failure on a local machine. Items of interest that are kept in the Cell Directory Service are keytab objects and account objects, as well as others. For problem resolution, refer to IBM DCE for AIX, Version 3.1: Problem Determination Guide.

Resource Variable:

Prog.pcount

Resource Identifier:

ProgName=cdsadv UserName = root

cdsd Daemon Activity Description: The DCE server for the Cell Directory Service. This master server runs on the host you have designated to be the server host for DCE for your SP system. The cdsd and cdsadv together provide the cell directory service which is essentially a distributed information database. Failures in the client or server daemon may cause problems accessing information in the same way as a file system failure on a local machine. Items of interest that are kept in the Cell Directory Service are keytab objects and account objects, as well as others. For problem resolution, refer to IBM DCE for AIX, Version 3.1: Problem Determination Guide.

Resource Variable:

Prog.pcount

Resource Identifier:

ProgName=cdsd UserName = root

secd Daemon Activity Description: The DCE Security Server. This master server runs on the host you have designated to be the server host for DCE for your SP system. Failures in this daemon cause problems relating to authentication and authorization. A typical problem would be a login failure through dce_login. SP Services using DCE for authenticating trusted services may hang if the DCE Security Server is not running. The hang condition can also occur if secd is running but the registry has been disabled. You may see messages such as "Cannot find KDC for requested realm".

Resource Variable:

Prog.pcount

Resource Identifier:

ProgName=secd UserName = root


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]