Table 1. Details about each condition to monitor
Condition | Details |
---|---|
Frame Power | Description: Whether the frame has its power on or
off. When the frame power is off, nodes and switches in the frame
cannot receive power.
Resource Variables:
Notes: With the frPowerOff_* variables, only one needs to be on for the frame to receive power. If all of them are off, then the frame has no power. If power was not explicitly shut down by the system administrator, perform hardware diagnostics on the frame. |
Frame Controller Responding | Description: This indicates whether the frame controller
is responding to command requests. The SP system can function when the
frame controller is not responding, but it will not be possible to obtain
certain node hardware status (such as key switch position and LED readouts) or
issue certain hardware commands (such as resetting the node).
When the controller fails, perform hardware diagnostics and replace the frame controller, if this is called for. Replacing the frame controller requires you to schedule down time for all nodes in that frame. Resource Variable: SP_HW.Frame.controllerResponds |
Frame Controller ID Mismatch | Description: This indicates whether the ID of the frame
controller agrees with the ID stored for it in the HACWS supervisor
card. If the IDs do not match, this indicates that the HACWS supervisor
card is not properly wired to the frame (possibly using the wrong "tty"
line). Have the wiring between the control workstations (primary and
backup) and the frame controller checked, and if that does not solve the
problem, perform hardware diagnostics on both the control workstations and the
frame controller. Monitor this condition only when HACWS is installed
on the SP system.
Resource Variable: SP_HW.Frame.controllerIDMismatch |
Frame Temperature | Description: This indicates whether the frame's
temperature is within the normal operational range. If the temperature
becomes out of range, hardware within the frame may fail. Make sure
that all fans are working properly. There are resource variables that
you can check with the SP Event Perspective or the hmmon command to
determine this. Make sure that the frame has proper ventilation.
Resource Variable: SP_HW.Frame.tempRange |
Frame Node Slot Failures | Description: This indicates whether or not the frame
supervisor can communicate with the node supervisor attached to the frame
slot. It is possible to see a "failure" in this condition when no real
failure exists. For example, since a wide node occupies two slots in
the frame but only has one node supervisor, one of the slots associated with
the wide node will always show a "failure". Any slots where nodes are
not attached will show a "failure", but this is OK. This is why it is
important to know the layout of the SP system.
You should be concerned when the status changes to show a failure because it can indicate a failure in the node supervisor. The node may continue to function in this type of failure, but certain hardware status (LEDs, switch position) may not be available and commands (node reset) may not work. Run hardware diagnostics on the node connected to the frame slot showing a failure. Resource Variables:
Note: SP_HW.Frame.nodefail17 indicates a failure in the SP Switch supervisor. If the frame has no switch, this will always show a "failure". |
Switch Power | Description: This indicates whether the switch power is on
or off. If the frame has no power, the switch will not have power, so
this should be checked first. If the frame has power but the switch
does not, ensure that the switch was not manually shut down, and perform
hardware diagnostics on the switch
Resource Variables:
Notes: SP_HW.Switch.nodePower indicates whether the power is on or off. SP_HW.Switch.powerLED indicates this as well, but also indicates whether the switch can receive power but is powered off. SP_HW.Switch.shutdownTemp indicates if the switch was powered off because of a high temperature condition. |
Switch Hardware Environment Indicator | Description: This indicates if the switch has detected any
hardware anomalies that can cause or has caused a shut down of the
switch. Such anomalies are: incorrect voltage, fan failure,
temperature out of range, and internal hardware failure. This indicator
shows whether all is well, whether a condition exists that should be
investigated, or whether the switch was forced to shut down because of these
errors.
Resource Variable: SP_HW.Switch.envLED Notes: Any change in this indicator is worth investigating, even if the indicator shows that the problem is not yet critical. Check for fan failures in the SP Switch. There are additional resource variables that you can use to check this with the SP Event Perspective and the hmmon command. Perform hardware diagnostics on the switch. Schedule repair for any failing hardware components. |
Switch Temperature | Description: This indicates if the temperature inside the
switch hardware is out of the normal operational range. If the
temperature becomes out of the normal range, the device may overheat and the
hardware may fail.
Resource Variable: SP_HW.Switch.tempRange Notes: Check for fan failures in the SP Switch. There are additional resource variables that you can use to check this with the SP Event Perspective and the hmmon command, and ensure that the frame has proper ventilation. |
Node Power | Description: This indicates whether the node power is on
or off. If the frame has no power, the node will not have power, so
this should be checked first. If the frame has power but the node does
not, ensure that the node was not manually shut down, and perform hardware
diagnostics on the node.
Resource Variables: SP_HW.Node.nodePower (should be used for SP-attached servers and clustered enterprise servers) SP_HW.Node.powerLED Notes: SP_HW.Node.nodePower indicates whether the power is on or off. SP_HW.Node.powerLED indicates this as well, but also indicates whether the node can receive power but is powered off. |
Node Hardware Environment Indicator | Description: This indicates if the node has detected any
hardware anomalies that can cause or have caused a shut down of the
node. Such anomalies are: incorrect voltage, fan failure,
temperature out of range, or internal hardware failure. This indicator
shows whether all is well, whether a condition exists that should be
investigated, or whether the node was forced to shut down because of these
errors.
Resource Variable: SP_HW.Node.envLED Notes: Any change in this indicator is worth investigating, even if the indicator shows that the problem is not yet critical. Check for fan failures in the node. There are additional resource variables that you can use to check this with the SP Event Perspective and the hmmon command. Perform hardware diagnostics on the node. Schedule repair for any failed hardware components. |
Node Temperature | Description: This indicates if the temperature inside the
node hardware is out of the normal operational range. If the
temperature becomes out of the normal range, the node may overheat and the
hardware may fail.
Resource Variable: SP_HW.Node.tempRange Notes: Check for fan failures in the node. There are additional resource variables that you can use to check this with the SP Event Perspective and the hmmon command. Ensure that the frame has proper ventilation, and check that all air paths within the frame are not clogged. |
Node Key Mode Switch Position | Description: This shows the current setting of the
node's mode switch. During node boot, the key switch position
controls whether the operating system is loaded, and whether service controls
are activated. During system operation, the position controls whether
the node can be reset and whether system dumps can be initiated. For
everyday operation, the key should be in the "Normal" position and should not
change. A command must be issued to change the key position. If
this does occur, locate the person changing this control and ensure that this
action was taken for a proper reason.
Resource Variable: SP_HW.Node.keyModeSwitch Note: Not all nodes have a key switch. |
Node LED or LCD Readout | Description: Each node has an LCD or LED display.
This display indicates the status of hardware and software testing during the
node's boot process. This display is also used to display specific
codes when node hardware or the operating system software fails. These
codes indicate the nature of the failure, and whether any additional error
data may be present. After a node has successfully booted, this display
should be blank. If the display is not blank, use either the SP
Hardware Perspective or the spmon command to determine what value is
being displayed, and consult SP-specific LED/LCD values, Network installation progress, and Other LED/LCD codes to determine what the LED
or LCD value means.
Resource Variable: SP_HW.Node.LCDhasMessage |
Node Reachable by RSCT Group Services | Description: This indicates whether RSCT Group Services
can reach the node through any of its network adapters and the switch.
If this indicates that the node is not reachable, all of the node's
network and switch adapters have either failed or been disabled. In
this case, the only way to reach the node when it is powered on is through the
node's serial link, using the s1term command. When this
happens, check the network adapter status and issue the /etc/ifconfig
on command to enable the adapter. Also, check the switch status
and perform problem determination procedures for Group Services and Topology
Services.
Resource Variable: Membership.Node.state |
Node has a processor that is offline | Description: This indicates that when the node was booted,
one or more processors did not respond. However, there is at least one
active processor, so the node functions.
Resource Variable: processorsOffline |
/tmp file system becoming full | Description: Each node has its own locally available
/tmp file system. This file system is used as temporary
storage for many AIX and PSSP utilities. If this file system runs out
of space, these utilities can fail, causing failures in those PSSP and LP
utilities that depend on them. When this file system nears its storage
capacity, it should be checked for large files that can be removed, or the
file system size should be increased.
Resource Variable: aixos.FS.%totused Resource Identifier: VG = rootvg LV = hd3 |
/var file system becoming full | Description: Each node has its own locally available
/var file system. This file system contains system logs, error
logs, trace information, and other important node-specific files. If
this file system runs out of space, log entries cannot be recorded, which can
lead to loss of error information when critical errors occur, leaving you and
IBM service personnel without an audit or debug trail. When the file
system nears its storage capacity, it should be checked for old log
information that can be removed or cleared, the file system size should be
increased, or separate file systems should be made for subdirectories that
consume large amounts of disk space.
Resource Variable: aixos.FS.%totused Resource Identifier: VG = rootvg LV = hd9var |
/ file system becoming full | Description: Each node has its own locally available
root file system. This file system contains important node
boot and configuration information, as well as LP binaries and configuration
files. If this file system runs out of space, it may not be possible to
install products on the node, or update that node's configuration
information (although the SMIT and AIX-based install procedures should attempt
to acquire more space). When this file system nears its storage
capacity, it should be checked for core files or any other large
files that can be removed, or the file system's size should be
increased.
Resource Variable: aixos.FS.%totused Resource Identifier: VG = rootvg LV = hd4 |
Paging Space Low | Description: Each node has at least one locally available
paging device. When all these paging devices near their capacity, the
node begins to thrash, spending more time and resources to process paging
requests than to process user requests. When operating as part of a
parallel process, the thrashing node will delay all other parts of the
parallel process that wait for this node to complete its processing. It
can also cause timeouts for other network and distributed processes. A
temporary fix is to terminate any non-critical processes that are using large
amounts of memory. If this is a persistent problem, a more permanent
fix is to restrict the node to specific processing only, or to add additional
paging devices.
Resource Variable: aixos.PagSp.%totalused |
Kernel Memory Buffer Failures | Description: Kernel memory buffers, or "mbufs", are
critical to network processing. These buffers are used by the kernel
network protocol code to transfer network messages. If the kernel
begins to encounter failures in acquiring these buffers, network information
packets can be lost, and network applications will not run efficiently.
An occasional failure can be tolerated, but numerous failures or a continuous
stream of small failures indicates that not enough memory has been allocated
to the kernel memory buffer pool.
Resource Variable: aixos.Mem.Kmem.failures Resource Identifier: Type = mbuf |
Switch Input and Output Errors | Description: The switch device driver tracks statistics on
the device's use, including any errors detected by the driver.
These errors are tracked as counters that are never reset, unless the node is
rebooted. Consult Diagnosing SP Switch problems and Diagnosing SP Switch2 problems for assistance in diagnosing any reported
errors for the SP Switch or SP Switch2.
Resource Variables:
Notes: Any increment in the value of the CSS.ierrors or CSS.oerrors counters indicates that the switch adapter is about to go offline. Continual increments to the CSS.ibadpackets counter can indicate transmission problems or "noise" in the connection between the SP Switch adapter and the SP Switch, so the SP Switch cabling should be checked and hardware diagnostics performed. Continual increments to the CSS.ipackets_drop and CSS.opacktes_drop counters indicate that there is either too much input or too much output for the SP Switch device driver to handle, and packets are lost. |
inetd Daemon Activity | Description: The inetd master daemon is
responsible for activating many AIX and PSSP service daemons when a client for
that service connects to the node. If the daemon fails, these services
cannot be started. Since many SP applications are network applications,
this can cause widespread failure in all SP applications. If the daemon
cannot be restarted manually, force a system dump of this node, collect
information for the IBM Support Center, and restart the node. The
reboot may temporarily resolve the problem.
Resource Variable: Prog.xpcount Resource Identifier: ProgName=inetd UserName = root |
srcmstr Daemon Activity | Description: The srcmstr daemon implements the
System Resource Controller functions. If this daemon fails, services
registered with the SRC cannot be controlled using SRC commands. If the
daemon cannot be restarted manually, force a system dump of this node, collect
information for the IBM Support Center, and restart the node. The
reboot may temporarily repair the problem.
Resource Variable: Prog.xpcount Resource Identifier: ProgName=srcmstr UserName = root |
biod Daemon Activity | Description: The biod daemon handles block I/O
requests for the NFS file system. In order for NFS to function on a
node, at least one biod daemon must be active. For normal NFS
activity, six to eight biod daemons are usually active on a
node. For higher NFS activity, some nodes may have more. These
daemons are started from node boot, run continuously, and should not shut
down. If any daemons shut down, consult the NFS documentation for
diagnostic procedures, and attempt to restart the daemon.
Resource Variable: Prog.pcount Resource Identifier: ProgName=biod UserName = root |
portmap Daemon Activity | Description: This daemon knows all the registered ports on
the node, and which programs are available on each of these ports. The
daemon's task is to convert Remote Procedure Call (RPC) port numbers to
Internet port numbers. RPC clients use this daemon to resolve their RPC
port numbers. If the daemon fails, the daemon itself and all RPC
servers on the node must be restarted.
Resource Variable: Prog.pcount Resource Identifier: ProgName=portmap UserName = root |
xntpd Daemon Activity | Description: This daemon is active on a node when the
Network Time Protocol (NTP) time synchronization protocol is running.
This daemon ensures that the node's time-of-day hardware is synchronized
with the network's time server. A failure in the daemon does not
necessarily mean that the time of day hardware on the node will no longer be
synchronized with the network, although this danger does exist. A
failure in the daemon does mean that time change updates from the network
server will not be made on this node. Such problems can lead to
failures in RSCT's Topology Services component, which may begin to see
packets arriving out of chronological order, and may cause RSCT to falsely
detect that one of its peer nodes has failed.
Resource Variable: Prog.pcount Resource Identifier: ProgName=xntpd UserName = root |
kerberos Daemon Activity | Description: The kerberos daemon runs on the node
where the Kerberos databases are stored. You need to know which node
this is to properly check this condition. The daemon is responsible for
accepting Kerberos V4 client requests for principal information, service
tickets, and Kerberos V4 database maintenance.
Failure in this daemon will cause failures in Kerberos V4 clients to acquire or validate credentials, which will lead to denial of service for users of the Kerberos V4 clients. If this daemon fails, consult Diagnosing SP Security Services problems for Kerberos V4 diagnostics, and attempt to restart the daemon. Resource Variable: Prog.pcount Resource Identifier: ProgName=kerberos UserName = root |
hatsd Daemon Activity | Description: This is the RSCT Topology Services daemon,
which is responsible for maintaining an internal topology map of the SP system
on this node. The daemon is under SRC control, and should restart
automatically if it is accidentally terminated. If this daemon fails
and does not restart, the node will be seen as "down" by all other nodes in
this system partition.
Other consequences of this daemon's failure to restart include the RSCT Group Services daemon on the node will fail and the RSCT Event Management daemon will fail. This daemon's status cannot be monitored by the SP Event Perspective or Problem Management, because these two facilities depend on the daemon for their own processing. To check this daemon's activity, you must use the lssrc -g hats command or the ps -ef | grep hats command. |
hagsd Daemon Activity | Description: This is the RSCT Group Services daemon, which
is responsible for handling Group Services functions for all Group Services
clients on this node. The daemon is under SRC control, and should
restart automatically if it is accidentally terminated. If this daemon
fails and does not restart, all Group Services clients on this node will
appear to have failed, as far as the Group Services group members are
concerned.
Those groups will begin their member failure processing for the Group Services clients on this node. The daemon's status cannot be monitored by the SP Event Perspective or Problem Management, because these two facilities depend on the daemon for their own processing. To check this daemon's activity, you must use the lssrc -g hags command or the ps -ef | grep hags command. |
haemd Daemon Activity | Description: This is the RSCT Event Management daemon,
which is responsible for handling Event Management registrations on this node
and communicating with other Event Management daemons in this system
partition. The daemon is under SRC control, and should restart
automatically if it is accidentally terminated. If this daemon fails
and does not restart, none of the Event Management resource variables from
this node will be available to Event Management applications for monitoring or
event generation purposes.
These affected applications include Problem Management and the SP Event Perspective. This daemon's status cannot be monitored by the SP Event Perspective or Problem Management, because these two facilities depend on the daemon for their own processing . To check this daemon's activity, you must use the lssrc -g haem command or the ps -ef | grep haem command. |
sdrd Daemon Activity | Description: This daemon runs on the control workstation
(and therefore must be checked only on that node), and services all requests
made of the System Data Repository (SDR). Although a failure in this
daemon may not have any immediate consequences, PSSP software services will
not be able to access SDR information, and can fail at later times when this
information is needed. Certain hardware monitoring capability can also
be lost, and may result in widespread, falsely detected "node not responding"
failures.
Resource Variable: Prog.pcount Resource Identifier: ProgName=sdrd UserName = root |
dced Daemon Activity | Description: The DCE client for the dcecp
command. This client runs on each host in the SP system when DCE
authentication is used. Failures in this daemon can prevent you from
administering principals, accounts, passwords and server keys and obtaining
other necessary information regarding DCE users. For problem
resolution, refer to IBM DCE for AIX, Version 3.1: Problem
Determination Guide.
Resource Variable: Prog.pcount Resource Identifier: ProgName=dced UserName = root |
cdsadv Daemon Activity | Description: The DCE client for the Cell Directory
Service. This client runs on each host in the SP system when DCE
authentication is used. The cdsd and cdsadv together
provide the cell directory service which is essentially a distributed
information database. Failures in the client or server daemon may cause
problems accessing information in the same way as a file system failure on a
local machine. Items of interest that are kept in the Cell Directory
Service are keytab objects and account objects, as well as others. For
problem resolution, refer to IBM DCE for AIX, Version 3.1:
Problem Determination Guide.
Resource Variable: Prog.pcount Resource Identifier: ProgName=cdsadv UserName = root |
cdsd Daemon Activity | Description: The DCE server for the Cell Directory
Service. This master server runs on the host you have designated to be
the server host for DCE for your SP system. The cdsd and
cdsadv together provide the cell directory service which is
essentially a distributed information database. Failures in the client
or server daemon may cause problems accessing information in the same way as a
file system failure on a local machine. Items of interest that are kept
in the Cell Directory Service are keytab objects and account objects, as well
as others. For problem resolution, refer to IBM DCE for AIX,
Version 3.1: Problem Determination Guide.
Resource Variable: Prog.pcount Resource Identifier: ProgName=cdsd UserName = root |
secd Daemon Activity | Description: The DCE Security Server. This master
server runs on the host you have designated to be the server host for DCE for
your SP system. Failures in this daemon cause problems relating to
authentication and authorization. A typical problem would be a login
failure through dce_login. SP Services using DCE for
authenticating trusted services may hang if the DCE Security Server is not
running. The hang condition can also occur if secd is running
but the registry has been disabled. You may see messages such as
"Cannot find KDC for requested realm".
Resource Variable: Prog.pcount Resource Identifier: ProgName=secd UserName = root |