Diagnosis Guide

Diagnostic instructions

Verify SP software installation

RSCT does not provide software verification. The following steps verify that the software and components are installed and defined:

Issue the command:

lslpp -l rsct.*

and verify that the RSCT components are installed. The output of this command is similar to:

  Fileset                   Level  State      Description
  ----------------------------------------------------------------------------
         Path: /usr/lib/objrepos
  rsct.basic.hacmp        1.2.0.0  COMMITTED  RS/6000 Cluster Technology
                                                 basic function (HACMP domains)
  rsct.basic.rte          1.2.0.0  COMMITTED  RS/6000 Cluster Technology
                                                 basic function (all domains)
  rsct.basic.sp           1.2.0.0  COMMITTED  RS/6000 Cluster Technology
                                                 basic function (SP domains)
  rsct.clients.hacmp      1.2.0.0  COMMITTED  RS/6000 Cluster Technology
                                                 client function (HACMP
                                                domains)
  rsct.clients.perl5      1.2.0.0  COMMITTED  RS/6000 Cluster Technology
                                                 Perl5 Package
  rsct.clients.rte        1.2.0.0  COMMITTED  RS/6000 Cluster Technology
                                                client function (all domains)
  rsct.clients.sp         1.2.0.0  COMMITTED  RS/6000 Cluster Technology
                                                 client function (SP domains)

Issue the command:
```
lssrc -s haem.syspar
```
and verify that the Event Management subsystem is defined. If it is not defined, you may have to use the syspar_ctrl command to recreate the RSCT subsystems. See Recover crashed node (Event Management Resource Monitor daemons).

Issue the command:

lssrc -ls haem.syspar

and verify that the Event Management daemon is running. The output of this command is similar to:

Subsystem         Group            PID     Status
 haem.c166s       haem             30448   active
 
No trace flags are set
 
Configuration Data Base version from SDR:
        931790799,481121271,0
 
Daemon started on Tuesday 08/10/99 at 08:36:37
Daemon has been running 2 days, 7 hours, 9 minutes and 9 seconds
Daemon connected to group services: Yes
Daemon has joined peer group:       Yes
Daemon communications enabled:      Yes
Daemon security:                    Compatibility
Peer count:                         6
 
Peer group state:
        931790799,481121271,0
        NOSEC
 
Logical Connection Information for Local Clients
    LCID          FD           PID     Start Time
       0          11          17302    Tuesday 08/10/99 08:38:38
       1          13          23740    Tuesday 08/10/99 08:38:40
      10          16          33032    Tuesday 08/10/99 08:38:59
      11          17          33032    Tuesday 08/10/99 08:38:59
 
Logical Connection Information for Remote Clients
    LCID          FD           PID     Start Time
 
Logical Connection Information for Peers
    LCID         Node
       2            7
       4            3
       5            6
      20            8
      27            1
      31            5
Resource Monitor Information
         Name         Inst     Type      FD     SHMID    PID    Locked
IBM.PSSP.CSSLogMon      0        C        -1      -1      -2  00/00  No
IBM.PSSP.SDR            0        C        -1      -1      -2  00/00  No
IBM.PSSP.harmld         0        S        20      11   28954  01/01  No
IBM.PSSP.harmpd         0        S        19      -1   28684  01/01  No
IBM.PSSP.hmrmd          0        S        21      -1   21766  01/01  No
IBM.PSSP.pmanrmd        0        C        14      -1      -2  00/00  No
Membership              0        I        -1      -1      -2  00/00  No
Response                0        I        -1      -1      -2  00/00  No
aixos                   0        S        12      10      -2  00/01  No
 
Highest file descriptor in use is 21
 
Peer Daemon Status
   0 S S      1 I A      3 I A      5 I A      6 I A      7 I A
   8 I A     17 O A     21 O A

Internal Daemon Counters
    GS init attempts =         22  GS join attempts =          1
    GS resp callback =       1653  CCI conn rejects =          0
    RMC conn rejects =          0  HR conn rejects  =          0
    Retry req msg    =          0  Retry rsp msg    =          0
    Intervl usr util =          1  Total usr util   =       2107
    Intervl sys util =          2  Total sys util   =       2006
    Intervl time     =      12000  Total time       =   19847228
    lccb's created   =         33  lccb's freed     =         23
    Reg rcb's creatd =         41  Reg rcb's freed  =         30
    Qry rcb's creatd =        332  Qry rcb's freed  =        332
    vrr created      =         41  vrr freed        =         30
    vqr created      =      42147  vqr freed        =      42147
    var inst created =        939  var inst freed   =          0
    Events regstrd   =         41  Events unregstrd =         30
    Insts assigned   =         43  Insts unassigned =         19
    Smem vars obsrv  =       3306  State vars obsrv =     195596
    Preds evaluated  =      33132  Events generated =        232
    Smem lck intrvl  =          0  Smem lck total   =          0
    PRM msgs to all  =          8  PRM msgs to peer =          8
    PRM resp msgs    =         86  PRM msgs rcvd    =         32
    PRM_NODATA       =        126  PRM_BADMSG errs  =          0
    Sched q elements =         32  Free q elements  =         30
    xcb alloc'd      =       1303  xcb freed        =       1302
    xcb freed msgfp  =         16  xcb freed reqp   =          1
    xcb freed reqn   =          8  xcb freed rspc   =        678
    xcb freed rspp   =         86  xcb freed cmdrm  =        513
    xcb freed unkwn  =          0  Sec enable       =          0
    Sec disable      =          0  Sec authent      =          0
    Wake sec thread  =          0  Wake main thread =          0
    Missed sec rsps  =          0  Enq sec request  =          0
    Deq sec request  =          0  Enq sec response =          0
    Deq sec response =          0
 
Daemon Resource Utilization Last Interval
User:                 0.010 seconds    0.008%
System:               0.020 seconds    0.017%
User+System:          0.030 seconds    0.025%
 
Daemon Resource Utilization Total
User:                21.070 seconds    0.011%
System:              20.060 seconds    0.010%
User+System:         41.130 seconds    0.021%
 
Data segment size:  2132K

The first portion of the output gives a good deal of information. The first lines correspond to the output obtained without the -l flag. It tells you if the subsystem is active or inoperative. Also, there is the status of the trace flag, which is off in this case.

The next line shows the EMCDB version number stored in the SDR.

Configuration Data Base version from SDR:
        931790799,481121271,0

The line under:

Peer group state:         931790799,481121271,0         NOSEC

shows the version number used by the ha_em_peers group. In this case, both are 931790799,481121271,0. Compare these values to see if the daemons are using an EMCDB version different that the one stored in the SDR.

This usually happens when you run the haemcfg command that creates a new EMCDB, but you have not restarted the Event Management daemons. Remember that all the Event Management daemons in the system partition (domain) need to be stopped, and the ha_em_peers group dissolved, in order to use the new EMCDB.

The output also tells you how long the daemon has been running:

Daemon started on Tuesday 08/10/99 at 08:36:37
Daemon has been running 2 days, 7 hours, 9 minutes and 9 seconds

If the daemon was able to connect to Group Services:

Daemon connected to group services: Yes

If the daemon was able to join the ha_em_peers group:

Daemon has joined peer group:       Yes

The following line states if the daemon has enabled communication with clients:

Daemon communications enabled:      Yes

Since RSCT 1.2 supports DCE security, the next line states which security mode this daemon is working with:

Daemon security:                    Compatibility

The last line of this stanza gives you the number of providers in the ha_em_peers group, not counting this one. This is the number of Event Management daemons running. If all nodes are up and running, this number should be equal to the number of nodes, plus one for the control workstation, and one for the current node.

Peer count:                         6

Identify the failing node

Follow these steps to identify the failing node:

Issue the command:
```
lssrc -ls haem.syspar
```
on the control workstation. Check the number of Peer count explained in Verify SP software installation. If the number of Peer count is less than the number of nodes (including the control workstation) plus one, you have one or more failing nodes.
Issue the command:
```
lssrc -ls hags.syspar
```
on the control workstation and verify that the ha_em_peers group appears on the list of local groups, as shown here:
```
lssrc -ls hags.c184s
Subsystem         Group            PID     Status
hags.c184s       hags             22446   active
3 locally-connected clients.  Their PIDs:
17804 27348 29440
 HA Group Services domain information:
 Domain established by node 0.
 Number of groups known locally: 3
                    Number of   Number of local
 Group name         providers   providers/subscribers
 cssMembership            5           0           1
 ha_em_peers              7           1           0
 ha.vsd                   5           1           0
 
```
If the ha_em_peers group does not appear in the output, the Event Management daemon is not running on the control workstation, or it has not been able to join the ha_em_peers group. If the daemon is not running, follow the steps given in Recover crashed node (Event Management Resource Monitor daemons).
If the daemon is not running on the control workstation (or the node where you are running the commands), the number of local providers for the ha_em_peers group will be zero.

Issue the following command to identify which node is failing:

/usr/sbin/rsct/bin/hagsgr -s hags.c184s -a ha_em_peers

The output is similar to the following:

  Number of: groups: 6
Group name[ha_em_peers] group state[Inserted |Idle |]
Providers[[1/0][1/65][1/5][1/9][1/13][1/1][1/17]]
Local subscribers[]

Note:: This command is undocumented and not supported, but it is shipped with RSCT, and therefore available on any system running RSCT.

From the output of this command you can see the list of providers (members of the ha_em_peers group). They are listed in the order they joined the group, with the first in the list called the Group Owner. The list contains the members brackets [X/Y] where X is the instance number of the daemon, and Y is the node number.

From this list you can identify the failing node.

Verify event registration

There are three optional arguments that can be passed to the daemon by using the haemtrcon command. The syntax for the haemtrcon command is as follows:

haemtrcon [-h host] [-a argument] -g group_name
haemtrcon [-h host] [-a argument] -s subsystem_name
haemtrcon [-h host] [-a argument] -p subsystem_pid

The arguments accepted by the daemons for dumping information are shown in Table 62.

Table 62. Arguments for Event Management daemon

Argument	Description
regs	Dumps registered events.
dinsts	Dumps registered instances.
olists	Dumps observation lists.

To see all the events registered with Event Management, you could use this command:

haemtrcon -a regs -s haem.sp5en0

Output is similar to the following:

haemtrcon: the specified trace flags have been set (00000000)

This command sends a request to the Event Management daemon to dump all registered events. The daemon will then dump the requested information into the em.trace file in the /var/ha/log directory. The content of the file looks like this:

Trace Started at 11/30/99 14:38:01.509522432
Registered Events:
 
0 0x00000000 ( 0, 0) 0 IBM.PSSP.Membership.LANAdapter.state "X==0" "X==1"
 
AdapterNum=0 AdapterType=en NodeNum=0
 
AdapterNum=0 AdapterType=en NodeNum=5
 
AdapterNum=0 AdapterType=en NodeNum=9
 
AdapterNum=0 AdapterType=en NodeNum=13
 
AdapterNum=0 AdapterType=en NodeNum=1
 
1 0x00000000 ( 0, 0) 2 IBM.PSSP.Response.Host.state "X==1 && X@P==0" ""
 
NodeNum=5
 
NodeNum=9
 
NodeNum=13
 
NodeNum=1
 
2 0x00010000 (1,0) 2 IBM.PSSP.Membership.LANAdapter.state "X==0 && X@P==1" ""
 
AdapterNum=0 AdapterType=css NodeNum=13

AdapterNum=0 AdapterType=css NodeNum=9
AdapterNum=0 AdapterType=css NodeNum=5
 
AdapterNum=0 AdapterType=css NodeNum=1
 
3 0x00000000 ( 0, 0) 3 IBM.PSSP.CSSlog.errlog "X@1 != 0" ""
 
No instances currently assigned
 
4 0x00000000 ( 0, 0) 1 IBM.PSSP.pm.User_state1 "X@0!=X@P0" ""
 
No instances currently assigned
 
9 0x00000000 ( 0, 0) 18 IBM.PSSP.SDR.modification "" ""
 
No instances currently assigned
 
10 0x00010000 ( 1, 0) 18 IBM.PSSP.SDR.modification "" ""
 
Class=EM_Condition
 
14 0x00050000 ( 5, 0) 18 IBM.PSSP.SDR.modification "" ""
 
No instances currently assigned
 
15 0x00060000 ( 6, 0) 18 IBM.PSSP.SDR.modification "" ""
 
No instances currently assigned
 
16 0x00070000 ( 7, 0) 18 IBM.PSSP.SDR.modification "" ""
 
Class=Node
 
17 0x00080000 ( 8, 0) 18 IBM.PSSP.SDR.modification "" ""
 
No instances currently assigned

As you can see, there are several events registered with Event Management, but only a few of them have instances currently assigned. An event with no instances assigned is an event known to Event Management, but not currently active. The first three event are the only ones active. We can see that Event Management is monitoring the Ethernet (en) and SP Switch (css) adapters through the Membership resource variable. Also, it is monitoring the Response variable for host in all the nodes.

Let us activate a file system monitor and dump this information again. The result is as follows:

Trace Started at 11/30/99 14:51:02.012841216
 
 Registered Events:
 
0 0x00000000 ( 0, 0) 0 IBM.PSSP.Membership.LANAdapter.state "X==0" "X==1"
 
AdapterNum=0 AdapterType=en NodeNum=0
 
AdapterNum=0 AdapterType=en NodeNum=5
 
AdapterNum=0 AdapterType=en NodeNum=9
 
AdapterNum=0 AdapterType=en NodeNum=13
 
AdapterNum=0 AdapterType=en NodeNum=1
 
1 0x00000000 ( 0, 0) 2 IBM.PSSP.Response.Host.state "X==1 && X@P==0" ""
 
NodeNum=5
 
NodeNum=9
 
NodeNum=13
 
NodeNum=1

2 0x00010000 ( 1, 0) 2 IBM.PSSP.Membership.LANAdapter.state "X==0 && X@P==1" ""
 
AdapterNum=0 AdapterType=css NodeNum=13
 
AdapterNum=0 AdapterType=css NodeNum=9
 
AdapterNum=0 AdapterType=css NodeNum=5
 
AdapterNum=0 AdapterType=css NodeNum=1
 
3 0x00000000 ( 0, 0) 3 IBM.PSSP.CSSlog.errlog "X@1 != 0" ""
 
No instances currently assigned
 
4 0x00000000 ( 0, 0) 1 IBM.PSSP.pm.User_state1 "X@0!=X@P0" ""
 
No instances currently assigned
 
9 0x00000000 ( 0, 0) 18 IBM.PSSP.SDR.modification "" ""
 
No instances currently assigned
 
10 0x00010000 ( 1, 0) 18 IBM.PSSP.SDR.modification "" ""
 
Class=EM_Condition
 
14 0x00050000 ( 5, 0) 18 IBM.PSSP.SDR.modification "" ""
 
No instances currently assigned
15 0x00060000 ( 6, 0) 18 IBM.PSSP.SDR.modification "" ""
 
No instances currently assigned
 
16 0x00070000 ( 7, 0) 18 IBM.PSSP.SDR.modification "" ""
 
Class=Node
 
17 0x00080000 ( 8, 0) 18 IBM.PSSP.SDR.modification "" ""
 
No instances currently assigned
 
18 0x00090000 ( 9, 0) 18 IBM.PSSP.aixos.FS.%totused "X>90" "X<60"
 
VG=rootvg LV=hd3

As you can see from this new output, there is a new event (18) which is the one we have just activated. If you want more information about this event, you can use one of the other two arguments described in Table 62. For example, the olists argument gives details on the events registered for this monitor, as follows:

Trace Started at 11/30/99 14:56:45.161388544
 
 
 
Obsv control = 0x200605c8, interval = 60.000000, flags = 0x0000,
                     last obsv = 943991746 874736
 
Obsv list = 0x2002b8e8, delay = 20.000000, number of ptr lists elements = 1
 
limit = 10000, inst count = 16
 
Normal list:
 
 
IBM.PSSP.aixos.FS.%totused
 
vector: VG=rootvg LV=hd3
 
API instance ID = 2, RM instance ID = 18465, RM instance number = 0
 
current value: 8.072917, raw: 8.072917
 
flags: 0003 qcnt: 0
 
 
 
Immediate list:

From this output, you can see that the sample interval for this variable is 60 seconds, and that the current value is a little bit over 8%. From the instance or vector, we can tell that this is a monitor of the /tmp file system (hd3). The previous output gave us the condition (X>90), and rearm condition (X<60). This tracing or dump facility from Event Management is helpful in situations where events registered through either the SP Event Perspective, Problem Management or the EMAPI directly, do not appear to be working.

Verify Resource Monitors

See Resource Monitor problems.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]