RSCT does not provide software verification. The following steps verify that the software and components are installed and defined:
lslpp -l rsct.*
and verify that the RSCT components are installed. The output of this command is similar to:
Fileset Level State Description ---------------------------------------------------------------------------- Path: /usr/lib/objrepos rsct.basic.hacmp 1.2.0.0 COMMITTED RS/6000 Cluster Technology basic function (HACMP domains) rsct.basic.rte 1.2.0.0 COMMITTED RS/6000 Cluster Technology basic function (all domains) rsct.basic.sp 1.2.0.0 COMMITTED RS/6000 Cluster Technology basic function (SP domains) rsct.clients.hacmp 1.2.0.0 COMMITTED RS/6000 Cluster Technology client function (HACMP domains) rsct.clients.perl5 1.2.0.0 COMMITTED RS/6000 Cluster Technology Perl5 Package rsct.clients.rte 1.2.0.0 COMMITTED RS/6000 Cluster Technology client function (all domains) rsct.clients.sp 1.2.0.0 COMMITTED RS/6000 Cluster Technology client function (SP domains)
lssrc -s haem.syspar
and verify that the Event Management subsystem is defined. If it is not defined, you may have to use the syspar_ctrl command to recreate the RSCT subsystems. See Recover crashed node (Event Management Resource Monitor daemons).
lssrc -ls haem.syspar
and verify that the Event Management daemon is running. The output of this command is similar to:
Subsystem Group PID Status haem.c166s haem 30448 active No trace flags are set Configuration Data Base version from SDR: 931790799,481121271,0 Daemon started on Tuesday 08/10/99 at 08:36:37 Daemon has been running 2 days, 7 hours, 9 minutes and 9 seconds Daemon connected to group services: Yes Daemon has joined peer group: Yes Daemon communications enabled: Yes Daemon security: Compatibility Peer count: 6 Peer group state: 931790799,481121271,0 NOSEC Logical Connection Information for Local Clients LCID FD PID Start Time 0 11 17302 Tuesday 08/10/99 08:38:38 1 13 23740 Tuesday 08/10/99 08:38:40 10 16 33032 Tuesday 08/10/99 08:38:59 11 17 33032 Tuesday 08/10/99 08:38:59 Logical Connection Information for Remote Clients LCID FD PID Start Time Logical Connection Information for Peers LCID Node 2 7 4 3 5 6 20 8 27 1 31 5 Resource Monitor Information Name Inst Type FD SHMID PID Locked IBM.PSSP.CSSLogMon 0 C -1 -1 -2 00/00 No IBM.PSSP.SDR 0 C -1 -1 -2 00/00 No IBM.PSSP.harmld 0 S 20 11 28954 01/01 No IBM.PSSP.harmpd 0 S 19 -1 28684 01/01 No IBM.PSSP.hmrmd 0 S 21 -1 21766 01/01 No IBM.PSSP.pmanrmd 0 C 14 -1 -2 00/00 No Membership 0 I -1 -1 -2 00/00 No Response 0 I -1 -1 -2 00/00 No aixos 0 S 12 10 -2 00/01 No Highest file descriptor in use is 21 Peer Daemon Status 0 S S 1 I A 3 I A 5 I A 6 I A 7 I A 8 I A 17 O A 21 O A
Internal Daemon Counters GS init attempts = 22 GS join attempts = 1 GS resp callback = 1653 CCI conn rejects = 0 RMC conn rejects = 0 HR conn rejects = 0 Retry req msg = 0 Retry rsp msg = 0 Intervl usr util = 1 Total usr util = 2107 Intervl sys util = 2 Total sys util = 2006 Intervl time = 12000 Total time = 19847228 lccb's created = 33 lccb's freed = 23 Reg rcb's creatd = 41 Reg rcb's freed = 30 Qry rcb's creatd = 332 Qry rcb's freed = 332 vrr created = 41 vrr freed = 30 vqr created = 42147 vqr freed = 42147 var inst created = 939 var inst freed = 0 Events regstrd = 41 Events unregstrd = 30 Insts assigned = 43 Insts unassigned = 19 Smem vars obsrv = 3306 State vars obsrv = 195596 Preds evaluated = 33132 Events generated = 232 Smem lck intrvl = 0 Smem lck total = 0 PRM msgs to all = 8 PRM msgs to peer = 8 PRM resp msgs = 86 PRM msgs rcvd = 32 PRM_NODATA = 126 PRM_BADMSG errs = 0 Sched q elements = 32 Free q elements = 30 xcb alloc'd = 1303 xcb freed = 1302 xcb freed msgfp = 16 xcb freed reqp = 1 xcb freed reqn = 8 xcb freed rspc = 678 xcb freed rspp = 86 xcb freed cmdrm = 513 xcb freed unkwn = 0 Sec enable = 0 Sec disable = 0 Sec authent = 0 Wake sec thread = 0 Wake main thread = 0 Missed sec rsps = 0 Enq sec request = 0 Deq sec request = 0 Enq sec response = 0 Deq sec response = 0 Daemon Resource Utilization Last Interval User: 0.010 seconds 0.008% System: 0.020 seconds 0.017% User+System: 0.030 seconds 0.025% Daemon Resource Utilization Total User: 21.070 seconds 0.011% System: 20.060 seconds 0.010% User+System: 41.130 seconds 0.021% Data segment size: 2132K
The first portion of the output gives a good deal of information. The first lines correspond to the output obtained without the -l flag. It tells you if the subsystem is active or inoperative. Also, there is the status of the trace flag, which is off in this case.
The next line shows the EMCDB version number stored in the SDR.
Configuration Data Base version from SDR: 931790799,481121271,0
Peer group state: 931790799,481121271,0 NOSEC
shows the version number used by the ha_em_peers group. In this case, both are 931790799,481121271,0. Compare these values to see if the daemons are using an EMCDB version different that the one stored in the SDR.
This usually happens when you run the haemcfg command that creates a new EMCDB, but you have not restarted the Event Management daemons. Remember that all the Event Management daemons in the system partition (domain) need to be stopped, and the ha_em_peers group dissolved, in order to use the new EMCDB.
The output also tells you how long the daemon has been running:
Daemon started on Tuesday 08/10/99 at 08:36:37 Daemon has been running 2 days, 7 hours, 9 minutes and 9 seconds
If the daemon was able to connect to Group Services:
Daemon connected to group services: Yes
If the daemon was able to join the ha_em_peers group:
Daemon has joined peer group: Yes
The following line states if the daemon has enabled communication with clients:
Daemon communications enabled: Yes
Since RSCT 1.2 supports DCE security, the next line states which security mode this daemon is working with:
Daemon security: Compatibility
The last line of this stanza gives you the number of providers in the ha_em_peers group, not counting this one. This is the number of Event Management daemons running. If all nodes are up and running, this number should be equal to the number of nodes, plus one for the control workstation, and one for the current node.
Peer count: 6
Follow these steps to identify the failing node:
lssrc -ls haem.syspar
on the control workstation. Check the number of Peer count explained in Verify SP software installation. If the number of Peer count is less than the number of nodes (including the control workstation) plus one, you have one or more failing nodes.
lssrc -ls hags.syspar
on the control workstation and verify that the ha_em_peers group appears on the list of local groups, as shown here:
lssrc -ls hags.c184s Subsystem Group PID Status hags.c184s hags 22446 active 3 locally-connected clients. Their PIDs: 17804 27348 29440 HA Group Services domain information: Domain established by node 0. Number of groups known locally: 3 Number of Number of local Group name providers providers/subscribers cssMembership 5 0 1 ha_em_peers 7 1 0 ha.vsd 5 1 0
If the ha_em_peers group does not appear in the output, the Event Management daemon is not running on the control workstation, or it has not been able to join the ha_em_peers group. If the daemon is not running, follow the steps given in Recover crashed node (Event Management Resource Monitor daemons).
If the daemon is not running on the control workstation (or the node where you are running the commands), the number of local providers for the ha_em_peers group will be zero.
/usr/sbin/rsct/bin/hagsgr -s hags.c184s -a ha_em_peers
The output is similar to the following:
Number of: groups: 6 Group name[ha_em_peers] group state[Inserted |Idle |] Providers[[1/0][1/65][1/5][1/9][1/13][1/1][1/17]] Local subscribers[]
From the output of this command you can see the list of providers (members of the ha_em_peers group). They are listed in the order they joined the group, with the first in the list called the Group Owner. The list contains the members brackets [X/Y] where X is the instance number of the daemon, and Y is the node number.
From this list you can identify the failing node.
There are three optional arguments that can be passed to the daemon by using the haemtrcon command. The syntax for the haemtrcon command is as follows:
haemtrcon [-h host] [-a argument] -g group_name haemtrcon [-h host] [-a argument] -s subsystem_name haemtrcon [-h host] [-a argument] -p subsystem_pid
The arguments accepted by the daemons for dumping information are shown in Table 62.
Table 62. Arguments for Event Management daemon
Argument | Description |
---|---|
regs | Dumps registered events. |
dinsts | Dumps registered instances. |
olists | Dumps observation lists. |
To see all the events registered with Event Management, you could use this command:
haemtrcon -a regs -s haem.sp5en0
Output is similar to the following:
haemtrcon: the specified trace flags have been set (00000000)
This command sends a request to the Event Management daemon to dump all registered events. The daemon will then dump the requested information into the em.trace file in the /var/ha/log directory. The content of the file looks like this:
Trace Started at 11/30/99 14:38:01.509522432 Registered Events: 0 0x00000000 ( 0, 0) 0 IBM.PSSP.Membership.LANAdapter.state "X==0" "X==1" AdapterNum=0 AdapterType=en NodeNum=0 AdapterNum=0 AdapterType=en NodeNum=5 AdapterNum=0 AdapterType=en NodeNum=9 AdapterNum=0 AdapterType=en NodeNum=13 AdapterNum=0 AdapterType=en NodeNum=1 1 0x00000000 ( 0, 0) 2 IBM.PSSP.Response.Host.state "X==1 && X@P==0" "" NodeNum=5 NodeNum=9 NodeNum=13 NodeNum=1 2 0x00010000 (1,0) 2 IBM.PSSP.Membership.LANAdapter.state "X==0 && X@P==1" "" AdapterNum=0 AdapterType=css NodeNum=13
AdapterNum=0 AdapterType=css NodeNum=9 AdapterNum=0 AdapterType=css NodeNum=5 AdapterNum=0 AdapterType=css NodeNum=1 3 0x00000000 ( 0, 0) 3 IBM.PSSP.CSSlog.errlog "X@1 != 0" "" No instances currently assigned 4 0x00000000 ( 0, 0) 1 IBM.PSSP.pm.User_state1 "X@0!=X@P0" "" No instances currently assigned 9 0x00000000 ( 0, 0) 18 IBM.PSSP.SDR.modification "" "" No instances currently assigned 10 0x00010000 ( 1, 0) 18 IBM.PSSP.SDR.modification "" "" Class=EM_Condition 14 0x00050000 ( 5, 0) 18 IBM.PSSP.SDR.modification "" "" No instances currently assigned 15 0x00060000 ( 6, 0) 18 IBM.PSSP.SDR.modification "" "" No instances currently assigned 16 0x00070000 ( 7, 0) 18 IBM.PSSP.SDR.modification "" "" Class=Node 17 0x00080000 ( 8, 0) 18 IBM.PSSP.SDR.modification "" "" No instances currently assigned
As you can see, there are several events registered with Event Management, but only a few of them have instances currently assigned. An event with no instances assigned is an event known to Event Management, but not currently active. The first three event are the only ones active. We can see that Event Management is monitoring the Ethernet (en) and SP Switch (css) adapters through the Membership resource variable. Also, it is monitoring the Response variable for host in all the nodes.
Let us activate a file system monitor and dump this information again. The result is as follows:
Trace Started at 11/30/99 14:51:02.012841216 Registered Events: 0 0x00000000 ( 0, 0) 0 IBM.PSSP.Membership.LANAdapter.state "X==0" "X==1" AdapterNum=0 AdapterType=en NodeNum=0 AdapterNum=0 AdapterType=en NodeNum=5 AdapterNum=0 AdapterType=en NodeNum=9 AdapterNum=0 AdapterType=en NodeNum=13 AdapterNum=0 AdapterType=en NodeNum=1 1 0x00000000 ( 0, 0) 2 IBM.PSSP.Response.Host.state "X==1 && X@P==0" "" NodeNum=5 NodeNum=9 NodeNum=13 NodeNum=1
2 0x00010000 ( 1, 0) 2 IBM.PSSP.Membership.LANAdapter.state "X==0 && X@P==1" "" AdapterNum=0 AdapterType=css NodeNum=13 AdapterNum=0 AdapterType=css NodeNum=9 AdapterNum=0 AdapterType=css NodeNum=5 AdapterNum=0 AdapterType=css NodeNum=1 3 0x00000000 ( 0, 0) 3 IBM.PSSP.CSSlog.errlog "X@1 != 0" "" No instances currently assigned 4 0x00000000 ( 0, 0) 1 IBM.PSSP.pm.User_state1 "X@0!=X@P0" "" No instances currently assigned 9 0x00000000 ( 0, 0) 18 IBM.PSSP.SDR.modification "" "" No instances currently assigned 10 0x00010000 ( 1, 0) 18 IBM.PSSP.SDR.modification "" "" Class=EM_Condition 14 0x00050000 ( 5, 0) 18 IBM.PSSP.SDR.modification "" "" No instances currently assigned 15 0x00060000 ( 6, 0) 18 IBM.PSSP.SDR.modification "" "" No instances currently assigned 16 0x00070000 ( 7, 0) 18 IBM.PSSP.SDR.modification "" "" Class=Node 17 0x00080000 ( 8, 0) 18 IBM.PSSP.SDR.modification "" "" No instances currently assigned 18 0x00090000 ( 9, 0) 18 IBM.PSSP.aixos.FS.%totused "X>90" "X<60" VG=rootvg LV=hd3
As you can see from this new output, there is a new event (18) which is the one we have just activated. If you want more information about this event, you can use one of the other two arguments described in Table 62. For example, the olists argument gives details on the events registered for this monitor, as follows:
Trace Started at 11/30/99 14:56:45.161388544 Obsv control = 0x200605c8, interval = 60.000000, flags = 0x0000, last obsv = 943991746 874736 Obsv list = 0x2002b8e8, delay = 20.000000, number of ptr lists elements = 1 limit = 10000, inst count = 16 Normal list: IBM.PSSP.aixos.FS.%totused vector: VG=rootvg LV=hd3 API instance ID = 2, RM instance ID = 18465, RM instance number = 0 current value: 8.072917, raw: 8.072917 flags: 0003 qcnt: 0 Immediate list:
From this output, you can see that the sample interval for this variable is 60 seconds, and that the current value is a little bit over 8%. From the instance or vector, we can tell that this is a monitor of the /tmp file system (hd3). The previous output gave us the condition (X>90), and rearm condition (X<60). This tracing or dump facility from Event Management is helpful in situations where events registered through either the SP Event Perspective, Problem Management or the EMAPI directly, do not appear to be working.
See Resource Monitor problems.