This test determines whether RSCT has been successfully installed. Group Services is a part of RSCT. Perform the following steps:
lslpp -l | grep rsct
Good results are indicated by output similar to:
rsct.basic.hacmp 1.2.0.0 COMMITTED RS/6000 Cluster Technology (HACMP domains) rsct.basic.rte 1.2.0.0 COMMITTED RS/6000 Cluster Technology (all domains) rsct.basic.sp 1.2.0.0 COMMITTED RS/6000 Cluster Technology (SP domains) rsct.clients.hacmp 1.2.0.0 COMMITTED RS/6000 Cluster Technology (HACMP domains) rsct.clients.rte 1.2.0.0 COMMITTED RS/6000 Cluster Technology (all domains) rsct.clients.sp 1.2.0.0 COMMITTED RS/6000 Cluster Technology (SP domains) rsct.core.utils 1.2.0.0 COMMITTED RS/6000 Cluster Technology (all domains)
Error results are indicated by no output from the command.
lppchk -c "rsct*"
Good results are indicated by the absence of error messages and the return of a zero exit status from this command. The command produces no output if it succeeds.
Error results are indicated by a non-zero exit code and by error messages similar to these:
lppchk: 0504-206 File /usr/lib/nls/msg/en_US/hats.cat could not be located. lppchk: 0504-206 File /usr/sbin/rsct/bin/hatsoptions could not be located. lppchk: 0504-208 Size of /usr/sbin/rsct/bin/phoenix.snap is 29356, expected value was 29355.
Some error messages may appear if an EFIX is applied to a file set. An EFIX is an emergency fix, supplied by IBM, to correct a specific problem.
If the test fails, the following file sets need to be installed:
If this test is successful, proceed to Configuration verification test.
This test verifies that Group Services on a PSSP node has the configuration data that it needs. Perform the following steps:
Good results are indicated by none of the preceding commands returning an error message, and all commands returning non-null output.
Error results are indicated if the preceding commands fail. In this case, the SDR could be experiencing problems. Diagnose the SDR subsystem by referring to Diagnosing SDR problems. If the commands succeed but do not show the expected information, it is possible that a problem occurred in the installation of the nodes. Verify installation of the nodes by consulting Diagnosing node installation problems.
If this test is successful, proceed to Operational verification tests.
The following information applies to the diagnostic procedures that follow:
Issue the following command:
lssrc -ls subsystem_name
Good results are indicated by an output similar to:
Subsystem Group PID Status hags hags 22962 active 2 locally-connected clients. Their PIDs: 20898(hagsglsmd) 25028(haemd) HA Group Services domain information: Domain established by node 21 Number of groups known locally: 2 Number of Number of local Group name providers providers/subscribers cssMembership 10 0 1 ha_em_peers 6 1 0
There must be an entry for cssMembership.
Error results are indicated by one of the following:
0513-036 The request could not be passed to the hags subsystem. Start the subsystem and try your command again.
This means that the GS daemon is not running. The GS subsystem is down. Issue the errpt command and look for an entry for the subsystem name. Proceed to Operational test 2 - Determine why the Group Services subsystem is not active.
0513-085 The hags Subsystem is not on file.
This means that the GS subsystem is not defined to the AIX SRC.
In PSSP, the partition-sensitive subsystems may have been undefined by the syspar_ctrl command. Use syspar_ctrl -a to add the subsystems to the node.
In HACMP/ES, HACMP may have not been installed on the node. Check the HACMP subsystem.
Subsystem Group PID Status hags.c47s hags 7350 active Subsystem hags.c47s trying to connect to Topology Services.
This means that Group Services is not connected to Topology Services. Check the Topology Services subsystem. See Diagnosing Topology Services problems.
Subsystem Group PID Status hags.c47s hags 35746 active No locally-connected clients. HA Group Services domain information: Domain not established. Number of groups known locally: 0
This means that the GS domain is not established. This is normal during the Group Services startup period. Retry this test after about three minutes. If this situation continues, perform Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered.
Subsystem Group PID Status hags.c47s hags 35746 active No locally-connected clients. HA Group Services domain information: Domain is recovering. Number of groups known locally: 0
This means that the GS domain is recovering. It is normal during Group Services domain recovery. Retry this test after waiting three to five minutes. If this situation continues, perform Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered.
Subsystem Group PID Status hags hags 25132 active No locally-connected clients. HA Group Services domain information: Domain established by node 1. Number of groups known locally: 0
This means that no GS clients are connected, or no local groups are established. The GS daemon is working normally for a while at startup time, or one of the following conditions occurred:
lssrc -s haem.partition_name
The output is similar to:
Subsystem Group PID Status haem.c47s haem inoperative
Subsystem Group PID Status hats hats 25074 active Network Name Indx Defd Mbrs St Adapter ID Group ID SPether [0] 15 12 S 9.114.61.65 9.114.61.195 SPether [0] en0 0x376d296c 0x3779180b HB Interval = 1 secs. Sensitivity = 4 missed beats SPswitch [1] 14 0 D 9.114.61.129 SPswitch [1] css0 HB Interval = 1 secs. Sensitivity = 4 missed beats 1 locally connected Client with PID: hagsd( 14460) Configuration Instance = 925928580 Default: HB Interval = 1 secs. Sensitivity = 4 missed beats Control Workstation IP address = 9.114.61.125 Daemon employs no security Data segment size 7052 KB
Look for SPswitch. The line
SPswitch [1] 14 0 D 9.114.61.129 9.114.61.154implies that the switch is not working or Topology Services thinks that the switch is down. For more information, see Diagnosing Topology Services problems.
Issue the command:
errpt -N hags subsystem_name
where subsystem_name is:
and look for an entry for the subsystem_name. It appears under the RESOURCE_NAME heading.
If an entry is found, issue the command:
errpt -a -N hags subsystem_name
to get details about error log entries. The entries related to Group Services are those with LABEL beginning with GS_.
The error log entry, together with its description in AIX Error Logs and templates, explains why the subsystem is inactive.
If there is no GS_ error log entry explaining why the subsystem went down or could not start, it is possible that the daemon may have exited abnormally. Look for an error entry with LABEL of CORE_DUMP and PROGRAM NAME of hagsd, by issuing the command:
errpt -J CORE_DUMP
If this entry is found, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
Another possibility when there is no GS_ error log entry is that the Group Services daemon could not be loaded. In this case, a message similar to the following may be present in the Group Services startup log:
0509-036 Cannot load program hagsd because of the following errors: 0509-026 System error: Cannot run a file that does not have a valid format.The message may refer to the Group Services daemon, or to some other program invoked by the startup script hags. If this error is found, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
For errors where the daemon did start up but then exited during initialization, detailed information about the problem is in the Group Services error log.
The hagsns command is used to determine the nameserver (NS) state and characteristics. Issue the command:
hagsns -s subsystem_name
The output is similar to:
HA GS NameServer Status NodeId=0.32, pid=18256, domainId=0.Nil, NS not established, CodeLevel=GSlevel(DRL=8) The death of the node is being simulated. NS state=kUncertain, protocolInProgress=kNoProtocol, outstandingBroadcast=kNoBcast Process started on Jun 21 10:33:08, (0:0:16) ago. HB connection took (0:0:0). Our current epoch of uncertainty started on Jun 21 10:33:08, (0:0:16) ago. Number of UP nodes: 1 List of UP nodes: 0
Error results are indicated by output of NS state is kUncertain, with the following considerations:
If this state does not change or takes longer than two or three minutes, proceed to check Topology Services. See Diagnosing Topology Services problems.
If the Group Services daemon is not in kCertain or kBecomeNS state, and is waiting for the other nodes, the hagsns command output is similar to:
HA GS NameServer Status NodeId=11.42, pid=21088, domainId=0.Nil, NS not established, CodeLevel=GSlevel(DRL=8) NS state=kGrovel, protocolInProgress=kNoProtocol, outstandingBroadcast=kNoBcast Process started on Jun 21 10:52:13, (0:0:22) ago. HB connection took (0:0:0). Our current epoch of uncertainty started on Jun 21 10:52:13, (0:0:22) ago. Number of UP nodes: 2 List of UP nodes: 0 11 Domain not established for (0:0:22). Currently waiting for node 0
In the preceding output, this node is waiting for an event or message from node 0 or for node 0. The expected event or message differs depending on the NS state which is shown in the second line of the hagsns command output.
Analyze the NSstate as follows:
Domain not established for (0:0:22). Currently waiting for node 0.1
This node received the acknowledge (Proclaim or InsertPhase1 message) and is waiting for the next message (InsertPhase1 or Commit message) from the NS (node 0).
If this state does not change to kCertain in a two or three minutes, proceed to Operational test 1 - Verify that Group Services is working properly, for Topology Services and Group Services on the waiting node (node 0 in this example).
Domain not established for (0:0:22). Waiting for 3 nodes: 1 7 6
If there are many waiting nodes, the output is similar to:
Domain not established for(0:0:22).Waiting for 43 nodes: 1 7 6 9 4 ....
This node is trying to become a nameserver, and the node is waiting for responses from the nodes that are listed in the hagsns command output. If this state remains for between three and five minutes, proceed to Operational test 1 - Verify that Group Services is working properly, for Topology Services and Group Services on the nodes that are on the waiting list.
Domain not recovered for (0:0:22). Currently waiting for node 0.1
After the current NS failure, this node is waiting for a candidate node that is becoming the NS. If this state stays too long, proceed to Operational test 1 - Verify that Group Services is working properly, for the Topology Services and Group Services on the node that is in the waiting list.
In this output, the value 0.1 means the following:
Therefore, this local node is waiting for a response from the GS daemon of node 0, and the incarnation is 1.
Issue the following command:
lssrc -ls subsystem_name
Error results are indicated by outputs similar to the error results of Operational test 1 - Verify that Group Services is working properly through Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered.
Good results are indicated by an output similar to:
Subsystem Group PID Status hags hags 22962 active 2 locally-connected clients. Their PIDs: 20898(hagsglsmd) 25028(haemd) HA Group Services domain information: Domain established by node 21 Number of groups known locally: 2 Number of Number of local Group name providers providers/subscribers cssMembership 10 0 1 ha_em_peers 6 1 0
In this output, examine the Group name field to see whether the requested group name exists. For example, the group ha_em_peers has 1 local provider, 0 local subscribers, and 6 total providers.
For more information about the given group, issue the command:
hagsns -s subsystem_name
on the NS node. The output is similar to:
HA GS NameServer Status NodeId=6.14, pid=10094, domainId=6.14, NS established, CodeLevel=GSlevel(DRL=8) NS state=kBecomeNS, protocolInProgress=kNoProtocol, outstandingBroadcast=kNoBcast Process started on Jun 19 18:35:55, (10d 20:22:39) ago. HB connection took (0:0:0). Initial NS certainty on Jun 19 18:36:12, (10d 20:22:22) ago, taking (0:0:16). Our current epoch of certainty started on Jun 23 13:05:18, (7d 1:53:16) ago. Number of UP nodes: 12 List of UP nodes: 0 1 5 6 7 8 9 11 17 19 23 26 List of known groups: 1.1 cssMembership: GL: 6 seqNum: 73 theIPS: 6 1 26 17 0 8 7 9 5 11 lookupQ: 2.1 ha_em_peers: GL: 6 seqNum: 30 theIPS: 6 0 8 7 5 11 lookupQ:
In the last line, the nodes that have the providers of the group ha_em_peers are 6 0 8 7 5 11.
If Operational test 1 - Verify that Group Services is working properly through Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered succeeded, issue the following command:
lssrc -ls subsystem_name
The output is similar to:
Subsystem Group PID Status hags hags 22962 active 2 locally-connected clients. Their PIDs: 20898(hagsglsmd) 25028(haemd) HA Group Services domain information: Domain established by node 21 Number of groups known locally: 2 Number of Number of local Group name providers providers/subscribers cssMembership 10 1 0 ha_em_peers 6 1 0
In the preceding output, the cssMembership group has 1 local provider. Otherwise, the following conditions apply:
There are several possible causes:
Perform switch diagnosis. See Diagnosing SP Switch problems or Diagnosing SP Switch2 problems.
Issue the following command:
lssrc -ls hats_subsystem
where hats_subsystem is:
Subsystem Group PID Status hats hats 17058 active Network Name Indx Defd Mbrs St Adapter ID Group ID SPether [0] 15 2 S 9.114.61.65 9.114.61.125 SPether [0] en0 0x37821d69 0x3784f3a9 HB Interval = 1 secs. Sensitivity = 4 missed beats SPswitch [1] 14 0 D 9.114.61.129 SPswitch [1] css0 HB Interval = 1 secs. Sensitivity = 4 missed beats 1 locally connected Client with PID: hagsd( 26366) Configuration Instance = 926456205 Default: HB Interval = 1 secs. Sensitivity = 4 missed beats Control Workstation IP address = 9.114.61.125 Daemon employs no security Data segment size 7044 KB
Find the first SPswitch row in the Network Name column. Find the St (state) column in the output. At the intersection of the first SPswitch row and state column is a letter. If it is not S, wait for few minutes longer since the Topology Services SPswitch group is not stable. If the state stays too long as D or U, proceed to Topology Services diagnosis. See Diagnosing Topology Services problems. If the state is S, proceed to Step 1c. In this example, the state is D.
The state has the following values:
Proceed to Operational test 7 - Verify the HAGSGLSM (Group Services GLobalized Switch Membership) subsystem.
Proceed to Operational test 7 - Verify the HAGSGLSM (Group Services GLobalized Switch Membership) subsystem.
Issue the following command:
hagsvote -ls subsystem
Compare the output to this list of choices.
Number of groups: 3 Group slot #[0] Group name [HostMembership] GL node [Unknown] voting data: No protocol is currently executing in the group. -------------------------------------------------------------- Group slot #[1] Group name [cssRawMembership] GL node [Unknown] voting data: No protocol is currently executing in the group. -------------------------------------------------------------- Group slot #[2] Group name [theSourceGroup] GL node [1] voting data: No protocol is currently executing in the group. ---------------------------------------------------------------
In this output, no protocol is running for "theSourceGroup".
Group slot #[2] Group name [theSourceGroup] GL node [1] voting data: Not GL in phase [1] of n-phase protocol of type [Join]. Local voting data: Number of providers: 1 Number of providers not yet voted: 1 (vote not submitted). Given vote:[No vote value] Default vote:[No vote value] ------------------------------------------------------
The number of local providers is 1, and no voting is submitted. Its Group Leader (GL) node is 1. The output of the same command on the GL node (node 1) is similar to:
Group slot #[3] Group name [theSourceGroup] GL node [1] voting data: GL in phase [1] of n-phase protocol of type [Join]. Local voting data: Number of providers: 1 Number of providers not yet voted: 0 (vote submitted). Given vote:[Approve vote] Default vote:[No vote value] Global voting data: Number of providers not yet voted: 1 Given vote:[Approve vote] Default vote:[No vote value] --------------------------------------------------
This indicates that a total of one provider has not voted.
lssrc -ls glsm_subsystem
where glsm_subsystem is:
Good results are indicated by output similar to:
Subsystem Group PID Status hagsglsm.c47s hags 22192 active Status information for subsystem hagsglsm.c47s: Connected to Group Services. Adapter Group Mbrs Joined Subs'd Aliases css0 (device does not exist) cssMembership 0 No Yes - css1 (device does not exist) css1Membership 0 No Yes - ml0 ml0Membership - No - Aggregate Adapter Configuration The current configuration id is 0x1482933. ml0[css0] Nodes: 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61 ml0[css1] Nodes: 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61
Subsystem Group PID Status hagsglsm hags 16788 active Status information for subsystem hagsglsm: Connected to Group Services. Adapter Group Mbrs Joined Subs'd Aliases css0 cssRawMembership 16 - Yes 1 cssMembership 16 Yes Yes - css1 css1RawMembership 16 - Yes 1 css1Membership 16 Yes Yes - ml0 ml0Membership 16 Yes - cssMembership Aggregate Adapter Configuration The current configuration id is 0x23784582. ml0[css0] Nodes: 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61 ml0[css1] Nodes: 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61
Error results are indicated by one of the following outputs:
0513-036 The request could not be passed to the hags subsystem. Start the subsystem and try your command again.
This means that the HAGSGLSM daemon is not running. The subsystem is down. Issue the errpt command and look for an entry for the subsystem name. Proceed to Operational test 2 - Determine why the Group Services subsystem is not active.
0513-085 The hagsglsm Subsystem is not on file.
This means that the HAGSGLSM subsystem is not defined to the AIX SRC.
For PSSP nodes, the partition-sensitive subsystems may have been undefined by the syspar_ctrl command. The same command may be used to add the subsystems to the node.
In HACMP/ES, HACMP may have not been installed on the node. Check the HACMP subsystem.
Subsystem Group PID Status hagsglsm.c47s hags 26578 active Status information for subsystem hagsglsm.c47s: Not yet connected to Group Services after 4 connect tries
HAGSGLSM is not connected to Group Services. The Group Services daemon is not running. If the state is S, proceed to Operational test 1 - Verify that Group Services is working properly for Group Services subsystem verification.
Subsystem Group PID Status bhagsglsm bhags 16048 active Status information for subsystem bhagsglsm: Waiting for Group Services response.
HAGSGLSM is being connected to Group Services. Wait for a few seconds. If this condition does not change after several seconds, proceed to Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered or Operational test 6 - Verify whether Group Services is running a protocol for a group.
Subsystem Group PID Status hagsglsm hags 26788 active Status information for subsystem hagsglsm: Connected to Group Services. Adapter Group Mbrs Joined Subs'd Aliases css0 cssRawMembership - - No - cssMembership 16 No No - css1 css1RawMembership 15 - Yes 1 css1Membership 15 Yes Yes - ml0 ml0Membership - - - - Aggregate Adapter Configuration The current configuration id is 0x23784582. ml0[css0] Nodes: 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61 ml0[css1] Nodes: 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61
On nodes that have the switch, the line "cssRawMembership" or "css1RawMembership" have No in the Subs'd column.
Check Topology Services to see whether the switch is working. Issue the command:
lssrc -ls hats_subsystem
The output is similar to:
Subsystem Group PID Status hats hats 25074 active Network Name Indx Defd Mbrs St Adapter ID Group ID SPether [0] 15 11 S 9.114.61.65 9.114.61.193 SPether [0] en0 0x376d296c 0x3784fdc5 HB Interval = 1 secs. Sensitivity = 4 missed beats SPswitch [1] 14 8 S 9.114.61.129 9.114.61.154 SPswitch [1] css0 0x376d296d 0x3784fc48 HB Interval = 1 secs. Sensitivity = 4 missed beats 1 locally connected Client with PID: hagsd( 14460) Configuration Instance = 925928580 Default: HB Interval = 1 secs. Sensitivity = 4 missed beats Control Workstation IP address = 9.114.61.125 Daemon employs no security Data segment size 7052 KB
Find the first row under Network Name with SPswitch. Find the column with heading St (state). Intersect this row and column. If the value at the intersection is not S, see TS_LOC_DOWN_ST on page *** and proceed to Action 3 - Correct local adapter problem.
If the state is S, proceed to Operational test 1 - Verify that Group Services is working properly to see whether the Group Services domain is established or not. If the Group Services domain is established, proceed to Operational test 6 - Verify whether Group Services is running a protocol for a group for cssMembership protocol activity.