Use the following table to diagnose problems with Group Services.
Locate the symptom and perform the action described in the following
table:
Table 61. Group Services symptoms
Symptom | Error label | Recovery |
---|---|---|
GS daemon cannot start. | GS_STARTERR_ER | See Action 1 - Start Group Services daemon. |
GS domains merged. | GS_DOM_MERGE_ER | See Action 2 - Verify Status of Group Services Subsystem. |
GS clients cannot connect or join the GS daemon. | The following errors may be present:
GS_AUTH_DENIED_ST GS_CLNT_SOCK_ER GS_DOM_NOT_FORM_WA | See Action 3 - Correct Group Services access problem. |
GS daemon died unexpectedly. | The following errors may be present:
GS_ERROR_ER GS_DOM_MERGE_ER GS_TS_RETCODE_ER GS_STOP_ST GS_XSTALE_PRCLM_ER | See Action 4 - Correct Group Services daemon problem. |
GS domain cannot be established or recovered. | The following errors may be present:
GS_STARTERR_ER GS_DOM_NOT_FORM_WA | See Action 5 - Correct domain problem. |
GS protocol has not been completed for a long time. | None | See Action 6 - Correct protocol problem. |
HAGSGLSM cannot start. | GS_GLSM_STARTERR_ER | See Action 7 - Correct hagsglsm startup problem. |
HAGSGLSM has stopped. | GS_GLSM_ERROR_ER or None | See Action 8 - hagsglsm daemon has stopped. |
Non-stale proclaim message received. | GS_XSTALE_PRCLM_ER | See Action 9 - Investigate non-stale proclaim message. |
Some of the possible causes are:
Run the diagnostics in Operational test 2 - Determine why the Group Services subsystem is not active to determine the cause of the problem.
The AIX error log has a GS_DOM_MERGE_ER, and the Group Services daemon has restarted. The most common cause of this situation is for Group Services daemon to receive a NODE_UP event from Topology Services after the Group Services daemon formed more than one domain.
If the Group Services daemon has been restarted and a domain has been formed, no action is needed. However, if the Group Services daemon is not restarted, perform Operational test 1 - Verify that Group Services is working properly to verify the status of the GS subsystem.
Perform these steps:
lssrc -l -s subsystem_name
where subsystem_name is:
After the merge, the Group Services daemon must be restarted. See TS_NODEUP_ST on page ***. Check it with Operational test 2 - Determine why the Group Services subsystem is not active.
For the nodes that cannot join, some of the possible causes are:
Analyze and correct this problem as follows:
lssrc -s subsystem
where subsystem_name is:
The output is similar to:
Subsystem Group PID Status hags.c47s hags 23482 active
If Status is not active, this indicates that the node cannot join the GS daemon. Perform Operational test 2 - Determine why the Group Services subsystem is not active. Start the Group Services subsystem by issuing this command:
/usr/sbin/rsct/bin/hagsctrl -s
If Status is active, proceed to Step 2.
errpt -a -N subsystem_name | more
where subsystem_name is:
Check the AIX error log for this entry:
---------------------------------------------------- LABEL: GS_AUTH_DENIED_ST IDENTIFIER: 23628CC2 Date/Time: Tue Jul 13 13:29:52 Sequence Number: 213946 Machine Id: 000032124C00 Node Id: c47n09 Class: O Type: INFO Resource Name: hags Description User is not allowed to use Group Services daemon Probable Causes The user is not the root user The user is not a member of hagsuser group Failure Causes Group Services does not allow the user Recommended Actions Check whether the user is the root Check whether the user is a member of hagsuser group Detail Data DETECTING MODULE RSCT,SSuppConnSocket.C, 1.17, 421 ERROR ID .0ncMX.ESrWr.Oin//rXQ7.................... REFERENCE CODE DIAGNOSTIC EXPLANATION User myuser1 is not a supplementary user of group 111. Connection refused.
This explains that the user (myuser1) of the client program does not have correct permission to use Group Services.
The following users can access Group Services:
Change the ownership of the client program to a user who can access Group Services.
hagsvote -ls subsystem
to determine whether the group is busy, and to find the Group Leader node for the specific group.
Some of the possible causes are:
If the Topology Services daemon is alive when the current NS restarts and tries to become a NS, the newly started NS sends a proclaim message to the other nodes. These nodes consider the newly started node as their NS. The receiver nodes consider the proclaim message current (that is, "non-stale") but undefined by design. Therefore, the received Group Services daemon will be core dumped.
Examine the AIX error log by issuing the command:
errpt -J GS_DOM_MERGE_ER,GS_XSTALE_PRCLM_ER,GS_ERROR_ER,GS_STOP_ST,\ GS_TS_RETCODE_ER | more
and search for GS_ labels or a RESOURCE NAME of any of the GS subsystems. If an entry is found, the cause is explained in the DIAGNOSTIC EXPLANATION field.
If Group Services has taken a core dump, the AIX error log will have the CORE_DUMP label with RESOURCE NAME of any of the GS subsystems. In this case, the core file is in: /var/ha/run/gs_subsystem.partition. Save this file. See Action 7 - Investigate Group Services failure.
Some of the possible causes are:
Proceed to Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered.
This is because the related client failed to vote for a specific protocol. Issue this command on any node that has target groups:
hagsvote -ls gs_subsystem
where gs_subsystem is:
If this node did not vote for the protocol, the output is similar to:
Group slot #[3] Group name [theSourceGroup] GL node [0] voting data: Not GL in phase [1] of n-phase protocol of type [Join]. Local voting data: Number of providers: 1 Number of providers not yet voted: 1 (vote not submitted). Given vote:[No vote value] Default vote:[No vote value] ProviderId Voted? Failed? Conditional? [101/11] No No Yes
As the preceding text explains, one of local providers did not submit a vote. If this node has already voted but the overall protocol is still running, the output is similar to:
Group slot #[3] Group name [theSourceGroup] GL node [0] voting data: Not GL in phase [1] of n-phase protocol of type [Join]. Local voting data: Number of providers: 1 Number of providers not yet voted: 0 (vote submitted). Given vote:[Approve vote] Default vote:[No vote value] ProviderId Voted? Failed? Conditional? [101/11] Yes No Yes
In this case, issue the same command on the Group Leader node. The output is similar to:
Group slot #[2] Group name [theSourceGroup] GL node [0] voting data: GL in phase [1] of n-phase protocol of type [Join]. Local voting data: Number of providers: 1 Number of providers not yet voted: 1 (vote not submitted). Given vote:[Approve vote] Default vote:[No vote value] ProviderId Voted? Failed? Conditional? [101/0] No No No Global voting data: Number of providers not yet voted: 1 Given vote:[Approve vote] Default vote:[No vote value] Nodes that have voted: [11] Nodes that have not voted: [0]
If there is no provider on the group leader node, the output of hagsvote -ls subsystem_name would be similar to:
Number of groups: 1 Group slot #[2] Group name [theSourceGroup] GL node [0] voting data: GL in phase [1] of n-phase protocol of type [Join]. Local voting data: No local providers to vote. Dummy vote submitted. Global voting data: Number of providers not yet voted: 0 Given vote:[No vote value] Default vote:[No vote value] Nodes that have voted: [0 ] Nodes that have not voted: [2 ]
The GL's output contains the information about the nodes that did not vote. Investigate the reason for their failure to do so. Debug the GS client application.
Some of the possible causes are:
Proceed to Operational test 7 - Verify the HAGSGLSM (Group Services GLobalized Switch Membership) subsystem.
Issue this command:
lssrc -l -s subsystem_name
where subsystem_name is:
If the daemon is stopped, the output will contain a status of "inoperative" for hagsglsm. Otherwise, the output will contain a status of "operative" for hagsglsm. If stopping the daemon was not intended, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.
The local Group Services daemon receives a valid domain join request (proclaim) message from its NameServer (NS) more than once. This typically happens when Topology Services notifies Group Services of inconsistent node events. This problem should be resolved automatically if a GS_START_ST AIX error log entry is seen after the problem occurs.
Perform these actions:
lssrc -l -s subsystem_name
where subsystem_name is:
If this problem persists, record all relevant information and contact the IBM Support Center.