IBM Books

Diagnosis Guide


Error symptoms, responses, and recoveries

Use the following table to diagnose problems with Group Services. Locate the symptom and perform the action described in the following table:

Table 61. Group Services symptoms

Symptom Error label Recovery
GS daemon cannot start. GS_STARTERR_ER See Action 1 - Start Group Services daemon.
GS domains merged. GS_DOM_MERGE_ER See Action 2 - Verify Status of Group Services Subsystem.
GS clients cannot connect or join the GS daemon. The following errors may be present:

GS_AUTH_DENIED_ST

GS_CLNT_SOCK_ER

GS_DOM_NOT_FORM_WA

See Action 3 - Correct Group Services access problem.
GS daemon died unexpectedly. The following errors may be present:

GS_ERROR_ER

GS_DOM_MERGE_ER

GS_TS_RETCODE_ER

GS_STOP_ST

GS_XSTALE_PRCLM_ER

See Action 4 - Correct Group Services daemon problem.
GS domain cannot be established or recovered. The following errors may be present:

GS_STARTERR_ER

GS_DOM_NOT_FORM_WA

See Action 5 - Correct domain problem.
GS protocol has not been completed for a long time. None See Action 6 - Correct protocol problem.
HAGSGLSM cannot start. GS_GLSM_STARTERR_ER See Action 7 - Correct hagsglsm startup problem.
HAGSGLSM has stopped. GS_GLSM_ERROR_ER or None See Action 8 - hagsglsm daemon has stopped.
Non-stale proclaim message received. GS_XSTALE_PRCLM_ER See Action 9 - Investigate non-stale proclaim message.

Actions

Action 1 - Start Group Services daemon

Some of the possible causes are:

Run the diagnostics in Operational test 2 - Determine why the Group Services subsystem is not active to determine the cause of the problem.

Action 2 - Verify Status of Group Services Subsystem

The AIX error log has a GS_DOM_MERGE_ER, and the Group Services daemon has restarted. The most common cause of this situation is for Group Services daemon to receive a NODE_UP event from Topology Services after the Group Services daemon formed more than one domain.

If the Group Services daemon has been restarted and a domain has been formed, no action is needed. However, if the Group Services daemon is not restarted, perform Operational test 1 - Verify that Group Services is working properly to verify the status of the GS subsystem.

Perform these steps:

  1. Find a node with the GS_DOM_MERGE_ER AIX error log entry.
  2. Find the GS_START_ST entry before the GS_DOM_MERGE_ER in the AIX error log.
  3. If there is a GS_START_ST entry, issue the command:
    lssrc -l -s subsystem_name
    

    where subsystem_name is:

  4. The lssrc output contains the node number that established the GS domain.
  5. Otherwise, proceed to Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered.

After the merge, the Group Services daemon must be restarted. See TS_NODEUP_ST on page ***. Check it with Operational test 2 - Determine why the Group Services subsystem is not active.

Action 3 - Correct Group Services access problem

For the nodes that cannot join, some of the possible causes are:

  1. Group Services may not be running.
  2. Group Services domain may not be established.
  3. The clients may not have permission to connect to the Group Services daemon.
  4. Group Services is currently doing a protocol for the group that is trying to join or subscribe.

Analyze and correct this problem as follows:

  1. Issue the command:
    lssrc -s subsystem
    

    where subsystem_name is:

    The output is similar to:

    Subsystem         Group            PID     Status 
     hags.c47s        hags             23482   active
    

    If Status is not active, this indicates that the node cannot join the GS daemon. Perform Operational test 2 - Determine why the Group Services subsystem is not active. Start the Group Services subsystem by issuing this command:

    /usr/sbin/rsct/bin/hagsctrl -s
    

    If Status is active, proceed to Step 2.

  2. Perform Operational test 1 - Verify that Group Services is working properly to check whether the Group Services domain is established or not.
  3. Issue the command:
    errpt -a -N subsystem_name | more
    

    where subsystem_name is:

    Check the AIX error log for this entry:

    ----------------------------------------------------
    LABEL:          GS_AUTH_DENIED_ST
    IDENTIFIER:     23628CC2
     
    Date/Time:       Tue Jul 13 13:29:52 
    Sequence Number: 213946
    Machine Id:      000032124C00
    Node Id:         c47n09
    Class:           O
    Type:            INFO
    Resource Name:   hags
     
    Description
    User is not allowed to use Group Services daemon
     
    Probable Causes
    The user is not the root user
    The user is not a member of hagsuser group
     
    Failure Causes
    Group Services does not allow the user
     
            Recommended Actions
            Check whether the user is the root
    Check whether the user is a member of hagsuser group
     
    Detail Data
    DETECTING MODULE
    RSCT,SSuppConnSocket.C,           1.17, 421   
    ERROR ID 
    .0ncMX.ESrWr.Oin//rXQ7....................
    REFERENCE CODE
                                              
    DIAGNOSTIC EXPLANATION
    User myuser1 is not a supplementary user of group 111. Connection refused. 
    

    This explains that the user (myuser1) of the client program does not have correct permission to use Group Services.

    The following users can access Group Services:

    Change the ownership of the client program to a user who can access Group Services.

  4. Issue the command:
    hagsvote -ls subsystem
    

    to determine whether the group is busy, and to find the Group Leader node for the specific group.

  5. Issue the same command on the Group Leader Node to determine the global status of the group. Resolve the problem by the client programs.

Action 4 - Correct Group Services daemon problem

Some of the possible causes are:

  1. Domain merged.
  2. Group Services daemon received a non-stale proclaim message from its NS.

    If the Topology Services daemon is alive when the current NS restarts and tries to become a NS, the newly started NS sends a proclaim message to the other nodes. These nodes consider the newly started node as their NS. The receiver nodes consider the proclaim message current (that is, "non-stale") but undefined by design. Therefore, the received Group Services daemon will be core dumped.

  3. The Topology Services daemon has died.
  4. The Group Services daemon has stopped.
  5. Group Services has an internal error that caused a core dump.

Examine the AIX error log by issuing the command:

errpt -J GS_DOM_MERGE_ER,GS_XSTALE_PRCLM_ER,GS_ERROR_ER,GS_STOP_ST,\
GS_TS_RETCODE_ER | more

and search for GS_ labels or a RESOURCE NAME of any of the GS subsystems. If an entry is found, the cause is explained in the DIAGNOSTIC EXPLANATION field.

If Group Services has taken a core dump, the AIX error log will have the CORE_DUMP label with RESOURCE NAME of any of the GS subsystems. In this case, the core file is in: /var/ha/run/gs_subsystem.partition. Save this file. See Action 7 - Investigate Group Services failure.

Action 5 - Correct domain problem

Some of the possible causes are:

  1. Topology Services is running, but the Group Services daemon is not running on some of the nodes.
  2. Group Services internal NS protocol is currently running.

Proceed to Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered.

Action 6 - Correct protocol problem

This is because the related client failed to vote for a specific protocol. Issue this command on any node that has target groups:

hagsvote -ls gs_subsystem

where gs_subsystem is:

If this node did not vote for the protocol, the output is similar to:

Group slot #[3] Group name [theSourceGroup] GL node [0] voting data:
Not GL in phase [1] of n-phase protocol of type [Join]. 
Local voting data:
Number of providers: 1
Number of providers not yet voted: 1 (vote not submitted).
Given vote:[No vote value] Default vote:[No vote value]
ProviderId      Voted?  Failed? Conditional?
[101/11]        No      No      Yes

As the preceding text explains, one of local providers did not submit a vote. If this node has already voted but the overall protocol is still running, the output is similar to:

Group slot #[3] Group name [theSourceGroup] GL node [0] voting data:
Not GL in phase [1] of n-phase protocol of type [Join]. 
Local voting data:
Number of providers: 1
Number of providers not yet voted: 0 (vote submitted).
Given vote:[Approve vote] Default vote:[No vote value]
ProviderId      Voted?  Failed? Conditional?
[101/11]        Yes     No      Yes

In this case, issue the same command on the Group Leader node. The output is similar to:

Group slot #[2] Group name [theSourceGroup] GL node [0] voting data:
GL in phase [1] of n-phase protocol of type [Join]. 
Local voting data:
Number of providers: 1
Number of providers not yet voted: 1 (vote not submitted).
Given vote:[Approve vote] Default vote:[No vote value]
ProviderId      Voted?  Failed? Conditional?
[101/0] No      No      No
 
Global voting data:
Number of providers not yet voted: 1
Given vote:[Approve vote] Default vote:[No vote value]
Nodes that have voted: [11]
Nodes that have not voted: [0]

If there is no provider on the group leader node, the output of hagsvote -ls subsystem_name would be similar to:

Number of groups: 1 
Group slot #[2] Group name [theSourceGroup] GL node [0] voting data:
GL in phase [1] of n-phase protocol of type [Join]. 
Local voting data:
No local providers to vote.  Dummy vote submitted.
Global voting data:
Number of providers not yet voted: 0
Given vote:[No vote value] Default vote:[No vote value]
Nodes that have voted: [0 ]
Nodes that have not voted: [2 ]

The GL's output contains the information about the nodes that did not vote. Investigate the reason for their failure to do so. Debug the GS client application.

Action 7 - Correct hagsglsm startup problem

Some of the possible causes are:

Proceed to Operational test 7 - Verify the HAGSGLSM (Group Services GLobalized Switch Membership) subsystem.

Action 8 - hagsglsm daemon has stopped

Issue this command:

lssrc -l -s subsystem_name

where subsystem_name is:

If the daemon is stopped, the output will contain a status of "inoperative" for hagsglsm. Otherwise, the output will contain a status of "operative" for hagsglsm. If stopping the daemon was not intended, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Action 9 - Investigate non-stale proclaim message

The local Group Services daemon receives a valid domain join request (proclaim) message from its NameServer (NS) more than once. This typically happens when Topology Services notifies Group Services of inconsistent node events. This problem should be resolved automatically if a GS_START_ST AIX error log entry is seen after the problem occurs.

Perform these actions:

  1. Find the GS_START_ST AIX error log entry after this one.
  2. If there is a GS_START_ST entry, issue the command:
    lssrc -l -s subsystem_name
    

    where subsystem_name is:

  3. The lssrc output contains the node number that established the GS domain.
  4. Otherwise, proceed to Action 4 - Correct Group Services daemon problem.

If this problem persists, record all relevant information and contact the IBM Support Center.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]