IBM Books

Diagnosis Guide


Diagnostic procedures

Installation verification test

This test determines whether RSCT has been successfully installed. Group Services is a part of RSCT. Perform the following steps:

  1. Issue the command:
    lslpp -l | grep rsct 
    

    Good results are indicated by output similar to:

     rsct.basic.hacmp     1.2.0.0  COMMITTED  RS/6000 Cluster Technology (HACMP domains)
     rsct.basic.rte       1.2.0.0  COMMITTED  RS/6000 Cluster Technology (all domains)
     rsct.basic.sp        1.2.0.0  COMMITTED  RS/6000 Cluster Technology (SP domains)
     rsct.clients.hacmp   1.2.0.0  COMMITTED  RS/6000 Cluster Technology (HACMP domains)
     rsct.clients.rte     1.2.0.0  COMMITTED  RS/6000 Cluster Technology (all domains)
     rsct.clients.sp      1.2.0.0  COMMITTED  RS/6000 Cluster Technology (SP domains)
     rsct.core.utils      1.2.0.0  COMMITTED  RS/6000 Cluster Technology (all domains)
     
    

    Error results are indicated by no output from the command.

  2. Issue the command:
    lppchk -c "rsct*"
    

    Good results are indicated by the absence of error messages and the return of a zero exit status from this command. The command produces no output if it succeeds.

    Error results are indicated by a non-zero exit code and by error messages similar to these:

    lppchk: 0504-206  File /usr/lib/nls/msg/en_US/hats.cat could not be located.
    lppchk: 0504-206  File /usr/sbin/rsct/bin/hatsoptions could not be located.
    lppchk: 0504-208  Size of /usr/sbin/rsct/bin/phoenix.snap is 29356,
                          expected value was 29355.
    

    Some error messages may appear if an EFIX is applied to a file set. An EFIX is an emergency fix, supplied by IBM, to correct a specific problem.

If the test fails, the following file sets need to be installed:

  1. rsct.basic.rte
  2. rsct.core.utils
  3. rsct.clients.rte
  4. rsct.basic.sp
  5. rsct.clients.sp
  6. rsct.basic.hacmp
  7. rsct.clients.hacmp

If this test is successful, proceed to Configuration verification test.

Configuration verification test

This test verifies that Group Services on a PSSP node has the configuration data that it needs. Perform the following steps:

  1. Perform the Topology Services Configuration verification diagnosis. See Diagnosing Topology Services problems.
  2. If it succeeds, issue the following commands to display data from the SDR and obtain the level of PSSP:
    1. SDRGetObjects Syspar
    2. splst_versions -t

    Good results are indicated by none of the preceding commands returning an error message, and all commands returning non-null output.

    Error results are indicated if the preceding commands fail. In this case, the SDR could be experiencing problems. Diagnose the SDR subsystem by referring to Diagnosing SDR problems. If the commands succeed but do not show the expected information, it is possible that a problem occurred in the installation of the nodes. Verify installation of the nodes by consulting Diagnosing node installation problems.

If this test is successful, proceed to Operational verification tests.

Operational verification tests

The following information applies to the diagnostic procedures that follow:

Operational test 1 - Verify that Group Services is working properly

Issue the following command:

lssrc -ls subsystem_name

Good results are indicated by an output similar to:

Subsystem         Group            PID     Status 
 hags             hags             22962    active
2 locally-connected clients.  Their PIDs:
20898(hagsglsmd) 25028(haemd) 
HA Group Services domain information:
Domain established by node 21
Number of groups known locally: 2
                   Number of   Number of local
Group name         providers   providers/subscribers
cssMembership           10           0           1
ha_em_peers             6            1           0

There must be an entry for cssMembership.

Error results are indicated by one of the following:

  1. A message similar to:
    0513-036 The request could not be passed to the hags subsystem.
        Start the subsystem and try your command again.
    

    This means that the GS daemon is not running. The GS subsystem is down. Issue the errpt command and look for an entry for the subsystem name. Proceed to Operational test 2 - Determine why the Group Services subsystem is not active.

  2. A message similar to:
    0513-085 The hags Subsystem is not on file.
    

    This means that the GS subsystem is not defined to the AIX SRC.

    In PSSP, the partition-sensitive subsystems may have been undefined by the syspar_ctrl command. Use syspar_ctrl -a to add the subsystems to the node.

    In HACMP/ES, HACMP may have not been installed on the node. Check the HACMP subsystem.

  3. Output similar to:
    Subsystem         Group            PID     Status 
    hags.c47s         hags             7350    active
     Subsystem hags.c47s trying to connect to Topology Services. 
    

    This means that Group Services is not connected to Topology Services. Check the Topology Services subsystem. See Diagnosing Topology Services problems.

  4. Output similar to:
    Subsystem         Group            PID     Status 
    hags.c47s         hags             35746   active
     No locally-connected clients. 
     HA Group Services domain information: 
     Domain not established. 
     Number of groups known locally: 0 
    

    This means that the GS domain is not established. This is normal during the Group Services startup period. Retry this test after about three minutes. If this situation continues, perform Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered.

  5. Output similar to:
    Subsystem         Group            PID     Status
    hags.c47s         hags             35746   active
     No locally-connected clients.
     HA Group Services domain information:
     Domain is recovering.
     Number of groups known locally: 0
    

    This means that the GS domain is recovering. It is normal during Group Services domain recovery. Retry this test after waiting three to five minutes. If this situation continues, perform Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered.

  6. An output similar to the Good results, but no cssMembership group is shown on the control workstation or the PSSP nodes. Proceed to Operational test 7 - Verify the HAGSGLSM (Group Services GLobalized Switch Membership) subsystem.
  7. Output similar to:
    Subsystem         Group            PID     Status 
    hags              hags             25132   active
     No locally-connected clients. 
     HA Group Services domain information: 
     Domain established by node 1. 
     Number of groups known locally: 0 
    

    This means that no GS clients are connected, or no local groups are established. The GS daemon is working normally for a while at startup time, or one of the following conditions occurred:

    1. The haem subsystem is not running on the control workstation. Issue this command to start the haem subsystem.
      lssrc -s haem.partition_name
      

      The output is similar to:

      Subsystem         Group            PID     Status 
      haem.c47s         haem                	   inoperative
      
    2. The haem subsystem and the switch are not working on the nodes. Issue the command: lssrc -ls hats. The output is similar to:
      Subsystem         Group            PID     Status 
      hats              hats             25074   active
      Network Name   Indx Defd Mbrs St Adapter ID      Group ID
      SPether        [0]   15   12  S 9.114.61.65     9.114.61.195   
      SPether        [0] en0          0x376d296c      0x3779180b
      HB Interval = 1 secs. Sensitivity = 4 missed beats
      SPswitch       [1]   14    0  D 9.114.61.129
      SPswitch       [1] css0 
      HB Interval = 1 secs. Sensitivity = 4 missed beats
        1 locally connected Client with PID:
      hagsd( 14460) 
        Configuration Instance = 925928580
        Default: HB Interval = 1 secs. Sensitivity = 4 missed beats
        Control Workstation IP address = 9.114.61.125
        Daemon employs no security
        Data segment size 7052 KB
      

      Look for SPswitch. The line

      SPswitch       [1]   14    0  D 9.114.61.129    9.114.61.154  
      
      implies that the switch is not working or Topology Services thinks that the switch is down. For more information, see Diagnosing Topology Services problems.
    3. If the two preceding conditions do not apply, see Operational test 5 - Verify whether the cssMembership or css1Membership groups are found on a node.

Operational test 2 - Determine why the Group Services subsystem is not active

Issue the command:

errpt -N hags subsystem_name

where subsystem_name is:

and look for an entry for the subsystem_name. It appears under the RESOURCE_NAME heading.

If an entry is found, issue the command:

errpt -a -N hags subsystem_name

to get details about error log entries. The entries related to Group Services are those with LABEL beginning with GS_.

The error log entry, together with its description in AIX Error Logs and templates, explains why the subsystem is inactive.

If there is no GS_ error log entry explaining why the subsystem went down or could not start, it is possible that the daemon may have exited abnormally. Look for an error entry with LABEL of CORE_DUMP and PROGRAM NAME of hagsd, by issuing the command:

errpt -J CORE_DUMP

If this entry is found, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

Another possibility when there is no GS_ error log entry is that the Group Services daemon could not be loaded. In this case, a message similar to the following may be present in the Group Services startup log:

0509-036 Cannot load program hagsd because of the following errors:
0509-026 System error: Cannot run a file that does not have a valid format.
The message may refer to the Group Services daemon, or to some other program invoked by the startup script hags. If this error is found, see Information to collect before contacting the IBM Support Center and contact the IBM Support Center.

For errors where the daemon did start up but then exited during initialization, detailed information about the problem is in the Group Services error log.

Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered

The hagsns command is used to determine the nameserver (NS) state and characteristics. Issue the command:

hagsns -s subsystem_name

The output is similar to:

HA GS NameServer Status
NodeId=0.32, pid=18256, domainId=0.Nil, NS not established,
  CodeLevel=GSlevel(DRL=8)
The death of the node is being simulated.
NS state=kUncertain, protocolInProgress=kNoProtocol,
  outstandingBroadcast=kNoBcast
Process started on Jun 21 10:33:08, (0:0:16) ago. 
  HB connection took (0:0:0).
Our current epoch of uncertainty started on Jun 21 10:33:08,
  (0:0:16) ago.
Number of UP nodes: 1
List of UP nodes:  0

Error results are indicated by output of NS state is kUncertain, with the following considerations:

  1. kUncertain is normal for a while after Group Services startup.
  2. Group Services may have instructed Topology Services to simulate a node death. This is so that every other node will see the node down event for this local node. This simulating node death state will last approximately two or three minutes.

If this state does not change or takes longer than two or three minutes, proceed to check Topology Services. See Diagnosing Topology Services problems.

If the Group Services daemon is not in kCertain or kBecomeNS state, and is waiting for the other nodes, the hagsns command output is similar to:

HA GS NameServer Status
NodeId=11.42, pid=21088, domainId=0.Nil, NS not established,
  CodeLevel=GSlevel(DRL=8)
NS state=kGrovel, protocolInProgress=kNoProtocol,
  outstandingBroadcast=kNoBcast
Process started on Jun 21 10:52:13, (0:0:22) ago.
  HB connection took (0:0:0).
Our current epoch of uncertainty started on Jun 21 10:52:13,
  (0:0:22) ago.
Number of UP nodes: 2
List of UP nodes:  0 11
Domain not established for (0:0:22).
  Currently waiting for node 0
 
 

In the preceding output, this node is waiting for an event or message from node 0 or for node 0. The expected event or message differs depending on the NS state which is shown in the second line of the hagsns command output.

Analyze the NSstate as follows:

  1. kGrovel means that this node believes that the waiting node (node 0 in this example) will become his NS. This node is waiting for node 0 to acknowledge it (issue a Proclaim message).
  2. kPendingInsert or kInserting means that the last line of the hagsns command output is similar to:
    Domain not established for (0:0:22).  Currently waiting for node 0.1
    

    This node received the acknowledge (Proclaim or InsertPhase1 message) and is waiting for the next message (InsertPhase1 or Commit message) from the NS (node 0).

    If this state does not change to kCertain in a two or three minutes, proceed to Operational test 1 - Verify that Group Services is working properly, for Topology Services and Group Services on the waiting node (node 0 in this example).

  3. kAscend, kAscending, kRecoverAscend, or kRecoverAscending means that the last line of the hagsns command output is similar to:
    Domain not established for (0:0:22).  Waiting for 3 nodes: 1 7 6
    

    If there are many waiting nodes, the output is similar to:

    Domain not established for(0:0:22).Waiting for 43 nodes: 1 7 6 9 4 ....
    

    This node is trying to become a nameserver, and the node is waiting for responses from the nodes that are listed in the hagsns command output. If this state remains for between three and five minutes, proceed to Operational test 1 - Verify that Group Services is working properly, for Topology Services and Group Services on the nodes that are on the waiting list.

  4. kKowtow or kTakeOver means that the last line of the hagsns command output is similar to:
    Domain not recovered for (0:0:22).  Currently waiting for node 0.1
    

    After the current NS failure, this node is waiting for a candidate node that is becoming the NS. If this state stays too long, proceed to Operational test 1 - Verify that Group Services is working properly, for the Topology Services and Group Services on the node that is in the waiting list.

    In this output, the value 0.1 means the following:

    Therefore, this local node is waiting for a response from the GS daemon of node 0, and the incarnation is 1.

Operational test 4 - Verify whether a specific group is found on a node

Issue the following command:

lssrc -ls subsystem_name

Error results are indicated by outputs similar to the error results of Operational test 1 - Verify that Group Services is working properly through Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered.

Good results are indicated by an output similar to:

Subsystem         Group            PID     Status 
 hags             hags             22962    active
2 locally-connected clients.  Their PIDs:
20898(hagsglsmd) 25028(haemd) 
HA Group Services domain information:
Domain established by node 21
Number of groups known locally: 2
                   Number of   Number of local
Group name         providers   providers/subscribers
cssMembership           10           0           1
ha_em_peers             6            1           0

In this output, examine the Group name field to see whether the requested group name exists. For example, the group ha_em_peers has 1 local provider, 0 local subscribers, and 6 total providers.

For more information about the given group, issue the command:

hagsns -s subsystem_name

on the NS node. The output is similar to:

HA GS NameServer Status
NodeId=6.14, pid=10094, domainId=6.14, NS established,
  CodeLevel=GSlevel(DRL=8)
NS state=kBecomeNS, protocolInProgress=kNoProtocol,
 outstandingBroadcast=kNoBcast
Process started on Jun 19 18:35:55, (10d 20:22:39) ago.
 HB connection took (0:0:0).
Initial NS certainty on Jun 19 18:36:12, (10d 20:22:22) ago,
 taking (0:0:16).
Our current epoch of certainty started on Jun 23 13:05:18,
 (7d 1:53:16) ago.
Number of UP nodes: 12
List of UP nodes:  0 1 5 6 7 8 9 11 17 19 23 26
List of known groups:
1.1 cssMembership: GL: 6 seqNum: 73
 theIPS: 6 1 26 17 0 8 7 9 5 11 lookupQ:
2.1 ha_em_peers: GL: 6 seqNum: 30 theIPS: 6 0 8 7 5 11 lookupQ:

In the last line, the nodes that have the providers of the group ha_em_peers are 6 0 8 7 5 11.

Operational test 5 - Verify whether the cssMembership or css1Membership groups are found on a node

If Operational test 1 - Verify that Group Services is working properly through Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered succeeded, issue the following command:

lssrc -ls subsystem_name 

The output is similar to:

Subsystem         Group            PID     Status 
 hags             hags             22962    active
2 locally-connected clients.  Their PIDs:
20898(hagsglsmd) 25028(haemd) 
HA Group Services domain information:
Domain established by node 21
Number of groups known locally: 2
                   Number of   Number of local
Group name         providers   providers/subscribers
cssMembership           10           1           0
ha_em_peers             6            1           0
 

In the preceding output, the cssMembership group has 1 local provider. Otherwise, the following conditions apply:

  1. No cssMembership or css1Membership exists in the output.

    There are several possible causes:

    1. /dev/css0 or /dev/css1 devices are down.

      Perform switch diagnosis. See Diagnosing SP Switch problems or Diagnosing SP Switch2 problems.

    2. Topology Services reports that the switch is not stable.

      Issue the following command:

      lssrc -ls hats_subsystem
      

      where hats_subsystem is:

      • hats on PSSP nodes
      • hats.partition_name on the PSSP control workstation
      • topsvcs on HACMP nodes

      The output is similar to:

      Subsystem       Group           PID     Status 
       hats           hats            17058   active
      Network Name   Indx Defd Mbrs St Adapter ID      Group ID
      SPether        [0]   15    2  S 9.114.61.65     9.114.61.125
      SPether        [0] en0          0x37821d69      0x3784f3a9
      HB Interval = 1 secs. Sensitivity = 4 missed beats
      SPswitch       [1]   14    0  D 9.114.61.129  
      SPswitch       [1] css0   
      HB Interval = 1 secs. Sensitivity = 4 missed beats
        1 locally connected Client with PID:
      hagsd( 26366) 
        Configuration Instance = 926456205
        Default: HB Interval = 1 secs. Sensitivity = 4 missed beats
        Control Workstation IP address = 9.114.61.125
        Daemon employs no security
        Data segment size 7044 KB
      

      Find the first SPswitch row in the Network Name column. Find the St (state) column in the output. At the intersection of the first SPswitch row and state column is a letter. If it is not S, wait for few minutes longer since the Topology Services SPswitch group is not stable. If the state stays too long as D or U, proceed to Topology Services diagnosis. See Diagnosing Topology Services problems. If the state is S, proceed to Step 1c. In this example, the state is D.

      The state has the following values:

      • S - stable or working correctly
      • D - dead, or not working
      • U - unstable (not yet incorporated)
    3. HAGSGLSM is not running or waiting for Group Services protocols.

      Proceed to Operational test 7 - Verify the HAGSGLSM (Group Services GLobalized Switch Membership) subsystem.

  2. cssMembership or css1Membership exist in the output, but the number of local providers is zero.

    Proceed to Operational test 7 - Verify the HAGSGLSM (Group Services GLobalized Switch Membership) subsystem.

Operational test 6 - Verify whether Group Services is running a protocol for a group

Issue the following command:

hagsvote -ls subsystem

Compare the output to this list of choices.

  1. If no protocol is running, the output is similar to:
    Number of groups: 3
    Group slot #[0] Group name [HostMembership] GL node [Unknown]
     voting data: No protocol is currently executing in the group.
    --------------------------------------------------------------
     
    Group slot #[1] Group name [cssRawMembership] GL node [Unknown]
     voting data: No protocol is currently executing in the group.
    --------------------------------------------------------------
     
    Group slot #[2] Group name [theSourceGroup] GL  node [1]
     voting data: No protocol is currently executing in the group.
    ---------------------------------------------------------------
    

    In this output, no protocol is running for "theSourceGroup".

  2. A protocol is running and waiting for a vote. For the group theSourceGroup, this node is soliciting votes and waiting for the local providers to vote. The output is similar to:
    Group slot #[2] Group name [theSourceGroup] GL node [1]
     voting data: Not GL in phase [1] of n-phase protocol of type [Join]. 
    Local voting data:
    Number of providers: 1
    Number of providers not yet voted: 1 (vote not submitted).
    Given vote:[No vote value] Default vote:[No vote value]
    ------------------------------------------------------
    

    The number of local providers is 1, and no voting is submitted. Its Group Leader (GL) node is 1. The output of the same command on the GL node (node 1) is similar to:

    Group slot #[3] Group name [theSourceGroup] GL node [1] voting data:
    GL in phase [1] of n-phase protocol of type [Join]. 
    Local voting data:
    Number of providers: 1
    Number of providers not yet voted: 0 (vote submitted).
    Given vote:[Approve vote] Default vote:[No vote value]
    Global voting data:
    Number of providers not yet voted: 1
    Given vote:[Approve vote] Default vote:[No vote value]
    --------------------------------------------------
    

    This indicates that a total of one provider has not voted.

Operational test 7 - Verify the HAGSGLSM (Group Services GLobalized Switch Membership) subsystem

Issue the following command:

lssrc -ls glsm_subsystem

where glsm_subsystem is:

Good results are indicated by output similar to:

Error results are indicated by one of the following outputs:

  1. A message similar to:
    0513-036 The request could not be passed to the hags subsystem.
             Start the subsystem and try your command again.
    

    This means that the HAGSGLSM daemon is not running. The subsystem is down. Issue the errpt command and look for an entry for the subsystem name. Proceed to Operational test 2 - Determine why the Group Services subsystem is not active.

  2. A message similar to:
    0513-085 The hagsglsm Subsystem is not on file.
    

    This means that the HAGSGLSM subsystem is not defined to the AIX SRC.

    For PSSP nodes, the partition-sensitive subsystems may have been undefined by the syspar_ctrl command. The same command may be used to add the subsystems to the node.

    In HACMP/ES, HACMP may have not been installed on the node. Check the HACMP subsystem.

  3. Output similar to:
    Subsystem         Group            PID     Status 
    hagsglsm.c47s     hags             26578   active
    Status information for subsystem hagsglsm.c47s:
    Not yet connected to Group Services after 4 connect tries
    

    HAGSGLSM is not connected to Group Services. The Group Services daemon is not running. If the state is S, proceed to Operational test 1 - Verify that Group Services is working properly for Group Services subsystem verification.

  4. Output similar to:
    Subsystem         Group            PID     Status
     bhagsglsm        bhags            16048   active
    Status information for subsystem bhagsglsm:
    Waiting for Group Services response.
     
     
    

    HAGSGLSM is being connected to Group Services. Wait for a few seconds. If this condition does not change after several seconds, proceed to Operational test 3 - Determine why the Group Services domain is not established or why it is not recovered or Operational test 6 - Verify whether Group Services is running a protocol for a group.

  5. Output similar to:
    Subsystem         Group            PID     Status
     hagsglsm         hags             26788   active
    Status information for subsystem hagsglsm:
    Connected to Group Services.
     Adapter  Group                  Mbrs   Joined  Subs'd  Aliases
     css0     cssRawMembership      -       -       No      -
              cssMembership         16       No     No      -
     css1     css1RawMembership     15      -       Yes     1
              css1Membership        15      Yes     Yes     -
     ml0      ml0Membership          -      -      -        -
    Aggregate Adapter Configuration
     The current configuration id is 0x23784582.
     ml0[css0] Nodes: 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61
     ml0[css1] Nodes: 1,5,9,13,17,21,25,29,33,37,41,45,49,53,57,61
    

    On nodes that have the switch, the line "cssRawMembership" or "css1RawMembership" have No in the Subs'd column.

    Check Topology Services to see whether the switch is working. Issue the command:

    lssrc -ls hats_subsystem
    

    The output is similar to:

    Subsystem         Group            PID     Status 
     hats             hats             25074   active
    Network Name   Indx Defd Mbrs St Adapter ID      Group ID
    SPether        [0]   15   11  S 9.114.61.65     9.114.61.193   
    SPether        [0] en0          0x376d296c      0x3784fdc5
    HB Interval = 1 secs. Sensitivity = 4 missed beats
    SPswitch       [1]   14    8  S 9.114.61.129    9.114.61.154   
    SPswitch       [1] css0         0x376d296d      0x3784fc48
    HB Interval = 1 secs. Sensitivity = 4 missed beats
      1 locally connected Client with PID:
    hagsd( 14460) 
      Configuration Instance = 925928580
      Default: HB Interval = 1 secs. Sensitivity = 4 missed beats
      Control Workstation IP address = 9.114.61.125
      Daemon employs no security
      Data segment size 7052 KB
    

    Find the first row under Network Name with SPswitch. Find the column with heading St (state). Intersect this row and column. If the value at the intersection is not S, see TS_LOC_DOWN_ST on page *** and proceed to Action 3 - Correct local adapter problem.

    If the state is S, proceed to Operational test 1 - Verify that Group Services is working properly to see whether the Group Services domain is established or not. If the Group Services domain is established, proceed to Operational test 6 - Verify whether Group Services is running a protocol for a group for cssMembership protocol activity.


[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]