Group Services Programming Guide and Reference

Overview of Group Services concepts

An application may consist of multiple processes that run on multiple nodes of an SP system. The set of nodes that is defined to Group Services is called a Group Services domain. A Group Services domain is the set of nodes that makes up a system partition. If there is only one system partition in the SP system, the Group Services domain consists of all of the nodes in the SP system. From the standpoint of the Group Services subsystem, the system partition in which it is running is "the system."

Group membership

Each group that is maintained by the Group Services subsystem is uniquely named. Any authorized process in a Group Services domain may create a new group. Any authorized process in the domain may ask to become a member of a group. This request is called a join request or joining the group. If the join request is successful, the process becomes a provider for the group.

Any authorized process in the domain can ask to monitor the group. This request is called a 'subscribe request' or 'subscribing to the group'. If the subscribe request is successful, the process becomes a subscriber for the group.

The term GS client refers to both providers and subscribers. A process that has registered to use the Group Services subsystem, but has not yet become a provider or subscriber, is also referred to as a GS client.

The domain control environment variable that is described in Group Services domains and in ha_gs_init subroutine must be set and exported in a GS client's environment to the name of the system partition in which the GS client and the Group Services daemon are running. On a node, this is the system partition to which the node belongs. On the control workstation, there is a Group Services daemon for each system partition. The value of the domain control environment variable identifies the system partition and the particular Group Services daemon to which the GS client will connect.

A group may have members on multiple nodes in the domain, and each node may have multiple members.

For each group, the Group Services subsystem maintains consistent group state data. A group's state consists of the membership list and the group state value:

Membership list

A group's membership list contains the list of providers in the group. Each provider is identified by a provider identifier. The Group Services subsystem maintains the list in the following order: the oldest provider (that is, the first provider to join the group) is at the head of the list, and the youngest is at the end. All of the group's providers and subscribers see the same ordering of the list.

The membership list is modified when providers join or leave the group. In addition to voluntarily leaving a group, a provider may leave involuntarily, due to the failure of the provider process itself or the failure of the node on which it is running. An involuntary leave is called a failure leave and is initiated by the Group Services subsystem. Finally, a provider may be expelled from the group at the request of a provider.

Group state value

The state value of a group is defined by the application that is using the GSAPI and is controlled by the providers in a way that is meaningful to the application. It is a byte field whose length may vary between 1 and 256 bytes. The Group Services subsystem does not interpret the state value.

The group state is available to surviving providers despite node, communication adapter, and network failures. However, the group state does not survive the dissolution of a group. If all of the providers fail, the group state is lost.

Changing membership and state value

Any provider in the group can ask Group Services to modify the state value and can also specify the level of consistency that is to be associated with the modification. Specifically, the provider can subject the proposed change to a voting protocol, or request that the change be approved without putting it to a vote. The voting protocol unifies the multi-phase commit and barrier synchronization abstractions.

A GS client that asks to become a provider of a group must have the correct authorization and must be admitted to the group by the current providers of the group. This is accomplished by the same voting protocol as the one used to mediate state changes.

A provider may leave a group in a number of ways. It may:

Leave voluntarily
Be expelled at the request of another provider
Leave involuntarily when its process, or the node on which it is running, fails.

All changes to a group, in state value or in membership, appear to the providers and subscribers of the group to be logically serialized. This means that one change completes before another begins. The Group Services subsystem processes all outstanding membership changes before it accepts any proposals to change the state value. If one or more providers fail during an ongoing protocol invocation, Group Services runs a failure leave protocol to remove the failed providers from the group when the protocol completes.

Subscribers are notified of the approved results of a state value change or a provider membership change. However, they do not participate in approving a state value change, admitting a new provider, or removing a leaving provider. Thus, a subscriber cannot affect the group state. In addition, subscribers do not appear in any group membership lists; they are known only to the Group Services subsystem. The providers of a group and the other subscribers of the group are unaware of any of the subscribers to the group.

Group Services domains

The Group Services subsystem provides services within the boundaries of what it calls a domain. A Group Services domain includes an SP system control workstation and the set of nodes within an SP system partition. This means that an SP system control workstation can be within multiple Group Services domains. An application wishing to become a GS client on the control workstation or on an SP system node must set one or more environment variables to ensure that it is able to connect to the proper Group Services domain.

Group Services PSSP domain

An SP system is divided into one or more Group Services domains, based on the number of SP system partitions defined. Each of these domains is referred to as a "Group Services PSSP domain". To connect to the Group Services PSSP domain, a GS client must ensure that the environment variable HA_DOMAIN_NAME is set to the name of the SP system partition in which the GS client is running prior to the GS client invoking the ha_gs_init subroutine. This must be done on a node (which can only be in one Group Services PSSP domain) as well as on the control workstation (which will be in multiple Group Services PSSP domains, if there are multiple defined SP system partitions).

Refer to ha_gs_init subroutine for more information about connecting to the Group Services subsystem and the meaning of error codes that are returned.

Group Services HACMP/ES domain

In addition to the Group Services PSSP domain, if HACMP/ES is installed on a node, that node will also be part of a Group Services HACMP/ES domain, which is separate from that node's Group Services PSSP domain. The Group Services HACMP/ES domain consists of all nodes that are part of the HACMP/ES cluster. This may include SP system nodes, non-SP AIX workstations, and SP system nodes from a physically separate SP system.

Connecting to domains

A GS client may connect to only one domain, regardless of whether a node (or the control workstation) is part of multiple Group Services domains. All services provided by the Group Services subsystem are within a single domain only. A GS client gets information only about the nodes and groups that are in the Group Services domain to which it is connected.

If a GS client wishes to connect to the Group Services HACMP/ES domain on its node, it must set the HA_DOMAIN_NAME and HA_GS_SUBSYS environment variables before the GS client invokes the ha_gs_init subroutine. HA_DOMAIN_NAME must be set to the name of the HACMP/ES cluster. HA_GS_SUBSYS must be set to grpsvcs.

All of the Group Services interfaces and semantics work identically and are supported within both Group Services PSSP domains and Group Services HACMP/ES domains.

Group creation

Typically, an application defines one or more group names that are known to all of the processes that are part of the application.

During initialization, each process in the application asks to join the group as it starts up. The first join request creates the group and defines its attributes. The subsequent join requests result in new providers joining the group. Each subsequent join request also includes group attribute information, which must match the group's established attributes. Otherwise, the join request is rejected.

The attributes of a group are:

The name of the group
An application-defined version code
The number of phases (one or multiple) for join and failure leave protocols.
A time limit, in seconds, for voting in each phase of a join or failure leave n-phase protocol. If a time limit of 0 is specified, no limit is enforced.
A default vote to use as a proxy for a provider that fails to vote or fails to vote in time. The default vote may be to either approve or reject. If none is specified, the default value is to reject.
A batch control field that specifies how requests may be batched. Join requests may be batched with other join requests, and failure leave requests may be batched with other failure leave requests. Join requests are never batched with failure leave requests.
Attributes related to a source-target relationship, if any, include:
- The name of the source-group for this group. Specifying a source-group name defines this group as a target-group.
- The number of phases to use for the source-reflection protocols, which run in the target-group when the source-group changes its state value.
- The voting phase time limit for source-reflection protocols, if they are n-phase.

All of the topics related to these attributes are discussed in greater detail later in this chapter.

Mutability of group attributes

Certain group attributes are mutable This means that they can be dynamically changed by the group's providers, using the ha_gs_change_attributes asynchronous interface. See ha_gs_change_attributes subroutine.

The following group attribute fields are mutable (can be dynamically changed):

gs_client_version: Client-specified version number
gs_batch_control: Batch control setting for membership (join and failure protocols)
gs_num_phases: Phase control setting for membership (join and failure protocols)
gs_source_reflection_num_phases: Phase control setting for source-state reflection protocols
gs_group_default_vote: Group's base default vote for all N-phase protocols
gs_merge_control: Behavior of group in a merge situation
gs_time_limit: Voting time limit for N-phase join and failure protocols
gs_source_reflection_time_limit: Voting time limit for N-phase source-state reflection protocols

The following group attribute fields are not mutable (cannot be dynamically changed). To change these group attribute fields, the providers must all leave the group, then rejoin the group with the desired new attribute field.

gs_group_name: The name of the provider's group
gs_source_group_name: The name of the source group for the provider's group (see Source-target group relationships).

Responsiveness checks

Responsiveness checks allow the Group Services subsystem to periodically inspect the state of the GS client when there are no ongoing group activities. Group Services always monitors the GS client for exit. A responsiveness check allows Group Services to query the actual responsiveness of the GS client. When the group is active, that is, when a protocol is running, Group Services can determine the responsiveness of the GS client by the client's response to the running protocol. Accordingly, Group Services suspends responsiveness checking during ongoing protocols.

When the GS client initializes itself with Group Services, it must specify information about the protocol, if any, to be used to perform responsiveness checks for the GS client. It must also specify the path name of a callback routine to invoke if the GS client fails its responsiveness check

Responsiveness protocols

The GS client can specify one of the following responsiveness protocols.

No protocol: In this case, Group Services acts only if the GS client process exits.
A ping-like protocol: In this case, Group Services periodically sends a responsiveness notification to the GS client and expects a response. The notification calls the responsiveness callback routine specified by the GS client. Group Services expects the responsiveness callback routine to return a code that indicates whether the GS client is operational or has detected an internal problem that prevents its correct operation.
This protocol is available to both single-threaded and multi-threaded GS clients.
A counter-checking protocol: In this case, Group Services periodically checks an arithmetic counter that is specified by a multi-threaded client. If the counter is changing, Group Services assumes that the GS client is responsive. If the counter does not change within a time limit specified by the GS client, Group Services calls the responsiveness callback routine before assuming that the GS client is nonresponsive.
This protocol works properly only for multi-threaded GS clients, and a thread must be dedicated to performing Group Services operations. The dedicated thread should handle responsiveness checking as well as the receipt of notifications from Group Services.

Responsiveness callback routines

The responsiveness callback routine is intended to provide the Group Services subsystem with a means of quiescing a provider that fails a responsiveness check. The routine should perform any cleanup actions that are required by the GS client. It also allows the GS client to perform periodic validity checks on its own operation or its environment.

Handling nonresponsive providers

Group Services performs responsiveness checks once the GS client has initialized. If a responsiveness check fails and the GS client is a provider, Group Services places it in a list of nonresponsive providers. Then, Group Services sends an announcement notification that contains the list to all of the group's providers. Group Services takes no other direct action.

On receipt of the announcement notification, a provider could initiate an expel protocol to remove the nonresponsive providers from the group, if appropriate. For more information, see The expel protocol. Group Services tries to contact nonresponsive providers. If a previously nonresponsive provider responds, Group Services places it in a list of "rejuvenated" providers. Then, Group Services sends an announcement notification that contains the list to all of the group's providers.

Note that because Group Services continues to perform responsiveness checks for nonresponsive providers, the group can determine how quickly it should respond to announcement notifications. A group can expel a nonresponsive provider after receiving the first announcement notification, or it can wait to see if the provider becomes responsive again.

Protocols and voting

The Group Services subsystem uses a variety of protocols. A protocol is the mechanism that coordinates membership and state value changes within a group.

The GSAPI provides a flexible n-phase voting protocol to mediate provider joins and departures, and state value changes. Different applications have different synchronization and coordination requirements for membership and state changes. Programmers can customize their applications to meet these requirements by choosing the appropriate number of voting phases, as follows:

A one-phase protocol is the special case in which no voting is allowed. Here, the proposed membership or state value change is automatically approved, without voting.
An n-phase protocol puts the group through at least one phase of voting before the change is approved. The number of phases required is not specified in advance to the Group Services subsystem. Instead, in each phase of voting, the providers can request another phase of voting, or end the protocol by approving or rejecting the proposal.

The votes of the providers cause a proposed change to be approved or rejected. If it is approved, Group Services sends a final notification that describes the change to all of the group's providers and subscribers. If it is rejected, Group Services sends the final notification only to the providers. The group state reverts to its value at the beginning of the protocol. When a protocol is proposed, the proposal indicates whether it is one-phase or n-phase.

Protocol categories

Protocols are grouped into four categories:

Membership change protocols

These protocols are used when a provider joins or leaves a group. If approved, the membership of the group changes. In addition, the group state value may also be changed during all phases of n-phase membership change protocols, as discussed in Submitting changes with voting responses.

Membership change protocols include:

Join
Leave (also called "voluntary leave")
Expel
Failure leave (including clients that invoke ha_gs_goodbye)
Cast-out (a form of failure leave that is associated with source-target relationships. See Source-target group relationships).

The state value change protocol

A provider uses this protocol to change the state value of the group, but leave the membership unchanged. n-phase state value change protocols may also change the group state value during the voting phases. State value change protocols do not affect the group membership.

The provider-broadcast message protocol

If this protocol is one-phase, it allows a provider to broadcast a message to all other providers in the group, with no voting.

If this protocol is n-phase, it allows a provider to broadcast the message to the other providers in the group, and also initiates the standard voting phases. The group state value may be changed during each voting phase. Provider-broadcast message protocols do not affect the group membership.

The change-attributes protocol

If this protocol is one-phase, it allows a provider to change a group's attributes with no voting.

If this protocol is n-phase, it allows a provider to propose to change the group attributes and initiates the standard voting phases. Change-attributes protocols do not affect the group membership.

Voting on an n-phase protocol

When a provider receives an n-phase protocol proposal notification, it is asked to vote. At the start of every phase of voting, the Group Services subsystem informs all of the providers in the group of the proposed state value change, and the current phase number. Each provider then votes either to approve the proposed change (APPROVE), to request another round of voting (CONTINUE), or to reject it and end the protocol (REJECT). Voting can occupy any number of phases, based on the wishes of the providers.

For each phase, each provider must provide one of the following vote values:

APPROVE: The provider approves the proposed change. If all providers vote to APPROVE the proposal in the same voting phase, the change is approved and the group state is changed accordingly. If the vote tally indicates that the protocol should continue, the provider must continue to vote in each subsequent phase.
CONTINUE: The provider conditionally approves the proposed change, but wants to continue to another phase of voting. If at least one provider votes to CONTINUE, the protocol continues to another voting phase.
REJECT: the provider rejects the proposed change. Like CONTINUE, only one provider needs to vote to REJECT to reject the proposed change. A REJECT vote on a failure leave protocol requires special consideration, as described in Rejection of the Group Services subsystem-initiated protocols.

Voting can have one of the following outcomes:

The proposed change is approved if every provider that was a member of the group at the start of the protocol votes to APPROVE the proposal, either explicitly or implicitly.
The protocol continues for another round if no provider votes to REJECT, and at least one provider votes to CONTINUE. The proposed change remains pending.
The proposed change is rejected if at least one provider that was a member of the group at the start of the protocol votes to REJECT the proposal, either implicitly or explicitly

Normally, providers vote explicitly by responding to the Group Services subsystem by calling the ha_gs_vote subroutine. However, if a provider fails before it submits a vote, or if it fails to vote within the group's voting time limit, the Group Services subsystem enters a default vote on behalf of that provider. The default vote is also called an implicit vote.

By default, the default vote is REJECT. However, the provider can set the default vote to APPROVE when it joins the group. The Group Services subsystem does not permit an implicit vote to CONTINUE, because it could lead to a non-terminating protocol.

After the proposal is approved or rejected, the Group Services subsystem notifies all of the providers of the outcome. The providers do not vote in this last phase of the protocol. Thus, unlike the other phases, in the last phase no information flows from the providers to the Group Services subsystem. Finally, and only if the proposal was approved, the Group Services subsystem informs the subscribers of the outcome of the vote.

In certain cases, an approved proposal also generates notifications related to source-target handling. For more information, see Source-target group relationships.

Specifying the provider's default vote value

By default, the Group Services subsystem assigns REJECT as the default value for each group. As part of its request to join a group, a provider may specify a default vote value as part of the group attributes. It may specify either REJECT or APPROVE. All providers must specify the same value. During each voting phase, any provider may specify a new default vote to be used for the group if any provider fails during this voting phase. The provider may specify either REJECT or APPROVE.

If no new default vote is specified, the current default vote carries over to the next phase. At the end of the protocol, the default vote reverts to the original value that was specified during the join of the providers.

If more than one provider specifies an updated default vote value with its vote, the Group Services subsystem arbitrarily chooses one of them. If different values are specified by different providers, it is not possible to predict which one the Group Services subsystem will choose.

As discussed in Submitting changes with voting responses, to ensure consistency, the group should ensure one of the following:

All providers submit the same updated default vote value.
Only one provider submits an updated default vote value.
No providers submit an updated default vote value, allowing the current value to remain in effect.

Approving and rejecting protocols

Every protocol must be either approved or rejected as described, based on the desires of the providers in the group. A protocol is approved when the providers vote to approve it. A protocol is rejected when the protocol is voted down or is ended for some reason.

In summary, a provider or the Group Services subsystem proposes the protocol. If necessary, voting proceeds for the desired number of phases.

If the protocol is approved, the updated information is broadcast to all providers and subscribers, as well as any appropriate target-groups. If the protocol is rejected, a notice of the rejection is broadcast to all providers. Subscribers receive no notification of a rejected protocol.

A one-phase protocol proposal is automatically approved. It cannot be rejected.

If a voluntary leave is rejected, whether by an explicit or implicit vote to REJECT, the protocol ends.

Proposing, voting, and phases for protocols

A protocol starts with a proposal. Either a provider or the Group Services subsystem itself can initiate the proposal.

Every protocol takes place in some number of phases. A one-phase protocol is an atomic multicast to the group members. An n-phase protocol is a mechanism that allows barrier synchronization. All providers in the group involved in the protocol proposal must arrive at the barrier (that is, submit a vote) before the protocol can proceed to the next phase. This guarantees that the group remains synchronized during the protocol. A provider's arrival at a barrier is signalled by its submission of a vote to approve, continue, or reject the proposal.

Each protocol proposal indicates whether it is a one-phase or an n-phase protocol. A one-phase protocol is a nonvoting protocol and is completed in a single phase, as described below.

A protocol that requires one or more voting phases is defined as an n-phase protocol. The exact number of phases is not defined in advance. Instead, the providers determine by their votes the exact number of voting phases.

The following sections describe the two types of protocols in more detail.

One-phase protocols

A one-phase protocol is a notification that the change proposed by the protocol is automatically approved.

For a membership change proposal, all providers and subscribers are notified of the updated membership. This is the join of a new provider or the leave of an old provider. The list of providers that are notified includes the providers that just joined, but does not include any providers that just left. Joins and leaves are not batched together in any one membership change proposal.

For a state value change proposal, all providers and subscribers are notified of the updated state.

For a provider-broadcast message proposal, all providers are notified of the message.

If a provider fails during the protocol, all remaining members receive the protocol notification. The Group Services subsystem immediately proposes a membership change protocol to handle the failed provider.

All providers and subscribers see a series of one-phase protocol notifications in the same order. However, any individual recipient may see a second or subsequent notification before all recipients have seen the first.

N-phase protocols

An n-phase protocol establishes a series of barrier-synchronization voting phases for the providers in the group. Any protocol may be proposed as an n-phase protocol

The proposal indicates that the protocol requires one or more voting phases, but it does not specify the exact number of phases to use. The voting results determine the actual number of voting phases.

The provider that proposes the protocol may specify a time limit for each voting phase. Each provider must register its vote within the given time limit. If a provider fails to register its vote in time, the Group Services subsystem does the following:

Applies the group's default vote value for that provider.
Notifies all providers of the lateness and includes a list of any providers that failed to respond in time.

The Group Services subsystem starts the protocol by broadcasting the proposal to all providers, which starts the first phase, and ends each phase by tallying the votes. In response to the initial notification, each provider must vote. Each vote response contains:

The actual vote value, as described in Voting on an n-phase protocol.
Optionally, an updated state value or a provider-broadcast message. For details, see Submitting changes with voting responses.
Optionally, a default vote to be used by the group if a provider fails during this voting phase. For details, see Specifying the provider's default vote value.

If at least one provider votes to REJECT the proposal, the Group Services subsystem broadcasts to all providers a notification that the proposal was rejected. If no provider votes to reject the proposal, but at least one provider votes to CONTINUE the voting, Group Services broadcasts a notification that another vote is expected on the proposal. Once again, each provider must respond by voting.

Once all providers vote in the same phase to approve the proposal, the Group Services subsystem broadcasts to all providers and subscribers a notification of the approved change. This final broadcast is equivalent to the one-phase notification.

Failure of a provider during any phase of voting is handled by using the group's default vote for the provider. The Group Services subsystem automatically includes the default vote (REJECT or APPROVE) in the vote tally. Once the protocol completes (that is, is either approved or rejected), the Group Services subsystem immediately proposes a membership change to handle the failed provider.

The final notification of the protocol's rejection or approval also indicates whether any default votes were used during the protocol.

The voting phase allows each provider to take any action desired, such as running scripts, issuing commands to manipulate resources, or displaying graphics on the screen. The provider then submits its vote. If a provider fails during a voting phase, the Group Services subsystem enters the default vote into the tally on behalf of the failed provider.

Once each provider has submitted its vote, the Group Services subsystem tallies the votes. If all of the providers voted to APPROVE the protocol in the same voting phase, the voting ends and the proposal is approved. During the protocol, the providers determine the number of voting phases that are used by voting to CONTINUE the protocol. This mechanism allows the providers to adapt to unexpected occurrences during each protocol, rather than having to know in advance the exact number of phases that will be required.

Initiating the protocols

Either a provider or the Group Services subsystem itself initiates proposals.

The Group Services subsystem proposes protocols to handle a join, a failure leave or a cast-out of a provider, or a source-group reflection protocol when a source-group state value change needs to be reflected to its target-groups.

It is up to the providers to initiate proposals to voluntarily leave a group, to expel a provider, to change the state value of a group, or to initiate a provider-broadcast message.

Please note carefully the difference here between initiating the protocol proposal and notifying the providers that a protocol has been proposed. The Group Services subsystem always issues the notification. However, the Group Services subsystem actually initiates a proposal only for the cases listed above.

For a provider to initiate a protocol proposal, it is simply a matter of calling the proper GSAPI subroutine. The Group Services subsystem notifies the other providers in the group that a proposal has been made and proceeds as described in Proposing, voting, and phases for protocols, based on the number of phases and the nature of the proposal.

Group Services subsystem-initiated protocols

The protocols that the Group Services subsystem initiates are join, failure leave, and source-target proposal protocols. Once it initiates the protocols, the Group Services subsystem notifies the providers, and the protocols proceed in a manner quite similar to other proposals.

The protocols that the Group Services subsystem initiates cover the following situations:

A membership change proposal to join a group by a potential provider
A membership change proposal for a failure leave of one or more failed providers
A membership change proposal to cast out one or more providers, due to source-target processing
A source-state reflection proposal to reflect to a target-group when its source-group has changed its state value through a non-membership change protocol.

The number of phases for these protocols is determined as follows:

For a join proposal, the provider must specify either a one-phase or an n-phase join protocol. The first join request to a group by the first provider places the phase setting in the gs_num_phases field of the group attributes block for the group. All subsequent membership change protocols must match the setting.
For failure leaves and cast-out proposals, the Group Services subsystem uses the gs_num_phases setting from the group attributes block.
For source-state reflection protocols, the Group Services subsystem uses the gs_source_reflection_num_phases setting from the group attributes block to control the number of phases.

Time limits for voting phases are determined as follows:

For join, failure leave, and cast-out protocols, the gs_time_limit field in the group attributes block is used.
For source reflection protocols, the gs_source_reflection_time_limit field in the group attributes block is used.

The notification procedure varies slightly, depending on the proposal that is made, as follows:

For a join proposal, the Group Services subsystem notifies all of the providers, including the "old" providers that are already in the group, and the providers asking to join the group. The notification specifies the "old" and joining providers.
For a failure leave proposal, the Group Services subsystem notifies the remaining providers of the protocol proposal.
For a cast-out proposal, the Group Services subsystem notifies all of the providers except those that are being cast out. A provider that is being cast out receives a final notification but does not receive interim notifications that occur while the cast-out is being voted on.

Membership changes may be batched, which means that multiple providers can be handled in a single join or leave protocol.

Once the Group Services subsystem has initiated the protocol, the providers invoke it in the usual manner based on the number of phases. One-phase protocols are a single notification. n-phase protocols proceed to the first voting phase.

Once voting has completed, the providers are notified of the result.

Approval of the Group Services subsystem-initiated protocols

In all cases, if the protocol is approved, Group Services performs the following actions:

All providers receive an updated membership list or state value. They may also receive a provider-broadcast message, if one was included in the final vote.
Subscribers receive the updated membership list or status value, depending on their subscription request.
An approved source-state reflection protocol results in a notification that the protocol has completed. The state value is updated only if a provider submitted a state value with a voting response.

Rejection of the Group Services subsystem-initiated protocols

If a join is rejected for any reason, the rejected GS clients receive a notification that their application to join the group has been rejected. The existing providers also receive the notification. The subscribers receive no notification.

The failure leave and cast-out proposals require some special handling for rejecting the protocols, because the failing providers must be removed in any case.

If any provider explicitly votes to REJECT a failure leave or cast-out proposal, the protocol stops. The membership list is updated to show the removal of the targeted provider. The providers and subscribers receive this updated list. The state value reverts to its value at the beginning of the protocol.

If any provider implicitly votes to REJECT a failure leave or cast-out proposal, the protocol stops. If failure leave requests are allowed to be batched, the Group Services subsystem immediately proposes another failure leave protocol, adding the newly-failed provider into the list of leaving providers.

If failure leave requests are not allowed to be batched, the Group Services subsystem handles this as an explicit vote to REJECT. To handle the newly-failed provider, Group Services initiates a failure leave protocol.

A rejection of a source-state reflection protocol simply ends the protocol, and the providers are notified that the protocol is rejected.

Provider-initiated protocols

Provider-initiated proposals include proposals made by a provider to perform the following:

Join a group
Leave a group voluntarily
Expel one or more providers from the group
Change the group state value
Broadcast a provider-broadcast message to all providers.
Change attributes
Goodbye

A provider calls a GSAPI subroutine to propose one of these protocols, specifying either a one-phase or an n-phase protocol. If an n-phase protocol is specified for one of these protocols, the provider must also specify a voting phase time limit as well. However, if no time limit is desired, a time limit of 0 may be specified.

The Group Services subsystem checks the proposal for errors. If the proposal is syntactically invalid, the provider receives a synchronous syntax error code. If the group currently has a running protocol, the provider receives a synchronous error code that indicates a collision between competing protocols. (Only one protocol may run at a time.)

If the synchronous checks pass, the Group Services subsystem tentatively accepts the proposal and the provider receives a synchronous successful return code. However, if collision errors are detected asynchronously (because other providers or the Group Services subsystem itself submits a proposal at the same time), the Group Services subsystem returns an error code asynchronously.

If multiple providers submit proposals at the same time, only one proposal is accepted by the Group Services subsystem. The other proposals are returned to the providers that made them, asynchronously returning a collision error code.

Whichever proposal the Group Services subsystem chooses, the providers are notified. If the protocol is a one-phase protocol, the proposal is automatically approved. If the protocol is an n-phase protocol, the proposal notification requests a vote from the providers.

At the end of the protocol, the Group Services subsystem notifies the providers of the protocol's result, that is, whether the protocol was approved or rejected.

For voluntary leave protocols:

If a leave protocol is approved, all remaining providers receive the updated membership list and, if it changed, the updated state value. Subscribers receive the updated membership list or state value, based on their subscription. The provider targeted by the protocol is sent the initial notification that its leave protocol is running.
If a leave is rejected, it ends the protocol. However, the provider who proposed the leave is still removed from the group. The membership list is updated to show the removal of the targeted providers. The providers and subscribers receive this updated list. The state value reverts to its value at the beginning of the protocol.

For expel protocols, see The expel protocol.

For state value change protocols:

If a state value change is approved, the providers and subscribers receive the updated state value. If the group is a source-group, its target-groups also receive notification of the change.
If a state value change protocol is rejected, the state value remains unchanged. The providers receive notification of the rejection. The subscribers receive no notification.

For provider-broadcast message protocols:

If it is a one-phase protocol, the message contained in the proposal is broadcast to all providers. Subscribers receive no notification.
If it is an n-phase protocol, and it is:
- Approved, and if the group state value was changed during the voting phases, the providers and subscribers receive the updated state value.
- Approved, but the group state value was not changed during the voting phases, the providers receive notice that the protocol is completed. Subscribers receive no notification.
- Rejected, the state value remains unchanged. The providers receive notification of the rejection. The subscribers receive no notification.

For change-attributes protocols:

If a change-attributes protocol is approved, the providers receive the updated group attributes. Subscribers receive no notification of the attribute change.
If a change-attributes protocol is rejected, the group attributes remain unchanged. The providers receive notification of the rejection. The subscribers receive no notification.

Note that any n-phase protocol can propose a change to the group state value. If the group state value change is accepted:

The providers and subscribers receive the updated state value.
If the group is a source-group, its target-groups also receive notification of the change.

Submitting changes with voting responses

The voting response to each phase of an n-phase protocol may contain a proposed new group state value, a provider-broadcast message, and a proposed new default vote for the group.

These choices give providers quite a bit of flexibility in managing their actions during an n-phase protocol. When one or more of these items is submitted during a voting response, the Group Services subsystem broadcasts it to all providers as part of the notification for the next phase of the protocol.

Changing the state value during the voting phases of a protocol can be very useful. As an example, it would allow a group to update the state value during membership change protocols, which may be important in determining group quorum or active/inactive status.

Similarly, by submitting a provider-broadcast message with their voting response, instead of or along with an updated state value, the providers can pass data among themselves during the protocol, without having to actually manipulate the state value field.

Because each provider must issue a vote response, each provider could submit with its vote a proposed updated state value, a provider-broadcast message, or a new default vote. In case of multiple submissions, the Group Services subsystem chooses only one of the values to propagate to the providers for the next phase notification. The Group Services subsystem considers only the providers who do not specify a null pointer to a state value or message. For these, the Group Services subsystem arbitrarily chooses one of the responses it receives from the group. Because the providers cannot control which response is chosen, they should guarantee that one of the following is true:

Only one provider submits a state value or message or new default vote value during each phase.
All providers submit the same new state value or message or new default vote value.

If these rules are not followed, it is not possible to determine which response will be chosen to be propagated for the next phase.

The Voting Phase Time Limit

The voting phase time limit allows the providers to determine if their peers are not responding quickly enough during voting protocols.

Once the Group Services subsystem has delivered its notification for each voting phase, it sets a timer. If it has not received a voting response from the provider within that time, the Group Services subsystem assumes that the provider is not going to respond, and applies the group's default vote for this provider. Note that the default vote applies only to the currently running protocol. If the provider votes later, the vote is ignored, and the provider is given an error code that indicates that the time limit was exceeded.

The Group Services subsystem specifies that a default vote was applied because the time limit was exceeded, but does not specify, at this time, the providers that were slow. If the application of the default vote causes the protocol to be rejected, or the time limit is exceeded in the last voting phase of an approved protocol, Group Services sends a notification to the providers that lists the providers that exceeded the time limit. The Group Services subsystem takes no further action. However, a provider may initiate an expel protocol to remove any providers that exceeded the time limit, if appropriate.

The voting phase time limit is also used to time the invocation of deactivate scripts during expel protocols. For more details, see The expel protocol.

Simultaneous protocols

Because there may be multiple providers in a group, more than one provider may submit a proposal at the same time. However, the Group Services subsystem does not invoke more than one protocol at a time within a group. (Of course, multiple protocols may be running simultaneously in a domain, one for each of the groups in the domain.)

What are simultaneous proposals? There is always a delay time between the call of a GSAPI subroutine by a provider to initiate a protocol and the broadcast of any resultant notification for that subroutine. The lag time allows the Group Services subsystem to batch multiple join requests, because the Group Services subsystem may receive multiple such requests before it has actually broadcast a notification. In this case, the Group Services subsystem collects all of the joins and issues a single notification. Similarly, the Group Services subsystem batches together multiple failure leaves or cast-outs into a single protocol. In all other cases, it deals with proposals one at a time.

In general, the first proposal to be made after a running protocol completes is the one that is chosen to invoke next. If multiple providers all attempt to submit proposals, the Group Services subsystem chooses one arbitrarily.

For provider-initiated proposals, all proposals that are not chosen to be invoked immediately are returned to the providers, with an asynchronous collision error code. The notification of the collision may arrive before or after the protocol that was chosen begins. The provider may resubmit the proposal at a later time, if appropriate.

All the Group Services subsystem-initiated proposals remain pending until they have been invoked within the group. No provider-initiated proposals are accepted until all of the pending Group Services subsystem-initiated proposals have been invoked. A provider that attempts to submit a proposal receives a synchronous or asynchronous collision error code.

When choosing among multiple proposals, the Group Services subsystem chooses a proposal based on the following priority order:

Failure leaves and cast-outs
Source-state reflection
Joins
Leaves and expels
State value change, provider-broadcast message, or change-attributes protocols.

Within these categories, if there are multiple simultaneous proposals of the chosen type, the Group Services subsystem arbitrarily chooses one of them, excluding those that may be batched together.

If batching is allowed, membership changes are batched. Joins are batched only with joins and failure leaves are batched only with failure leaves.

No provider is allowed to cycle invisibly. If a provider should fail and then restart and try to join the group, the Group Services subsystem ensures that the leave of that provider is proposed before the subsequent join of that provider.

A running protocol is always completed. The protocol could complete successfully or unsuccessfully. An unsuccessful completion could be caused by a provider voting to REJECT the protocol, by an explicit or implicit vote. The protocol might also end unsuccessfully if one or more providers fail to submit their votes within the specified time limit.

A rejected provider-initiated protocol is not automatically resubmitted. The providers must resubmit the protocol, if it is required.

Ending a protocol

The end of the protocol is signaled by the end of the voting phases for the protocol. In the case of a one-phase protocol, the end phase is the only phase.

In the case of an n-phase protocol, the voting phases can end in one of the following ways:

If any provider votes to REJECT the proposal in any voting phase, the proposal is rejected.
If a default vote of REJECT is entered, the proposal is rejected.
If all providers vote to APPROVE the proposal, or default votes of APPROVE are entered in the same voting phase, the proposal is approved.

In all cases, the end phase consists of a broadcast of the results of the protocol just processed.

For approved proposals, a notification is sent to all providers and subscribers. The notification contains a flag that specifies whether any default votes were used to approve the protocol. Providers receive this information, but subscribers do not. The contents of the notification is also determined as follows:

If there was a membership change, it contains the updated membership list. Both providers and subscribers receive this information.
If there was a proposal to change the group state value, it contains the new group state value. Both providers and subscribers receive this information.
If a provider-broadcast message was submitted on the final vote, it contains the message. Providers receive this information; subscribers do not.

For rejected proposals, a notification is sent only to providers. The notification contains the following information:

An indication that the proposal was rejected
A flag that specifies why the proposal was rejected. Reasons include: there was an explicit vote to REJECT, a default vote to REJECT was submitted on behalf of a failed provider, or the protocol was ended because a provider exceeded the specified time limit.

The expel protocol

The expel protocol allows a provider to propose the removal from the group of one or more providers. Some situations in which this could be useful include:

A provider has received an announcement notification that another provider is not responsive or has detected an internal error.
A provider has received an announcement notification that another provider failed to submit a vote during a previously completed n-phase protocol within the specified time limit.
A provider has detected through some other means that another provider is not behaving as expected in the context of the application that the group is running.

During the invocation of the expel protocol, Group Services runs a deactivate script against each provider that is being expelled. The deactivate script, which is specified by each GS client on initialization, is used to perform any cleanup actions that may be required.

The deactivate script does not need to be a shell script but can be any kind of executable file. For each provider that is targeted for expulsion, the Group Services daemon forks a child process that attempts to invoke the deactivate script on the provider's node. For details about the environment in which the deactivate script runs and the input and output specifications to which it must conform, see Deactivate scripts and ha_gs_expel subroutine.

The expel protocol is a provider-initiated protocol. Therefore, if it collides with another already-running protocol, Group Services returns it to the proposer. The proposer must resubmit the protocol; the protocol is not automatically queued.

A provider uses the ha_gs_expel subroutine to request an expel protocol. On input, the provider specifies the following information:

The number of phases for the protocol. An expel protocol may be either a one-phase or an n-phase protocol.
The voting time limit for each phase. Providers that are not being expelled must vote within this time limit. For providers that are being expelled, the deactivate script must complete within this time limit, or be considered unsuccessful.
The list of providers to be expelled. These providers do not take part in the protocol and receive no notice of it, unless it is approved. All providers that are not targeted for expulsion take part in running the protocol, even if they had been declared nonresponsive before the protocol began.
A deactivate phase specifier. This value tells Group Services in which voting phase it should invoke the deactivate script. A value of 0 indicates that the deactivate script should not be invoked.
An expel flag. This flag is passed to the deactivate script. A null value indicates that no flag should be passed to the deactivate script.

For each provider that is targeted for expulsion, Group Services runs the deactivate script that was specified by that provider when it initialized itself with Group Services. The deactivate script runs on the node on which the provider that is targeted for expulsion is running. It runs during the phase and uses the flag that was specified on the expel protocol. To be successful, the deactivate script must complete within the voting time limit for the phase. To invoke the deactivate script, Group Services acts as a substitute for each provider that is being expelled.

During the expel protocol, providers that are not being expelled treat this as a normal protocol and take any action they deem appropriate. If it is an n-phase protocol, their voting responses are tallied as if it were any other n-phase protocol.

If the value of the deactivate phase specifier is 0, no deactivate script is invoked during the protocol. If the protocol is approved, the providers that are targeted for expulsion are removed from the group. Because one-phase protocols are always approved, a one-phase expel protocol with a deactivate phase specifier of 0 simply removes the targeted providers from the group. If the protocol is rejected, the targeted providers are not removed from the group.

At the start of the voting phase given by a non-zero deactivate phase specifier, Group Services runs the deactivate script against each targeted provider. If at least one provider votes to reject the protocol before this phase, the targeted providers are not removed from the group and no deactivate scripts are invoked.

If the expel protocol is a one-phase protocol, and the value of the deactivate phase specifier is 1, the deactivate script is run immediately after the protocol begins running. Providers that are not targeted for expulsion receive the usual protocol approval notification, informing them that the targeted providers are now out of the group. Providers that are targeted for expulsion receive the protocol approval notification after the Group Services daemon has forked a child process to run the deactivate script. The Group Services daemon does not wait for the script to complete before it sends the notification. Therefore, it difficult to determine whether the provider will receive the notification before or after the script runs.

The exit code of the deactivate script is not inspected, and the result is not returned to the providers that remain in the group.

If a provider that is targeted for expulsion by a one-phase expel protocol fails after the protocol has begun, no failure protocol is initiated in the group for that provider.

When a deactivate script runs successfully, it is expected to exit with an exit code of 0. Group Services treats the successful completion of the deactivate script as a vote to approve the protocol. If the protocol requires more voting phases, Group Services continues to vote APPROVE for each subsequent voting phase.

When a deactivate script does not exit with a code of 0, Group Services enters the group's current default vote value as the provider's vote for the phase. If the protocol requires more voting phases, Group Services continues to enter the current default vote value as the provider's vote for each subsequent voting phase.

If the deactivate script is to be run in a future voting phase, Group Services enters a vote of CONTINUE as the provider's vote for each interim voting phase.

If one or more providers that are targeted for expulsion did not specify a deactivate script, or specified a script that could not be run, but a non-zero deactivate phase specifier was given, then for those providers, the group's default vote value is entered for this and each subsequent voting phase. However, for providers that did specify a valid deactivate script, the script is run and its result is used to drive the voting, as previously described.

When a provider fails after the expel protocol begins but before the Group Services daemon has forked a child process to run the deactivate script, Group Services passes a process ID of 0 to the deactivate script. The deactivate script is still run and the exit code is used to determine the vote for this provider, as previously described.

Group Services tallies the votes for voting phases in the normal manner. If the expel protocol is approved, the providers that are targeted for expulsion are removed from the group. Remaining providers and subscribers are notified.

Group Services sends the protocol approval notification to expelled providers that did not exit in the course of running the deactivate script. However, Group Services does not verify that such providers receive or process the notification. Because they are no longer in the group, expelled providers cannot submit protocols and do not receive notifications related to the group.

In the event that the protocol is rejected for any reason, the providers that are targeted for expulsion are not removed from the group. However, if the deactivate script causes a provider to exit, Group Services initiates a failure leave protocol for that provider.

When a single process is joined as providers to multiple groups, and one of those provider instances has been expelled from a group, the effect on the other instances is as follows:

If the process no longer exists (it is killed or has failed) as a result of the expel protocol, the other provider instances of the process are handled through failure leave protocols in their groups.
If the process still exists, the other provider instances of the process are not affected and continue as full participants in their groups.

If a single process is joined as providers to multiple groups, and more than one of the groups are simultaneously running expel protocols that target those providers (because the process is unresponsive, for example), the order in which deactivate scripts are run against the process is not defined by Group Services. Because each group's expel protocol proceeds independently, Group Services does not coordinate the invocation of the deactivate script for each group's protocol. If all groups approve their expel protocols and the process is killed, no failure leave protocols are invoked. If one or more groups reject their expel protocols, but the process is killed in the course of running the deactivate script, those groups initiate failure leave protocols to remove the failed provider.

Deactivate-on-failure handling

The same deactivate script will be run in the case of a local provider's process failure as well as in the case of the expel protocol. When a provider is failing, its group is forced into a failure leave protocol. Deactivate-on-failure handling allows recovery and clean-up actions on the failed provider's node, although the failed provider's process no longer exists.

In the case of an n-phase protocol, the results of the script's invocation will be used in subsequent voting for the protocol. For a one-phase protocol, the results will not be relayed to the remaining group members. Unlike expel, the group cannot specify the voting phase in which the script will be invoked. It is always run in the first phase. The failure leave protocol, with deactivate-on-failure handling, operates as follows:

If batching of failures is not allowed, the deactivate script is run for every provider.
If batching of failures is allowed, and:
- If there are multiple failed providers on one node in one protocol, the deactivate script will be run once.
- If there are multiple failed providers in separate protocols, the deactivate script will be run once per protocol.
The deactivate script is invoked on each node with a failed provider.
In the case where a failed GS client process had been joined as providers to multiple groups, each group continues to run independent failure protocols.
- If multiple groups specify deactivate-on-failure, then the deactivate script will be run during each group's failure protocol.
- Group Services does not define the order in which the deactivate scripts will be run by each group, as the order in which the individual groups will run the failure protocols is not defined.
If a group has enabled deactivate-on-failure, and a one or more providers are to be cast out, the decision will be:
- If the targeted provider's process exists at the time the cast-out protocol begins running, the deactivate script will not be invoked.
- If the targeted provider's process does not exist at the time the cast-out protocol begins running, the deactivate script will be run.
- If the targeted provider's process exists at the time the cast-out protocol begins running, but fails during the cast-out protocol, the deactivate script will not be run.

Deactivate-on-failure handling with one-phase protocol

If the failure protocol is a one-phase protocol, the deactivate script is invoked immediately after the protocol begins running and the Group Services daemon does not wait for the script to complete. Non-failed providers receive the usual protocol approval notification, informing them that the failed providers are now out of the group.

The exit code of the deactivate script is not inspected, and the result is not returned to the providers that remain in the group.

Deactivate-on-failure handling with n-phase protocol

If the failure protocol is an n-phase protocol, the results of the deactivate script will be used to guide the vote submitted for the failed providers. Note that the deactivate script is always run during the first phase of the failure protocol.

If a deactivate script runs successfully, it is expected to exit with an exit code of 0. Group Services treats the successful invocation of the deactivate script as a vote to approve the protocol. If the protocol requires more voting phases, Group Services continues to vote APPROVE for each subsequent voting phase.

If a deactivate script does not exit with a code of 0, Group Services enters the group's current default vote value as the failure provider's vote for the phase. If the protocol requires more voting phases, Group Services continues to enter the current default vote value as the failed provider's vote for each subsequent voting phase.

If the group has specified a time limit for failure protocols, and the script does not complete within the time specified, the Group Services daemon treats this as a normal voting time out and applies the group's current default vote. If the voting phases continue in the protocol, the Group Services daemon will continue to apply the group's current default vote value for each subsequent voting phase.

Group Services tallies the votes for voting phases in the normal manner. If the failure protocol is approved, the failed providers are removed from the group. Remaining providers and subscribers are notified. If the failure protocol is rejected, there are special conditions that apply to rejection of any failure protocol.

When the rejection is caused by either an explicit reject vote or a default reject vote, and batching of failures is not allowed, the protocol ends and the failed providers are removed from the group. Remaining providers and subscribers are notified.
When the rejection is caused by an explicit reject vote, and batching of failures is allowed the protocol ends and the failed providers are removed from the group. Remaining providers and subscribers are notified.
When the rejection is caused by a default reject vote, and batching of failures is allowed, the protocol ends, but the failed providers are not removed. The group is immediately put into a new failure protocol, with any newly-failed providers added to the list of already-failed providers from the previous protocol. A deactivate script is run only once against any single failed provider instance. As a result, during the subsequent failure protocols, only the newly-failed providers will have their deactivate scripts run, but no deactivate scripts will be run against the already-failed providers. During any subsequent failure protocols, the Group Services daemon votes APPROVE on behalf of the old failed providers. This prevents the group from being put into an infinitely-looping situation, where the failure protocol ends via a default REJECT vote caused by a failed deactivate script, and would otherwise be continually restarted.

Deactivate scripts

This section provides information about the invocation environment, input parameters, and exit codes of deactivate scripts.

Invocation environment for deactivate scripts

To handle a situation in which a provider must be expelled, or in which it is failing, a provider can specify a deactivate script on the ha_gs_init subroutine when it first registers with Group Services. The script may be a shell script or any kind of executable file that conforms to the input and output rules that are specified later in this section.

Group Services does not verify that a deactivate script actually exists on a node or that it can be run, until an expel protocol is invoked. If the specified deactivate script is not found or cannot be run, Group Services applies the group's default vote value for the phase in which the deactivate script should have been invoked, and for each subsequent voting phase, if there are any.

A valid deactivate script is run as follows. For each provider targeted by the expel protocol, the Group Services daemon on the provider's node forks a child process that tries to run the deactivate script, using the following environment:

Effective uid and gid: The forked process runs with the effective uid and gid of the targeted provider that it had when it registered with Group Services by its call to the ha_gs_init subroutine. If the provider changed its uid or gid after calling ha_gs_init, the deactivate script still uses the effective uid and gid from the time when ha_gs_init was called. A deactivate script with a set uid bit in its file permissions runs with those values.
Working directory: The forked process begins running in the current working directory of the targeted provider that it had when it registered with Group Services by its call to the ha_gs_init subroutine. If the provider changed its current working directory after calling ha_gs_init, the deactivate script still uses the current working directory that existed when ha_gs_init was called. A deactivate script that wants to run in another directory must change to that directory.
Environment variables: The forked process inherits the environment variables from the Group Services daemon's environment. Therefore, the deactivate script must not make any assumptions about the environment variables (for example, the path) or access to specific directories or file systems except for those that are normally accessible to the provider's effective uid and gid.
STDIN, STDOUT, and STDERR file descriptors: On input, the STDIN, STDOUT, and STDERR file descriptors are closed (not associated with any files). To perform input or output, the deactivate script must explicitly open any input or output file that it wants to use.

Input parameters for deactivate scripts

On input, Group Services supplies the following parameters to a deactivate script:

The process ID parameter is always zero when the script is run for deactivate-on-failure handling. (See The expel protocol.)
The voting time limit of the expel protocol, in seconds, as an int (4 bytes) The deactivate script must complete and exit within this limit.
The name of the failed provider's group, as a null-terminated string
The deactivate flag, which is the null-terminated string providerdied. The deactivate script can distinguish when it is called by checking this deactivate flag.
The list of failed provider's instance numbers, separated by commas. This parameter will be presented only for deactivate-on-failure handling. When batching of failures is enabled, the deactivate script can be run once for the multiple providers' failure. This fifth parameter tells which providers were failing. Note that each provider instance number does not contain the node number.

Exit codes for deactivate scripts

On output, the deactivate script must supply an exit code of 0 for a successful completion. Any other exit code indicates an unsuccessful completion. It is up to the deactivate script to decide what constitutes a successful completion.

On receipt of an exit code indicating a successful completion before the time limit expires, Group Services votes APPROVE for this voting phase of the protocol. On receipt of an exit code indicating an unsuccessful completion before the time limit expires, Group Services applies the group's default vote for this voting phase of the protocol, and each subsequent voting phase of the protocol.

If the deactivate script does not exit before the time limit expires, Group Services applies the group's default vote for this voting phase of the protocol, and each subsequent voting phase of the protocol.

Notifications

A GS client can receive several types of messages, called notifications, from Group Services. These include notifications for:

Protocol proposals and ongoing protocols
Protocol approvals
Protocol rejections
Announcements
Responsiveness checks

All messages are sent in a fault-tolerant-manner. That is, providers and subscribers are guaranteed to receive notifications despite failures.

Protocol proposal and ongoing protocol notifications

These notifications are sent to the providers of a group to indicate that an n-phase protocol has been proposed or is in progress. As a response to these notifications, the Group Services subsystem typically expects a vote.

Protocol proposal notifications are not sent for one-phase membership or state value changes because these proposals are automatically approved.

There are three types of proposals for which notifications are sent:

Membership change proposals: A membership change proposal notification is sent when a provider has requested to voluntarily join or leave a group, a provider has requested the expulsion of one or more providers from a group, a provider has left the group involuntarily either because the process itself failed, or because the node on which it was running failed. An involuntary leave is called a failure leave and is initiated by Group Services.
State value change proposals: A state value change proposal notification is sent when a provider has requested a change to the group's state value.
Provider-broadcast message proposals: A provider-broadcast message proposal notification is sent when a provider has issued a request to broadcast a message and may also initiate voting.
Attribute change proposals: Attribute change proposal notification is sent when a provider has issued a request to change the group's attribute.

The only GS clients that are concerned with protocol proposal and ongoing protocol notifications are providers. Subscribers do not participate in proposing, approving, or rejecting membership or state value changes for the group. Also, when subscribers join or leave the group, no notification is sent to any GS client.

Protocol approvals

Protocol approvals are sent to the providers of a group to indicate that a proposal has been approved. They are also sent to the subscribers of the group for membership changes and state value changes.

Note that a protocol approval notification is sent as the first and only notification for a one-phase protocol.

Protocol rejections

Protocol rejections are sent to the providers of a group to indicate that a proposed membership or state value change has been rejected. Subscribers are not notified when proposals are rejected.

Announcement notifications

Announcement notifications are sent to the providers of a group to announce an item of interest within the group. They include warnings that individual providers have not voted in time or not responded to a responsiveness check.

Responsiveness notifications

Responsiveness notifications are sent to of a group's providers to determine whether the provider is active. If a provider does not respond to this responsiveness check within the time limit it specified previously, Group Services sends an announcement notification to all providers.

An illustration of a multi-phase protocol

The figures that follow illustrate a state change protocol for a group with two providers, P1 and P2, and two subscribers, S1 and S2. P2 proposes a change to the group's state value, and specifies whether the change requires voting phases or is handled as a single broadcast. Figure 1 shows the sequence of events for a one-phase protocol. Figure 2 shows the sequence of events for a two-phase commit protocol. Figure 3 shows the sequence of events for a three-phase protocol.

Upon receipt of the state change proposal from P2, the Group Services subsystem sends a notification to all of the providers in the group, namely P1 and P2. If P2 requested a one-phase protocol, the change is approved, S1 and S2 are notified, and the protocol terminates. If P2 requested a multi-phase protocol, P1 and P2 are instructed to vote on the outcome of the protocol.

Figure 2 shows the invocation of a multi-phase protocol in which the providers vote to approve the change after one round of voting. Figure 3 shows the providers extending the voting to three rounds. When the change is approved, all of the providers and the subscribers, that is, P1, P2, S1, and S2, are informed of the change.

Note that if both P1 and P2 submit state change requests concurrently, the Group Services subsystem chooses one of the requests for invocation, and returns the other to its proposer.

The n-phase agreement protocol that the GSAPI provides is flexible and powerful enough to handle a variety of synchronization and coordination requirements:

A one-phase protocol is invoked when a provider submits a state change requesting that there be no voting by the providers, and therefore another provider cannot stop this state change. The first phase of such a protocol is also the last phase of the protocol, as shown in Figure 1.
A two-phase state change protocol is essentially the well-understood two-phase commit protocol with a reliable coordinator.
An n-phase state change protocol gives the providers the framework to perform n-1 rounds of barrier synchronization. For example, the four-phase protocol shown in Figure 3 yields three rounds of barrier synchronization, at the end of voting phases one, two, and three.

If any provider is notified that the state change is approved, the GSAPI guarantees that all (non-failed) providers and subscribers are notified of the approved state change without regard to failures within the system.

One-phase and n-phase changes

It is the responsibility of the providers in a group to determine the level of consistency that is required for managing changes to the group membership and state value. As described previously, providers may use either one-phase or n-phase protocols. In all cases, all providers see all protocols in the same order. However, the level of consistency differs in an important way, as follows.

Assume that two proposals occur rapidly one after the other.

For one-phase protocols, although all providers see both protocols in the same order, some providers may see both the first and the second protocol before another provider has seen the first. This leads to a loosely synchronous consistency level, because the providers loosely catch up to each other in seeing the "latest and greatest" group state.
For n-phase protocols, the group state is managed in a strongly-consistent manner. Because an n-phase protocol forces all participating providers to submit votes, that is, to reach the barrier synchronization points, no provider can see the second protocol before all have seen and reacted to the first.

Subscribers have no choice but to receive the notifications of approved group changes in a loosely synchronous manner. The GSAPI guarantees that all subscribers to a group see the approved changes in the same order as do the group's providers. However, one subscriber may see multiple notifications before another subscriber has seen any.

Figure 1. A One-Phase Protocol

View figure.

Figure 2. A Two-Phase Commit Protocol

View figure.

Figure 3. Barrier Synchronization in a Multi-Phase Protocol

View figure.

Active protocol proposals

The GSAPI guarantees that only one protocol that affects the group's membership or state value is run at any time. If more than one proposal is submitted within the group simultaneously, the Group Services subsystem chooses one for invocation and returns the others to the providers that submitted them. It is the responsibility of a provider that receives a returned proposal to resubmit it for invocation, if appropriate.

When providers join or involuntarily leave a group, this processing is modified. In these cases, the membership protocol to deal with the join or involuntary leave request is held until the currently running protocol has been approved or rejected. The membership change protocol is then started immediately. Figure 4 shows how a new provider join request is delayed until the completion of an ongoing three-phase state change protocol.

Figure 4. The Serialization of a Pending Join Request

View figure.

Failures

When a node fails, Group Services assumes that all providers on that node have also failed. The GSAPI supports process failure detection by detecting the loss of a socket connection.

When a provider leaves due to either a node failure or the failure of the process itself, Group Services proposes a failure leave protocol for that provider. If the group had been using a one-phase protocol to handle joins, the failure leave is specified as one-phase. If the join had been n-phase, the failure leave is specified as n-phase.

As long as one provider is active, Group Services continues to keep the group going.

The Group Services subsystem itself has been designed to survive failures. These can be node failures, that lead to the loss of one or more Group Services processes, or network failures and communications adaptor failures, that hinder the communication between Group Services processes.

If the Group Services subsystem fails, any surviving GS client receives an announcement notification that the Group Services daemon has terminated suddenly and unexpectedly. The GS client can get the FFDC ID (First Failure Data Capture identifier) that is related to the cause of the Group Services subsystem failure when the FFDC ID exists. In addition, if a protocol is running, it is terminated. See Deactivate-on-failure handling and the FFDC Programming Guide and Reference for further information.

Provider actions during voting

As shown in Figure 5, a provider can perform any sequence of actions that it chooses between the time that it receives an ongoing protocol notification (that is, a request for a vote) and the time that it votes. However:

Providers should submit their votes within the voting time limit. When a provider is asked to vote on a proposed change, the proposal may include a time limit within which the vote must be submitted. The time limit includes any message delays. If any provider fails to submit its vote in time, the Group Services subsystem applies the group's current default vote in lieu of that provider's vote. The Group Services subsystem supplies a list of the providers that failed to vote in time to the other providers.
Providers should wait until a running protocol has completed before submitting a new proposal. During the invocation of any protocol, no provider is allowed to submit another protocol proposal. The Group Services subsystem simply returns an error code to the provider if it tries to do so and ignores the new proposal. The provider must wait until the running protocol has completed before it resubmits the proposal.

Figure 5. Actions during a Membership Change Protocol

View figure.

Subscribing to a group

When it is desirable for a process to monitor a group without playing a part in the control of the group's state, the GSAPI allows the process to subscribe to the group. A subscriber may subscribe to receive approved membership changes, approved state changes, or both.

Subscribers do not participate in any of the voting protocols. In fact, they are not notified that any such activity is taking place. If a state or membership change is not approved, no notification is sent to the subscribers.

Notifications to all subscribers to any single group are serialized, so all subscribers receive all notifications in the same order. However, it is not guaranteed that all subscribers will receive any one notification before any other subscribers receive subsequent notifications. No notifications or protocol proposals are made when subscribers join or leave a group.

Subscription allows one group to maintain a loose synchronization with one or more other groups. For example, a subscriber could be used to monitor and display information about the state of a number of groups and the members of those groups. As the subscribed-to groups change state or membership, the monitor can collect the changes and display or log the updated information.

Provider and subscriber tokens

The GSAPI uses integers that are called provider and subscriber tokens to identify providers and subscribers. These tokens are assigned, invalidated, and reassigned in a similar way in which file descriptors are assigned, invalidated, and reassigned as files are opened and closed.

For example, a GS client joins group foo and receives provider token 0. When the same client leaves group foo, Group Services invalidates provider token 0 and makes it available for reassignment. When the next GS client (which could be the same GS client or a different GS client) joins the next group (which could be the same group or a different group), Group Services assigns provider token 0 to that client.

As another example, a GS client subscribes to group bar and receives subscriber token 2. When the same client unsubscribes from group bar or becomes unsubscribed because group bar is dissolved, Group Services invalidates subscriber token 2 and makes it available for reassignment. When the next GS client (which could be the same GS client or a different GS client) subscribes to the next group (which could be the same group or a different group), Group Services assigns subscriber token 2 to that client.

Source-target group relationships

It is sometimes convenient to associate several groups with a single application, and to allow a process to be a member of multiple groups. Such relationships are not normally tracked by Group Services, except when the source-target facility is used. To understand this facility, consider the following scenario.

If a node crashes, all of the groups with providers on that node receive a membership change proposal notification simultaneously. The notification causes each group to begin reacting independently to the membership change. However, it may be better for some applications to wait until another group has completed processing this change. Such a relationship might exist, for example, between a disk recovery subsystem and a distributed database application. If the database is on a disk on the failed node, the database application must wait for the disk recovery subsystem to recover from the node failure before it can begin its recovery.

Although it is possible to deal with such relationships using subscriptions, they are loosely synchronized and may not provide the degree of timing control that is required. Instead, the source-target facility can be used. The source-target facility allows a target group to tie itself to a source group as follows. If a failure leads to the failure of a provider in both the source and the target groups, the source group completes its membership change protocol before the target group begins its membership change protocol. Thus, the providers in the target group can run with the knowledge that the providers in the source group have already handled the failure. This knowledge is particularly useful when the recovery of the target group depends on the completion of recovery by the source group.

In the recovery scenario just described, the disk recovery subsystem is defined as the source group, and the database application is defined as the target group.

With source-target groups, join and leave protocols work a little differently than with other groups. Here are some key differences.

A group defines itself as a target-group by listing a source-group name in the set of group attributes specified on the ha_gs_join subroutine by each target-group provider. A source-group is not notified that it has been sourced by any groups.
For every node on which a target-group provider wants to run, there must exist a source-group provider.
If there is no source-group provider on a node, a potential target-group provider is not allowed to join the target group, and no membership change is proposed. The GS client attempting to join the target-group receives an asynchronous return code that indicates that there is no source-group provider active on this node.
If there is a source-group provider on a node, a potential target-group provider is allowed to join the target-group in the normal manner, that is, by a membership change proposal to the target-group.
There may be multiple source-group or target-group providers on a node. A source-group may have any number of target-groups. A target-group may have only one source group.
If the last remaining source-group provider on a node leaves the source-group, voluntarily or involuntarily, all of the target-group providers on that node must leave the target-group. The source-group processes the leaves as a normal membership change proposal.
Once the source-group has approved the leave protocol, a membership change is proposed to the target-group as a cast-out of the affected providers from the target-group. As a failure leave protocol, the cast-out protocol cannot be rejected. As part of the notification initiating this target-group membership change, the target-group receives the source-group's state value. If there is no target-group provider on that node, no notification is sent to the target-group providers.
The providers that are being cast out receive a notification that they have been cast out of the group. They do not otherwise participate in the cast-out protocol.
If a target-group is running a protocol, and a source-group provider process fails on a node that also contains a target-group provider, the source-group runs a failure leave protocol.
In this case, only the process of the source-group provider has failed, not the node on which it is running. Because the target-group provider process still exists, the target-group protocol could continue. However, once the source-group completes its leave protocol, the target-group provider may no longer validly belong to the target-group.
As a result, the Group Services subsystem considers the target-group providers that will be cast-out as having failed during the protocol, and treats them as follows:
- If the target-group's default vote is REJECT, the protocol is rejected, and the Group Services subsystem initiates a cast-out protocol.
- If the default vote is APPROVE, the protocol is approved or, if a provider votes CONTINUE, the protocol continues.
- If the protocol continues, the failed target-group providers are no longer allowed to participate. Instead, the default vote (APPROVE in this case) is registered for them for each voting phase.
Whatever the outcome of the target-group's running protocol, once it ends, the Group Services subsystem immediately initiates a cast-out protocol for the target-group.
When a source-group leave protocol prevents the last target-group providers from invoking protocols, those providers are given a cast-out final notification and the target-group is, in effect, dissolved.
If a node fails, rather than the last source-group provider on the node, it is handled in the same way as if the source-group provider itself had failed. The source-group completes its protocol before the target-group is notified. In this case, the target-group receives a cast-out protocol, rather than a failure leave protocol.
If a source-group changes its state value during protocols that do not result in a target-group cast-out , its associated target-groups receive the committed state value. An example of this is a state value change protocol or a voting response during any other n-phase protocol.
The notification appears to the target-group as a source-state reflection protocol. Values specified in the group attributes of the target-group control the number of phases and a voting time limit. The target-group treats this as a normal protocol and takes whatever actions are required.
If the target-group is running a protocol when a source-group state value change is ready to be reflected, the running protocol continues normally, and the source-state reflection protocol is queued, to be initiated later when the running protocol completes.
If a subsequent source-group state value change appears, only the most recent one is reflected to the target-group, and the earlier change is simply dropped. In addition, if a cast-out is necessary, and a source-state reflection protocol is queued, the queued protocol is dropped, because the cast-out protocol reflects the most recent source-group state value.
Because a source-state reflection protocol is initiated by Group Services, it is always initiated before any pending provider-initiated protocols for the group. In addition, there is no interface for a provider to request this protocol. It is automatically initiated as a consequence of a source-group's state value change.
As part of any cast-out protocol in a target group, it will receive in the notification the source-group's current state value.

Host and adapter membership groups

The Group Services subsystem provides several system-defined groups to which GS clients can subscribe for keeping track of hardware status. Refer to Description for a complete listing of the different system-defined groups that are available for subscription.

The host membership group

The Group Services subsystem keeps track of node status to determine when nodes are no longer reachable. A node that is fully isolated due to network or communications adapter failures is not distinguishable from a node that has failed. Accordingly, a fully isolated or failed node triggers such actions as notifications to groups that have one or more providers on the failed node or nodes.

The state of the nodes is reflected by the Group Services subsystem in a special system-defined group called the host membership group. The host membership group is represented by HA_GS_HOST_MEMBERSHIP. By subscribing to this group, a GS client can obtain information about the nodes that are currently active and any transitions that occur as nodes become active or fail.

A node appears active in the host membership group when the Group Services subsystem is active on that node. All such active nodes that can communicate with each other appear in this group.

The adapter membership groups

The Group Services subsystem also keeps track of the status of Ethernet and SP Switch adapters. The state of the adapters is reflected by the Group Services subsystem in two system-defined groups called the Ethernet adapter membership group and the SP Switch adapter membership group. By subscribing to these groups, a GS client can obtain adapter membership information which, for example, it uses to determine communication paths to nodes.

The view of adapter membership implies that all of the nodes in the membership are able to communicate with each other over the IP network to which the adapters are connected. On the SP, this means that for Ethernet membership (HA_GS_ENET_MEMBERSHIP), the view from any one node is the set of all other nodes reachable from that node via the SP Ethernet. If the SP Ethernet is sundered (broken such that only subsets of nodes can communicate with each other), a node's view of Ethernet membership will be only the subset of nodes with which it can communicate.

Within the Group Services PSSP domain, adapter membership information is available for only a single Ethernet adapter on the SP control workstation, even if multiple Ethernet adapters are connected to the SP nodes.

For the SP Switch (HA_GS_CSS_MEMBERSHIP), the hardware does not allow sundering; if a node is on the switch, it can communicate with any other node on the switch. Thus, a local node's view of cssMembership is the global view of all nodes that are on the switch. A node must look at the given membership to determine if it is listed.

If HACMP/ES is installed on a node, and a GS client is connected to the Group Services HACMP/ES domain, there may be more adapter membership groups than simply the two described above. This is because heartbeating may take place on additional networks for HACMP/ES if the networks are installed and are defined to HACMP/ES. In this case, the semantics for these adapter membership groups match those as described for the Ethernet membership group. When a GS client receives a subscription notification for an adapter membership group, it will indicate the set of nodes with which this node can communicate across the given adapter type.

In summary, the information presented to a GS client for adapter membership subscriptions is not globally consistent (except for the HA_GS_CSS_MEMBERSHIP group). Since each node is presented with the set of nodes to which it can communicate for the given adapter type, different nodes may see different views. This can occur if the different networks are not fully connected to all nodes in the domain, or if there are failures of various routers between sections of the networks.

For more information, see ha_gs_subscribe subroutine.

Quorum

Many applications require a form of quorum to ensure that the proper resources are available before the application begins operation. For example, one application may require a certain percentage of nodes to be up and running before it begins, while another requires particular nodes.

Because groups have significantly different requirements for quorum, the GSAPI does not provide a predefined quorum as part of its support. It is the responsibility of the application that is using the GSAPI to form groups that define and implement required quorum mechanisms. By manipulating the state information of the group, an application can build the required quorum mechanism.

Sundered networks

The Group Services subsystem provides a single group namespace within each system partition. Given the right set of multiple network failures, a system partition with multiple networks can become split. In the case of a sundered namespace, the nodes become split in such a way that they can no longer communicate with any nodes on the other side of the split. However, it is possible for each sundered portion to maintain enough information to reconstruct the groups that previously existed, at least those groups that still have members within any particular portion.

When a namespace is sundered, it is possible to get two instances of what should be one group. For example, in a sundered network, two nodes that own the two tails of a twin-tailed disk could end up on separate sides of the split. Because the processes of the subsystem coordinating the disk on each node would believe that the other process had disappeared, the process might want to activate its tail, which could lead to data corruption. As this example shows, it is important that each group determine if it needs a form of quorum, and use it to guide when a group is ready to perform its services.

Although the Group Services subsystem does not provide a quorum mechanism, it does provide some assistance to groups when a network is sundered. When a system partition is sundered, the providers receive membership protocol proposals from Group Services that all of the providers on the other side of the split have failed. The providers can then run those protocols as they normally would, taking into account such factors as quorum to protect resources as necessary.

If a sundered network becomes healed and Group Services discovers separate domains, it dissolves the smaller domain, which is defined as the domain with the smaller number of nodes. Group Services sends an announcement notification that it has terminated abnormally to the clients on the smaller domain. Upon receipt of the notification, the clients on the smaller domain can join the larger domain or perform any other appropriate recovery action.

[ Top of Page | Previous Page | Next Page | Table of Contents | Index ]