Internet-Draft Grenville Armitage Bellcore August 13th, 1996 Redundant MARS architectures and SCSP Status of this Memo This document was submitted to the IETF Internetworking over NBMA (ION) WG. Publication of this document does not imply acceptance by the ION WG of any ideas expressed within. Comments should be submitted to the ion@nexen.com mailing list. Distribution of this memo is unlimited. This memo is an internet draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time. It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress". Please check the lid-abstracts.txt listing contained in the internet-drafts shadow directories on ds.internic.net (US East Coast), nic.nordu.net (Europe), ftp.isi.edu (US West Coast), or munnari.oz.au (Pacific Rim) to learn the current status of any Internet Draft. Abstract The Server Cache Synchronisation Protocol (SCSP) has been proposed as a general mechanism for synchronising the databases of NHRP Next Hop Servers (NHSs), MARSs, and MARS Multicast Servers (MCSs). All these entities are different parts to the IETF's ION solution. This document attempts to identify a range of distributed MARS scenarios, highlight associated problems, and describe how SCSP may be applied. This document does not deal with redundant MCS scenarios. Armitage Expires February 13th, 1997 [Page 1] Internet Draft August 13th, 1996 1. Introduction. SCSP [1] being developed within the Internetworking over NBMA (ION) working group as a general solution for synchronizing distributed databases such as distributed Next Hop Servers [2] and MARSs [3]. This document attempts to identify the range of redundant MARS scenarios, describe the associated problems, and describe how SCSP may be applied. Distributed MCS scenarios are being looked at in other documents [4]. [Editors note: The MARS I-D referenced in [3] has been accepted for elevation to Proposed Standard, and will probably be released as an RFC during the lifetime of this I-D. It is not currently known what RFC number it will have, but the title will remain as cited.] In the current MARS model a Cluster consists of a number of MARS Clients (IP/ATM interfaces in routers and/or hosts) utilizing the services of a single MARS. This MARS is responsible for tracking the IP group membership information across all Cluster members, and providing on-demand associations between IP multicast group identifiers (addresses) and multipoint ATM forwarding paths. It is also responsible for allocating Cluster Member IDs (CMIs) to Cluster members (inserted into outgoing data packets, to allow reflected packet detection when Multicast Servers are placed in the data path). Two different, but significant goals motivate the distribution of the MARS functionality across a number of physical entities: Fault tolerance If a client discovers the MARS it is using has failed, it can switch to another MARS and continue operation where it left off. Load sharing A logically single MARS is realized using a number of individual MARS entities. MARS Clients in a given Cluster are shared amongst the individual MARS entities. A general solution to Load Sharing may also provide Fault Tolerance to the MARS Clients. However, it is not necessarily true that methods for supporting Fault Tolerance will also support Load Sharing. Some additional terminology is required to describe the options. These reflect the differing relationships the MARSs have with each other and the Cluster members (clients). Armitage Expires February 13th, 1997 [Page 2] Internet Draft August 13th, 1996 Fault tolerant model: Active MARS The single MARS serving the Cluster. It allocates CMIs and tracks group membership changes by itself. It is the sole entity that constructs replies to MARS_REQUESTs. Backup MARS An additional MARS that tracks the information being generated by the Active MARS. Cluster members may re-register with a Backup MARS if the Active MARS fails, and they'll assume the Backup has sufficient up to date knowledge of the Cluster's state to take the role of Active MARS. Living Group The set of Active MARS and current Backup MARS entities. When a MARS entity dies it falls out of the Living Group. When it restarts, it rejoins the Living Group. Election of the Active MARS takes place amongst the members of the Living Group. MARS Group The total set of MARS entities configured to be part of the distributed MARS. This is the combination of the Living Group and 'dead' MARS entities that may be currently dying, dead, or restarting. The list is constructed in the following order {Active MARS, Backup MARS, ... Backup MARS, dead MARS,.... dead MARS}. If there are no 'dead' MARS entities, the MARS Group and Living Group are identical. Load sharing model: Active Sub-MARS Each simultaneously active MARS entity forming part of a distributed MARS is an Active Sub-MARS. Each Active Sub-MARS must create the impression that it performs all the operations of a single Active MARS - allocating CMIs and tracking group membership information within the Cluster. MARS_REQUESTs sent to a single Active Sub-MARS return information covering the entire Cluster. Active Sub-MARS Group The set of Active Sub-MARS entities that are currently representing the co-ordinated distributed Armitage Expires February 13th, 1997 [Page 3] Internet Draft August 13th, 1996 MARS for the Cluster. Cluster members are distributed amongst all the members of the Active Sub-MARS Group. Backup Sub-MARS A MARS entity that tracks the activities of an Active Sub-MARS, and is able to become a member of the Active Sub-MARS group when failure occurs. MARS Group The set of Active Sub-MARS and Backup Sub-MARS entities. When a MARS entity dies it falls out of the MARS Group. When it restarts, it rejoins the MARS Group. Election of the Active Sub-MARS entities takes place amongst the members of the MARS Group. It must be noted that Load Sharing does NOT involve complete distribution of processing load and database size amongst the members of the Active Sub-MARS Group. This is discussed further in section 4. The rest of this document looks at a variety of different failure and load sharing scenarios, and describes what is expected of the various MARS entities. This is then used to develop an SCSP based solution. Section 2 begins by reviewing the existing Client interface to the MARS. Sections 3 takes a closer look at the problems faced by the Fault Tolerant service. Section 4 expands on this to include the additional demands of Load Sharing. Section 5 describes how the SCSP service can be applied to these scenarios, and provides suggested packet formats for an SCSP based distributed MARS implementation. 2. MARS Client expectations. MARS Clients (and Multicast Servers) expect only one MARS to be currently Active, with zero or more Backup MARS entities available in case a failure of the Active MARS is detected. From their perspective the Active MARS is the target of their registrations, MARS_REQUESTs, and group membership changes. The Active MARS is the source of group membership change information for other Cluster members, and MARS_REDIRECT_MAP messages listing the currently available Backup MARS entities. A MARS client will act as though: MARS_REQUESTs to the Active MARS return Cluster-wide information. MARS_JOINs and MARS_LEAVEs sent to the Active MARS have Cluster- wide impact when necessary. Armitage Expires February 13th, 1997 [Page 4] Internet Draft August 13th, 1996 MARS_JOINs and MARS_LEAVEs received from the Active MARS represent Cluster-wide activity. The MARS entities listed in received MARS_REDIRECT_MAP messages are legitimate Backup MARS entities for the Cluster. MARS Clients have a specific behavior during MARS failure (Section 5.4 of [3]). When a MARS Client detects a failure of its MARS, it steps to the next member of the Backup MARS list (from the most recent MARS_REDIRECT_MAP) and attempts to re-register. If the re- registration fails, the process repeats until a functional MARS is found. Sections 5.4.1 and 5.4.2 of [3] describe how a MARS Client, after successfully re-registering with a MARS, re-issues all the MARS_JOIN messages that it had sent to its previous MARS. This causes the new MARS to build a group membership database reflecting that of the failed MARS immediately prior to its failure. (This behaviour is required for the case where there is only one MARS available and it suffers a crash/reboot cycle.) The MARS Clients behave like a distributed cache 'memory', imposing their group membership state onto the newly restarted MARS. (It is worth noting that the MARS itself will propagate MARS_JOINs out on ClusterControlVC for each group re-joined by a MARS Client. Other MARS Clients will treat the new MARS_JOINs as redundant information - if they already have a pt-mpt VC out to a given group, the re-joining group member will already be a leaf node.) An alternative use of the MARS_REDIRECT_MAP message is also provided - forcing Clients to shift from one MARS to another even when failure has not occurred. This is achieved when a Client receives a MARS_REDIRECT_MAP message where the first listed MARS address is not the same as the address of the MARS it is currently using. The client then uses bit 7 of the mar$redirf flag to control whether a 'hard' or 'soft' redirect will be performed. If the bit is reset, a 'soft' redirect occurs which does not include re-joining all groups. (In contrast, a client re-registering after actual MARS failure performs a 'hard' redirect.) The current MARS specification is not clear on how MARS Clients should handle the Cluster Sequence Number (CSN). In Section 5.4.1.2 of [3] it says: "When a new cluster member starts up it should initialise HSN to zero. When the cluster member sends the MARS_JOIN to register (described later), the HSN will be correctly updated to the current CSN value when the endpoint receives the copy of its Armitage Expires February 13th, 1997 [Page 5] Internet Draft August 13th, 1996 MARS_JOIN back from the MARS." (The HSN - Host Sequence Number - is the MARS Client's own opinion of what the last seen CSN value was. Revalidation is triggered if the CSN exceeds HSN + 1.) Although the text in [3] is not explicit, a MARS Client MUST reset its own HSN to the CSN value carried in the registration MARS_JOIN returned by the new MARS (section 5.2.3 [3]). The reason for this is as follows: CSN increments occur every time a message is transmitted on ClusterControlVC. This can occur very rapidly. It may not be reasonable to keep the Backup MARS entities uptodate with the CSN from the Active MARS, considering how much inter-MARS SCSP traffic this would imply. If the HSN is not updated using the Backup MARS's CSN, and the Backup's CSN is lower than the client's HSN, no warnings are given. This opens a window of opportunity for cluster members to lose messages on the new ClusterControlVC and not detect the losses. It is not a major issue if the HSN is updated using the Backup MARS's CSN, and the Backup's CSN was higher than the client's original HSN. This should only occur when the MARS Client is doing a hard-redirect or re-registration after MARS failure, in which case complete group revalidation must occur anyway (section 5.4.1 [3]). A soft-redirect is reserved only for those cases when the Backup MARS is known to be fully synchronised with the Active MARS. 3. Architectures for Fault Tolerance. This section look at the possible situations that will be faced by a Fault Tolerant distributed MARS model. No attempt will be made to consider the more general goals of Load Sharing amongst the distributed MARS entities. Specific SCSP based mechanisms to achieve identified requirements are left to section 5. The following initial Cluster arrangement will be assumed for all examples: C1 C2 C3 | | | ------------- M1 ------------ Armitage Expires February 13th, 1997 [Page 6] Internet Draft August 13th, 1996 | M2--M3--M4 The MARS Group for this Cluster is {M1, M2, M3, M4}. Initially the Living Group is equivalent to the MARS Group. The Cluster members (C1, C2, and C3) begin by using M1 as the Active MARS for the Cluster. M2, M3, and M4 are the Backup MARS entities. The Active MARS regularly transmits a MARS_REDIRECT_MAP on ClusterControlVC containing the members of the MARS Group (not the Living Group, this will be discussed in section 3.5). In this example M1 transmits a MARS_REDIRECT_MAP specifying {M1, M2, M3, M4}. Communication between M1, M2, M3, and M4 (to co-ordinate their roles as Active and Backup MARS entities) is completely independent of the communication between M1 and C1, C2, and C3. (The lines represent associations, rather than actual VCs. M1 has pt-pt VCs between itself and the cluster members, in addition to ClusterControlVC spanning out to the cluster members.) 3.1 Initial failure of an Active MARS. Assume the initial Cluster configuration is functioning properly. Now assume some failure mode kills M1 without affecting the set of Backup MARS entities. Each Cluster member re-registers with M2 (the next MARS in the MARS_REDIRECT_MAP list), leaving the rebuilt cluster looking like this: C1 C2 C3 | | | ------------- M2 ------------ | M3--M4 As noted in section 2, re-registering with M2 involves each cluster member re-issuing its outstanding MARS_JOINs to M2. This will occur whether or not M2 had prior knowledge of the group membership database in M1. The Living Group is now {M2, M3, M4}. In the immediate aftermath of the cluster's rebuilding, M2 must behave as the Active MARS. This includes transmitting a new version of MARS_REDIRECT_MAP that lists the re-ordered MARS Group, {M2, M3, M4, M1}. Armitage Expires February 13th, 1997 [Page 7] Internet Draft August 13th, 1996 3.2 Failure of the Active MARS and a Backup MARS. In the scenario of section 3.1 it is possible that M2 was also affected by the condition that caused M1 to fail. In such situation, the MARS clients would have tried to re-register with M3, then M4, then cycled back to M1. This sequence would repeat until one of the set {M1, M2, M3, M4} allowed the clients to re-register. (Although the Living Group is now {M3, M4}, the MARS Clients are not aware of this and will cycle through the list of MARS entities they last received in a MARS_REDIRECT_MAP.) There is a potential here for the MARS Clients to end up re- registering with different MARS entities. Consider what might occur if M2's failure is transient. C1, C2, and C3 may not necessarily attempt to re-register at exactly the same time. If C1 makes the first attempt and discovers M2 is not responding, it will shift to M3. If C2 and C3 attempt to re-register with M2 a short time later, and M2 responds, we end up with the following cluster arrangement: C1 C2 C3 | | | M3 M2 ------------ | | ----- M4 ------ Obviously M2 and M3 cannot both behave as the Active MARS, because they are attached to only a subset of the Cluster's members. The solution is for members of a Living Group to elect and enforce their own notion of who the Active MARS should be. This must occur whenever a current member dies, or a new member joins the Living Group. This can be utilized as follows: The elected Active MARS builds an appropriate MARS Group list to transmit in MARS_REDIRECT_MAPs. The elected Active MARS will be listed first in the MARS_REDIRECT_MAP. The Backup MARS entities obtain copies of this MARS_REDIRECT_MAP. Clients that attempt to register with a Backup MARS will temporarily succeed. However, the Backup MARS will immediately issue its MARS_REDIRECT_MAP (with bit 7 of the mar$redirf flag set). Receipt of this MARS_REDIRECT_MAP causes the client to perform a hard-redirect back to the indicated Active MARS. Two election procedures would be triggered when M2's transient Armitage Expires February 13th, 1997 [Page 8] Internet Draft August 13th, 1996 failure caused it leave then rejoin the Living Group. Depending on how the election procedure is defined, the scenario described above could have resulted in C1 shifting back to M2 (if M2 was re-elected Active MARS), or C2 and C3 being told to move on to M3 (if M3 retained its position as Active MARS, attained when M2 originally failed). 3.3 Tracking Cluster Member IDs during re-registration. One piece of information that is not supplied by cluster members during re-registration/re-joining is their Cluster Member ID (CMI) - this must be supplied by the new Active MARS. It is highly desirable that when a cluster member re-registers with M2 it be assigned the same CMI that it obtained from M1. To ensure this, the Active MARS MUST ensure that the Backup MARSs are aware of the ATM addresses and CMIs of every cluster member. This requirement stems from the use of CMIs in multicast data AAL_SDUs for reflected packet detection. During the transition from M1 to M2, some cluster members may transition earlier than others. If they are assigned the same CMI as a pre-transition cluster member to whom they are currently sending IP packets, they recipient will discard these packets as though they were reflections from an MCS. In the absence of a CMI tracking scheme, the problem would correct itself once all cluster members had transitioned to M2. However, it is preferable to avoid this interval completely, since there is little reason for MARS failures to interrupt on-going data paths between cluster members. 3.4 Re-introducing failed MARS entities into the Cluster. As noted in Section 3.2, the Living Group must have a mechanism for electing and enforcing its choice of Active MARS. A byproduct of this election must be to prioritize the Backup MARS entities, so that the Active MARS can issue useful MARS_REDIRECT_MAP messages. While MARS Clients only react when the Active MARS dies, the Living Group must react when anyone of its members dies. Conversely, when a new member joins the Living Group (presumably a previously dead MARS that has been restarted), a decision needs to be made about what role the new member plays in the Living Group. Two possibilities exist: The new member becomes a Backup MARS, and is listed by the Active MARS in subsequent MARS_REDIRECT_MAP messages. Armitage Expires February 13th, 1997 [Page 9] Internet Draft August 13th, 1996 The new member is immediately elected to take over as Active MARS. Simply adding a new Backup MARS causes no disruption to the Cluster. For example, if M1 restarted after the simple example in section 3.1, M2 (as the Active MARS) might continue to send {M2, M3, M4, M1} in its MARS_REDIRECT_MAP messages. Since M2 is still listed as the Active MARS, MARS Clients will take no further action. (If M1 had some characteristics that make it more desirable than M3 or M4, M2 might instead start sending {M2, M1, M3, M4}, but the immediate effect would be the same.) However, it is possible that M1 has characteristics that make it preferable to any of the other Living Group members whenever it is available. (This might include throughput, attachment point in the ATM network, fundamental reliability of the underlying hardware, etc.) Ideally, once M1 has recovered it is immediately re-elected to the position of Active MARS. This action does have the ability to temporarily disrupt MARS Clients, so it should be performed using the soft-redirect function (Section 5.4.3 of [3]). The soft-redirect avoids having each MARS Client re-join the multicast groups it was a member of (consequently, the new Active MARS must have synchronized its database with the previous Active MARS prior to the redirection). Using the example from Section 3.1 again, once M1 had rejoined the Living Group and synchronized with M2, M2 would stop sending MARS_REDIRECT_MAPs with {M2, M3, M4, M1} and start sending MARS_REDIRECT_MAPs with {M1, M2, M3, M4}. Bit 7 of the mar$redirf flag would be reset to indicate a soft redirect. Cluster members re-register with M1, and generate a lot less signaling traffic than would have been evident if a hard-redirect was used. Hard-redirects are used by Backup MARS entities to force wayward MARS Clients back to the elected Active MARS. 3.5 Sending the MARS Group in MARS_REDIRECT_MAP. It is important that MARS_REDIRECT_MAPs contain the entire MARS Group rather than just the Living Group. Whilst the dead MARS entities (if any) are obviously of no immediate benefit to a MARS Client, including them in the MARS_REDIRECT_MAP improves the chances of a Cluster recovering from a catastrophic failure of all MARS entities in the MARS Group. Consider what might happen if only the Living Group were listed in MARS_REDIRECT_MAP. As each Active MARS dies, the Living Group shrinks, and each MARS Client is updated with a smaller list of Armitage Expires February 13th, 1997 [Page 10] Internet Draft August 13th, 1996 Backup MARS entities to cycle through during the next MARS failure (as described in section 2). If the final MARS fails, the MARS Client is potentially left with a list of just one MARS entity to keep re- trying (the last Living Group advertised by the Active MARS). There is no way to predict that the final Active MARS to die will restart even if the rest of the MARS Group does. By listing the entire MARS Group we improve the chances of a MARS Client eventually finding a restarted MARS entity after the final MARS dies. Prioritizing the list in each MARS_REDIRECT_MAP, such that Backup MARS entities known to be alive are ahead of dead MARS entities, ensures this approach does not cause MARS Clients any problems while the Living Group has one or more members. 3.6 The impact of Multicast Servers. The majority of the analysis presented for MARS Clients applies to Multicast Servers (MCS) as well. They utilize the MARS in a parallel fashion to MARS Clients, and respond to MARS_REDIRECT_MAP (received over ServerControlVC) in the same way. In the same way that MARS Clients re-join their groups after a hard-redirect, MCSs also re- register (using MARS_MSERV) for groups that they are configured to support. However, the existence of MCS supported groups imposes a very important requirement on the Living Group. Consider what would happen if the Backup MARS M2 in section 3.1 had no knowledge of which groups were MCS supported immediately after the failure of M1. Active MARS fails. Cluster members and MCSs gradually detect the failure, and begin re-registering with the first available Backup MARS. Cluster members re-join all groups they were members of. As the Backup (now Active) MARS receives these MARS_JOINs it propagates them on its new ClusterControlVC. Simultaneously each MCS re-registers for all groups they were configured to support. If a MARS_MSERV arrives for a group that already has cluster members, the new Active MARS transmits an appropriate MARS_MIGRATE on its new ClusterControlVC. Assume that group X was MCS supported prior to M1's failure. Each Armitage Expires February 13th, 1997 [Page 11] Internet Draft August 13th, 1996 cluster member had a pt-mpt VC out to the MCS (a single leaf node). MARS failure occurs, and each cluster member re-registers with M2. The pt-mpt VC for group X is unchanged. Now cluster members begin re-issuing MARS_JOINs to M2. If the MCS for group X has not yet re- registered to support group X, M2 thinks the group is VC Mesh based, so it propagates the MARS_JOINs on ClusterControlVC. Other cluster members then update their pt-mpt VC for group X to add each 'new' leaf node. This results on cluster members forwarding their data packets to the MCS and some subset of the cluster members directly. This is not good. When the MCS finally re-registers to support group X, M2 will issue a MARS_MIGRATE. This fixes every cluster member's pt-mpt VC for group X, but the transient period is quite messy. If the entire Living Group is constantly aware of which groups are MCS supported, a newly elected Active MARS can take temporary action to avoid the scenario above. An obvious solution is for the new Active MARS to internally treat the groups as MCS supported even before the MCSs themselves have correctly re-registered. It would suppress MARS_JOINs on ClusterControlVC for that group, just as though the MCS was actually registered. Ultimately the MCS would re- register, and operation continues normally. [This issue needs further careful thought, especially to cover the situation where the MCS fails to re-register in time. Perhaps the new Active MARS fakes a MARS_LEAVE on ClusterControlVC for the MCS if it doesn't re-register in the appropriate time? In theory at least this would correctly force the group back to being VC Mesh based.] 3.7 Summary of requirements. For the purely fault-tolerant model, the requirements are: Active MARS election from amongst the Living Group must be possible whenever a MARS Group member dies or restarts (section 3.2 and 3.4). When a new Active MARS is elected, and there already exists an operational Active MARS, complete database synchronisation between the two is required before a soft-redirect is initiated by the current Active MARS (section 3.4) The Living Group's members must have an up to date map of the CMI allocation (section 3.3). The Living Group's members must have an up to date map of the MCS supported groups (section 3.6). The entire MARS Group is transmitted in MARS_REDIRECT_MAPs. The Armitage Expires February 13th, 1997 [Page 12] Internet Draft August 13th, 1996 only change that occurs as entities die or Living Group elections occur is to the order in which MARS addresses are listed. No special additions are required to handle client requests (e.g. MARS_REQUEST or MARS_GROUPLIST_QUERY), since there is only a single Active MARS. 4. Architectures for Load Sharing. The issue of Load Sharing is typically raised during discussions on the scaling limits for a Cluster [5]. Some of the 'loads' that are of interest in a MARS Cluster are: Number of MARS_JOIN/LEAVE messages handled per second by a given MARS. Number of MARS_REQUESTs handled per second by a given MARS. Size of the group membership database. Number of SVCs terminating on a given MARS entity from MARS Clients and MCSs. Number of SVCs traversing intervening ATM switches on their way to a MARS that is topologically distant from some or all of its MARS Clients and/or MCSs. Before embarking on a general Load Sharing solution, it is wise to keep in mind that having more than one MARS entity available does not affect each of these loads equally. We can assume that the average number of MARS_JOIN/LEAVE events within a Cluster will rise as the Cluster's membership level rises. The group membership state changes of all Cluster members must be propagated to all other Cluster members whenever they occur. Subdividing a Cluster among a number of Active Sub-MARSs does not change the fact that each Active Sub-MARS must track each and every MARS_JOIN/LEAVE event. The MARS_JOIN/LEAVE event load is therefore going to be effectively the same in each Active Sub-MARS as it would have been for a single Active MARS. (An 'event' between the Active Sub-MARSs is most likely an SCSP activity conveying the semantic equivalent of the MARS_JOIN/LEAVE.) If each Active Sub-MARS has a complete view of the cluster's group membership, they can answer MARS_REQUESTs using locally held information. It is possible that the average MARS_REQUEST rate perceived by any one Active Sub-MARS would be lower than that Armitage Expires February 13th, 1997 [Page 13] Internet Draft August 13th, 1996 perceived by a single Active MARS. However, it is worth noting that steady-state MARS_REQUEST load is likely to be significantly lower than steady-state MARS_JOIN/LEAVE load anyway (since a MARS_REQUEST is only used when a source first establishes a pt-mpt VC to a group - subsequent group changes are propagated using MARS_JOIN/LEAVE events). Distributing this load may not be a sufficiently valuable goal to warrant to complexity of a Load Sharing distributed MARS solution. Partitioning the group membership database among the Active Sub-MARS entities would actually work against the reduction in MARS_REQUEST traffic per Active Sub-MARS. With a partitioned database each MARS_REQUEST received by an Active Sub-MARS would require a consequential query to the other members of the Active Sub-MARS Group. The nett effect would be to bring the total processing load for handling MARS_REQUEST events (per Active Sub-MARS) back up to the level that a single Active MARS would see. It would seem that a fully replicated database across the Active Sub-MARS Group is preferable. SVC limits at any given MARS are not actually as important as they might seem. A single Active MARS would terminate an SVC per MARS Client or MCS, and originate two pt-mpt SVCs (ClusterControlVC and ServerControlVC). It might be argued that if a MARS resides over an ATM interface that supports only X SVCs, then splitting the MARS into two Active Sub-MARS would allow approximately 2*X MARS Clients and/or MCSs (and so forth for 3, 4, ... N Active Sub-MARSs). However, consider the wider context (discussed in [5]). If your ATM NICs are technologically limited to X SVCs, then the MARS Clients and MCSs making up the Cluster are likely to be similarly technologically limited. Having 2 Active Sub-MARSs will not change the fact that your cluster cannot have more than X members. (Consider that a VC Mesh for any given multicast group could end up with a mesh of X by X, or an MCS for the same group would have to terminate SVCs from up to X sources.) So conserving SVCs at the MARS may not be a valid reason to deploy a Load Sharing distributed MARS solution. SVC distributions across the switches of an ATM cloud can be significantly affected by placement of MARS Clients relative to the MARS itself. This 'load' does benefit from the use of multiple Active Sub-MARSs. If MARS Clients are configured to use a topologically 'local' Active Sub-MARS, we reduce the number of long-haul pt-pt SVCs that might otherwise traverse an ATM cloud to a single Active MARS. Of the 'loads' identified above, this one is arguably the only one that justifies a Load Sharing distributed MARS solution. The rest of this section will look at a number of scenarios that arise when attempting to provide a Load Sharing distributed MARS. Armitage Expires February 13th, 1997 [Page 14] Internet Draft August 13th, 1996 4.1 Partitioning the Cluster. A partitioned cluster has the following characteristics: ClusterControlVC (CCVC) is partitioned into a number of sub-CCVCs, one for each Active Sub-MARS. The leaf nodes of each sub-CCVC are those cluster members making up the cluster partition served by the associated Active Sub-MARS. MARS_JOIN/LEAVE traffic to one Active Sub-MARS must propagate out on each and every sub-CCVC to ensure Cluster wide distribution. This propagation must occur quickly, as it will impact the overall group change latency perceived by MARS Clients around the Cluster. Allocation of CMIs across the cluster must be co-ordinated amongst the Active Sub-MARSs to ensure no CMI conflicts within the cluster. Each sub-CCVC must carry MARS_REDIRECT_MAP messages with a MARS list appropriate for the partition it sends to. Each Active Sub-MARS must be capable of answering a MARS_REQUEST or MARS_GROUPLIST_QUERY with information covering the entire Cluster. Three mechanisms are possible for distributing MARS Clients among the available Active Sub_MARSs. MARS Clients could be manually configured with (or learn from a configuration server) the ATM address of their administratively assigned Active Sub-MARS. Each Active Sub-MARS simply accepts whoever registers with it as a cluster member. MARS Clients could be manually configured with (or learn from a configuration server) an Anycast ATM address representing "the nearest" Active Sub-MARS. Each Active Sub-MARS simply accepts whoever registers with it as a cluster member. MARS Clients could be manually configured with (or learn from a configuration server) the ATM address of an arbitrary Active Sub- MARS. The Active Sub-MARS entities have a mechanism for deciding which clients should register with which Active Sub-MARS. If a clients registers with an incorrect Active Sub-MARS, it will be redirected to the correct one. Regardless of the mechanism used, it must be kept in mind that MARS Clients themselves have no idea that they are being served by an Active Sub-MARS. They see a single Active MARS at all times. Armitage Expires February 13th, 1997 [Page 15] Internet Draft August 13th, 1996 The Anycast ATM address approach is nice, but suffers from the fact that such a service is not available under UNI 3.0 or UNI 3.1. This limits us to configuring clients with the specific ATM addresses of an Active Sub-MARS to use when a client first starts up. Finally, if an Active Sub-MARS is capable of redirecting a MARS Client to another Active Sub-MARS on-demand, then the client's choice of initial Active Sub-MARS is more flexible. However, while dynamic reconfiguring is desirable it makes complex demands on the Client and MARS interactions. One issues the choice of metric to match MARS Clients to particular Active Sub-MARSs. Ideally this should be based on topological location of the MARS Clients. However, this implies that any given Active Sub-MARS has the ability to deduce the ATM topology between a given MARS Client and the other members of the Active Sub-MARS Group. Unless MARS entities are restricted to running on switch control processors, this may not be possible. 4.2 What level of Fault Tolerance? Providing Load Sharing does not necessarily encompass Fault Tolerance as described in section 3. A number of different service levels are possible: At the simplest end there are no Backup Sub-MARS entities. Each Active Sub-MARS looks after only one partition. If the Active Sub-MARS fails then all the cluster members in the associated partition have no MARS support until the Active Sub-MARS returns. An alternative is to provide each Active Sub-MARS with one or more Backup Sub-MARS entities. Cluster members switch to the Backup(s) for their partition (previously advertised by their Active Sub- MARS) if the Active Sub-MARS fails. If the Backups for the partition all fail, the associated partition has no MARS support until one of the Sub-MARS entities restarts. Backup Sub-MARS entities serving one partition may not be dynamically re-assigned to another partition. A refinement on the preceeding model would allow temporary re- assignment of Backup Sub-MARS entities from one partition to another. The most complex model requires a set of MARS entities from which a subset may at any one time be Active Sub-MARS entities supporting the Cluster, while the remaining entities form a pool of Backup Sub-MARS entities. The partitioning of the cluster amongst the available Active Sub-MARS entities is dynamic. The number of Active Sub-MARS entities may also vary with time, Armitage Expires February 13th, 1997 [Page 16] Internet Draft August 13th, 1996 implying that partitions may change in size and scope dynamically. The following subsections touch on these different models. 4.3 Simple Load Sharing, no Fault Tolerance. In the simplest model each partition has one Active Sub-MARS, there are no backups, and no dynamic reconfiguration is available. Each Active Sub-MARS supports any cluster member that chooses to register with it. Consider a cluster with 4 MARS Clients, and 2 Active Sub-MARSs. The following picture shows one possible configuration, where the cluster members are split evenly between the sub-MARSs: C1 C2 C3 C4 | | | | ----- M1 ------ ----- M2 ----- | | ----------------------- C1, C2, C3, and C4 all consider themselves to be members of the same Cluster. M1 manages a sub-CCVC with {C1, C2} as leaf nodes - Partition 1. M2 manages a sub-CCVC with {C3, C4} as leaf nodes - Partition 2. M1 and M2 form the Active Sub-MARS Group, and exchange cluster co-ordination information using SCSP. When a MARS_JOIN/LEAVE event occurs in Partition 1, M1 uses SCSP to indicate the group membership transition to M2, which fabricates an equivalent MARS_JOIN/LEAVE message out to Partition 2. (e.g. When C1 issues a MARS_JOIN/LEAVE message it is propagated to {C1, C2} via M1. M1 also indicates the group state change to M2, which sends an matching MARS_JOIN/LEAVE to {C3, C4}.) As discussed earlier in this section, MARS_REQUEST processing is expedited if each Active Sub-MARS keeps a local copy of the group membership database for the entire cluster. This is a reasonable requirement, and imposes no additional demands on the data flow between each Active Sub-MARS (since every MARS_JOIN/LEAVE event results in SCSP updates to every other Active Sub-MARS). Cluster members registering with either M1 or M2 must receive a CMI that is unique within the scope of the entire cluster. Since each Active Sub-MARS is administratively configured, and no dynamic partitioning is supported, two possibilities emerge: Divide the CMI space into non-overlapping blocks, and assign each block to a different Active Sub-MARS. The Active Sub-MARS then Armitage Expires February 13th, 1997 [Page 17] Internet Draft August 13th, 1996 assigns CMIs from its allocated CMI block. Define a distributed CMI allocation mechanism for dynamic CMI allocation amongst the Active Sub-MARS entities. Since this scheme is fundamentally oriented towards fairly static configurations, a dynamic CMI allocation scheme would appear to be overkill. Network administrators should assign CMI blocks in roughly the same proportion that they assign clients to each Active Sub-MARS (to minimize the chances of an Active Sub-MARS running out of CMIs). The MARS_REDIRECT_MAP message from each Active Sub-MARS lists only itself, since there are no backups. M1 lists {M1}, and from M2 lists {M2}. 4.4 Simple Load Sharing, intra-partition Fault Tolerance. A better solution exists when each Active Sub-MARS has one or more Backup Sub-MARS entities available. The diagram from section 4.3 might become: C1 C2 C3 C4 | | | | ----- M1 ------ ----- M2 ----- / \ / \ | -------------------- | M3 M4 In this case M3 is a Backup for M1, and M4 is a Backup for M2. M3 is never shared with M2, and M4 is never shared with M1. This model is a union of section 3 and section 4.3, applying section 3's rules within the context of a Partition instead of entire Cluster. M1 is the Active MARS for the Partition, and M3 behaves as a member of the Living Group for the Partition. The one key difference is that the Active MARS for each Partition are also members of the Active Sub-MARS Group for the Cluster, and share information using SCSP as described in section 4.3. As a consequence, the election of partition's Backup MARS to Active MARS must also trigger election into the Cluster's Active Sub-MARS Group. Borrowing from section 3, each Active Sub-MARS transmits a MARS_REDIRECT_MAP containing the Sub-MARS entities assigned to the the partition (whether Active, Backup, or dead). In this example M1 would list {M1, M3}, and M2 would list {M2, M4}. If M1 failed, the procedures from section 3 would be applied within the context of Partition 1 to elect M3 to Active Sub-MARS: Armitage Expires February 13th, 1997 [Page 18] Internet Draft August 13th, 1996 C1 C2 C3 C4 | | | | ----- M3 ------ ----- M2 ----- | | \ ----------------------- | M4 Clients in Partition 1 would now receive MARS_REDIRECT_MAPs from M3 listing {M3, M1}. Clients in Partition 2 would see no change. If M1 recovers, an intra-partition re-election procedure may see M3 and M1 swap places or M3 remain as the Active Sub-MARS with M1 as a Backup Sub-MARS. (Parameters affecting the election choice between M1 and M3 would now include the topological distance between M3 and the partition's cluster members.) The information shared between an Active Sub-MARS and its associated Backup Sub-MARS(s) now also includes the CMI block that has been assigned to the partition. 4.5 Dynamically configured Load sharing. A completely general version of the model in section 4.4 would allow the following additional freedoms: The Active Sub-MARS Group can grow or shrink in number over time, implying that partitions can have time-varying numbers of cluster members. Backup Sub-MARS entities may be elected at any time to support any partition. If such flexibility exists, each Active Sub-MARS can effectively become each other's Backup Sub-MARS. Shifting clients from a failed Active Sub-MARS to another Active Sub-MARS is partition reconfiguration from the perspective of the Sub-MARSs, but is fault tolerant MARS service from the perspective of the clients. However, a number of problems currently exist before we can implement completely general re-configuration of cluster partitions. The most important one is how a single Active Sub-MARS can redirect a subset of the MARS Clients attached to it, while retaining the rest. For example, assume this initial configuration: C1 C2 C3 C4 | | | | Armitage Expires February 13th, 1997 [Page 19] Internet Draft August 13th, 1996 ----- M1 ------ ----- M2 ----- | | ----------------------- M1 lists {M1, M2} in its MARS_REDIRECT_MAPs, and M2 lists {M2, M1}. The cluster members neither know nor care that the Backup MARS listed by their Active MARS is actually an Active MARS for another partition of the Cluster. If M1 failed, its partition of the cluster collapses. C1 and C2 re- register with M2, and the picture becomes: C1 C2 C3 C4 | | | | --------------------------- M2 ----- All cluster members start receiving MARS_REDIRECT_MAPs from M2, listing {M2, M1}. Unfortunately, we currently have no obvious mechanism for re-partitioning the cluster once M1 has recovered. M2 needs some what of inducing C1 and C2 to perform a soft-redirect (or hard, if appropriate) to M1, without losing C3 and C4. One way of avoiding this scenario is to insist that the number of partitions cannot change, even while Active Sub-MARSs fail. Provision enough Active Sub-MARSs for the desired load sharing, and then provide a pool of shared Backup Sub-MARSs. The starting configuration might be redrawn as: C1 C2 C3 C4 | | | | ----- M1 ------ ----- M2 ----- | | ----------------------- | | M3 M4 In this case M1 lists {M1, M3, M4} in its MARS_REDIRECT_MAPs, and M2 lists {M2, M3, M4}. If M1 fails, the MARS Group configures to: C1 C2 C3 C4 | | | | ----- M3 ------ ----- M2 ----- | | ----------------------- | M4 Now, if M3 stays up while M1 is recovering from its failure, there Armitage Expires February 13th, 1997 [Page 20] Internet Draft August 13th, 1996 will be a period within which M3 lists {M3, M4, M1} in its MARS_REDIRECT_MAPs, and M2 lists {M2, M4, M1}. This implies that the failure of M1, and the promotion of M3 into the Active Sub-MARS Group, causes M2 to re-evaluate the list of available Backup Sub- MARSs too. When M1 is detected to be available again, M1 might be placed on the list of Backup Sub-MARS. The cluster would be configured as: C1 C2 C3 C4 | | | | ----- M3 ------ ----- M2 ----- | | ----------------------- | | M1 M4 M3 lists {M3, M1, M4} in its MARS_REDIRECT_MAPs, and M2 lists {M2, M4, M1}. (Unchanged from the MARS_REDIRECT_MAPs immediately after M1 died. As discussed in section 3, it is important to list all possible MARS entities to assist client's in recovering from catastrophic MARS failure.) M1 may be re-elected as Active Sub-MARS for {C1, C2}, requiring M3 to trigger a soft-redirect in MARS Clients back to M1. The Active Sub- MARS Group must also be updated. There are additional problems with sharing the Backup Sub-MARS entities. If M1 and M2 failed simultaneously the cluster would probably rebuild itself to look like: C1 C2 C3 C4 | | | | ----- M3 ------ ----- M4 ----- | | ----------------------- However, as described in section 3, transient failures of a Backup Sub-MARS might cause M3 to be unavailable during the failure of M1. This would lead to the topology we saw earlier: C1 C2 C3 C4 | | | | --------------------------- M4 ----- The two partitions have collapsed into one. An obivous additionaly requirement is that M1 and M2 list an opposite Armitage Expires February 13th, 1997 [Page 21] Internet Draft August 13th, 1996 sequence of Backup Sub-MARSs in their MARS_REDIRECT_MAPs. For example, if M1 listed {M1, M3, M4} and M2 listed {M2, M3, M4} the cluster would look like this after a simultaneous failure of M1 and M2: C1 C2 C3 C4 | | | | --------------------------- M3 ----- | M4 Again, the two partitions have collapsed into one. A not entirely fool proof solution would be for the Active MARS to issue specifically targetted MARS_REDIRECT_MAP messages on the pt-pt VCs that each client has open to it. If C1 and C2 still had their pt-pt VCs open, e.g. after re-registration, M3 could send them private MARS_REDIRECT_MAPs listing {M4, M3} as the list, forcing only C1 and C2 to re-direct. Another possibility is for the remaining Active Sub-MARS entities to split into multiple logical Active Sub-MARS entities, and manage each partition separately (with a separate sub-CCVC for its members) until one of the real Sub-MARS entities restarts. The secondary 'logical' Active Sub-MARS could then redirect the partition back to the newly restarted 'real' Active Sub-MARS. Both of these approaches require further thought. 4.6 What about Multicast Servers ? As noted in section 3, it is imperative that knowledge of MCS supported groups is propagated to Backup MARS entities to minimize transient changes to pt-mpt SVCs out of the clients during an Active MARS failure. However, with a partitioned Cluster the issue becomes more complex. For an Active Sub-MARS to correctly filter MARS_JOIN/LEAVE messages it may want to transmit on its local Sub-CCVC it MUST know what groups are, cluster wide, being supported by an MCS. Since the MCS in question may have registered with only one Active Sub-MARS, the Active Sub-MARS Group must exchange timely information on MCS registrations and supported groups. The propagation of MCS information must be carefully tracked at each Active Sub-MARS, as it impacts on whether the local partition should see a MARS_JOIN, MARS_LEAVE, or MARS_MIGRATE on the sub-CCVC (or Armitage Expires February 13th, 1997 [Page 22] Internet Draft August 13th, 1996 nothing at all). There may well be race conditions where one Active Sub-MARS is processing a group MARS_JOIN, while simultaneously an MCS is registering to support the same group with a different Active Sub-MARS. The problem is not unsolvable, it just requires careful design. Finally, all preceding discussions in section 4 on partitioning of ClusterControlVC also apply to ServerControlVC (SCVC). An MCS may attach to any Active Sub-MARS, which then must originate a sub-SCVC. It is not yet clear how a distributed Active Sub-MARS Group would interact with a distributed MCS supporting the same multicast group. 4.7 Summary of requirements. The problems described in section 4.5 make the fully dynamic partition model very unattractive. The fixed load sharing approaches in section 4.3 and 4.4 demand a significantly simpler solution, while providing a valuable service. We will focus on section 4.4's requirements rather than section 4.5. Active Sub-MARSs track the cluster wide group membership for all groups so they can answer MARS_REQUESTs from locally held information. CMI mappings to actual cluster members need to be propagated amongst Active and Backup Sub-MARSs. In addition, the CMI space needs to be split into non-overlapping blocks so that each Active Sub-MARS can allocate CMIs that are unique cluster-wide. To ensure each Active Sub-MARS can filter the JOIN/LEAVE traffic it propagates on its Sub-CCVC, information on what groups are MCS supported MUST be distributed around the Active Sub-MARS Group, not just between Active Sub-MARSs and their Backups. 5. Using SCSP. TBD. See Appendix A for some evolving details. 6. Open Issues. The specific SCSP solution to items summarized in sections 3.7 and Armitage Expires February 13th, 1997 [Page 23] Internet Draft August 13th, 1996 4.7 are still being developed. Security Consideration Security consideration are not addressed in this document. Acknowledgments Jim Rubas and Anthony Gallo of IBM helped clarify some points in the initial release. Rob Coulton and Carl Marcinik of FORE Systems engaged in helpful discussions after the June 1996 IETF presentation. Author's Address Grenville Armitage Bellcore, 445 South Street Morristown, NJ, 07960 USA Email: gja@thumper.bellcore.com Ph. +1 201 829 2635 References [1] J. Luciani, G. Armitage, J. Halpern, "Server Cache Synchronization Protocol (SCSP) - NBMA", INTERNET DRAFT, draft- luciani-rolc-scsp-03.txt, June 1996. [2] J. Luciani, et al, "NBMA Next Hop Resolution Protocol (NHRP)", INTERNET DRAFT, draft-ietf-rolc-nhrp-09.txt, July 1996. [3] G. Armitage, "Support for Multicast over UNI 3.0/3.1 based ATM Networks.", Bellcore, INTERNET DRAFT, draft-ietf-ipatm-ipmc-12.txt, February 1996. [4] R. Talpade, M. Ammar, "Multiple MCS support using an enhanced version of the MARS server.", INTERNET DRAFT, draft-talpade-ion- multiplemcs-00.txt, June 11, 1996 [5] G. Armitage, "Issues affecting MARS Cluster Size", Bellcore, INTERNET DRAFT, draft-armitage-ion-cluster-size-00.txt, July 1996. Armitage Expires February 13th, 1997 [Page 24] Internet Draft August 13th, 1996 Appendix A. Early SCSP message formats. [Editors note: These were first inserted in draft-luciani-rolc-scsp- 02.txt as a rough first cut at identifying the 'cache' information that MARS entities would need to exchange. It has not been fully aligned with the body of this I-D yet, although it has been updated to align with draft-luciani-rolc-scsp-03.txt [1].] The following material describes a first cut breaking down the MARS state into a number of separate caches (termed "sub-caches"). Each sub-cache contains information necessary for the working of a MARS and which might conceivably be required by another MARS entity planning to operate as a Backup MARS. SCSP as defined in [1] allows members of a Server Group (SG) to keep their individual copies of caches aligned by the use of alignment and update messages. Client State Update (CSU) messages contain actual cache update data within individual Client State Advertisement (CSA) records. These may be transmitted by a server when a local cache state change occurs, or when solicited by a neighboring server using a Client State Update Solicitation (CSUS) message. This appendix describes a number of CSA record types that will be useful in propagating and updating MARS sub-caches. SCSP also defines an alignment phase where members of a Server Group discover if their caches are out of alignment be comparing Client State Advertisement Summary (CSAS) records. CSAS records are carried in Client Alignment (CA) messages between neighboring servers. This appendix also describes CSAS records to match the defined CSA record types. This appendix does not yet cover the SCSP mechanisms that would be used during Active MARS election. It also does not cover how SCSP Server Groups will be mapped to the MARS Group, Living Group, and Active Sub-MARS Group concepts introduced in this document. [Some of the text contains assumption more in line with section 3 (the basic Fault Tolerant model) rather than section 4. This will be refined in later revisions of the Internet Draft. Where this Appendix states requirements that are not consistent with the main body of this Internet Draft, the 'requirement' should be viewed with caution.] Armitage Expires February 13th, 1997 [Page 25] Internet Draft August 13th, 1996 A.1.1 The MARS Sub-caches. The overall MARS state consists of a number of components. Cluster membership list. Cluster Member IDs. (Multicast) Server membership list. Absolute maximum and minimum group addresses for protocol being supported. Member map (hostmap) for each Layer 3 group. Multicast Server (MCS) Servermap for each Layer 3 group. Block-join map. Redirect_map. The Cluster membership list is the most fundamental object for a MARS. It contains the ATM addresses of every cluster member, and explicitly maps Cluster Member ATM addresses to Cluster Member IDs. Both of these pieces of information will be combined into a single CMI map sub-cache. The (Multicast)Server membership list is essential to enable construction of a backup ServerControlVC by any one of the backup MARSs. Each multicast group is represented by a membership map (hostmap) sub-cache. Since a given multicast group may also have MCSs registered to support it there is also a matching Servermap. Hostmaps and Servermaps are treated as separate sub-caches. To simplify and shorten the CSA Records, members of these maps are identified by their Cluster Member IDs rather than enumerating their actual ATM addresses. Since hostmaps for a given group may be quite large, and most MARS_JOIN/LEAVE events simply result in an incremental change to the host map, two different types of CSA record will be defined. One will represent the sub-cache in its entirety (for use when aligning servers), and a second whose semantics will match the JOIN/LEAVE event (allowing an incremental addition to, or deletion from, the sub-cache). The same will apply to the CSA records for Servermaps. The block-join map represents all currently valid block MARS_JOINs Armitage Expires February 13th, 1997 [Page 26] Internet Draft August 13th, 1996 registered with the MARS. This allows the preceding, group-specific hostmaps to be simplified. (The CSA Records representing the hostmap for a given group only lists nodes that have issued a specific single-group MARS_JOIN for that group.) Internally, the MARS builds whatever database structure is required to ensure that replies to MARS_REQUESTs, and general hole-punching activities, take the block- join map's contents into account. The Redirect_map is the list of MARS entities this MARS is currently sending in its MARS_REDIRECT_MAP messages. A.1.2 Client State Advertisement Summary (CSAS) records. Client State Advertisement Summary (CSAS) records are carried within SCSP Cache Alignment (CA) messages [1]. They are used to inform one server of another server's cache state without sending the contents of the cache itself. CSAS records have an 8 byte fixed header defined by SCSP, followed by protocol specific fields. For MARS use we add a 16 bit CSAS record (sub-cache) type field. The first 10 bytes of each CSAS record are thus: csas$protoID 16 bits Protocol (MARS, NHRP, etc). csas$unused 16 bits csas$sequence 32 bits CSA Sequence Number. csas$type 16 bits CSAS record sub-cache type. The CSA Sequence number indicates how recently the source's version of the specified sub-cache has been modified. This is used to determine which server (if any) is out of alignment. The remaining bytes of the CSAS record are determined by csas$type. The CSAS record types are: CSAS_CMI_MAP 1 CSAS_MCS_LIST 2 CSAS_HOST_MAP 3 CSAS_MCS_MAP 4 CSAS_BLOCK_JOINS 5 CSAS_REDIRECT_MAP 6 The specific formats of each CSAS record are described in the following sub-sections. Armitage Expires February 13th, 1997 [Page 27] Internet Draft August 13th, 1996 A.1.2.1 CSAS_CMI_MAP. The complete CSAS Record looks like: csas$protoID 16 bits Protocol ( = 3 for MARS). csas$unused 16 bits csas$sequence 32 bits CSA Sequence Number. csas$type 16 bits Set to 1 (CSAS_CMI_MAP) csas$orig_len 8 bits Length of csas$origin field. csas$unused 8 bits unused. csas$origin x octets Originator's protocol address. For this CSAS, the sequence number is incremented every time a new cluster member registers, or an old one is considered to have died or deregistered. A.1.2.2 CSAS_MCS_LIST. The complete CSAS Record looks like: csas$protoID 16 bits Protocol ( = 3 for MARS). csas$unused 16 bits csas$sequence 32 bits CSA Sequence Number. csas$type 16 bits Set to 2 (CSAS_MCS_LIST) csas$orig_len 8 bits Length of csas$origin field. csas$unused 8 bits unused. csas$origin x octets Originator's protocol address. For this CSAS, the sequence number is incremented every time a new MCS registers, or an old one is considered to have died or deregistered. A.1.2.3 CSAS_HOST_MAP. The complete CSAS Record looks like: csas$protoID 16 bits Protocol ( = 3 for MARS). csas$unused 16 bits csas$sequence 32 bits CSA Sequence Number. csas$type 16 bits Set to 3 (CSAS_HOST_MAP) csas$orig_len 8 bits Length of csas$origin field. csas$group_len 8 bits Length of group address. csas$origin x octets Originator's protocol address. csas$group y octets Hostmap's group address. For this CSAS, the sequence number is incremented whenever a cluster member joins or leaves the group specified by csas$group. Armitage Expires February 13th, 1997 [Page 28] Internet Draft August 13th, 1996 A.1.2.4 CSAS_MCS_MAP. The complete CSAS Record looks like: csas$protoID 16 bits Protocol ( = 3 for MARS). csas$unused 16 bits csas$sequence 32 bits CSA Sequence Number. csas$type 16 bits Set to 4 (CSAS_MCS_MAP) csas$orig_len 8 bits Length of csas$origin field. csas$group_len 8 bits Length of group address. csas$origin x octets Originator's protocol address. csas$group y octets Servermap's group address. For this CSAS, the sequence number is incremented whenever an MCS joins or leaves the group specified by csas$group. A.1.2.5 CSAS_BLOCK_JOINS. The complete CSAS Record looks like: csas$protoID 16 bits ( = 3 for MARS). csas$unused 16 bits csas$sequence 32 bits CSA Sequence Number. csas$type 16 bits Set to 5 (CSAS_BLOCK_JOINS) csas$orig_len 8 bits Length of csas$origin field. csas$unused 8 bits unused. csas$origin x octets Originator's protocol address. For this CSAS, the sequence number is incremented whenever a block MARS_JOIN, or matching block MARS_LEAVE, occurs. A.1.2.6 CSAS_REDIRECT_MAP. The complete CSAS Record looks like: csas$protoID 16 bits ( = 3 for MARS). csas$unused 16 bits csas$sequence 32 bits CSA Sequence Number. csas$type 16 bits Set to 6 (CSAS_REDIRECT_MAP) csas$orig_len 8 bits Length of csas$origin field. csas$unused 8 bits unused. csas$origin x octets Originator's protocol address. For this CSAS, the sequence number is incremented whenever the local server modifies the list of MARS entities in its MARS_REDIRECT_MAP list. Armitage Expires February 13th, 1997 [Page 29] Internet Draft August 13th, 1996 A.1.3 Client State Advertisement (CSA) Records. CSA records have an 12 byte fixed header defined by SCSP, followed by protocol specific fields. For MARS use we add a 16 bit CSA record (sub-cache) type field. The first 14 bytes of each CSAS record are thus: csa$protoID 16 bits Protocol (MARS, NHRP, etc). csa$ttl 16 bits TTL csa$sequence 32 bits CSA Sequence Number. csa$sgid 32 bits Server Group ID. csa$type 16 bits CSA Record sub-cache type. The CSA Sequence number indicates how recently the source's version of the specified sub-cache has been modified. The Server Group ID identifies an instance of a Server Group. Since Server Groups will exist on a per-protocol basis (IPv4, IPv6, etc) the csa$sgid field implicitly identifies the formats of any 'group' address fields within the CSA Records. The remaining bytes of the CSA record are determined by csa$type. To match the CSAS records, the following set of CSA record types are defined. CSA_CMI_MAP 1 CSA_MCS_LIST 2 CSA_HOST_MAP 3 CSA_MCS_MAP 4 CSA_BLOCK_JOINS 5 CSA_REDIRECT_MAP 6 In addition, to allow indication of incremental updates to some of the sub-caches, the following CSA_CMI_MAP_JOIN 128 CSA_CMI_MAP_LEAVE 129 CSA_MCS_LIST_JOIN 130 CSA_MCS_LIST_LEAVE 131 CSA_HOST_MAP_JOIN 132 CSA_HOST_MAP_LEAVE 133 CSA_MCS_MAP_JOIN 134 CSA_MCS_MAP_LEAVE 135 (csa$type values in the range 1 to 127 correspond to entire sub- caches, whilst the range 128 to 512 are allocated to incremental sub-cache updates.) Armitage Expires February 13th, 1997 [Page 30] Internet Draft August 13th, 1996 The amount of information carried by a specific CSA_HOST_MAP or CSA_CMI_MAP may exceed the size of a link layer PDU. Hence, CSA Records relating to a single MARS sub-cache may be fragmented across a number of CSU Request messages. This may be considered analogous to the fragmentation of a a group's membership list across a number of MARS_MULTIs when a MARS replies to a single MARS_REQUEST. Every type allows for fragmentation of the CSA Record across multiple CSU Request messages. Analogous to MARS_MULTIs, CSA Record fragments carry a 15 bit fragment number and 1 bit 'end of fragment' (EOF) flag. Re-assembly of fragments requires collecting CSA Record fragments referring to the same sub-cache type and entry, until the EOF flag is set. The re-assembled CSA Record is then processed. csa$fragy is coded with EOF flag x in the leading bit, and fragment number y coded as an unsigned integer in the remaining 15 bits. | 1st octet | 2nd octet | 7 6 5 4 3 2 1 0 7 6 5 4 3 2 1 0 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |x| y | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ A fragmented CSA Record MUST be carried in a consecutive sequence of CSU messages. A sequence of CSA Record fragments referring to the same sub-cache type and entry SHALL carry the same CSA Sequence Number. If the CSA Sequence Number changes during the re-assembly of a fragmented CSA Record, the fragments so far are discarded. (The CSA Sequence number for any given type of cache information is derived in the same way as the CSA Sequence number for the equivalent CSAS message, as described in the previous section). The 15 bit fragment number in consecutive fragments of a CSA Record SHALL start at 1 and increment by 1 for each fragment. Fragments SHALL be transmitted in order of their fragment sequence numbers. All but the final fragment shall have the EOF flag set to 0. The final (or first, if there is only one) fragment SHALL have the EOF flag set to 1. If the fragment sequence number skips by more than one at the receiver, the CSA Record being re-assembled is considered in error. It is discarded after the final fragment is received. If the final fragment does not arrive within 10 seconds [**** check validity of this ****] of the last received fragment, the CSA Record re-assembly is terminated and the fragments collected so far are discarded. Armitage Expires February 13th, 1997 [Page 31] Internet Draft August 13th, 1996 SCSP CA messages carry an originator field to identify the actual Server that originated the update. In the case of MARS, the orignator is identified by its ATM address (cf. the NHRP case where the 'protocol address' is used). The format of the ATM address is irrelevant - the originator field is simply an uninterpreted octet string used for pattern matching. A.1.3.1 CSA_CMI_MAP. This CSA Record carries the entire membership of the current cluster, along with the Cluster Member IDs (CMIs) assigned by the MARS they registered with. These CMIs are then used as a short-form representation of the actual cluster members to compress the size of subsequent CSA_HOST_MAP messages. csa$protoID 16 bits Protocol ( = 3 for MARS). csa$ttl 16 bits TTL csa$sequence 32 bits CSA Sequence Number. csa$sgid 32 bits Server Group ID. csa$type 16 bits CSA Record sub-cache type. csa$type 16 bits Set to 1 (CSA_CMI_MAP). csa$orig_len 8 bits Length of csa$origin. csa$unused 8 bits unused. csa$fragxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment (N). csa$thtl 8 bits Type and length of ATM addresses. csa$tstl 8 bits Type and length of ATM sub-addresses. csa$origin x octets Originator's NBMA address. csa$atmaddr.1 q octets ATM address of member 1. csa$subaddr.1 r octets ATM sub-address of member 1. csa$cmi.1 16 bits Cluster Member ID for entry 1. [..etc..] csa$atmaddr.N q octets ATM address of member N. csa$subaddr.N r octets ATM sub-address of member N. csa$cmi.N 16 bits Cluster Member ID for entry N. A.1.3.2 CSA_MCS_LIST. This CSA Record carries the entire list of currently registered Multicast Servers (MCSs). Each MCS is also assigned an internal ID by the MARS they registered with - this is used to compress the size of subsequent CSA_MCS_MAP messages. csa$protoID 16 bits Protocol ( = 3 for MARS). csa$ttl 16 bits TTL csa$sequence 32 bits CSA Sequence Number. csa$sgid 32 bits Server Group ID. Armitage Expires February 13th, 1997 [Page 32] Internet Draft August 13th, 1996 csa$type 16 bits CSA Record sub-cache type. csa$type 16 bits Set to 2 (CSA_MCS_LIST). csa$orig_len 8 bits Length of csa$origin. csa$unused 8 bits unused. csa$fragxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment (N). csa$thtl 8 bits Type and length of ATM addresses. csa$tstl 8 bits Type and length of ATM sub-addresses. csa$origin x octets Originator's NBMA address. csa$atmaddr.1 q octets ATM address of MCS 1. csa$subaddr.1 r octets ATM sub-address of MCS 1. csa$cmi.1 16 bits Internal MCS ID for entry 1. [..etc..] csa$atmaddr.N q octets ATM address of member N. csa$subaddr.N r octets ATM sub-address of member N. csa$cmi.N 16 bits Internal MCS ID for entry N. A.1.3.3 CSA_HOST_MAP This CSA Record carries the list of cluster members who have joined a specified group using a single-group MARS_JOIN operation. The Cluster Member IDs are used to represent each group member with each CSA Record fragment. A recipient MARS uses this CSA in conjunction with the current Cluster membership list to derive the actual ATM addresses of group members. csa$protoID 16 bits Protocol ( = 3 for MARS). csa$ttl 16 bits TTL csa$sequence 32 bits CSA Sequence Number. csa$sgid 32 bits Server Group ID. csa$type 16 bits CSA Record sub-cache type. csa$type 16 bits Set to 3 (CSA_HOST_MAP). csa$orig_len 8 bits Length of csa$origin. csa$group_len 8 bits Length of csa$group. csa$fragxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment (N). csa$origin x octets Originator's NBMA address. csa$group y octets Multicast group's protocol address. csa$cmi.1 16 bits Cluster Member ID for entry 1. csa$cmi.2 16 bits Cluster Member ID for entry 2. [..etc..] csa$cmi.N 16 bits Cluster Member ID for entry N. Armitage Expires February 13th, 1997 [Page 33] Internet Draft August 13th, 1996 A.1.3.4 CSA_MCS_MAP This CSA Record carries the list of MCSs who have joined to support a specified group. The internal MCS IDs from prior CSA_MCS_LIST CSA Records are used to represent each MCS. A recipient MARS uses this CSA in conjunction with the current MCS membership list to derive the actual ATM addresses of group members. csa$protoID 16 bits Protocol ( = 3 for MARS). csa$ttl 16 bits TTL csa$sequence 32 bits CSA Sequence Number. csa$sgid 32 bits Server Group ID. csa$type 16 bits CSA Record sub-cache type. csa$type 16 bits Set to 4 (CSA_MCS_MAP). csa$orig_len 8 bits Length of csa$origin. csa$group_len 8 bits Length of csa$group. csa$fragxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment (N). csa$origin x octets Originator's NBMA address. csa$group y octets Multicast group's protocol address. csa$cmi.1 16 bits Internal MCS ID for entry 1. csa$cmi.2 16 bits Internal MCS ID for entry 2. [..etc..] csa$cmi.N 16 bits Internal MCS ID for entry N. A.1.3.5 CSA_BLOCK_JOINS This CSA Record carries the list of Cluster Members who have joined blocks of the layer 3 group address space. The Cluster Member IDs from prior CSA_CMI_MAP CSA Records are used to represent each cluster member and associate it with a specific pair. csa$protoID 16 bits Protocol ( = 3 for MARS). csa$ttl 16 bits TTL csa$sequence 32 bits CSA Sequence Number. csa$sgid 32 bits Server Group ID. csa$type 16 bits CSA Record sub-cache type. csa$type 16 bits Set to 5 (CSA_BLOCK_JOINS). csa$orig_len 8 bits Length of csa$origin. csa$group_len 8 bits Lengths of csa$min and csa$max fields. csa$fragxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment (N). csa$origin x octets Originator's NBMA address. csa$min.1 y octets group address of block 1. csa$max.1 y octets group address of block 1. csa$cmi.1 16 bits Cluster Member ID for block 1. [..etc..] Armitage Expires February 13th, 1997 [Page 34] Internet Draft August 13th, 1996 csa$min.N y octets group address of block N. csa$max.N y octets group address of block N. csa$cmi.N 16 bits Cluster Member ID for block N. A.1.3.6 CSA_REDIRECT_MAP This CSA Record carries the list the source server is using to generate MARS_REDIRECT_MAP messages. csa$protoID 16 bits Protocol ( = 3 for MARS). csa$ttl 16 bits TTL csa$sequence 32 bits CSA Sequence Number. csa$sgid 32 bits Server Group ID. csa$type 16 bits CSA Record sub-cache type. csa$type 16 bits Set to 6 (CSA_REDIRECT_MAP). csa$orig_len 8 bits Length of csa$origin. csa$unused 8 bits unused. csa$fragxy 16 bits Fragment number and EOF flag. csa$cnum 16 bits Number of entries in this fragment (N). csa$thtl 8 bits Type and length of ATM addresses. csa$tstl 8 bits Type and length of ATM sub-addresses. csa$origin x octets Originator's NBMA address. csa$atmaddr.1 q octets ATM address of member 1. csa$subaddr.1 r octets ATM sub-address of member 1. [..etc..] csa$atmaddr.N q octets ATM address of member N. csa$subaddr.N r octets ATM sub-address of member N. A.1.3.7 Incremental update CSA Records. The incremental update CSA Records types are all coded using the CSA Record for the entire sub-cache, except that only a single entry is passed. Two examples: CSA_CMI_MAP_LEAVE is coded as a CSA_CMI_MAP but with csa$type = 8, csa$cnum = 1, and only a single ATM address and CMI pair provided. CSA_HOST_MAP_JOIN is coded as a CSA_HOST_MAP but with csa$type = 11, csa$cnum = 1, and only a single CMI provided. In each case the fragment number is 1, and the EOF flag is always set. The operation (incremental add or drop) is indicated by the csa$type field. Armitage Expires February 13th, 1997 [Page 35] Internet Draft August 13th, 1996 A.1.4 Use of CSA Records. The most important sub-caches for a MARS to exchange are the CSA_CMI_MAP and CSA_MCS_LIST. Without alignment of these sub-caches, members of the Server Group cannot interpret the other CSA Record types, which identify nodes using ID values supplied in the CSA_CMI_MAP and CSA_MCS_LIST records. In section 4 there is reference to MARS_JOIN/LEAVE events being propagated around the Active Sub-MARS Group. It is expected that propagating such events will be achieved by issuing CSA_HOST_MAP_JOIN and CSA_HOST_MAP_LEAVE CSA Records in CSU messages. There are no CSAS Record types equivalent to the incremental sub- cache update CSA Record types. The semantics of the CSAS Record in the CA message is to indicate the state of an entire sub-cache. It would make no sense to try and discover (or convey) an 'incremental state' of a sub-cache. As a consequence, incremental sub-cache update CSA Record types SHALL only be sent in un-solicited CSU Request messages. Client State Update Solicit (CSUS) messages SHALL only trigger the delivery of CSA Records containing entire sub-caches as atomic units. How the rest of section 3 or section 4 will be achieved with SCSP is TBD. Armitage Expires February 13th, 1997 [Page 36]