References [1] Deering, S., "Host Extensions for IP Multicasting", STD 3, RFC 1112, Stanford University, August 1989. [2] Heinanen, J., "Multiprotocol Encapsulation over ATM Adaption Layer 5", RFC 1483, Telecom Finland, July 1993. [3] Laubach, M., "Classical IP and ARP over ATM", RFC 1577, Hewlett- Packard Laboratories, December 1993. [4] ATM Forum, "ATM User Network Interface (UNI) Specification Version 3.1", ISBN 0-13-393828-X, Prentice Hall, Englewood Cliffs, NJ, June 1995. [5] Waitzman, D., Partridge, C., and S. Deering, "Distance Vector Multicast Routing Protocol", RFC 1075, November 1988. [6] Perez, M., Liaw, F., Grossman, D., Mankin, A., Hoffman, E., and A. Malis, "ATM Signaling Support for IP over ATM", RFC 1755, February 1995. [7] Borden, M., Crawley, E., Davie, B., and S. Batsell, "Integration of Real-time Services in an IP-ATM Network Architecture.", RFC 1821, August 1995. [8] ATM Forum, "ATM User-Network Interface Specification Version 3.0", Englewood Cliffs, NJ: Prentice Hall, September 1993.
Appendix A. Hole punching algorithms. Implementations are entirely free to comply with the body of this memo in any way they see fit. This appendix is purely for clarification. A MARS implementation might pre-construct a set of <min,max> pairs (P) that reflects the entire Class D space, excluding any addresses currently supported by multicast servers. The <min> field of the first pair MUST be 224.0.0.0, and the <max> field of the last pair must be 239.255.255.255. The first and last pair may be the same. This set is updated whenever a multicast server registers or deregisters. When the MARS must perform 'hole punching' it might consider the following algorithm: Assume the MARS_JOIN/LEAVE received by the MARS from the cluster member specified the block <Emin, Emax>. Assume Pmin(N) and Pmax(N) are the <min> and <max> fields from the Nth pair in the MARS's current set P. Assume set P has K pairs. Pmin(1) MUST equal 224.0.0.0, and Pmax(M) MUST equal 239.255.255.255. (If K == 1 then no hole punching is required). Execute pseudo-code: create copy of set P, call it set C. index1 = 1; while (Pmax(index1) <= Emin) index1++; index2 = K; while (Pmin(index2) >= Emax) index2--; if (index1 > index2) Exit, as the hole-punched set is null. if (Pmin(index1) < Emin) Cmin(index1) = Emin; if (Pmax(index2) > Emax) Cmax(index2) = Emax;
Set C is the required 'hole punched' set of address blocks. The resulting set C retains all the MARS's pre-constructed 'holes' covering the multicast servers, but will have been pruned to cover the section of the Class D space specified by the originating host's <Emin,Emax> values. The host end should keep a table, H, of open VCs in ascending order of Class D address. Assume H(x).addr is the Class address associated with VC.x. Assume H(x).addr < H(x+1).addr. The pseudo code for updating VCs based on an incoming JOIN/LEAVE might be: x = 1; N = 1; while (x < no.of VCs open) { while (H(x).addr > max(N)) { N++; if (N > no. of pairs in JOIN/LEAVE) return(0); } if ((H(x).addr <= max(N) && ((H(x).addr >= min(N)) perform_VC_update(); x++; }
Appendix B. Minimising the impact of IGMP in IPv4 environments. Implementing any part of this appendix is not required for conformance with this document. It is provided solely to document issues that have been identified. The intent of section 5.1 is for cluster members to only have outgoing point to multipoint VCs when they are actually sending data to a particular multicast group. However, in most IPv4 environments the multicast routers attached to a cluster will periodically issue IGMP Queries to ascertain if particular groups have members. The current IGMP specification attempts to avoid having every group member respond by insisting that each group member wait a random period, and responding if no other member has responded before them. The IGMP reply is sent to the multicast address of the group being queried. Unfortunately, as it stands the IGMP algorithm will be a nuisance for cluster members that are essentially passive receivers within a given multicast group. It is just as likely that a passive member, with no outgoing VC already established to the group, will decide to send an IGMP reply - causing a VC to be established where there was no need for one. This is not a fatal problem for small clusters, but will seriously impact on the ability of a cluster to scale. The most obvious solution is for routers to use the MARS_GROUPLIST_REQUEST and MARS_GROUPLIST_REPLY messages, as described in section 8.5. This would remove the regular IGMP Queries, resulting in cluster members only sending an IGMP Report when they first join a group. Alternative solutions do exist. One would be to modify the IGMP reply algorithm, for example: If the group member has VC open to the group proceed as per RFC 1112 (picking a random reply delay between 0 and 10 seconds). If the group member does not have VC already open to the group, pick random reply delay between 10 and 20 seconds instead, and then proceed as per RFC 1112. If even one group member is sending to the group at the time the IGMP Query is issued then all the passive receivers will find the IGMP Reply has been transmitted before their delay expires, so no new VC is required. If all group members are passive at the time of the IGMP Query then a response will eventually arrive, but 10 seconds later than under conventional circumstances.
The preceding solution requires re-writing existing IGMP code, and implies the ability of the IGMP entity to ascertain the status of VCs on the underlying ATM interface. This is not likely to be available in the short term. One short term solution is to provide something like the preceding functionality with a 'hack' at the IP/ATM driver level within cluster members. Arrange for the IP/ATM driver to snoop inside IP packets looking for IGMP traffic. If an IGMP packet is accepted for transmission, the IP/ATM driver can buffer it locally if there is no VC already active to that group. A 10 second timer is started, and if an IGMP Reply for that group is received from elsewhere on the cluster the timer is reset. If the timer expires, the IP/ATM driver then establishes a VC to the group as it would for a normal IP multicast packet. Some network implementors may find it advantageous to configure a multicast server to support the group 224.0.0.1, rather than rely on a mesh. Given that IP multicast routers regularly send IGMP queries to this address, a mesh will mean that each router will permanently consume an AAL context within each cluster member. In clusters served by multiple routers the VC load within switches in the underlying ATM network will become a scaling problem. Finally, if a multicast server is used to support 224.0.0.1, another ATM driver level hack becomes a possible solution to IGMP Reply traffic. The ATM driver may choose to grab all outgoing IGMP packets and send them out on the VC established for sending to 224.0.0.1, regardless of the Class D address the IGMP message was actually for. Given that all hosts and routers must be members of 224.0.0.1, the intended recipients will still receive the IGMP Replies. The negative impact is that all cluster members will receive the IGMP Replies.
Appendix C. Further comments on 'Clusters'. The cluster concept was introduced in section 1 for two reasons. The more well known term of Logical IP Subnet is both very IP specific, and constrained to unicast routing boundaries. As the architecture described in this document may be re-used in non-IP environments a more neutral term was needed. As the needs of multicasting are not always bound by the same scopes as unicasting, it was not immediately obvious that apriori limiting ourselves to LISs was beneficial in the long term. It must be stressed that Clusters are purely an administrative being. You choose their size (i.e. the number of endpoints that register with the same MARS) based on your multicasting needs, and the resource consumption you are willing to put up with. The larger the number of ATM attached hosts you require multicast support for, the more individual clusters you might choose to establish (along with multicast routers to provide inter-cluster traffic paths). Given that not all the hosts in any given LIS may require multicast support, it becomes conceivable that you might assign a single MARS to support hosts from across multiple LISs. In effect you have a cluster covering multiple LISs, and have achieved 'cut through' routing for multicast traffic. Under these circumstances increasing the geographical size of a cluster might be considered a good thing. However, practical considerations limit the size of clusters. Having a cluster span multiple LISs may not always be a particular 'win' situation. As the number of multicast capable hosts in your LISs increases it becomes more likely that you'll want to constrain a cluster's size and force multicast traffic to aggregate at multicast routers scattered across your ATM cloud. Finally, multi-LIS clusters require a degree of care when deploying IP multicast routers. Under the Classical IP model you need unicast routers on the edges of LISs. Under the MARS architecture you only need multicast routers at the edges of clusters. If your cluster spans multiple LISs, then the multicast routers will perceive themselves to have a single interface that is simultaneously attached to multiple unicast subnets. Whether this situation will work depends on the inter-domain multicast routing protocols you use, and your multicast router's ability to understand the new relationship between unicast and multicast topologies. In the absence of futher research in this area, networks deployed in conformance to this document MUST make their IP cluster and IP LIS coincide, so as to avoid these complications.
Appendix D. TLV list parsing algorithm. The following pseudo-code represents how the TLV list format described in section 10 could be handled by a MARS or MARS client. list = (mar$extoff & 0xFFFC); if (list == 0) exit; list = list + message_base; while (list->Type.y != 0) { switch (list->Type.y) { default: { if (list->Type.x == 0) break; if (list->Type.x == 1) exit; if (list->Type.x == 2) log-error-and-exit; } [...other handling goes here..] } list += (list->Length + 4 + ((4-(list->Length & 3)) % 4)); } return;
Appendix E. Summary of timer values. This appendix summarises various timers or limits mentioned in the main body of the document. Values are specified in the following format: [x, y, z] indicating a minimum value of x, a recommended value of y, and a maximum value of z. A '-' will indicate that a category has no value specified. Values in minutes are followed by 'min', values in seconds are followed by 'sec'. Idle time for MARS - MARS client pt to pt VC: [1 min, 20 min, -] Idle time for multipoint VCs from client. [1 min, 20 min, -] Allowed time between MARS_MULTI components. [-, -, 10 sec] Initial random L_MULTI_RQ/ADD retransmit timer range. [5 sec, -, 10 sec] Random time to set VC_revalidate flag. [1 sec, -, 10 sec] MARS_JOIN/LEAVE retransmit interval. [5 sec, 10 sec, -] MARS_JOIN/LEAVE retransmit limit. [-, -, 5] Random time to re-register with MARS. [1 sec, -, 10 sec] Force wait if MARS re-registration is looping. [1 min, -, -] Transmission interval for MARS_REDIRECT_MAP. [1 min, 1 min, 2 min] Limit for client to miss MARS_REDIRECT_MAPs. [-, -, 4 min]
Appendix F. Pseudo code for MARS operation. Implementations are entirely free to comply with the body of this memo in any way they see fit. This appendix is purely for possible clarification. A MARS implementation might be built along the lines suggested in this pseudo-code. 1. Main 1.1 Initilization Define a server list as the list of leaf nodes on ServerControlVC. Define a cluster list as the list of leaf nodes on ClusterControlVC. Define a host map as the list of hosts that are members of a group. Define a server map as the list of hosts (MCSs) that are serving a group. Read config file. Allocate message queues. Allocate internal tables. Set up passive open VC connection. Set up redirect_map timer. Establish logging. 1.2 Message Processing Forever { If the message has a TLV then { If TLV is unsupported then { process as defined in TLV type field. } /* unknown TLV */ } /* TLV present */ Place incoming message in the queue. For (all messages in the queue) { If the message is not a JOIN/LEAVE/MSERV/UNSERV with mar$flags.register == 1 then { If the message source is (not a member of server list) && (not a member of cluster list) then { Drop the message silently. } } If (mar$pro.type is not supported) or (the ATM source address is missing) then { Continue.
} Determine type of message. If an ERR_L_RELEASE arrives on ClusterControlVC then { Remove the endpoints ATM address from all groups for which it has joined. Release the CMI. Continue. } /* error on CCVC */ Call specific message handling routine. If redirect_map timer pops { Call MARS_REDIRECT_MAP message handling routine. } /* redirect timer pop */ } /* all msgs in the queue */ } /* forever loop */ 2. Message Handler 2.1 Messages: - MARS_REQUEST Indicate no MARS_MULTI support of TLV. If the supported TLV is not NULL then { Indicate MARS_MULTI support of TLV. Process as required. } else { /* TLV NULL */ Indicate message to be sent on Private VC. If the message source is a member of server list then { If the group has a non-null host map then { Call MARS_MULTI with the host map for the group. } else { /* no group */ Call MARS_NAK message routine. } /* no group */ } else { /* source is cluster list */ If the group has a non-null server map then { Call MARS_MULTI with the server map for the group. } else { /* cluster member but no server map */ If the group has a non-null host map then { Call MARS_MULTI with the host map for the group. } else { /* no group */ Call MARS_NAK message routine. } /* no group */ } /* cluster member but no server map */ } /* source is a cluster list */ } /* TLV NULL */ If a message exists then { Send message as indicated. }
Return. - MARS_MULTI Construct a MARS_MULTI for the specified map. If the param indicates TLV support then { Process the TLV as required. } Return. - MARS_JOIN If (mar$flags.copy != 0) silently ignore the message. If more than a single <min,max> pair is specified then silently ignore the message. Indicate message to be sent on private VC. If (mar$flags.register == 1) then { If the node is already a registered member of the cluster associated with protocol type then { /*previous register*/ Copy the existing CMI into the MARS_JOIN. } else { /* new register */ Add the node to ClusterControlVC. Add the node to cluster list. mar$cmi = obtain CMI. } /* new register */ } else { /* not a register */ If the group is a duplicate of a previous MARS_JOIN then { mar$msn = current csn. Indicate message to be sent on Private VC. } else { Indicate no message to be sent. If the message source is in server map then { Drop the message silently. } else { If the first <min,max> encompasses any group with a server map then { Call the Modified JOIN/LEAVE Processing routine. } else { If the MARS_JOIN is for a multi group then { Call the MultiGroup JOIN/LEAVE Processing Routine. } else { Indicate message to be sent on ClusterControlVC. } /* not for a multi group */ } /* group not handled by server */ } /* msg src not in server map */ Update internal tables. } /* not a duplicate */ } /* not a register */
If a message exists then { mar$flags.copy = 1. Send message as indicated. } Return. - MARS_LEAVE If (mar$flags.copy != 0) silently ignore the message. If more than a single <min,max> pair is specified then silently ignore the message. Indicate message to be sent on ClusterControlVC. If (mar$flags.register == 1) then { /* deregistration */ Update internal tables to remove the member's ATM addr from all groups it has joined. Drop the endpoint from ClusterControlVC. Drop the endpoint from cluster list. Release the CMI. Indicate message to be sent on Private VC. } else { /* not a deregistration */ If the group is a duplicate of a previous MARS_LEAVE then { mar$msn = current csn. Indicate message to be sent on Private VC. } else { Indicate no message to be sent. If the first <min,max> encompasses any group with a server map then { Call the Modified JOIN/LEAVE Processing routine. } else { If the MARS_LEAVE is for a multi group then { Call the MultiGroup JOIN/LEAVE Processing Routine. } else { Indicate message to be sent on ClusterControlVC. } } Update internal tables. } /* not a duplicate */ } /* not a deregistration */ If a message exists then { mar$flags.copy = 1. Send message as indicated. } Return. - MARS_MSERV If (mar$flags.register == 1) then { /* server register */ Add the endpoint as a leaf node to ServerControlVC.
Add the endpoint to the server list. Indicate the message to be sent on Private VC. mar$cmi = 0. } else { /* not a register */ If the source has not registered then { Drop and ignore the message. Indicate no message to be sent. } else { /* source is registered */ If MCS is already member of indicated server map { Indicate message to be sent on Private VC. mar$flags.layer3grp = 0; mar$flags.copy = 1. } else { /* New MCS to add. */ Add the server ATM addr to server map for group. Indicate message to be sent on ServerControlVC. Send message as indicated. Make a copy of the message. Indicate message to be sent on ClusterControlVC. If new server map was just created { Construct MARS_MIGRATE, with MCS as target. } else { Change the op code to MARS_JOIN. mar$flags.layer3grp = 0. mar$flags.copy = 1. } /* new server map */ } /* New MCS to add. */ } /* source is registered */ } /* not a register */ If a message exists then { Send message as indicated. } Return. - MARS_UNSERV If (mar$flags.register == 1) then { /* deregister */ Remove the ATM addr of the MCS from all server maps. If a server map becomes null then delete it. Remove the endpoint as a leaf of ServerControlVC. Remove the endpoint from server list. Indicate the message to be sent on Private VC. } else { /* not a deregister */ If the source is not a member of server list then { Drop and ignore the message. Indicate no message to be sent. } else { /* source is registered */
If MCS is not member of indicated server map { Indicate message to be sent on Private VC. mar$flags.layer3grp = 0; mar$flags.copy = 1. } else { /* MCS existed, must be removed. */ Remove ATM addr of the MCS from indicated server map. If a server map is null then delete it. Indicate the message to be sent on ServerControlVC. Send message as indicated. Make a copy of the message. Change the op code to MARS_LEAVE. Indicate message (copy) to be sent on ClusterControlVC. mar$flags.layer3grp = 0; mar$flags.copy = 1. } /* MCS existed, must be removed. */ } /* source is registered */ } /* not a deregister */ If a message exists then { Send message as indicated. } Return. - MARS_NAK Build command. Return. - MARS_GROUPLIST_REQUEST If (mar$pnum != 1) then Return. Call MARS_GROUPLIST_REPLY with the range and output VC. Return. - MARS_GROUPLIST_REPLY Build command for specified range. Indicate message to be sent on specified VC. Send message as indicated. Return. - MARS_REDIRECT_MAP Include the MARSs own address in the message. If there are backup MARSs then include their addresses. Indicate MARS_REDIRECT_MAP is to be sent on ClusterControlVC. Send message back as indicated. Return.
3. Send Message Handler If (the message is going out ClusterControlVC) && (a new csn is required) then { mar$msn = obtain a CSN } If (the message is going out ServerControlVC) && (a new ssn is required) then { mar$msn = obtain a SSN } Return. 4. Number Generator 4.1 Cluster Sequence Number Generate the next sequence number. Return. 4.2 Server Sequence Number Generate the next sequence number. Return. 4.3 CMI CMIs are allocated uniquely per registered cluster member within the context of a particular layer 3 protocol type. A single node may register multiple times if it supports multiple layer 3 protocols. The CMIs allocated for each such registration may or may not be the same. Generate a CMI for this protocol. Return. 5. Modified JOIN/LEAVE Processing This routine processes JOIN/LEAVE when a server map exists. Make a copy of the message. Change the type of the copy to MARS_SJOIN. If the message is a MARS_LEAVE then { Change the type of the copy to MARS_SLEAVE. } mar$flags.copy = 1 (copy). Hole punch the <min,max> group by excluding from the range those groups which the joining (leaving) node is already (still) a member of
due to it having previously issued a single group join. Indicate the message to be sent on ServerControlVC. If the message (copy) contains one or more <min,max> pair { Send message (copy) as indicated. } mar$flags.punched = 0 in the original message. Indicate the message to be sent on Private VC. Send message (original) as indicated. Hole punch the <min,max> group by excluding from the range those groups that are served by MCSs or which the joining (leaving) node is already (still) a member of due to it having previously issued a single group join. Indicate the (original) message to be sent on ClusterControlVC. If (number of holes punched > 0) then { /* punched holes */ In original message do { mar$flags.punched = 1. old punched list <- new punched list. } } /* punched holes */ mar$flags.copy = 1. Send message as indicated. Return. 5.1 MultiGroup JOIN/LEAVE Processing This routine processes JOIN/LEAVE when a multi group exists. If (mar$flags.layer3grp) { Ignore this setting, consider it reset. } mar$flags.copy = 1. Make a copy of the message. From the copy hole punch the <min,max> group by excluding from the range those groups that this node has already joined or left. If (number of holes punched > 0) then { mar$flags.punch = 0 in original message. Indicate original message to be sent on Private VC. Send original message as indicated. mar$flags.punch = 1 in copy message. old group range <- new punched list. Indicate message to be sent on ClusterControlVC. Send copy of message as indicated. } else { Indicate message to be sent on ClusterControlVC. Send original message as indicated.
} /* no holes punched */ Return.