;-*-rfc-*- INTERNET-DRAFT L. Coene Internet Engineering Task Force Siemens Issued: February 2002 Expires: July 2002 Multirouting Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html Abstract This document describes a way to loadshare the different paths of a multihomed SCTP association at the same moment while keeping congestion control per path. The document also describes a possible solution to multihoming which would require no routing tables on the host and which would try to guarantee non-overlapping multihomed paths. It could possibly reduce the growth of the routing table in a router. The selection of which link to take would be a local one. The solution is similar to the use of links and linksets within a routeset in SS7. Table of Contents Multirouting ii Chapter 1: Introduction 2 Chapter 2: Loadsharing within a SCTP association 3 Chapter 3: Multirouting packets in networks 5 Chapter 4: Considerations 6 Chapter 5: Security considerations 8 Chapter 6: References and related work 9 Chapter 7: Acknowledgments 9 Coene [Page 1] Draft multirouting February 2002 Chapter 8: Author's address 9 Editors note: this draft is going to be split up in 2 parts: - loadsharing within SCTP - multirouting packets in IP networks(depending on whether this technology is already existing) 1 Introduction Multihoming has the potential to solve some Quality-of-service (QOS) resilience and relialability problems that exist nowadays in the internet. In order to solve these problems, Multihoming must be able to use all the paths present in a single association at the same time/in parallel. The SCTP specification [RFC2960] only allows a single(=primary) path to be active at any given moment. Only when this path experience trouble(such as no transmission possible...), will another path be used for the transmission of the messages. This draft is a attempt to improve this behaviour. 2 Loadsharing within a SCTP association on the host A multihomed SCTP association on a host has always more than one path to send its traffic over it. The number of paths is dependant on the number of IP adddresses exchanged during the setup of the association. As each path can have different transmission characteristic(such as delay, bandwith, jitter ...etc), separate congestion control processing must be done for each path. (Note : in future IP addresses may be added and removed "on-the-fly" during the active lifetime of the association, this amounts to adding and removing of paths to the association [ADDIP]). At present, the congestion control information is already kept per path as is required in [RFC2960]. The information is updated for the primary path by the flow of the traffic and for the alternative paths by exchanging heartbeat messages. However the heartbeat timer can be very different from the timers used for the congestion control per path and retransmission, thus rendering the info from the heartbeat useless. Congestion control info concerning a single path decays if no traffic is send over that path. To keep the congestion info up to date, the timing of sending heartbeats must be in the same range as the congestion control timings, which may place a burden of not-so-usefull(= they are NOT carrying data) messages on the alternate paths. For each path within the association, a separate congestion control window is to be specified within the transport protocol, as for every path its congestion control characteristics may (and will) be different(example RTT). This will lead to a seperate congestion control per path. Each path should be seen(in TCP terms) as a separate TCP connection, with each TCP connection having a different path/route through the network. Coene [Page 2] Draft multirouting February 2002 If all paths are in use(assuming enough traffic is sent/received), then all congestion control info for every path will remain up to date. This will make a change-over more smoothly and traffic can be distributed from the failed path to all the remaining active paths, thus smoothing the change-over. The present SCTP changeover works the following: one path active, all others in standby and a changeover is from the previously single active to a single standby path. The scheme allows also the endpoints to choose whether all paths will be active in parallel or that there will be some standby paths in addition to the active paths. When all paths are in use it is up to some form of distributor function in SCTP to distribute the traffic across the different paths. The distributor function is a implementation dependant function which can have different, sometimes conflicting functions. Example the distributor can try to obtain a certain message transfer rate accross the complete association, another kind of distributor can try to load up all paths up till maximun capacity with all paths doing SCTP/TCP friendly congestion control. Other distributors may try to minimalise the delay or jitter. For that they would need some feedback from the remote side on top of the already existing SCTP congestion control mechanism. If that is the case then a SCTP extension may be needed. A SCTP implementation which does NOT support parallel usage of its paths must be able to communicate with an implementation which can support this. As no new additions to the SCTP protocol are required, that would mean that a SCTP full-path(meaning all paths are used in parallel)implementation would NOT break a SCTP single-path implementation. The single-path will answer the SACK the received messages to the source address of the messages. If a SACK is send back spanning multiple paths, each of the paths congestion control info will be updated per RFC2960. The application can do at present this by specifying the primary path before sendng a message to SCTP. 3 Multirouting messages in the network In order to obtain the greatest advantages of multihoming, the paths within an association should be as distinct as possible. This cannot always be guaranteed, for example due to problems occuring in networks. As a path is really a collection of subsequent nodes and links between nodes, a path selection at the host really means taking a certain link towards a node. At the next hop a link is selected using the routing information present in the packet(wow the IP address). Multiple links can route towards the same, required destination. The way in which these links are selected can be diverse. Coene [Page 3] Draft multirouting February 2002 In present IP networks, every path has a distinct IP address, thus the complete IP address(not really, the prefix instead) becomes the link selector. Because there are a lot of prefixes in the network, that would mean that there are a lot of link selections to be made, increasing the size of the routing/selection tables. This is however what is now happening with the present IP multihoming architecture. In order to keep the present multihoming solutions working, the proposed solution should not adversely impact the present multihoming architecture(using different IP addresses for each path). The solution should allow for the selection of the link on which to send out the message. The selection criterion can be contained in the: - IP address(example IPv4/v6 address prefix...) - outside the IP address(example IPv6 flowlabel, IPv4 TOS field) This would also leave the transport layer in control of which path/link to send the msg out on, thus preserving the end-to-end principle. A Link selection parameter would be in teh IP network layer used to specify the path to be taken in the host, this would implicitly specify the outgoing interface/link on the host. On the transport level for each link selection, a separate congestion control window is to be specified within the transport protocol, as for every path its congestion control characteristics may (and will) be different(example RTT). This will lead to a seperate congestion control per Link selection and implicitely per path. However there is then the requirement for routers to do something to keep msg with the same (destination and source) IP address and link selection on the same path(see paragraph on routers). In order to limit the number of congestion control windows in the transport layer on the host, an upper limit may be specified on the Link selection field (example 16), so that in this example the transport TCB would have maximal 16 congestion control windows stored. If less than the maximal number of LS are used, then this would mean that not all possible paths may be used during message transmission.(example only LS 0..4 is used, because the host has only 5 interfaces, then if somewhere in a router within the network, more than 5 links lead to the same prefix, the 6th and higher links will never be used by the traffic of this association). It is not envision to make this parameter LS a negotiated feature between the end points, as the endpoint has no view whatsoever on the number of links associated with a prefix at a router and thus may be underutilising the number of links avialable on its path. The possible Link selection choices are detailed in the following paragraphs. Coene [Page 4] Draft multirouting February 2002 3.1 a part of the IP address used as Link selection The selection based on the IP address can done in 2 ways: either use the most significant part of the IP address or use the least significant part of the IP address to make a distinction between 2 or more links. The present way is to use the Most significant part(or better called a different prefix) of the IP address. Some of the more disturbing features of this solution are described in [SCTPMULTI] and [DRCN200]. A alternative way is to use the least significant part of the IP address, meaning that the node would be addressed via a single IP address where for example the last 4 bits indicate which link the message has to go out on. The prefix used to send to this destination would be advertised in all of the networks this host is attached to and the routers would allow to route to this destination. This has particular difficulties which will be described in the next following paragraphs. The least 4 bits allow for up to different 16 links to be used. If more links are needed then the number of bits may be augmented but then the selection field will become variable from host to host and will generate more problems on the host and in the network. Therefore 4 bits as a fixed length is advisable and from the experience of other network which use similar technologies, 16 is a good upper limit. If less than 16 links are avialable, then the bits may remain unused by the host or can be mapped by the host onto the actually present links. The bits themselves are not changed by the host, they can be used further down in the network for selecting a link towards the destination. Example of a mapping is : 2 links, 0 selects link 0, 1 selects Link 1, 2 selects link 0, 3 selects link 1, etc... The link selection(LS) bit (or whatever name that is suited) must be be used to specify the path to be followed as otherwise transport layer congestion control algorithms may go haywire. If a router has more than one link towards a certain destination, and the message travels through a number of routers with this capability, that would necessary mean that there are(at least in theory) a infinite number of paths toward the destination which can be in used at the same time. This might be a problem for the present congestion control algorithms in TCP and SCTP(this has not yet exhaustivily researched, so this is at this present moment a typical research issue, see chapter x) The congestion control in the transport layer is done on a per path basis. If the linkselection is always used to select the same link from router to router(except in the case where the linked failed and other links have to take over the traffic) that would mean that for the address(prefix) with a certain link selection, it would take the Coene [Page 5] Draft multirouting February 2002 same path through the network, giving the clasic SCTP(and TCP) congestion control algorithms its chance to do its job. The LS must not be changed by a router as it would change the path taken through the network(contrary to SLS selection rotation in ANSI SS7 networks). The result of this concept is that we get at most 2^n paths (n = number of LS bits used) paths through the network which would also limit the maximal number of congestion control variable sets used in SCTP. This is much more manageable than an infinite number of paths through the network. If a changeover occurred, then the traffic of the failed link would be moved to another link and congestion would (surely) occur and the congeestion algorithm would deal with it via reducing the traffic. If the number of links is at most 50% of the link selction combinations(example: LS = 4 bits -> 16 combinations and we have 8 links -> with a random distribution that would mean that every link gets the traffic of 2 LS -> if a link fails then the 2 LS go each to a different link, thus getting a traffic distributions of 50% on each takeover link(be reminded that this link has its own traffic to carry from 2 LS and it get extra traffic from 1 LS), easing the transient effects of the changeover. It is advisable to include the destination and source IP address in the link selection algorithm. This would distribute the traffic more evenly over the active links of the host or router. It should be noted that such algorithms are implementation dependant and they would not be the same on all routers. 3.2 Impact of Link selection on SCTP 3.2.1 LS using IP address If a LS uses the Most significant part of the IP address, then for every LS there is a different IP address (with a different prefix, of course). This allows the classical use of SCTP as SCTP at this present moment uses multihoming by specifying the different IP addresses. If a LS uses the least significant part of the IP address, then as in the previous case, there is for every LS a different IP addres (however now with the same prefix). This will still allow the classical use of SCTP. 3.2.2 LS outside the IP address If a field outside the IP address is used, then changes may be required to SCTP for transporting the different path selector(= link selector) between the 2 endnodes Editors note: - take a look at OSPF which may have a similar feature , for routes with different metrics(1 versus 2 , traffic is distributed 66% - 33% or another distribution) See paragraphs on equal-cost multipaths in the OSPF spec. Coene [Page 6] Draft multirouting February 2002 - take a look a the virtual router redundancy WG 4 Considerations. The solution proposed has shown its merits in SS7 networks where it is heavely used. The reason why it works has to do with the transactional nature of the messages flowing through a SS7 network. That means that congestion does only occur occasionaly and not like in internets, continously. The following extreme cases may happen when this scheme is put into operation: Congestion control - Negative: every msg will follow a different path towards the remote end, it will be very difficult for the end-to-end congestion control of SCTP(and TCP) to do its proper congestion control. Up till this moment, SCTP executes its congestion control algorithm across the complete association with the explicit notion that only one source-destination transport address pair(= a single path) out of a bunch of multihomed addresses(= paths) is used for the data transfer. Thus the congestion control is in fact only active on the active path(and there is only one active path allowed according to [RFC2960]). There are exception for lost SACKs that they may take the alternate paths but these should be regarded as exceptions, not the rule. It could therefore be very interesting to use and study SCTP with loaddistribution across all its paths to see if expanding really the congestion control across all paths of the association would break end-to-end congestion control (or not), augment the throughput (or not). It would at least give a clue if end-to-end congestion control would continue to work in a enviroment where both hosts and routers would have multiple routes with loadsharing at the same time towards a certain destination/network. If all paths are in use (and no selection mechanism is used), then if along a path, messages get dropped, then congestion control will kick in and reduce traffic, not only for that path but for all paths. That means that on the other paths, traffic is reduced, even if there was no congestion on those paths. So the throughput will be reduced significantly. This would mean that the case without path selection has always less throughput than the case with path selection. - positive: the positive case should be the reverse of the negative case. End-to-end congestion control would be accross the congestion of the networks as a whole and not of the congestion of links and routers across a certain path. That would also indicate that the throughput could be higher(not lineair with the number of links but Coene [Page 7] Draft multirouting February 2002 better. It would also put to better use spare capacity (if it existed in a network). Addresses Some addres classes may simply not be suited for this approach as the last bits of the address are factory fabricated and thus may clash with adresses of other interfaces in the same host or router. Routing protocols Routing protocols should be distributing prefixes according to routesets and not to links. A routeset may consists of one or more links. If you have 2 or more links in a routeset for a certain destination, if one link went down, then traffic would still flow uninterrupted across the other link or take a completely different route(and it would depend on the amount of traffic that went through this link or router). This would also mean that the flow throughput of the complete association would be reduced, not cut off, thus giving end-to-end congestion control algorithms a better change to react and the changeover would be far less disruptive. The routing protocols would then try to find an altenative link and add it to the routeset or wait till the failed link or router gets back into operation. This would be far less disruptive for any traffic coming through the neighbourhood. Different path characteristics: If a stream with in-sequence delivery is required by SCTP, splitting the traffic up between 2 or more paths(with radical different transmission characteristic such as short versus long delays), may lead to large SACKs, due to the large number of Gap reports.... Editors note: elaborate on this further in the next version... 5 Security considerations To be completed. 6 References and related work [RFC2960] Stewart, R. R., Xie, Q., Morneault, K., Sharp, C. , , Schwarzbauer, H. J., Taylor, T., Rytina, I., Kalla, M., Zhang, L. and Paxson, V."Stream Control Transmission Protocol", RFC2960, October 2000. [ROUTER] Draves, R., "Default router preferences and more-specific routes",draft-ietf-ipngwg-router-selection-00.txt, work in progress [INGRES] Draves, R., "Ingress filtering, Site multihoming and source Coene [Page 8] Draft multirouting February 2002 adddress selection", draft-draves-ipngwg-ingress-filtering-00.txt, work in progress [ADDRSEL] Draves, R., "Default Address selection for IPv6", draft-ietf-ipngwg-default-addr-select-00.txt, work in progress [SCTPMULTI] Coene, L(Ed.), "Multihoming issues in the Stream Control Transmission Protocol", draft-coene-sctp-multihome-03.txt, work in progress [DRSCN2000] http://www.sctp.de/papers/drcn2000.pdf 7 Acknowledgments The authors wish to thank M. Tuexen, ... and many others for their invaluable comments. 8 Author's address Lode Coene Phone: +32-14-252081 Siemens Atea EMail: lode.coene@siemens.atea.be Atealaan 34 B-2200 Herentals Belgium Coene [Page 9]