INTERNET DRAFT Chitra Venkatramani Expires October 1997 IBM T.J.Watson Research Center draft-chitra-rether-00.txt Tzi-cker Chiueh SUNY at Stony Brook April 1997 RETHER : A Protocol for Real-Time Traffic Support on Ethernet ------------------------------------------------------------- Status of this memo This document is an Internet Draft. Internet Drafts are working documents of the Internet Engineering Task Force (IETF), its Areas, and its Working Groups. Note that other groups may also distribute working documents as Internet Drafts. Internet Drafts are draft documents valid for a maximum of six months. Internet Drafts may be updated, replaced, or obsoleted by other documents at any time, It is not appropriate to use Internet Drafts as reference material or to cite them other than as a "working draft" or "work in progress". Please check the I-D abstract listing contained in each Internet Draft directory to learn the current status of this or any other Internet Draft. Abstract -------- Distributed multimedia applications require performance guarantees from the underlying network subsystem. Ethernet has been the dominant local area network architecture in the last decade, and we believe that it will remain popular because of its cost-effectiveness and the availability of higher-bandwidth Ethernets. We present the design of a software-based timed-token protocol called RETHER that provides real-time performance guarantees to multimedia applications without requiring any modifications to existing Ethernet hardware. RETHER features a hybrid mode of operation to reduce the performance impact on non-real-time network traffic, a race-condition-free distributed admission control mechanism, and an efficient token-passing scheme that protects the network against token loss due to node failures or otherwise. Performance measurements from a prototype running on 10-Mbps Ethernet and 100-Mbps Fast-Ethernet indicate that up to 60 of the raw bandwidth can be reserved without deteriorating the performance of non-real-time traffic significantly. This scheme is consistent with the ISSLL over LANs framework described in [7]. Venkatramani, Chiueh Expires Oct. 1997 [Page 1] INTERNET DRAFT RETHER April 1997 1. Introduction --------------- Distributed multimedia applications demand time-constrained data delivery, which requires resource guarantees from underlying networking and I/O systems. Although Ethernet performs well for non-real-time traffic, providing bandwidth guarantees on an Ethernet is problematic for two reasons. The first reason is that the Ethernet uses the CSMA/CD protocol [1]. The CSMA/CD protocol uses an exponential backoff algorithm to handle collision on the channel while attempting to transmit a packet. Due to the probabilistic nature of the exponential backoff algorithm, channel access becomes nondeterministic, which is not suitable for real-time applications. The second reason is that the native Ethernet does not prioritize packets. This is undesirable for real-time systems in which important packets should not be held up waiting for unimportant packets. This problem is addressed by the IEEE Standards Committee which has been working on the 802.1p [10], a standard for multiple traffic classes to be handled by bridges and switches. RETHER [4,5] has been designed to address the above problems and to provide deterministic and periodic access to the network for multimedia applications. Because of the enormous popularity of Ethernet, it is essential to be able to support real-time services using existing Ethernet infrastructure, requiring minimal if any hardware changes. Hence, the RETHER protocol has been built into the network device driver in the operating system, thereby operating seamlessly with any off-the-shelf Ethernet hardware. RETHER does not require any hardware changes in the hosts or wiring for real-time communication on a single shared segment. It only requires special RETHER-enabled switches to extend the guarantees across multiple segments. RETHER is designed for applications requiring periodic access to the network medium. By using a deadline-driven token passing protocol that prioritizes real-time over non-real-time data, RETHER guarantees deterministic channel access to real-time data. RETHER also features a distributed admission control policy to allow dynamic addition and removal of real-time connections. The fault tolerance mechanism of RETHER protects the network against token loss due to node failures or bit corruptions. RETHER has been implemented and tested independently, as well as in the context of the Stony Brook Video Server which is an end-to-end digital video server [8]. It is important to distinguish the protocol layer at which RETHER works with respect to RSVP [6]. Figure 1 illustrates where RETHER fits in an end-to-end bandwidth reservation framework. In Figure 1, an Venkatramani, Chiueh Expires Oct. 1997 [Page 2] INTERNET DRAFT RETHER April 1997 end-to-end real-time connection is to be established between a sender node and a receiver node across the internet. This connection can be set up using a real-time connection setup protocol such as RSVP [6], which in turn relies on link-layer reservation mechanisms to actually set up the underlying connection. One of the subnetworks that participates in the real-time connection in Figure 1 is a switched multi-segment Ethernet while the others are ATM, FDDI, etc. The abstraction that RETHER exposes to RSVP is thus a real-time capable Ethernet-based subnetwork that can provide bandwidth guarantees. Hence, RETHER plays a complementary role to higher level reservation protocols in that it functions exclusively as a link-layer building block. Sender Receiver O---------------------------------------------------O Connection Layer (RSVP) Saved Connection State | | V V O---------------O----------------O------------------O Link Layer FDDI ATM Switched Ethernet O---------------O----------------O------------------O / \ / ------------------ \ / | RETHER enabled | \ / | Ethernet Switch| \ / ------------------ \ / / | \ \ \ / /\ O-|-O \ \ \ / O O O O O<------- Hosts / Segmented Ethernet running \ / RETHER \ Figure 1. Internet bandwidth reservation with RETHER. In this document, we first present the design of RETHER as applicable to a shared single Ethernet segment and then the extension of RETHER to a switched environment. Our current implementation runs on a network of 10Mbps and 100Mbps Ethernets, on a combination of Pentium and 486-based PC's running the FreeBSD v2.0 kernel. Venkatramani, Chiueh Expires Oct. 1997 [Page 3] INTERNET DRAFT RETHER April 1997 2. The RETHER Protocol Overview ------------------------------- The RETHER protocol has the following features - * It allows applications to reserve bandwidth and guarantees the reservation throughout the lifetime of the application by implementing two traffic classes -- real-time and non-real-time. * It is implemented in software over off-the-shelf network hardware. * It adopts a hybrid scheme whereby the network operates using a timed token-bus protocol when there are real-time sessions, and using the original Ethernet protocol at all other times. This scheme reduces the performance degradation of non-real-time traffic, and presents minimal disruption to existing network applications. The network operates using the original Ethernet protocol (CSMA/CD) until a real-time request is made. When this happens, the network switches to a token-bus mode. In this mode, real-time data (like audio and video) is assumed to be periodic and time is divided into cycles of one period. The token rotation time (TRT) is a system initialization parameter and is fixed for all aplications in the network. In each cycle, channel access for *all* traffic - real-time(RT) and non-real-time(NRT) -- is regulated by a token. RT traffic is scheduled to be sent out first in each cycle and NRT traffic is allowed to use the channel in the remaining time. Hence, in each cycle, all admitted RT nodes get to send data. But, all nodes may or may not get to send NRT data, depending on the time left in the cycle or the available residual bandwidth. When all the time in the cycle is exhausted, the token returns to the first RT node and begins a new cycle. The details of how this time is maintained and how the token is circulated are explained in the following sections. When the last RT session terminates, the network switches back to the CSMA mode. 2.1 Protocol Description ------------------------ |-------- |---------| 1st RT request |----------|--| | New RT |CSMA |--------------------->| RETHER |<---| requests | Mode | | Mode | | |<---------------------| | |---------| Last RT session |-------------| termination Figure 2 : Different Modes and Transitions in RETHER. The network is a single shared Ethernet segment and consists of N nodes, all of which run RETHER. It is assumed that the value of N is known to all the nodes in the network. Venkatramani, Chiueh Expires Oct. 1997 [Page 4] INTERNET DRAFT RETHER April 1997 2.1.1 The CSMA Mode ------------------- Nodes compete for the channel using the generic Ethernet protocol. This protocol performs well under light loads and deteriorates only when the network is heavily loaded. Nodes operate in this mode until the arrival of the first RT request which initiates a switch to the RETHER-mode. 2.1.2 Switch to the RETHER Mode ------------------------------- When an application on a node generates an RT request and the node is in the CSMA-mode, it becomes an Initiator by broadcasting a Switch-to-RETHER message on the Ethernet. Every node that receives this message responds by setting its protocol mode to RETHER mode. It holds off sending any more data and awaits completion of transmission of the packet already in the transmission buffer of its network interface. Then it sends an acknowledgment back to the initiator. As soon as the initiator receives all the acknowledgments, it creates a token and begins circulating it. This completes a successful switch to RETHER-mode. Acknowledgements are crucial to the success of the protocol for two reasons. Firstly, acknowledgments signify the willingness of the nodes to switch to RETHER-mode. Secondly, the fact that acknowledgments are successfully sent out indicates that the nodes do not have any pending packet in the backoff phase of the CSMA/CD protocol. The latter is particularly important because in typical Ethernet cards, software has no control over the data once it has been transferred to the network interface buffer. The protocol can robustly handle the following scenarios that may affect the process of switching to RETHER-mode - 1) Multiple Initiators This condition occurs when applications running on two or more different nodes generate an RT request simultaneously, and the corresponding nodes initiate a transition to RETHER mode. The race condition occurs when all of these nodes have placed their Switch-to-RETHER broadcast message in the network interface buffer and hence have no way of withdrawing it. We resolve this condition by giving the node with the smaller ID, a higher priority. Venkatramani, Chiueh Expires Oct. 1997 [Page 5] INTERNET DRAFT RETHER April 1997 2) Node failures Ideally, the initiator that gains control of the transition will receive (N-1) acknowledgments, where N is the maximum number of nodes on the network. However, in the following cases, some nodes do not respond and the initiator does not receive (N-1) acknowledgments - i) Although N is the maximum number of nodes on the network, there could be fewer nodes actually in the network or some of the nodes could be powered off. ii) some of the acknowledgments could be dropped by the network card at the Initiator node. iii) some of the nodes do not receive the Switch-to-RETHER message due to transient phenomena such as receiver buffer overflow. To take care of the above, the initiator sets a timer while awaiting (N-1) acknowledgments. In all of the above cases, the initiator times out because it receives less than (N-1) acknowledgments. It then confirms that the nodes that did not respond are really dead by trying to elicit acknowledgments from them again. It does so by broadcasting the Switch-to-RETHER message again. The Initiator attempts this a few more times (twice in our prototype) and the Initiator with the lowest ID completes the transition. The time to switch to RETHER-mode is non-deterministic since the nodes have to transmit packets in their network buffers as well as the acknowledgments using the CSMA/CD protocol. The maximum time interval to complete the switch, used in our implementation, is based on an empirical distribution. 2.1.3 The RETHER Mode --------------------- In this mode, a token circulates among two sets of nodes - the RT-set and the NRT-set. Only nodes that have successfully made a bandwidth reservation belong to the RT-set, while *all* the network nodes belong to the NRT-set. Real-time data is assumed to be periodic and the interval between successive visits of the token to each node in the RT-set is one period. This period is also known as the Token Rotation Time (TRT). The token visits the nodes in either RT or NRT modes, when the node can hold the token for an interval of time called the Token Holding Time (THT). During this time, it Venkatramani, Chiueh Expires Oct. 1997 [Page 6] INTERNET DRAFT RETHER April 1997 can send the corresponding type of message. Each RT process specifies its required transmission bandwidth as the amount of data it needs to send during each TRT. This is translated into its THT using Equation [1], where `Software Overhead' refers to overheads such as processing of packets in the device driver and T_token refers to the token processing overhead. Since the RT reservations are made in terms of the size of the data unit to be sent in each cycle, policing and enforcement of the reservation is done automatically. RT nodes are allowed to change their bandwidth reservation dynamically and will be granted requests for more bandwidth only if resources are available. THT_RT = (Data per TRT/(Ethernet bandwidth) + Software Overhead + T_token [Eqn 1] THT_NRT = (Message Size)/(Ethernet bandwidth) + Software Overhead + T_token [Eqn 2] Equation [Eqn 2] is used to determine the THT for NRT nodes. While THT_RT is predetermined based on the reservation request, THT_NRT is computed each time the token visits an NRT node, based on the size of the message that is actually sent. By controlling the 'Message Size' parameter, it is possible to tune THT_NRT to get better performance. Each NRT message could consist of one or more packets, each of which can have a maximum size of an Ethernet Maximum Transmission Unit (MTU). The number of packets that a node is permitted to send per NRT-token visit is limited by a parameter called 'NRT Burst Size'. The 'NRT Burst Size' can be set to - i) one packet, ii) as many packets as possible, or iii) at most a fixed number of packets. The first policy is fair and allows each node a share of the residual bandwidth. However, since Ethernet traffic is usually bursty in nature, the second policy might be a better choice since the response time will be shorter, if as many of the queued packets as possible could be transmitted. The third policy only permits at most a fixed number of packets to be sent. Depending on the characteristics of the traffic on the network, the 'NRT Burst Size' parameter can be set appropriately. Venkatramani, Chiueh Expires Oct. 1997 [Page 7] INTERNET DRAFT RETHER April 1997 In order to maintain the real-time guarantees, the interval between two consecutive visits of the token to an RT node must be one TRT. This time is maintained in the token as a counter that holds the value of the residual time in the current cycle. In each cycle, the token starts out with its 'ResidualTime' field set to one TRT. It visits RT nodes once for each stream in the order in which they were admitted. If a node made multiple bandwidth reservations, then the token visits the node once, when all streams are serviced. This order is determined using 'RT Stream Information' stored in the token, which is an up-to-date list of all the currently active RT sessions and their bandwidth reservations. This information is required by the admission control algorithm. Each entry in this list holds a pair which uniquely identifies an RT session. Each of the RT nodes sends one unit of RT data for the corresponding stream and decrements the ResidualTime by its THT_RT. Then, in the remaining time, the token visits the NRT nodes in a round-robin fashion, starting with the last unvisited NRT node in the previous cycle. Each NRT node determines if there is sufficient time to transmit the data packet. It does so by examining the messages in the NRT message queue and computing the time it would take to transmit one or more of those packets, using Equation [Eqn 2]. If there is sufficient time, it sends out data constrained by the 'NRT Burst Size' parameter, decrements the 'ResidualTime' field on the token by its THT_NRT and passes the token to its neighbor. If not, it sends the token back to the RT-set, flagging itself as the NRT node to be the first to receive the token in the next cycle. The first RT node then resets the 'ResidualTime' to be equal to one TRT, thereby beginning a new cycle. Hence, the token need not necessarily visit all the nodes in the NRT-set in each cycle. It may visit nodes in the RT-set multiple times before visiting a node in the NRT-set. In this way, the nodes in the RT-set are given priority over those in the NRT-set. Venkatramani, Chiueh Expires Oct. 1997 [Page 8] INTERNET DRAFT RETHER April 1997 Ethernet Segment ================================================================ | | | | | | | | | | +-+ | +-+ | | +-+ | | | | |1| 2 |3| 4 5 |6| 7 8 9 10 +-+ +-+ +-+ Figure 3 : Sample Configuration of a network where nodes 1, 3 and 6 are real-time nodes. Figure 3 shows a sample configuration of a network. Here, the nodes 1, 3 and 6 belong to the RT-set and need a bandwidth reservation for an MPEG-1 stream of 1.5 Mbps each . 1.5Mbps translates to 6.25 KB per frame at the rate of 30 frames per second. Hence, using Equation [Eqn 1], the nodes make a reservation of 6ms each, on a 10Mbps Ethernet. During each TRT, the token visits the nodes 1, 3 and 6 when each of them holds the token for 6 msec while transmitting data and releases it. In the remaining time of 15 msec, the token visits nodes so that they can transmit NRT data. A possible sequence in two successive token rotation cycles is - [1]-[3]-[6]-1-2-3-4-5-6-7-[1]-[3]-[6]-7-8-9-10-11-1-2-... As is evident, the protocol is completely decentralized. This is possible because much of the protocol state is maintained in the token. In a nutshell, the design requires the following information to reside in the token - * Residual Time in the token cycle * Network State Information about all the nodes in RETHER-mode * RT Stream Information about all active RT streams and their reservations * First NRT node in next token cycle 2.1.4 Switch to CSMA Mode ------------------------- The last node to terminate its RT session destroys the token and sends out a Switch-to-CSMA broadcast message. All the nodes switch back to the CSMA-mode in response to this message. No acknowledgments are collected in this case. One potential problem with this approach is that nodes that fail to receive the Switch-to-CSMA message may not be able to revert to the CSMA mode. This problem is solved by having all the nodes maintain a timer that is set larger than the longest time between token visits. Once this timer goes off, the node automatically switches to the CSMA/CD mode. Venkatramani, Chiueh Expires Oct. 1997 [Page 9] INTERNET DRAFT RETHER April 1997 3. Support Modules ------------------ This section describes the admission control and fault tolerance modules of RETHER. The admission control module corresponds to the Bandwidth Allocator (BA) described in [7]. The implementation in RETHER is a distributed BA where each host is responsible for making admission decisions locally. 3.1 Admission Control --------------------- As new real-time requests are made, they must go through an admission control module that determines if there is sufficient network bandwidth available to satisfy the request of the node. We adopt a simple decentralized admission control policy to admit new RT sessions. Each node determines whether or not to admit its own application's reservation request. A new session with a reservation THT_RTnew can be admitted if and only if - sum{i in RTset} THT_RTi + THT_RTnew + T_NRT <= TRT [Eqn 3] where T_NRT is the time in each cycle set aside for NRT traffic. It basically sets an upper bound on the time between consecutive visits of the token to a node and hence determines the starvation characteristics of NRT traffic. The minimum value for T_NRT should be such that the worst case NRT network access latency is less than the timeout values of higher layer protocols. If a request is admitted, it is added to the RT-set information on the token and hence, it can start sending RT data in the next cycle, when the token visits in the RT mode. If not, the request system call returns to the user process with a failure code. The user may keep trying until the request is admitted. In our design, admission decision at a node is postponed until that node receives an NRT token. In other words, a process' request to initiate an RT connection does not return until the associated node receives the token. This is because the token contains the most up-to-date information about the RT-set and the bandwidth reservations. This decision was made because the only other alternative is to maintain this information at each node. However, this may lead to incorrect admission decisions when two or more nodes receive RT requests simultaneously and each node admits its request without the knowledge of admission decisions made at other nodes. Venkatramani, Chiueh Expires Oct. 1997 [Page 10] INTERNET DRAFT RETHER April 1997 When an RT session terminates, the sender merely removes itself from the RT-set information on the token, when it receives the token. It also sends a message to the receiver end of the connection to terminate the session. 3.2 Fault Tolerance ------------------- Since nodes in a network of workstations environment are likely to crash or get rebooted without any warning to other nodes, it is essential to have a mechanism that can detect node failures, reconfigure the token bus and regenerate the token if it is lost. The RETHER protocol is designed to be sufficiently robust to handle these events transparently. The issue of fault tolerance focuses on two different scenarios in the protocol. The first is ring maintenance and the second is token loss. Ring maintenance involves dynamic failure and addition of nodes into the token bus. Token loss covers the scenarios when a node holding the token fails or when a node drops the token as a corrupted packet. Our approach to fault tolerance is a decentralized scheme that is far simpler that the one in 802.4 [2] and 802.5 [3]. This is partly owing to the fact that in a failure where a token is lost and it cannot be detected by our scheme, the network nodes timeout and switch back to transmitting data using the CSMA/CD protocol which the hardware implements. The fault tolerance feature remedies the following situations - i) node failure, ii) node reboot, iii) token corruption, iv) token loss. When the network operates in the RETHER-mode, each node monitors the state of its logical successor. Each node, after sending the token to its successor, sets an `Acknowledgment Timer'. If the successor is alive, it holds the token for the duration of its THT and then sends the token to its successor and also an acknowledgment to its predecessor. On receiving this, the monitoring predecessor cancels its timer. If, on the other hand, the successor is dead, the monitoring node times out and assumes that its successor is dead. It then updates the `Network State Information' on the token to indicate the status of the successor, modifies the `RT Stream Information', if the failed node had made any reservations. It then sends the token to the next successor that is alive. Venkatramani, Chiueh Expires Oct. 1997 [Page 11] INTERNET DRAFT RETHER April 1997 In case the successor node is alive, but drops the token for reasons like hardware checksum failure or the monitoring node does not receive the acknowledgment for similar reasons, this scheme would not detect it and jump to the conclusion that the neighbor is dead. In order to avoid this, the predecessor polls the successor to ensure that it has really crashed, before regenerating the token. In response to the polling message, the neighbor specifies if it has forwarded the token or did not receive the token at all. In the first case, the monitoring node does not generate a token while in the second case, it does. The scheme described above would result in a tokenless network if both a node and its monitoring predecessor crash. This is detected by each node using an `Inactivity Timer' set to the maximum token rotation time seen by any node (NRT or RT), described in Section 3.1. We consider this a catastrophic failure, terminate all the RT reservations, and make all the nodes revert back to the CSMA/CD protocol. Our original design used the scheme whereby the token was broadcast and this served as the acknowledgment to the monitoring node. However, since the protocol was implemented in software, the interrupt overhead due to the broadcasting was very heavy and we had to choose to send the acknowledgments explicitly. The order in which the token and acknowledgment are sent is important. If the order is such that a node sends the acknowledgment first and then the token, and if the node fails between these events, then the network becomes tokenless. In this scenario, we use the `Inactivity Timer' mechanism described above to regenerate the token. If the order of the two messages is reversed and the node fails after forwarding the token, then the network would end up with two tokens. We avoid the this situation and chose to implement the former scheme, where the acknowledgment is sent before the token is forwarded When a node boots up, it broadcasts a message identifying itself. If the network is in RETHER-mode, the node with the token adds the new node to the 'Network State Information' in the token. The new node then waits for the token to arrive. Once this node is added to the list in the token, the token would arrive at the node within the 'Worst Case Net Access Latency' bound described in Section 3.1. If the token does not arrive within this interval, the new node assumes that the network is in the CSMA mode. The broadcast message may, however, introduce a slip in the protocol timings because it may collide with the node attempting to access the network. But, this effect is not cumulative and the protocol recovers from this in the next cycle. Venkatramani, Chiueh Expires Oct. 1997 [Page 12] INTERNET DRAFT RETHER April 1997 3.3 Overflow Buffer Management ------------------------------ The fault recovery time described above takes more than one token cycle. During this time, the network is tokenless, and the sender node continues to receive data from the application, for instance, a camera in a video-conferencing application. Therefore, the sender node has to reserve enough buffer space for each stream, to hold the data generated during the token recovery period. However, each additional buffer increases the end-to-end latency for the connection. Hence, we limit the number of buffers to be (T_end_to_end/TRT), where T_end_to_end is the end-to-end latency that each connection can specify. We also adopt a policy whereby we empty all the overflow buffers as soon as possible using the available NRT bandwidth. This is because our goal is to try to preserve the bandwidth reservation even when there is loss of bandwidth due to fault recovery. 4. RETHER in Switched Ethernet Environment ------------------------------------------ To provide end-to-end network bandwidth guarantees between any pair of nodes in a multi-segment Ethernet, RETHER has been extended to operate across switches. Conceptually, the system applies the single-segment RETHER protocol (described in Section 3) on each of the segments on the path from the source to the destination of a real-time connection. The per-segment reservations are all part of one logical connection used for resource allocation. Consider the network configuration in Figure 4. Suppose a real-time connection runs from A1 to C1. This connection can be broken down into three sub-connections namely A1-Sw1 on Segment 1, Sw1-Sw2 on Segment 2, and Sw2-C1 on Segment 3, where the nodes on the left are senders and those on the right are receivers.For a multi-segment real-time connection, each intermediate switch acts as a sender on one segment and a receiver on the other. All of the Ethernet segments run the single-segment RETHER protocol, with the token cycles on different segments being independent of each other. That is, there is no synchronization requirement on the token cycles associated with adjoining segments. The following sections describe the functionality of each of the major modules of the multi-segment RETHER protocol in detail. /------[A2]-----\ /------------------\ /-------[C1]-----| / \/ \/ | [A3] Segment1 [Sw1] Segment2 [Sw2] Segment3 | \ /\ /\ [C2] \-------[A1]----/ \-------[B1--------/ \----------------| Figure 4 : A multi-segment Ethernet configuration. Nodes on different segments can have RT sessions going across switches. Venkatramani, Chiueh Expires Oct. 1997 [Page 13] INTERNET DRAFT RETHER April 1997 4.1 Connection Setup and Admission Control ------------------------------------------ Multi-segment RETHER includes a link layer connection setup protocol to manage connections across switches. The RETHER connection message is routed using the static routing tables in the switches and an end-to-end real-time connection is established in the Ethernet subnetwork. This is independent of the network layer real-time connection establishment across the subnetwork. Connection establishment is initiated by the receiver. The connection message is sent as a special message that is intercepted by RETHER modules along the path from the receiver to the sender. This is because, the connection set up module is responsible for making resource reservations, namely buffers and network bandwidth, along the path from the receiver to the sender. At each intermediate switch, the next hop network and the Ethernet address of the next hop destination are saved as part of the connection state. This fixes the route that the data has to take once the end-to-end connection has been admitted. If there is a change in the route due to failures during the lifetime of the connection, the connection is terminated and an attempt is made to re-establish it. This is dealt with in Section 3.2. Once the connection is established, each sub-connection is only aware of its other end-point in its segment. The information about the final source and destination of the stream is not stored anywhere in the network. If the connection setup should fail for reasons such as insufficient buffer memory, failure to admit the connection, or no route to the destination, then a non-real-time message is sent back to the original receiver along the same path on which the connection request message traveled forward, and all previously allocated resources are released and the application is informed of the failure. The admission control test is performed in each segment that the connection crosses and is the same as that described in Section 3.1. Once the connection is admitted in the last Ethernet segment, it is complete and data transfer begins immediately. There is no explicit 'connection accept' message when the connection is admitted. Instead, the fact that the receiver receives the first frame of data acts as an implicit acknowledgment. Venkatramani, Chiueh Expires Oct. 1997 [Page 14] INTERNET DRAFT RETHER April 1997 4.2 Fault Tolerance ------------------- When the token on an Ethernet segment is lost or corrupted, the single-segment RETHER protocol's fault tolerance mechanism recovers from the fault by reintroducing the token in that segment. All the real-time connections that pass through the segment continue to work after token recovery. Therefore, multi-segment RETHER poses no new problems compared to single-segment RETHER in this case. However, when a network node crashes, new mechanisms need to be devised to handle multi-segment connections in which the failed node is involved. For a multi-segment connection, if the crashed node is one of the intermediate switches or an end-point of a real-time connection, the state associated with the connection needs to be cleaned up. Because a RETHER switch participates in the token passing on all segments that are connected to it, the switch failure is detected independently on each of the segments via the fault tolerance scheme described in Section 3.2. The nodes that detect the failure then broadcast a message to that effect on their respective segments. The other end-points of the sub-connection to which the crashed node was connected, upon receiving such a message, frees up associated resources, and sends an abort message to the next sub-connection. The message eventually reaches the actual end-points of the connection in either direction and all the reserved resources for the connection are released. For an intermediate switch crash, a termination message travels along the path of the connection on both sides of the failed node. For the crash of an end-point, the message travels only in one direction. /------[A2]-----\ /------------------\ /-------[C1]-----| / \/ \/ | [A3] Segment1 [Sw1] Segment2 [Sw2] Segment3 | \ /\ /\ [C2] \-------[A1]----/ \-------[B1--------/ \----------------| Figure 5. Switch SW connects Segment1 and Segment2. If SW crashes, hosts A2 and B2 detect the failure independently in the two segments that SW is connected to. For instance, in Figure 5, suppose Sw1 has crashed and Node A2 detects the failure on Segment 1 and Node B2 detects it on Segment 2. A2 and B2 inform all the node on their respective segments by broadcasting a message. On receiving the message, A1 terminates its Send sub-connection and frees up the connection's associated resources. Sw2 similarly terminates the Receive sub-connection whose other end-point Venkatramani, Chiueh Expires Oct. 1997 [Page 15] INTERNET DRAFT RETHER April 1997 was Sw1, on Segment 2. Since this is a multi-segment connection, Sw2 also sends a message to C1 to terminate the entire connection and to free the reserved resources. Thus, all the state associated with the connection through Sw1 are cleared. If, on the other hand, the failed node is not involved in the real-time connection, a token recovery takes place locally in the segment on which the failed node is and the RT connection continues as before. The only effect on any real-time session crossing this segment is that it would not be able to send/receive data during the fault recovery period. The impact of fault recovery on buffering requirements at the switch is discussed in Section 4.3.1. 4.3 Buffer Management --------------------- The token cycles on adjoining network segments are not synchronized and this could lead to longer latency because of temporal skew of token arrival times on the two segments of the connection. Since the token rotation times (TRT) in both segments are the same, the maximum skew between them could be one TRT long. To accommodate this, we use a double buffering scheme in which there are two buffers for each real-time stream going across the switch. While one buffer is being filled by the Receive sub-connection on one segment, the other is emptied out by the Send sub-connection on the other, when the token on the outgoing segment arrives. At the end of each token cycle, the roles of the two buffers are swapped. The size of the buffer has a direct effect on the end-to-end latency seen by the users. For interactive applications such as teleconferencing, the limit on the end-to-end latency is around 300msec. This implies that the buffering delays at each hop must be small and that the number of hops that a real-time connection can cross, is bounded. In theory, the token cycle times on all network segments are supposed to be the same, and in each cycle the switch receives exactly one frame of data from the incoming segment and sends out exactly one frame of data on the outgoing segment. Therefore, the real-time connection suffers a maximum of one token cycle latency at each hop along the path, and the maximum end-to-end latency is (Number of hops * Token cycle time). The minimum latency would be the time to transmit the data from the sender to the receiver as though they were on the same segment plus the time to copy the data to memory and out at each intermediate switch. However, in practice, neither the network segments have identical token cycle times, nor does the switch forward the data it receives immediately. The delay impact of the discrepancies in token cycle times and the possible approaches to solve the problem are described in Section 4.4. Venkatramani, Chiueh Expires Oct. 1997 [Page 16] INTERNET DRAFT RETHER April 1997 As explained in Section 3.3, each switch must have additional buffers besides the two buffers for the double buffering scheme to take care of fault recovery. In multi-segment RETHER again, the buffers for each connection across a switch is limited by the maximum end-to-end latency specification for the connection. We keep track of the end-to-end latency suffered by each data frame and drop the ones that exceed the limit, in the network. 4.4 Implementation Experiences ----------------------------- After implementing the multi-segment RETHER protocol as described in the previous section, some unexpected problems surfaced, which forced us to go back to devise additional mechanisms to supplement the original design. This section describes these modifications and additions. The main problem that we encountered in our first implementation was a significant deviation of the token cycle times from the ideal values, on adjoining segments. This led to serious problems of buffer overflow/underflow at the switches, and a breakdown of bandwidth guarantees for connections. We traced the deviation and found that it was due to the inaccuracy in determining the token processing time at the switch. There was contention for the CPU due to packets arriving on multiple segments more or less simultaneously. Because of the contention, the processing of tokens/messages on one segment is delayed by the switch's processing of the token or message on another segment. Since these events cannot be predicted, this led to nondeterminism in computing the time to process a token at the switch and also the token cycle times. This led to differences between the average token rotation times (TRT) of the adjoining segments. For a connection that crosses these segments, this phenomenon allows the segment with the smaller average TRT to send data at a faster rate than the segment with the larger TRT can consume, leading to buffer overrun problems at the switch. Because of this problem, data is delivered only at the rate of the slowest link in the path of the connection, which results in the violation of end-to-end bandwidth guarantees. To address this problem, we devised a cycle time compensation algorithm that dynamically measures the nondeterminism and accounts for it. Venkatramani, Chiueh Expires Oct. 1997 [Page 17] INTERNET DRAFT RETHER April 1997 4.4.1 Compensation Algorithm ---------------------------- We developed a history-based prediction mechanism to estimate the discrepancy between the ideal token cycle time and the actual token cycle time in the next cycle. One of the nodes in each segment is designated the 'synchronizer' node and is supposed to receive the token at the beginning of each token cycle. The 'synchronizer' node takes a time stamp at each token visit and if the token cycle time is not the ideal token cycle time, then it adjusts the effective token cycle time in the next cycle to take care of the detected discrepancy. For example, if the discrepancy estimate that the 'synchronizer' makes is (+5 msec), meaning that the token arrived late by 5 msec, it assumes that the same is likely to occur in the next cycle. Hence, the 'synchronizer' simply subtracts 5 msec from the TRT for the next cycle expecting the token to come back to it, exactly one TRT later. If, on the other hand, the discrepancy that the 'synchronizer' detects is (-5 msec), meaning that the token arrived 5 msec early, then it holds the token for 5 msec, thereby correcting the current cycle. The 'synchronizer' node also keeps track of the correction that it applied in the previous 'n' cycles (the value of 'n' determines the amount of history used in prediction and is a configurable parameter), and computes the correction term for the next cycle based on this history. In the current design, all the RT nodes independently compute the correction term. However, only the first real-time node in the token cycle serves the 'synchronizer' role and actually applies the correction. Because the list of real-time nodes is stored in the token, it is straightforward for each node to decide whether it is the 'synchronizer' node. In other words, there is no need to re-elect a new synchronizer node when the previous synchronizer dies or all RT streams from that previous synchronizer terminate. The cycle-time prediction and adjustment algorithm executed by all RT nodes is shown in the Algorithm below. -------------------------------------------------------------------- 1 Current_TimeStamp = gettime(); 2 if (NodeID == Synchronizer) { 3 Difference = Current_TimeStamp + Token_Residual_Time - Prev_TimeStamp - Ideal_Token_Cycle; 4 if (Difference < 0) { 5 Delay a time period of duration abs(Difference); 6 } 7 /* accumulate errors from previous runs */ 8 Compensation += Difference; 9 if (Compensation < 0) 10 Compensation = 0; 11 Token_Residual_Time = Ideal_Token_Cycle - K * Compensation; 12 Current_TimeStamp = gettime(); 13 } 14 Prev_TimeStamp = Current_TimeStamp; -------------------------------------------------------------------- Venkatramani, Chiueh Expires Oct. 1997 [Page 18] INTERNET DRAFT RETHER April 1997 'Token_Residual_Time' is the residual time counter on the token that indicates the time remaining in the current token cycle. 'Difference' is the token cycle discrepancy calculated from the previous cycle. 'Compensation' is the cumulative errors observed so far. The initial 'Token_Residual_Time' for the next cycle is computed by subtracting (K * Compensation) from 'Ideal_Token_Cycle', where K is the damping factor to avoid over-compensation. A larger value of K implies that any deviation observed in the cycle is compensated for quicker. A smaller value for 'K' indicates that the compensation is done slower. The gettime() in line 1 of the algorithm determines the actual time when the token arrives at this node and is used to compute the duration of the previous cycle. Intuitively, this time stamp should also mark the beginning of the next token cycle. However, if the token happens to come early and a 'Delay' is done to compensate for the early arrival, the actual start of the next cycle is after the 'Delay'. Hence, we take another time stamp at the end of the processing, on line 12. This algorithm has been implemented in our prototype and as a result the cycle times across various segments remain stable with very small jitter. This in turn ensures smooth delivery of data for multi-segment real-time connections. 5. User Interface ----------------- The user application has a set of library functions to request and manage bandwidth reservations for connections. The library functions access the RETHER Bandwidth Allocator through system calls to the kernel. The connections and the network are managed transparently by RETHER. The library functions and the RETHER system call API correspond to the Requester Module described in [7]. 6. Conclusion ------------- In this document we presented RETHER, a protocol to support real-time connections in a switched Ethernet environment. The protocol is designed to operate entirely in the link layer and exposes a real-time capable Ethernet subnetwork to protocols such as RSVP. RETHER provides distinct traffic priority classes namely real-time and non-real-time and allows the user to specify the kind of flow required. RETHER also incorporates an admission control module that can be employed by RSVP to make reservations. The implementation for the end-stations is entirely in software without requiring any infrastructure/hardware changes while special switches are necessary to carry the RETHER guarantees across multiple Ethernet segments. Venkatramani, Chiueh Expires Oct. 1997 [Page 19] INTERNET DRAFT RETHER April 1997 A detailed performance evaluation of the protocol can be found in [9]. Through the evaluation of the prototype, we show that the implementation is reasonably efficient in a testbed that consists of 10-Mbps and 100-Mbps links, and the target performance guarantees are indeed met in all cases. There are several directions that the RETHER group is currently pursuing actively. First, we would like to experiment the switch implementation on a high-end Pentium machine that has more efficient I/O hardware to evaluate the performance scalability of the switch, in terms of larger numbers of attached links as well as higher link bandwidth. In addition, the software architecture will use polling instead of being interrupt driven, to minimize the CPU overhead. Second, because wireless LAN has a physical network architecture similar to Ethernet, we are planning to extend the multi-segment RETHER technology to a heterogeneous network of wired and wireless links. Finally, we are looking at the feasibility of implementing part of the RETHER protocol in hardware, to demonstrate the feasibility of incorporating RETHER into switches, to Ethernet switch vendors. Venkatramani, Chiueh Expires Oct. 1997 [Page 20] INTERNET DRAFT RETHER April 1997 References ---------- [1] "Carrier Sense Multiple Access with Collision Detection (CSMA/CD) access method and physical layer specifications" ANSI/IEEE Standard 802.3-1985. [2] IEEE/ANSI Standard 802.4 - 1985, "Token passing bus access method and physical layer specifications". IEEE Inc., New York, 1985. [3] IEEE/ANSI Standard 802.5, "Token Ring access method and physical layer specifications". IEEE Inc., New York, 1985. [4] "Design, Implementation and Evaluation of a software-based real-time {Ethernet} protocol", Chitra Venkatramani and Tzi-cker Chiueh Proceedings of the ACM SIGCOMM'95. [5] "Supporting Real-time Traffic on Ethernet", Tzi-cker Chiueh and Chitra Venkatramani Proceedings of IEEE Real-time Systems Symposium, Peurto Rico, Dec. 1994. [6] "Resource Reservation Protocol (RSVP) - Version 1 Functional Specification" Internet Draft, November 1996 [7] "A Framework for Providing Integrated Services Over Shared and Switched LAN Technologies", Internet Draft, April 1997, [8] "Design of the Stony Brook Video Server" Michael Vernick, Chitra Venkatramani, Tzi-cker Chiueh. SPIE First International Symposium on Technologies and Systems for Voice, Video and Data Communications.", Oct. 1995, Pliladelphia, MA. [9] "The Design, Implementation and Evaluation of RETHER : A Real-Time Ethernet Protocol", Chitra Venkatramani PhD Thesis, SUNY at Stony Brook, Stony Brook, NY. [10] "IEEE Standards for Local and Metropolitan Area Networks: Draft Standard for Traffic Class and Dynamic Multicast Filtering Services in Bridged Local Area Networks" (Draft Supplement to 802.1D), P802.1p/D5, February, 1997. Venkatramani, Chiueh Expires Oct. 1997 [Page 21] INTERNET DRAFT RETHER April 1997 Authors' Addresses Chitra Venkatramani IBM T.J.Watson Research Center 30, Saw Mill River Road Hawthorne, NY 10532. USA (914) 784-7369 chitra@watson.ibm.com Tzi-cker Chiueh Department of Computer Science SUNY at Stony Brook Stony Brook, NY 11594 USA (516) 632-8449 chiueh@cs.sunysb.edu Venkatramani, Chiueh Expires Oct. 1997 [Page 22]