Network Working Group Douglas Otis Internet Draft SANlight Document: draft-otis-network-overhead-00.txt Expires: August, 2002 February, 2002 Network Overhead Problem Statement Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of [RFC2026]. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract System performance is often limited by an integrated device interconnect. Process differences between logic and memory keep large-scale memory interconnects critical as Moore's Law surpasses the rate of packaging reductions. Two areas are related to memory interconnect performance when handling network messaging: - Logic and memory state context switching - Reassembly of partial messages with bifurcation of payload Modern logic and memory systems carry an amount of state comparable to a typical message. As such, process context switching has a sizeable impact on performance. Read then write operations to reassemble and relocate a message payload contained within an envelope is also comparable to message overhead. The essential envelope associates payloads with processes within an Upper-Level Protocol. Optimal transport operations are ideally contained within a simple network adapter that minimizes context switching and placement operations. Otis Network Overhead Problem Statement Page[2] Overview If a system adopted a technique of accessing information using an array of pointers with byte lengths, then reassembly of messages or extraction of payloads would be superfluous. Such an array could then handle both messaging and memory virtualization, but this would not scale when sharing memory state for global virtualization. To stabilize memory virtualization, message objects are placed into contiguous regions instead of remapping memory. When a message is transferred, it is often desired to schedule process handling that may involve a context switch. For memory rather than message based applications, scheduling may be deferred or never invoked depending on the function of the message. The message itself may represent an array of messages where only the final message is significant. An effective transport handles process scheduling and performs message assembly together with payload extraction. To be practical for a network interface card, these operations need to be idempotent with respect to the packet relieving problematic inter-packet state processing and queuing at the IP layer. As indicated by [RFC1122], a transport must support all mechanisms of the IP layer and, although there are means to implement direct placement methods at the IP layer, access to such a new mechanism still requires change to the transport. Essential to a scheme that does not impact the IP layer is a transport that supports both framing and non-sequential delivery. Within each packet is an envelope that allows process association, message control, message payload bifurcation, and robust error detection. Currently, only SCTP offers flow control together with this set of features while not requiring mechanisms added to the IP layer. Proposed Network Interface Card Intrinsic Approach SCTP allows a shim to be placed in-line to each packet as the transport can deliver packets as received. SCTP also allows packets to be formed by the Upper-Level Protocol by means of disabling transport fragmentation. Using SCTP, a shim can introduce a structure contained within each Data Chunk for mode-negotiated direct placement within an association. This structure could be in the form of the following: - Placement Mode[24]:Flags[8] - Placement Tag[32] - Placement Offset[64] Should the Flags field contain set flags, these are held for the associated process until the Data Chunk TSN is less than or equal to the cumulative TSN. These flags are Disclose, Acknowledgement Requested, and Release Buffer. For applications that must ensure placement sequencing, an acknowledgement prior to each new message send is required. When disclosure of message reception at the receiver is desired, the Disclose flag is set in the last fragment of the message. For Upper-Level Protocols using implied placement rather than an explicit negotiation, the completion of messages Otis Network Overhead Problem Statement Page[3] associated with a process is indicated with a Release Buffer flag in the last fragment of the last message. This leaves many options at the shim undefined. Acknowledgment to the sender could be deduced from the transport SACK or as a result of messages emanating from the receiving shim sent back to the sender. Should the order of message completion be important at the receiver, a single Stream must be used for these related messages. It should be noted there is no assured send order between Streams so there must be Stream-message allegiance to enable ascertainment of message completion. Messages are allowed to be delivered out of sequence so Stream would partition messages rather than minimize head of queue blocking. The process and related placement buffers may be associated with the SCTP Stream or the Placement Tag depending on the needs of the Upper-Level Protocol. As the IP layer cannot assure reception order, messages to be placed in sequence must have sending paced by acknowledgement but can be sent on any Stream. Stream partitioning may isolate sub-systems as a means to independently control resources. There is work pending on Stream flow control conventions for this purpose. For all Upper-Level Protocols that utilize the DDP shim, the Payload Protocol Identifier will indicate either a null value or an IANA registered protocol identity. For systems that establish network- virtualized memory schemes, the shim explicitly negotiates process- Placement Tag associations together with related restrictions. ANONYMOUS placement messages interpreted by the shim layer may be used to post requests such as direct reads. There are many protocols however where user buffers associated with either Streams or Placement Tags can be implied to minimize latency. By using either technique, significant benefits are found through the direct placement standardization of the shim. Memory based applications will further benefit from autonomous validation and response to these requests to save associated context switching. Reference [RFC1122] R. Braden, "Requirements for Internet Hosts -- Communication Layers", RFC 1122, October 1989. [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [RFC2960] R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, V. Paxson, "Stream Control Transmission Protocol", RFC 2960, October 2000 Otis Network Overhead Problem Statement Page[4] Authors' Addresses Douglas Otis 800 E. Middlefield Mountain View, CA 94043 USA Email dotis@sanlight.net Full Copyright Statement Copyright (C) The Internet Society (2002). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Funding for the RFC Editor function is currently provided by the Internet Society.