Network Working Group                                   Douglas Otis
   Internet Draft                                              SANlight
   Document: draft-otis-network-overhead-00.txt 
   Expires: August, 2002                                 February, 2002
 
 
                    Network Overhead Problem Statement 
 
   Status of this Memo 
    
   This document is an Internet-Draft and is in full conformance with 
   all provisions of Section 10 of [RFC2026]. 
    
   Internet-Drafts are working documents of the Internet Engineering 
   Task Force (IETF), its areas, and its working groups.  Note that      
   other groups may also distribute working documents as Internet-
   Drafts. 
    
   Internet-Drafts are draft documents valid for a maximum of six 
   months and may be updated, replaced, or obsoleted by other documents 
   at any time.  It is inappropriate to use Internet-Drafts as 
   reference material or to cite them other than as "work in progress." 
    
   The list of current Internet-Drafts can be accessed at 
        http://www.ietf.org/ietf/1id-abstracts.txt 
   The list of Internet-Draft Shadow Directories can be accessed at 
        http://www.ietf.org/shadow.html. 
    
    
   Abstract 
       
   System performance is often limited by an integrated device 
   interconnect.  Process differences between logic and memory keep 
   large-scale memory interconnects critical as Moore's Law surpasses 
   the rate of packaging reductions.  Two areas are related to memory 
   interconnect performance when handling network messaging: 
    - Logic and memory state context switching 
    - Reassembly of partial messages with bifurcation of payload 
    
   Modern logic and memory systems carry an amount of state comparable 
   to a typical message.  As such, process context switching has a 
   sizeable impact on performance.  Read then write operations to 
   reassemble and relocate a message payload contained within an 
   envelope is also comparable to message overhead.  The essential 
   envelope associates payloads with processes within an Upper-Level 
   Protocol.  Optimal transport operations are ideally contained within 
   a simple network adapter that minimizes context switching and 
   placement operations.    
   Otis           Network Overhead Problem Statement           Page[2]     

   Overview 
    
   If a system adopted a technique of accessing information using an 
   array of pointers with byte lengths, then reassembly of messages or 
   extraction of payloads would be superfluous.  Such an array could 
   then handle both messaging and memory virtualization, but this would 
   not scale when sharing memory state for global virtualization.  To 
   stabilize memory virtualization, message objects are placed into 
   contiguous regions instead of remapping memory. 
    
   When a message is transferred, it is often desired to schedule 
   process handling that may involve a context switch.  For memory 
   rather than message based applications, scheduling may be deferred 
   or never invoked depending on the function of the message.  The 
   message itself may represent an array of messages where only the 
   final message is significant.  An effective transport handles 
   process scheduling and performs message assembly together with 
   payload extraction.  To be practical for a network interface card, 
   these operations need to be idempotent with respect to the packet 
   relieving problematic inter-packet state processing and queuing at 
   the IP layer. 
    
   As indicated by [RFC1122], a transport must support all mechanisms 
   of the IP layer and, although there are means to implement direct 
   placement methods at the IP layer, access to such a new mechanism 
   still requires change to the transport.  Essential to a scheme that 
   does not impact the IP layer is a transport that supports both 
   framing and non-sequential delivery.  Within each packet is an 
   envelope that allows process association, message control, message 
   payload bifurcation, and robust error detection.  Currently, only 
   SCTP offers flow control together with this set of features while 
   not requiring mechanisms added to the IP layer. 
 
   Proposed Network Interface Card Intrinsic Approach 
    
   SCTP allows a shim to be placed in-line to each packet as the 
   transport can deliver packets as received.  SCTP also allows packets 
   to be formed by the Upper-Level Protocol by means of disabling 
   transport fragmentation.  Using SCTP, a shim can introduce a 
   structure contained within each Data Chunk for mode-negotiated 
   direct placement within an association.  This structure could be in 
   the form of the following: 
    - Placement Mode[24]:Flags[8] 
    - Placement Tag[32] 
    - Placement Offset[64] 
    
   Should the Flags field contain set flags, these are held for the 
   associated process until the Data Chunk TSN is less than or equal to 
   the cumulative TSN.  These flags are Disclose, Acknowledgement 
   Requested, and Release Buffer.  For applications that must ensure 
   placement sequencing, an acknowledgement prior to each new message 
   send is required.  When disclosure of message reception at the 
   receiver is desired, the Disclose flag is set in the last fragment 
   of the message.  For Upper-Level Protocols using implied placement 
   rather than an explicit negotiation, the completion of messages
   Otis           Network Overhead Problem Statement           Page[3]     

   associated with a process is indicated with a Release Buffer flag in 
   the last fragment of the last message. 
    
   This leaves many options at the shim undefined.  Acknowledgment to 
   the sender could be deduced from the transport SACK or as a result 
   of messages emanating from the receiving shim sent back to the 
   sender.  Should the order of message completion be important at the 
   receiver, a single Stream must be used for these related messages.  
   It should be noted there is no assured send order between Streams so 
   there must be Stream-message allegiance to enable ascertainment of 
   message completion.  Messages are allowed to be delivered out of 
   sequence so Stream would partition messages rather than minimize 
   head of queue blocking.  
    
   The process and related placement buffers may be associated with the 
   SCTP Stream or the Placement Tag depending on the needs of the 
   Upper-Level Protocol.  As the IP layer cannot assure reception 
   order, messages to be placed in sequence must have sending paced by 
   acknowledgement but can be sent on any Stream.  Stream partitioning 
   may isolate sub-systems as a means to independently control 
   resources.  There is work pending on Stream flow control conventions 
   for this purpose.    
    
   For all Upper-Level Protocols that utilize the DDP shim, the Payload 
   Protocol Identifier will indicate either a null value or an IANA 
   registered protocol identity.  For systems that establish network-
   virtualized memory schemes, the shim explicitly negotiates process-
   Placement Tag associations together with related restrictions.  
   ANONYMOUS placement messages interpreted by the shim layer may be 
   used to post requests such as direct reads.  There are many 
   protocols however where user buffers associated with either Streams 
   or Placement Tags can be implied to minimize latency.  By using 
   either technique, significant benefits are found through the direct 
   placement standardization of the shim.  Memory based applications 
   will further benefit from autonomous validation and response to 
   these requests to save associated context switching. 
 
   Reference 
    
   [RFC1122]  R. Braden, "Requirements for Internet Hosts -- 
   Communication Layers", RFC 1122, October 1989. 
    
   [RFC2026]  Bradner, S., "The Internet Standards Process --  
   Revision 3", BCP 9, RFC 2026, October 1996.  
           
   [RFC2960]  R. Stewart, Q. Xie, K. Morneault, C. Sharp, H. 
   Schwarzbauer, T. Taylor, I. Rytina, M. Kalla, L. Zhang, V. Paxson, 
   "Stream Control Transmission Protocol", RFC 2960, October 2000 
   Otis           Network Overhead Problem Statement           Page[4]     

   Authors' Addresses  
           
   Douglas Otis  
   800 E. Middlefield  
   Mountain View, CA 94043  
   USA   
           
   Email dotis@sanlight.net  
                   
 
   Full Copyright Statement   
              
   Copyright (C) The Internet Society (2002).  All Rights Reserved. 
   This document and translations of it may be copied and furnished to 
   others, and derivative works that comment on or otherwise explain it 
   or assist in its implementation may be prepared, copied, published 
   and distributed, in whole or in part, without restriction of any 
   kind, provided that the above copyright notice and this paragraph 
   are included on all such copies and derivative works.  However, this 
   document itself may not be modified in any way, such as by removing 
   the copyright notice or references to the Internet Society or other 
   Internet organizations, except as needed for the purpose of 
   developing Internet standards in which case the procedures for 
   copyrights defined in the Internet Standards process must be 
   followed, or as required to translate it into languages other than 
   English. 
    
   The limited permissions granted above are perpetual and will not be 
   revoked by the Internet Society or its successors or assigns. 
    
   This document and the information contained herein is provided on an 
   "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING 
   TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING 
   BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION 
   HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF 
   MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. 
    
   Funding for the RFC Editor function is currently provided by the 
   Internet Society.