INTERNET-DRAFT H.K. Jerry Chu Sun Microsystems Vivek Kashyap IBM Expires: October, 2002 April, 2002 IP link and multicast over InfiniBand networks Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright (C) The Internet Society (date). All Rights Reserved. Abstract This document specifies a method for setting up IP subnets and multicast services over InfiniBand(TM) networks. Discussions in this document are applicable to both IPv4 and IPv6, unless explicitly specified. A separate document will cover unicast and encapsulation of IP datagrams over InfiniBand networks. Table of Contents 1.0 Introduction 2.0 Terminology 3.0 Basic IPoIB Transport - Unreliable Datagram 4.0 IB Multicast Architecture 5.0 IB Links vs IPoIB Links Chu & Kashyap [Page 1] draft-ietf-ipoib-link-multicast-01.txt March 2002 6.0 Setting up an IPoIB Link 6.1 Maximum Transmission Unit 6.2 IPoIB Link Q_Key 6.3 Other Link Attributes 7.0 The IPoIB All-Node Multicast and Broadcast Group 8.0 Mapping for other Multicast Groups 9.0 Sending and Receiving IP Multicast Packets 10.0 Support for IP Multicast Routing 11.0 Security Considerations 12.0 Acknowledgments 13.0 References 14.0 Author's Address 15.0 Full Copyright Statement 1.0 Introduction InfiniBand Architecture (IBA) defines four layers of network services corresponding to layer one through layer four of the OSI reference model. For the purpose of running IP over an InfiniBand (IB) network, the IB link, network, and transport layers collectively constitute the data link layer to the IP stack. One can find a general overview of IB architecture related to IP networks in [IPoIB_ARCH]. This document will focus on the steps required to lay out an IP network on top of an IB network. It will describe all the elements an IP over InfiniBand (IPoIB) link consists of, how to configure its associated link attributes, and how to set up basic broadcast and multicast services on an IPoIB link. IPoIB link is the building block for an IP network to be decomposed into multiple subnets connected by IP routers. Subnetting allows the containment of broadcast traffic within a single subnet. It also provides certain degree of isolation between nodes on different subnets. The latter may be an important consideration for administration purpose. 2.0 Terminology The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [RFC2119]. 3.0 Basic IPoIB Transport - Unreliable Datagram InfiniBand defines four types of transport services [IBTA]. They are reliable connection, unreliable connection, reliable datagram, unreliable datagram. IBA also defines a special raw datagram service for encapsulation purpose. Both unreliable datagram and raw datagram Chu & Kashyap [Page 2] draft-ietf-ipoib-link-multicast-01.txt March 2002 define support for multicast. They provide the basic transport mechanism that best matches the IP datagram paradigm. IB unreliable datagram provides many additional transport features such as the partition key (P_Key) protection, multiple queue pairs (QPs), and Q_Key protection. Moreover, it requires a 32-bit invariant CRC checksum, which provides a much stronger protection against data corruption, compared with the 16-bit CRC that a raw datagram carries. For these reasons, IB unreliable datagram is considered to be a much better choice as the basic IPoIB transport than the raw datagram, and is chosen as the default IPoIB transport mechanism ([IPoIB_ARCH], [IPoIB_ENCAP]). 4.0 IB Multicast Architecture The following discussion gives a short overview of the multicast architecture in InfiniBand. For a more complete description, the reader is referred to [IBTA] and [IPoIB_ARCH]. IBA defines two layers of multicast services. Its link layer uses multicast LIDs (MLIDs), which are allocated by the Subnet Manager (SM) and fall in the range between 0xC0000 to 0xFFFE (approximately 16k). MLIDs are used by IB switches to program their multicast forwarding tables. An IB switch implementation may support much fewer MLIDs in its forwarding table though. IB network layer uses multicast GIDs (MGIDs), which closely resemble IPv6 multicast addresses [AARCH] shown below. | 8 | 4 | 4 | 112 bits | +------ -+----+----+---------------------------------------------+ |11111111|flgs|scop| group ID | +--------+----+----+---------------------------------------------+ [IPoIB_ARCH] describes each field in more details. Since every IB multicast packet is required to carry both LRH and GRH, a valid MGID and a valid MLID are both needed before a valid IB multicast packet can be constructed. An IB multicast group is uniquely identified by a valid MGID. Before a MGID can be used within an IB subnet, either as a destination address of a multicast packet, or representing a multicast group that an IB node can join, a "MCGroupRecord" corresponding to the MGID must be created through the Subnet Administrator (SA). Besides the MGID, the creator must supply values of the path MTU, Q_Key, TClass, FlowLabel, HopLimit that are appropriate for all the potential Chu & Kashyap [Page 3] draft-ietf-ipoib-link-multicast-01.txt March 2002 clients of the multicast group to use. In return, SA will allocate a MLID to be used by switches in the local IB subnet. Unreliable multicast is defined by IBA as an optional functionality for channel adaptors (CAs) and switches. In today's IP technology, link multicast has become an indispensable function for better supporting a modern IP network. For this reason, it is required that an IPoIB fabric supports multicast. This includes all the CAs and switches that make up an IP network. 5.0 IB Links vs IPoIB Links A link segment on top of which an IP subnet can be configured is defined in [IPV6] as a communication facility or medium over which nodes can communicate at the "link" layer. For most types of communication media, the boundary between different data link segments follows the physical topology of the network connectivity, and is pretty obvious. E.g. an Ethernet network connected by switches, hubs, or bridges usually forms a single link segment and broadcast/multicast domain. Different Ethernet segments can be connected by IP routers at the network layer. InfiniBand defines its own link-layer and subnets consisting of nodes connected by IB switches. However, the IPoIB link boundary need not follow the IB link boundary. Nodes residing on different IB subnets can still communicate directly with one another through IB routers at the InfiniBand network layer. This communication at the network layer applies to both unicast as well as multicast. The ultimate requirement for two nodes in the same IB fabric to communicate at the IB level, besides the physical connectivity, is a common P_Key. Partitioning in IB provides an isolation mechanism among nodes in an IB fabric, much like VLANs in an Ethernet network. Each HCA (Host Channel Adaptor) port of an endnode contains a P_Key table of all the valid P_Keys the port is allowed to use. The P_Key table is set up by the SM of the local IB subnet. Each QP is programmed with a P_Key from the local P_Key table. This P_Key is carried in all the outgoing packets from the QP, and is used to compare against the P_Key of the incoming packets to the QP. Reception of an invalid P_Key causes the packet to be discarded. IB switches may optionally enforce partition checking too. Therefore P_Key and IB partition are the natural choice for defining IPoIB link boundary. It also affords much flexibility to the network administrators when multiple links are set up in a large network. Chu & Kashyap [Page 4] draft-ietf-ipoib-link-multicast-01.txt March 2002 6.0 Setting up an IPoIB Link A network administrator defines an IPoIB link by setting up an IB partition and assigning it a unique P_Key. An IB partition may or may not span multiple IB subnets, and whether it does or not is mostly transparent to IPoIB. Each node attached to the IB partition MUST have one of its HCAs assigned the P_Key to use. Note that the P_key table of a HCA port may contain many P_Keys. It is up to an implementation to define the method by which the P_Key relevant to a particular IPoIB subnet is determined and conveyed to the IPoIB stack. E.g. implementations can resort to a manual configuration to choose the P_key or a set of P_Keys for IPoIB use, and rely on DHCP [DHCP] on each IPoIB link to assign an IP subnet number. Once an IB partition is established for IPoIB use, the link MTU and Q_Key are two other important attributes that must be chosen before the IPoIB link can be configured. 6.1 Maximum Transmission Unit IB defines five permissible maximum payload sizes. They are 256, 512, 1024, 2048 and 4096 bytes. [IPV6] requires a link MTU of 1280 bytes or greater. To be better compatible with Ethernet, the dominant network media in both the LAN and WAN environment, the IPoIB link MTU SHALL be 1500 bytes or greater. This leaves only 2048 and 4096 bytes as two acceptable maximum payload sizes for IPoIB. Channel adaptors supporting a maximum payload size less than the minimal requirement can still expose an acceptable link MTU to IP through an adaptation layer that fragments larger messages into smaller IB packets, and reassembles them on the receiving end. But this must be done in a way that is totally transparent to the IP stack. It is up to the network administrator to select a link MTU to use when configuring an IPoIB link. The link MTU SHALL not be greater than the maximum payload size of any IB component that is part of the IPoIB link minus "EtherType" [IPoIB_ENCAP]. This includes IB switches, CAs, or routers. In general, a full link MTU should be employed whenever possible to attain better throughput performance. One caveat is that once a link MTU is chosen for a given IPoIB link, nodes connected by CAs of a smaller maximum payload size won't be able to join the link unless the whole link and all the devices attached to it are reconfigured to use a smaller MTU. The flexibility of configuring a smaller than the full link MTU size Chu & Kashyap [Page 5] draft-ietf-ipoib-link-multicast-01.txt March 2002 does make it easier for one to bridge an IPoIB link with an Ethernet link, by setting the MTU of all the connecting nodes to 1500 bytes. For IPv4, this may require a manual configuration of a MTU different from the default, link MTU size on all the nodes belonging to an IPoIB link. For IPv6, one can use the link MTU option of the router advertisement [DISC] to announce a smaller MTU to all the nodes. In case an IPoIB link spans more than one IB subnet, the IPoIB link MTU MUST not exceed the path MTU of any path connecting two nodes in the same IB partition. It is up to the network administrator to determine the appropriate path MTU value that will work for any node in the same IPoIB link. 6.2 IPoIB Link Q_Key A Q_Key is programmed by the source QP in every IB datagram, and is verified by the destination QP against the Q_Key it has been assigned. A Q_Key violation will cause the offending datagram to be dropped, and a Q_Key violation trap to be raised. A Q_Key must be selected to be used by all the QPs attached to an IPoIB link. It is recommended that a controlled Q_Key be used with the high order bit set. This is to prevent non-privileged software from fabricating and sending out bogus IP datagrams. All QPs configured to use on a given IPoIB link SHALL be assigned the same per-link Q_Key. 6.3 Other Link Attributes TClass, FlowLabel, and HopLimit are three other attributes that are required for an IPoIB link covering more than a single IB subnet. The selection of these values are implementation dependent. But it must take into account the topology of IB subnets comprising the IPoIB link to ensure successful communication between any two nodes in the same IPoIB link. 7.0 The IPoIB All-Node Multicast and Broadcast Group Once an IB partition is created with link attributes identified for an IPoIB link, the network administrator must create a special IB multicast group for every node on the IPoIB link to join. This is achieved through the creation of "MCGroupRecord" in each IB subnet that the IB partition encompasses, as described in section 4 above. The MGID will have the P_Key of the IB partition that defines the IPoIB link embedded in it. A special signature is also embedded to identify the MGID for IPoIB use only. For IPv4 over IB, the signature will be "0x401B". For IPv6 over IB, the signature will be "0x601B". Chu & Kashyap [Page 6] draft-ietf-ipoib-link-multicast-01.txt March 2002 For an IPv4 subnet, the MGID for this special IB multicast group SHALL have the following format: | 8 | 4 | 4 | 16 bits | 16 bits | 48 bits | 32 bits | +--------+----+----+-----------------+---------+----------+---------+ |11111111|0001|scop||< P_Key >|00.......0|| +--------+----+----+-----------------+---------+----------+---------+ For an IPv6 subnet, the format of the MGID SHALL look like this: | 8 | 4 | 4 | 16 bits | 16 bits | 80 bits | +--------+----+----+-----------------+---------+--------------------+ |11111111|0001|scop||< P_Key >|000.............0001| +--------+----+----+-----------------+---------+--------------------+ As for the scop bits, if the IPoIB link is fully contained within a single IB subnet, the scop bits SHALL be set to 2 (link-local). Otherwise the scope will be set higher. A MCGroupRecord will be created with all the IPoIB link attributes described before. When a node is attached to an IPoIB link identified by a P_Key, it must look for a special, all-node multicast/broadcast group to join. This is done by constructing the MGID with the link P_Key and the IPoIB signature. The node SHOULD always look for a MGID of a link-local scope first before attempting one with a greater scope. Once the right MGID and MCGroupRecord are identified, the local node SHOULD use the link MTU recorded in the MCGroupRecord. In case the link MTU is greater than the maximum payload size that the local HCA can support, the node can not join the IPoIB link and operate as an IP node. Otherwise the local node must join the special all-node multicast/broadcast group by calling the SA to create a MCMemberRecord corresponding to the MGID. The SA will return all the link attributes for the local node to use. The node MUST use these attributes in all future multicast operations to the local IPoIB link. The broadcast group for IPv4 will serve to provide a broadcast service for protocol like ARP to use. In addition to the all-node multicast/broadcast group, an all-router multicast group SHOULD be created at link configuration time if an IP router will be attached to the link. This is to facilitate IP multicast operations described later. A MCGroupRecord for the all- router MGID must be created in every IB subnet that the IPoIB link encompasses. The format of the all-router MGID will be covered in next section. 8.0 Mapping for other Multicast Groups Chu & Kashyap [Page 7] draft-ietf-ipoib-link-multicast-01.txt March 2002 The support of general IP multicast [IPMULT] over IB is similar to the case of the special all-node multicast/broadcast group discussed above. An algorithmic mapping is used so that given an IP multicast address, individual host can compute the corresponding IB multicast address (MGID) all by itself without having to consult an external entity. This also removes the need for an externally maintained IP to IB multicast mapping table. The IPoIB multicast mapping is defined as follows. The same mapping function is used for both IPv4 and IPv6 except the IPoIB signature field. | 8 | 4 | 4 | 16 bits | 16 bits | 80 bits | +------ -+----+----+-----------------+---------+--------------------+ |11111111|0001|scop||< P_Key >| group ID | +--------+----+----+-----------------+---------+--------------------+ Since a MGID allocated for transporting IP multicast datagrams is considered only a transient link-layer multicast address, all IB MGIDs allocated for IPoIB purpose SHOULD have T = 1. The scope bits SHALL be the same as that of the all-node MGID for the same IPoIB link. An IP multicast address is used together with a given IPoIB link P_Key to form the MGID of the IB multicast group. For IPv6 the lower 80-bit of the group ID is used directly in the lower 80-bit of the MGID. For IPv4, the group ID is only 28-bit long and the rest of the 80 bits are filled with 0. The rest of the bits are the same as those of the all-node MGID. E.g. on an IPoIB link that is fully contained within a single IB subnet with a P_Key of 8, the MGIDs for the all-router multicast group with group ID 2 [AARCH, IGMP2] are: FF12:401B:8:0:0:0:0:2 or FF12:401B:8::2 for IPv4 in a compressed format, and FF12:601B:8:0:0:0:0:2 or FF12:601B:8::2 Chu & Kashyap [Page 8] draft-ietf-ipoib-link-multicast-01.txt March 2002 for IPv6 in a compressed format. A special case exists for the IPv4 limited broadcast address "255.255.255.255" [HOSTS]. The address SHALL be mapped to the broadcast MGID for IPv4 networks as described in section 7 above. Also the IPv6 all-node multicast address "FF0X::1" [AARCH] will be mapped to the the special all-node MGID for IPv6 networks. 9.0 Sending and Receiving IP Multicast Packets In order to send a packet destined for an IP multicast address, a node must first check for the existence of MCGroupRecord corresponding to the MGID of the outbound link. If one already exists, the MLID allocated by the SM for the MCGroupRecord is used as the DLID for the packet. Otherwise, it means no member exists on the local link. The packet should be forwarded to the all-router multicast group to ensure the correct delivery of the packet to multicast listeners on remote networks. (See section 10 below.) If an all-router multicast group doesn't already exist, it implies no router presence on the local subnet. The packet can then be safely dropped. Note that the local node MUST be notified when an IB multicast group corresponding to the MGID ever comes into existence later. This signifies that an interested party just showed up on the local link and therefore must be copied. For a node to join an IP multicast group to receive IP multicast packets, it must first construct a MGID corresponding to the IP multicast group, using the rule described above. Note that it must remember the scope bits from the all-node MGID, and use the same scope in all the MGIDs it constructs. The local node then checks with SA to see if a MCGroupRecord corresponding to the MGID already exists. If not, one must be created. The MCGroupRecord MUST be created with the IPoIB link MTU. For the rest of the attributes, it is recommended that it uses the same values from the all-node multicast/broadcast group corresponding to the link. Note that for an IPoIB link that spans more than one IB subnet connected by IB routers, adequate multicast forwarding support at the IB level is required for multicast packets to be forwarded properly to members in remote IB subnets. The specific mechanism for this will be covered in [IBTA], and is out of scope of this document. Once the IB multicast group is identified, the node must join the group, unless it is a member already, by calling the SA to create a Chu & Kashyap [Page 9] draft-ietf-ipoib-link-multicast-01.txt March 2002 MCMemberRecord corresponding to the MGID. The join call enables SM to program local IB switches and routers with the new multicast information. Specifically it causes an IB switch to add the LID of the caller to its forwarding table entry corresponding to the MLID allocated for the group. It also causes an IB router to attach itself to the IB multicast tree corresponding to the MGID. In case any of the above IB operations fails, a node MAY choose to simply join the all-router multicast group. This will ensure it receives a copy of every multicast packet on the local link. Nodes doing so MUST filter out those multicast packets that are of no interest to the local node. When a node leaves an IP multicast group, it SHOULD delete the MCMemberRecord from the SA. This allows the SA to free up related resources. SM should delete MCGroupRecords that are no longer in use, and free up the MLIDs allocated for them. The specific algorithm is implementation-dependent, and therefore is out of scope of this document. 10.0 Support for IP Multicast Routing IP multicast routing requires a router to receive a copy of every link multicast packet on a locally connected link [IPMULT, IP6MLD]. For Ethernet this is usually done by turning on promiscuous multicast mode on a locally connected Ethernet interface. Unfortunately IBA does not support promiscuous multicast mode. Therefore the IPoIB driver should forward a copy of every outbound multicast datagram to the MGID corresponding to the all-router multicast group. This is to ensure multicast packets be properly forwarded to remote IP networks, and applies to IP hosts as well as multicast routers. 11.0 Security Considerations All the operations for creating and configuring an IPoIB link described in this document, including assigning P_Keys to CAs, creating MCGroupRecords and MCMemberRecords in SA, creating and attaching QPs to IB multicast groups,... etc are privileged operations, and MUST be protected by the underlying operating system. This is to prevent malicious, non- privileged software from hijacking important resources and configurations. E.g. A bogus all-node IPoIB multicast group may prevent a proper one from being created when the network administrator tries to set up a link. Controlled Q_Keys SHOULD be used in IB multicast groups in order to prevent non-privileged software from fabricating IP datagrams to Chu & Kashyap [Page 10] draft-ietf-ipoib-link-multicast-01.txt March 2002 send, as mentioned in section 6.2. 12.0 Acknowledgments The authors would like to thank Bruce Beukema, David Brean, Dan Cassiday, Aditya Dube, Yaron Haviv, Michael Krause, Thomas Narten, Erik Nordmark, Greg Pfister, Renato Recio, Satya Sharma, David L. Stevens, and Madhu Talluri for their suggestions and many clarifications on the IBA specification. 13.0 References [AARCH] Hinden, R. and S. Deering "IP Version 6 Addressing Architecture", RFC 2373, July 1998. [DHCP] R. Droms "Dynamic Host Configuration Protocol", RFC 2131, March 1997. [DISC] Narten, T., Nordmark, E. and W. Simpson, "Neighbor Discovery for IP Version 6 (IPv6)", RFC 2461, December 1998. [HOSTS] Braden R., "Requirements for Internet Hosts -- Communication Layers", RFC 1122, October 1989 [IBTA] InfiniBand Architecture Specification, Release 1.0.a by InfiniBand Trade Association at www.infinibandta.org [IGMP2] Fenner W., "Internet Group Management Protocol, Version 2", RFC 2236, November 1997. [IPMULT] Deering S., "Host Extensions for IP Multicasting", RFC 1112, August 1989. [IPoIB_ARCH] draft-ietf-ipoib-architecture-01.txt [IPoIB_ENCAP] draft-ietf-ipoib-ip-over-infiniband-00.txt [IPV6] Deering, S. and R. Hinden, "Internet Protocol, Version 6 (IPv6) Specification", RFC 2460, December 1998. [IP6MLD] Deering S., Fenner W., Haberman B., "Multicast Listener Discovery (MLD) for IPv6", RFC 2710, October 1999. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. Chu & Kashyap [Page 11] draft-ietf-ipoib-link-multicast-01.txt March 2002 14.0 Author's Address H.K. Jerry Chu 901 San Antonio Road, UMPK17-201 Palo Alto, CA 94303-4900 USA Phone: +1 650 786-5146 EMail: jerry.chu@sun.com Vivek Kashyap IBM 15450, SW Koll Parkway Beaverton, OR 97006 Phone: 503 578 3422 EMail: vivk@us.ibm.com 15.0 Full Copyright Statement Copyright (C) The Internet Society (2001>. All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Chu & Kashyap [Page 12]