Internet Engineering Task Force Sassan Ahmadi Audio Video Transport WG Nokia Inc. INTERNET-DRAFT November 1, 2003 Expires: June 1, 2004 Real-Time Transport Protocol (RTP) Payload and File Storage Formats for the Variable-Rate Multimode Wideband (VMR-WB) Audio Codec Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC 2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsolete by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or cite them other than as "work in progress". The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/lid-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html This document is an individual submission to the IETF. Comments should be directed to the authors. Copyright Notice Copyright (C) The Internet Society (2003). All Rights Reserved. Abstract This document specifies a real-time transport protocol (RTP) payload format to be used for Variable-Rate Multimode Wideband (VMR-WB) speech codec. The payload format is designed to be able to interoperate with existing VMR-WB transport formats on non-IP networks. In addition, a file format is specified for transport of VMR-WB speech data in storage mode applications such as email. A MIME type registration is included, for VMR-WB, specifying use of both the RTP payload and the storage format. VMR-WB is a variable-rate multimode wideband speech codec that has a number of operating modes, one of which is fully interoperable with AMR-WB (G.722.2) audio codec. Therefore, provisions have been made in this draft to facilitate and simplify data packet exchange between VMR-WB and AMR-WB (i.e., RFC 3267) in the interoperable mode with minimal logic interworking function in the transport layer (i.e., a RTP translator). Sassan Ahmadi [page 1] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 Table of Contents 1. Introduction.................................................3 2. Conventions and Acronyms.....................................3 3. Background on the Adaptive Multi-Rate Wideband (AMR-WB) Speech Codec........................................................4 4. The Variable-Rate Multimode Wideband (VMR-WB) Speech Codec...4 4.1. Narrowband Speech Processing.............................6 4.2. Continuous vs. Discontinuous Transmission................6 5. Support for Multi-Channel Session............. ..............7 6. Robustness against Packet Loss......... .....................7 6.1. Forward Error Correction (FEC)...........................8 6.2. Frame Interleaving and Multi-Frame Encapsulation.........8 7. Bandwidth Efficient or Octet-aligned Mode....................9 8. VMR-WB Voice over IP scenarios..............................11 9. VMR-WB RTP Payload Format...................................11 9.1. RTP Header Usage........................................12 9.2. Payload Structure.......................................12 9.3. Bandwidth-Efficient Mode................................12 9.3.1. The Payload Header..................................14 9.3.2. The Payload Table of Contents.......................16 9.3.3. Speech Data.........................................16 9.3.4. Algorithm for Forming the Payload...................16 9.3.5 Payload Examples.....................................17 9.3.5.1. Single Channel Payload Carrying a Single Frame...17 9.3.5.2. Single Channel Payload Carrying Multiple Frames..18 9.3.5.3. Multi-Channel Payload Carrying Multiple Frames...19 9.4. Octet-aligned Mode......................................19 9.4.1. The Payload Header..................................20 9.4.2. The Payload Table of Contents ............... ......22 9.4.3. Speech Data.........................................22 9.4.4. Methods for Forming the Payload.....................22 9.4.5. Payload Example.....................................22 9.4.5.1. Basic Single Channel Payload Carrying Multiple Frames... ..............................22 9.5. Implementation Considerations...........................23 10. VMR-WB Storage Format......................................23 10.1. Single Channel Header..................................24 10.2. Multi-channel Header...................................25 10.3. Speech Frames..........................................26 11. Congestion Control (Network-Controlled Mode Switching).....26 12. Security Considerations....................................26 12.1. Confidentiality........................................27 12.2. Authentication.........................................27 12.3. Decoding Validation and Provision for Lost or Late Packets......................... ......................28 13. Payload Format Parameters..................................28 13.1. VMR-WB MIME Registration...............................28 13.2. Mapping MIME Parameters into SDP.......................31 14. IANA Considerations........................................32 15. Acknowledgements...........................................32 Appendix A VMR-WB Frame Structure.............................33 Sassan Ahmadi [page 2] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 Appendix B Interworking Function (IWF) for Interoperable AMR-WB <-> VMR-WB Interconnections........................35 References.....................................................38 Normative References...........................................38 Informative References.........................................39 Author's Address...............................................39 Full Copyright Statement.......................................40 1. Introduction This document specifies the payload format for packetization of VMR-WB encoded speech signals into the Real-time Transport Protocol (RTP) [5]. The payload format supports transmission of single and multiple channels, multiple frames per payload, the use of seamless mode switching, and interoperation with existing VMR-WB transport formats on non-IP networks, as described in Section 4. The payload format itself is specified in Section 9. A related file format is specified in Section 10 for transport of VMR-WB speech data in storage mode applications such as email. In Section 13, a MIME type registration for VMR-WB is provided. Since VMR-WB is interoperable with AMR-WB and understanding that IP-based interconnections are practically the most efficient method through which the two codecs can be connected, an attempt has been made throughout this draft to maximize the similarities with RFC 3267 while optimizing the payload for the VMR-WB codec itself. 2. Conventions and Acronyms The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC2119 [3]. The following acronyms are used in this document: 3GPP - the Third Generation Partnership Project 3GPP2 - the Third Generation Partnership Project 2 CDMA - Code Division Multiple Access WCDMA - Wideband Code Division Multiple Access GSM - Global System for Mobile Communications AMR-WB - Adaptive Multi-Rate Wideband Codec VMR-WB - Variable-Rate Multimode Wideband Codec CMR - Codec Mode Request CN - Comfort Noise DTX - Discontinuous Transmission FEC - Forward Error Correction SID - Silence Indicator (the frames containing only CN parameters) VAD - Voice Activity Detection IWF - Interworking Function TrFO - Transcoder-Free Operation UDP - User Datagram Protocol Sassan Ahmadi [page 3] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 RTP - Real-Time Transfer Protocol MIME - Multipurpose Internet Mail Extension IF2 - Interface Format 2 (an AMR-WB frame structure type) SDP - Session Description Protocol SIP - Session Initiation Protocol The term "frame-block" is used in this document to describe the time-synchronized set of speech frames in a multi-channel VMR-WB session. In particular, in an N-channel session, a frame-block will contain N speech frames, one from each of the channels, and all N speech frames represent exactly the same time period. 3. Background on the Adaptive Multi-Rate Wideband (AMR-WB) Speech Codec The Adaptive Multi-Rate Wideband (AMR-WB) speech codec was developed by 3rd Generation Partnership Project (3GPP) for multimedia services in 3G GSM/WCDMA cellular systems [1,2,4,6]. It was later selected by ITU-T as G.722.2 Recommendation. The AMR-WB codec is a multi-mode speech codec with Voice Activity Detection and Discontinuous Transmission (VAD/DTX) capability. AMR-WB supports 9 wideband speech coding modes with respective bit rates ranging from 6.6 to 23.85 kbps. The input/output sampling frequency used in AMR-WB is 16000 Hz and the speech processing is performed on 20 ms frames. This means that each AMR-WB encoded frame represents 320 speech samples. The multi-rate encoding (i.e., multi-mode) capability of AMR-WB is designed for preserving high speech quality under a wide range of transmission conditions. That is the AMR-WB codec modes is adapted to prevailing channel conditions by a tradeoff between total number of source-coding and channel-coding bits. With AMR-WB, GSM mobile radio systems are able to use available bandwidth as effectively as possible. E.g. in GSM it is possible to dynamically adjust the speech encoding rate during a session so as to continuously adapt to the varying transmission conditions by dividing the fixed overall bandwidth between speech data and error protective coding to enable best possible trade-off between speech compression rate and error tolerance. To perform mode adaptation, the decoder (speech receiver) needs to signal the encoder (speech sender) the new mode it prefers. This mode change signal is called Codec Mode Request or CMR [4]. Since in most sessions speech is sent in both directions between the two ends, the mode requests from the decoder at one end to the encoder at the other end are piggy-backed over the speech frames in the reverse direction. In other words, there is no out-of-band signaling needed for sending CMRs. 4. The Variable-Rate Multimode Wideband (VMR-WB) Speech Codec VMR-WB is the wideband speech-coding standard developed by Third Sassan Ahmadi [page 4] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 Generation Partnership Project 2 (3GPP2) for multimedia services in 3G CDMA cellular systems. Unlike AMR-WB, VMR-WB is a source-controlled variable-rate multimode wideband speech codec. It has a number of operating modes, where each mode is a tradeoff between voice quality and system capacity. Therefore, corresponding to each mode is a quality and average data rate (ADR). Note that the concept of mode in VMR-WB is different from that of AMR-WB. The operating mode in VMR-WB is chosen based on the traffic condition of the network and the desired quality of service [9,10,11]. The desired ADR in each mode is obtained by encoding speech frames at different rates available in CDMA Rate-Set II depending on the characteristics of input speech and the maximum and minimum rate constraints imposed by the network operator. While VMR-WB is a native CDMA codec complying with all CDMA system requirements, it is further interoperable with AMR-WB in one of the operational modes. This is due to the fact that VMR-WB and AMR-WB share the same core technology. This feature enables Transcoder Free (TrFO) interconnections between VMR-WB and AMR-WB across different wireless/wireline systems (e.g., GSM/WCDMA and CDMA2000) without use of unnecessary complex media format conversion. Due to incompatibility of the GSM/WCDMA and CDMA2000 signaling protocols, a complete interoperable interconnection between VMR-WB and AMR-WB is accomplished through a minimal logic Interworking Function (IWF) that resides in one of the gateways in the transport layer (see Appendix B for more details) between the two incompatible terminals. The current implementation of VMR-WB is compliant with CDMA Rate-Set II operation (i.e., Multiplex Option 2 [12,13]) and supports interoperability with AMR-WB at 12.65 kbps (i.e., AMR-WB mode 2). However, the current document has been drafted to accommodate future design extensions to VMR-WB including other AMR-WB codec modes (i.e., AMR-WB modes 0 and 1). VMR-WB is able to transition between various modes with no degradation in voice quality that is attributable to the mode switching itself; i.e., seamless mode switching. The operation mode of the VMR-WB encoder may be switched seamlessly without prior knowledge of the decoder. All modes (i.e., mode 0, 1, 2, and 3) can be chosen depending on the traffic conditions (i.e., congestion) and the desired quality of service. While in the interoperable mode, mode switching is not allowed. There is only one AMR-WB interoperable mode in VMR-WB. Since AMR-WB codec depending on channel conditions may request a mode change, in-band data included in VMR-WB frame structure, as shown in Appendix A, is used during an interoperable interconnection to switch between AMR-WB codec modes 0, 1, or 2. As mentioned earlier, VMR-WB is compliant with CDMA Multiplex Option 2 [13] with the permissible encoding rates shown in Table 1. The CDMA system requires the codecs to generate speech frames compliant with the above encoding rates while operating in Rate-Set II [12]. Also, in certain conditions in CDMA system such as blank-and-burst or dim-and-burst signaling all or part of the primary traffic is used by the system for signaling thus the codecs are forced to use lower bit rates for encoding the speech data [12]. Sassan Ahmadi [page 5] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 +-----------------+---------------------------------+-------------------------+ | Frame Type | Bits per Packet (Frame Size) | Encoding Rate (kbps) | +-----------------+---------------------------------+-------------------------+ | Full-Rate | 266 | 13.3 | | Half-Rate | 124 | 7.2 | | Quarter-Rate | 54 | 2.7 | | Eighth-Rate | 20 | 1.0 | | Blank | 0 | - | | Erasure | 0 | - | +-----------------+---------------------------------+-------------------------+ Table 1: CDMA Rate-Set II frame types and their associated encoding rates VMR-WB is robust to high percentage of packet loss and packets with corrupted rate information. The reception of Blank or Erasure frame types at decoder invokes the built-in frame error concealment mechanisms. The built-in frame error concealment mechanism in VMR-WB conceals the effect of lost packets by exploiting in-band data and the data in the previous frames. The built-in noise pre-processing module in VMR-WB considerably improves the performance under severe background noise conditions. The VMR-WB codec further has the capability to detect and conceal frames with corrupted rate information. The frames with erroneous rate information MAY be passed to the decoder by the CDMA Multiplex sublayer in the receiving side. 4.1. Narrowband Speech Processing VMR-WB has the capability to operate with 8000 Hz sampled input/output speech signals in all modes of operation [9,10]. Mode switching MAY be utilized to change the mode of operation while processing narrowband speech signals. However, during a session, transition between narrowband and wideband processing is not allowed due to different timestamps and other likely synchronization problems. 4.2. Continuous vs. Discontinuous Transmission The circuit-switched operation of VMR-WB within a CDMA network requires continuous transmission of the speech data during a conversation and once a voice service option is initiated [12,13]. Also the intrinsic source-controlled variable-rate feature of the CDMA speech codecs is REQUIRED for optimal operation of the CDMA system and interference control. However, VMR-WB has the capability to operate in a discontinuous transmission mode for some packet-switched applications over IP networks, where the number of transmitted bits and packets during silence period are reduced to a minimum. The VMR-WB DTX operation is similar to that of AMR-WB [4]. 5. Support for Multi-Channel Session Both the RTP payload format and the storage format defined in this document support multi-channel audio content (e.g., a stereophonic speech session). Although VMR-WB codec itself does not support encoding of multi-channel audio content into a single bit stream, it can be used to separately encode and decode each of the individual channels. Sassan Ahmadi [page 6] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 To transport (or store) the separately encoded multi-channel content, the speech frames for all channels that are framed and encoded for the same 20 ms periods are logically collected in a frame-block. At the session setup, out-of-band signaling must be used to indicate the number of channels in the session and the order of the speech frames from different channels in each frame-block. When using SDP for signaling, the number of channels is specified in the rtpmap attribute and the order of channels carried in each frame-block is implied by the number of channels as specified in Section 4.1 in [19]. 6. Robustness against Packet Loss The payload format support several features including forward error correction (FEC) and frame interleaving in order to increase robustness against lost packets. 6.1. Forward Error Correction (FEC) The simple scheme of repetition of previously sent data is one way of achieving FEC. Another possible scheme which is more bandwidth efficient is to use payload external FEC, e.g., RFC2733 [19], which generates extra packets containing repair data. The FEC feature is included for further compatibility with AMR-WB payload. The repetition method involves the simple retransmission of previously transmitted frame-blocks together with the current frame-block(s). This is done by using a sliding window to group the speech frame-blocks to send in each payload. Figure 1 illustrates an example. --+--------+--------+--------+--------+--------+--------+--------+-- | f(n-2) | f(n-1) | f(n) | f(n+1) | f(n+2) | f(n+3) | f(n+4) | --+--------+--------+--------+--------+--------+--------+--------+-- <---- p(n-1) ----> <----- p(n) -----> <---- p(n+1) ----> <---- p(n+2) ----> <---- p(n+3) ----> <---- p(n+4) ----> Figure 1: An example of redundant transmission. In this example each frame-block is retransmitted one time in the following RTP payload packet. Here, f(n-2)..f(n+4) denotes a sequence of speech frame-blocks and p(n-1)..p(n+4) a sequence of payload packets. The use of this approach does not require signaling at the session setup. In other words, the speech sender can choose to use this scheme without consulting the receiver. This is because a packet containing redundant frames will not look Sassan Ahmadi [page 7] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 different from a packet with only new frames. The receiver may receive multiple copies or versions of a frame for a certain timestamp if no packet is lost. If multiple versions of the same speech frame are received, it is RECOMMENDED that the highest rate be used by the speech decoder. This redundancy scheme provides the same functionality as the one described in RFC 2198 "RTP Payload for Redundant Audio Data" [19]. In most cases the mechanism in this payload format is more efficient and simpler than requiring both endpoints to support RFC 2198 in addition. If the spread in time required between the primary and redundant encodings is larger than 5 frame times, the bandwidth overhead of RFC 2198 will be lower. The sender is responsible for selecting an appropriate amount of redundancy based on feedback about the channel, e.g., in RTCP receiver reports, or network traffic. A sender should not base selection of FEC on the CMR, as this parameter most probably was set based on none-IP information. The sender is also responsible for avoiding congestion, which may be aggravated by redundant transmission. 6.2. Frame Interleaving and Multi-Frame Encapsulation To decrease protocol overhead, the payload design allows several speech frame-blocks be encapsulated into a single RTP packet. One of the drawbacks of such approach is that in case of packet loss this means loss of several consecutive speech frame-blocks, which usually causes clearly audible distortion in the reconstructed speech. Interleaving of frame-blocks can improve the speech quality in such cases by distributing the consecutive losses into a series of single frame-block losses. However, interleaving and bundling several frame-blocks per payload will also increase end-to-end delay and is therefore not appropriate for all types of applications. Streaming applications will most likely be able to exploit interleaving to improve speech quality in lossy transmission conditions. This payload design supports the use of frame interleaving as an option. For the encoder (speech sender) to use frame interleaving in its outbound RTP packets for a given session, the decoder (speech receiver) needs to indicate its support via out-of-band means (see Section 13). 7. Bandwidth Efficient or Octet-aligned Mode For a given session, the payload format can be either bandwidth efficient or octet aligned, depending on the mode of operation that is established for the session via out-of-band means. In the octet-aligned format, all the fields in a payload, including payload header, table of contents entries, and speech frames themselves, are individually aligned to octet boundaries to make implementations efficient. In the bandwidth efficient format only the full payload is octet aligned, so fewer padding bits are added. Note, octet alignment of a field or payload means that the last octet is padded Sassan Ahmadi [page 8] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 with zeroes in the least significant bits to fill the octet. Also note that this padding is separate from padding indicated by the P bit in the RTP header. Between the two payload formation approaches, only the octet-aligned mode has the capability to use the interleaving to make the speech transport robust to packet loss. 8. VMR-WB Voice over IP Scenarios The primary scenario for this payload format is IP end-to-end between two terminals incorporating VMR-WB codec, as shown in Figure 2. This payload format is expected to be useful for both conversational and streaming services. +----------+ +----------+ | | | | | TERMINAL |<----------------------->| TERMINAL | | | RTP/UDP/IP/VMR-WB | | +----------+ +----------+ Figure 2: IP terminal to IP terminal scenario A conversational service puts requirements on the payload format. Low delay is a very important factor, i.e. fewer speech frame-blocks per payload packet. Low overhead is also required when the payload format traverses low bandwidth links, especially if the frequency of packets will be high. Streaming service has less strict real-time requirements and therefore can use a larger number of frame-blocks per packet than conversational service. This reduces the overhead from IP, UDP, and RTP headers. However, including several frame-blocks per packet makes the transmission more vulnerable to packet loss, so interleaving may be used to reduce the effect of packet loss on speech quality. A streaming server handling a large number of clients also needs a payload format that requires as few resources as possible when doing packetization. The octet-aligned and interleaving modes require the least amount of resources, while bandwidth efficient mode is more demanding. Another scenario occurs when VMR-WB encoded speech will be transmitted from a non-IP system (e.g., 3GPP2/CDMA2000 network) to an RTP/UDP/IP VoIP terminal, and/or vice versa, as depicted in Figure 3. VMR-WB over 3GPP2/CDMA2000 network +------+ +----------+ | | | | <------------------->| GW |<---------------------->| TERMINAL | | | RTP/UDP/IP/VMR-WB | | +------+ +----------+ | | IP network | Figure 3: GW to VoIP terminal scenario Sassan Ahmadi [page 9] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 VMR-WB's capability to seamlessly switch between modes is exploited in CDMA (non-IP) networks to optimize speech quality for a given traffic condition. To preserve this functionality in scenarios including a gateway to an IP network, a codec mode request (CMR) field is needed. The gateway will be responsible for forwarding the CMR between the non-IP and IP parts in both directions. The IP terminal should follow the CMR forwarded by the gateway to optimize speech quality going to the non-IP decoder. The mode control algorithm in the gateway SHOULD accommodate the delay imposed by the IP network on the response to CMR by the IP terminal. The IP terminal should not set the CMR (see Section 9.3.1), but the gateway can set the CMR value on frames going toward the encoder in the non-IP part to optimize speech quality from that encoder to the gateway. The gateway can alternatively set a lower CMR value, if desired, as one means to control congestion on the IP network. A third likely scenario is that RTP/UDP/IP is used as transport between two non-IP systems, i.e., IP is originated and terminated in gateways on both sides of the IP transport, as illustrated in Figure 4. This is the most likely scenario for an interoperable interconnection between 3GPP/(GSM, WCDMA)/AMR-WB and 3GPP2/CDMA2000/VMR-WB. VMR-WB over AMR-WB over 3GPP2/CDMA2000 network 3GPP/(GSM, WCDMA) network +------+ +------+ | GW | RTP/UDP/IP/AMR-WB | | <-------------------->|------|<------------------->| GW |<-------------------> | IWF | | | +------+ +------+ | | | IP network | | | Figure 4: GW to GW scenario (AMR-WB <-> VMR-WB interoperable interconnection) The use of an Interworking Function (IWF) in the gateway immediately interfacing with the 3GPP2/CDMA2000 network is REQUIRED in this scenario. The IWF entity resides in the transport layer and is used for interoperable interconnections only. In addition, the CMR value may be set in packets received by the gateways on the IP network side. The gateway should forward to the non-IP side a CMR value that is the minimum of two values (1) the CMR value it receives on the IP side; and (2) a CMR value it may choose for congestion control of transmission on the IP side. The details of the traffic control algorithm are left to the implementation. The fourth example VoIP scenario comprises a RTP/UDP/IP transport between two non-IP systems, i.e., IP is originated and terminated in gateways on both sides of the IP transport, as illustrated in Figure 5. This is the most likely scenario for Mobile Station-to-Mobile Station (MS-to-MS) Transcoder-Free (TrFO) interconnection between two 3GPP2/CDMA2000 terminals that both use VMR-WB codec. Sassan Ahmadi [page 10] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 VMR-WB over VMR-WB over 3GPP2/CDMA2000 network 3GPP2/CDMA2000 network +------+ +------+ | | | | <-------------------->| GW |<------------------->| GW |<-------------------> | | RTP/UDP/IP/VMR-WB | | +------+ +------+ | | | IP network | | | Figure 5: GW to GW scenario (a CDMA2000 MS-to-MS voice over IP scenario) 9. VMR-WB RTP Payload Format The VMR-WB payload format is very similar to that of AMR-WB (i.e., RFC 3267). Both codecs' payloads have relatively identical structures to further simplify interoperable interconnections and consequently the interworking function. The only differences are in the non-interoperable interconnections where NO IWF is required. The payload format consists of the RTP header, payload header, and payload data. 9.1. RTP Header Usage The format of the RTP header is specified in [5]. This payload format uses the fields of the header in a manner consistent with that specification. The RTP timestamp corresponds to the sampling instant of the first sample encoded for the first frame-block in the packet. The timestamp clock frequency is the same as the sampling frequency, so the timestamp unit is in samples. The duration of one speech frame-block is 20 ms for VMR-WB. For normal wideband operation of VMR-WB, the input/output sampling frequency is 16 kHz, corresponding to 320 samples per frame from each channel. Thus, the timestamp is increased by 320 for VMR-WB for each consecutive frame-block. For narrowband operation of VMR-WB, the input/output sampling frequency is 8 kHz, corresponding to 160 encoded speech samples per frame from each channel. Thus, the timestamp is increased by 160 for VMR-WB for each consecutive frame-block while processing narrowband input/output speech signals. The choice of sampling frequency MUST be indicated in the beginning of a session (see section 13). The default input/output sampling rate is 16 kHz. Note that during a session, the sampling rate SHALL not be changed. A packet may contain multiple frame-blocks of encoded speech or comfort noise parameters. If interleaving is employed, the frame-blocks encapsulated into a payload are picked according to the interleaving rules as defined in Section 9.4.1. Otherwise, each packet covers a period of one or more contiguous 20 ms frame-block intervals. In case the data from all the channels for a particular frame-block in the period is missing, for example at a gateway from some other transport format, it is possible to indicate that no data is present for that frame-block rather than breaking a multi-frame-block packet into two, as Sassan Ahmadi [page 11] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 explained in Section 9.3.2. The payload is always made an integral number of octets long by padding with zero bits if necessary. If additional padding is required to bring the payload length to a larger multiple of octets or for some other purpose, then the P bit in the RTP header MAY be set and padding appended as specified in [5]. The RTP header marker bit (M) SHALL be set to 1 if the first frame-block carried in the packet contains a speech frame, which is the first in a talk spurt. For all other packets the marker bit SHALL be set to zero (M=0). The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile under which this payload format is being used will assign a payload type for this encoding or specify that the payload type is to be bound dynamically. 9.2. Payload Structure The complete payload consists of a payload header, a payload table of contents, and speech data representing one or more speech frame-blocks. The following diagram shows the general payload format layout: +----------------+-------------------+---------------- | payload header | table of contents | speech data ... +----------------+-------------------+---------------- Payloads containing more than one speech frame-block are called compound payloads. The following sections describe the variations taken by the payload format depending on whether the VMR-WB session is set up to use the bandwidth-efficient mode or octet-aligned mode and any of the OPTIONAL functions such as interleaving. Implementations SHOULD support both bandwidth-efficient and octet-aligned operation to increase interoperability. 9.3. Bandwidth-Efficient Mode 9.3.1. The Payload Header In bandwidth-efficient mode, the payload header simply consists of a 4 bit codec mode request: 0 1 2 3 +-+-+-+-+ | CMR | +-+-+-+-+ CMR (4 bits): Indicates a codec mode request sent to the speech encoder at the site of the receiver of this payload, provided that the network allows the use of the requested mode. Therefore, the network MAY overwrite the mode request depending on the network conditions. Also, during a VMR-WB <-> AMR-WB Sassan Ahmadi [page 12] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 interoperable interconnection, the operational mode of VMR-WB is set to "Rate-Set II mode 3". The value of the CMR field is set according to the following Table +-----------+------------------------------------------------------------------+ | CMR | VMR-WB Codec Mode | +-----------+------------------------------------------------------------------+ | 0 | Rate-Set II mode 3 (AMR-WB interoperable mode at 6.6 kbps) | | 1 | Rate-Set II mode 3 (AMR-WB interoperable mode at 8.85 kbps) | | 2 | Rate-Set II mode 3 (AMR-WB interoperable mode at 12.65 kbps) | | 3 | Rate-Set II mode 2 | | 4 | Rate-Set II mode 1 | | 5 | Rate-Set II mode 0 | | 6 | (reserved) | | 10-14 | (reserved) | | 15 | No Preference (Codec mode SHOULD be set by the network) | +-----------+------------------------------------------------------------------+ Table 3: List of valid CMR values and their associated VMR-WB operating modes. The choice of values for each operating mode of VMR-WB was to ensure similarity with the compatible modes of AMR-WB in order to facilitate interoperability. The reserved values have not been implemented yet. The mode request received in the CMR field is valid until the next CMR is received, i.e. a newly received CMR value overrides the previous one. Therefore, if a terminal continuously wishes to receive frames in the same mode x, it needs to set CMR=x for all its outbound payloads, and if a terminal has no preference in which mode to receive, it SHOULD set CMR=15 in all its outbound payloads. If receiving a payload with a CMR value, which is not valid, the CMR MUST be ignored by the receiver. The CMR values 0 and 1 are used to maintain similarity with AMR-WB codec modes 0 and 1. Note that there is only one interoperable mode in VMR-WB (i.e., Rate-Set II mode 3). In-band signaling is used in VMR-WB as described in Appendix A to select between AMR-WB codec modes 0, 1, or 2. The Interworking Function will ensure correct codec mode setting on both sides of an interoperable interconnection. The default input/output sampling rate of VMR-WB is 16 kHz. The narrowband operation of VMR-WB on 8 kHz input/output narrowband speech requires that both encoder and decoder of VMR-WB be informed of the desired sampling rate. This MUST be signaled to the encoder and decoder in the beginning of a real-time or non-real-time VoIP session through MIME parameter "sampling-frequency" (see section 13.1 for MIME registration parameters). With a given sampling rate (i.e., 8/16 kHz), the encoder can switch between wideband or narrowband operation modes without prior knowledge of the decoder. In a multi-channel session, CMR SHOULD be interpreted by the receiver of the payload as the desired encoding mode for all the channels in the session, if the network allows. Sassan Ahmadi [page 13] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 An IP end-point SHOULD NOT set the CMR based on packet losses or other congestion indications, for several reasons: - The other end of the IP path may be a gateway to a non-IP network (such as a radio link) that needs to set the CMR field to optimize performance on that network. - Congestion on the IP network is managed by the IP sender, in this case at the other end of the IP path. Feedback about congestion SHOULD be provided to that IP sender through RTCP or other means, and then the sender can choose to avoid congestion using the most appropriate mechanism. That may include adjusting the codec mode, but also includes adjusting the level of redundancy or number of frames per packet. The encoder SHOULD follow a received mode request, but MAY change to a different mode if the network necessitates it, for example to control congestion. The CMR field MUST be set to 15 for packets sent to a multicast group. The encoder in the speech sender SHOULD ignore mode requests when sending speech to a multicast session but MAY use RTCP feedback information as a hint that a mode change is needed. The codec mode selection MAY be restricted by a session parameter to a subset of the available modes. If so, the requested mode MUST be among the signaled subset (see Section 13). 9.3.2. The Payload Table of Contents The table of contents (ToC) consists of a list of ToC entries, each representing a speech frame. In bandwidth-efficient mode, a ToC entry takes the following format: 0 1 2 3 4 5 +-+-+-+-+-+-+ |F| FT |Q| +-+-+-+-+-+-+ F (1 bit): If set to 1, indicates that this frame is followed by another speech frame in this payload; if set to 0, indicates that this frame is the last frame in this payload. FT (4 bits): Frame type index whose value is chosen according to the following Table. Note that Rate-Set II (RS-II) contains four frame types that are the allowed encoding rates compatible with CDMA Multiplex option 2 [12,13]. Sassan Ahmadi [page 14] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 +----+-------------------------------------+----------------------------------+ | FT | Encoding Rate | Frame Size (Bits) | +----+-------------------------------------+----------------------------------+ | 0 | RS-II Full-Rate (AMR-WB 6.6 kbps) | 13 Preamble+132 Data+121 Padding | | 1 | RS-II Full-Rate (AMR-WB 8.85 kbps) | 13 Preamble+177 Data+76 Padding | | 2 | RS-II Full-Rate (AMR-WB 12.65 kbps) | 13 Preamble+253 Data | | 3 | RS-II Full-Rate 13.3 kbps | 266 | | 4 | RS-II Half-Rate 6.2 kbps | 124 (Preamble + Data) | | 5 | RS-II Quarter-Rate 2.7 kbps | 54 | | 6 | RS-II Eighth-Rate 1.0 kbps | 20 | | 7 | (reserved) | | | 8 | (reserved) | | | 9 | RS-II CNG (AMR-WB SID+ padding) | 5 Preamble+35 Data+14 Padding | | 10 | (reserved) | | | 11 | (reserved) | | | 12 | (reserved) | | | 13 | (reserved) | | | 14 | RS-II Erasure (AMR-WB SPEECH_LOST) | 0 | | 15 | RS-II Blank (AMR-WB NO_DATA) | 0 | +----+-------------------------------------+----------------------------------+ Table 4:VMR-WB payload frame types for real-time transport. The Preamble bits in the interoperable Frame Types are used for in-band signaling to distinguish between different encoding rates. For example, the first 13 preamble bits in the Rate-Set II Interoperable Full-Rate are used to decode AMR-WB codec modes 0, 1, or 2. During the interoperable mode, FT=14 (SPEECH_LOST) and FT=15 (NO_DATA) are used to indicate frames that are either lost or not being transmitted in this payload, respectively. FT=14 or 15 MAY be used in the non-interoperable modes to indicate frame erasure or blank frame, respectively. Q (1 bit): Frame quality indicator. If set to 0, indicates the corresponding frame is corrupted. During the interoperable mode, the receiver side (with AMR-WB codec) should set the RX_TYPE to either SPEECH_BAD or SID_BAD depending on the frame type (FT), if Q=0. For multi-channel sessions, the ToC entries of all frames from a frame-block are placed in the ToC in consecutive. When multiple frame-blocks are present in a packet in bandwidth-efficient mode, they will be placed in the packet in order of their creation time. Therefore, with N channels and K speech frame-blocks in a packet, there MUST be N*K entries in the ToC, and the first N entries will be from the first frame-block, the second N entries will be from the second frame-block, and so on. The following figure shows an example of a ToC of three entries in a single channel session using bandwidth efficient mode. 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1| FT |Q|1| FT |Q|0| FT |Q| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Sassan Ahmadi [page 15] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 Below is an example of how the ToC entries will appear in the ToC of a packet carrying 3 consecutive frame-blocks in a session with two channels (L and R). +----+----+----+----+----+----+ | 1L | 1R | 2L | 2R | 3L | 3R | +----+----+----+----+----+----+ |<------->|<------->|<------->| Frame- Frame- Frame- Block 1 Block 2 Block 3 9.3.3. Speech Data Speech data of a payload contains one or more speech as described in the ToC of the payload. Each speech frame represents 20 ms of speech encoded in one of the available encoding rates depending on the operation mode. The Length of the speech frame is defined by the frame type in the FT field. The order and numbering notation of the bits are as specified in the VMR-WB standard specification [10]. To facilitate the VMR-WB and AMR-WB interconnection and to simplify the interworking function during the interoperable mode, the order and the numbering notation of the speech codec bits closely follow those of AMR-WB RTP payload (i.e., RFC 3267). 9.3.4. Algorithm for Forming the Payload The complete RTP payload in bandwidth-efficient mode is formed by packing bits from the payload header, table of contents, and speech frames, in order as defined by their corresponding ToC entries in the ToC list, contiguously into octets beginning with the most significant bits of the fields and the octets. To be precise, the four-bit payload header is packed into the first octet of the payload with bit 0 of the payload header in the most significant bit of the octet. The four most significant bits (numbered 0-3) of the first ToC entry are packed into the least significant bits of the octet, ending with bit 3 in the least significant bit. Packing continues in the second octet with bit 4 of the first ToC entry in the most significant bit of the octet. If more than one frame is contained in the payload, then packing continues with the second and successive ToC entries. Bit 0 of the first data frame follows immediately after the last ToC bit, proceeding through all the bits of the frame in numerical order. Bits from any successive frames follow contiguously in numerical order for each frame and in consecutive order of the frames. If speech data is missing for one or more speech frame within the sequence, because of, for example Blank and Burst or DTX, a ToC entry with FT set to NO_DATA/Blank SHALL be included in the ToC for each of the missing frames, but no data bits are included in the payload for the missing frame (see Section 9.3.5.2 for an example). 9.3.5 Payload Examples Sassan Ahmadi [page 16] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 9.3.5.1. Single Channel Payload Carrying a Single Frame The following diagram shows a bandwidth-efficient VMR-WB payload from a single channel session carrying a single speech data block. In the payload, no specific mode is requested (CMR=15), the speech frame is not damaged at the IP origin (Q=1), and the encoding rate is VMR-WB Rate-Set II Half-Rate (FT=4). The encoded speech bits, d(0) to d(123), are arranged according to [2]. Finally, two zero bits are added to the end as padding to make the payload octet aligned. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CMR=15|0| FT=4 |1|d(0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d(123)|P|P| +-+-+-+-+-+-+-+-+ 9.3.5.2. Single Channel Payload Carrying Multiple Frames The following diagram shows a single channel, bandwidth efficient compound VMR-WB payload that contains four frames, of which one has no speech data. VMR-WB is hypothetically operating in Rate-Set II mode 2. The first frame is a speech frame at Rate-Set II Full-Rate (FT=3) that is composed of speech bits d(0) to d(265). The second frame is an Rate-Set II Quarter-Rate (FT=5), consisting of bits g(0) to g(53). The third frame is Rate-Set II Blank/NO_DATA frame and does not carry any speech information, it is represented in the payload by its ToC entry (FT=15). The fourth frame in the payload is a speech frame encoded at Rate-Set II Half-Rate (FT=4), it consists of speech bits h(0) to h(123). None of the frames is damaged at IP origin (Q=1). The encoded speech d(0) to d(265), g(0) to g(53), and h(0) to h(123), are sequentially arranged in the payload. (Note, no speech bits are present for the third frame). Finally, No padding bits are required to make the payload octet aligned. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CMR=3 |1| FT=3 |1|1| FT=5 |1|1| FT=15 |1|0| FT=4 |1|d(0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Sassan Ahmadi [page 17] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d(265)|g(0) +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | g(53)|h(0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | h(123)| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 9.3.5.3. Multi-Channel Payload Carrying Multiple Frames The following diagram shows a two-channel payload carrying 3 frame-blocks, i.e. the payload will contain 6 speech frames. In the payload all speech frames contain the same encoding rate of Rate-Set II Full-Rate (FT=3) and are not damaged at IP origin. The CMR is set to 15, i.e., the operating mode SHOULD be set by the network. The two channels are defined as left (L) and right (R) in that order. The encoded speech bits are designated dXY(0)... dXY(K-1), where X = block number, Y = channel, and K is the number of speech bits for the corresponding encoding rate. Exemplifying this, for frame-block 1 of the left channel the encoded bits are designated as d1L(0) to d1L(265). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CMR=15|1|1L FT=3|1|1|1R FT=3|1|1|2L FT=3|1|1|2R FT=3|1|1|3L FT| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |3|1|0|3R FT=3|1|d1L(0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d1L(265)|d1R(0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Sassan Ahmadi [page 18] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d1R(265)|d2L(0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d2L(265)|d2R(0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d2R(265)|d3L(0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | d3L(265)|d3R(0) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |d3R(265)P|P|P|P| +-+-+-+-+-+-+-+-+ 9.4. Octet-aligned Mode 9.4.1. The Payload Header In octet-aligned mode, the payload header consists of a 4 bit CMR, 4 reserved bits, and optionally, an 8 bit interleaving header, as shown below: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+- - - - - - - - | CMR |R|R|R|R| ILL | ILP | +-+-+-+-+-+-+-+-+- - - - - - - - CMR (4 bits): same as defined in section 9.3.1. R: is a reserved bit that MUST be set to zero. All R bits MUST be ignored by the receiver. ILL (4 bits, unsigned integer): This is an OPTIONAL field that is present only if interleaving is signaled out-of-band for the session. ILL=L indicates to the receiver that the interleaving length is L+1, in number of frame-blocks. ILP (4 bits, unsigned integer): This is an OPTIONAL field that is present only if interleaving is signaled. ILP MUST take a value between 0 and ILL, inclusive, indicating the interleaving index for frame-blocks in this payload in the interleave group. If the value of ILP is found greater than ILL, the payload SHOULD be discarded. ILL and ILP fields MUST be present in each packet in a session if interleaving is signaled for the session. Sassan Ahmadi [page 19] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 If Interleaving option is utilized, It MUST be performed on a frame-block basis as opposed to a frame basis in a multi-channel session. The following example illustrates the arrangement of speech frame-blocks in an interleave group during an interleave session. Here we assume ILL=L for the interleave group that starts at speech frame-block n. We also assume that the first payload packet of the interleave group is s and the number of speech frame-blocks carried in each payload is N. Then we will have: Payload s (the first packet of this interleave group): ILL=L, ILP=0, Carry frame-blocks: n, n+(L+1), n+2*(L+1),..., n+(N-1)*(L+1) Payload s+1 (the second packet of this interleave group): ILL=L, ILP=1, Carry frame-blocks: n+1, n+1+(L+1), n+1+2*(L+1),..., n+1+(N-1)*(L+1) ... Payload s+L (the last packet of this interleave group): ILL=L, ILP=L, Carry frame-blocks: n+L, n+L+(L+1), n+L+2*(L+1), ..., n+L+(N-1)*(L+1) The next interleave group will start at frame-block n+N*(L+1). There will be no interleaving effect unless the number of frame-blocks per packet (N) is at least 2. Moreover, the number of frame-blocks per payload (N) and the value of ILL MUST NOT be changed inside an interleave group. In other words, all payloads in an interleave group MUST have the same ILL and MUST contain the same number of speech frame-blocks. The sender of the payload MUST only apply interleaving if the receiver has signaled its use through out-of-band means. Since interleaving will increase buffering requirements at the receiver, the receiver uses MIME parameter "interleaving=I" to set the maximum number of frame-blocks allowed in an interleaving group to I. When performing interleaving the sender MUST use a proper number of frame-blocks per payload (N) and ILL so that the resulting size of an interleave group is less than or equal to I, i.e., N*(L+1)<=I. 9.4.2. The Payload Table of Contents The table of contents (ToC) in octet-aligned mode consists of a list of ToC entries where each entry corresponds to a speech frame carried in the payload, i.e., +---------------------+ | list of ToC entries | +---------------------+ Note, for ToC entries with FT=14 or 15, there will be no corresponding speech frame in the payload. Sassan Ahmadi [page 20] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 The list of ToC entries is organized in the same way as described for bandwidth-efficient mode in 9.3.2, with the following exception; when interleaving is used, the frame-blocks in the ToC will almost never be placed consecutive in time. Instead, the presence and order of the frame-blocks in a packet will follow the pattern described in 9.4.1. The following example shows the ToC of three consecutive packets, each carrying 3 frame-blocks, in an interleaved two-channel session. Here, the two channels are left (L) and right (R) with L coming before R, and the interleaving length is 3 (i.e., ILL=2). This makes the interleave group 9 frame-blocks large. Packet #1 --------- ILL=2, ILP=0: +----+----+----+----+----+----+ | 1L | 1R | 4L | 4R | 7L | 7R | +----+----+----+----+----+----+ |<------->|<------->|<------->| Frame- Frame- Frame- Block 1 Block 4 Block 7 Packet #2 --------- ILL=2, ILP=1: +----+----+----+----+----+----+ | 2L | 2R | 5L | 5R | 8L | 8R | +----+----+----+----+----+----+ |<------->|<------->|<------->| Frame- Frame- Frame- Block 2 Block 5 Block 8 Packet #3 --------- ILL=2, ILP=2: +----+----+----+----+----+----+ | 3L | 3R | 6L | 6R | 9L | 9R | +----+----+----+----+----+----+ |<------->|<------->|<------->| Frame- Frame- Frame- Block 3 Block 6 Block 9 A ToC entry takes the following format in octet-aligned mode: 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |F| FT |Q|P|P| +-+-+-+-+-+-+-+-+ F (1 bit): see definition in Section 9.3.2. FT (4 bits unsigned integer): see definition in Section 9.3.2. Sassan Ahmadi [page 21] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 Q (1 bit): see definition in Section 9.3.2. P bits: padding bits MUST be set to zero. 9.4.3. Speech Data In octet-aligned mode, speech data is carried in a similar way to that in the bandwidth-efficient mode as discussed in Section 9.3.3, with the following exceptions: - The last octet of each speech frame MUST be padded with zeroes at the end if not all bits in the octet are used. In other words, each speech frame MUST be octet-aligned. - When multiple speech frames are present in the speech data (i.e., compound payload), the speech frames MUST be arranged one whole frame after another. 9.4.4. Methods for Forming the Payload The payload begins with the payload header of one octet or two if frame interleaving is selected. The payload header is followed by the table of contents consisting of a list of one-octet ToC entries. The speech data follows the table of contents. For packetization in the normal order, all of the octets comprising a speech frame are appended to the payload as a unit. The speech frames are packed in the same order as their corresponding ToC entries are arranged in the ToC list, with the exception that if a given frame has a ToC entry with FT=14 or 15, there will be no data octets present for that frame. 9.4.5. Payload Example 9.4.5.1. Basic Single Channel Payload Carrying Multiple Frames The following diagram shows an octet-aligned payload from a single channel session that carries two VMR-WB Rate-Set II Full-Rate frames (FT=3). In the payload, a codec mode request is sent (e.g., CMR=4), requesting the encoder at the receiver's side to use VMR-WB mode 1. No interleaving is used. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CMR=4 |R|R|R|R|1|FT#1=3 |Q|P|P|0|FT#2=3 |Q|P|P| f1(0..7) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f1(8..15) | f1(16..23) | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | r |P|P|P|P|P|P| f2(0..7) | f2(8..15) | f2(16..23) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Sassan Ahmadi [page 22] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | l |P|P|P|P|P|P| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ r= f1(264,265) l= f2(264,265) Note, in above example the last octet in both speech frames is padded with zeros to make them octet-aligned. 9.5. Implementation Considerations An application implementing this payload format MUST understand all the payload parameters in the out-of-band signaling used. For example, if an application uses SDP, all the SDP and MIME parameters in this document MUST be understood. This requirement ensures that an implementation always can decide if it is capable or not of communicating. No operation mode of the payload format is mandatory to implement. The requirements of the application using the payload format should be used to determine what to implement. To achieve basic interoperability with other applications implementing this payload, each implementation SHOULD at least implement both bandwidth-efficient and octet-aligned mode for single channel. The support of interleaving is OPTIONAL. 10. VMR-WB Storage Format The storage format is used for storing VMR-WB encoded speech frames in a file or as an e-mail attachment. Multiple channel content is also supported. The storage format for VMR-WB is identical to that of AMR-WB to ensure full compatibility in the interoperable mode. In general, VMR-WB file has the following structure: +------------------+ | Header | +------------------+ | Speech frame 1 | +------------------+ : ... : +------------------+ | Speech frame n | +------------------+ 10.1. Single channel Header A single channel VMR-WB file header contains only a magic number. The magic number for single channel VMR-WB files in non-interoperable modes MUST consist of ASCII character string: Sassan Ahmadi [page 23] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 "#!VMR-WB\n" (or 0x2321564d522d57420a in hexadecimal). Note, the "\n" is an important part of the magic numbers and MUST be included in the comparison; otherwise, the single channel magic number above will become indistinguishable from those of the multi-channel files defined in the next section. The magic number for single channel VMR-WB files in the interoperable mode MUST consist of ASCII character string: "#!AMR-WB\n" (or 0x2321414d522d57420a in hexadecimal). Note that VMR-WB uses the same magic number as AMR-WB (see RFC 3267) when saving the encoded speech in the interoperable mode. Therefore, a file generated by VMR-WB is directly decodable with AMR-WB. However, since VMR-WB can only decode AMR-WB modes 0, 1, or 2, AMR-WB codec MUST be instructed not to generate the modes that are not in common so that files generated by AMR-WB can be decoded directly by VMR-WB (Note that the expansion of AMR-WB interoperable modes in VMR-WB decoder will ultimately ease this requirement). 10.2. Multi-channel Header The multi-channel header consists of a magic number followed by a 32-bit channel description field, giving the multi-channel header the following structure: +----------------------------+ | magic number | +----------------------------+ | channel description field | +----------------------------+ The magic number for multi-channel VMR-WB files in the non-interoperable modes MUST consist of the ASCII character string: "#!VMR-WB_MC1.0\n" (or 0x2321564d522d57425F4D43312E300a in hexadecimal). The version number in the magic numbers refers to the version of the file format. The magic number for multi-channel VMR-WB files in the interoperable mode MUST consist of the ASCII character string (see RFC 3267): "#!AMR-WB_MC1.0\n" (or 0x2321414d522d57425F4D43312E300a in hexadecimal). The 32-bit channel description field is defined as: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved bits | CHAN | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Sassan Ahmadi [page 24] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 Reserved bits: MUST be set to 0 when written, and a reader MUST ignore them. CHAN (4 bit unsigned integer): Indicates the number of audio channels contained in this storage file. The valid values and the order of the channels within a frame block are specified in Section 4.1 in [19]. 10.3. Speech Frames After the file header, speech frame-blocks consecutive in time are stored in the file. Each frame-block contains a number of octet-aligned speech frames equal to the number of channels, and stored in increasing order, starting with channel 1. Each stored speech frame starts with a one octet frame header with the following format: 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |P| FT |Q|P|P| +-+-+-+-+-+-+-+-+ The FT field and the Q bit are defined as follows. The P bits are padding and MUST be set to 0. +----+---------------------------------------------+------------------------+ | FT | Encoding Rate | Frame Size (Bits) | +----+---------------------------------------------+------------------------+ | 0 | RS-II Full-Rate (AMR-WB 6.6 kbps) | 132 | | 1 | RS-II Full-Rate (AMR-WB 8.85 kbps) | 177 | | 2 | RS-II Full-Rate (AMR-WB 12.65 kbps) | 253 | | 3 | RS-II Full-Rate 13.3 kbps | 266 | | 4 | RS-II Half-Rate 6.2 kbps | 124 | | 5 | RS-II Quarter-Rate 2.7 kbps | 54 | | 6 | RS-II Eighth-Rate 1.0 kbps | 20 | | 7 | (reserved) | - | | 8 | (reserved) | - | | 9 | RS-II CNG (AMR-WB SID) | 35 | | 10 | (reserved) | - | | 11 | (reserved) | - | | 12 | (reserved) | - | | 13 | (reserved) | - | | 14 | RS-II Erasure (AMR-WB SPEECH_LOST) | 0 | | 15 | RS-II Blank (AMR-WB NO_DATA) | 0 | +----+---------------------------------------------+------------------------+ Table 5: VMR-WB frame types for non-real-time transport and storage. Q (1 bit): Frame quality indicator. If set to 0, indicates the corresponding frame is corrupted. Note that in the above Table no padding for the AMR-WB compatible Frame Types is included. This is due to the fact that no frame-size adjustment for those frames is needed (to make them compatible to CDMA Multiplex Option 2), since in the file storage, no real-time over-the-air transmission takes place. Following this one octet header, the speech bits are placed as defined in 9.3.3. The last octet Sassan Ahmadi [page 25] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 of each frame is padded with zeroes, if needed, to achieve octet alignment. The following example shows a VMR-WB speech frame encoded at Rate-Set II Half-Rate (with 124 speech bits) in the storage format. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |P| FT=4 |Q|P|P| | +-+-+-+-+-+-+-+-+ + | | + Speech bits for frame-block n, channel k + | | + + | | + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | +-+-+-+-+ Frame-blocks or speech frames that are lost in transmission and thereby not received MUST be stored as Blank/NO_DATA frames (FT=15) or Erasure/SPEECH_LOST (FT=14) in complete frame-blocks to keep synchronization with the original media. 11. Congestion Control (Network-Controlled Mode Switching) The general congestion control considerations for transporting RTP data apply to VMR-WB speech over RTP as well. However, the multimode capability of VMR-WB speech coding may provide an advantage over other payload formats for controlling congestion since the bandwidth demand can be adjusted by selecting a different operating mode (i.e., mode switching). Another parameter that may impact the bandwidth demand for VMR-WB is the number of frame-blocks that are encapsulated in each RTP payload. Packing more frame-blocks in each RTP payload can reduce the number of packets sent and hence the overhead from RTP/UDP/IP headers, at the expense of increased delay. If forward error correction (FEC) is used to alleviate the packet loss, the amount of redundancy added by FEC will need to be regulated so that the use of FEC itself does not cause a congestion problem. It is RECOMMENDED that VMR-WB applications using this payload format employ congestion control. The actual mechanism for congestion control is not specified but should be suitable for real-time transport of datagrams. 12. Security Considerations RTP packets using the payload format defined in this specification are subject to the general security considerations discussed in [5]. As this format transports encoded speech, the main security issues include Sassan Ahmadi [page 26] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 confidentiality and authentication of the speech itself. The payload format itself does not have any built-in security mechanisms. External mechanisms, such as SRTP [17], MAY be used. This payload format does not exhibit any significant non-uniformity in the receiver side computational complexity for packet processing and thus is unlikely to pose a denial-of-service threat due to the receipt of pathological/corrupted data. Note that robust built-in bad rate detection and concealment as well as frame erasure concealment mechanisms have been implemented in VMR-WB to alleviate the impacts of the reception of corrupted rate information, packet loss, and corrupted speech data [10]. 12.1. Confidentiality To achieve confidentiality of the encoded VMR-WB speech, all speech data bits MAY be encrypted. There is no need to encrypt the payload header or the table of contents due to the following reasons: 1) they only carry information about the requested speech mode, frame type, and frame quality 2) this information could be useful to some third party, e.g., quality monitoring. As long as the VMR-WB payload is only packed and unpacked at either end, encryption may be performed after packet encapsulation so that there is no conflict between the two operations. Interleaving may affect encryption. Depending on the encryption scheme used, there may be restrictions on, for example, the time when keys can be changed. Specifically, the key change may need to occur at the boundary between interleave groups. The type of encryption method used may impact the error robustness of the payload data. The error robustness may be severely reduced when the data is encrypted unless an encryption method without error-propagation is used, e.g. a stream cipher. 12.2. Authentication To authenticate the sender of the speech, an external mechanism MUST be used. It is RECOMMENDED that such a mechanism protect all the speech data bits. Data tampering by a man-in-the-middle attacker could result in erroneous depacketization/decoding that could lower the speech quality. Tampering with the CMR field may result in speech in a different quality than desired. To prevent a man-in-the-middle attacker from tampering with the payload packets, some additional information besides the speech bits SHOULD be protected. Sassan Ahmadi [page 27] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 This may include the payload header, ToC, RTP timestamp, RTP sequence number, and the RTP marker bit. 12.3. Decoding Validation and Provision for Lost or Late Packets When processing a received payload packet, if the receiver finds that the calculated payload length, based on the information of the session and the values found in the payload header fields, do not match the size of the received packet, the receiver SHOULD discard the packet to avoid potential degradation of speech quality and to invoke the built-in frame error concealment mechanism. Therefore, invalid packets SHALL be treated as lost packets. Late packets (i.e., unavailability of a packet when needed for decoding at the receiver) SHALL be treated as lost packets. Furthermore, if the late packet is part of an interleave group, depending upon the availability of the other packets in that interleave group, decoding MUST be resumed from the next (sequential order) available packet. In other words, the unavailability of a packet in an interleave group at certain time SHOULD not invalidate the other packets within that interleave group that MAY arrive later. 13. Payload Format Parameters This section defines the parameters that may be used to select optional features in the VMR-WB payload. The parameters are defined here as part of the MIME subtype registration for the VMR-WB speech codec. A mapping of the parameters into the Session Description Protocol (SDP) [8] is also provided for those applications that use SDP. Equivalent parameters could be defined elsewhere for use with control protocols that do not use MIME or SDP. The data format and parameters are specified for both real-time transport in RTP and for storage type applications such as e-mail attachments. 13.1. VMR-WB MIME Registration The MIME subtype for the Variable-Rate Multimode Wideband (VMR-WB) audio codec is allocated from the IETF tree since VMR-WB is expected to be a widely used speech codec in general MMS, IMS, and VoIP applications. This MIME registration covers both real-time transfer via RTP and non-real-time transfers via stored files. Note, the receiver MUST ignore any unspecified parameter and use the default values instead. Media Type name: audio Media subtype name: VMR-WB Required parameters: none Note that if no input parameters are defined, the default values will be used. Sassan Ahmadi [page 28] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 Also note that "crc" and "robust-sorting" parameters from RFC 3267 are not applicable to VMR-WB RTP payload and storage file formats. To ensure compatibility between VMR-WB and AMR-WB in the interoperable sessions, one SHOULD make sure that AMR-WB does not utilize crc and robust-sorting (i.e., these options are deactivated in the session initiation). OPTIONAL parameters: These parameters apply to RTP transfer only. octet-align: Permissible values are 0 and 1. If 1, octet-aligned operation SHALL be used. If 0 or if not present, bandwidth-efficient operation is employed (default). mode-set: Requested VMR-WB mode set. Restricts the active codec mode set to a subset of all modes. Possible values are a comma separated list of modes from the set: 0, 1, 2, or 3 [10]. If such mode set is specified by the decoder, the encoder MUST abide by the request and MUST NOT use modes outside of the subset. If not present, all codec modes are allowed for the session. During and upon initiation of an interoperable interconnection between VMR-WB and AMR-WB, only Rate-Set II mode 3 SHALL be used. There are three Frame Types (i.e., FT=0, 1, or 2) within this mode that match AMR-WB modes 0, 1, and 2, respectively. If the AMR-WB codec is engaged in an interoperable interconnection with VMR-WB, the active AMR-WB codec mode set SHOULD be limited to 0, 1, or 2. mode-change-period: Specifies a number of frame-blocks, N, that is the interval at which codec mode changes are allowed. The initial phase of the interval is arbitrary, but changes must be separated by multiples of N frame-blocks. If this parameter is not present, mode changes are allowed at any time during the session. Note that this consideration is only made for maximum compatibility with AMR-WB; otherwise, VMR-WB modes can be switched at any time as long as the mode switching interval is an integer multiple of the frame size (i.e., 20 ms). mode-change-neighbor: Permissible values are 0 and 1. If 1, mode changes SHALL only be made to the neighboring modes in the active codec mode set. Neighboring modes are the ones closest in bit rate to the current mode, either the next higher or next lower rate. If 0 or if not present, change between any two modes in the active codec mode set is allowed. maxptime: The maximum amount of media, which can be encapsulated in a payload packet, expressed as time in Sassan Ahmadi [page 29] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 milliseconds. The time is calculated as the sum of the time the media present in the packet represents. The time SHALL be an integer multiple of the frame size. If this parameter is not present, the sender MAY encapsulate any number of speech frames into one RTP packet. interleaving: Indicates that frame-block level interleaving SHALL be used for the session and its value defines the maximum number of frame-blocks allowed in an interleaving group (see Section 9.4.1). If this parameter is not present, interleaving SHALL not be used. The presence of this parameter also implies automatically that octet-aligned operation SHALL be used. ptime: see RFC2327 [8]. It SHALL be at least one frame size for VMR-WB. channels: The number of audio channels. The possible values and their respective channel order is specified in section 4.1 in [19]. If omitted it has the default value of 1. These parameters apply to both real-time and non-real-time transfers dtx: Permissible values are 0 and 1. The default is 0 (i.e., No DTX) where VMR-WB normally operates as a continuous variable-rate codec. If dtx=1, this MUST be signaled to both encoder and decoder of VMR-WB to operate in Discontinuous Transmission (DTX) mode. sampling-frequency: Permissible values are 0 and 1. The default value is 0 (i.e., 16000 Hz sampling frequency for input/output and normal wideband operation). If the value is set to 1, the input/output sampling frequency is 8000 Hz (i.e., narrowband operation). If the sampling frequency is signaled only to encoder or the decoder, different combinations of input and output speech sampling frequencies are obtained (e.g., input at 8000 Hz and output at 16000 Hz). Nevertheless, different input and output sampling rates are not RECOMMENDED. The sampling frequency SHALL not be changed during a session. Also note that the time stamp is 320 for 16000 Hz sampling frequency and 160 for 8000 Hz sampling frequency. Encoding considerations: This type is defined for transfer via both RTP (RFC 3550) and stored-file methods as described in Sections 9 and 10, respectively, of RFC XXXX. Audio data is binary data, and must be encoded for non-binary transport; the Base64 encoding is suitable for Email. Security considerations: See Section 12 of RFC XXXX. Public specification: The VMR-WB speech codec is specified in following 3GPP2 Sassan Ahmadi [page 30] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 specifications C.P0052-0 and S.R0080-0. Transfer methods are specified in RFC XXXX. Additional information: The following applies to stored-file transfer methods: Magic numbers: Single channel for the non-interoperable modes: ASCII character string "#!VMR-WB\n" (or 0x2321564d522d57420a in hexadecimal) Single channel for the interoperable mode (see RFC 3267): ASCII character string "#!AMR-WB\n" (or 0x2321414d522d57420a in hexadecimal) Multi-channel for the non-interoperable modes: ASCII character string "#!VMR-WB_MC1.0\n" (or 0x2321564d522d57425F4D43312E300a in hexadecimal) Multi-channel for the interoperable mode (see RFC 3267): ASCII character string "#!AMR-WB_MC1.0\n" (or 0x2321414d522d57425F4D43312E300a in hexadecimal) File extensions for the non-interoperable modes: vmr, VMR Macintosh file type code: none Object identifier or OID: none File extensions for the interoperable mode (see RFC 3267): awb, AWB Macintosh file type code: none Object identifier or OID: none Person & email address to contact for further information: Sassan Ahmadi, Ph.D. Nokia Inc. USA sassan.ahmadi@nokia.com Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type. Author/Change controller: Sassan Ahmadi, Ph.D. Nokia Inc. USA sassan.ahmadi@nokia.com IETF Audio/Video Transport Working Group 13.2. Mapping MIME Parameters into SDP The information carried in the MIME media type specification has a specific mapping to fields in the Session Description Protocol (SDP) [8], which is commonly used to describe RTP sessions. When SDP is used to specify sessions employing the VMR-WB codec, the mapping is as follows: - The MIME type ("audio") goes in SDP "m=" as the media name. - The MIME subtype (payload format name) goes in SDP "a=rtpmap" as Sassan Ahmadi [page 31] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 the encoding name. The RTP clock rate in "a=rtpmap" MUST be 16000 for VMR-WB (although 8000 is also supported by VMR-WB for narrowband I/O processing), and the encoding parameters (number of channels) MUST either be explicitly set to N or omitted, implying a default value of 1. The values of N that are allowed is specified in Section 4.1 in [19]. - The parameters "ptime" and "maxptime" go in the SDP "a=ptime" and "a=maxptime" attributes, respectively. - Any remaining parameters go in the SDP "a=fmtp" attribute by copying them directly from the MIME media type string as a semicolon separated list of parameter=value pairs. Some example SDP session descriptions utilizing VMR-WB encodings follow. In these examples, long a=fmtp lines are folded to meet the column width constraints of this document; the backslash ("\") at the end of a line and the carriage return that follows it should be ignored. Example of usage of VMR-WB in a possible VoIP scenario: m=audio 49120 RTP/AVP 98 a=rtpmap:98 VMR-WB/16000 a=fmtp:98 octet-align=1 Example of usage of VMR-WB in a possible streaming scenario (two channel stereo): m=audio 49120 RTP/AVP 99 a=rtpmap:99 VMR-WB/16000/2 a=fmtp:99 interleaving=30 a=maxptime:100 Note that the payload format (encoding) names are commonly shown in upper case. MIME subtypes are commonly shown in lower case. These names are case-insensitive in both places. Similarly, parameter names are case-insensitive both in MIME types and in the default mapping to the SDP a=fmtp attribute. 14. IANA Considerations One new MIME subtype must be registered, see Section 14. A new SDP attribute "maxptime", also defined in Section 14, needs to be registered. The "maxptime" attribute is expected to be defined in the revision of RFC 2327 [11] and is added here with a consistent definition. 15. Acknowledgements The author would like to thank Dr. Redwan Salami of VoiceAge Corporation and Ari Lakaniemi of Nokia Inc. for their technical support throughout the draft and review of this document. Sassan Ahmadi [page 32] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 Appendix A- VMR-WB Frame Structure VMR-WB encoder frame structure has been designed to minimize the complexity of the IWF during interoperable interconnections between VMR-WB and AMR-WB. Note that NO IWF is required for interconnections that utilize VMR-WB at both ends. The VMR-WB frame structure has been designed similar to AMR-WB Interface Format 2 (IF2) bit-packing scheme [2]. The detailed information on VMR-WB frame structure can be found in [10]. For convenience, the frame structure of the interoperable mode is reviewed in this section to facilitate the description of the IWF procedures in Appendix B. The VMR-WB 12.65 kbps Interoperable Full-Rate has the following structure: <-----------------------------------266 Bits----------------------------------> <------Preamble-----------><----------AMR-WB Compatible Data Bits-------------> 0 7 8 9 0 1 2 +-+-+-+-+-+-+-+-+-+-+-+-+-+---------------------------------------------------+ |1|1|1|1|1|0|0|0|0|0|1|0|1| AMR-WB mode 2 Class A, B, C Bits | +-+-+-+-+-+-+-+-+-+-+-+-+-+---------------------------------------------------+ The first octet "11111000" denotes the Interoperable Full-Rate Type, the next 4 bits "0010" indicate the AMR-WB Frame Type; the 13th bit is the Frame Quality Bit Q of AMR-WB. Therefore, by removing the first octet, AMR-WB Frame Type, Quality bit, and 253 data bits identical to that of AMR-WB mode 2 are obtained, which is compliant with Interface Format 2 [2]. The VMR-WB 8.85 kbps Interoperable Full-Rate has the following structure: <-----------------------------------266 Bits----------------------------------> <------Preamble-----------><----------AMR-WB Compatible Data Bits-------------> 0 7 8 9 0 1 2 +-+-+-+-+-+-+-+-+-+-+-+-+-+---------------------------------------------------+ |1|1|1|1|1|0|0|0|0|0|0|1|1| AMR-WB mode 1 Class A, B, C Bits |76 Padding | +-+-+-+-+-+-+-+-+-+-+-+-+-+---------------------------------------------------+ The first octet "11111000" denotes the Interoperable Full-Rate Type, the next 4 bits "0001" indicate the AMR-WB Frame Type; the 13th bit is the Frame Quality Bit Q of AMR-WB. The padding bits are used to adjust the frame size to 266 bits. Therefore, by removing the first octet and the last 76 padding bits, AMR-WB Frame Type, Quality bit, and 177 data bits identical to that of AMR-WB mode 1 are obtained, which is compliant with Interface Format 2 [2]. The VMR-WB 6.6 kbps Interoperable Full-Rate has the following structure: <-----------------------------------266 Bits----------------------------------> <------Preamble-----------><----------AMR-WB Compatible Data Bits-------------> 0 7 8 9 0 1 2 +-+-+-+-+-+-+-+-+-+-+-+-+-+---------------------------------------------------+ |1|1|1|1|1|0|0|0|0|0|0|0|1| AMR-WB mode 0 Class A, B, C Bits |121 Padding| +-+-+-+-+-+-+-+-+-+-+-+-+-+---------------------------------------------------+ The first octet "11111000" denotes the Interoperable Full-Rate Type, the next 4 bits "0000" indicate the AMR-WB Frame Type; the 13th bit is the Frame Quality Sassan Ahmadi [page 33] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 Bit Q of AMR-WB. The padding bits are used to adjust the frame size to 266 bits. Therefore, by removing the first octet and the last 121 padding bits, AMR-WB Frame Type, Quality bit, and 132 data bits identical to that of AMR-WB mode 0 are obtained, which is compliant with Interface Format 2 [2]. To satisfy the dim-and-burst requirement of the CDMA system, where the speech codec data rate is reduced to accommodate signaling traffic, three interoperable half-rate frame types may be generated depending on the AMR-WB codec mode in an interoperable interconnection. The VMR-WB Interoperable Half-Rate has the following structures depending on the AMR-WB codec mode [10]. Note that the type of the interoperable half-rate is determined by examining the preamble bits of a Rate-Set II Half-Rate frame. Note that the use of interoperable Half-Rate Frames in VMR-WB is to comply with dim-and-burst signaling requirement of the CDMA system; however, since there is no corresponding half-rate mode within AMR-WB, the Interworking Function SHALL convert these frames into the corresponding AMR-WB frames. There is NO transcoding involved and the conversion is accomplished by adding or removing bits from or to the beginning and the end of the packets in both directions in the transport layer. For non-interoperable modes NO Interworking Function is required. The VMR-WB 12.65 kbps Interoperable Half-Rate has the following structure: <-----------------------------------124 Bits-----------------------------------> <------Preamble-----------><----------AMR-WB Compatible Data Bits--------------> 0 7 8 9 0 1 2 +-+-+-+-+-+-+-+-+-+-+-+-+-+------------------------------------------------+-+-+ |1|1|0|1|1|1|1|1|0|0|1|0|1| AMR-WB mode2 Class A, B, C Bits w/o FCB Bits |0|0| +-+-+-+-+-+-+-+-+-+-+-+-+-+------------------------------------------------+-+-+ Note that the Fixed-Codebook (FCB) indices (i.e., the last 144 bits [1]) are removed from the end of AMR-WB mode 2 bits to make them fit to Rate-Set II Half-Rate frame size. The first octet "11011111" denotes the Interoperable Half-Rate Type, the next 4 bits "0010" indicate the AMR-WB Frame Type, the 13th bit is the Frame Quality Bit Q of AMR-WB. The removal of the FCB indices by the IWF is done in an AMR-WB to VMR-WB interoperable interconnection and upon dim-and-burst signaling. The removed bits are replaced with random bits in VMR-WB to AMR-WB interoperable interconnection by IWF if an interoperable half-rate frame is received. The VMR-WB 8.85 kbps Interoperable Half-Rate has the following structure: <-----------------------------------124 Bits-----------------------------------> <------Preamble-----------><----------AMR-WB Compatible Data Bits--------------> 0 7 8 9 0 1 2 +-+-+-+-+-+-+-+-+-+-+-+-+-+----------------------------------------------------+ |1|1|0|1|1|1|1|1|0|0|0|1|1| AMR-WB mode 1 Class A, B, C Bits w/o FCB Bits | +-+-+-+-+-+-+-+-+-+-+-+-+-+----------------------------------------------------+ Note that some of the Fixed-Codebook (FCB) indices (i.e., the last 66 bits [1]) are removed from the end of AMR-WB mode 1 bits to make them fit to Rate-Set II Half-Rate frame size. Sassan Ahmadi [page 34] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 The first octet "11011111" denotes the Interoperable Half-Rate Type, the next 4 bits "0001" indicate the AMR-WB Frame Type, the 13th bit is the Frame Quality Bit Q of AMR-WB. The removal of the FCB indices by the IWF is done in an AMR-WB to VMR-WB interoperable interconnection and upon dim-and-burst signaling. The removed bits are replaced with random bits in VMR-WB to AMR-WB interoperable interconnection by IWF if an interoperable half-rate frame is received. The VMR-WB 6.6 kbps Interoperable Half-Rate has the following structure: <-----------------------------------124 Bits-----------------------------------> <------Preamble-----------><----------AMR-WB Compatible Data Bits--------------> 0 7 8 9 0 1 2 +-+-+-+-+-+-+-+-+-+-+-+-+-+----------------------------------------------------+ |1|1|0|1|1|1|1|1|0|0|0|0|1| AMR-WB mode 0 Class A, B, C Bits w/o FCB Bits | +-+-+-+-+-+-+-+-+-+-+-+-+-+----------------------------------------------------+ Note that some of the Fixed-Codebook (FCB) indices (i.e., the last 21 bits [1]) are removed from the end of AMR-WB mode 0 bits to make them fit to Rate-Set II Half-Rate frame size. The first octet "11011111" denotes the Interoperable Half-Rate Type, the next 4 bits "0000" indicate the AMR-WB Frame Type, the 13th bit is the Frame Quality Bit Q of AMR-WB. The removal of the FCB indices by the IWF is done in an AMR-WB to VMR-WB interoperable interconnection and upon dim-and-burst signaling. The removed bits are replaced with random bits in VMR-WB to AMR-WB interoperable interconnection by IWF if an interoperable half-rate frame is received. The comfort noise (CN) and Silence Descriptor update (SID_UPDATE) data bits during silence intervals are transmitted through Rate-Set II CNG Quarter-Rate, which has the following frame structure: <-------------------------------------54 Bits----------------------------------> <-Preamble-><---AMR-WB Compatible Data Bits--> 0 1 2 3 4 +-+-+-+-+-+----------------------------------+---------------------------------+ |1|0|0|1|1| AMR-WB SID Bits (35 bits) | 14 Padding Bits | +-+-+-+-+-+----------------------------------+---------------------------------+ The IWF in an AMR-WB to VMR-WB interoperable interconnection SHALL add the preamble and the padding bits to the beginning and the end of AMR-WB SID_UPDATE bits, respectively, to form a Rate-Set II CNG Quarter-Rate frame. These bits SHALL be removed from the VMR-WB incoming CNG Quarter-Rate packet to form the outgoing SID_UPDATEs for AMR-WB in an interoperable interconnection. Appendix B- Interworking Function (IWF) for Interoperable AMR-WB <-> VMR-WB Interconnections The output bit stream of VMR-WB codec has been arranged to closely follow the AMR-WB Interface Format 2 frame structure [2,10]. However, to comply with the constraints of CDMA Link-Layer Assisted Service Options and CDMA2000 Multiplex Sublayer [12,13], the output frame size of the CDMA speech codec in any of the encoding rates SHALL conform to the sizes allowed by CDMA Multiplex Option 2 [12,13] as shown in Table 1. To ensure a transparent speech data flow between 3GPP/AMR-WB and 3GPP2/VMR-WB, an interworking function operating at transport Sassan Ahmadi [page 35] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 layer is required. This is merely a RTP translator as envisioned in RFC3550. The following describes the functions performed by the IWF in an interoperable interconnection. Note that NO interworking function is needed if VMR-WB codecs are incorporated at both ends. Also note that if the path between AMR-WB and VMR-WB does not include the CDMA2000 air-interface, the IWF MAY be eliminated depending on the implementation. An example for such case would be if AMR-WB and VMR-WB are used for Internet based multimedia applications that does not involve any cellular network. +------------+ +------------+ | VMR-WB | | AMR-WB | +------------+<-----------Session Initiation Using SIP----------->+------------+ |Intermediate| |Intermediate| | Protocol | | Protocol | | Layers | Gateway | Layers | +------------+ +-----------------------+ +------------+ | RTP | | Interworking Function | | RTP | +------------+ +-----------+-----------+ +------------+ | UDP | | UDP | UDP | | UDP | +------------+ +-----------+-----------+ +------------+ | IP | | IP | IP | | IP | +------------+ +-----------+-----------+ +------------+ | Data Link | | Data Link | Data Link | | Data Link | | Layer | | Layer | Layer | | Layer | +------------+ +-----------+-----------+ +------------+ | Physical | | Physical | Physical | | Physical | | Layer |<------------>| Layer | Layer |<----------->| Layer | +------------+ +-----------+-----------+ +------------+ Figure B-1: The data flow in an interoperable interconnection between VMR-WB and AMR-WB. When receiving a RTP payload from VMR-WB destined for AMR-WB, the IWF SHALL perform the following procedure for every speech packet within the payload; i.e., the payload may contain more than one speech frame. - The preamble bits of the incoming speech data block SHALL be examined to determine the AMR-WB codec mode as well as to appropriately set the FT field of the outgoing payload ToC. - In VMR-WB to AMR-WB path, the CMR of the outgoing payload SHALL be set to 2. This means that VMR-WB codec always requests AMR-WB encoder to operate in codec mode 2. - For incoming FT values 0, 1, or 2, Remove the 13 bit preamble from the beginning of the incoming speech data block and form the AMR-WB compatible speech data block as follows: * If FT field of the incoming payload ToC is 0, the extra 121 padding bits SHALL be removed from the end of the speech data block. * If FT field of the incoming payload ToC is 1, the extra 76 padding bits SHALL be removed from the end of the speech data block. * If FT field of the incoming payload ToC is 2, the speech data block must be used as is. Sassan Ahmadi [page 36] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 - If FT field of the incoming payload ToC is 9, the 5-bit preamble in the beginning and the extra 14 padding bits at the end SHALL be removed from the incoming speech data block to form an AMR-WB SID_UPDATE frame. - If FT field of the incoming payload ToC is 4 and the content of the incoming preamble is 11011111, the FT field of the outgoing payload ToC SHALL be set to 0, 1, or 2 depending on the contents of the four most significant bits of the second octet in the incoming preamble. If the aforementioned bits are 0010, 0001, or 0000, the FT field of the outgoing payload is set to 2, 1, or 0, respectively and 144 (FT=2), 66 (FT=1), or 21 (FT=0) randomly generated bits must be added to the end of the speech data to form a regular AMR-WB speech frame with 253, 177, 132 bits of speech data corresponding to codec modes 2, 1, or 0, respectively. Also the 8-bit preamble at the beginning of the incoming speech block SHALL be removed. Note that the interoperable half-rate is used by VMR-WB codec to comply with CDMA dim-and-burst signaling requirements when using CDMA2000 Link Layer Assisted Protocols for real-time VoIP services [13,14]). This is an unlikely situation if the path between the source and destination codecs does not include the CDMA2000 air interface, in such cases, the IWF MAY be eliminated depending on the implementation of the terminal codecs. When receiving a RTP payload from AMR-WB destined for VMR-WB, the IWF SHALL perform the following procedure for every speech packet within the payload; i.e., the payload may contain more than one speech frame. - The valid CMR values in the AMR-WB to VMR-WB path are 0, 1, or 2. - If FT field of the incoming payload ToC is 0, 121 padding bits (zeros) SHALL be added to the end of the outgoing speech data block and special bit pattern "1111100000001" SHALL be added to the beginning of the outgoing speech data block. - If FT field of the incoming payload ToC is 1, 76 padding bits (zeros) SHALL be added to the end of the outgoing speech data block and special bit pattern "1111100000011" SHALL be added to the beginning of the outgoing speech data block. - If FT field of the incoming payload ToC is 2, NO padding bits (zeros) SHALL be added to the end of the outgoing speech data block and special bit pattern "1111100000101" SHALL be added to the beginning of the outgoing speech data block. - If FT field of the incoming payload ToC is 9, 14 padding bits (zeros) SHALL be added to the end of the speech data block. The special bit pattern "10011" SHALL be added to the beginning of the outgoing speech block. During an AMR-WB to VMR-WB interoperable interconnection, the CDMA Multiplex Sublayer in the 3GPP2/CDMA2000 link MAY request a half-rate speech frame, the IWF SHOULD convert the interoperable Full-Rate frames (i.e., FT values 0, 1, or 2) to the interoperable Half-Rate frames in the outgoing payload using the following procedure (This is an unlikely situation if the path between the source and destination codecs does not include the CDMA2000 air interface, in such cases the IWF MAY be eliminated depending on the implementation of the Sassan Ahmadi [page 37] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 terminal codecs): - The size of the outgoing speech data must be adjusted to 124 bits; therefore extra bits SHALL be removed from the end of the incoming speech data. If the FT field of the incoming payload is 0, 1, or 2, then 144, 66, or 21 bits SHALL be removed from the end of the incoming speech data block, respectively, to form a Rate-Set II interoperable Half-Rate speech frame for VMR-WB. - The corresponding FT field of the outgoing payload SHALL be set to 4. - The first 13 bits of the outgoing speech data block SHALL be set to "1101111100101" (for FT=2), "1101111100011" (for FT=1), or "1101111100001" (for FT=0), depending of the value of the FT field of the incoming payload (to be interpreted as interoperable half-rate frame by the VMR-WB decoder at the receiving side). References Normative References [1] 3GPP TS 26.190 "AMR Wideband speech codec; Transcoding functions", version 5.1.0 (2001-12), 3rd Generation Partnership Project (3GPP). [2] 3GPP TS 26.201 "AMR Wideband speech codec; Frame Structure", version 5.0.0 (2001-03), 3rd Generation Partnership Project (3GPP). [3] S. Bradner, "Key words for use in RFCs to Indicate Requirement Levels", IETF RFC 2119, March 1997. [4] 3GPP TS 26.193 "AMR Wideband Speech Codec; Source Controlled Rate operation", version 5.0.0 (2001-03), 3rd Generation Partnership Project (3GPP). [5] H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", IETF RFC 3550, July 2003. [6] J. Sjoberg, et al., "Real-Time Transport Protocol (RTP) Payload Format and File Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) Audio Codecs", IETF RFC 3267, June 2002. [7] 3GPP TS 26.192 "AMR Wideband speech codec; Comfort Noise aspects", version 5.0.0 (2001-03), 3rd Generation Partnership Project (3GPP). [8] M. Handley and V. Jacobson, "SDP: Session Description Protocol", IETF RFC 2327, April 1998. [9] 3GPP2 S.R.0080-0 "CDMA2000 Wideband Speech Codec Stage 1 Requirements", 3GPP2 Technical Specification, February 2003. [10] 3GPP2 C.P.0052-0 "Source-Controlled Variable-Rate Multimode Wideband Speech Codec Service Option for Wideband Spread Spectrum Communication Sassan Ahmadi [page 38] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 Systems", 3GPP2 Technical Specification, to be published in April 2004. [11] 3GPP2 C11-20030915-003R2 "CDMA2000 Wideband Speech Codec Characterization Test Plan", 3GPP2 Technical Specification, September 2003. [12] 3GPP2 C.P.9021 "Link-Layer Assisted Robust Header Compression, Service Options for Voice-over-IP Operation", 3GPP2 Technical Specification, September 2002. [13] 3GPP2 C.S.0003A-2 "Medium Access Control (MAC) Standard for CDMA2000 Spread Spectrum Systems", Release A, 3GPP2 Technical Specification, February 2002. [14] 3GPP2 C.S0005-0 "Upper Layer (Layer 3) Signaling Standard for cdma2000 Spread Spectrum Systems", Release 0, 3GPP2 Technical Specification, June 2002. Informative References [15] S. Floyd, M. Handley, J. Padhye, J. Widmer, "Equation-Based Congestion Control for Unicast Applications", ACM SIGCOMM 2000, Stockholm, Sweden [16] J. Rosenberg, and H. Schulzrinne, "An RTP Payload Format for Generic Forward Error Correction", IETF RFC 2733, December 1999. [17] Baugher, et al., "The Secure Real Time Transport Protocol", IETF Draft (Work in Progress), November 2001. [18] C. Perkins, et al., "RTP Payload for Redundant Audio Data", IETF RFC 2198, September 1997. [19] H. Schulzrinne, "RTP Profile for Audio and Video Conferences with Minimal Control" IETF RFC 3551, July 2003. Any 3GPP document can be downloaded from the 3GPP web server, "http://www.3gpp.org/", see specifications. Any 3GPP2 document can be downloaded from the 3GPP2 web server, "http://www.3gpp2.org/", see specifications. Author's Address The editor will serve as the point of contact for all technical matters related to this document. Dr. Sassan Ahmadi Phone: 1 (858) 831-5916 Fax: 1 (858) 831-6513 Nokia Inc. EMail: sassan.ahmadi@nokia.com 12278 Scripps Summit Dr. San Diego, CA 92131 USA Sassan Ahmadi [page 39] INTERNET-DRAFT VMR-WB RTP Payload and File Storage Formats November 2003 This Internet-Draft expires in six months from November 2003. Full Copyright Statement Copyright (C) The Internet Society (2003). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assignees. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society. Sassan Ahmadi [page 40]