Audio/Video Transport Working Group M. Espelien Internet Draft: RTP Payload Common Format for R. Gellens Vocoder Speech Qualcomm Inc. Document: draft-espelien-avt-common-00.txt September 2001 RTP Payload Common Format for Vocoder Speech Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt. The list of Internet- Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Copyright Notice Copyright (C) The Internet Society 2001. All Rights Reserved. Espelien & Gellens Expires March 2002 [Page 1] Internet Draft Common Payload Format September 2001 Table of Contents 1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . 2 2 Conventions Used in this Document . . . . . . . . . . . . . 3 3 Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 4 Introduction . . . . . . . . . . . . . . . . . . . . . . . . 3 5 Background and Motivation for Common Format . . . . . . . . . 3 6 Common Characteristics . . . . . . . . . . . . . . . . . . . 4 6.1 PureVoice Characteristics . . . . . . . . . . . . . . . . 5 6.2 EVRC Characteristics . . . . . . . . . . . . . . . . . . 5 6.3 SMV Characteristics . . . . . . . . . . . . . . . . . . . 5 7 Common RTP Packet Format . . . . . . . . . . . . . . . . . . 6 7.1 Normal Format . . . . . . . . . . . . . . . . . . . . . . 6 7.2 TOC Entries . . . . . . . . . . . . . . . . . . . . . . 8 7.3 Bundling Codec Data Frames . . . . . . . . . . . . . . . 9 7.3.1 Additional Bundling Restrictions on the Sender . . . 9 7.4 Interleaving Codec Data Frames . . . . . . . . . . . . . 10 7.4.1 Additional Interleaving Restrictions on the Sender . 11 7.5 Finding Interleave Group Boundaries . . . . . . . . . . . 11 7.6 Reconstructing Interleaved Speech . . . . . . . . . . . 11 7.7 Receiving Invalid Values . . . . . . . . . . . . . . . . 12 7.8 Optimized Single Frame Format . . . . . . . . . . . . . 12 7.9 Detecting Which Format . . . . . . . . . . . . . . . . . 13 7.10 Codec Data Frame Format . . . . . . . . . . . . . . . . 13 7.10.1 PureVoice Codec Data Frame Format . . . . . . . . . 13 7.10.2 EVRC or SMV Codec Data Frame Format . . . . . . . . 14 7.11 Adding New Codecs . . . . . . . . . . . . . . . . . . . 15 8 Tardy Packets . . . . . . . . . . . . . . . . . . . . . . . 15 9 Lost Packets . . . . . . . . . . . . . . . . . . . . . . . . 15 10 Implementation Issues . . . . . . . . . . . . . . . . . . . 16 10.1 Interleaving Length . . . . . . . . . . . . . . . . . . . 16 11 Security Considerations . . . . . . . . . . . . . . . . . . 16 12 Real Time and Storage Mode . . . . . . . . . . . . . . . . . 17 12.1 RTP Mode . . . . . . . . . . . . . . . . . . . . . . . . 17 12.2 Storage Mode . . . . . . . . . . . . . . . . . . . . . . 17 13 IANA Considerations . . . . . . . . . . . . . . . . . . . . 18 13.1 Registration of MIME Media Type . . . . . . . . . . . . . 18 13.1.1 audio/EVRC Media Type Registration . . . . . . . . . 18 13.1.2 audio/SMV Media Type Registration . . . . . . . . . . 19 13.1.3 audio/qcelp-common Media Type Registration . . . . . 20 13.2 Optional Media Type Parameters . . . . . . . . . . . . . 21 14 Mapping to SDP Parameters . . . . . . . . . . . . . . . . . 22 15 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . 22 16 References . . . . . . . . . . . . . . . . . . . . . . . . . 23 17 Authors' Addresses . . . . . . . . . . . . . . . . . . . . . 23 1 Abstract This document describes a common [RTP] payload format for speech encoded using vocoders which share certain common characteristics (see section 6). Espelien & Gellens Expires March 2002 [Page 2] Internet Draft Common Payload Format September 2001 This is expected to be especially useful in CDMA (Code Division Multiple Access) wireless systems. CDMA networks use one of three vocoders: [PureVoice] (Qcelp), [EVRC] (Enhanced Variable Rate Codec) and in the future [SMV] (Selectable Mode Vocoder). All of these vocoders share a number of common characteristics (see section 6) and can be transmitted using the RTP payload format specified in this document. An interleaved format is included to reduce the effect of packet loss on speech quality, as well as a bundled format, and a format optimized for header compression. 2 Conventions Used in this Document The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [KEYWORDS]. 3 Changes This is the first version. 4 Introduction This document describes a generalized format for use as an [RTP] payload type. Three CDMA vocoders are initially specified in this common format; more can be added by following the procedures in section 7.11. The [PureVoice] Qcelp vocoder and the [EVRC] vocoder are already widely deployed in CDMA wireless networks. [SMV] is the codec of choice for next generation CDMA wireless networks and is likely to be widely deployed as next generation wireless networks are rolled out. Multiple codec data frames MAY be bundled together to reduce the per-frame transmission overhead. Codec data frames can be interleaved to reduce quality degradation due to lost packets. The sender can choose various interleave settings based on the importance of low end-to-end delay versus greater tolerance for lost packets. A format optimized for header compression is provided (see section 7.8). 5 Background and Motivation for Common Format The Electronic Industries Association (EIA) & Telecommunications Industry Association (TIA) has published three standards which define the speech compression algorithms for CDMA applications: PureVoice, EVRC and SMV. Espelien & Gellens Expires March 2002 [Page 3] Internet Draft Common Payload Format September 2001 The [SMV] codec is the preferred speech codec standard for CDMA2000. The SMV will be deployed in third generation handsets, in addition to PureVoice and EVRC codecs. There are currently handsets that support two of these codecs, and in the future handsets might support all three codecs. The PureVoice and EVRC codecs are currently deployed in millions of first and second generation CDMA handsets. The format of the three codec (PureVoice, EVRC and SMV) frames is very similar. The similarities suggest that a common specification for encapsulating these three wireless vocoders as well as potential future wireless vocoders is possible and worth pursuing. The environment (memory, processor speed, etc.) of wireless handsets is constrained. A common RTP payload format for multiple vocoders allows the handset to support these vocoders with a single, smaller RTP implementation than would be needed for separate formats, reducing code size and complexity, and therefore shortening time to market, lowering costs, and improving quality. It also permits saved handset resources to be spent on user features. Since an RTP format for [EVRC] and [SMV] has not yet been approved, a direct case can be made for a common format supporting at least these two (plus future) codecs. The situation with [PureVoice] is more complex. An RTP format already exists [vnd.Qcelp] and is specified in [RFC2658]; therefore it would be ideal for a common format supporting PureVoice as well as EVRC and SMV to interoperate with existing implementations of [vnd.Qcelp]. However, if interoperability is sacrificed, significant benefits can be obtained by making better use of RTP packet bits; for example, allowing for table-of-contents entries as well as a frame count field, yet spending the same number of bits (or fewer) per packet on average. The common format specified here gives up interoperability with [vnd.Qcelp] in order to gain packet optimization benefits. 6 Common Characteristics The format of the three initial codec (PureVoice, EVRC and SMV) frames is very similar. This specification is designed to transport data frames of vocoders that have the following characteristics: - is frame based - null and erasure frames are allowed - total number of rates < 17. - maximum full rate frame can be transported in a single RTP packet using this specific format. Espelien & Gellens Expires March 2002 [Page 4] Internet Draft Common Payload Format September 2001 Vocoders with characteristics that can be expressed in format type, TOC entries and codec frames can easily be expressed in this common format. New vocoders with such characteristics can be added to this common format by following the steps in section 7.11. 6.1 PureVoice Characteristics The Qcelp [PureVoice] codec compresses each 20 milliseconds of 8000 Hz sampled input speech into one of four different size output frames: Rate 1 (266 bits), Rate 1/2 (124 bits), Rate 1/4 (54 bits) or Rate 1/8 (20 bits). In addition, there are two zero bit vocoder frame types (see PureVoice Table in section 7.2): null frames and erasure frames. (Erasure frames are never transmitted; they are substituted by the receiver for lost or damaged frames.) 6.2 EVRC Characteristics The [EVRC] codec compresses each 20 milliseconds of 8000 Hz sampled input speech into one of four different size output frames: Rate 1 (171 bits), Rate 1/2 (80 bits), Rate 1/4 (40 bits) or Rate 1/8 (16 bits). In addition, there are two zero bit vocoder frame types (see EVRC Table in section 7.2): null frames and erasure frames. (Erasure frames are never transmitted; they are substituted by the receiver for lost or damaged frames.) 6.3 SMV Characteristics Like the EVRC, the [SMV] codec also compresses each 20 milliseconds of 8000 Hz sampled input speech into one of four different size output frames: Rate 1 (171 bits), Rate 1/2 (80 bits), Rate 1/4 (40 bits) or Rate 1/8 (16 bits). In addition, there are two zero bit vocoder frame types (see SMV Table in section 7.2): null frames and erasure frames. (Erasure frames are never transmitted; they are substituted by the receiver for lost or damaged frames.) The SMV is more bandwidth efficient than the EVRC vocoder. The SMV achieves lower average data rates (ADR) by transmitting at percentages of each rate as shown in the table above. The assumptions and details of noise levels and ADR are described in Chapter 4 of [SMV]. The EVRC is equivalent in performance to SMV mode 1. The SMV codec operates in one of four modes. Each mode employs one of the vocoders operating at the rates mentioned above. Each mode operates in all rates (full to 1/8) for varying percentages of time, based on desired average data rate specified, taking into account characteristics of the speech samples. [SMV] modes can be changed on a frame by frame basis. Note that the [SMV] mode is not encapsulated in the RTP packet; only fields defined in section 7.1 or 7.8 are sent as RTP payload. [SMV] modes are included in this document for informational purposes only. Espelien & Gellens Expires March 2002 [Page 5] Internet Draft Common Payload Format September 2001 While each [SMV] mode can operate in all rates (full to 1/8) for varying percentages of time, higher or lower average data rate are achieved for each mode. This is shown in the table below: Mode 0 Mode 1 Mode 2 Mode 3 ------------------------------------------------------------- Rate 1 68.90% 38.14% 15.43% 07.49% Rate 1/2 06.03% 15.82% 38.34% 46.28% Rate 1/4 00.00% 17.37% 16.38% 16.38% Rate 1/8 25.07% 28.67% 29.85% 29.85% ------------------------------------------------------------- ADR 7205 bps 5182 bps 4073 bps 3692 bps The SMV codec chooses the output frame rate based on an analysis of the input speech and the current operating mode (either normal or one of three reduced rates). For typical speech patterns, this results in an average output of 4.2k bits/second for normal mode and lower for reduced rate modes. 7 Common RTP Packet Format The RTP timestamp is in 1/8000 of a second units. The RTP payload data for the common format is one of two types: normal (type 1) and optimized single frame (type 2). 7.1 Normal Format Normal packet format allows for multiple codec frames to be included in each RTP packet. The sender chooses how many codec data frames to include in each RTP packet. If more than one, the sender chooses to bundle or interleave the frames. Bundling groups two or more consecutive data frames in a single RTP packet. Interleaving groups two or more non-consecutive frames in a packet. Interleaving can mitigate the listener's perception of data loss. The normal codec RTP payload data is formatted as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header [RTP] | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |R|R| LLL | NNN |R|R|Frame Count| TOC | ... | TOC |padding| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |one or more codec data frames, one per TOC entry | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The RTP header has the expected values as described in [RTP]. The use of the marker bit in the RTP header is outside the scope of this document. The use of the marker bit is defined by the application. Espelien & Gellens Expires March 2002 [Page 6] Internet Draft Common Payload Format September 2001 When multiple codec data frames are present in a single RTP packet, the timestamp is, as always, that of the oldest data represented in the RTP packet. The assignment of an RTP payload type for this new packet format is outside the scope of this document, and will not be specified here. It is expected that the RTP profile for a particular class of applications will assign a payload type for this encoding, or if that is not done then a payload type in the dynamic range will be chosen. [SDP] can be used to signal out of band the RTP payload type (see example in section 14). The fields following the RTP header have the following meaning: 1st bit: Reserved (R): 1 bit MUST be set to zero by sender; SHOULD be ignored by receiver. 2nd bit: Reserved (R): 1 bit MUST be set to zero by sender; SHOULD be ignored by receiver. 3rd-5th bit: Interleave (LLL): 3 bits MUST be set to a value from 0 to 7. If this field is non-zero, interleaving is enabled. All receivers MUST support interleaving. Senders MAY support interleaving. Senders that do not support interleaving MUST set field LLL and NNN to zero. 6th-8th bit: Interleave Index (NNN): 3 bits MUST have a value less than or equal to the value of LLL. Values of NNN greater than the value of LLL are invalid. More than one codec data frame MAY be included in a single RTP packet. Multiple data frames are either bundled or interleaved. Bundling is described in detail in Section 7.3, and interleaving in Section 7.4. If only one codec data frame is included in an RTP packet, the LLL and NNN fields MUST be zero. 9th bit: Reserved (R): 1 bit MUST be set to zero by sender; SHOULD be ignored by receiver. 10th bit: Reserved (R): 1 bit MUST be set to zero by sender; SHOULD be ignored by receiver. 11th-16th bit: Frame Count (Count): 6 bits MUST be set by sender to the number of codec data frames minus one. Valid values range from 0 to 63. The frame count plus one indicates how many TOC entries (and codec data frames) are present in the RTP packet. A value of zero indicates one frame. A value of 63 indicates 64 frames. Espelien & Gellens Expires March 2002 [Page 7] Internet Draft Common Payload Format September 2001 TOC entries are described in section 7.2. TOC entries provide information about the encoding rate and length of the respective codec frame. Codec frames are speech data encoded at various rates (Full, 1/2, 1/4, or 1/8). Null and erasure frames are not played out but have zero length and corresponding TOC entry indicating null or erased frame type. 17th-20th bit: First Table of Contents (TOC): 4 bits MUST be set by sender as described in section 7.2. There is one TOC entry for each codec frame. The value can range from 0 to 5 as shown in the three tables below. Each value indicates to the receiver the length of the corresponding codec data frame. Padding (padding): 0 or 4 bits If the frame count is odd, then the sender MUST set 4 bits of padding following the last TOC entry and preceding the first codec data frame to zero. If the frame count is even, then no padding is used; the first codec data frame immediately follows the last TOC entry. The receiver interprets the bits following the last TOC entry or padding as the first codec data frame. Codec Frame(s): Length depends on codec and rate See descriptions in section 7.2. Each codec frame uses zero or more bits, depending on the rate specified by TOC and codec type specified by MIME type. (For example, half Rate EVRC and SMV codec frames are 80 bits long, while a half rate PureVoice codec frames are 124 bits long.) The sender sets the TOC value, and associated codec frame. The tables below correlate TOC values with valid codec lengths for the initial three codecs; future codecs specify mapping in their MIME registration, as per section 7.11. 7.2 TOC Entries TOC entries apply only to multiple frame (Type 1) format as described in section 7.1. Each TOC entry is correlated with the respective codec data frame. The TOC value indicates the rate set and number of bits in the data frame. For PureVoice, EVRC and SMV the following tables are used: TOC PureVoice Value Rate Codec data frame size (in octets) ----- ------- ---------------------------------------------- 0 Blank 0 (0 bits) 1 1/8 3 (20 bits; 4 zero bits of padding at end) 2 1/4 6 (54 bits; 2 zero bits of padding at end) 3 1/2 16 (124 bits; 4 zero bits of padding at end) 4 1 34 (266 bits; 6 zero bits of padding at end) 5 Erasure 0 SHOULD NOT be transmitted by sender 6-15 n/a n/a Reserved. SHOULD NOT be transmitted Espelien & Gellens Expires March 2002 [Page 8] Internet Draft Common Payload Format September 2001 Note that the common frame format for PureVoice has TOC entries instead of lead bytes. As a result, the PureVoice codec frame size in the table indicates the size of the data itself, just as it does for EVRC and SMV. TOC EVRC Value Rate Codec data frame size (in octets) ----- ------- -------------------------------------------- 0 Blank 0 (0 bits) 1 1/8 2 (16 bits) 2 1/4 5 (40 bits) 3 1/2 10 (80 bits) 4 1 22 (171 bits; 5 padded at end with zeros) 5 Erasure 0 SHOULD NOT be transmitted by sender 6-15 n/a n/a Reserved. SHOULD NOT be transmitted TOC SMV Value Rate Codec data frame size (in octets) ----- ------- --------------------------------------------- 0 Blank 0 (0 bits) 1 1/8 2 (16 bits) 2 1/4 5 (40 bits) 3 1/2 10 (80 bits) 4 1 22 (171 bits; 5 padded at end with zeros) 5 Erasure 0 SHOULD NOT be transmitted by sender 6-15 n/a n/a Reserved. SHOULD NOT be transmitted 7.3 Bundling Codec Data Frames Bundling codec data frames only applies to multiple frame format as described in section 7.1. As indicated in section 7, more than one codec data frame MAY be included in a single RTP packet. Bundling codec data frames means multiple data frames are included consecutively in a packet (without interleaving). The bundling of codec data frames is signaled by setting the frame count to a value greater than 0 (which also requires that the LLL and the NNN values MUST both be zero). Senders MAY support bundling. All receivers MUST support bundling. Receivers MAY signal the maximum number of codec data frames they can handle in a single RTP packet. This can be done using out of band signaling (for example in [SDP] parameters). See also maxptime in section 13.2. 7.3.1 Additional Bundling Restrictions on the Sender Furthermore, senders have the following additional restrictions: o MUST never include more codec data frames in a single RTP packet than signaled by maxptime in Section 13.1. Espelien & Gellens Expires March 2002 [Page 9] Internet Draft Common Payload Format September 2001 o To the extent that it is possible to determine the MTU of the underlying transport, MUST not include more codec data frames in a single RTP packet than will fit in the MTU. For the purpose of computing the maximum bundling value, all codec data frames SHOULD be assumed to have the Rate 1 size. It is essential that a single codec full rate frame be sent in an unfragmented single RTP packet. Note that optimized single frames are sent 20 ms (milliseconds) at a time, one in each RTP packet. Therefore for optimized single frame format, maxptime MUST be 20 ms, for the currently supported vocoders; see section 14. 7.4 Interleaving Codec Data Frames Interleaving is meaningful only when more than one codec data frame is bundled into a single RTP packet. All receivers MUST support interleaving. Senders MAY support interleaving. Interleaving of codec data frames is signaled by setting the LLL bits to a value from 1 to 7 inclusive. Receivers MAY signal the maximum number of bundles (maxinterleave) they can handle in a single interleaving group. This can be done using out of band signaling (for example in [SDP] parameters). Section 13.2 describes the maxinterleave parameter. Given a time-ordered sequence of output, codec frames numbered 0..n, a bundling value B, and an interleave value L where n = B * (L+1) - 1, the output frames are placed into RTP packets as follows (the values of the fields LLL and NNN are indicated for each RTP packet): First RTP Packet in Interleave group: LLL=L, NNN=0 Frame 0, Frame L+1, Frame 2(L+1), Frame 3(L+1), ... for a total of B frames Second RTP Packet in Interleave group: LLL=L, NNN=1 Frame 1, Frame 1+L+1, Frame 1+2(L+1), Frame 1+3(L+1), ... for a total of B frames This continues to the last RTP packet in the interleave group: L+1 RTP Packet in Interleave group: LLL=L, NNN=L Frame L, Frame L+L+1, Frame L+2(L+1), Frame L+3(L+1), ... for a total of B frames Senders MUST transmit in timestamp-increasing order. Furthermore, within each interleave group, the RTP packets making up the Espelien & Gellens Expires March 2002 [Page 10] Internet Draft Common Payload Format September 2001 interleave group MUST be transmitted in value-increasing order of the NNN field. While this does not guarantee reduced end-to-end delay on the receiving end, when packets are delivered in order by the underlying transport, delay is reduced to the minimum possible. 7.4.1 Additional Interleaving Restrictions on the Sender Additionally, senders have the following restrictions: o Once beginning a session with a given maximum interleaving value, the sender MUST NOT increase the interleaving to a value that exceeds the maximum interleaving that was signaled. The maximum interleaving value is signaled by maxinterleave in section 13.2. o MAY change the interleaving value only between interleave groups. 7.5 Finding Interleave Group Boundaries Given an RTP packet with sequence number S, interleave value (field LLL) L, and interleave index value (field NNN) N, the interleave group consists of RTP packets with sequence numbers from S-N to S-N+L inclusive. In other words, the interleave group always consists of L+1 RTP packets with sequential sequence numbers. The bundling value for all RTP packets in an interleave group MUST be the same. The receiver determines the expected bundling value for all RTP packets in an interleave group by the number of codec data frames bundled in the first RTP packet of the interleave group received. Note that this might not be the first RTP packet of the interleave group sent if packets are delivered out of order (or lost) by the underlying transport. On receipt of an RTP packet in an interleave group with other than the expected bundling value, the receiver MAY discard codec data frames off the end of the RTP packet or add erasure codec data frames to the end of the packet in order to manufacture a substitute packet with the expected bundling value. The receiver MAY instead choose to discard the whole interleave group and play silence. 7.6 Reconstructing Interleaved Speech Given an RTP sequence number ordered set of RTP packets in an interleave group numbered 0..L, where L is the interleave value and B is the bundling value, and codec data frames within each RTP packet that are numbered in order from first to last with the numbers 1..B, the original, time-ordered sequence of output frames from the codec is reconstructed as follows: Espelien & Gellens Expires March 2002 [Page 11] Internet Draft Common Payload Format September 2001 First L+1 frames: Frame 0 from packet 0 of interleave group Frame 0 from packet 1 of interleave group And so on up to... Frame 0 from packet L of interleave group Second L+1 frames: Frame 1 from packet 0 of interleave group Frame 1 from packet 1 of interleave group And so on up to... Frame 1 from packet L of interleave group And so on up to... Bth L+1 frames: Frame B from packet 0 of interleave group Frame B from packet 1 of interleave group And so on up to... Frame B from packet L of interleave group 7.7 Receiving Invalid Values On receipt of an RTP packet with an invalid value of the NNN field, the RTP packet MUST be treated as lost by the receiver for the purpose of generating erasure frames as described in section 9. A codec data frame with a reserved value in the TOC field SHOULD also be considered invalid. All codec frames in a packet after an invalid TOC field SHOULD be considered invalid. 7.8 Optimized Single Frame Format Optimized single frame format is designed for maximum efficiency in transmission of codec data with certain forms of header compression. Only one codec data frame is sent in each RTP packet, and there are no frame count or TOC field entries, or other payload header fields. The codec rate can be determined from the length of the codec frame, since there is only one codec data frame in each RTP packet of this type. If two frame types have different rates, but are expressed in the same number of codec frame bytes, there MUST be other signaling to distinguish them. For example, the codec sender could encode the rate in the frame data. This is a vocoder design issue and further discussion is out of the scope of this document. Espelien & Gellens Expires March 2002 [Page 12] Internet Draft Common Payload Format September 2001 The optimized single frame RTP payload data is formatted as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | RTP Header [RTP] | +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ | Only one codec data frame | | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 7.9 Detecting Which Format All receivers MUST be able to process both types of packets. The sender MAY choose to use one or both types of packets. The packets of the two types can be distinguished by checking the payload type field in the RTP header. The association of payload type number with the packet type is done out-of-band, for example by [SDP] during the setup of a session. 7.10 Codec Data Frame Format The formats described in this section are applicable to both normal and optimized single frame RTP payload formats as described in sections 7.1 and 7.8. Bits are layed out as they come out of the vocoder. This will be referred to as native format. The native format for [PureVoice] is LSB (least significant bit) first (see example in section 7.10.1). The native format for [EVRC] and [SMV] is MSB (most significant bit) first (see example in section 7.10.2). 7.10.1 PureVoice Codec Data Frame Format The output of the PureVoice codec is converted into data frames for inclusion in the RTP payload as follows: The bits as numbered in the standard [PureVoice] from the highest to the lowest are packed into octets. The highest numbered bit (bit 265 for Rate 1, bit 123 for Rate 1/2, bit 53 for Rate 1/4 and bit 19 for Rate 1/8) is placed in the most significant bit (Internet bit 0) of the first octet (octet 0) of the codec data frame; the second highest numbered bit (bit 264 for Rate 1... bit 18 for Rate 1/8) is placed in the second most significant bit (Internet bit 0) of the first octet (octet 0) of the codec data frame. This continues until all of the bits have been placed in the codec data frame. Any remaining unused bits of the last octet of the codec data frame MUST be set to zero. For example, the frame below shows in detail how a PureVoice Rate codec 1/8 frame is packed into a data frame: Espelien & Gellens Expires March 2002 [Page 13] Internet Draft Common Payload Format September 2001 The codec data frame for a Rate 1/8 frame is 20 bits long. Bits 0 through 19 from the standard Rate 1/8 frame are placed as indicated with bits marked with "Z" being set to zero. The Rate 1/4, 1/2 and full rate frames are converted similarly (with padding) to align on octet boundaries. PureVoice Rate 1/8 codec data frame (octet 0 - 2) 0 1 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|1|1|1|1|1|1| | | | | | | | | | | | | | | |9|8|7|6|5|4|3|2|1|0|9|8|7|6|5|4|3|2|1|0|Z|Z|Z|Z| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Internet bit 0 refers to the left-most bit of the left-most octet. Internet bit 1 refers the next bit (to the right) of the left-most octet. [RFC 2658] discusses network byte and internet byte order in more detail. 7.10.2 EVRC or SMV Codec Data Frame Format The output of the EVRC or SMV codec is converted into data frames for inclusion in the RTP payload as follows: The bits as numbered in the standard [RTP] from the lowest to the highest are packed into octets. The lowest numbered bit (bit 1) is placed in the most significant bit (Internet bit 0) of the first octet of the codec data frame; the second lowest bit is placed in the second most significant bit of the first octet, the third lowest in the third most significant bit of the first octet, and so on. This continues until all of the bits have been placed in the codec data frame. Any remaining unused bits of the last octet of the codec data frame MUST be set to zero (note that this is only applicable to rate 1 frames as the others fit completely into a whole number of octets). For example, the frame below shows in detail how an EVRC or SMV Full Rate 1 codec frame is packed into a data frame: EVRC or SMV Rate 1 codec data frame (octet 0 - 3) 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| |0|0|0|0|0|0|0|0|0|1|1|1|1|1|1|1|1|1|1|2|2|2|2|2|2|2|2|2|2|3|3|3| |1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Espelien & Gellens Expires March 2002 [Page 14] Internet Draft Common Payload Format September 2001 Rate 1 codec data frame (octet 19 - 21) 1 1 1 1 4 5 6 7 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1|1| | | | | | |4|4|4|4|4|5|5|5|5|5|5|5|5|5|5|6|6|6|6|6|6|6|6|6|6|7|7|Z|Z|Z|Z|Z| |5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1|2|3|4|5|6|7|8|9|0|1| | | | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The codec data frame for a Rate 1 frame is 22 octets long. Bits 1 through 171 from the standard Rate 1 frame are placed as indicated with bits marked with "Z" being set to zero. The Rate 1/8, 1/4, and 1/2 frames are converted similarly but do not require zero padding because they align on octet boundaries. 7.11 Adding New Codecs Codecs that share the characteristics in section 6 can be added to this common format by following the steps below: 1. Register new MIME type. 2. In MIME Type registration specify that when transported in RTP, this common format is used. 3. Provide mapping of TOC value to rate and frame size of codec payload (as shown in section 7.2). 8 Tardy Packets Assume that the receiver has begun playing frames from an interleave group. The time has come to play frame x from packet n of the interleave group. Further assume that packet n of the interleave group has not been received. Now, assume that packet n of the interleave group arrives before frame x+(L+1) of that packet is needed. Receivers SHOULD use frame x+(L+1) of the newly received packet n rather than substituting an erasure frame. In other words, just because packet n wasn't available the first time it was needed to reconstruct the interleaved speech, the receiver SHOULD NOT assume that the packet is not available when the same packet is subsequently needed for interleaved speech reconstruction. 9 Lost Packets Codecs transported using this format support the notion of erasure frames. These are frames that for whatever reason are not available. When reconstructing interleaved speech or playing back non-interleaved speech, erasure frames MUST be fed to the codec for all missing packets. Espelien & Gellens Expires March 2002 [Page 15] Internet Draft Common Payload Format September 2001 Receivers MAY use the timestamp clock to determine how many codec data frames are missing. Each codec data frame advances the timestamp clock EXACTLY 160 counts for the defined vocoders (section 7.10). Since the bundling/interleaving value can vary, the timestamp clock is the only reliable way to calculate exactly how many codec data frames are missing when a packet is dropped. Specifically when reconstructing interleaved speech, a missing RTP packet in the interleave group SHOULD be treated as containing B erasure codec data frames where B is the bundling value for that interleave group. 10 Implementation Issues 10.1 Interleaving Length All wireless codecs interpolate the missing speech content when given an erasure frame. However, consecutive erasure frames reduce the listener's perception of voice quality. This makes interleaving desirable over bundling as it increases speech quality in the presence of lost packets. On the other hand, interleaving can greatly increase the end-to-end delay. Where an interactive session is desired, an interleave value (field LLL) of 0 to 2 is RECOMMENDED. When end-to-end delay is not a concern, an interleaving value (field LLL) of 4 or 5 is RECOMMENDED, subject to maxinterleave parameter. See description of this parameter in section 13.2. The parameters maxbundle and maxinterleaving at the initial setup of the session guarantee that the receiver can allocate a well-known amount of buffer space at the beginning of the session that will be sufficient for all future reception in that session. Less buffer space could be needed at some point in the future if the sender decreases the bundling value or interleaving value, but never more buffer space. This prevents the receiver needing to allocate more buffer space (with the possible result that none is available). 11 Security Considerations RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [RTP], and any appropriate profile (for example, [PROFILE]). This implies that confidentiality of the media streams can be achieved by encryption. Because the data compression used with this payload format is applied end-to-end, encryption can be performed after compression so there is no conflict between the two Espelien & Gellens Expires March 2002 [Page 16] Internet Draft Common Payload Format September 2001 operations. A potential denial-of-service threat exists for data encodings using compression techniques that have non-uniform receiver-end computational load. The attacker can inject pathological datagrams into the stream which are complex to decode and cause the receiver to be overloaded. However, this encoding does not exhibit any significant non-uniformity. As with any IP-based protocol, in some circumstances, a receiver can be overloaded simply by the receipt of too many packets, either desired or undesired. Network-layer authentication can be used to discard packets from undesired sources, but the processing cost of the authentication itself might be too high. In a multicast environment, pruning of specific sources might be implemented in future versions of IGMP [6] and in multicast routing protocols to allow a receiver to select which sources are allowed to reach it. 12 Real Time and Storage Mode 12.1 RTP Mode RTP mode is used to transmit codec frames in real time and interactive fashion (as opposed to playing a static stored file described in section 12.2.) RTP mode uses RTP headers with SDP negotiation (section 14) to describe the MIME media type and the RTP ptype format. Speech frames lost in transmission and non-received frames MUST be played out as erasure frames (see definition in Section 9) to keep synchronization with the original media. 12.2 Storage Mode Storage mode is used for storing speech frames, for example, as a file, email attachment, or web link. When stored as a file, the first few octets of the file are a "magic number" that identify the file. See sections 13.1.1, 13.1.2 and 13.1.3 for EVRC, SMV and PVC respectively for more details. All files are stored in normal mode groups (section 7.1). It is optional for the application to translate between normal mode format and optimized mode format. The codec data frames are stored in groups, preceded by group header information identical to payload header information as specified in section 7. That is, the R, LLL, NNN, TOC entries, etc. are present. Since there is no RTP header, and hence no timestamp, packets must be in order. Espelien & Gellens Expires March 2002 [Page 17] Internet Draft Common Payload Format September 2001 Following the magic number octets, the file is formatted as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+=+ |R|R|0|0|0|0|0|0|R|R|Frame Count| TOC | ... | TOC |padding| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | one or more codec data frames, one per TOC entry | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The meaning of the fields is specified in section 7.1. The LLL and NNN fields MUST both be zero. The format of the frames, including any padding, is identical to the normal mode specified in 7.1. This format, while more complex than other designs, makes it easy for an implementation to receive speech frames using RTP and store them, more or less as-is, in a file. Conversely, it is simple for an implementation to read frames out of a file and transmit them using RTP. Speech frames lost in transmission and non-received frames MUST be stored as erasure frames (see definition in Section 9) to keep synchronization with the original media. 13 IANA Considerations This document registers three new MIME media type registrations. The registration forms appear below. The MIME media type names for each supported codec is allocated from the IETF tree since PureVoice and EVRC codecs are already widely deployed, and SMV is expected to be a widely used codec for voice-over-IP applications. RTP format is described previously (see sections 7.1 and 7.8.) 13.1 Registration of MIME Media Type 13.1.1 audio/EVRC Media Type Registration Media Type Name: audio Media Subtype Name: EVRC Required Parameters: none Optional Parameters: ptype: See Section 13.2. maxptime: See Section 13.2. Espelien & Gellens Expires March 2002 [Page 18] Internet Draft Common Payload Format September 2001 maxinterleave: See Section 13.2. Optional parameters for storage mode: none Encoding considerations for RTP mode: see Section 13.2. Encoding considerations for storage mode: see Section 13.2. Security considerations: see Section 11. Public specification: This document. Additional information for storage mode (see also section 12.2): Magic number (network byte order): ASCII character string "#!EVRC\n", that is, 0x2321455652430a in hexadecimal. File extensions: EVC, evc Macintosh file type code: not specified Object identifier or OID: none Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type. Person & email address to contact for further information: The authors of this document. Author/Change controller: The IESG. 13.1.2 audio/SMV Media Type Registration Media Type Name: audio Media Subtype Name: SMV Required Parameters: none Optional Parameters: ptype: See Section 13.2. maxptime: See Section 13.2. maxinterleave: See Section 13.2. Optional parameters for storage mode: none Encoding considerations for RTP mode: see Section 13.2. Encoding considerations for storage mode: see Section 13.2. Espelien & Gellens Expires March 2002 [Page 19] Internet Draft Common Payload Format September 2001 Security considerations: see Section 11. Public specification: This document. Additional information for storage mode (see also section 12.2): Magic number (network byte order): ASCII character string "#!SMV\n", that is, 0x2321534d560a in hexadecimal. File extensions: smv, SMV Macintosh file type code: not specified Object identifier or OID: none Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type. Person & email address to contact for further information: The authors of this document. Author/Change controller: The IESG. 13.1.3 audio/qcelp-common Media Type Registration Media Type Name: audio Media Subtype Name: qcelp-common Required Parameters: none Optional Parameters: ptype: See Section 13.2. maxptime: See Section 13.2. maxinterleave: See Section 13.2. Optional parameters for storage mode: none Encoding considerations for RTP mode: see Section 13.2. Encoding considerations for storage mode: see Section 13.2. Security considerations: see Section 11. Public specification: This document. Additional information for storage mode (see also section 12.2): Magic number (network byte order): ASCII character string "#!PVC\n", that is, 0x23215056430a in hexadecimal. Espelien & Gellens Expires March 2002 [Page 20] Internet Draft Common Payload Format September 2001 File extensions: pvc, PVC Macintosh file type code: not specified Object identifier or OID: none Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type. Person & email address to contact for further information: The authors of this document. Author/Change controller: The IESG. 13.2 Optional Media Type Parameters These parameters are applicable to all three media and submedia types described above. Optional parameters for RTP mode: ptype: Ptype indicates the type of RTP/media subtype packet. The default value is 1. Valid values are 1 or 2. Ptype value 1 indicates normal format (see section 7.1), while ptype value 2 indicates optimized header compressed codec format (see section 7.8). maxptime: The maximum amount of media which can be encapsulated in each packet, expressed as time in milliseconds. The time SHALL be calculated as the sum of the time the media present in the packet represents. The time SHOULD be a multiple of the frame size. If not signaled, the default maxptime value is 200 ms. maxinterleave: Maximum number for interleaving value. The interleaving values used in the entire session MUST not exceed this maximum value. If not signaled, the default maxinterleave value is 5. Optional parameters for storage mode: none Encoding considerations for RTP mode: see Section 7, and Section 7.3 and 7.4 of this document. Encoding considerations for storage mode: Storage mode is identical to RTP mode. A stored file is made up of essentially multiple RTP packets without the RTP, UDP, etc headers. Espelien & Gellens Expires March 2002 [Page 21] Internet Draft Common Payload Format September 2001 Normal (type 1) encoded speech frames MUST be stored in RTP sequence number order. Furthermore, missing frames and non-received frames during non-speech period MUST be encapsulated into a compound codec payload as blank frames or erasures. Each receiving entity that accepts this MIME type MUST be able to decode all codec coding modes. For normal codec frames, bundling and interleaving information is included in each grouping. Security considerations: see Section 11. Public specification: This document. Intended usage: COMMON. It is expected that many VoIP applications (as well as mobile applications) will use this type. Person & email address to contact for further information: The authors of this document. Author/Change controller: The IESG. 14 Mapping to SDP Parameters Please note that this section applies to packets transmitted using RTP. Parameters are mapped to [SDP] as usual. Example usage in SDP, for PureVoice vocoder run in normal format: m = audio 49120 RTP/AVP 97 a = rtpmap:97 qcelp-common a = fmtp:97 ptype=1; maxptime=80 ms Example usage in SDP, for SMV vocoder run in optimized single frame format: m = audio 49120 RTP/AVP 98 a = rtpmap:98 SMV a = fmtp:98 ptype=2; maxptime=20 ms Since all optimized single frames (ptype = 2) for the currently supported vocoders are 20 ms long, maxptime MUST be 20 ms. If a new vocoder is added with a different frame duration, maxptime for that Vocoder MUST equal the vocoder's frame time. 15 Acknowledgements This document heavily borrows from "RTP Payload Format for PureVoice(tm) Audio" by Kyle McKay (RFC 2658, August 1999). Material has also been used from "An RTP Payload Format for EVRC Speech", Adam Li (editor), a work in progress. The authors and Espelien & Gellens Expires March 2002 [Page 22] Internet Draft Common Payload Format September 2001 others who contributed to these two documents made this document possible. The authors thank the following colleagues for contributing to this document: Rusty Sanders, Trevor Bourget, Eric Rosen, Harleen Gill, Kirti Gupta. 16 References [PureVoice] TIA/EIA/IS-733, "High Rate Speech Service Option for Wideband Spread Spectrum Communication Systems", January 1997. May be ordered online at http://www.eia.tia.org/eng. [EVRC] TIA/EIA/IS-127, "Enhanced Variable Rate Codec, Speech Service Option 3 for Wideband Spread Spectrum Digital Systems", January 1997. [SMV] TIA/EIA/IS-893, "Selectable Mode Vocoder", August 2001 published as PNSP-4575. [RTP] Schulzrinne, H., Casner, S., Frederick, R. and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", RFC 1889, January 1996. [KEYWORDS] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [PROFILE] Schulzrinne, H., "RTP Profile for Audio and Video Conferences with Minimal Control", RFC 1890, January 1996. [RFC 2658] McKay, K., "RTP Payload Format for PureVoice(tm) Audio", RFC 2658, August 1999. [SDP] M. Handley and V. Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998. [IGMP] Deering, S., "Host Extensions for IP Multicasting", STD 5, RFC 1112, August 1989. 17 Authors' Addresses Magdalena L. Espelien QUALCOMM Incorporated 5775 Morehouse Drive San Diego, CA 92121-1714 USA Phone: +1 858 651-6733 Email: magda@qualcomm.com Espelien & Gellens Expires March 2002 [Page 23] Internet Draft Common Payload Format September 2001 Randall Gellens QUALCOMM Incorporated 5775 Morehouse Drive San Diego, CA 92121-1714 USA Phone: +1 858 651-5115 Email: rg+ietf@qualcomm.com Espelien & Gellens Expires March 2002 [Page 24]