Internet Draft T. Hager January 2003 Dolby Laboratories Expires: June 2003 J. Flaks Microsoft Corporation RTP Payload Format for AC-3 Streams Status of this Memo This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet-Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as reference material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMEDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [1]. Abstract This document describes an RTP payload format for transporting AC-3 encoded audio data. AC-3 is a high quality multichannel audio coding system used in ATSC HDTV, DVD, film, and other media. The RTP payload format presented in this document provides mechanisms for interleaving redundant data, which can increase packet loss resilience. An intelligent method for fragmenting AC-3 frames that exceed the maximum transfer unit (MTU) is also described. Hager/Flaks Expires June 2003 [Page 1] Internet Draft RTP Payload Format for AC-3 Streams January 2003 0. Change Log for This is the first draft submission with Todd Hager replacing Jason Flaks as author. Most changes are editorial and are not detailed in this log. The single technical change in this revision was made to the NDU field described in 2.1.1. Previously, the NDU was designated as the high 4 bits of a byte with the low 4 bits marked as reserved. Now the NDU is an 8-bit number so that no bit shifting is required to process it. Hager/Flaks Expires June 2003 [Page 2] Internet Draft RTP Payload Format for AC-3 Streams January 2003 1. Introduction AC-3 is a high quality audio codec designed to encode multiple channels of audio into a low bit-rate format. AC-3 achieves its large compression ratios via encoding a multiplicity of channels as a single entity. Dolby Digital, which is a branded version of AC-3, encodes up to 5.1 channels of audio. AC-3 has been adopted as an audio compression scheme for many consumer and professional applications. It is the mandatory codec for DVD-video, ATSC digital terrestrial television, laser disc, and DVD-audio (as an optional multichannel audio format). AC-3 is also a common audio format for film. It is highly likely that people may wish to stream AC-3 data over IP networks. Applications for streaming AC-3 range from video on demand to multichannel Internet radio. RTP provides a mechanism for stream synchronization and hence serves as the best transport solution for AC-3, which is a codec primarily used in audio for video applications. The RTP payload described in this document also provides a method of ensuring a continuous high quality AC-3 stream. 1.1 Overview of AC-3 AC-3 can deliver up to 5.1 channels of audio at data rates approximately equal to half of one PCM channel [2], [6], [7]. The ".1" refers to a band limited optional low-frequency enhancement channel. AC-3 was designed for signals sampled at rates of 32, 44.1, or 48 kHz. Data rates can vary between 64 kbps and 640 kpbs depending the number of channels and desired quality. AC-3 exploits psychoacoustic phenomena that reveal large amounts of inaudible information contained in a typical audio signal. Substantial data reduction occurs via the removal of all inaudible information contained in an audio stream. Source coding techniques are further used to reduce the data used to code an audio signal. Like most perceptual coders, AC-3 operates in the frequency domain. A 512-point TDAC transform is taken with 50% overlap, providing 256 new frequency samples. Frequency samples are then converted to exponents and mantissas. Exponents are differentially encoded. Mantissas are allocated a varying number of bits depending on the audibility of the spectral component associated with it. Audibility is determined via a masking curve. Bits for mantissas are allocated from a global bit pool. 1.2 AC-3 Bitstream AC-3 bitstreams are organized into synchronization frames. Each AC-3 sync frame contains a Sync Information (SI) field, a Bit Stream Information (BSI) field, and 6 audio blocks (AB) representing 256 PCM Hager/Flaks Expires June 2003 [Page 3] Internet Draft RTP Payload Format for AC-3 Streams January 2003 samples for each channel. The entire frame represents a time duration of 1536 PCM samples across all coded channels (32 msec @ 48kHz sample rate) [2]. Figure 1 shows the AC-3 frame format. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |SI |BSI| AB0 | AB1 | AB2 | AB3 | AB4 | AB5 |AUX|CRC| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1. AC-3 Frame Format The Synchronization Information field contains information needed to acquire and maintain synchronization. The Bit Stream Information field contains parameters that describe the coded audio service [2]. Each audio block also contains fields that determine the usage of block switching, dither, dynamic range control, coupling, and exponent strategy. Figure 2 shows the format of an AC-3 audio block. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Block |Dither |Dynamic |Coupling |Coupling |Exponent | | switch |Flags |Range Ctrl |Strategy |Coordinates |Strategy | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Exponents | Bit Allocation | Mantissas | | | Parameters | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2. AC-3 Audio Block Format 2. RTP AC-3 Payload Format According to [5] RTP payload formats should contain an integral number of application data units (ADUs). In the context of audio compression algorithms an ADU typically refers to a codec frame. In this case an ADU shall be equivalent to an AC-3 sync frame. Hence each RTP packet will contain an integral number of AC-3 frames unless the AC-3 frame size exceeds the maximum transfer unit (MTU) of the underlying network. If a large AC-3 frame requires fragmentation before transmission within an RTP packet, section 2.2 provides guidelines for creating partial frames. To compensate for the possibility of lost packets, each RTP packet may contain redundant audio information in addition to, or instead of, the primary AC-3 data payload. These redundant data may exactly replace lost audio data in response to a request for retransmission. Alternatively they may represent a constant delayed secondary stream Hager/Flaks Expires June 2003 [Page 4] Internet Draft RTP Payload Format for AC-3 Streams January 2003 of lower-quality, lower-bandwidth audio that a receiver may use as a substitute for lost primary data. Section 2.3 describes the possible types of redundant data in detail. 2.1 RTP Header Extension 2.1.1 Main Header Extension The following header extension shall be at the front of every AC-3 RTP payload. The primary purpose of this main header is to indicate the number of frames or fragments present in the packet. The term "Data Unit" is used to reference AC-3 data be they a full frame or a fragment. Figure 3 shows the format of this header extension. 0 1 2 3 4 5 6 7 8 +-+-+-+-+-+-+-+-+ | NDU | +-+-+-+-+-+-+-+-+ Figure 3. AC-3 RTP Payload Header Extension Number of data units (NDU): An 8-bit field used to indicate the number of AC-3 frames or fragments present in the RTP payload. 2.1.2 Data Unit Header Extension The following header should be in front of each audio data unit (i.e. AC-3 frame or fragment) present in the RTP packet. The fields should aid in handling redundant data and fragmented AC-3 frames. 0 1 2 3 4 5 6 7 8 +-+-+-+-+-+-+-+-+ |TYP|F|B| RDT |T| +-+-+-+-+-+-+-+-+ Figure 4. Audio Data Unit Header Type field (TYP): This field is used to specify the type of data associated with this header, which can be AC-3 data, AC-3 data plus redundant data, or redundant data alone. Table 1 shows the various settings for each type of data. Hager/Flaks Expires June 2003 [Page 5] Internet Draft RTP Payload Format for AC-3 Streams January 2003 00 - AC-3 data 01 - AC-3 data + redundant data 10 - redundant data 11 - reserved Table 1. TYP field values Fragment bit (F): This bit is set to 1 if the corresponding data unit is an AC-3 fragment. SHOULD THIS BIT BE SET IF TYP IS 01 or 10? Block 0 Bit (B): This bit is set to 1 if the packet contains an AC-3 fragment consisting of the first 5/8ths of the frame, which is guaranteed to contain blocks 0 and 1. If an AC-3 fragment is received and the B bit is not set, and the previous fragment is lost, then the frame is useless and can be discarded. However if the first fragment is received, and the later fragment is lost, block 1 can be repeated to complete the frame. Redundant Data field (RDT): This 3-bit field indicates the type of redundant data associated with the frames. The following table shows the various settings for each type of redundant data. 000 - Full frame/Lower bit rate 001 - Full frame/Lower bit rate/Fewer channel 010 - 5/8ths fragment 011 - 3/8ths fragment 100 - 5/8ths fragment/Lower bit rate 101 - 3/8ths fragment/Lower bit rate 110 - 5/8ths fragment/Lower bit rate/fewer channels 111 - 3/8ths fragment/Lower bit rate/fewer channels Table 2. RDT field values Time Code Bit (T): This bit is set to 1 if the AC-3 data contains time code. Figure 5 shows how a full AC-3 RTP payload format should appear. Hager/Flaks Expires June 2003 [Page 6] Internet Draft RTP Payload Format for AC-3 Streams January 2003 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | NDU |TYP|F|B| RDT |T| AC-3 Frame(1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+ | | Redundant data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |TYP|F|B| RDT |T| AC-3 Frame(2) | +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+ | | Redundant data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |TYP|F|B| RDT |T| AC-3 Frame(N) | +-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+ | | Redundant data | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 5. Full AC-3 RTP payload 2.2 Fragmentation of AC-3 Frames The size of AC-3 frames remain constant throughout an encode procedure of a particular piece of audio, but the frame size can vary depending upon the sample rate of the uncompressed audio, and the compression rate applied by the encoder. According to table 5.13 in [2], AC-3 frame sizes range from a minimum of 128 bytes to a maximum of 3840 bytes. AC-3 frame sizes may be large enough to require fragmentation before transmission within an RTP packet. For example, an audio file sampled at 32 kHz and compressed with a desired bit rate of 640 kbps would produce AC-3 frames of 3840 bytes each. This exceeds the standard 1500 byte MTU of an Ethernet network and the 1492 byte MTU of the PPPoE protocol. In [3] it is specified that fragmentation should not be left to the IP layer, but instead should be handled by the application itself. AC-3 frames were designed to accommodate memory buffers smaller than an entire AC-3 frame. For this reason, each AC-3 frame contains two 16-bit CRC words. CRC1 is contained in the synchronization information (SI) header located at the beginning of each AC-3 frame. CRC1 is the second 16-bit word of the frame. Figure 6 shows the structure of the SI header. Hager/Flaks Expires June 2003 [Page 7] Internet Draft RTP Payload Format for AC-3 Streams January 2003 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SYNC WORD | CRC1 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |FSC|FRMSIZECD | +-+-+-+-+-+-+-+-+ Figure 6. Synchronization Information header CRC2 is the last 16-bit word of an AC-3 frame as shown in Figure 1. CRC1 applies to the first 5/8ths of the frame excluding the sync word. CRC2 covers the remaining 3/8ths of the frame as well as the entire frame (excluding the sync word). All AC-3 encoders enforce specific block size restrictions that guarantee blocks 0 and 1 are completely covered by CRC1 [2]. Ensuring that blocks 0 and 1 are in the first 5/8ths of the frame is necessary because block 0 contains information that is shared with the remaining 5 blocks. The dual CRC allows decoders to immediately begin processing block 0 when the 5/8ths point is reached. This 5/8ths split in all AC-3 frames, which was intended for the possibility of smaller input buffers (assuming a guaranteed transport stream such as S/PDIF), provides a very logical fragmentation unit. Using the 5/8ths point provides two possible gains over arbitrary fragmentation: 1) Using 5/8ths fragmentation, if the second fragment is dropped, the first fragment can still be decoded by an AC-3 decoder. Block 1 will be repeated in place of any missing blocks lost in the second fragment. 2) In closed networks with no QoS problems, it may be possible to use smaller buffers, as was intended in the original design of the 5/8ths split. In [2] the 5/8ths point is defined as: 5/8-framesize = truncate(framesize/2) + truncate(framesize/8) According to table 7.34 in [2], 5/8ths frame sizes can range from 80 bytes to 2400 bytes. Hence there are still instances where the 5/8ths boundary may exceed the MTU of the underlying network. In an Ethernet network this would be rare because the majority of AC-3 data publicly available is sampled at 48kHz and is encoded at a data rate of 384kbps or 448kbps. This provides a 5/8thspoint of 960 bytes and 1120 bytes respectively, which would be less then the MTU of a typical Ethernet Hager/Flaks Expires June 2003 [Page 8] Internet Draft RTP Payload Format for AC-3 Streams January 2003 network. In the rare instances where even the 5/8ths point exceeds the MTU, AC-3 frames should be arbitrarily fragmented to a length that is less the MTU. It should be noted that using 5/8ths fragmentation in terms of smaller buffer sizes is only useful in networks where the inter-arrival jitter is less than the time needed to decode Blocks 0 and 1 of the AC-3 stream and play the uncompressed audio. JitterLimit = DecodeTime(Block 0 & 1) + 2 * (256/Fs), where Fs is the sample rate of the uncompressed audio 2.3 Data Resiliency This section provides information on how to encapsulate redundant data into an RTP payload to ensure the reception of all the AC-3 data being sent. There are several types of redundant data that can be sent, which are defined in section 2.1.2 and specified for each data unit in the data unit header. The various types of redundant data are further discussed in the following sections. As a general rule redundant data of any type should never repeat primary audio information from the same RTP payload. 2.3.1 Lower Bit Rate Data Transmitting redundant AC-3 frames encoded at a lower data rate (and thus quality) than the primary AC-3 audio is an obvious means of enhancing data resiliency. This approach allows the AC-3 decoder to decode the lower quality frame if the packet containing the high-quality audio is dropped, lost, or arrives with errors. However, the lower quality data may still require an undesirably large amount of bandwidth. Also, the redundant data can be extremely low quality, especially in cases where large numbers of channels are being transmitted. Retaining the original number of channels at a low data rate may result in sound quality that is subjectively unpleasant to listen to. Rather than simply lower the quality of all channels to achieve resuced bandwidth, a more desirable approach may be to mix the multiple channels down into stereo and encode the resulting two-channel audio into lower bit-rate, but subjectively more listenable, AC-3 data. 2.3.2 Lower Bit Rate Data with Fewer Channels When encoding multichannel audio, a secondary two-channel version of the audio can also be encoded at a lower bit rate. Since the audio is reduced to two channels, it is still possible to maintain high quality even at a lower bit rate. The lower bit-rate two-channel version can be interleaved with the multichannel audio, and when a packet is lost or Hager/Flaks Expires June 2003 [Page 9] Internet Draft RTP Payload Format for AC-3 Streams January 2003 corrupted the two-channel version can be used in its place. 2.3.3 5/8ths and 3/8ths Fragment Another method of sending redundant data might include fragmentation of packets at the 5/8ths split and interleaving fragments from previous frames. This ensures that all data is sent twice, which decreases the likelihood of lost data. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP(X): | AC-3 5/8ths Fragment(n)| AC-3 3/8ths Fragment(n-1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP(X+1) | AC-3 3/8ths Fragment(n)| AC-3 5/8ths Fragment(n) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 7. 5/8ths û 3/8ths Interleaving In addition it is possible that one may wish to only send the 5/8ths fragment as redundant data. Since the 5/8ths fragment can be decoded on its own, it would allow for redundant data at a lower overall bit rate. However because block repeats are used when only the first 5/8ths is present, the quality would be significantly reduced if the redundant data was to be used. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP(X): | AC-3 Frame(n) | AC-3 5/8ths Fragment(n-1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ RTP(X+1) | AC-3 Frame(n+1) | AC-3 5/8ths Fragment(n) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 8. Redundant 5/8ths Data 2.3.4 5/8ths Fragment Lower Bit Rate Following the methods listed in the previous section, it may also be beneficial to send the redundant fragments at a lower bit rate. Ideally a lower bit rate version of the previous frames 5/8ths fragment could be sent along, which would provide for a very low bit rate redundant data channel. Hager/Flaks Expires June 2003 [Page 10] Internet Draft RTP Payload Format for AC-3 Streams January 2003 2.3.5 5/8ths and 3/8ths Fragment Lower Bit Rate and Fewer Channels Combining the methods from 2.3.2 and 2.3.3 a version of the 5/8ths fragment that is lower in bit rate and is composed of fewer channels may be sent as redundant data. This provides and opportunity for low bit rate redundant data that has fewer channels but less quality degradation. 3 RTP header fields Payload Type (PT): It is expected that the RTP profile for a particular class of applications will assign a payload type for this encoding, or alternatively a payload type in the dynamic range [96,127] shall be chosen. Marker (M) bit: The M bit is set for last fragment of an AC-3 frame. In instances where one or more full AC-3 frames is encapsulated in an RTP packet the M bit will be set and the full frame itself will be considered the last fragment. Extension (X) bit: Defined by the RTP profile used. Timestamp: A 32-bit word that corresponds to the sampling instant for the first AC-3 frame in an RTP packet. AC-3 encodes data sampled at 32 kHz, 44.1 kHz, and 48 kHz. Fragmented frames shall maintain the same time stamp until the last fragment is sent. The starting timestamp is selected at random. 4 Types and Names 4.1 MIME type registration MIME media type name: audio MIME subtype name: ac3 Required parameters: Rate: Equal to the RTP timestamp clock rate for the particular AC-3 stream of a given RTP session. In the case on a single frame per a packet, and the AC-3 stream was encoded at 48Khz at a bit rate of 384 kbps the rate parameter would equal 32 milliseconds. In the case of interleaving the 5/8ths and 3/8ths fragments assuming the AC-3 file was encoded again at 48Khz with a bit rate of 384 kbps the clock rate would need to be one half of 32 milliseconds or 16 milliseconds. Hager/Flaks Expires June 2003 [Page 11] Internet Draft RTP Payload Format for AC-3 Streams January 2003 Optional parameters: Channels: How many channels are present in the AC3 stream. This will be a number between 1 and 6. Ptime: The length of time in milliseconds represented by the AC-3 frame(s) in the packet. Maxptime: The maximum amount of media which can be encapsulated in each RTP packet, expressed as time in milliseconds Encoding considerations: The AC-3 bitstream shall be generated according to the AC-3 specification [2]. This bitstream is binary data and MUST be encoded for non-binary transport (for Email or any transport that cannot accommodate binary directly, the Base64 encoding is sufficient). This type is also defined for transfer via RTP. All RTP packets MUST be packetized using the RTP payload format described in this document. Security considerations: see section 5 of this document Interoperability considerations: none Published specification: see [2] Applications: Multichannel audio compression for audio and audio for video Additional Information: none Magic number(s): none File extension(s): .ac3 Macintosh File Type Code(s): none Object Identifier(s) or OID(s): none Person & email address to contact for further information: Todd Hager IETF AVT working group. Intended Usage: COMMON Author/Change controller: Author: Todd Hager Jason Flaks Change Controller: IETF AVT WG Hager/Flaks Expires June 2003 [Page 12] Internet Draft RTP Payload Format for AC-3 Streams January 2003 4.2 SDP usage The encoding name when using SDP [3] SHALL be "ac3" (MIME subtype). An example of the media representation in SDP is given below. m = audio 49000 RTP/AVP 100 a = rtpmap:100 ac3/48000 a = fmtp:100 number-channels=[1-6] 5. Security considerations In order to protect copyrighted material, certain security precautions may be necessary. The payload format described in this document is subject to the security considerations defined in [4]. The security considerations discussed in [4] imply the usage of encryption to protect the confidentiality of content. Such an encryption scheme is harmless to the encoded audio data presuming the data is decrypted before being sent to the decoder. 6. Normative References [1] Bradner, S., "Key Words for use in RFCs to Indicate Requirement Levels", RFC 2119, Internet Engineering Task Force, March 1997. [2] U.S. Advanced Television Systems Committee (ATSC), "Digital Audio Compression (AC-3) Standard," Doc A/52, December 1995. [3] Handley, M. and Jacobson, V., "SDP: Session Description Protocol," RFC 2327, Internet Engineering Task Force, April 1998 [4] Schulzrinne, Casner, Frederick, and Jacobson, "RTP: A Transport Protocol for Real-Time Applications," RFC 1889, Internet Engineering Task Force, February 1996. [5] Handley, M. and Perkins, C., "Guidelines for Writers of RTP Payload Format Specifications," RFC 2736, Internet Engineering Task Force, December 1999. 7. Informative References [6] Todd, C. et. al, "AC-3: Flexible Perceptual Coding for Audio Transmission and Storage," Preprint 3796, Presented at the 96th Convention of the Audio Engineering Society, May 1994. [7] Fielder, L. et. al, "AC-2 and AC-3: Low-Complexity Transform-Based Audio Coding," Collected Papers on Digital Audio Bit-Rate Reduction, pp. 54-72, Audio Engineering Society, September 1996. Hager/Flaks Expires June 2003 [Page 13] Internet Draft RTP Payload Format for AC-3 Streams January 2003 8. Authors' Addresses Todd Hager Dolby Laboratories 100 Potrero Ave San Francisco, CA 94103 Phone: +1 415 558 0136 Email: thh@dolby.com Jason Flaks Microsoft Corporation 1 Microsoft Way Redmond, WA 98052 Phone: +1 425 722 2543 Email: jasonfl@microsoft.com