Internet Engineering Task Force Audio Visual Transport WG Internet-Draft C.Guillemot, P.Christ, S.Wesner draft-gc-avt-mpeg4visual-00.txt INRIA / Univ. Stuttgart - RUS March, 1 2000 Expires: September, 1 2000 RTP payload format for MPEG-4 Visual Advanced Profiles (scalable, core, main, N-bits) STATUS OF THIS MEMO This document is an Internet-Draft and is in full conformance with all provisions of Section 10 of RFC2026. Internet-Drafts are working documents of the Internet Engineering Task Force (IETF), its areas, and its working groups. Note that other groups may also distribute working documents as Internet- Drafts. Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet- Drafts as refer- ence material or to cite them other than as "work in progress." The list of current Internet-Drafts can be accessed at http://www.ietf.org/ietf/1id-abstracts.txt The list of Internet-Draft Shadow Directories can be accessed at http://www.ietf.org/shadow.html. Abstract This document describes a payload format for the transport of MPEG-4 visual Elementary Streams, applicable for multimedia applications not restricted to the simple visual profile. It is an application of the format described in [1] to specialized cases of video which are applicable in the H.323 context (using for example MPEG-4 lay- ered streams). It is therefore intended to support advanced MPEG-4 visual profiles (simple scalable, core, main, N-bits profiles), by allowing protection against loss of key segments of the elementary streams. The simple scalable profile supports temporal and spatial scalability, features important for rate or congestion control on the Internet, especially in multicast. The core and the main visual profiles target multimedia applications on the Internet and allow, in addition to scalability, the usage of sprite objects and of arbitrary shape objects. Guillemot/Christ/Wesner [Page 1] Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000 1 Introduction An MPEG-4 scene is composed of media objects. The MPEG-4 dynamic- scene description framework, which defines the spatio-temporal rela- tion of the media objects as well as their contents, is inspired by VRML. The compressed binary representation of the scene description is called BIFS (Binary Format for Scenes) [2]. The compressed scene description is conveyed through one or more Elementary Streams (ES). A compression layer produces the compressed representations of the audio-visual objects that will be inserted into the scene. These compressed representations are organized into Elementary Streams (ES). Elementary Stream Descriptors provide information relative to the stream, such as the compression scheme used. Elementary stream data is partitioned into Access Units. The delineation of an Access Unit is completely determined by the compression layer that gener- ates the elementary stream. An Access Unit is the smallest data en- tity to which timing information can be attributed. Two Access Units shall never refer to the same point in time. Natural and animated synthetic objects may refer to an Object De- scriptor (OD), which points to one or more Elementary Streams that carry the coded representation of the object or its animation data. An OD serves as a grouping of one or more Elementary Stream Descrip- tors that refer to a single media object. The OD also defines the hierarchical relations and properties of the Elementary Streams De- scriptors. The Object Descriptors are conveyed through one or more Elementary Streams. By conveying the session (or resource) descrip- tion as well as the scene description through their own Elementary Streams, it becomes possible to change portions of scenes and/or properties of media streams separately and dynamically at well-known instants of time. In order to allow effective implementations of the standard, subsets of the MPEG-4 Systems, Visual, and Audio tool sets have been identified, that can be used for specific applications. Profiles exist for various types of media content (audio, visual, and graphics) and for scene descriptions. The visual part of the standard defines five profiles for natural video: the simple profile, the simple scalable profile, the core profile, the main profile and the N-bits profile [3]. Considering the visual elementary streams, an important entry point in the elementary stream data is the videoObjectPlane(), starting by corresponding configuration information. Depending on the different visual profiles, different sets of parameters will be present in the header of the VideoObjectPlane(). These parameters are essential to configure the decoders and not covered by the HEC-based error resilience mechanism. After analysing the impact of the coding options provided by the different profiles with respect to loss resilience, this document Guillemot/Christ/Wesner [Page 2] Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000 specifies an RTP payload format as an application of the generic format proposed in [1], specialized to cases of video, applicable e.g. in the H.323 context (see recommendation H.323 annex B). The document defines packetization rules as well as protocol support for protection of key segments of MPEG-4 visual streams. The design goals of this RTP payload format are to provide the following: - a unified solution for all the visual profiles, with protection against loss of key segments of the elementary streams. - a solution independent of the usage or the non-usage of the MPEG-4 OD framework. - protection against packet loss with a protocol support easily adaptable to varying network conditions, for both "live" and "pre-recorded" visual contents. - flexible support of a range of error control mechanisms, from no protection to redundant data (key segments) and FEC. The list of key segments (VisualObjectsequence header, VisualObject header, VisualObjectLayer header, Group_of_videoobjectplane header, VideoObjectPlane header) included in the payload header as redundant data, as well as possibly additional protection schemes supported will be announced via an out-of-band signaling at the beginning of the session, using for example SDP [4]. The protection scheme used at a specific instant during the session will be signaled via the ex- tension type (XT) field in the payload header. 2 MPEG-4 visual profiles Five profiles have been defined for natural video content [3]: - The Simple Visual Profile provides efficient, error resilient coding of rectangular video objects, suitable for applications on mobile networks, such as PCS and IMT2000. - The Simple Scalable Visual Profile adds support for coding of temporal and spatial scalable objects to the Simple Visual Profile, It is useful for applications which provide services at more than one level of quality due to bit-rate or decoder resource limitations. - The Core Visual Profile adds support for coding of arbitrary- shaped and temporally scalable objects to the Simple Visual Profile. It is useful for applications such as those providing relatively simple content-interactivity (Internet multimedia applications). - The Main Visual Profile adds support for coding of interlaced, semi-transparent, and sprite objects to the Core Visual Profile. It is useful for interactive and entertainment-quality broadcast and DVD applications. - The N-Bit Visual Profile adds support for coding video objects having pixel-depths ranging from 4 to 12 bits to the Core Visual Profile. It is suitable for use in surveillance applications. The profiles for synthetic and synthetic/natural hybrid visual content are: Guillemot/Christ/Wesner [Page 3] Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000 - The Simple Facial Animation Visual Profile provides a simple means to animate a face model, suitable for applications such as audio/video presentation for the hearing impaired. - The Scalable Texture Visual Profile provides spatial scalable coding of still image (texture) objects useful for applications needing multiple scalability levels, such as mapping texture onto objects in games, and high-resolution digital still cameras. - The Basic Animated 2-D Texture Visual Profile provides spatial scalability, SNR scalability, and mesh-based animation for still image (textures) objects and also simple face object animation. - The Hybrid Visual Profile combines the ability to decode arbitrary-shaped and temporally scalable natural video objects (as in the Core Visual Profile) with the ability to decode several synthetic and hybrid objects, including simple face and animated still image objects. It is suitable for various content-rich multimedia applications. 3 Impact of the profiles on the MPEG-4 visual syntax A set of error resilience tools has been defined in the MPEG-4 vis- ual syntax in order to recover corrupted headers [3]. In particular, the VideoObjectPlane data is structured in video packets, the entry point being defined by the function video_packet_header(), and delimited by resync_markers. Header information is inserted at the start of a video packet. Contained in this header is the information necessary to restart the decoding process (provided key parameters from the VOP header have been corectly received). Following the quant_scale is the Header Extension Code (HEC). HEC is a bit used to indicate whether additional information is available. If the HEC is equal to one, then basic configuration parameters can be inserted in the packet header. This section analyses the parameters which can be potentially protected by the HEC mechanism and stresses the coding options, therefore visual profiles, not addressed by the above mechanism. Depending on the different visual profiles, different sets of parameters will be present in the header of the VideoObjectPlane(). In the simple profiles, essential VOP header parameters are: vop_coding_type, modulo_time_base, marker_bit, vop_time_increment, fcodes when the VOP_coding_type is P or B, VOP_reduced_resolution if reduced_resolution_vop_enable is equal to 1. Let us now consider the simple scalable profile supporting temporal and spatial scalability. Scalable or layered coding is very interesting for rate or congestion control on the Internet, especially in multicast. A key parameter in order to be able to decode a VOP in an enhancement layer is "ref_select_code" which signals the VOP that has been taken as a reference for the prediction. The core and the main visual profiles target multimedia applications on the Internet and allow the usage of sprite objects and of arbitrary shape objects, in addition to the scalable features Guillemot/Christ/Wesner [Page 4] Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000 provided by the simple scalable profile. Sprite decoding operates in two modes: basic and low-latency. The low latency mode allows to update the sprite or transmit new pieces of the sprite which can be then used as reference information for decoding subsequent S-VOPs, and for construction of subsequent parts of sprites. Therefore, it is important to be able to protect this information by allowing repetition of sprite data in consecutive packets. The decoding of arbitrary shape VOP requires the dimensions of its bounding rectangle, its horizontal and vertical spatial position, as well as the shape coding type. This information is respectively provided by the parameters "vop_width", "vop_height", "vop_horizontal_mc_spatial_ref" and "vop_vertical_mc_spatial_ref", "vop_shape_coding_type" in the VOP header. The parameter "change_conv_ratio_disable" is also needed to be able to decode properly the video packet. The "vop_constant_alpha parameter" as well as the "vop_constant_alpha_value" (if vop_constant_alpha==1), scaling factor applied to the decoded VOP before display need also protection. When scalability is applied on arbitrary shape objects, extra parameters need to be protected. For the shape decoding these parameters are "load_backward_shape", "backward_shape_width", "backward_shape_height", "backward_shape_horizontal_mc_spatial_ref", "backward_shape_vertical_mc_spatial_ref", "load_forward_shape", "forward_shape_width", "forward_shape_height", "forward_shape_horizontal_mc_spatial_ref", "forward_shape_vertical_mc_spatial_ref". Another important parameter is "background_composition" which signals the usage of background composition in conjunction with scalability. A final pa- rameter important to be protected is "Vop_rounding_type" which sig- nals the rounding mechanism used in the pixel value interpolation in motion compensation for P and S(GMC)-VOPs. 4 Design Consideration The syntax of the visual bitstreams defines two types of information: the configuration information and the elementary stream data [2]. The configuration information includes: - the global configuration information refering to the whole group of visual objects (visualobjectsequence()), - the object configuration information refering to a single visual object (visualobject()) - and the object layer configuration information - (visualobjectlayer()). Two modes of transmission of configuration and elementary stream information are specified: The separate mode consists in transmitting the configuration information in "containers" provided by MPEG-4 systems (ODs). The combined mode consists in tansmitting the configuration information together with the elementary stream data. Guillemot/Christ/Wesner [Page 5] Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000 The solution recommended in draft-jnb-mpeg4av-rtp-01.txt, when using the combined mode, consists in transporting this configuration information in separate RTP packets, and in possibly repeating the corresponding RTP packets periodically if needed for protection purposes. However, this vital information being restricted to a few bytes, to transport it in separate RTP packets leads to unnecessary overhead. More efficient transport can be reached by grouping this data with elementary stream data inside packets. The same remark applies to the Group_of_VideoObjectPlane() entry point and to its corresponding header or configuration information. The compression layer organizes the ES data in Access Units (AU). The AUs are the smallest entities that can be attributed individual timestamps. The timestamps may be obtained directly, through the ESI. If the SLConfigDescriptor indicates that timestamps are ab- sent, the timestamps may be obtained indirectly, for example, by us- ing the frame rate. The compression layer passes full or partial Access Units, together with indications of AU boundaries, random access points, desired timing information, directly to the network adaptation layer or in- directly via the sync layer. It is however preferable, for imple- mentation efficiency, to pass the ES data directly to the network adaptation layer, i.e. to avoid producing the full SL packets. Par- tial AUs or typed segments are - in terms of the encoding syntax - syntactical and semantically meaningful parts of an AU - cf. [1], 7.2.3, "Such partial AUs may have significance for improved error resilience".) Depending on the visual profiles, different sets of key parameters will be present in the header of the VideoObjectPlane(), as described above. Key parameters for the simple scalable, main and core profiles are not covered by the error resilience tools defined in MPEG-4. This document advocates the need for protection support in the packetization format, that would be applicable for the different visual profiles, independently of the usage or non-usage of the OD framework. Although, the protection support would benefit most the simple scalable, core, main and N-bits profiles, therefore a large range of multimedia applications, it is also applicable to simple videotelephony and videoconferencing applications relying on the simple profile (despite the existence of the HEC mechanism). To include the redundancy in the payload header instead of inserting it at the level of the video packet brings more flexibility in the redundancy insertion (avoid for example parsing the different video packets in the RTP packets), and in the adaptation of level of redundancy to the network characteristics, especially in the case of pre-encoded streams. Several video packets can be transmitted in the same RTP packets. The payload format also specifies a mechanism for grouping an AU or a partial AU together with protection data (redundant data, FEC). This mechanism makes it possible to adapt the protection of the dif- ferent partial AUs to varying network conditions during the session. Guillemot/Christ/Wesner [Page 6] Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000 Consecutive segments (e.g. video packets [3]) of the same type will be packed consecutively in the same RTP payload without using the grouping mechanism. The compression layer should provide partial AUs of a size small enough so that the resulting RTP packet can fit the MTU size. RTP packets that transport fragments belonging to the same AU will have their RTP timestamp set to the same value. 5 Payload Format specification The packet will consist of an RTP header followed by possibly multiple payloads. 5.1 RTP Header Usage Each RTP packet starts with a fixed RTP header. The following fields of the fixed RTP header are used: - Marker bit (M bit): The marker bit of the RTP header is set to 1 when the current packet carries the end of an access unit AU, or the last fragment of an AU. - Payload Type (PT): The payload type shall be set to a value as- signed to this format or a payload type in the dynamic range should be chosen. - The RTP timestamp is set to the composition timestamp (CTS), if its presence is indicated by the SLConfigDescriptor, and if its length is not more than 32 bits. Otherwise, i.e. if the CTS is not present or when not using the OD framework, the RTP timestamp should be set to the sampling instant of the first AU contained in the packet. The RTP timestamp encodes in this case the presentation time of the first AU contained in the packet. The RTP timestamp may be the same on successive packets if an AU occupies more than one packet. If the packet contains only 'extension' data objects (see below), then the RTP timestamp is set at the value of the presentation time of the AU to which the first extension data object (e.g. FEC or redundant data) applies. SSRC: A mapping between the ES identifiers and the SSRCs should be provided via out-of-band signaling (e.g. SDP). Guillemot/Christ/Wesner [Page 7] Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000 5.2 Payload Header The payload header is always present, with a variable length, and is defined as follows: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |G|E| XT | LENGTH |EBITS| | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . + Extension data +-+-+-+-+-+-+-+-+ . |G|E|0| res | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | LENGTH | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | . . Media Payload | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1 RTP payload format. G (Group) (1 bit): If this field is 1, it indicates that the object associated to the current header is followed by another object. E (Extension) (1 bit): If its value is 1 then the next object contains Extension data. If its value is 0, then the next object contains AU data (full AU or partial AU). LENGTH (13 bits): this field specifies the length in bytes of the next object. If the object is an AU or partial AU object (E=0), then the field is not present. If the object is the last object of the payload (G=0) then this field is not present. EBITS (3 bits): Indicates the number of bits that shall be ignored in the last byte of the extension data. If the object is the last object of the payload (G=0) then this field is not present. res (Reserved) (6 bits): this field is only present if the E-field is 0, resulting in always 1 byte for {G,E=1,XT} or {G,E=0,res} XT (Extension type) (6 bits): This field is only present if E is set to 1. It then specifies the type of extension data. Examples of types will be the different headers (VisualObjectsequence header, VisualObject header, VisualObjectLayer header, Group_of_videoobjectplane header, VideoObjectPlane header) or possi- bly FEC data with the specification of the FEC coding scheme (parity codes, block codes such as Reed Solomon codes,...). Guillemot/Christ/Wesner [Page 8] Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000 6 Multiplexing MPEG-4 applications can involve a large number of ESs, and thus also a large number of RTP sessions. A multiplexing scheme allowing se- lective bundling of ES may therefore be necessary for some applica- tions. The multiplexing problem is outside the scope of this payload format. 7 Security Considerations RTP packets transporting information with the proposed payload for- mat are subject to the security considerations discussed in the RTP specification [8]. This implies that confidentiality of the media streams is achieved by encryption. If the entire stream (extension data and AU data) is to be secured and all the participants are expected to have the keys to decode the entire stream, then the encryption is performed in the usual manner, and there is no conflict between the two operations (encapsulation and encryption). The need for a portion of stream (e.g. extension data) to be en- crypted with a different key, or not to be encrypted, would require application level signaling protocols to be aware of the usage of the XT field, and to exchange keys and negotiate their usage on the media and extension data separately. 8 Authors Addresses Christine Guillemot INRIA Campus Universitaire de Beaulieu 35042 RENNES Cedex, FRANCE email: Christine.Guillemot@irisa.fr Paul Christ Computer Center - RUS University of Stuttgart Allmandring 30 D70550 Stuttgart, Germany. email: Paul.Christ@rus.uni-stuttgart.de Stefan Wesner Computer Center - RUS University of Stuttgart Allmandring 30 D70550 Stuttgart, Germany. email: Paul.Christ@rus.uni-stuttgart.de Guillemot/Christ/Wesner [Page 9] Internet-Draft Payload format for MPEG-4 visual streams March 1, 2000 9 References [1] C. Guillemot, P. Christ, S. Wesner, A. Klemets, "RTP Payload format for MPEG-4 with scaleable and flexible error resiliency", draft-guillemot-avt-genrtp-02.txt, March 2000. [2] ISO/IEC 14496-1 FDIS MPEG-4 Systems November 1998 [3] ISO/IEC 14496-2 FDIS MPEG-4 Visual November 1998 [4] Mark Handley, Van Jacobson, 'SDP: Session Description Proto- col', draft-ietf-mmusic-sdp-07.txt, 2nd Apr 1998. [5] H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson "RTP: A Transport Protocol for Real Time Applications", RFC 1889, Internet Engineering Task Force, January 1996. [6] J. Rosenberg, H. Schulzrinne, "An RTP Payload format for Generic Forward Error Correction", draft-ietf-avt-fec-05.txt, 26 Feb. 1999. Guillemot/Christ/Wesner [Page 10]