Internet Engineering Task Force            Audio Visual Transport WG
   Internet-Draft                                C.Guillemot, P.Christ,
                                                               S.Wesner
   draft-gc-avt-mpeg4visual-00.txt        INRIA / Univ. Stuttgart - RUS
   March, 1 2000
   Expires: September, 1 2000


         RTP payload format for MPEG-4 Visual Advanced Profiles
                     (scalable, core, main, N-bits)


                            STATUS OF THIS MEMO

   This document is an Internet-Draft and is in full conformance with
   all provisions of Section 10 of RFC2026.
   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF), its areas, and its working groups.  Note that
   other groups may also distribute working documents as Internet-
   Drafts.
   Internet-Drafts are draft documents valid for a maximum of six
   months and may be updated, replaced, or obsoleted by other documents
   at any time. It is inappropriate to use Internet- Drafts as refer-
   ence material or to cite them other than as "work in progress."
   The list of current Internet-Drafts can be accessed at
   http://www.ietf.org/ietf/1id-abstracts.txt

   The list of Internet-Draft Shadow Directories can be accessed at
   http://www.ietf.org/shadow.html.


                                 Abstract

   This document describes a payload format for the transport of MPEG-4
   visual Elementary Streams, applicable for multimedia applications
   not restricted to the simple visual profile.  It is an application
   of the format described in [1] to specialized cases of video which
   are applicable in the H.323 context (using for example MPEG-4 lay-
   ered streams). It is therefore intended to support advanced MPEG-4
   visual profiles (simple scalable, core, main, N-bits profiles), by
   allowing protection against loss of key segments of the elementary
   streams. The simple scalable profile supports temporal and spatial
   scalability, features important for rate or congestion control on
   the Internet, especially in multicast. The core and the main visual
   profiles target multimedia applications on the Internet and allow,
   in addition to scalability, the usage of sprite objects and of
   arbitrary shape objects.


  Guillemot/Christ/Wesner                                     [Page 1]

Internet-Draft   Payload format for MPEG-4 visual streams March 1, 2000


   1     Introduction


   An MPEG-4 scene is composed of media objects. The MPEG-4 dynamic-
   scene description framework, which defines the spatio-temporal rela-
   tion of the media objects as well as their contents, is inspired by
   VRML.  The compressed binary representation of the scene description
   is called BIFS (Binary Format for Scenes) [2]. The compressed scene
   description is conveyed through one or more Elementary Streams (ES).

   A compression layer produces the compressed representations of the
   audio-visual objects that will be inserted into the scene.  These
   compressed representations are organized into Elementary Streams
   (ES).  Elementary Stream Descriptors provide information relative to
   the stream, such as the compression scheme used.  Elementary stream
   data is partitioned into Access Units.  The delineation of an Access
   Unit is completely determined by the compression layer that gener-
   ates the elementary stream.  An Access Unit is the smallest data en-
   tity to which timing information can be attributed. Two Access Units
   shall never refer to the same point in time.

   Natural and animated synthetic objects may refer to an Object De-
   scriptor (OD), which points to one or more Elementary Streams that
   carry the coded representation of the object or its animation data.
   An OD serves as a grouping of one or more Elementary Stream Descrip-
   tors that refer to a single media object.  The OD also defines the
   hierarchical relations and properties of the Elementary Streams De-
   scriptors.  The Object Descriptors are conveyed through one or more
   Elementary Streams.  By conveying the session (or resource) descrip-
   tion as well as the scene description through their own Elementary
   Streams, it becomes possible to change portions of scenes and/or
   properties of media streams separately and dynamically at well-known
   instants of time.

   In order to allow effective implementations of the standard, subsets
   of the MPEG-4 Systems, Visual, and Audio tool sets have been
   identified, that can be used for specific applications. Profiles
   exist  for  various  types  of  media  content  (audio,  visual,  and
   graphics)  and  for  scene  descriptions.  The  visual  part  of  the
   standard  defines  five  profiles  for  natural  video:  the  simple
   profile, the simple scalable profile, the core profile, the main
   profile and the N-bits profile [3].

   Considering the visual elementary streams, an important entry point
   in the elementary stream data is the videoObjectPlane(), starting by
   corresponding configuration information.  Depending on the different
   visual profiles, different sets of parameters will be present in the
   header of the VideoObjectPlane().  These parameters are essential to
   configure the decoders and not covered by the HEC-based error
   resilience mechanism.

   After analysing the impact of the coding options provided by the
   different profiles with respect to loss resilience, this document

Guillemot/Christ/Wesner                                    [Page 2]

Internet-Draft   Payload format for MPEG-4 visual streams March 1, 2000

   specifies an RTP payload format as an application of the generic
   format proposed in [1], specialized to cases of video, applicable
   e.g. in the H.323 context (see recommendation H.323 annex B).  The
   document defines packetization rules as well as protocol support for
   protection of key segments of MPEG-4 visual streams. The design
   goals of this RTP payload format are to provide the following:
    - a unified solution for all the visual profiles, with protection
       against loss of key segments of the elementary streams.
    - a solution independent of the usage or the non-usage of the
       MPEG-4 OD framework.
    - protection against packet loss with a protocol support easily
       adaptable to varying network conditions, for both "live" and
       "pre-recorded" visual contents.
    - flexible support of a range of error control mechanisms, from no
       protection to redundant data (key segments) and FEC.

  The list of key segments (VisualObjectsequence header, VisualObject
  header, VisualObjectLayer header, Group_of_videoobjectplane header,
  VideoObjectPlane header) included in the payload header as redundant
  data, as well as possibly additional protection schemes supported
  will be announced via an out-of-band signaling at the beginning of
  the session, using for example SDP [4].  The protection scheme used
  at a specific instant during the session will be signaled via the ex-
  tension type (XT) field in the payload header.


   2     MPEG-4 visual profiles


   Five profiles have been defined for natural video content [3]:

   - The Simple Visual Profile provides efficient, error resilient
     coding of rectangular video objects, suitable for applications on
     mobile networks, such as PCS and IMT2000.
   - The Simple Scalable Visual Profile adds support for coding of
     temporal and spatial scalable objects to the Simple Visual
     Profile, It is useful for applications which provide services at
     more than one level of quality due to bit-rate or decoder resource
     limitations.
   - The Core Visual Profile adds support for coding of arbitrary-
     shaped and temporally scalable objects to the Simple Visual
     Profile. It is useful for applications such as those providing
     relatively simple content-interactivity (Internet multimedia
     applications).
   - The Main Visual Profile adds support for coding of interlaced,
     semi-transparent, and sprite objects to the Core Visual Profile.
     It is useful for interactive and entertainment-quality broadcast
     and DVD applications.
   - The N-Bit Visual Profile adds support for coding video objects
     having pixel-depths ranging from 4 to 12 bits to the Core Visual
     Profile. It is suitable for use in surveillance applications.

   The  profiles  for  synthetic  and  synthetic/natural  hybrid  visual
   content are:


Guillemot/Christ/Wesner                                    [Page 3]

Internet-Draft   Payload format for MPEG-4 visual streams March 1, 2000

   - The Simple Facial Animation Visual Profile provides a simple means
     to animate a face model, suitable for applications such as
     audio/video presentation for the hearing impaired.
   - The Scalable Texture Visual Profile provides spatial scalable
     coding of still image (texture) objects useful for applications
     needing multiple scalability levels, such as mapping texture onto
     objects in games, and high-resolution digital still cameras.
   - The Basic Animated 2-D Texture Visual Profile provides spatial
     scalability, SNR scalability, and mesh-based animation for still
     image (textures) objects and also simple face object animation.
   - The Hybrid Visual Profile combines the ability to decode
     arbitrary-shaped and temporally scalable natural video objects (as
     in the Core Visual Profile) with the ability to decode several
     synthetic and hybrid objects, including simple face and animated
     still image objects. It is suitable for various content-rich
     multimedia applications.


   3     Impact of the profiles on the MPEG-4 visual syntax


   A set of error resilience tools has been defined in the MPEG-4 vis-
   ual syntax in order to recover corrupted headers [3]. In particular,
   the VideoObjectPlane data is structured in video packets, the entry
   point being defined by the function video_packet_header(), and
   delimited by resync_markers. Header information is inserted at the
   start of a video packet.  Contained in this header is the
   information necessary to restart the decoding process (provided key
   parameters from the VOP header have been corectly received).
   Following the quant_scale is the Header Extension Code (HEC). HEC is
   a bit used to indicate whether additional information is available.
   If the HEC is equal to one, then basic configuration parameters can
   be inserted in the packet header. This section analyses the
   parameters which can be potentially protected by the HEC mechanism
   and stresses the coding options, therefore visual profiles, not
   addressed by the above mechanism.

   Depending  on  the  different  visual  profiles,  different  sets  of
   parameters will be present in the header of the VideoObjectPlane().
   In  the  simple  profiles,  essential  VOP  header  parameters  are:
   vop_coding_type,  modulo_time_base,  marker_bit,  vop_time_increment,
   fcodes when the VOP_coding_type is P or B, VOP_reduced_resolution if
   reduced_resolution_vop_enable is equal to 1.

   Let us now consider the simple scalable profile supporting temporal
   and spatial scalability. Scalable or layered coding is very
   interesting for rate or congestion control on the Internet,
   especially in multicast. A key parameter in order to be able to
   decode a VOP in an enhancement layer is "ref_select_code" which
   signals the VOP that has been taken as a reference for the
   prediction.

   The core and the main visual profiles target multimedia applications
   on the Internet and allow the usage of sprite objects and of
   arbitrary shape objects, in addition to the scalable features

Guillemot/Christ/Wesner                                   [Page 4]

Internet-Draft   Payload format for MPEG-4 visual streams March 1, 2000

   provided by the simple scalable profile. Sprite decoding operates in
   two modes: basic and low-latency. The low latency mode allows to
   update the sprite or transmit new pieces of the sprite which can be
   then used as reference information for decoding subsequent S-VOPs,
   and for construction of subsequent parts of sprites. Therefore, it
   is important to be able to protect this information by allowing
   repetition of sprite data in consecutive packets.  The decoding of
   arbitrary shape VOP requires the dimensions of its bounding
   rectangle, its horizontal and vertical spatial position, as well as
   the shape coding type.  This information is respectively provided
   by the parameters "vop_width", "vop_height",
   "vop_horizontal_mc_spatial_ref" and "vop_vertical_mc_spatial_ref",
   "vop_shape_coding_type" in the VOP header.  The parameter
   "change_conv_ratio_disable" is also needed to be able to decode
   properly the video packet.  The "vop_constant_alpha parameter" as
   well as the "vop_constant_alpha_value" (if vop_constant_alpha==1),
   scaling factor applied to the decoded VOP before display need also
   protection.

   When scalability is applied on arbitrary shape objects, extra
   parameters need to be protected. For the shape decoding these
   parameters are "load_backward_shape", "backward_shape_width",
   "backward_shape_height", "backward_shape_horizontal_mc_spatial_ref",
   "backward_shape_vertical_mc_spatial_ref", "load_forward_shape",
   "forward_shape_width", "forward_shape_height",
   "forward_shape_horizontal_mc_spatial_ref",
   "forward_shape_vertical_mc_spatial_ref".  Another important
   parameter is "background_composition" which signals the usage of
   background composition in conjunction with scalability. A final pa-
   rameter important to be protected is "Vop_rounding_type" which sig-
   nals the rounding mechanism used in the pixel value interpolation in
   motion compensation for P and S(GMC)-VOPs.


   4     Design Consideration


   The  syntax  of  the  visual  bitstreams  defines  two  types  of
   information: the configuration information and the elementary stream
   data [2].  The configuration information includes:

  - the global configuration information refering to the whole group
     of visual objects (visualobjectsequence()),
  - the object configuration information refering to a single visual
     object (visualobject())
  - and the object layer configuration information
  - (visualobjectlayer()).

   Two modes of transmission of configuration and elementary stream
   information  are  specified:  The  separate  mode  consists  in
   transmitting the configuration information in "containers" provided
   by MPEG-4 systems (ODs).   The combined mode consists in tansmitting
   the configuration information together with the elementary stream
   data.


Guillemot/Christ/Wesner                                   [Page 5]

Internet-Draft   Payload format for MPEG-4 visual streams March 1, 2000

   The solution recommended in draft-jnb-mpeg4av-rtp-01.txt, when using
   the  combined  mode,  consists  in  transporting  this  configuration
   information in separate RTP packets, and in possibly repeating the
   corresponding RTP packets periodically if needed for protection
   purposes.  However, this vital information being restricted to a few
   bytes, to transport it in separate RTP packets leads to unnecessary
   overhead.  More efficient transport can be reached by grouping this
   data with elementary stream data inside packets.  The same remark
   applies to the Group_of_VideoObjectPlane() entry point and to its
   corresponding header or configuration information.

   The compression layer organizes the ES data in Access Units (AU).
   The AUs are the smallest entities that can be attributed individual
   timestamps.  The timestamps may be obtained directly, through the
   ESI.  If the SLConfigDescriptor indicates that timestamps are ab-
   sent, the timestamps may be obtained indirectly, for example, by us-
   ing the frame rate.

   The compression layer passes full or partial Access Units, together
   with indications of AU boundaries, random access points, desired
   timing information, directly to the network adaptation layer or in-
   directly via the sync layer.  It is however preferable, for imple-
   mentation efficiency, to pass the ES data directly to the network
   adaptation layer, i.e. to avoid producing the full SL packets. Par-
   tial AUs or typed segments are - in terms of the encoding syntax -
   syntactical and semantically meaningful parts of an AU - cf. [1],
   7.2.3, "Such partial AUs may have significance for improved error
   resilience".)

   Depending on the visual profiles, different sets of key parameters
   will be present in the header of the VideoObjectPlane(), as
   described above.  Key parameters for the simple scalable, main and
   core profiles are not covered by the error resilience tools defined
   in MPEG-4. This document advocates the need for protection support
   in the packetization format, that would be applicable for the
   different visual profiles, independently of the usage or non-usage
   of the OD framework.

  Although, the protection support would benefit most the simple
  scalable, core, main and N-bits profiles, therefore a large range of
  multimedia applications, it is also applicable to simple
  videotelephony and videoconferencing applications relying on the
  simple profile (despite the existence of the HEC mechanism). To
  include the redundancy in the payload header instead of inserting it
  at the level of the video packet brings more flexibility in the
  redundancy insertion (avoid for example parsing the different video
  packets in the RTP packets), and in the adaptation of level of
  redundancy to the network characteristics, especially in the case of
  pre-encoded streams. Several video packets can be transmitted in the
  same RTP packets.

   The payload format also specifies a mechanism for grouping an AU or
   a partial AU together with protection data (redundant data, FEC).
   This mechanism makes it possible to adapt the protection of the dif-
   ferent partial AUs to varying network conditions during the session.
Guillemot/Christ/Wesner                                   [Page 6]

Internet-Draft   Payload format for MPEG-4 visual streams March 1, 2000

   Consecutive segments (e.g. video packets [3]) of the same type will
   be packed consecutively in the same RTP payload without using the
   grouping mechanism.

   The compression layer should provide partial AUs of a size small
   enough so that the resulting RTP packet can fit the MTU size. RTP
   packets that transport fragments belonging to the same AU will have
   their RTP timestamp set to the same value.


   5     Payload Format specification


   The packet will consist of an RTP header followed by possibly
   multiple payloads.


   5.1  RTP Header Usage


   Each RTP packet starts with a fixed RTP header. The following fields
   of the fixed RTP header are used:

   - Marker bit (M bit): The marker bit of the RTP header is set to 1
     when the current packet carries the end of an access unit AU, or
     the last fragment of an AU.

   - Payload Type (PT): The payload type shall be set to a value as-
     signed to this format or a payload type in the dynamic range
     should be chosen.

   - The RTP timestamp is set to the composition timestamp (CTS), if
     its presence is indicated by the SLConfigDescriptor, and if its
     length is not more than 32 bits.  Otherwise, i.e. if the CTS is
     not present or when not using the OD framework, the RTP timestamp
     should be set to the sampling instant of the first AU contained in
     the packet. The RTP timestamp encodes in this case the
     presentation time of the first AU contained in the packet. The RTP
     timestamp may be the same on successive packets if an AU occupies
     more than one packet. If the packet contains only 'extension' data
     objects (see below), then the RTP timestamp is set at the value of
     the presentation time of the AU to which the first extension data
     object (e.g. FEC or redundant data) applies.

   SSRC: A mapping between the ES identifiers and the SSRCs should be
   provided via out-of-band signaling (e.g. SDP).


Guillemot/Christ/Wesner                                  [Page 7]

Internet-Draft   Payload format for MPEG-4 visual streams March 1, 2000


   5.2  Payload Header


   The payload header is always present, with a variable length, and is
   defined as follows:

   0                   1                   2                   3
   0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |G|E|    XT     |        LENGTH           |EBITS|               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
   |                                                               .
   +     Extension data                            +-+-+-+-+-+-+-+-+
   .                                               |G|E|0|  res    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |             LENGTH            |                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
   |                                                               .
   .                  Media Payload                                |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+


                      Figure 1   RTP payload format.


   G (Group) (1 bit): If this field is 1, it indicates that the object
   associated to the current header is followed by another
   object.

   E (Extension) (1 bit): If its value is 1 then the next object
   contains Extension data. If its value is 0, then the next object
   contains AU data (full AU or partial AU).

   LENGTH (13 bits): this field specifies the length in bytes of the
   next object. If the object is an AU or partial AU object (E=0), then
   the field is not present.  If the object is the last object of the
   payload (G=0) then this field is not present.

   EBITS (3 bits): Indicates the number of bits that shall be ignored
   in the last byte of the extension data. If the object is the last
   object of the payload (G=0) then this field is not present.

   res (Reserved) (6 bits): this field is only present if the E-field
   is 0, resulting in always 1 byte for {G,E=1,XT} or {G,E=0,res}

   XT (Extension type) (6 bits):  This field is only present if E is
   set to 1. It then specifies the type of extension data. Examples of
   types will be the different headers (VisualObjectsequence header,
   VisualObject header, VisualObjectLayer header,
   Group_of_videoobjectplane header, VideoObjectPlane header) or possi-
   bly FEC data with the specification of the FEC coding scheme (parity
   codes, block codes such as Reed Solomon codes,...).


Guillemot/Christ/Wesner                               [Page 8]

Internet-Draft   Payload format for MPEG-4 visual streams March 1, 2000

   6     Multiplexing


   MPEG-4 applications can involve a large number of ESs, and thus also
   a large number of RTP sessions. A multiplexing scheme allowing se-
   lective bundling of ES may therefore be necessary for some applica-
   tions. The multiplexing problem is outside the scope of this payload
   format.


   7     Security Considerations


   RTP packets transporting information with the proposed payload for-
   mat are subject to the security considerations discussed in the RTP
   specification [8]. This implies that confidentiality of the media
   streams is achieved by encryption.

   If the entire stream (extension data and AU data) is to be secured
   and all the participants are expected to have the keys to decode the
   entire stream, then the encryption is performed in the usual manner,
   and there is no conflict between the two operations (encapsulation
   and encryption).

   The need for a portion of stream (e.g. extension data) to be en-
   crypted with a different key, or not to be encrypted, would require
   application level signaling protocols to be aware of the usage of
   the XT field, and to exchange keys and negotiate their usage on the
   media and extension data separately.


   8     Authors Addresses


   Christine Guillemot
   INRIA
   Campus Universitaire de Beaulieu
   35042 RENNES Cedex, FRANCE
   email: Christine.Guillemot@irisa.fr

   Paul Christ
   Computer Center - RUS University of Stuttgart
   Allmandring 30
   D70550 Stuttgart, Germany.
   email: Paul.Christ@rus.uni-stuttgart.de

   Stefan Wesner
   Computer Center - RUS University of Stuttgart
   Allmandring 30
   D70550 Stuttgart, Germany.
   email: Paul.Christ@rus.uni-stuttgart.de


Guillemot/Christ/Wesner                                   [Page 9]

Internet-Draft   Payload format for MPEG-4 visual streams March 1, 2000

   9     References


  [1]   C. Guillemot, P. Christ, S. Wesner, A. Klemets, "RTP Payload
        format for MPEG-4 with scaleable and flexible error
        resiliency", draft-guillemot-avt-genrtp-02.txt, March 2000.
  [2]   ISO/IEC 14496-1 FDIS MPEG-4 Systems November 1998
  [3]   ISO/IEC 14496-2 FDIS MPEG-4 Visual November 1998
  [4]   Mark Handley, Van Jacobson, 'SDP: Session Description Proto-
        col', draft-ietf-mmusic-sdp-07.txt, 2nd Apr 1998.
  [5]   H. Schulzrinne, S. Casner, R. Frederick, V. Jacobson "RTP: A
        Transport Protocol for Real Time Applications",  RFC 1889,
        Internet Engineering Task Force, January 1996.
  [6]   J. Rosenberg, H. Schulzrinne, "An RTP Payload format for
        Generic Forward Error Correction", draft-ietf-avt-fec-05.txt,
        26 Feb. 1999.


Guillemot/Christ/Wesner                                [Page 10]