RFC 6190

RTP Payload Format for Scalable Video Coding

Pages: 100
Proposed Standard
→ Errata

Part 1 of 4 – Pages 1 to 22

RFC6190 - Page 1

Internet Engineering Task Force (IETF)                         S. Wenger
Request for Comments: 6190                                   Independent
Category: Standards Track                                     Y.-K. Wang
ISSN: 2070-1721                                      Huawei Technologies
                                                              T. Schierl
                                                          Fraunhofer HHI
                                                        A. Eleftheriadis
                                                                   Vidyo
                                                                May 2011


              RTP Payload Format for Scalable Video Coding

Abstract

   This memo describes an RTP payload format for Scalable Video Coding
   (SVC) as defined in Annex G of ITU-T Recommendation H.264, which is
   technically identical to Amendment 3 of ISO/IEC International
   Standard 14496-10.  The RTP payload format allows for packetization
   of one or more Network Abstraction Layer (NAL) units in each RTP
   packet payload, as well as fragmentation of a NAL unit in multiple
   RTP packets.  Furthermore, it supports transmission of an SVC stream
   over a single as well as multiple RTP sessions.  The payload format
   defines a new media subtype name "H264-SVC", but is still backward
   compatible to RFC 6184 since the base layer, when encapsulated in its
   own RTP stream, must use the H.264 media subtype name ("H264") and
   the packetization method specified in RFC 6184.  The payload format
   has wide applicability in videoconferencing, Internet video
   streaming, and high-bitrate entertainment-quality video, among
   others.

Status of This Memo

   This is an Internet Standards Track document.

   This document is a product of the Internet Engineering Task Force
   (IETF).  It represents the consensus of the IETF community.  It has
   received public review and has been approved for publication by the
   Internet Engineering Steering Group (IESG).  Further information on
   Internet Standards is available in Section 2 of RFC 5741.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be obtained at
   http://www.rfc-editor.org/info/rfc6190.

RFC6190 - Page 2

Copyright Notice

   Copyright (c) 2011 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

   This document may contain material from IETF Documents or IETF
   Contributions published or made publicly available before November
   10, 2008.  The person(s) controlling the copyright in some of this
   material may not have granted the IETF Trust the right to allow
   modifications of such material outside the IETF Standards Process.
   Without obtaining an adequate license from the person(s) controlling
   the copyright in such materials, this document may not be modified
   outside the IETF Standards Process, and derivative works of it may
   not be created outside the IETF Standards Process, except to format
   it for publication as an RFC or to translate it into languages other
   than English.

RFC6190 - Page 3

Table of Contents

   1. Introduction ....................................................5
      1.1. The SVC Codec ..............................................6
           1.1.1. Overview ............................................6
           1.1.2. Parameter Sets ......................................8
           1.1.3. NAL Unit Header .....................................9
      1.2. Overview of the Payload Format ............................12
           1.2.1. Design Principles ..................................12
           1.2.2. Transmission Modes and Packetization Modes .........13
           1.2.3. New Payload Structures .............................15
   2. Conventions ....................................................16
   3. Definitions and Abbreviations ..................................16
      3.1. Definitions ...............................................16
           3.1.1. Definitions from the SVC Specification .............16
           3.1.2. Definitions Specific to This Memo ..................18
      3.2. Abbreviations .............................................22
   4. RTP Payload Format .............................................23
      4.1. RTP Header Usage ..........................................23
      4.2. NAL Unit Extension and Header Usage .......................23
           4.2.1. NAL Unit Extension .................................23
           4.2.2. NAL Unit Header Usage ..............................24
      4.3. Payload Structures ........................................25
      4.4. Transmission Modes ........................................28
      4.5. Packetization Modes .......................................28
           4.5.1. Packetization Modes for Single-Session
                  Transmission .......................................28
           4.5.2. Packetization Modes for Multi-Session
                  Transmission .......................................29
      4.6. Single NAL Unit Packets ...................................32
      4.7. Aggregation Packets .......................................33
           4.7.1. Non-Interleaved Multi-Time Aggregation
                  Packets (NI-MTAPs) .................................33
      4.8. Fragmentation Units (FUs) .................................35
      4.9. Payload Content Scalability Information (PACSI) NAL Unit ..35
      4.10. Empty NAL unit ...........................................43
      4.11. Decoding Order Number (DON) ..............................43
           4.11.1. Cross-Session DON (CS-DON) for
                   Multi-Session Transmission ........................43
   5. Packetization Rules ............................................45
      5.1. Packetization Rules for Single-Session Transmission .......45
      5.2. Packetization Rules for Multi-Session Transmission ........46
           5.2.1. NI-T/NI-TC Packetization Rules .....................47
           5.2.2. NI-C/NI-TC Packetization Rules .....................49
           5.2.3. I-C Packetization Rules ............................50
           5.2.4. Packetization Rules for Non-VCL NAL Units ..........50
           5.2.5. Packetization Rules for Prefix NAL Units ...........51

RFC6190 - Page 4

   6. De-Packetization Process .......................................51
      6.1. De-Packetization Process for Single-Session Transmission ..51
      6.2. De-Packetization Process for Multi-Session Transmission ...51
           6.2.1. Decoding Order Recovery for the NI-T and
                  NI-TC Modes ........................................52
                  6.2.1.1. Informative Algorithm for NI-T
                           Decoding Order Recovery within
                           an Access Unit ............................55
           6.2.2. Decoding Order Recovery for the NI-C,
                  NI-TC, and I-C Modes ...............................57
   7. Payload Format Parameters ......................................59
      7.1. Media Type Registration ...................................60
      7.2. SDP Parameters ............................................75
           7.2.1. Mapping of Payload Type Parameters to SDP ..........75
           7.2.2. Usage with the SDP Offer/Answer Model ..............76
           7.2.3. Dependency Signaling in Multi-Session
                  Transmission .......................................84
           7.2.4. Usage in Declarative Session Descriptions ..........85
      7.3. Examples ..................................................86
           7.3.1. Example for Offering a Single SVC Session ..........86
           7.3.2. Example for Offering a Single SVC Session Using
                  scalable-layer-id ..................................87
           7.3.3. Example for Offering Multiple Sessions in MST ......87
           7.3.4. Example for Offering Multiple Sessions in
                  MST Including Operation with Answerer Using
                  scalable-layer-id ..................................89
           7.3.5. Example for Negotiating an SVC Stream with
                  a Constrained Base Layer in SST ....................90
      7.4. Parameter Set Considerations ..............................91
   8. Security Considerations ........................................91
   9. Congestion Control .............................................92
   10. IANA Considerations ...........................................93
   11. Informative Appendix: Application Examples ....................93
      11.1. Introduction .............................................93
      11.2. Layered Multicast ........................................93
      11.3. Streaming ................................................94
      11.4. Videoconferencing (Unicast to MANE, Unicast to
            Endpoints) ...............................................95
      11.5. Mobile TV (Multicast to MANE, Unicast to Endpoint) .......96
   12. Acknowledgements ..............................................97
   13. References ....................................................97
      13.1. Normative References .....................................97
      13.2. Informative References ...................................98

RFC6190 - Page 5

1.  Introduction

   This memo specifies an RTP [RFC3550] payload format for the Scalable
   Video Coding (SVC) extension of the H.264/AVC video coding standard.
   SVC is specified in Amendment 3 to ISO/IEC 14496 Part 10
   [ISO/IEC14496-10] and equivalently in Annex G of ITU-T Rec. H.264
   [H.264].  In this memo, unless explicitly stated otherwise,
   "H.264/AVC" refers to the specification of [H.264] excluding Annex G.

   SVC covers the entire application range of H.264/AVC, from low-
   bitrate mobile applications, to High-Definition Television (HDTV)
   broadcasting, and even Digital Cinema that requires nearly lossless
   coding and hundreds of megabits per second.  The scalability features
   that SVC adds to H.264/AVC enable several system-level
   functionalities related to the ability of a system to adapt the
   signal to different system conditions with no or minimal processing.
   The adaptation relates both to the capabilities of potentially
   heterogeneous receivers (differing in screen resolution, processing
   speed, etc.), and to differing or time-varying network conditions.
   The adaptation can be performed at the source, the destination, or in
   intermediate media-aware network elements (MANEs).  The payload
   format specified in this memo exposes these system-level
   functionalities so that system designers can take direct advantage of
   these features.

      Informative note: Since SVC streams contain, by design, a sub-
      stream that is compliant with H.264/AVC, it is trivial for a MANE
      to filter the stream so that all SVC-specific information is
      removed.  This memo, in fact, defines a media type parameter
      (sprop-avc-ready, Section 7.2) that indicates whether or not the
      stream can be converted to one compliant with [RFC6184] by
      eliminating RTP packets, and rewriting RTP Control Protocol (RTCP)
      to match the changes to the RTP packet stream as specified in
      Section 7 of [RFC3550].

   This memo defines two basic modes for transmission of SVC data,
   single-session transmission (SST) and multi-session transmission
   (MST).  In SST, a single RTP session is used for the transmission of
   all scalability layers comprising an SVC bitstream; in MST, the
   scalability layers are transported on different RTP sessions.  In
   SST, packetization is a straightforward extension of [RFC6184].  For
   MST, four different modes are defined in this memo.  They differ on
   whether or not they allow interleaving, i.e., transmitting Network
   Abstraction Layer (NAL) units in an order different than the decoding
   order, and by the technique used to effect inter-session NAL unit
   decoding order recovery.  Decoding order recovery is performed using
   either inter-session timestamp alignment [RFC3550] or cross-session
   decoding order numbers (CS-DONs).  One of the MST modes supports both

RFC6190 - Page 6

   decoding order recovery techniques, so that receivers can select
   their preferred technique.  More details can be found in Section
   1.2.2.

   This memo further defines three new NAL unit types.  The first type
   is the payload content scalability information (PACSI) NAL unit,
   which is used to provide an informative summary of the scalability
   information of the data contained in an RTP packet, as well as
   ancillary data (e.g., CS-DON values).  The second and third new NAL
   unit types are the empty NAL unit and the non-interleaved multi-time
   aggregation packet (NI-MTAP) NAL unit.  The empty NAL unit is used to
   ensure inter-session timestamp alignment required for decoding order
   recovery in MST.  The NI-MTAP is used as a new payload structure
   allowing the grouping of NAL units of different time instances in
   decoding order.  More details about the new packet structures can be
   found in Section 1.2.3.

   This memo also defines the signaling support for SVC transport over
   RTP, including a new media subtype name (H264-SVC).

   A non-normative overview of the SVC codec and the payload is given in
   the remainder of this section.

1.1.  The SVC Codec

1.1.1.  Overview

   SVC defines a coded video representation in which a given bitstream
   offers representations of the source material at different levels of
   fidelity (hence the term "scalable").  Scalable video coding
   bitstreams, or scalable bitstreams, are constructed in a pyramidal
   fashion: the coding process creates bitstream components that improve
   the fidelity of hierarchically lower components.

   The fidelity dimensions offered by SVC are spatial (picture size),
   quality (or Signal-to-Noise Ratio (SNR)), and temporal (pictures per
   second).  Bitstream components associated with a given level of
   spatial, quality, and temporal fidelity are identified using
   corresponding parameters in the bitstream: dependency_id, quality_id,
   and temporal_id (see also Section 1.1.3).  The fidelity identifiers
   have integer values, where higher values designate components that
   are higher in the hierarchy.  It is noted that SVC offers significant
   flexibility in terms of how an encoder may choose to structure the
   dependencies between the various components.  Decoding of a
   particular component requires the availability of all the components
   it depends upon, either directly, or indirectly.  An operation point

RFC6190 - Page 7

   of an SVC bitstream consists of the bitstream components required to
   be able to decode a particular dependency_id, quality_id, and
   temporal_id combination.

   The term "layer" is used in various contexts in this memo.  For
   example, in the terms "Video Coding Layer" and "Network Abstraction
   Layer" it refers to conceptual organization levels.  When referring
   to bitstream syntax elements such as block layer or macroblock layer,
   it refers to hierarchical bitstream structure levels.  When used in
   the context of bitstream scalability, e.g., "AVC base layer", it
   refers to a level of representation fidelity of the source signal
   with a specific set of NAL units included.  The correct
   interpretation is supported by providing the appropriate context.

   SVC maintains the bitstream organization introduced in H.264/AVC.
   Specifically, all bitstream components are encapsulated in Network
   Abstraction Layer (NAL) units, which are organized as Access Units
   (AUs).  An AU is associated with a single sampling instance in time.
   A subset of the NAL unit types correspond to the Video Coding Layer
   (VCL), and contain the coded picture data associated with the source
   content.  Non-VCL NAL units carry ancillary data that may be
   necessary for decoding (e.g., parameter sets as explained below) or
   that facilitate certain system operations but are not needed by the
   decoding process itself.  Coded picture data at the various fidelity
   dimensions are organized in slices.  Within one AU, a coded picture
   of an operation point consists of all the coded slices required for
   decoding up to the particular combination of dependency_id and
   quality_id values at the time instance corresponding to the AU.

   It is noted that the concept of temporal scalability is already
   present in H.264/AVC, as profiles defined in Annex A of [H.264]
   already support it.  Specifically, in H.264/AVC, the concept of sub-
   sequences has been introduced to allow optional use of temporal
   layers through Supplemental Enhancement Information (SEI) messages.
   SVC extends this approach by exposing the temporal scalability
   information using the temporal_id parameter, alongside (and unified
   with) the dependency_id and quality_id values that are used for
   spatial and quality scalability, respectively.  For coded picture
   data defined in Annex G of [H.264], this is accomplished by using a
   new type of NAL unit, namely, coded slice in scalable extension NAL
   unit (type 20), where the fidelity parameters are part of its header.
   For coded picture data that follow H.264/AVC, and to ensure
   compatibility with existing H.264/AVC decoders, another new type of
   NAL unit, namely, prefix NAL unit (type 14), has been defined to
   carry this header information.  SVC additionally specifies a third
   new type of NAL unit, namely, subset sequence parameter set NAL unit
   (type 15), to contain sequence parameter set information for quality
   and spatial enhancement layers.  All these three newly specified NAL

RFC6190 - Page 8

   unit types (14, 15, and 20) are among those reserved in H.264/AVC and
   are to be ignored by decoders conforming to one or more of the
   profiles specified in Annex A of [H.264].

   Within an AU, the VCL NAL units associated with a given dependency_id
   and quality_id are referred to as a "layer representation".  The
   layer representation corresponding to the lowest values of
   dependency_id and quality_id (i.e., zero for both) is compliant by
   design to H.264/AVC.  The set of VCL and associated non-VCL NAL units
   across all AUs in a bitstream associated with a particular
   combination of values of dependency_id and quality_id, and regardless
   of the value of temporal_id, is conceptually a scalable layer.  For
   backward compatibility with H.264/AVC, it is important to
   differentiate, however, whether or not SVC-specific NAL units are
   present in a given bitstream.  This is particularly important for the
   lowest fidelity values in terms of dependency_id and quality_id (zero
   for both), as the corresponding VCL data are compliant with
   H.264/AVC, and may or may not be accompanied by associated prefix NAL
   units.  This memo therefore uses the term "AVC base layer" to
   designate the layer that does not contain SVC-specific NAL units, and
   "SVC base layer" to designate the same layer but with the addition of
   the associated SVC prefix NAL units.  Note that the SVC specification
   uses the term "base layer" for what in this memo will be referred to
   as "AVC base layer".  Similarly, it is also important to be able to
   differentiate, within a layer, the temporal fidelity components it
   contains.  This memo uses the term "T0" to indicate, within a
   particular layer, the subset that contains the NAL units associated
   with temporal_id equal to 0.

   SNR scalability in SVC is offered in two different ways.  In what is
   called coarse-grain scalability (CGS), scalability is provided by
   including or excluding a complete layer when decoding a particular
   bitstream.  In contrast, in medium-grain scalability (MGS),
   scalability is provided by selectively omitting the decoding of
   specific NAL units belonging to MGS layers.  The selection of the NAL
   units to omit can be based on fixed-length fields present in the NAL
   unit header (see also Sections 1.1.3 and 4.2).

1.1.2.  Parameter Sets

   SVC maintains the parameter sets concept in H.264/AVC and introduces
   a new type of sequence parameter set, referred to as the subset
   sequence parameter set [H.264].  Subset sequence parameter sets have
   NAL unit type equal to 15, which is different from the NAL unit type
   value (7) of sequence parameter sets.  VCL NAL units of NAL unit type
   1 to 5 must only (indirectly) refer to sequence parameter sets, while
   VCL NAL units of NAL unit type 20 must only (indirectly) refer to
   subset sequence parameter sets.  The references are indirect because

RFC6190 - Page 9

   VCL NAL units refer to picture parameter sets (in their slice
   header), which in turn refer to regular or subset sequence parameter
   sets.  Subset sequence parameter sets use a separate identifier value
   space than sequence parameter sets.

   In SVC, coded picture data from different layers may use the same or
   different sequence and picture parameter sets.  Let the variable DQId
   be equal to dependency_id * 16 + quality_id.  At any time instant
   during the decoding process there is one active sequence parameter
   set for the layer representation with the highest value of DQId and
   one or more active layer SVC sequence parameter set(s) for layer
   representations with lower values of DQId.  The active sequence
   parameter set or an active layer SVC sequence parameter set remains
   unchanged throughout a coded video sequence in the scalable layer in
   which the active sequence parameter set or active layer SVC sequence
   parameter set is referred to.  This means that the referred sequence
   parameter set or subset sequence parameter set can only change at
   instantaneous decoding refresh (IDR) access units for any layer.  At
   any time instant during the decoding process there may be one active
   picture parameter set (for the layer representation with the highest
   value of DQId) and one or more active layer picture parameter set(s)
   (for layer representations with lower values of DQId).  The active
   picture parameter set or an active layer picture parameter set
   remains unchanged throughout a layer representation in which the
   active picture parameter set or active layer picture parameter set is
   referred to, but may change from one AU to the next.

1.1.3.  NAL Unit Header

   SVC extends the one-byte H.264/AVC NAL unit header by three
   additional octets for NAL units of types 14 and 20.  The header
   indicates the type of the NAL unit, the (potential) presence of bit
   errors or syntax violations in the NAL unit payload, information
   regarding the relative importance of the NAL unit for the decoding
   process, the layer identification information, and other fields as
   discussed below.

   The syntax and semantics of the NAL unit header are specified in
   [H.264], but the essential properties of the NAL unit header are
   summarized below for convenience.

   The first byte of the NAL unit header has the following format (the
   bit fields are the same as defined for the one-byte H.264/AVC NAL
   unit header, while the semantics of some fields have changed
   slightly, in a backward-compatible way):

RFC6190 - Page 10

         +---------------+
         |0|1|2|3|4|5|6|7|
         +-+-+-+-+-+-+-+-+
         |F|NRI|  Type   |
         +---------------+

   The semantics of the components of the NAL unit type octet, as
   specified in [H.264], are described briefly below.  In addition to
   the name and size of each field, the corresponding syntax element
   name in [H.264] is also provided.

   F:    1 bit
         forbidden_zero_bit.  H.264/AVC declares a value of 1 as a
         syntax violation.

   NRI:  2 bits
         nal_ref_idc.  A value of "00" (in binary form) indicates that
         the content of the NAL unit is not used to reconstruct
         reference pictures for future prediction.  Such NAL units can
         be discarded without risking the integrity of the reference
         pictures in the same layer.  A value greater than "00"
         indicates that the decoding of the NAL unit is required to
         maintain the integrity of reference pictures in the same layer
         or that the NAL unit contains parameter sets.

   Type: 5 bits
         nal_unit_type.  This component specifies the NAL unit type as
         defined in Table 7-1 of [H.264], and later within this memo.
         For a reference of all currently defined NAL unit types and
         their semantics, please refer to Section 7.4.1 in [H.264].

         In H.264/AVC, NAL unit types 14, 15, and 20 are reserved for
         future extensions.  SVC uses these three NAL unit types as
         follows: NAL unit type 14 is used for prefix NAL unit, NAL unit
         type 15 is used for subset sequence parameter set, and NAL unit
         type 20 is used for coded slice in scalable extension (see
         Section 7.4.1 in [H.264]).  NAL unit types 14 and 20 indicate
         the presence of three additional octets in the NAL unit header,
         as shown below.

            +---------------+---------------+---------------+
            |0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|0|1|2|3|4|5|6|7|
            +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
            |R|I|   PRID    |N| DID |  QID  | TID |U|D|O| RR|
            +---------------+---------------+---------------+

RFC6190 - Page 11

   R:    1 bit
         reserved_one_bit.  Reserved bit for future extension.  R must
         be equal to 1.  The value of R must be ignored by decoders.

   I:    1 bit
         idr_flag.  This component specifies whether the layer
         representation is an instantaneous decoding refresh (IDR) layer
         representation (when equal to 1) or not (when equal to 0).

   PRID: 6 bits
         priority_id.  This flag specifies a priority identifier for the
         NAL unit.  A lower value of PRID indicates a higher priority.

   N:    1 bit
         no_inter_layer_pred_flag.  This flag specifies, when present in
         a coded slice NAL unit, whether inter-layer prediction may be
         used for decoding the coded slice (when equal to 1) or not
         (when equal to 0).

   DID:  3 bits
         dependency_id.  This component indicates the inter-layer coding
         dependency level of a layer representation.  At any access
         unit, a layer representation with a given dependency_id may be
         used for inter-layer prediction for coding of a layer
         representation with a higher dependency_id, while a layer
         representation with a given dependency_id shall not be used for
         inter-layer prediction for coding of a layer representation
         with a lower dependency_id.

   QID:  4 bits
         quality_id.  This component indicates the quality level of an
         MGS layer representation.  At any access unit and for identical
         dependency_id values, a layer representation with quality_id
         equal to ql uses a layer representation with quality_id equal
         to ql-1 for inter-layer prediction.

   TID:  3 bits
         temporal_id.  This component indicates the temporal level of a
         layer representation.  The temporal_id is associated with the
         frame rate, with lower values of _temporal_id corresponding to
         lower frame rates.  A layer representation at a given
         temporal_id typically depends on layer representations with
         lower temporal_id values, but it never depends on layer
         representations with higher temporal_id values.

RFC6190 - Page 12

   U:    1 bit
         use_ref_base_pic_flag.  A value of 1 indicates that only
         reference base pictures are used during the inter prediction
         process.  A value of 0 indicates that the reference base
         pictures are not used during the inter prediction process.

   D:    1 bit
         discardable_flag.  A value of 1 indicates that the current NAL
         unit is not used for decoding NAL units with values of
         dependency_id higher than the one of the current NAL unit, in
         the current and all subsequent access units.  Such NAL units
         can be discarded without risking the integrity of layers with
         higher dependency_id values.  discardable_flag equal to 0
         indicates that the decoding of the NAL unit is required to
         maintain the integrity of layers with higher dependency_id.

   O:    1 bit
         output_flag: Affects the decoded picture output process as
         defined in Annex C of [H.264].

   RR:   2 bits
         reserved_three_2bits.  Reserved bits for future extension.  RR
         MUST be equal to "11" (in binary form).  The value of RR must
         be ignored by decoders.

   This memo extends the semantics of F, NRI, I, PRID, DID, QID, TID, U,
   and D per Annex G of [H.264] as described in Section 4.2.

1.2.  Overview of the Payload Format

   Similar to [RFC6184], this payload format can only be used to carry
   the raw NAL unit stream over RTP and not the bytestream format
   specified in Annex B of [H.264].

   The design principles, transmission modes, and packetization modes as
   well as new payload structures are summarized in this section.  It is
   assumed that the reader is familiar with the terminology and concepts
   defined in [RFC6184].

1.2.1.  Design Principles

   The following design principles have been observed for this payload
   format:

   o  Backward compatibility with [RFC6184] wherever possible.

RFC6190 - Page 13

   o  The SVC base layer or any H.264/AVC compatible subset of the SVC
      base layer, when transmitted in its own RTP stream, must be
      encapsulated using [RFC6184].  This ensures that such an RTP
      stream can be understood by [RFC6184] receivers.

   o  Media-aware network elements (MANEs) as defined in [RFC6184] are
      signaling-aware, rely on signaling information, and have state.

   o  MANEs can aggregate multiple RTP streams, possibly from multiple
      RTP sessions.

   o  MANEs can perform media-aware stream thinning (selective
      elimination of packets or portions thereof).  By using the payload
      header information identifying layers within an RTP session, MANEs
      are able to remove packets or portions thereof from the incoming
      RTP packet stream.  This implies rewriting the RTP headers of the
      outgoing packet stream, and rewriting of RTCP packets as specified
      in Section 7 of [RFC3550].

1.2.2.  Transmission Modes and Packetization Modes

   This memo allows the packetization of SVC data for both single-
   session transmission (SST) and multi-session transmission (MST).  In
   the case of SST all SVC data are carried in a single RTP session.  In
   the case of MST two or more RTP sessions are used to carry the SVC
   data, in accordance with the MST-specific packetization modes defined
   in this memo, which are based on the packetization modes defined in
   [RFC6184].  In MST, each RTP session is associated with one RTP
   stream, which may carry one or more layers.

   The base layer is, by design, compatible to H.264/AVC.  During
   transmission, the associated prefix NAL units, which are introduced
   by SVC and, when present, are ignored by H.264/AVC decoders, may be
   encapsulated within the same RTP packet stream as the H.264/AVC VCL
   NAL units or in a different RTP packet stream (when MST is used).
   For convenience, the term "AVC base layer" is used to refer to the
   base layer without prefix NAL units, while the term "SVC base layer"
   is used to refer to the base layer with prefix NAL units.

   Furthermore, the base layer may have multiple temporal components
   (i.e., supporting different frame rates).  As a result, the lowest
   temporal component ("T0") of the AVC or SVC base layer is used as the
   starting point of the SVC bitstream hierarchy.

   This memo allows encapsulating in a given RTP stream any of the
   following three alternatives of layer combinations:

RFC6190 - Page 14

   1. the T0 AVC base layer or the T0 SVC base layer only;
   2. one or more enhancement layers only; or
   3. the T0 SVC base layer, and one or more enhancement layers.

   SST should be used in point-to-point unicast applications and, in
   general, whenever the potential benefit of using multiple RTP
   sessions does not justify the added complexity.  When SST is used,
   the layer combination cases 1 and 3 above can be used.  When an
   H.264/AVC compatible subset of the SVC base layer is transmitted
   using SST, the packetization of [RFC6184] must be used, thus ensuring
   compatibility with [RFC6184] receivers.  When, however, one or more
   SVC quality or spatial enhancement layers are transmitted using SST,
   the packetization defined in this memo must be used.  In SST, any of
   the three [RFC6184] packetization modes, namely, single NAL unit
   mode, non-interleaved mode, and interleaved mode, can be used.

   MST should be used in a multicast session when different receivers
   may request different layers of the scalable bitstream.  An operation
   point for an SVC bitstream, as defined in this memo, corresponds to a
   set of layers that together conform to one of the profiles defined in
   Annex A or G of [H.264] and, when decoded, offer a representation of
   the original video at a certain fidelity.  The number of streams used
   in MST should be at least equal to the number of operation points
   that may be requested by the receivers.  Depending on the
   application, this may result in each layer being carried in its own
   RTP session, or in having multiple layers encapsulated within one RTP
   session.

      Informative note: Layered multicast is a term commonly used to
      describe the application where multicast is used to transmit
      layered or scalable data that has been encapsulated into more than
      one RTP session.  This application allows different receivers in
      the multicast session to receive different operation points of the
      scalable bitstream.  Layered multicast, among other application
      examples, is discussed in more detail in Section 11.2.

   When MST is used, any of the three layer combinations above can be
   used for each of the sessions.  When an H.264/AVC compatible subset
   of the SVC base layer is transmitted in its own session in MST, the
   packetization of [RFC6184] must be used, such that [RFC6184]
   receivers can be part of the MST and receive only this session.  For
   MST, this memo defines four different MST-specific packetization
   modes, namely, non-interleaved timestamp (NI-T) based mode, non-
   interleaved CS-DON (NI-C) based mode, non-interleaved combined
   timestamp and CS-DON mode (NI-TC), and interleaved CS-DON (I-C) based
   mode (detailed in Section 4.5.2).  The modes differ depending on
   whether the SVC data are allowed to be interleaved, i.e., to be
   transmitted in an order different than the intended decoding order,

RFC6190 - Page 15

   and they also differ in the mechanisms provided in order to recover
   the correct decoding order of the NAL units across the multiple RTP
   sessions.  These four MST modes reuse the packetization modes
   introduced in [RFC6184] for the packetization of NAL units in each of
   their individual RTP sessions.

   As the names of the MST packetization modes imply, the NI-T, NI-C,
   and NI-TC modes do not allow interleaved transmission, while the I-C
   mode allows interleaved transmission.  With any of the three non-
   interleaved MST packetization modes, legacy [RFC6184] receivers with
   implementation of the non-interleaved mode specified in [RFC6184] can
   join a multi-session transmission of SVC, to receive the base RTP
   session encapsulated according to [RFC6184].

1.2.3.  New Payload Structures

   [RFC6184] specifies three basic payload structures, namely, single
   NAL unit packet, aggregation packet, and fragmentation unit.
   Depending on the basic payload structure, an RTP packet may contain a
   NAL unit not aggregating other NAL units, one or more NAL units
   aggregated in another NAL unit, or a fragment of a NAL unit not
   aggregating other NAL units.  Each NAL unit of a type specified in
   [H.264] (i.e., 1 to 23, inclusive) may be carried in its entirety in
   a single NAL unit packet, may be aggregated in an aggregation packet,
   or may be fragmented and carried in a number of fragmentation unit
   packets.  To enable aggregation or fragmentation of NAL units while
   still ensuring that the RTP packet payload is only composed of NAL
   units, [RFC6184] introduced six new NAL unit types (24-29) to be used
   as payload structures, selected from the NAL unit types left
   unspecified in [H.264].

   This memo reuses all the payload structures used in [RFC6184].
   Furthermore, three new types of NAL units are defined: payload
   content scalability information (PACSI) NAL unit, empty NAL unit, and
   non-interleaved multi-time aggregation packet (NI-MTAP) (specified in
   Sections 4.9, 4.10, and 4.7.1, respectively).

   PACSI NAL units may be used for the following purposes:

   o  To enable MANEs to decide whether to forward, process, or discard
      aggregation packets, by checking in PACSI NAL units the
      scalability information and other characteristics of the
      aggregated NAL units, rather than looking into the aggregated NAL
      units themselves, which are defined by the video coding
      specification.

RFC6190 - Page 16

   o  To enable correct decoding order recovery in MST using the NI-C or
      NI-TC mode, with the help of the CS-DON information included in
      PACSI NAL units.

   o  To improve resilience to packet losses, e.g., by utilizing the
      following data or information included in PACSI NAL units:
      repeated Supplemental Enhancement Information (SEI) messages,
      information regarding the start and end of layer representations,
      and the indices to layer representations of the lowest temporal
      subset.

   Empty NAL units may be used to enable correct decoding order recovery
   in MST using the NI-T or NI-TC mode.  NI-MTAP NAL units may be used
   to aggregate NAL units from multiple access units but without
   interleaving.

2.  Conventions

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in BCP 14, RFC 2119
   [RFC2119].

   This specification uses the notion of setting and clearing a bit when
   bit fields are handled.  Setting a bit is the same as assigning that
   bit the value of 1 (On).  Clearing a bit is the same as assigning
   that bit the value of 0 (Off).

3.  Definitions and Abbreviations

3.1.  Definitions

   This document uses the terms and definitions of [H.264].  Section
   3.1.1 lists relevant definitions copied from [H.264] for convenience.

   When there is discrepancy, the definitions in [H.264] take
   precedence.  Section 3.1.2 gives definitions specific to this memo.
   Some of the definitions in Section 3.1.2 are also present in
   [RFC6184] and copied here with slight adaptations as needed.

3.1.1.  Definitions from the SVC Specification

   access unit: A set of NAL units always containing exactly one primary
   coded picture.  In addition to the primary coded picture, an access
   unit may also contain one or more redundant coded pictures, one
   auxiliary coded picture, or other NAL units not containing slices or
   slice data partitions of a coded picture.  The decoding of an access
   unit always results in a decoded picture.

RFC6190 - Page 17

   base layer: A bitstream subset that contains all the NAL units with
   the nal_unit_type syntax element equal to 1 or 5 of the bitstream and
   does not contain any NAL unit with the nal_unit_type syntax element
   equal to 14, 15, or 20 and conforms to one or more of the profiles
   specified in Annex A of [H.264].

   base quality layer representation: The layer representation of the
   target dependency representation of an access unit that is associated
   with the quality_id syntax element equal to 0.

   coded video sequence: A sequence of access units that consists, in
   decoding order, of an IDR access unit followed by zero or more non-
   IDR access units including all subsequent access units up to but not
   including any subsequent IDR access unit.

   dependency representation: A subset of Video Coding Layer (VCL) NAL
   units within an access unit that are associated with the same value
   of the dependency_id syntax element, which is provided as part of the
   NAL unit header or by an associated prefix NAL unit.  A dependency
   representation consists of one or more layer representations.

   IDR access unit: An access unit in which the primary coded picture is
   an IDR picture.

   IDR picture: Instantaneous decoding refresh picture.  A coded picture
   in which all slices of the target dependency representation within
   the access unit are I or EI slices that causes the decoding process
   to mark all reference pictures as "unused for reference" immediately
   after decoding the IDR picture.  After the decoding of an IDR picture
   all following coded pictures in decoding order can be decoded without
   inter prediction from any picture decoded prior to the IDR picture.
   The first picture of each coded video sequence is an IDR picture.

   layer representation: A subset of VCL NAL units within an access unit
   that are associated with the same values of the dependency_id and
   quality_id syntax elements, which are provided as part of the VCL NAL
   unit header or by an associated prefix NAL unit.  One or more layer
   representations represent a dependency representation.

   prefix NAL unit: A NAL unit with nal_unit_type equal to 14 that
   immediately precedes in decoding order a NAL unit with nal_unit_type
   equal to 1, 5, or 12.  The NAL unit that immediately succeeds in
   decoding order the prefix NAL unit is referred to as the associated
   NAL unit.  The prefix NAL unit contains data associated with the
   associated NAL unit, which are considered to be part of the
   associated NAL unit.

RFC6190 - Page 18

   reference base picture: A reference picture that is obtained by
   decoding a base quality layer representation with the nal_ref_idc
   syntax element not equal to 0 and the store_ref_base_pic_flag syntax
   element equal to 1 of an access unit and all layer representations of
   the access unit that are referred to by inter-layer prediction of the
   base quality layer representation.  A reference base picture is not
   an output of the decoding process, but the samples of a reference
   base picture may be used for inter prediction in the decoding process
   of subsequent pictures in decoding order.  Reference base picture is
   a collective term for a reference base field or a reference base
   frame.

   scalable bitstream: A bitstream with the property that one or more
   bitstream subsets that are not identical to the scalable bitstream
   form another bitstream that conforms to the SVC specification
   [H.264].

   target dependency representation: The dependency representation of an
   access unit that is associated with the largest value of the
   dependency_id syntax element for all dependency representations of
   the access unit.

   target layer representation: The layer representation of the target
   dependency representation of an access unit that is associated with
   the largest value of the quality_id syntax element for all layer
   representations of the target dependency representation of the access
   unit.

3.1.2.  Definitions Specific to This Memo

   anchor layer representation: An anchor layer representation is such a
   layer representation that, if decoding of the operation point
   corresponding to the layer starts from the access unit containing
   this layer representation, all the following layer representations of
   the layer, in output order, can be correctly decoded.  The output
   order is defined in [H.264] as the order in which decoded pictures
   are output from the decoded picture buffer of the decoder.  As H.264
   does not specify the picture display process, this more general term
   is used instead of display order.  An anchor layer representation is
   a random access point to the layer the anchor layer representation
   belongs.  However, some layer representations, succeeding an anchor
   layer representation in decoding order but preceding the anchor layer
   representation in output order, may refer to earlier layer
   representations for inter prediction, and hence the decoding may be
   incorrect if random access is performed at the anchor layer
   representation.

RFC6190 - Page 19

   AVC base layer: The subset of the SVC base layer in which all prefix
   NAL units (type 14) are removed.  Note that this is equivalent to the
   term "base layer" as defined in Annex G of [H.264].

   base RTP session: When multi-session transmission is used, the RTP
   session that carries the RTP stream containing the T0 AVC base layer
   or the T0 SVC base layer, and zero or more enhancement layers.  This
   RTP session does not depend on any other RTP session as indicated by
   mechanisms defined in Section 7.2.3.  The base RTP session may carry
   NAL units of NAL unit type equal to 14 and 15.

   decoding order number (DON): A field in the payload structure or a
   derived variable indicating NAL unit decoding order.  Values of DON
   are in the range of 0 to 65535, inclusive.  After reaching the
   maximum value, the value of DON wraps around to 0.  Note that this
   definition also exists in [RFC6184] in exactly the same form.

   Empty NAL unit: A NAL unit with NAL unit type equal to 31 and sub-
   type equal to 1.  An empty NAL unit consists of only the two-byte NAL
   unit header with an empty payload.

   enhancement RTP session: When multi-session transmission is used, an
   RTP session that is not the base RTP session.  An enhancement RTP
   session typically contains an RTP stream that depends on at least one
   other RTP session as indicated by mechanisms defined in Section
   7.2.3.  A lower RTP session to an enhancement RTP session is an RTP
   session on which the enhancement RTP session depends.  The lowest RTP
   session for a receiver is the RTP session that does not depend on any
   other RTP session received by the receiver.  The highest RTP session
   for a receiver is the RTP session on which no other RTP session
   received by the receiver depends.

   cross-session decoding order number (CS-DON): A derived variable
   indicating NAL unit decoding order number over all NAL units within
   all the session-multiplexed RTP sessions that carry the same SVC
   bitstream.

   default level: The level indicated by the profile-level-id parameter.
   In Session Description Protocol (SDP) Offer/Answer, the level is
   downgradable, i.e., the answer may either use the default level or a
   lower level.  Note that this definition also exists in [RFC6184] in a
   slightly different form.

   default sub-profile: The subset of coding tools, which may be all
   coding tools of one profile or the common subset of coding tools of
   more than one profile, indicated by the profile-level-id parameter.
   In SDP Offer/Answer, the default sub-profile must be used in a

RFC6190 - Page 20

   symmetric manner, i.e., the answer must either use the same sub-
   profile as the offer or reject the offer.  Note that this definition
   also exists in [RFC6184] in a slightly different form.

   enhancement layer: A layer in which at least one of the values of
   dependency_id or quality_id is higher than 0, or a layer in which
   none of the NAL units is associated with the value of temporal_id
   equal to 0.  An operation point constructed using the maximum
   temporal_id, dependency_id, and quality_id values associated with an
   enhancement layer may or may not conform to one or more of the
   profiles specified in Annex A of [H.264].

   H.264/AVC compatible: The property of a bitstream subset of
   conforming to one or more of the profiles specified in Annex A of
   [H.264].

   intra layer representation:  A layer representation that contains
   only slices that use intra prediction, and hence do not refer to any
   earlier layer representation in decoding order in the same layer.
   Note that in SVC intra prediction includes intra-layer intra
   prediction as well as inter-layer intra prediction.

   layer: A bitstream subset in which all NAL units of type 1, 5, 12,
   14, or 20 have the same values of dependency_id and quality_id,
   either directly through their NAL unit header (for NAL units of type
   14 or 20) or through association to a prefix (type 14) NAL unit (for
   NAL unit type 1, 5, or 12).  A layer may contain NAL units associated
   with more than one values of temporal_id.

   media-aware network element (MANE): A network element, such as a
   middlebox or application layer gateway that is capable of parsing
   certain aspects of the RTP payload headers or the RTP payload and
   reacting to their contents.  Note that this definition also exists in
   [RFC6184] in exactly the same form.

      Informative note: The concept of a MANE goes beyond normal routers
      or gateways in that a MANE has to be aware of the signaling (e.g.,
      to learn about the payload type mappings of the media streams),
      and in that it has to be trusted when working with Secure Real-
      time Transport Protocol (SRTP).  The advantage of using MANEs is
      that they allow packets to be dropped according to the needs of
      the media coding.  For example, if a MANE has to drop packets due
      to congestion on a certain link, it can identify and remove those
      packets whose elimination produces the least adverse effect on the
      user experience.  After dropping packets, MANEs must rewrite RTCP
      packets to match the changes to the RTP packet stream as specified
      in Section 7 of [RFC3550].

RFC6190 - Page 21

   multi-session transmission: The transmission mode in which the SVC
   stream is transmitted over multiple RTP sessions.  Dependency between
   RTP sessions MUST be signaled according to Section 7.2.3 of this
   memo.

   NAL unit decoding order: A NAL unit order that conforms to the
   constraints on NAL unit order given in Section G.7.4.1.2 in [H.264].
   Note that this definition also exists in [RFC6184] in a slightly
   different form.

   NALU-time: The value that the RTP timestamp would have if the NAL
   unit would be transported in its own RTP packet.  Note that this
   definition also exists in [RFC6184] in exactly the same form.

   operation point: An operation point is identified by a set of values
   of temporal_id, dependency_id, and quality_id.  A bitstream
   corresponding to an operation point can be constructed by removing
   all NAL units associated with a higher value of dependency_id, and
   all NAL units associated with the same value of dependency_id but
   higher values of quality_id or temporal_id.  An operation point
   bitstream conforms to at least one of the profiles defined in Annex A
   or G of [H.264], and offers a representation of the original video
   signal at a certain fidelity.

      Informative note: Additional NAL units may be removed (with lower
      dependency_id or same dependency_id but lower quality_id) if they
      are not required for decoding the bitstream at the particular
      operation point.  The resulting bitstream, however, may no longer
      conform to any of the profiles defined in Annex A or G of [H.264].

   operation point representation: The set of all NAL units of an
   operation point within the same access unit.

   RTP packet stream: A sequence of RTP packets with increasing sequence
   numbers (except for wrap-around), identical payload type and
   identical SSRC (Synchronization Source), carried in one RTP session.
   Within the scope of this memo, one RTP packet stream is utilized to
   transport one or more layers.

   single-session transmission: The transmission mode in which the SVC
   bitstream is transmitted over a single RTP session.

   SVC base layer: The layer that includes all NAL units associated with
   dependency_id and quality_id values both equal to 0, including prefix
   NAL units (NAL unit type 14).

RFC6190 - Page 22

   SVC enhancement layer: A layer in which at least one of the values of
   dependency_id or quality_id is higher than 0.  An operation point
   constructed using the maximum dependency_id and quality_id values and
   any temporal_id value associated with an SVC enhancement layer does
   not conform to any of the profiles specified in Annex A of [H.264].

   SVC NAL unit: A NAL unit of NAL unit type 14, 15, or 20 as specified
   in Annex G of [H.264].

   SVC NAL unit header: A four-byte header resulting from the addition
   of a three-byte SVC-specific header extension added in NAL unit types
   14 and 20.

   SVC RTP session: Either the base RTP session or an enhancement RTP
   session.

   T0 AVC base layer: A subset of the AVC base layer constructed by
   removing all VCL NAL units associated with temporal_id values higher
   than 0 and non-VCL NAL units and SEI messages associated only with
   the VCL NAL units being removed.

   T0 SVC base layer: A subset of the SVC base layer constructed by
   removing all VCL NAL units associated with temporal_id values higher
   than 0 as well as prefix NAL units, non-VCL NAL units, and SEI
   messages associated only with the VCL NAL units being removed.

   transmission order: The order of packets in ascending RTP sequence
   number order (in modulo arithmetic).  Within an aggregation packet,
   the NAL unit transmission order is the same as the order of
   appearance of NAL units in the packet.  Note that this definition
   also exists in [RFC6184] in exactly the same form.

3.2.  Abbreviations

   In addition to the abbreviations defined in [RFC6184], the following
   abbreviations are used in this memo.

      CGS:        Coarse-Grain Scalability
      CS-DON:     Cross-Session Decoding Order Number
      MGS:        Medium-Grain Scalability
      MST:        Multi-Session Transmission
      PACSI:      Payload Content Scalability Information
      SST:        Single-Session Transmission
      SNR:        Signal-to-Noise Ratio
      SVC:        Scalable Video Coding

(next page on part 2)