RFC 5104

Codec Control Messages in the RTP Audio-Visual Profile with Feedback (AVPF)

Pages: 64
Proposed Standard
Updated by: 7728 8082

Part 1 of 4 – Pages 1 to 12

RFC5104 - Page 1

Network Working Group                                          S. Wenger
Request for Comments: 5104                                    U. Chandra
Category: Standards Track                                          Nokia
                                                           M. Westerlund
                                                               B. Burman
                                                                Ericsson
                                                           February 2008


                     Codec Control Messages in the
             RTP Audio-Visual Profile with Feedback (AVPF)

Status of This Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Abstract

   This document specifies a few extensions to the messages defined in
   the Audio-Visual Profile with Feedback (AVPF).  They are helpful
   primarily in conversational multimedia scenarios where centralized
   multipoint functionalities are in use.  However, some are also usable
   in smaller multicast environments and point-to-point calls.

   The extensions discussed are messages related to the ITU-T Rec. H.271
   Video Back Channel, Full Intra Request, Temporary Maximum Media
   Stream Bit Rate, and Temporal-Spatial Trade-off.

RFC5104 - Page 2

Table of Contents

   1. Introduction ....................................................4
   2. Definitions .....................................................5
      2.1. Glossary ...................................................5
      2.2. Terminology ................................................5
      2.3. Topologies .................................................8
   3. Motivation ......................................................8
      3.1. Use Cases ..................................................9
      3.2. Using the Media Path ......................................11
      3.3. Using AVPF ................................................11
           3.3.1. Reliability ........................................12
      3.4. Multicast .................................................12
      3.5. Feedback Messages .........................................12
           3.5.1. Full Intra Request Command .........................12
                  3.5.1.1. Reliability ...............................13
           3.5.2. Temporal-Spatial Trade-off Request and
                  Notification .......................................14
                  3.5.2.1. Point-to-Point ............................15
                  3.5.2.2. Point-to-Multipoint Using
                           Multicast or Translators ..................15
                  3.5.2.3. Point-to-Multipoint Using RTP Mixer .......15
                  3.5.2.4. Reliability ...............................16
           3.5.3. H.271 Video Back Channel Message ...................16
                  3.5.3.1. Reliability ...............................19
           3.5.4. Temporary Maximum Media Stream Bit Rate
                  Request and Notification ...........................19
                  3.5.4.1. Behavior for Media Receivers Using TMMBR ..21
                  3.5.4.2. Algorithm for Establishing Current
                           Limitations ...............................23
                  3.5.4.3. Use of TMMBR in a Mixer-Based
                           Multipoint Operation ......................29
                  3.5.4.4. Use of TMMBR in Point-to-Multipoint Using
                           Multicast or Translators ..................30
                  3.5.4.5. Use of TMMBR in Point-to-Point Operation ..31
                  3.5.4.6. Reliability ...............................31
   4. RTCP Receiver Report Extensions ................................32
      4.1. Design Principles of the Extension Mechanism ..............32
      4.2. Transport Layer Feedback Messages .........................33
           4.2.1. Temporary Maximum Media Stream Bit Rate
                  Request (TMMBR) ....................................34
                  4.2.1.1. Message Format ............................34
                  4.2.1.2. Semantics .................................35
                  4.2.1.3. Timing Rules ..............................39
                  4.2.1.4. Handling in Translators and Mixers ........39
           4.2.2. Temporary Maximum Media Stream Bit Rate
                  Notification (TMMBN) ...............................39
                  4.2.2.1. Message Format ............................39

RFC5104 - Page 3

                  4.2.2.2. Semantics .................................40
                  4.2.2.3. Timing Rules ..............................41
                  4.2.2.4. Handling by Translators and Mixers ........41
      4.3. Payload-Specific Feedback Messages ........................41
           4.3.1. Full Intra Request (FIR) ...........................42
                  4.3.1.1. Message Format ............................42
                  4.3.1.2. Semantics .................................43
                  4.3.1.3. Timing Rules ..............................44
                  4.3.1.4. Handling of FIR Message in Mixers and
                           Translators ...............................44
                  4.3.1.5. Remarks ...................................44
           4.3.2. Temporal-Spatial Trade-off Request (TSTR) ..........45
                  4.3.2.1. Message Format ............................46
                  4.3.2.2. Semantics .................................46
                  4.3.2.3. Timing Rules ..............................47
                  4.3.2.4. Handling of Message in Mixers and
                           Translators ...............................47
                  4.3.2.5. Remarks ...................................47
           4.3.3. Temporal-Spatial Trade-off Notification (TSTN) .....48
                  4.3.3.1. Message Format ............................48
                  4.3.3.2. Semantics .................................49
                  4.3.3.3. Timing Rules ..............................49
                  4.3.3.4. Handling of TSTN in Mixers and
                           Translators ...............................49
                  4.3.3.5. Remarks ...................................49
           4.3.4. H.271 Video Back Channel Message (VBCM) ............50
                  4.3.4.1. Message Format ............................50
                  4.3.4.2. Semantics .................................51
                  4.3.4.3. Timing Rules ..............................52
                  4.3.4.4. Handling of Message in Mixers or
                           Translators ...............................52
                  4.3.4.5. Remarks ...................................52
   5. Congestion Control .............................................52
   6. Security Considerations ........................................53
   7. SDP Definitions ................................................54
      7.1. Extension of the rtcp-fb Attribute ........................54
      7.2. Offer-Answer ..............................................55
      7.3. Examples ..................................................56
   8. IANA Considerations ............................................58
   9. Contributors ...................................................60
   10. Acknowledgements ..............................................60
   11. References ....................................................60
      11.1. Normative References .....................................60
      11.2. Informative References ...................................61

RFC5104 - Page 4

1.  Introduction

   When the Audio-Visual Profile with Feedback (AVPF) [RFC4585] was
   developed, the main emphasis lay in the efficient support of point-
   to-point and small multipoint scenarios without centralized
   multipoint control.  However, in practice, many small multipoint
   conferences operate utilizing devices known as Multipoint Control
   Units (MCUs).  Long-standing experience of the conversational video
   conferencing industry suggests that there is a need for a few
   additional feedback messages, to support centralized multipoint
   conferencing efficiently.  Some of the messages have applications
   beyond centralized multipoint, and this is indicated in the
   description of the message.  This is especially true for the message
   intended to carry ITU-T Rec. H.271 [H.271] bit strings for Video Back
   Channel messages.

   In Real-time Transport Protocol (RTP) [RFC3550] terminology, MCUs
   comprise mixers and translators.  Most MCUs also include signaling
   support.  During the development of this memo, it was noticed that
   there is considerable confusion in the community related to the use
   of terms such as mixer, translator, and MCU.  In response to these
   concerns, a number of topologies have been identified that are of
   practical relevance to the industry, but are not documented in
   sufficient detail in [RFC3550].  These topologies are documented in
   [RFC5117], and understanding this memo requires previous or parallel
   study of [RFC5117].

   Some of the messages defined here are forward only, in that they do
   not require an explicit notification to the message emitter that they
   have been received and/or indicating the message receiver's actions.
   Other messages require a response, leading to a two-way communication
   model that one could view as useful for control purposes.  However,
   it is not the intention of this memo to open up RTP Control Protocol
   (RTCP) to a generalized control protocol.  All mentioned messages
   have relatively strict real-time constraints, in the sense that their
   value diminishes with increased delay.  This makes the use of more
   traditional control protocol means, such as Session Initiation
   Protocol (SIP) [RFC3261], undesirable when used for the same purpose.
   That is why this solution is recommended instead of "XML Schema for
   Media Control" [XML-MC], which uses SIP Info to transfer XML messages
   with similar semantics to what are defined in this memo.
   Furthermore, all messages are of a very simple format that can be
   easily processed by an RTP/RTCP sender/receiver.  Finally, and most
   importantly, all messages relate only to the RTP stream with which
   they are associated, and not to any other property of a communication
   system.  In particular, none of them relate to the properties of the
   access links traversed by the session.

RFC5104 - Page 5

2.  Definitions

2.1.  Glossary

   AIMD   - Additive Increase Multiplicative Decrease
   AVPF   - The extended RTP profile for RTCP-based feedback
   FCI    - Feedback Control Information [RFC4585]
   FEC    - Forward Error Correction
   FIR    - Full Intra Request
   MCU    - Multipoint Control Unit
   MPEG   - Moving Picture Experts Group
   PLI    - Picture Loss Indication
   PR     - Packet rate
   QP     - Quantizer Parameter
   RTT    - Round trip time
   SSRC   - Synchronization Source
   TMMBN  - Temporary Maximum Media Stream Bit Rate Notification
   TMMBR  - Temporary Maximum Media Stream Bit Rate Request
   TSTN   - Temporal-Spatial Trade-off Notification
   TSTR   - Temporal-Spatial Trade-off Request
   VBCM   - Video Back Channel Message

2.2.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED",  "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

      Message:
          An RTCP feedback message [RFC4585] defined by this
          specification, of one of the following types:

          Request:
              Message that requires acknowledgement

          Command:
              Message that forces the receiver to an action

          Indication:
              Message that reports a situation

          Notification:
              Message that provides a notification that an event has
              occurred.  Notifications are commonly generated in
              response to a Request.

   Note that, with the exception of "Notification", this terminology is
   in alignment with ITU-T Rec. H.245 [H245].

RFC5104 - Page 6

   Decoder Refresh Point:
          A bit string, packetized in one or more RTP packets, that
          completely resets the decoder to a known state.

          Examples for "hard" decoder refresh points are Intra pictures
          in H.261, H.263, MPEG-1, MPEG-2, and MPEG-4 part 2, and
          Instantaneous Decoder Refresh (IDR) pictures in H.264.
          "Gradual" decoder refresh points may also be used; see for
          example [AVC].  While both "hard" and "gradual" decoder
          refresh points are acceptable in the scope of this
          specification, in most cases the user experience will benefit
          from using a "hard" decoder refresh point.

          A decoder refresh point also contains all header information
          above the picture layer (or equivalent, depending on the video
          compression standard) that is conveyed in-band.  In H.264, for
          example, a decoder refresh point contains parameter set
          Network Adaptation Layer (NAL) units that generate parameter
          sets necessary for the decoding of the following slice/data
          partition NAL units (and that are not conveyed out of band).

   Decoding:
          The operation of reconstructing the media stream.

   Rendering:
          The operation of presenting (parts of) the reconstructed media
          stream to the user.

   Stream thinning:
          The operation of removing some of the packets from a media
          stream.  Stream thinning, preferably, is media-aware, implying
          that media packets are removed in the order of increasing
          relevance to the reproductive quality.  However, even when
          employing media-aware stream thinning, most media streams
          quickly lose quality when subjected to increasing levels of
          thinning.  Media-unaware stream thinning leads to even worse
          quality degradation.  In contrast to transcoding, stream
          thinning is typically seen as a computationally lightweight
          operation.

   Media:
          Often used (sometimes in conjunction with terms like bit rate,
          stream, sender, etc.) to identify the content of the forward
          RTP packet stream (carrying the codec data), to which the
          codec control message applies.

RFC5104 - Page 7

   Media Stream:
          The stream of RTP packets labeled with a single
          Synchronization Source (SSRC) carrying the media (and also in
          some cases repair information such as retransmission or
          Forward Error Correction (FEC) information).

   Total media bit rate:
          The total bits per second transferred in a media stream,
          measured at an observer-selected protocol layer and averaged
          over a reasonable timescale, the length of which depends on
          the application.  In general, a media sender and a media
          receiver will observe different total media bit rates for the
          same stream, first because they may have selected different
          reference protocol layers, and second, because of changes in
          per-packet overhead along the transmission path.  The goal
          with bit rate averaging is to be able to ignore any burstiness
          on very short timescales (e.g., below 100 ms) introduced by
          scheduling or link layer packetization effects.

   Maximum total media bit rate:
          The upper limit on total media bit rate for a given media
          stream at a particular receiver and for its selected protocol
          layer.  Note that this value cannot be measured on the
          received media stream.  Instead, it needs to be calculated or
          determined through other means, such as quality of service
          (QoS) negotiations or local resource limitations.  Also note
          that this value is an average (on a timescale that is
          reasonable for the application) and that it may be different
          from the instantaneous bit rate seen by packets in the media
          stream.

   Overhead:
          All protocol header information required to convey a packet
          with media data from sender to receiver, from the application
          layer down to a pre-defined protocol level (for example, down
          to, and including, the IP header).  Overhead may include, for
          example, IP, UDP, and RTP headers, any layer 2 headers, any
          Contributing Sources (CSRCs), RTP padding, and RTP header
          extensions.  Overhead excludes any RTP payload headers and the
          payload itself.

   Net media bit rate:
          The bit rate carried by a media stream, net of overhead.  That
          is, the bits per second accounted for by encoded media, any
          applicable payload headers, and any directly associated meta
          payload information placed in the RTP packet.  A typical
          example of the latter is redundancy data provided by the use
          of RFC 2198 [RFC2198].  Note that, unlike the total media bit

RFC5104 - Page 8

          rate, the net media bit rate will have the same value at the
          media sender and at the media receiver unless any mixing or
          translating of the media has occurred.

          For a given observer, the total media bit rate for a media
          stream is equal to the sum of the net media bit rate and the
          per-packet overhead as defined above multiplied by the packet
          rate.

   Feasible region:
          The set of all combinations of packet rate and net media bit
          rate that do not exceed the restrictions in maximum media bit
          rate placed on a given media sender by the Temporary Maximum
          Media Stream Bit Rate Request (TMMBR) messages it has
          received.  The feasible region will change as new TMMBR
          messages are received.

   Bounding set:
          The set of TMMBR tuples, selected from all those received at a
          given media sender, that define the feasible region for that
          media sender.  The media sender uses an algorithm such as that
          in section 3.5.4.2 to determine or iteratively approximate the
          current bounding set, and reports that set back to the media
          receivers in a Temporary Maximum Media Stream Bit Rate
          Notification (TMMBN) message.

2.3.  Topologies

   Please refer to [RFC5117] for an in-depth discussion.  The topologies
   referred to throughout this memo are labeled (consistently with
   [RFC5117]) as follows:

   Topo-Point-to-Point . . . . . Point-to-point communication
   Topo-Multicast  . . . . . . . Multicast communication
   Topo-Translator . . . . . . . Translator based
   Topo-Mixer  . . . . . . . . . Mixer based
   Topo-RTP-switch-MCU . . . . . RTP stream switching MCU
   Topo-RTCP-terminating-MCU . . Mixer but terminating RTCP

3.  Motivation

   This section discusses the motivation and usage of the different
   video and media control messages.  The video control messages have
   been under discussion for a long time, and a requirement document was
   drawn up [Basso].  That document has expired; however, we quote
   relevant sections of it to provide motivation and requirements.

RFC5104 - Page 9

3.1.  Use Cases

   There are a number of possible usages for the proposed feedback
   messages.  Let us begin by looking through the use cases Basso et al.
   [Basso] proposed.  Some of the use cases have been reformulated and
   comments have been added.

   1. An RTP video mixer composes multiple encoded video sources into a
      single encoded video stream.  Each time a video source is added,
      the RTP mixer needs to request a decoder refresh point from the
      video source, so as to start an uncorrupted prediction chain on
      the spatial area of the mixed picture occupied by the data from
      the new video source.

   2. An RTP video mixer receives multiple encoded RTP video streams
      from conference participants, and dynamically selects one of the
      streams to be included in its output RTP stream.  At the time of a
      bit stream change (determined through means such as voice
      activation or the user interface), the mixer requests a decoder
      refresh point from the remote source, in order to avoid using
      unrelated content as reference data for inter picture prediction.
      After requesting the decoder refresh point, the video mixer stops
      the delivery of the current RTP stream and monitors the RTP stream
      from the new source until it detects data belonging to the decoder
      refresh point.  At that time, the RTP mixer starts forwarding the
      newly selected stream to the receiver(s).

   3. An application needs to signal to the remote encoder that the
      desired trade-off between temporal and spatial resolution has
      changed.  For example, one user may prefer a higher frame rate and
      a lower spatial quality, and another user may prefer the opposite.
      This choice is also highly content dependent.  Many current video
      conferencing systems offer in the user interface a mechanism to
      make this selection, usually in the form of a slider.  The
      mechanism is helpful in point-to-point, centralized multipoint and
      non-centralized multipoint uses.

   4. Use case 4 of the Basso document applies only to Picture Loss
      Indication (PLI) as defined in AVPF [RFC4585] and is not
      reproduced here.

   5. Use case 5 of the Basso document relates to a mechanism known as
      "freeze picture request".  Sending freeze picture requests over a
      non-reliable forward RTCP channel has been identified as
      problematic.  Therefore, no freeze picture request has been
      included in this memo, and the use case discussion is not
      reproduced here.

RFC5104 - Page 10

   6. A video mixer dynamically selects one of the received video
      streams to be sent out to participants and tries to provide the
      highest bit rate possible to all participants, while minimizing
      stream trans-rating.  One way of achieving this is to set up
      sessions with endpoints using the maximum bit rate accepted by
      each endpoint, and accepted by the call admission method used by
      the mixer.  By means of commands that reduce the maximum media
      stream bit rate below what has been negotiated during session set
      up, the mixer can reduce the maximum bit rate sent by endpoints to
      the lowest of all the accepted bit rates.  As the lowest accepted
      bit rate changes due to endpoints joining and leaving or due to
      network congestion, the mixer can adjust the limits at which
      endpoints can send their streams to match the new value.  The
      mixer then requests a new maximum bit rate, which is equal to or
      less than the maximum bit rate negotiated at session setup for a
      specific media stream, and the remote endpoint can respond with
      the actual bit rate that it can support.

   The picture Basso, et al., draw up covers most applications we
   foresee.  However, we would like to extend the list with two
   additional use cases:

   7. Currently deployed congestion control algorithms (AIMD and TCP
      Friendly Rate Control (TFRC) [RFC3448]) probe for additional
      available capacity as long as there is something to send.  With
      congestion control algorithms using packet loss as the indication
      for congestion, this probing generally results in reduced media
      quality (often to a point where the distortion is large enough to
      make the media unusable), due to packet loss and increased delay.

      In a number of deployment scenarios, especially cellular ones, the
      bottleneck link is often the last hop link.  That cellular link
      also commonly has some type of QoS negotiation enabling the
      cellular device to learn the maximal bit rate available over this
      last hop.  A media receiver behind this link can, in most (if not
      all) cases, calculate at least an upper bound for the bit rate
      available for each media stream it presently receives.  How this
      is done is an implementation detail and not discussed herein.
      Indicating the maximum available bit rate to the transmitting
      party for the various media streams can be beneficial to prevent
      that party from probing for bandwidth for this stream in excess of
      a known hard limit.  For cellular or other mobile devices, the
      known available bit rate for each stream (deduced from the link
      bit rate) can change quickly, due to handover to another
      transmission technology, QoS renegotiation due to congestion, etc.
      To enable minimal disruption of service, quick convergence is
      necessary, and therefore media path signaling is desirable.

RFC5104 - Page 11

    8. The use of reference picture selection (RPS) as an error
       resilience tool was introduced in 1997 as NEWPRED [NEWPRED], and
       is now widely deployed.  When RPS is in use, simplistically put,
       the receiver can send a feedback message to the sender,
       indicating a reference picture that should be used for future
       prediction.  ([NEWPRED] mentions other forms of feedback as
       well.)  AVPF contains a mechanism for conveying such a message,
       but did not specify for which codec and according to which syntax
       the message should conform.  Recently, the ITU-T finalized Rec.
       H.271, which (among other message types) also includes a feedback
       message.  It is expected that this feedback message will fairly
       quickly enjoy wide support.  Therefore, a mechanism to convey
       feedback messages according to H.271 appears to be desirable.

3.2.  Using the Media Path

   There are two reasons why we use the media path for the codec control
   messages.

   First, systems employing MCUs often separate the control and media
   processing parts.  As these messages are intended for or generated by
   the media part rather than the signaling part of the MCU, having them
   on the media path avoids transmission across interfaces and
   unnecessary control traffic between signaling and processing.  If the
   MCU is physically decomposed, the use of the media path avoids the
   need for media control protocol extensions (e.g., in media gateway
   control (MEGACO) [RFC3525]).

   Secondly, the signaling path quite commonly contains several
   signaling entities, e.g., SIP proxies and application servers.
   Avoiding going through signaling entities avoids delay for several
   reasons.  Proxies have less stringent delay requirements than media
   processing, and due to their complex and more generic nature may
   result in significant processing delay.  The topological locations of
   the signaling entities are also commonly not optimized for minimal
   delay, but rather towards other architectural goals.  Thus, the
   signaling path can be significantly longer in both geographical and
   delay sense.

3.3.  Using AVPF

   The AVPF feedback message framework [RFC4585] provides the
   appropriate framework to implement the new messages.  AVPF implements
   rules controlling the timing of feedback messages to avoid congestion
   through network flooding by RTCP traffic.  We re-use these rules by
   referencing AVPF.

RFC5104 - Page 12

   The signaling setup for AVPF allows each individual type of function
   to be configured or negotiated on an RTP session basis.

3.3.1.  Reliability

   The use of RTCP messages implies that each message transfer is
   unreliable, unless the lower layer transport provides reliability.
   The different messages proposed in this specification have different
   requirements in terms of reliability.  However, in all cases, the
   reaction to an (occasional) loss of a feedback message is specified.

3.4.  Multicast

   The codec control messages might be used with multicast.  The RTCP
   timing rules specified in [RFC3550] and [RFC4585] ensure that the
   messages do not cause overload of the RTCP connection.  The use of
   multicast may result in the reception of messages with inconsistent
   semantics.  The reaction to inconsistencies depends on the message
   type, and is discussed for each message type separately.

(page 12 continued on part 2)