Network Working Group S. Wenger Request for Comments: 5104 U. Chandra Category: Standards Track Nokia M. Westerlund B. Burman Ericsson February 2008 Codec Control Messages in the RTP Audio-Visual Profile with Feedback (AVPF) Status of This Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited.Abstract
This document specifies a few extensions to the messages defined in the Audio-Visual Profile with Feedback (AVPF). They are helpful primarily in conversational multimedia scenarios where centralized multipoint functionalities are in use. However, some are also usable in smaller multicast environments and point-to-point calls. The extensions discussed are messages related to the ITU-T Rec. H.271 Video Back Channel, Full Intra Request, Temporary Maximum Media Stream Bit Rate, and Temporal-Spatial Trade-off.
Table of Contents
1. Introduction ....................................................4 2. Definitions .....................................................5 2.1. Glossary ...................................................5 2.2. Terminology ................................................5 2.3. Topologies .................................................8 3. Motivation ......................................................8 3.1. Use Cases ..................................................9 3.2. Using the Media Path ......................................11 3.3. Using AVPF ................................................11 3.3.1. Reliability ........................................12 3.4. Multicast .................................................12 3.5. Feedback Messages .........................................12 3.5.1. Full Intra Request Command .........................12 3.5.1.1. Reliability ...............................13 3.5.2. Temporal-Spatial Trade-off Request and Notification .......................................14 3.5.2.1. Point-to-Point ............................15 3.5.2.2. Point-to-Multipoint Using Multicast or Translators ..................15 3.5.2.3. Point-to-Multipoint Using RTP Mixer .......15 3.5.2.4. Reliability ...............................16 3.5.3. H.271 Video Back Channel Message ...................16 3.5.3.1. Reliability ...............................19 3.5.4. Temporary Maximum Media Stream Bit Rate Request and Notification ...........................19 3.5.4.1. Behavior for Media Receivers Using TMMBR ..21 3.5.4.2. Algorithm for Establishing Current Limitations ...............................23 3.5.4.3. Use of TMMBR in a Mixer-Based Multipoint Operation ......................29 3.5.4.4. Use of TMMBR in Point-to-Multipoint Using Multicast or Translators ..................30 3.5.4.5. Use of TMMBR in Point-to-Point Operation ..31 3.5.4.6. Reliability ...............................31 4. RTCP Receiver Report Extensions ................................32 4.1. Design Principles of the Extension Mechanism ..............32 4.2. Transport Layer Feedback Messages .........................33 4.2.1. Temporary Maximum Media Stream Bit Rate Request (TMMBR) ....................................34 4.2.1.1. Message Format ............................34 4.2.1.2. Semantics .................................35 4.2.1.3. Timing Rules ..............................39 4.2.1.4. Handling in Translators and Mixers ........39 4.2.2. Temporary Maximum Media Stream Bit Rate Notification (TMMBN) ...............................39 4.2.2.1. Message Format ............................39
4.2.2.2. Semantics .................................40 4.2.2.3. Timing Rules ..............................41 4.2.2.4. Handling by Translators and Mixers ........41 4.3. Payload-Specific Feedback Messages ........................41 4.3.1. Full Intra Request (FIR) ...........................42 4.3.1.1. Message Format ............................42 4.3.1.2. Semantics .................................43 4.3.1.3. Timing Rules ..............................44 4.3.1.4. Handling of FIR Message in Mixers and Translators ...............................44 4.3.1.5. Remarks ...................................44 4.3.2. Temporal-Spatial Trade-off Request (TSTR) ..........45 4.3.2.1. Message Format ............................46 4.3.2.2. Semantics .................................46 4.3.2.3. Timing Rules ..............................47 4.3.2.4. Handling of Message in Mixers and Translators ...............................47 4.3.2.5. Remarks ...................................47 4.3.3. Temporal-Spatial Trade-off Notification (TSTN) .....48 4.3.3.1. Message Format ............................48 4.3.3.2. Semantics .................................49 4.3.3.3. Timing Rules ..............................49 4.3.3.4. Handling of TSTN in Mixers and Translators ...............................49 4.3.3.5. Remarks ...................................49 4.3.4. H.271 Video Back Channel Message (VBCM) ............50 4.3.4.1. Message Format ............................50 4.3.4.2. Semantics .................................51 4.3.4.3. Timing Rules ..............................52 4.3.4.4. Handling of Message in Mixers or Translators ...............................52 4.3.4.5. Remarks ...................................52 5. Congestion Control .............................................52 6. Security Considerations ........................................53 7. SDP Definitions ................................................54 7.1. Extension of the rtcp-fb Attribute ........................54 7.2. Offer-Answer ..............................................55 7.3. Examples ..................................................56 8. IANA Considerations ............................................58 9. Contributors ...................................................60 10. Acknowledgements ..............................................60 11. References ....................................................60 11.1. Normative References .....................................60 11.2. Informative References ...................................61
1. Introduction
When the Audio-Visual Profile with Feedback (AVPF) [RFC4585] was developed, the main emphasis lay in the efficient support of point- to-point and small multipoint scenarios without centralized multipoint control. However, in practice, many small multipoint conferences operate utilizing devices known as Multipoint Control Units (MCUs). Long-standing experience of the conversational video conferencing industry suggests that there is a need for a few additional feedback messages, to support centralized multipoint conferencing efficiently. Some of the messages have applications beyond centralized multipoint, and this is indicated in the description of the message. This is especially true for the message intended to carry ITU-T Rec. H.271 [H.271] bit strings for Video Back Channel messages. In Real-time Transport Protocol (RTP) [RFC3550] terminology, MCUs comprise mixers and translators. Most MCUs also include signaling support. During the development of this memo, it was noticed that there is considerable confusion in the community related to the use of terms such as mixer, translator, and MCU. In response to these concerns, a number of topologies have been identified that are of practical relevance to the industry, but are not documented in sufficient detail in [RFC3550]. These topologies are documented in [RFC5117], and understanding this memo requires previous or parallel study of [RFC5117]. Some of the messages defined here are forward only, in that they do not require an explicit notification to the message emitter that they have been received and/or indicating the message receiver's actions. Other messages require a response, leading to a two-way communication model that one could view as useful for control purposes. However, it is not the intention of this memo to open up RTP Control Protocol (RTCP) to a generalized control protocol. All mentioned messages have relatively strict real-time constraints, in the sense that their value diminishes with increased delay. This makes the use of more traditional control protocol means, such as Session Initiation Protocol (SIP) [RFC3261], undesirable when used for the same purpose. That is why this solution is recommended instead of "XML Schema for Media Control" [XML-MC], which uses SIP Info to transfer XML messages with similar semantics to what are defined in this memo. Furthermore, all messages are of a very simple format that can be easily processed by an RTP/RTCP sender/receiver. Finally, and most importantly, all messages relate only to the RTP stream with which they are associated, and not to any other property of a communication system. In particular, none of them relate to the properties of the access links traversed by the session.
2. Definitions
2.1. Glossary
AIMD - Additive Increase Multiplicative Decrease AVPF - The extended RTP profile for RTCP-based feedback FCI - Feedback Control Information [RFC4585] FEC - Forward Error Correction FIR - Full Intra Request MCU - Multipoint Control Unit MPEG - Moving Picture Experts Group PLI - Picture Loss Indication PR - Packet rate QP - Quantizer Parameter RTT - Round trip time SSRC - Synchronization Source TMMBN - Temporary Maximum Media Stream Bit Rate Notification TMMBR - Temporary Maximum Media Stream Bit Rate Request TSTN - Temporal-Spatial Trade-off Notification TSTR - Temporal-Spatial Trade-off Request VBCM - Video Back Channel Message2.2. Terminology
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Message: An RTCP feedback message [RFC4585] defined by this specification, of one of the following types: Request: Message that requires acknowledgement Command: Message that forces the receiver to an action Indication: Message that reports a situation Notification: Message that provides a notification that an event has occurred. Notifications are commonly generated in response to a Request. Note that, with the exception of "Notification", this terminology is in alignment with ITU-T Rec. H.245 [H245].
Decoder Refresh Point: A bit string, packetized in one or more RTP packets, that completely resets the decoder to a known state. Examples for "hard" decoder refresh points are Intra pictures in H.261, H.263, MPEG-1, MPEG-2, and MPEG-4 part 2, and Instantaneous Decoder Refresh (IDR) pictures in H.264. "Gradual" decoder refresh points may also be used; see for example [AVC]. While both "hard" and "gradual" decoder refresh points are acceptable in the scope of this specification, in most cases the user experience will benefit from using a "hard" decoder refresh point. A decoder refresh point also contains all header information above the picture layer (or equivalent, depending on the video compression standard) that is conveyed in-band. In H.264, for example, a decoder refresh point contains parameter set Network Adaptation Layer (NAL) units that generate parameter sets necessary for the decoding of the following slice/data partition NAL units (and that are not conveyed out of band). Decoding: The operation of reconstructing the media stream. Rendering: The operation of presenting (parts of) the reconstructed media stream to the user. Stream thinning: The operation of removing some of the packets from a media stream. Stream thinning, preferably, is media-aware, implying that media packets are removed in the order of increasing relevance to the reproductive quality. However, even when employing media-aware stream thinning, most media streams quickly lose quality when subjected to increasing levels of thinning. Media-unaware stream thinning leads to even worse quality degradation. In contrast to transcoding, stream thinning is typically seen as a computationally lightweight operation. Media: Often used (sometimes in conjunction with terms like bit rate, stream, sender, etc.) to identify the content of the forward RTP packet stream (carrying the codec data), to which the codec control message applies.
Media Stream: The stream of RTP packets labeled with a single Synchronization Source (SSRC) carrying the media (and also in some cases repair information such as retransmission or Forward Error Correction (FEC) information). Total media bit rate: The total bits per second transferred in a media stream, measured at an observer-selected protocol layer and averaged over a reasonable timescale, the length of which depends on the application. In general, a media sender and a media receiver will observe different total media bit rates for the same stream, first because they may have selected different reference protocol layers, and second, because of changes in per-packet overhead along the transmission path. The goal with bit rate averaging is to be able to ignore any burstiness on very short timescales (e.g., below 100 ms) introduced by scheduling or link layer packetization effects. Maximum total media bit rate: The upper limit on total media bit rate for a given media stream at a particular receiver and for its selected protocol layer. Note that this value cannot be measured on the received media stream. Instead, it needs to be calculated or determined through other means, such as quality of service (QoS) negotiations or local resource limitations. Also note that this value is an average (on a timescale that is reasonable for the application) and that it may be different from the instantaneous bit rate seen by packets in the media stream. Overhead: All protocol header information required to convey a packet with media data from sender to receiver, from the application layer down to a pre-defined protocol level (for example, down to, and including, the IP header). Overhead may include, for example, IP, UDP, and RTP headers, any layer 2 headers, any Contributing Sources (CSRCs), RTP padding, and RTP header extensions. Overhead excludes any RTP payload headers and the payload itself. Net media bit rate: The bit rate carried by a media stream, net of overhead. That is, the bits per second accounted for by encoded media, any applicable payload headers, and any directly associated meta payload information placed in the RTP packet. A typical example of the latter is redundancy data provided by the use of RFC 2198 [RFC2198]. Note that, unlike the total media bit
rate, the net media bit rate will have the same value at the media sender and at the media receiver unless any mixing or translating of the media has occurred. For a given observer, the total media bit rate for a media stream is equal to the sum of the net media bit rate and the per-packet overhead as defined above multiplied by the packet rate. Feasible region: The set of all combinations of packet rate and net media bit rate that do not exceed the restrictions in maximum media bit rate placed on a given media sender by the Temporary Maximum Media Stream Bit Rate Request (TMMBR) messages it has received. The feasible region will change as new TMMBR messages are received. Bounding set: The set of TMMBR tuples, selected from all those received at a given media sender, that define the feasible region for that media sender. The media sender uses an algorithm such as that in section 3.5.4.2 to determine or iteratively approximate the current bounding set, and reports that set back to the media receivers in a Temporary Maximum Media Stream Bit Rate Notification (TMMBN) message.2.3. Topologies
Please refer to [RFC5117] for an in-depth discussion. The topologies referred to throughout this memo are labeled (consistently with [RFC5117]) as follows: Topo-Point-to-Point . . . . . Point-to-point communication Topo-Multicast . . . . . . . Multicast communication Topo-Translator . . . . . . . Translator based Topo-Mixer . . . . . . . . . Mixer based Topo-RTP-switch-MCU . . . . . RTP stream switching MCU Topo-RTCP-terminating-MCU . . Mixer but terminating RTCP3. Motivation
This section discusses the motivation and usage of the different video and media control messages. The video control messages have been under discussion for a long time, and a requirement document was drawn up [Basso]. That document has expired; however, we quote relevant sections of it to provide motivation and requirements.
3.1. Use Cases
There are a number of possible usages for the proposed feedback messages. Let us begin by looking through the use cases Basso et al. [Basso] proposed. Some of the use cases have been reformulated and comments have been added. 1. An RTP video mixer composes multiple encoded video sources into a single encoded video stream. Each time a video source is added, the RTP mixer needs to request a decoder refresh point from the video source, so as to start an uncorrupted prediction chain on the spatial area of the mixed picture occupied by the data from the new video source. 2. An RTP video mixer receives multiple encoded RTP video streams from conference participants, and dynamically selects one of the streams to be included in its output RTP stream. At the time of a bit stream change (determined through means such as voice activation or the user interface), the mixer requests a decoder refresh point from the remote source, in order to avoid using unrelated content as reference data for inter picture prediction. After requesting the decoder refresh point, the video mixer stops the delivery of the current RTP stream and monitors the RTP stream from the new source until it detects data belonging to the decoder refresh point. At that time, the RTP mixer starts forwarding the newly selected stream to the receiver(s). 3. An application needs to signal to the remote encoder that the desired trade-off between temporal and spatial resolution has changed. For example, one user may prefer a higher frame rate and a lower spatial quality, and another user may prefer the opposite. This choice is also highly content dependent. Many current video conferencing systems offer in the user interface a mechanism to make this selection, usually in the form of a slider. The mechanism is helpful in point-to-point, centralized multipoint and non-centralized multipoint uses. 4. Use case 4 of the Basso document applies only to Picture Loss Indication (PLI) as defined in AVPF [RFC4585] and is not reproduced here. 5. Use case 5 of the Basso document relates to a mechanism known as "freeze picture request". Sending freeze picture requests over a non-reliable forward RTCP channel has been identified as problematic. Therefore, no freeze picture request has been included in this memo, and the use case discussion is not reproduced here.
6. A video mixer dynamically selects one of the received video streams to be sent out to participants and tries to provide the highest bit rate possible to all participants, while minimizing stream trans-rating. One way of achieving this is to set up sessions with endpoints using the maximum bit rate accepted by each endpoint, and accepted by the call admission method used by the mixer. By means of commands that reduce the maximum media stream bit rate below what has been negotiated during session set up, the mixer can reduce the maximum bit rate sent by endpoints to the lowest of all the accepted bit rates. As the lowest accepted bit rate changes due to endpoints joining and leaving or due to network congestion, the mixer can adjust the limits at which endpoints can send their streams to match the new value. The mixer then requests a new maximum bit rate, which is equal to or less than the maximum bit rate negotiated at session setup for a specific media stream, and the remote endpoint can respond with the actual bit rate that it can support. The picture Basso, et al., draw up covers most applications we foresee. However, we would like to extend the list with two additional use cases: 7. Currently deployed congestion control algorithms (AIMD and TCP Friendly Rate Control (TFRC) [RFC3448]) probe for additional available capacity as long as there is something to send. With congestion control algorithms using packet loss as the indication for congestion, this probing generally results in reduced media quality (often to a point where the distortion is large enough to make the media unusable), due to packet loss and increased delay. In a number of deployment scenarios, especially cellular ones, the bottleneck link is often the last hop link. That cellular link also commonly has some type of QoS negotiation enabling the cellular device to learn the maximal bit rate available over this last hop. A media receiver behind this link can, in most (if not all) cases, calculate at least an upper bound for the bit rate available for each media stream it presently receives. How this is done is an implementation detail and not discussed herein. Indicating the maximum available bit rate to the transmitting party for the various media streams can be beneficial to prevent that party from probing for bandwidth for this stream in excess of a known hard limit. For cellular or other mobile devices, the known available bit rate for each stream (deduced from the link bit rate) can change quickly, due to handover to another transmission technology, QoS renegotiation due to congestion, etc. To enable minimal disruption of service, quick convergence is necessary, and therefore media path signaling is desirable.
8. The use of reference picture selection (RPS) as an error resilience tool was introduced in 1997 as NEWPRED [NEWPRED], and is now widely deployed. When RPS is in use, simplistically put, the receiver can send a feedback message to the sender, indicating a reference picture that should be used for future prediction. ([NEWPRED] mentions other forms of feedback as well.) AVPF contains a mechanism for conveying such a message, but did not specify for which codec and according to which syntax the message should conform. Recently, the ITU-T finalized Rec. H.271, which (among other message types) also includes a feedback message. It is expected that this feedback message will fairly quickly enjoy wide support. Therefore, a mechanism to convey feedback messages according to H.271 appears to be desirable.3.2. Using the Media Path
There are two reasons why we use the media path for the codec control messages. First, systems employing MCUs often separate the control and media processing parts. As these messages are intended for or generated by the media part rather than the signaling part of the MCU, having them on the media path avoids transmission across interfaces and unnecessary control traffic between signaling and processing. If the MCU is physically decomposed, the use of the media path avoids the need for media control protocol extensions (e.g., in media gateway control (MEGACO) [RFC3525]). Secondly, the signaling path quite commonly contains several signaling entities, e.g., SIP proxies and application servers. Avoiding going through signaling entities avoids delay for several reasons. Proxies have less stringent delay requirements than media processing, and due to their complex and more generic nature may result in significant processing delay. The topological locations of the signaling entities are also commonly not optimized for minimal delay, but rather towards other architectural goals. Thus, the signaling path can be significantly longer in both geographical and delay sense.3.3. Using AVPF
The AVPF feedback message framework [RFC4585] provides the appropriate framework to implement the new messages. AVPF implements rules controlling the timing of feedback messages to avoid congestion through network flooding by RTCP traffic. We re-use these rules by referencing AVPF.
The signaling setup for AVPF allows each individual type of function to be configured or negotiated on an RTP session basis.3.3.1. Reliability
The use of RTCP messages implies that each message transfer is unreliable, unless the lower layer transport provides reliability. The different messages proposed in this specification have different requirements in terms of reliability. However, in all cases, the reaction to an (occasional) loss of a feedback message is specified.3.4. Multicast
The codec control messages might be used with multicast. The RTCP timing rules specified in [RFC3550] and [RFC4585] ensure that the messages do not cause overload of the RTCP connection. The use of multicast may result in the reception of messages with inconsistent semantics. The reaction to inconsistencies depends on the message type, and is discussed for each message type separately.