7.2. SDP Parameters
7.2.1. Mapping of Payload Type Parameters to SDP
The media type video/H264-SVC string is mapped to fields in the Session Description Protocol (SDP) as follows: o The media name in the "m=" line of SDP MUST be video. o The encoding name in the "a=rtpmap" line of SDP MUST be H264-SVC (the media subtype). o The clock rate in the "a=rtpmap" line MUST be 90000. o The OPTIONAL parameters profile-level-id, max-recv-level, max- recv-base-level, max-mbps, max-fs, max-cpb, max-dpb, max-br, redundant-pic-cap, in-band-parameter-sets, packetization-mode, sprop-interleaving-depth, deint-buf-cap, sprop-deint-buf-req, sprop-init-buf-time, sprop-max-don-diff, max-rcmd-nalu-size, mst- mode, sprop-mst-csdon-always-present, sprop-mst-remux-buf-size, sprop-remux-buf-req, remux-buf-cap, sprop-remux-init-buf-time, sprop-mst-max-don-diff, and scalable-layer-id, when present, MUST be included in the "a=fmtp" line of SDP. These parameters are expressed as a media type string, in the form of a semicolon- separated list of parameter=value pairs. o The OPTIONAL parameters sprop-parameter-sets, sprop-level- parameter-sets, sprop-scalability-info, sprop-operation-point- info, sprop-no-NAL-reordering-required, and sprop-avc-ready, when present, MUST be included in the "a=fmtp" line of SDP or conveyed using the "fmtp" source attribute as specified in Section 6.3 of [RFC5576]. For a particular media format (i.e., RTP payload type), a sprop-parameter-sets or sprop-level-parameter-sets MUST NOT be both included in the "a=fmtp" line of SDP and conveyed using the "fmtp" source attribute. When included in the "a=fmtp" line of SDP, these parameters are expressed as a media type string, in the form of a semicolon-separated list of parameter=value pairs. When conveyed using the "fmtp" source attribute, these parameters are only associated with the given source and payload type as parts of the "fmtp" source attribute. Informative note: Conveyance of sprop-parameter-sets and sprop-level-parameter-sets using the "fmtp" source attribute allows for out-of-band transport of parameter sets in topologies like Topo-Video-switch-MCU [RFC5117].
7.2.2. Usage with the SDP Offer/Answer Model
When an SVC stream (with media subtype H264-SVC) is offered over RTP using SDP in an Offer/Answer model [RFC3264] for negotiation for unicast usage, the following limitations and rules apply: o The parameters identifying a media format configuration for SVC are profile-level-id, packetization-mode, and mst-mode. These media configuration parameters (except for the level part of profile-level-id) MUST be used symmetrically when the answerer does not include scalable-layer-id in the answer; i.e., the answerer MUST either maintain all configuration parameters or remove the media format (payload type) completely, if one or more of the parameter values are not supported. Note that the level part of profile-level-id includes level_idc, and, for indication of level 1b when profile_idc is equal to 66, 77, or 88, bit 4 (constraint_set3_flag) of profile-iop. The level part of profile- level-id is changeable. Informative note: The requirement for symmetric use does not apply for the level part of profile-level-id, and does not apply for the other stream properties and capability parameters. Informative note: In [H.264], all the levels except for Level 1b are equal to the value of level_idc divided by 10. Level 1b is a level higher than Level 1.0 but lower than Level 1.1, and is signaled in an ad hoc manner. For the Baseline, Main, and Extended profiles (with profile_idc equal to 66, 77, and 88, respectively), Level 1b is indicated by level_idc equal to 11 (i.e., the same as level 1.1) and constraint_set3_flag equal to 1. For other profiles, Level 1b is indicated by level_idc equal to 9 (but note that Level 1b for these profiles is still higher than Level 1, which has level_idc equal to 10, and lower than Level 1.1). In SDP Offer/Answer, an answer may indicate a level equal to or lower than the level indicated in the offer. Due to the ad hoc indication of Level 1b, offerers and answerers must check the value of bit 4 (constraint_set3_flag) of the middle octet of the parameter profile-level-id, when profile_idc is equal to 66, 77, or 88 and level_idc is equal to 11. To simplify handling and matching of these configurations, the same RTP payload type number used in the offer should also be used in the answer, as specified in [RFC3264]. The same RTP payload type number used in the offer MUST also be used in the answer when the answer includes scalable-layer-id. When the answer does not include scalable-layer-id, the answer MUST NOT contain a payload
type number used in the offer unless the configuration is exactly the same as in the offer or the configuration in the answer only differs from that in the offer with a level lower than the default level offered. Informative note: When an offerer receives an answer that does not include scalable-layer-id it has to compare payload types not declared in the offer based on the media type (i.e., video/H264-SVC) and the above media configuration parameters with any payload types it has already declared. This will enable it to determine whether the configuration in question is new or if it is equivalent to configuration already offered, since a different payload type number may be used in the answer. Since an SVC stream may contain multiple operation points, a facility is provided so that an answerer can select a different operation point than the entire SVC stream. Specifically, different operation points MAY be described using the sprop- scalability-info or sprop-operation-point-info parameters. The first one carries the entire scalability information SEI message defined in Annex G of [H.264], whereas the second one may be derived, e.g., as a subset of this SEI message that only contains key information about an operation point. Operation points, in both cases, are associated with a layer identifier. If such information (sprop-operation-point-info or sprop- scalability-info) is provided in an offer, an answerer MAY select from the various operation points offered in the sprop- scalability-information or sprop-operation-point-info parameters by including scalable-layer-id in the answer. By this, the answerer indicates its selection of a particular operation point in the received and/or in the sent stream. When such operation point selection takes place, i.e., the answerer includes scalable- layer-id in the answer, the media configuration parameters MUST NOT be present in the answer. Rather, the media configuration that the answerer will use for receiving and/or sending is the one used for the selected operation point as indicated in the offer. Informative note: The ability to perform operation point selection enables a receiver to utilize the scalable nature of an SVC stream. o The parameter max-recv-level, when present, declares the highest level supported for receiving. In case max-recv-level is not present, the highest level supported for receiving is equal to the
default level indicated by the level part of profile-level-id. max-recv-level, when present, MUST be higher than the default level. o The parameter max-recv-base-level, when present, declares the highest level of the base layer supported for receiving. When max-recv-base-level is not present, the highest level supported for the base layer is not constrained separately from the SVC stream containing the base layer. The endpoint at the other side MUST NOT send a scalable stream for which the base layer is of a level higher than max-recv-base-level. Parameters declaring receiver capabilities above the default level (max-mbps, max- smbps, max-fs, max-cpb, max-dpb, max-br, and max-recv-level) do not apply to the base layer when max-recv-base-level is present. o The parameters sprop-deint-buf-req, sprop-interleaving-depth, sprop-max-don-diff, sprop-init-buf-time, sprop-mst-csdon-always- present, sprop-remux-buf-req, sprop-mst-remux-buf-size, sprop- remux-init-buf-time, sprop-mst-max-don-diff, sprop-scalability- information, sprop-operation-point-info, sprop-no-NAL-reordering- required, and sprop-avc-ready describe the properties of the NAL unit stream that the offerer or answerer is sending for the media format configuration. This differs from the normal usage of the Offer/Answer parameters: normally such parameters declare the properties of the stream that the offerer or the answerer is able to receive. When dealing with SVC, the offerer assumes that the answerer will be able to receive media encoded using the configuration being offered. Informative note: The above parameters apply for any stream sent by the declaring entity with the same configuration; i.e., they are dependent on their source. Rather than being bound to the payload type, the values may have to be applied to another payload type when being sent, as they apply for the configuration. o The capability parameters max-mbps, max-fs, max-cpb, max-dpb, max- br, redundant-pic-cap, and max-rcmd-nalu-size MAY be used to declare further capabilities of the offerer or answerer for receiving. These parameters MUST NOT be present when the direction attribute is sendonly, and the parameters describe the limitations of what the offerer or answerer accepts for receiving streams. o When mst-mode is not present and packetization-mode is equal to 2, the following applies.
o An offerer has to include the size of the de-interleaving buffer, sprop-deint-buf-req, in the offer. To enable the offerer and answerer to inform each other about their capabilities for de-interleaving buffering, both parties are RECOMMENDED to include deint-buf-cap. It is also RECOMMENDED to consider offering multiple payload types with different buffering requirements when the capabilities of the receiver are unknown. o When mst-mode is present and equal to "NI-C", "NI-TC", or "I-C", the following applies. o An offerer has to include sprop-remux-buf-req in the offer. To enable the offerer and answerer to inform each other about their capabilities for re-multiplexing buffering, both parties are RECOMMENDED to include remux-buf-cap. It is also RECOMMENDED to consider offering multiple payload types with different buffering requirements when the capabilities of the receiver are unknown. o The sprop-parameter-sets or sprop-level-parameter-sets parameter, when present (included in the "a=fmtp" line of SDP or conveyed using the "fmtp" source attribute as specified in Section 6.3 of [RFC5576]), is used for out-of-band transport of parameter sets. However, when out-of-band transport of parameter sets is used, parameter sets MAY still be additionally transported in-band. The answerer MAY use either out-of-band or in-band transport of parameter sets for the stream it is sending, regardless of whether out-of-band parameter sets transport has been used in the offerer- to-answerer direction. Parameter sets included in an answer are independent of those parameter sets included in the offer, as they are used for decoding two different video streams, one from the answerer to the offerer, and the other in the opposite direction. The following rules apply to transport of parameter sets in the offerer-to-answerer direction. o An offer MAY include either or both of sprop-parameter- sets and sprop-level-parameter-sets. If neither sprop-parameter- sets nor sprop-level-parameter-sets is present in the offer, then only in-band transport of parameter sets is used. o If the answer includes in-band-parameter-sets equal to 1, then the offerer MUST transmit parameter sets in-band. Otherwise, the following applies.
o If the level to use in the offerer-to-answerer direction is equal to the default level in the offer, the following applies. The answerer MUST be prepared to use the parameter sets included in sprop-parameter-sets, when present, for decoding the incoming NAL unit stream, and ignore sprop- level-parameter-sets, when present. When sprop-parameter-sets is not present in the offer, in-band transport of parameter sets MUST be used. o Otherwise (the level to use in the offerer-to-answerer direction is not equal to the default level in the offer), the following applies. The answerer MUST be prepared to use the parameter sets that are included in sprop-level-parameter-sets for the accepted level (i.e., the default level in the answer, which is also the level to use in the offerer-to-answerer direction), when present, for decoding the incoming NAL unit stream, and ignore all other parameter sets included in sprop-level-parameter-sets and sprop-parameter-sets, when present. When no parameter sets for the accepted level are present in the sprop-level-parameter-sets, in-band transport of parameter sets MUST be used. The following rules apply to transport of parameter sets in the answerer-to-offerer direction. o An answer MAY include either sprop-parameter-sets or sprop- level-parameter-sets, but MUST NOT include both of the two. If neither sprop-parameter-sets nor sprop-level-parameter-sets is present in the answer, then only in-band transport of parameter sets is used. o If the offer includes in-band-parameter-sets equal to 1, then the answerer MUST NOT include sprop-parameter-sets or sprop- level-parameter-sets in the answer and MUST transmit parameter sets in-band. Otherwise, the following applies. o If the level to use in the answerer-to-offerer direction is equal to the default level in the answer, the following applies.
The offerer MUST be prepared to use the parameter sets included in sprop-parameter-sets, when present, for decoding the incoming NAL unit stream, and ignore sprop- level-parameter-sets, when present. When sprop-parameter-sets is not present in the answer, the answerer MUST transmit parameter sets in-band. o Otherwise (the level to use in the answerer-to-offerer direction is not equal to the default level in the answer), the following applies. The offerer MUST be prepared to use the parameter sets that are included in sprop-level-parameter-sets for the level to use in the answerer-to-offerer direction, when present in the answer, for decoding the incoming NAL unit stream, and ignore all other parameter sets included in sprop-level-parameter-sets and sprop-parameter-sets, when present in the answer. When no parameter sets for the level to use in the answerer-to-offerer direction are present in sprop-level- parameter-sets in the answer, the answerer MUST transmit parameter sets in-band. When sprop-parameter-sets or sprop-level-parameter-sets is conveyed using the "fmtp" source attribute as specified in Section 6.3 of [RFC5576], the receiver of the parameters MUST store the parameter sets included in the sprop-parameter-sets or sprop- level-parameter-sets for the accepted level and associate them to the source given as a part of the "fmtp" source attribute. Parameter sets associated with one source MUST only be used to decode NAL units conveyed in RTP packets from the same source. When this mechanism is in use, SSRC collision detection and resolution MUST be performed as specified in [RFC5576]. Informative note: Conveyance of sprop-parameter-sets and sprop- level-parameter-sets using the "fmtp" source attribute may be used in topologies like Topo-Video-switch-MCU [RFC5117] to enable out-of-band transport of parameter sets. For streams being delivered over multicast, the following rules apply: o The media format configuration is identified by profile-level- id, including the level part, packetization-mode, and mst-mode. These media format configuration parameters (including the level part of profile-level-id) MUST be used symmetrically; i.e., the answerer
MUST either maintain all configuration parameters or remove the media format (payload type) completely. Note that this implies that the level part of profile-level-id for Offer/Answer in multicast is not changeable. To simplify handling and matching of these configurations, the same RTP payload type number used in the offer should also be used in the answer, as specified in [RFC3264]. An answer MUST NOT contain a payload type number used in the offer unless the configuration is the same as in the offer. o Parameter sets received MUST be associated with the originating source, and MUST be only used in decoding the incoming NAL unit stream from the same source. o The rules for other parameters are the same as above for unicast as long as the above rules are obeyed. Table 14 lists the interpretation of all the parameters that MUST be used for the various combinations of offer, answer, and direction attributes. Note that the two columns wherein the scalable-layer-id parameter is used only apply to answers, whereas the other columns apply to both offers and answers. Table 14. Interpretation of parameters for various combinations of offers, answers, direction attributes, with and without scalable- layer-id. Columns that do not indicate offer or answer apply to both.
sendonly --+
answer: recvonly,scalable-layer-id --+ |
recvonly w/o scalable-layer-id --+ | |
answer: sendrecv, scalable-layer-id --+ | | |
sendrecv w/o scalable-layer-id --+ | | | |
| | | | |
profile-level-id C X C X P
max-recv-level R R R R -
max-recv-base-level R R R R -
packetization-mode C X C X P
mst-mode C X C X P
sprop-avc-ready P P - - P
sprop-deint-buf-req P P - - P
sprop-init-buf-time P P - - P
sprop-interleaving-depth P P - - P
sprop-max-don-diff P P - - P
sprop-mst-csdon-always-present P P - - P
sprop-mst-max-don-diff P P - - P
sprop-mst-remux-buf-size P P - - P
sprop-no-NAL-reordering-required P P - - P
sprop-operation-point-info P P - - P
sprop-remux-buf-req P P - - P
sprop-remux-init-buf-time P P - - P
sprop-scalability-info P P - - P
deint-buf-cap R R R R -
max-br R R R R -
max-cpb R R R R -
max-dpb R R R R -
max-fs R R R R -
max-mbps R R R R -
max-rcmd-nalu-size R R R R -
redundant-pic-cap R R R R -
remux-buf-cap R R R R -
in-band-parameter-sets R R R R -
sprop-parameter-sets S S - - S
sprop-level-parameter-sets S S - - S
scalable-layer-id X O X O -
Legend:
C: configuration for sending and receiving streams
P: properties of the stream to be sent
R: receiver capabilities
S: out-of-band parameter sets
O: operation point selection
X: MUST NOT be present
-: not usable, when present SHOULD be ignored
Parameters used for declaring receiver capabilities are in general downgradable; i.e., they express the upper limit for a sender's possible behavior. Thus, a sender MAY select to set its encoder using only lower/lesser or equal values of these parameters. Parameters declaring a configuration point are not changeable, with the exception of the level part of the profile-level-id parameter for unicast usage. This expresses values a receiver expects to be used and must be used verbatim on the sender side. If level downgrading (for profile-level-id) is used, an answerer MUST NOT include the scalable-layer-id parameter. When a sender's capabilities are declared, and non-downgradable parameters are used in this declaration, then these parameters express a configuration that is acceptable for the sender to receive streams. In order to achieve high interoperability levels, it is often advisable to offer multiple alternative configurations, e.g., for the packetization mode. It is impossible to offer multiple configurations in a single payload type. Thus, when multiple configuration offers are made, each offer requires its own RTP payload type associated with the offer. A receiver SHOULD understand all media type parameters, even if it only supports a subset of the payload format's functionality. This ensures that a receiver is capable of understanding when an offer to receive media can be downgraded to what is supported by the receiver of the offer. An answerer MAY extend the offer with additional media format configurations. However, to enable their usage, in most cases a second offer is required from the offerer to provide the stream property parameters that the media sender will use. This also has the effect that the offerer has to be able to receive this media format configuration, not only to send it. If an offerer wishes to have non-symmetric capabilities between sending and receiving, the offerer can allow asymmetric levels via level-asymmetry-allowed equal to 1. Alternatively, the offerer can offer different RTP sessions, i.e., different media lines declared as "recvonly" and "sendonly", respectively. This may have further implications on the system, and may require additional external semantics to associate the two media lines.7.2.3. Dependency Signaling in Multi-Session Transmission
If MST is used, the rules on signaling media decoding dependency in SDP as defined in [RFC5583] apply. The rules on "hierarchical or layered encoding" with multicast in Section 5.7 of [RFC4566] do not
apply, i.e., the notation for Connection Data "c=" SHALL NOT be used with more than one address. Additionally, the order of dependencies of the RTP sessions indicated by the "a=depend" attribute as defined in [RFC5583] MUST represent the decoding order of the VC) NAL units in an access unit, i.e., the order of session dependency is given from the base or the lowest enhancement RTP session (the most important) to the highest enhancement RTP session (the least important).7.2.4. Usage in Declarative Session Descriptions
When SVC over RTP is offered with SDP in a declarative style, as in Real Time Streaming Protocol (RTSP) [RFC2326] or Session Announcement Protocol (SAP) [RFC2974], the following considerations are necessary. o All parameters capable of indicating both stream properties and receiver capabilities are used to indicate only stream properties. For example, in this case, the parameter profile-level-id declares the values used by the stream, not the capabilities for receiving streams. This results in that the following interpretation of the parameters MUST be used: Declaring actual configuration or stream properties: - profile-level-id - packetization-mode - mst-mode - sprop-deint-buf-req - sprop-interleaving-depth - sprop-max-don-diff - sprop-init-buf-time - sprop-mst-csdon-always-present - sprop-mst-remux-buf-size - sprop-remux-buf-req - sprop-remux-init-buf-time - sprop-mst-max-don-diff - sprop-scalability-info - sprop-operation-point-info - sprop-no-NAL-reordering-required - sprop-avc-ready Out-of-band transporting of parameter sets: - sprop-parameter-sets - sprop-level-parameter-sets
Not usable (when present, they SHOULD be ignored): - max-mbps - max-fs - max-cpb - max-dpb - max-br - max-recv-level - max-recv-base-level - redundant-pic-cap - max-rcmd-nalu-size - deint-buf-cap - remux-buf-cap - scalable-layer-id o A receiver of the SDP is required to support all parameters and values of the parameters provided; otherwise, the receiver MUST reject (RTSP) or not participate in (SAP) the session. It falls on the creator of the session to use values that are expected to be supported by the receiving application.7.3. Examples
In the following examples, "{data}" is used to indicate a data string encoded as base64.7.3.1. Example for Offering a Single SVC Session
Example 1: The offerer offers one video media description including two RTP payload types. The first payload type offers H264, and the second offers H264-SVC. Both payload types have different fmtp parameters as profile-level-id, packetization-mode, and sprop- parameter-sets. Offerer -> Answerer SDP message: m=video 20000 RTP/AVP 97 96 a=rtpmap:96 H264/90000 a=fmtp:96 profile-level-id=4de00a; packetization-mode=0; sprop-parameter-sets={sps0},{pps0}; a=rtpmap:97 H264-SVC/90000 a=fmtp:97 profile-level-id=53000c; packetization-mode=1; sprop-parameter-sets={sps0},{pps0},{sps1},{pps1}; If the answerer does not support media subtype H264-SVC, it can issue an answer accepting only the base layer offer (payload type 96). In the following example, the receiver supports H264-SVC, so it lists payload type 97 first as the preferred option.
Answerer -> Offerer SDP message: m=video 40000 RTP/AVP 97 96 a=rtpmap:96 H264/90000 a=fmtp:96 profile-level-id=4de00a; packetization-mode=0; sprop-parameter-sets={sps2},{pps2}; a=rtpmap:97 H264-SVC/90000 a=fmtp:97 profile-level-id=53000c; packetization-mode=1; sprop-parameter-sets={sps2},{pps2},{sps3},{pps3};7.3.2. Example for Offering a Single SVC Session Using scalable-layer-id
Example 2: Offerer offers the same media configurations as shown in the example above for receiving and sending the stream, but using a single RTP payload type and including sprop-operation-point-info. Offerer -> Answerer SDP message: m=video 20000 RTP/AVP 97 a=rtpmap:97 H264-SVC/90000 a=fmtp:97 profile-level-id=53000c; packetization-mode=1; sprop-parameter-sets={sps0},{sps1},{pps0},{pps1}; sprop-operation-point-info=<1,0,0,0,4de00a,3200,176,144,128, 256>,<2,1,1,0,53000c,6400,352,288,256,512>; In this example, the receiver supports H264-SVC and chooses the lower operation point offered in the RTP payload type for sending and receiving the stream. Answerer -> Offerer SDP message: m=video 40000 RTP/AVP 97 a=rtpmap:97 H264-SVC/90000 a=fmtp:97 sprop-parameter-sets={sps2},{sps3},{pps2},{pps3}; scalable-layer-id=1; In an equivalent example showing the use of sprop-scalability-info instead using the sprop-operation-point-info, the sprop-operation- point-info would be exchanged by the sprop-scalability-info followed by the binary (base16) representation of the Scalability Information SEI message.7.3.3. Example for Offering Multiple Sessions in MST
Example 3: In this example, the offerer offers a multi-session transmission with up to three sessions. The base session media description includes payload types that are backward compatible with
[RFC6184], and three different payload types are offered. The other two media are using payload types with media subtype H264-SVC. In each media description, different values of profile-level-id, packetization-mode, mst-mode, and sprop-parameter-sets are offered. Offerer -> Answerer SDP message: a=group:DDP L1 L2 L3 m=video 20000 RTP/AVP 96 97 98 a=rtpmap:96 H264/90000 a=fmtp:96 profile-level-id=4de00a; packetization-mode=0; mst-mode=NI-T; sprop-parameter-sets={sps0},{pps0}; a=rtpmap:97 H264/90000 a=fmtp:97 profile-level-id=4de00a; packetization-mode=1; mst-mode=NI-TC; sprop-parameter-sets={sps0},{pps0}; a=rtpmap:98 H264/90000 a=fmtp:98 profile-level-id=4de00a; packetization-mode=2; mst-mode=I-C; init-buf-time=156320; sprop-parameter-sets={sps0},{pps0}; a=mid:L1 m=video 20002 RTP/AVP 99 100 a=rtpmap:99 H264-SVC/90000 a=fmtp:99 profile-level-id=53000c; packetization-mode=1; mst-mode=NI-T; sprop-parameter-sets={sps1},{pps1}; a=rtpmap:100 H264-SVC/90000 a=fmtp:100 profile-level-id=53000c; packetization-mode=2; mst-mode=I-C; sprop-parameter-sets={sps1},{pps1}; a=mid:L2 a=depend:99 lay L1:96,97; 100 lay L1:98 m=video 20004 RTP/AVP 101 a=rtpmap:101 H264-SVC/90000 a=fmtp:101 profile-level-id=53001F; packetization-mode=1; mst-mode=NI-T; sprop-parameter-sets={sps2},{pps2}; a=mid:L3 a=depend:101 lay L1:96,97 L2:99 It is assumed that in this example the answerer only supports the NI- T mode for multi-session transmission. For this reason, it chooses the corresponding payload type (96) for the base RTP session. For the two enhancement RTP sessions, the answerer also chooses the payload types that use the NI-T mode (99 and 101).
Answerer -> Offerer SDP message: a=group:DDP L1 L2 L3 m=video 40000 RTP/AVP 96 a=rtpmap:96 H264/90000 a=fmtp:96 profile-level-id=4de00a; packetization-mode=0; mst-mode=NI-T; sprop-parameter-sets={sps3},{pps3}; a=mid:L1 m=video 40002 RTP/AVP 99 a=rtpmap:99 H264-SVC/90000 a=fmtp:99 profile-level-id=53000c; packetization-mode=1; mst-mode=NI-T; sprop-parameter-sets={sps4},{pps4}; a=mid:L2 a=depend:99 lay L1:96 m=video 40004 RTP/AVP 101 a=rtpmap:101 H264-SVC/90000 a=fmtp:101 profile-level-id=53001F; packetization-mode=1; mst-mode=NI-T; sprop-parameter-sets={sps5},{pps5}; a=mid:L3 a=depend:101 lay L1:96 L2:997.3.4. Example for Offering Multiple Sessions in MST Including Operation with Answerer Using scalable-layer-id
Example 4: In this example, the offerer offers a multi-session transmission of three layers with up to two sessions. The base session media description has a payload type that is backward compatible with [RFC6184]. Note that no parameter sets are provided, in which case in-band transport must be used. The other media description contains two enhancement layers and uses the media subtype H264-SVC. It includes two operation point definitions. Offerer -> Answerer SDP message: a=group:DDP L1 L2 m=video 20000 RTP/AVP 96 a=rtpmap:96 H264/90000 a=fmtp:96 profile-level-id=4de00a; packetization-mode=0; mst-mode=NI-T; a=mid:L1 m=video 20002 RTP/AVP 97 a=rtpmap:97 H264-SVC/90000 a=fmtp:97 profile-level-id=53001F; packetization-mode=1; mst-mode=NI-TC; sprop-operation-point-info=<2,0,1,0,53000c, 3200,352,288,384,512>,<3,1,2,0,53001F,6400,704,576,768,1024>; a=mid:L2 a=depend:97 lay L1:96
It is assumed that the answerer wants to send and receive the base layer (payload type 96), but it only wants to send and receive the lower enhancement layer, i.e., the one with layer id equal to 2. For this reason, the response will include the selection of the desired layer by setting scalable-layer-id equal to 2. Note that the answer only includes the scalable-layer-id information. The answer could include sprop-parameter-sets in the response. Answerer -> Offerer SDP message: a=group:DDP L1 L2 m=video 40000 RTP/AVP 96 a=rtpmap:96 H264/90000 a=fmtp:96 profile-level-id=4de00a; packetization-mode=0; mst-mode=NI-T; a=mid:L1 m=video 40002 RTP/AVP 97 a=rtpmap:97 H264-SVC/90000 a=fmtp:97 scalable-layer-id=2; a=mid:L2 a=depend:97 lay L1:967.3.5. Example for Negotiating an SVC Stream with a Constrained Base Layer in SST
Example 5: The offerer (Alice) offers one video description including two RTP payload types with differing levels and packetization modes. Offerer -> Answerer SDP message: m=video 20000 RTP/AVP 97 96 a=rtpmap:96 H264-SVC/90000 a=fmtp:96 profile-level-id=53001e; packetization-mode=0; a=rtpmap:97 H264-SVC/90000 a=fmtp:97 profile-level-id=53001f; packetization-mode=1; The answerer (Bridge) chooses packetization mode 1, and indicates that it would receive an SVC stream with the base layer being constrained. Answerer -> Offerer SDP message: m=video 40000 RTP/AVP 97 a=rtpmap:97 H264-SVC/90000 a=fmtp:97 profile-level-id=53001f; packetization-mode=1; max-recv-base-level=000d
The answering endpoint must send an SVC stream at Level 3.1. Since the offering endpoint did not declare max-recv-base-level, the base layer of the SVC stream the answering endpoint must send is not specifically constrained. The offering endpoint (Alice) must send an SVC stream at Level 3.1, for which the base layer must be of a level not higher than Level 1.3.7.4. Parameter Set Considerations
Section 8.4 of [RFC6184] applies in this memo, with the following applies additionally for multi-session transmission (MST). In MST, regardless of out-of-band or in-band transport of parameter sets are in use, parameter sets required for decoding NAL units carried in one particular RTP session SHOULD be carried in the same session, MAY be carried in a session that the particular RTP session depends on, and MUST NOT be carried in a session that the particular RTP session does not depend on.8. Security Considerations
The security considerations of the RTP Payload Format for H.264 Video specification [RFC6184] apply. Additionally, the following applies. Decoders MUST exercise caution with respect to the handling of reserved NAL unit types and reserved SEI messages, particularly if they contain active elements, and MUST restrict their domain of applicability to the presentation containing the stream. The safest way is to simply discard these NAL units and SEI messages. When integrity protection is applied to a stream, care MUST be taken that the stream being transported may be scalable; hence a receiver may be able to access only part of the entire stream. End-to-end security with either authentication, integrity, or confidentiality protection will prevent a MANE from performing media- aware operations other than discarding complete packets. And in the case of confidentiality protection it will even be prevented from performing discarding of packets in a media-aware way. To allow any MANE to perform its operations, it will be required to be a trusted entity that is included in the security context establishment. This applies both for the media path and for the RTCP path, if RTCP packets need to be rewritten.
9. Congestion Control
Within any given RTP session carrying payload according to this specification, the provisions of Section 10 of [RFC6184] apply. Reducing the session bitrate is possible by one or more of the following means: a) Within the highest layer identified by the DID field remove any NAL units with QID higher than a certain value. b) Remove all NAL units with TID higher than a certain value. c) Remove all NAL units associated with a DID higher than a certain value. Informative note: Removal of all coded slice NAL units associated with DIDs higher than a certain value in the entire stream is required in order to preserve conformance of the resulting SVC stream. d) Utilize the PRID field to indicate the relative importance of NAL units, and remove all NAL units associated with a PRID higher than a certain value. Note that the use of the PRID is application- specific. e) Remove NAL units or entire packets according to application- specific rules. The result will depend on the particular coding structure used as well as any additional application-specific functionality (e.g., concealment performed at the receiving decoder). In general, this will result in the reception of a non- conforming bitstream and hence the decoder behavior is not specified by [H.264]. Significant artifacts may therefore appear in the decoded output if the particular decoder implementation does not take appropriate action in response to congestion control. Informative note: The discussion above is centered on NAL units rather than packets, primarily because that is the level where senders can meaningfully manipulate the scalable bitstream. The mapping of NAL units to RTP packets is fairly flexible when using aggregation packets. Depending on the nature of the congestion control algorithm, the "dimension" of congestion measurement (packet count or bitrate) and reaction to it (reducing packet count or bitrate or both) can be adjusted accordingly. All aforementioned means are available to the RTP sender, regardless of whether that sender is located in the sending endpoint or in a mixer-based MANE.
When a translator-based MANE is employed, then the MANE MAY manipulate the session only on the MANE's outgoing path, so that the sensed end-to-end congestion falls within the permissible envelope. As with all translators, in this case, the MANE needs to rewrite RTCP RRs to reflect the manipulations it has performed on the session. Informative note: Applications MAY also implement, in addition or separately, other congestion control mechanisms, e.g., as described in [RFC5775] and [Yan].10. IANA Considerations
A new media type, as specified in Section 7.1 of this memo, has been registered with IANA.11. Informative Appendix: Application Examples
11.1. Introduction
Scalable video coding is a concept that has been around since at least MPEG-2 [MPEG2], which goes back as early as 1993. Nevertheless, it has never gained wide acceptance, perhaps partly because applications didn't materialize in the form envisioned during standardization. ISO/IEC MPEG and ITU-T VCEG, respectively, performed a requirement analysis for the SVC project. The MPEG and VCEG requirement documents are available in [JVT-N026] and [JVT-N027], respectively. The following introduces four main application scenarios that the authors consider relevant and that are implementable with this specification.11.2. Layered Multicast
This well-understood form of the use of layered coding [McCanne] implies that all layers are individually conveyed in their own RTP packet streams, each carried in its own RTP session using the IP (multicast) address and port number as the single demultiplexing point. Receivers "tune" into the layers by subscribing to the IP multicast, normally by using IGMP [IGMP]. Depending on the application scenario, it is also possible to convey a number of layers in one RTP session, when finer operation points within the subset of layers are not needed. Layered multicast has the great advantage of simplicity and easy implementation. However, it has also the great disadvantage of utilizing many different transport addresses. While the authors
consider this not to be a major problem for a professionally maintained content server, receiving client endpoints need to open many ports to IP multicast addresses in their firewalls. This is a practical problem from a firewall and network address translation (NAT) viewpoint. Furthermore, even today IP multicast is not as widely deployed as many wish. The authors consider layered multicast an important application scenario for the following reasons. First, it is well understood and the implementation constraints are well known. Second, there may well be large-scale IP networks outside the immediate Internet context that may wish to employ layered multicast in the future. One possible example could be a combination of content creation and core- network distribution for the various mobile TV services, e.g., those being developed by 3GPP (MBMS) [MBMS] and DVB (DVB-H) [DVB-H].11.3. Streaming
In this scenario, a streaming server has a repository of stored SVC coded layers for a given content. At the time of streaming, and according to the capabilities, connectivity, and congestion situation of the client(s), the streaming server generates and serves a scalable stream. Both unicast and multicast serving is possible. At the same time, the streaming server may use the same repository of stored layers to compose different streams (with a different set of layers) intended for other audiences. As every endpoint receives only a single SVC RTP session, the number of firewall pinholes can be optimized to one. The main difference between this scenario and straightforward simulcasting lies in the architecture and the requirements of the streaming server, and is therefore out of the scope of IETF standardization. However, compelling arguments can be made why such a streaming server design makes sense. One possible argument is related to storage space and channel bandwidth. Another is bandwidth adaptability without transcoding -- a considerable advantage in a congestion controlled network. When the streaming server learns about congestion, it can reduce the sending bitrate by choosing fewer layers when composing the layered stream; see Section 9. SVC is designed to gracefully support both bandwidth ramp-down and bandwidth ramp-up with a considerable dynamic range. This payload format is designed to allow for bandwidth flexibility in the mentioned sense. While, in theory, a transcoding step could achieve a similar dynamic range, the computational demands are impractically high and video quality is typically lowered -- therefore, few (if any) streaming servers implement full transcoding.
11.4. Videoconferencing (Unicast to MANE, Unicast to Endpoints)
Videoconferencing has traditionally relied on Multipoint Control Units (MCUs). These units connect endpoints in a star configuration and operate as follows. Coded video is transmitted from each endpoint to the MCU, where it is decoded, scaled, and composited to construct output frames, which are then re-encoded and transmitted to the endpoint(s). In systems supporting personalized layout (each user is allowed to select the layout of his/her screen), the compositing and encoding process is performed for each of the receiving endpoints. Even without personalized layout, rate matching still requires that the encoding process at the MCU is performed separately for each endpoint. As a result, MCUs have considerable complexity and introduce significant delay. The cascaded encodings also reduce the video quality. Particularly for multipoint connections, interactive communication is cumbersome as the end-to- end delay is very high [G.114]. A simpler architecture is the switching MCU, in which one of the incoming video streams is redirected to the receiving endpoints. Obviously, only one user at a time can be seen and rate matching cannot be performed, thus forcing all transmitting endpoints to transmit at the lowest bit rate available in the MCU-to-endpoint connections. With scalable video coding the MCU can be replaced with an application-level router (ALR): this unit simply selects which incoming packets should be transmitted to which of the receiving endpoints [Eleft]. In such a system, each endpoint performs its own composition of the incoming video streams. Assuming, for example, a system that uses spatial scalability with two layers, personalized layout is equivalent to instructing the ALR to only send the required packets for the corresponding resolution to the particular endpoint. Similarly, rate matching at the ALR for a particular endpoint can be performed by selecting an appropriate subset of the incoming video packets to transmit to the particular endpoint. Personalized layout and rate matching thus become routing decisions, and require no signal processing. Note that scalability also allows participants to enjoy the best video quality afforded by their links, i.e., users no longer have to be forced to operate at the quality supported by the weakest endpoint. Most importantly, the ALR has an insignificant contribution to the end-to-end delay, typically an order of magnitude less than an MCU. This makes it possible to have fully interactive multipoint conferences with even a very large number of participants. There are significant advantages as well in terms of error resilience and, in fact, error tolerance can be increased by nearly an order of magnitude here as well (e.g., using unequal error protection). Finally, the very low delay of an ALR allows these systems to be
cascaded, with significant benefits in terms of system design and deployment. Cascading of traditional MCUs is impossible due to the very high delay that even a single MCU introduces. Scalable video coding enables a very significant paradigm shift in videoconferencing systems, bringing the complexity of video communication systems (particularly the servers residing within the network) in line with other types of network applications.11.5. Mobile TV (Multicast to MANE, Unicast to Endpoint)
This scenario is a bit more complex, and designed to optimize the network traffic in a core network, while still requiring only a single pinhole in the endpoint's firewall. One of its key applications is the mobile TV market. Consider a large private IP network, e.g., the core network of the Third Generation Partnership Project (3GPP). Streaming servers within this core network can be assumed to be professionally maintained. It is assumed that these servers can have many ports open to the network and that layered multicast is a real option. Therefore, the streaming server multicasts SVC scalable layers, instead of simulcasting different representations of the same content at different bitrates. Also consider many endpoints of different classes. Some of these endpoints may lack the processing power or the display size to meaningfully decode all layers; others may have these capabilities. Users of some endpoints may wish not to pay for high quality and are happy with a base service, which may be cheaper or even free. Other users are willing to pay for high quality. Finally, some connected users may have a bandwidth problem in that they can't receive the bandwidth they would want to receive -- be it through congestion, connectivity, change of service quality, or for whatever other reasons. However, all these users have in common that they don't want to be exposed too much, and therefore the number of firewall pinholes needs to be small. This situation can be handled best by introducing middleboxes close to the edge of the core network, which receive the layered multicast streams and compose the single SVC scalable bitstream according to the needs of the endpoint connected. These middleboxes are called MANEs throughout this specification. In practice, the authors envision the MANE to be part of (or at least physically and topologically close to) the base station of a mobile network, where all the signaling and media traffic necessarily are multiplexed on the same physical link.
MANEs necessarily need to be fairly complex devices. They certainly need to understand the signaling, so, for example, to associate the payload type octet in the RTP header with the SVC payload type. A MANE may aggregate multiple RTP streams, possibly from multiple RTP sessions, thus to reduce the number of firewall pinholes required at the endpoints, or may optimize the outgoing RTP stream to the MTU size of the outgoing path by utilizing the aggregation and fragmentation mechanisms of this memo. This type of MANE is conceptually easy to implement and can offer powerful features, primarily because it necessarily can "see" the payload (including the RTP payload headers), utilize the wealth of layering information available therein, and manipulate it. A MANE can also perform stream thinning, in order to adhere to congestion control principles as discussed in Section 9. While the implementation of the forward (media) channel of such a MANE appears to be comparatively simple, the need to rewrite RTCP RRs makes even such a MANE a complex device. While the implementation complexity of either case of a MANE, as discussed above, is fairly high, the computational demands are comparatively low.12. Acknowledgements
Miska Hannuksela contributed significantly to the designs of the PACSI NAL unit and the NI-C mode for decoding order recovery. Roni Even organized and coordinated the design team for the development of this memo, and provided valuable comments. Jonathan Lennox contributed to the NAL unit reordering algorithm for MST and provided input on several parts of this memo. Peter Amon, Sam Ganesan, Mike Nilsson, Colin Perkins, and Thomas Wiegand were members of the design team and provided valuable contributions. Magnus Westerlund has also made valuable comments. Charles Eckel and Stuart Taylor provided valuable comments after the first WGLC for this document. Xiaohui (Joanne) Wei helped improving Table 13 and the SDP examples. The work of Thomas Schierl has been supported by the European Commission under contract number FP7-ICT-248036, project COAST.13. References
13.1. Normative References
[H.264] ITU-T Recommendation H.264, "Advanced video coding for generic audiovisual services", March 2010.
[RFC6184] Wang, Y.-K., Even, R., Kristensen, T., and R. Jesup, "RTP Payload Format for H.264 Video", RFC 6184, May 2011. [ISO/IEC14496-10] ISO/IEC International Standard 14496-10:2005. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC3264] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with Session Description Protocol (SDP)", RFC 3264, June 2002. [RFC3550] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, July 2003. [RFC4288] Freed, N. and J. Klensin, "Media Type Specifications and Registration Procedures", BCP 13, RFC 4288, December 2005. [RFC4566] Handley, M., Jacobson, V., and C. Perkins, "SDP: Session Description Protocol", RFC 4566, July 2006. [RFC4648] Josefsson, S., "The Base16, Base32, and Base64 Data Encodings", RFC 4648, October 2006. [RFC5576] Lennox, J., Ott, J., and T. Schierl, "Source-Specific Media Attributes in the Session Description Protocol (SDP)", RFC 5576, June 2009. [RFC5583] Schierl, T. and S. Wenger, "Signaling Media Decoding Dependency in the Session Description Protocol (SDP)", RFC 5583, July 2009. [RFC6051] Perkins, C. and T. Schierl, "Rapid Synchronisation of RTP Flows", RFC 6051, November 2010.13.2. Informative References
[DVB-H] DVB - Digital Video Broadcasting (DVB); DVB-H Implementation Guidelines, ETSI TR 102 377, 2005. [Eleft] Eleftheriadis, A., R. Civanlar, and O. Shapiro, "Multipoint Videoconferencing with Scalable Video Coding", Journal of Zhejiang University SCIENCE A, Vol. 7, Nr. 5, April 2006, pp. 696-705. (Proceedings of the Packet Video 2006 Workshop.)
[G.114] ITU-T Rec. G.114, "One-way transmission time", May 2003. [H.241] ITU-T Rec. H.241, "Extended video procedures and control signals for H.300-series terminals", May 2006. [IGMP] Cain, B., Deering, S., Kouvelas, I., Fenner, B., and A. Thyagarajan, "Internet Group Management Protocol, Version 3", RFC 3376, October 2002. [JVT-N026] Ohm J.-R., Koenen, R., and Chiariglione, L. (ed.), "SVC requirements specified by MPEG (ISO/IEC JTC1 SC29 WG11)", JVT-N026, available from http://ftp3.itu.ch/av-arch/ jvt-site/2005_01_HongKong/JVT-N026.doc, Hong Kong, China, January 2005. [JVT-N027] Sullivan, G. and Wiegand, T. (ed.), "SVC requirements specified by VCEG (ITU-T SG16 Q.6)", JVT-N027, available from http://ftp3.itu.int/av-arch/ jvt-site/2005_01_HongKong/JVT-N027.doc, Hong Kong, China, January 2005. [McCanne] McCanne, S., Jacobson, V., and Vetterli, M., "Receiver- driven layered multicast", in Proc. of ACM SIGCOMM'96, pages 117-130, Stanford, CA, August 1996. [MBMS] 3GPP - Technical Specification Group Services and System Aspects; Multimedia Broadcast/Multicast Service (MBMS); Protocols and codecs (Release 6), December 2005. [MPEG2] ISO/IEC International Standard 13818-2:1993. [RFC2326] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming Protocol (RTSP)", RFC 2326, April 1998. [RFC2974] Handley, M., Perkins, C., and E. Whelan, "Session Announcement Protocol", RFC 2974, October 2000. [RFC5117] Westerlund, M. and S. Wenger, "RTP Topologies", RFC 5117, January 2008. [RFC5775] Luby, M., Watson, M., and L. Vicisano, "Asynchronous Layered Coding (ALC) Protocol Instantiation", RFC 5775, April 2010. [Yan] Yan, J., Katrinis, K., May, M., and Plattner, R., "Media- and TCP-friendly congestion control for scalable video streams", in IEEE Trans. Multimedia, pages 196-206, April 2006.
Authors' Addresses
Stephan Wenger 2400 Skyfarm Dr. Hillsborough, CA 94010 USA Phone: +1-415-713-5473 EMail: stewe@stewe.org Ye-Kui Wang Huawei Technologies 400 Crossing Blvd, 2nd Floor Bridgewater, NJ 08807 USA Phone: +1-908-541-3518 EMail: yekui.wang@huawei.com Thomas Schierl Fraunhofer HHI Einsteinufer 37 D-10587 Berlin Germany Phone: +49-30-31002-227 EMail: ts@thomas-schierl.de Alex Eleftheriadis Vidyo, Inc. 433 Hackensack Ave. Hackensack, NJ 07601 USA Phone: +1-201-467-5135 EMail: alex@vidyo.com