6. Packetization Rules
The packetization modes are introduced in section 5.2. The packetization rules common to more than one of the packetization modes are specified in section 6.1. The packetization rules for the single NAL unit mode, the non-interleaved mode, and the interleaved mode are specified in sections 6.2, 6.3, and 6.4, respectively.6.1. Common Packetization Rules
All senders MUST enforce the following packetization rules regardless of the packetization mode in use: o Coded slice NAL units or coded slice data partition NAL units belonging to the same coded picture (and thus sharing the same RTP timestamp value) MAY be sent in any order permitted by the applicable profile defined in [1]; however, for delay-critical systems, they SHOULD be sent in their original coding order to minimize the delay. Note that the coding order is not necessarily the scan order, but the order the NAL packets become available to the RTP stack. o Parameter sets are handled in accordance with the rules and recommendations given in section 8.4. o MANEs MUST NOT duplicate any NAL unit except for sequence or picture parameter set NAL units, as neither this memo nor the H.264 specification provides means to identify duplicated NAL units. Sequence and picture parameter set NAL units MAY be duplicated to make their correct reception more probable, but any such duplication MUST NOT affect the contents of any active sequence or picture parameter set. Duplication SHOULD be
performed on the application layer and not by duplicating RTP packets (with identical sequence numbers). Senders using the non-interleaved mode and the interleaved mode MUST enforce the following packetization rule: o MANEs MAY convert single NAL unit packets into one aggregation packet, convert an aggregation packet into several single NAL unit packets, or mix both concepts, in an RTP translator. The RTP translator SHOULD take into account at least the following parameters: path MTU size, unequal protection mechanisms (e.g., through packet-based FEC according to RFC 2733 [18], especially for sequence and picture parameter set NAL units and coded slice data partition A NAL units), bearable latency of the system, and buffering capabilities of the receiver. Informative note: An RTP translator is required to handle RTCP as per RFC 3550.6.2. Single NAL Unit Mode
This mode is in use when the value of the OPTIONAL packetization-mode MIME parameter is equal to 0, the packetization-mode is not present, or no other packetization mode is signaled by external means. All receivers MUST support this mode. It is primarily intended for low- delay applications that are compatible with systems using ITU-T Recommendation H.241 [15] (see section 12.1). Only single NAL unit packets MAY be used in this mode. STAPs, MTAPs, and FUs MUST NOT be used. The transmission order of single NAL unit packets MUST comply with the NAL unit decoding order.6.3. Non-Interleaved Mode
This mode is in use when the value of the OPTIONAL packetization-mode MIME parameter is equal to 1 or the mode is turned on by external means. This mode SHOULD be supported. It is primarily intended for low-delay applications. Only single NAL unit packets, STAP-As, and FU-As MAY be used in this mode. STAP-Bs, MTAPs, and FU-Bs MUST NOT be used. The transmission order of NAL units MUST comply with the NAL unit decoding order.
6.4. Interleaved Mode
This mode is in use when the value of the OPTIONAL packetization-mode MIME parameter is equal to 2 or the mode is turned on by external means. Some receivers MAY support this mode. STAP-Bs, MTAPs, FU-As, and FU-Bs MAY be used. STAP-As and single NAL unit packets MUST NOT be used. The transmission order of packets and NAL units is constrained as specified in section 5.5.7. De-Packetization Process (Informative)
The de-packetization process is implementation dependent. Therefore, the following description should be seen as an example of a suitable implementation. Other schemes may be used as well. Optimizations relative to the described algorithms are likely possible. Section 7.1 presents the de-packetization process for the single NAL unit and non-interleaved packetization modes, whereas section 7.2 describes the process for the interleaved mode. Section 7.3 includes additional decapsulation guidelines for intelligent receivers. All normal RTP mechanisms related to buffer management apply. In particular, duplicated or outdated RTP packets (as indicated by the RTP sequences number and the RTP timestamp) are removed. To determine the exact time for decoding, factors such as a possible intentional delay to allow for proper inter-stream synchronization must be factored in.7.1. Single NAL Unit and Non-Interleaved Mode
The receiver includes a receiver buffer to compensate for transmission delay jitter. The receiver stores incoming packets in reception order into the receiver buffer. Packets are decapsulated in RTP sequence number order. If a decapsulated packet is a single NAL unit packet, the NAL unit contained in the packet is passed directly to the decoder. If a decapsulated packet is an STAP-A, the NAL units contained in the packet are passed to the decoder in the order in which they are encapsulated in the packet. If a decapsulated packet is an FU-A, all the fragments of the fragmented NAL unit are concatenated and passed to the decoder. Informative note: If the decoder supports Arbitrary Slice Order, coded slices of a picture can be passed to the decoder in any order regardless of their reception and transmission order.
7.2. Interleaved Mode
The general concept behind these de-packetization rules is to reorder NAL units from transmission order to the NAL unit decoding order. The receiver includes a receiver buffer, which is used to compensate for transmission delay jitter and to reorder packets from transmission order to the NAL unit decoding order. In this section, the receiver operation is described under the assumption that there is no transmission delay jitter. To make a difference from a practical receiver buffer that is also used for compensation of transmission delay jitter, the receiver buffer is here after called the deinterleaving buffer in this section. Receivers SHOULD also prepare for transmission delay jitter; i.e., either reserve separate buffers for transmission delay jitter buffering and deinterleaving buffering or use a receiver buffer for both transmission delay jitter and deinterleaving. Moreover, receivers SHOULD take transmission delay jitter into account in the buffering operation; e.g., by additional initial buffering before starting of decoding and playback. This section is organized as follows: subsection 7.2.1 presents how to calculate the size of the deinterleaving buffer. Subsection 7.2.2 specifies the receiver process how to organize received NAL units to the NAL unit decoding order.7.2.1. Size of the Deinterleaving Buffer
When SDP Offer/Answer model or any other capability exchange procedure is used in session setup, the properties of the received stream SHOULD be such that the receiver capabilities are not exceeded. In the SDP Offer/Answer model, the receiver can indicate its capabilities to allocate a deinterleaving buffer with the deint- buf-cap MIME parameter. The sender indicates the requirement for the deinterleaving buffer size with the sprop-deint-buf-req MIME parameter. It is therefore RECOMMENDED to set the deinterleaving buffer size, in terms of number of bytes, equal to or greater than the value of sprop-deint-buf-req MIME parameter. See section 8.1 for further information on deint-buf-cap and sprop-deint-buf-req MIME parameters and section 8.2.2 for further information on their use in SDP Offer/Answer model. When a declarative session description is used in session setup, the sprop-deint-buf-req MIME parameter signals the requirement for the deinterleaving buffer size. It is therefore RECOMMENDED to set the deinterleaving buffer size, in terms of number of bytes, equal to or greater than the value of sprop-deint-buf-req MIME parameter.
7.2.2. Deinterleaving Process
There are two buffering states in the receiver: initial buffering and buffering while playing. Initial buffering occurs when the RTP session is initialized. After initial buffering, decoding and playback is started, and the buffering-while-playing mode is used. Regardless of the buffering state, the receiver stores incoming NAL units, in reception order, in the deinterleaving buffer as follows. NAL units of aggregation packets are stored in the deinterleaving buffer individually. The value of DON is calculated and stored for all NAL units. The receiver operation is described below with the help of the following functions and constants: o Function AbsDON is specified in section 8.1. o Function don_diff is specified in section 5.5. o Constant N is the value of the OPTIONAL sprop-interleaving-depth MIME type parameter (see section 8.1) incremented by 1. Initial buffering lasts until one of the following conditions is fulfilled: o There are N VCL NAL units in the deinterleaving buffer. o If sprop-max-don-diff is present, don_diff(m,n) is greater than the value of sprop-max-don-diff, in which n corresponds to the NAL unit having the greatest value of AbsDON among the received NAL units and m corresponds to the NAL unit having the smallest value of AbsDON among the received NAL units. o Initial buffering has lasted for the duration equal to or greater than the value of the OPTIONAL sprop-init-buf-time MIME parameter. The NAL units to be removed from the deinterleaving buffer are determined as follows: o If the deinterleaving buffer contains at least N VCL NAL units, NAL units are removed from the deinterleaving buffer and passed to the decoder in the order specified below until the buffer contains N-1 VCL NAL units.
o If sprop-max-don-diff is present, all NAL units m for which don_diff(m,n) is greater than sprop-max-don-diff are removed from the deinterleaving buffer and passed to the decoder in the order specified below. Herein, n corresponds to the NAL unit having the greatest value of AbsDON among the received NAL units. The order in which NAL units are passed to the decoder is specified as follows: o Let PDON be a variable that is initialized to 0 at the beginning of the an RTP session. o For each NAL unit associated with a value of DON, a DON distance is calculated as follows. If the value of DON of the NAL unit is larger than the value of PDON, the DON distance is equal to DON - PDON. Otherwise, the DON distance is equal to 65535 - PDON + DON + 1. o NAL units are delivered to the decoder in ascending order of DON distance. If several NAL units share the same value of DON distance, they can be passed to the decoder in any order. o When a desired number of NAL units have been passed to the decoder, the value of PDON is set to the value of DON for the last NAL unit passed to the decoder.7.3. Additional De-Packetization Guidelines
The following additional de-packetization rules may be used to implement an operational H.264 de-packetizer: o Intelligent RTP receivers (e.g., in gateways) may identify lost coded slice data partitions A (DPAs). If a lost DPA is found, a gateway may decide not to send the corresponding coded slice data partitions B and C, as their information is meaningless for H.264 decoders. In this way a MANE can reduce network load by discarding useless packets without parsing a complex bitstream. o Intelligent RTP receivers (e.g., in gateways) may identify lost FUs. If a lost FU is found, a gateway may decide not to send the following FUs of the same fragmented NAL unit, as their information is meaningless for H.264 decoders. In this way a MANE can reduce network load by discarding useless packets without parsing a complex bitstream.
o Intelligent receivers having to discard packets or NALUs should first discard all packets/NALUs in which the value of the NRI field of the NAL unit type octet is equal to 0. This will minimize the impact on user experience and keep the reference pictures intact. If more packets have to be discarded, then packets with a numerically lower NRI value should be discarded before packets with a numerically higher NRI value. However, discarding any packets with an NRI bigger than 0 very likely leads to decoder drift and SHOULD be avoided.8. Payload Format Parameters
This section specifies the parameters that MAY be used to select optional features of the payload format and certain features of the bitstream. The parameters are specified here as part of the MIME subtype registration for the ITU-T H.264 | ISO/IEC 14496-10 codec. A mapping of the parameters into the Session Description Protocol (SDP) [5] is also provided for applications that use SDP. Equivalent parameters could be defined elsewhere for use with control protocols that do not use MIME or SDP. Some parameters provide a receiver with the properties of the stream that will be sent. The name of all these parameters starts with "sprop" for stream properties. Some of these "sprop" parameters are limited by other payload or codec configuration parameters. For example, the sprop-parameter-sets parameter is constrained by the profile-level-id parameter. The media sender selects all "sprop" parameters rather than the receiver. This uncommon characteristic of the "sprop" parameters may not be compatible with some signaling protocol concepts, in which case the use of these parameters SHOULD be avoided.8.1. MIME Registration
The MIME subtype for the ITU-T H.264 | ISO/IEC 14496-10 codec is allocated from the IETF tree. The receiver MUST ignore any unspecified parameter. Media Type name: video Media subtype name: H264 Required parameters: none
OPTIONAL parameters: profile-level-id: A base16 [6] (hexadecimal) representation of the following three bytes in the sequence parameter set NAL unit specified in [1]: 1) profile_idc, 2) a byte herein referred to as profile-iop, composed of the values of constraint_set0_flag, constraint_set1_flag, constraint_set2_flag, and reserved_zero_5bits in bit-significance order, starting from the most significant bit, and 3) level_idc. Note that reserved_zero_5bits is required to be equal to 0 in [1], but other values for it may be specified in the future by ITU-T or ISO/IEC. If the profile-level-id parameter is used to indicate properties of a NAL unit stream, it indicates the profile and level that a decoder has to support in order to comply with [1] when it decodes the stream. The profile-iop byte indicates whether the NAL unit stream also obeys all constraints of the indicated profiles as follows. If bit 7 (the most significant bit), bit 6, or bit 5 of profile-iop is equal to 1, all constraints of the Baseline profile, the Main profile, or the Extended profile, respectively, are obeyed in the NAL unit stream. If the profile-level-id parameter is used for capability exchange or session setup procedure, it indicates the profile that the codec supports and the highest level supported for the signaled profile. The profile-iop byte indicates whether the codec has additional limitations whereby only the common subset of the algorithmic features and limitations of the profiles signaled with the profile-iop byte and of the profile indicated by profile_idc is supported by the codec. For example, if a codec supports only the common subset of the coding tools of the Baseline profile and the Main profile at level 2.1 and below, the profile-level-id becomes 42E015, in which 42 stands for the Baseline profile, E0 indicates that only the common subset for all profiles is supported, and 15 indicates level 2.1.
Informative note: Capability exchange and session setup procedures should provide means to list the capabilities for each supported codec profile separately. For example, the one-of-N codec selection procedure of the SDP Offer/Answer model can be used (section 10.2 of [7]). If no profile-level-id is present, the Baseline Profile without additional constraints at Level 1 MUST be implied. max-mbps, max-fs, max-cpb, max-dpb, and max-br: These parameters MAY be used to signal the capabilities of a receiver implementation. These parameters MUST NOT be used for any other purpose. The profile-level-id parameter MUST be present in the same receiver capability description that contains any of these parameters. The level conveyed in the value of the profile-level-id parameter MUST be such that the receiver is fully capable of supporting. max-mbps, max-fs, max-cpb, max- dpb, and max-br MAY be used to indicate capabilities of the receiver that extend the required capabilities of the signaled level, as specified below. When more than one parameter from the set (max- mbps, max-fs, max-cpb, max-dpb, max-br) is present, the receiver MUST support all signaled capabilities simultaneously. For example, if both max-mbps and max-br are present, the signaled level with the extension of both the frame rate and bit rate is supported. That is, the receiver is able to decode NAL unit streams in which the macroblock processing rate is up to max-mbps (inclusive), the bit rate is up to max-br (inclusive), the coded picture buffer size is derived as specified in the semantics of the max-br parameter below, and other properties comply with the level specified in the value of the profile-level-id parameter. A receiver MUST NOT signal values of max- mbps, max-fs, max-cpb, max-dpb, and max-br that meet the requirements of a higher level,
referred to as level A herein, compared to the level specified in the value of the profile- level-id parameter, if the receiver can support all the properties of level A. Informative note: When the OPTIONAL MIME type parameters are used to signal the properties of a NAL unit stream, max-mbps, max-fs, max-cpb, max-dpb, and max-br are not present, and the value of profile- level-id must always be such that the NAL unit stream complies fully with the specified profile and level. max-mbps: The value of max-mbps is an integer indicating the maximum macroblock processing rate in units of macroblocks per second. The max-mbps parameter signals that the receiver is capable of decoding video at a higher rate than is required by the signaled level conveyed in the value of the profile-level-id parameter. When max-mbps is signaled, the receiver MUST be able to decode NAL unit streams that conform to the signaled level, with the exception that the MaxMBPS value in Table A-1 of [1] for the signaled level is replaced with the value of max-mbps. The value of max-mbps MUST be greater than or equal to the value of MaxMBPS for the level given in Table A-1 of [1]. Senders MAY use this knowledge to send pictures of a given size at a higher picture rate than is indicated in the signaled level. max-fs: The value of max-fs is an integer indicating the maximum frame size in units of macroblocks. The max-fs parameter signals that the receiver is capable of decoding larger picture sizes than are required by the signaled level conveyed in the value of the profile-level-id parameter. When max-fs is signaled, the receiver MUST be able to decode NAL unit streams that conform to the signaled level, with the exception that the MaxFS value in Table A-1 of [1] for the signaled level is replaced with the value of max-fs. The value of max-fs MUST be greater than or equal to the value of MaxFS for the level given in Table A-1 of [1]. Senders MAY use this knowledge to send larger pictures at a
proportionally lower frame rate than is indicated in the signaled level. max-cpb The value of max-cpb is an integer indicating the maximum coded picture buffer size in units of 1000 bits for the VCL HRD parameters (see A.3.1 item i of [1]) and in units of 1200 bits for the NAL HRD parameters (see A.3.1 item j of [1]). The max-cpb parameter signals that the receiver has more memory than the minimum amount of coded picture buffer memory required by the signaled level conveyed in the value of the profile-level-id parameter. When max-cpb is signaled, the receiver MUST be able to decode NAL unit streams that conform to the signaled level, with the exception that the MaxCPB value in Table A-1 of [1] for the signaled level is replaced with the value of max-cpb. The value of max-cpb MUST be greater than or equal to the value of MaxCPB for the level given in Table A-1 of [1]. Senders MAY use this knowledge to construct coded video streams with greater variation of bit rate than can be achieved with the MaxCPB value in Table A-1 of [1]. Informative note: The coded picture buffer is used in the hypothetical reference decoder (Annex C) of H.264. The use of the hypothetical reference decoder is recommended in H.264 encoders to verify that the produced bitstream conforms to the standard and to control the output bitrate. Thus, the coded picture buffer is conceptually independent of any other potential buffers in the receiver, including de-interleaving and de-jitter buffers. The coded picture buffer need not be implemented in decoders as specified in Annex C of H.264, but rather standard- compliant decoders can have any buffering arrangements provided that they can decode standard-compliant bitstreams. Thus, in practice, the input buffer for video decoder can be integrated with de- interleaving and de-jitter buffers of the receiver.
max-dpb: The value of max-dpb is an integer indicating the maximum decoded picture buffer size in units of 1024 bytes. The max-dpb parameter signals that the receiver has more memory than the minimum amount of decoded picture buffer memory required by the signaled level conveyed in the value of the profile-level-id parameter. When max-dpb is signaled, the receiver MUST be able to decode NAL unit streams that conform to the signaled level, with the exception that the MaxDPB value in Table A-1 of [1] for the signaled level is replaced with the value of max-dpb. Consequently, a receiver that signals max-dpb MUST be capable of storing the following number of decoded frames, complementary field pairs, and non-paired fields in its decoded picture buffer: Min(1024 * max-dpb / ( PicWidthInMbs * FrameHeightInMbs * 256 * ChromaFormatFactor ), 16) PicWidthInMbs, FrameHeightInMbs, and ChromaFormatFactor are defined in [1]. The value of max-dpb MUST be greater than or equal to the value of MaxDPB for the level given in Table A-1 of [1]. Senders MAY use this knowledge to construct coded video streams with improved compression. Informative note: This parameter was added primarily to complement a similar codepoint in the ITU-T Recommendation H.245, so as to facilitate signaling gateway designs. The decoded picture buffer stores reconstructed samples and is a property of the video decoder only. There is no relationship between the size of the decoded picture buffer and the buffers used in RTP, especially de-interleaving and de-jitter buffers. max-br: The value of max-br is an integer indicating the maximum video bit rate in units of 1000 bits per second for the VCL HRD parameters (see A.3.1 item i of [1]) and in units of 1200 bits
per second for the NAL HRD parameters (see A.3.1 item j of [1]). The max-br parameter signals that the video decoder of the receiver is capable of decoding video at a higher bit rate than is required by the signaled level conveyed in the value of the profile-level-id parameter. The value of max- br MUST be greater than or equal to the value of MaxBR for the level given in Table A-1 of [1]. When max-br is signaled, the video codec of the receiver MUST be able to decode NAL unit streams that conform to the signaled level, conveyed in the profile-level-id parameter, with the following exceptions in the limits specified by the level: o The value of max-br replaces the MaxBR value of the signaled level (in Table A-1 of [1]). o When the max-cpb parameter is not present, the result of the following formula replaces the value of MaxCPB in Table A-1 of [1]: (MaxCPB of the signaled level) * max-br / (MaxBR of the signaled level). For example, if a receiver signals capability for Level 1.2 with max-br equal to 1550, this indicates a maximum video bitrate of 1550 kbits/sec for VCL HRD parameters, a maximum video bitrate of 1860 kbits/sec for NAL HRD parameters, and a CPB size of 4036458 bits (1550000 / 384000 * 1000 * 1000). The value of max-br MUST be greater than or equal to the value MaxBR for the signaled level given in Table A-1 of [1]. Senders MAY use this knowledge to send higher bitrate video as allowed in the level definition of Annex A of H.264, to achieve improved video quality. Informative note: This parameter was added primarily to complement a similar codepoint in the ITU-T Recommendation H.245, so as to facilitate signaling gateway designs. No assumption can be made from the value of
this parameter that the network is capable of handling such bit rates at any given time. In particular, no conclusion can be drawn that the signaled bit rate is possible under congestion control constraints. redundant-pic-cap: This parameter signals the capabilities of a receiver implementation. When equal to 0, the parameter indicates that the receiver makes no attempt to use redundant coded pictures to correct incorrectly decoded primary coded pictures. When equal to 0, the receiver is not capable of using redundant slices; therefore, a sender SHOULD avoid sending redundant slices to save bandwidth. When equal to 1, the receiver is capable of decoding any such redundant slice that covers a corrupted area in a primary decoded picture (at least partly), and therefore a sender MAY send redundant slices. When the parameter is not present, then a value of 0 MUST be used for redundant-pic-cap. When present, the value of redundant-pic-cap MUST be either 0 or 1. When the profile-level-id parameter is present in the same capability signaling as the redundant-pic-cap parameter, and the profile indicated in profile-level-id is such that it disallows the use of redundant coded pictures (e.g., Main Profile), the value of redundant- pic-cap MUST be equal to 0. When a receiver indicates redundant-pic-cap equal to 0, the received stream SHOULD NOT contain redundant coded pictures. Informative note: Even if redundant-pic-cap is equal to 0, the decoder is able to ignore redundant codec pictures provided that the decoder supports such a profile (Baseline, Extended) in which redundant coded pictures are allowed. Informative note: Even if redundant-pic-cap is equal to 1, the receiver may also choose other error concealment strategies to
replace or complement decoding of redundant slices. sprop-parameter-sets: This parameter MAY be used to convey any sequence and picture parameter set NAL units (herein referred to as the initial parameter set NAL units) that MUST precede any other NAL units in decoding order. The parameter MUST NOT be used to indicate codec capability in any capability exchange procedure. The value of the parameter is the base64 [6] representation of the initial parameter set NAL units as specified in sections 7.3.2.1 and 7.3.2.2 of [1]. The parameter sets are conveyed in decoding order, and no framing of the parameter set NAL units takes place. A comma is used to separate any pair of parameter sets in the list. Note that the number of bytes in a parameter set NAL unit is typically less than 10, but a picture parameter set NAL unit can contain several hundreds of bytes. Informative note: When several payload types are offered in the SDP Offer/Answer model, each with its own sprop-parameter- sets parameter, then the receiver cannot assume that those parameter sets do not use conflicting storage locations (i.e., identical values of parameter set identifiers). Therefore, a receiver should double-buffer all sprop-parameter-sets and make them available to the decoder instance that decodes a certain payload type. parameter-add: This parameter MAY be used to signal whether the receiver of this parameter is allowed to add parameter sets in its signaling response using the sprop-parameter-sets MIME parameter. The value of this parameter is either 0 or 1. 0 is equal to false; i.e., it is not allowed to add parameter sets. 1 is equal to true; i.e., it is allowed to add parameter sets. If the parameter is not present, its value MUST be 1.
packetization-mode: This parameter signals the properties of an RTP payload type or the capabilities of a receiver implementation. Only a single configuration point can be indicated; thus, when capabilities to support more than one packetization-mode are declared, multiple configuration points (RTP payload types) must be used. When the value of packetization-mode is equal to 0 or packetization-mode is not present, the single NAL mode, as defined in section 6.2 of RFC 3984, MUST be used. This mode is in use in standards using ITU-T Recommendation H.241 [15] (see section 12.1). When the value of packetization-mode is equal to 1, the non- interleaved mode, as defined in section 6.3 of RFC 3984, MUST be used. When the value of packetization-mode is equal to 2, the interleaved mode, as defined in section 6.4 of RFC 3984, MUST be used. The value of packetization mode MUST be an integer in the range of 0 to 2, inclusive. sprop-interleaving-depth: This parameter MUST NOT be present when packetization-mode is not present or the value of packetization-mode is equal to 0 or 1. This parameter MUST be present when the value of packetization-mode is equal to 2. This parameter signals the properties of a NAL unit stream. It specifies the maximum number of VCL NAL units that precede any VCL NAL unit in the NAL unit stream in transmission order and follow the VCL NAL unit in decoding order. Consequently, it is guaranteed that receivers can reconstruct NAL unit decoding order when the buffer size for NAL unit decoding order recovery is at least the value of sprop- interleaving-depth + 1 in terms of VCL NAL units. The value of sprop-interleaving-depth MUST be an integer in the range of 0 to 32767, inclusive.
sprop-deint-buf-req: This parameter MUST NOT be present when packetization-mode is not present or the value of packetization-mode is equal to 0 or 1. It MUST be present when the value of packetization-mode is equal to 2. sprop-deint-buf-req signals the required size of the deinterleaving buffer for the NAL unit stream. The value of the parameter MUST be greater than or equal to the maximum buffer occupancy (in units of bytes) required in such a deinterleaving buffer that is specified in section 7.2 of RFC 3984. It is guaranteed that receivers can perform the deinterleaving of interleaved NAL units into NAL unit decoding order, when the deinterleaving buffer size is at least the value of sprop-deint-buf-req in terms of bytes. The value of sprop-deint-buf-req MUST be an integer in the range of 0 to 4294967295, inclusive. Informative note: sprop-deint-buf-req indicates the required size of the deinterleaving buffer only. When network jitter can occur, an appropriately sized jitter buffer has to be provisioned for as well. deint-buf-cap: This parameter signals the capabilities of a receiver implementation and indicates the amount of deinterleaving buffer space in units of bytes that the receiver has available for reconstructing the NAL unit decoding order. A receiver is able to handle any stream for which the value of the sprop-deint-buf-req parameter is smaller than or equal to this parameter. If the parameter is not present, then a value of 0 MUST be used for deint-buf-cap. The value of deint-buf-cap MUST be an integer in the range of 0 to 4294967295, inclusive. Informative note: deint-buf-cap indicates the maximum possible size of the deinterleaving buffer of the receiver only.
When network jitter can occur, an appropriately sized jitter buffer has to be provisioned for as well. sprop-init-buf-time: This parameter MAY be used to signal the properties of a NAL unit stream. The parameter MUST NOT be present, if the value of packetization-mode is equal to 0 or 1. The parameter signals the initial buffering time that a receiver MUST buffer before starting decoding to recover the NAL unit decoding order from the transmission order. The parameter is the maximum value of (transmission time of a NAL unit - decoding time of the NAL unit), assuming reliable and instantaneous transmission, the same timeline for transmission and decoding, and that decoding starts when the first packet arrives. An example of specifying the value of sprop- init-buf-time follows. A NAL unit stream is sent in the following interleaved order, in which the value corresponds to the decoding time and the transmission order is from left to right: 0 2 1 3 5 4 6 8 7 ... Assuming a steady transmission rate of NAL units, the transmission times are: 0 1 2 3 4 5 6 7 8 ... Subtracting the decoding time from the transmission time column-wise results in the following series: 0 -1 1 0 -1 1 0 -1 1 ... Thus, in terms of intervals of NAL unit transmission times, the value of sprop-init-buf-time in this example is 1.
The parameter is coded as a non-negative base10 integer representation in clock ticks of a 90- kHz clock. If the parameter is not present, then no initial buffering time value is defined. Otherwise the value of sprop-init- buf-time MUST be an integer in the range of 0 to 4294967295, inclusive. In addition to the signaled sprop-init-buf- time, receivers SHOULD take into account the transmission delay jitter buffering, including buffering for the delay jitter caused by mixers, translators, gateways, proxies, traffic-shapers, and other network elements. sprop-max-don-diff: This parameter MAY be used to signal the properties of a NAL unit stream. It MUST NOT be used to signal transmitter or receiver or codec capabilities. The parameter MUST NOT be present if the value of packetization-mode is equal to 0 or 1. sprop-max-don-diff is an integer in the range of 0 to 32767, inclusive. If sprop-max-don-diff is not present, the value of the parameter is unspecified. sprop-max- don-diff is calculated as follows: sprop-max-don-diff = max{AbsDON(i) - AbsDON(j)}, for any i and any j>i, where i and j indicate the index of the NAL unit in the transmission order and AbsDON denotes a decoding order number of the NAL unit that does not wrap around to 0 after 65535. In other words, AbsDON is calculated as follows: Let m and n be consecutive NAL units in transmission order. For the very first NAL unit in transmission order (whose index is 0), AbsDON(0) = DON(0). For other NAL units, AbsDON is calculated as follows: If DON(m) == DON(n), AbsDON(n) = AbsDON(m) If (DON(m) < DON(n) and DON(n) - DON(m) < 32768), AbsDON(n) = AbsDON(m) + DON(n) - DON(m)
If (DON(m) > DON(n) and DON(m) - DON(n) >= 32768), AbsDON(n) = AbsDON(m) + 65536 - DON(m) + DON(n) If (DON(m) < DON(n) and DON(n) - DON(m) >= 32768), AbsDON(n) = AbsDON(m) - (DON(m) + 65536 - DON(n)) If (DON(m) > DON(n) and DON(m) - DON(n) < 32768), AbsDON(n) = AbsDON(m) - (DON(m) - DON(n)) where DON(i) is the decoding order number of the NAL unit having index i in the transmission order. The decoding order number is specified in section 5.5 of RFC 3984. Informative note: Receivers may use sprop- max-don-diff to trigger which NAL units in the receiver buffer can be passed to the decoder. max-rcmd-nalu-size: This parameter MAY be used to signal the capabilities of a receiver. The parameter MUST NOT be used for any other purposes. The value of the parameter indicates the largest NALU size in bytes that the receiver can handle efficiently. The parameter value is a recommendation, not a strict upper boundary. The sender MAY create larger NALUs but must be aware that the handling of these may come at a higher cost than NALUs conforming to the limitation. The value of max-rcmd-nalu-size MUST be an integer in the range of 0 to 4294967295, inclusive. If this parameter is not specified, no known limitation to the NALU size exists. Senders still have to consider the MTU size available between the sender and the receiver and SHOULD run MTU discovery for this purpose. This parameter is motivated by, for example, an IP to H.223 video telephony gateway, where NALUs smaller than the H.223 transport data
unit will be more efficient. A gateway may terminate IP; thus, MTU discovery will normally not work beyond the gateway. Informative note: Setting this parameter to a lower than necessary value may have a negative impact. Encoding considerations: This type is only defined for transfer via RTP (RFC 3550). A file format of H.264/AVC video is defined in [29]. This definition is utilized by other file formats, such as the 3GPP multimedia file format (MIME type video/3gpp) [30] or the MP4 file format (MIME type video/mp4). Security considerations: See section 9 of RFC 3984. Public specification: Please refer to RFC 3984 and its section 15. Additional information: None File extensions: none Macintosh file type code: none Object identifier or OID: none Person & email address to contact for further information: stewe@stewe.org Intended usage: COMMON Author: stewe@stewe.org Change controller: IETF Audio/Video Transport working group delegated from the IESG.
8.2. SDP Parameters
8.2.1. Mapping of MIME Parameters to SDP
The MIME media type video/H264 string is mapped to fields in the Session Description Protocol (SDP) [5] as follows: o The media name in the "m=" line of SDP MUST be video. o The encoding name in the "a=rtpmap" line of SDP MUST be H264 (the MIME subtype). o The clock rate in the "a=rtpmap" line MUST be 90000. o The OPTIONAL parameters "profile-level-id", "max-mbps", "max-fs", "max-cpb", "max-dpb", "max-br", "redundant-pic-cap", "sprop- parameter-sets", "parameter-add", "packetization-mode", "sprop- interleaving-depth", "deint-buf-cap", "sprop-deint-buf-req", "sprop-init-buf-time", "sprop-max-don-diff", and "max-rcmd-nalu- size", when present, MUST be included in the "a=fmtp" line of SDP. These parameters are expressed as a MIME media type string, in the form of a semicolon separated list of parameter=value pairs. An example of media representation in SDP is as follows (Baseline Profile, Level 3.0, some of the constraints of the Main profile may not be obeyed): m=video 49170 RTP/AVP 98 a=rtpmap:98 H264/90000 a=fmtp:98 profile-level-id=42A01E; sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==8.2.2. Usage with the SDP Offer/Answer Model
When H.264 is offered over RTP using SDP in an Offer/Answer model [7] for negotiation for unicast usage, the following limitations and rules apply: o The parameters identifying a media format configuration for H.264 are "profile-level-id", "packetization-mode", and, if required by "packetization-mode", "sprop-deint-buf-req". These three parameters MUST be used symmetrically; i.e., the answerer MUST either maintain all configuration parameters or remove the media format (payload type) completely, if one or more of the parameter values are not supported.
Informative note: The requirement for symmetric use applies only for the above three parameters and not for the other stream properties and capability parameters. To simplify handling and matching of these configurations, the same RTP payload type number used in the offer SHOULD also be used in the answer, as specified in [7]. An answer MUST NOT contain a payload type number used in the offer unless the configuration ("profile-level-id", "packetization-mode", and, if present, "sprop-deint-buf-req") is the same as in the offer. Informative note: An offerer, when receiving the answer, has to compare payload types not declared in the offer based on media type (i.e., video/h264) and the above three parameters with any payload types it has already declared, in order to determine whether the configuration in question is new or equivalent to a configuration already offered. o The parameters "sprop-parameter-sets", "sprop-deint-buf-req", "sprop-interleaving-depth", "sprop-max-don-diff", and "sprop- init-buf-time" describe the properties of the NAL unit stream that the offerer or answerer is sending for this media format configuration. This differs from the normal usage of the Offer/Answer parameters: normally such parameters declare the properties of the stream that the offerer or the answerer is able to receive. When dealing with H.264, the offerer assumes that the answerer will be able to receive media encoded using the configuration being offered. Informative note: The above parameters apply for any stream sent by the declaring entity with the same configuration; i.e., they are dependent on their source. Rather then being bound to the payload type, the values may have to be applied to another payload type when being sent, as they apply for the configuration. o The capability parameters ("max-mbps", "max-fs", "max-cpb", "max- dpb", "max-br", ,"redundant-pic-cap", "max-rcmd-nalu-size") MAY be used to declare further capabilities. Their interpretation depends on the direction attribute. When the direction attribute is sendonly, then the parameters describe the limits of the RTP packets and the NAL unit stream that the sender is capable of producing. When the direction attribute is sendrecv or recvonly, then the parameters describe the limitations of what the receiver accepts.
o As specified above, an offerer has to include the size of the deinterleaving buffer in the offer for an interleaved H.264 stream. To enable the offerer and answerer to inform each other about their capabilities for deinterleaving buffering, both parties are RECOMMENDED to include "deint-buf-cap". This information MAY be used when the value for "sprop-deint-buf-req" is selected in a second round of offer and answer. For interleaved streams, it is also RECOMMENDED to consider offering multiple payload types with different buffering requirements when the capabilities of the receiver are unknown. o The "sprop-parameter-sets" parameter is used as described above. In addition, an answerer MUST maintain all parameter sets received in the offer in its answer. Depending on the value of the "parameter-add" parameter, different rules apply: If "parameter- add" is false (0), the answer MUST NOT add any additional parameter sets. If "parameter-add" is true (1), the answerer, in its answer, MAY add additional parameter sets to the "sprop- parameter-sets" parameter. The answerer MUST also, independent of the value of "parameter-add", accept to receive a video stream using the sprop-parameter-sets it declared in the answer. Informative note: care must be taken when parameter sets are added not to cause overwriting of already transmitted parameter sets by using conflicting parameter set identifiers. For streams being delivered over multicast, the following rules apply in addition: o The stream properties parameters ("sprop-parameter-sets", "sprop- deint-buf-req", "sprop-interleaving-depth", "sprop-max-don-diff", and "sprop-init-buf-time") MUST NOT be changed by the answerer. Thus, a payload type can either be accepted unaltered or removed. o The receiver capability parameters "max-mbps", "max-fs", "max- cpb", "max-dpb", "max-br", and "max-rcmd-nalu-size" MUST be supported by the answerer for all streams declared as sendrecv or recvonly; otherwise, one of the following actions MUST be performed: the media format is removed, or the session rejected. o The receiver capability parameter redundant-pic-cap SHOULD be supported by the answerer for all streams declared as sendrecv or recvonly as follows: The answerer SHOULD NOT include redundant coded pictures in the transmitted stream if the offerer indicated redundant-pic-cap equal to 0. Otherwise (when redundant_pic_cap is equal to 1), it is beyond the scope of this memo to recommend how the answerer should use redundant coded pictures.
Below are the complete lists of how the different parameters shall be interpreted in the different combinations of offer or answer and direction attribute. o In offers and answers for which "a=sendrecv" or no direction attribute is used, or in offers and answers for which "a=recvonly" is used, the following interpretation of the parameters MUST be used. Declaring actual configuration or properties for receiving: - profile-level-id - packetization-mode Declaring actual properties of the stream to be sent (applicable only when "a=sendrecv" or no direction attribute is used): - sprop-deint-buf-req - sprop-interleaving-depth - sprop-parameter-sets - sprop-max-don-diff - sprop-init-buf-time Declaring receiver implementation capabilities: - max-mbps - max-fs - max-cpb - max-dpb - max-br - redundant-pic-cap - deint-buf-cap - max-rcmd-nalu-size Declaring how Offer/Answer negotiation shall be performed: - parameter-add o In an offer or answer for which the direction attribute "a=sendonly" is included for the media stream, the following interpretation of the parameters MUST be used: Declaring actual configuration and properties of stream proposed to be sent: - profile-level-id - packetization-mode - sprop-deint-buf-req
- sprop-max-don-diff - sprop-init-buf-time - sprop-parameter-sets - sprop-interleaving-depth Declaring the capabilities of the sender when it receives a stream: - max-mbps - max-fs - max-cpb - max-dpb - max-br - redundant-pic-cap - deint-buf-cap - max-rcmd-nalu-size Declaring how Offer/Answer negotiation shall be performed: - parameter-add Furthermore, the following considerations are necessary: o Parameters used for declaring receiver capabilities are in general downgradable; i.e., they express the upper limit for a sender's possible behavior. Thus a sender MAY select to set its encoder using only lower/lesser or equal values of these parameters. "sprop-parameter-sets" MUST NOT be used in a sender's declaration of its capabilities, as the limits of the values that are carried inside the parameter sets are implicit with the profile and level used. o Parameters declaring a configuration point are not downgradable, with the exception of the level part of the "profile-level-id" parameter. This expresses values a receiver expects to be used and must be used verbatim on the sender side. o When a sender's capabilities are declared, and non-downgradable parameters are used in this declaration, then these parameters express a configuration that is acceptable. In order to achieve high interoperability levels, it is often advisable to offer multiple alternative configurations; e.g., for the packetization mode. It is impossible to offer multiple configurations in a single payload type. Thus, when multiple configuration offers are made, each offer requires its own RTP payload type associated with the offer.
o A receiver SHOULD understand all MIME parameters, even if it only supports a subset of the payload format's functionality. This ensures that a receiver is capable of understanding when an offer to receive media can be downgraded to what is supported by the receiver of the offer. o An answerer MAY extend the offer with additional media format configurations. However, to enable their usage, in most cases a second offer is required from the offerer to provide the stream properties parameters that the media sender will use. This also has the effect that the offerer has to be able to receive this media format configuration, not only to send it. o If an offerer wishes to have non-symmetric capabilities between sending and receiving, the offerer has to offer different RTP sessions; i.e., different media lines declared as "recvonly" and "sendonly", respectively. This may have further implications on the system.8.2.3. Usage in Declarative Session Descriptions
When H.264 over RTP is offered with SDP in a declarative style, as in RTSP [27] or SAP [28], the following considerations are necessary. o All parameters capable of indicating the properties of both a NAL unit stream and a receiver are used to indicate the properties of a NAL unit stream. For example, in this case, the parameter "profile-level-id" declares the values used by the stream, instead of the capabilities of the sender. This results in that the following interpretation of the parameters MUST be used: Declaring actual configuration or properties: - profile-level-id - sprop-parameter-sets - packetization-mode - sprop-interleaving-depth - sprop-deint-buf-req - sprop-max-don-diff - sprop-init-buf-time
Not usable: - max-mbps - max-fs - max-cpb - max-dpb - max-br - redundant-pic-cap - max-rcmd-nalu-size - parameter-add - deint-buf-cap o A receiver of the SDP is required to support all parameters and values of the parameters provided; otherwise, the receiver MUST reject (RTSP) or not participate in (SAP) the session. It falls on the creator of the session to use values that are expected to be supported by the receiving application.8.3. Examples
A SIP Offer/Answer exchange wherein both parties are expected to both send and receive could look like the following. Only the media codec specific parts of the SDP are shown. Some lines are wrapped due to text constraints. Offerer -> Answer SDP message: m=video 49170 RTP/AVP 100 99 98 a=rtpmap:98 H264/90000 a=fmtp:98 profile-level-id=42A01E; packetization-mode=0; sprop-parameter-sets=Z0IACpZTBYmI,aMljiA== a=rtpmap:99 H264/90000 a=fmtp:99 profile-level-id=42A01E; packetization-mode=1; sprop-parameter-sets=Z0IACpZTBYmI,aMljiA== a=rtpmap:100 H264/90000 a=fmtp:100 profile-level-id=42A01E; packetization-mode=2; sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==; sprop-interleaving-depth=45; sprop-deint-buf-req=64000; sprop-init-buf-time=102478; deint-buf-cap=128000 The above offer presents the same codec configuration in three different packetization formats. PT 98 represents single NALU mode, PT 99 non-interleaved mode; PT 100 indicates the interleaved mode. In the interleaved mode case, the interleaving parameters that the offerer would use if the answer indicates support for PT 100 are also included. In all three cases the parameter "sprop-parameter-sets" conveys the initial parameter sets that are required for the answerer when receiving a stream from the offerer when this configuration
(profile-level-id and packetization mode) is accepted. Note that the value for "sprop-parameter-sets", although identical in the example above, could be different for each payload type. Answerer -> Offerer SDP message: m=video 49170 RTP/AVP 100 99 97 a=rtpmap:97 H264/90000 a=fmtp:97 profile-level-id=42A01E; packetization-mode=0; sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==,As0DEWlsIOp==, KyzFGleR a=rtpmap:99 H264/90000 a=fmtp:99 profile-level-id=42A01E; packetization-mode=1; sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==,As0DEWlsIOp==, KyzFGleR; max-rcmd-nalu-size=3980 a=rtpmap:100 H264/90000 a=fmtp:100 profile-level-id=42A01E; packetization-mode=2; sprop-parameter-sets=Z0IACpZTBYmI,aMljiA==,As0DEWlsIOp==, KyzFGleR; sprop-interleaving-depth=60; sprop-deint-buf-req=86000; sprop-init-buf-time=156320; deint-buf-cap=128000; max-rcmd-nalu-size=3980 As the Offer/Answer negotiation covers both sending and receiving streams, an offer indicates the exact parameters for what the offerer is willing to receive, whereas the answer indicates the same for what the answerer accepts to receive. In this case the offerer declared that it is willing to receive payload type 98. The answerer accepts this by declaring a equivalent payload type 97; i.e., it has identical values for the three parameters "profile-level-id", packetization-mode, and "sprop-deint-buf-req". This has the following implications for both the offerer and the answerer concerning the parameters that declare properties. The offerer initially declared a certain value of the "sprop-parameter-sets" in the payload definition for PT=98. However, as the answerer accepted this as PT=97, the values of "sprop-parameter-sets" in PT=98 must now be used instead when the offerer sends PT=97. Similarly, when the answerer sends PT=98 to the offerer, it has to use the properties parameters it declared in PT=97. The answerer also accepts the reception of the two configurations that payload types 99 and 100 represent. It provides the initial parameter sets for the answerer-to-offerer direction, and for buffering related parameters that it will use to send the payload types. It also provides the offerer with its memory limit for deinterleaving operations by providing a "deint-buf-cap" parameter. This is only useful if the offerer decides on making a second offer, where it can take the new value into account. The "max-rcmd-nalu- size" indicates that the answerer can efficiently process NALUs up to
the size of 3980 bytes. However, there is no guarantee that the network supports this size. Please note that the parameter sets in the above example do not represent a legal operation point of an H.264 codec. The base64 strings are only used for illustration.8.4. Parameter Set Considerations
The H.264 parameter sets are a fundamental part of the video codec and vital to its operation; see section 1.2. Due to their characteristics and their importance for the decoding process, lost or erroneously transmitted parameter sets can hardly be concealed locally at the receiver. A reference to a corrupt parameter set has normally fatal results to the decoding process. Corruption could occur, for example, due to the erroneous transmission or loss of a parameter set data structure, but also due to the untimely transmission of a parameter set update. Therefore, the following recommendations are provided as a guideline for the implementer of the RTP sender. Parameter set NALUs can be transported using three different principles: A. Using a session control protocol (out-of-band) prior to the actual RTP session. B. Using a session control protocol (out-of-band) during an ongoing RTP session. C. Within the RTP stream in the payload (in-band) during an ongoing RTP session. It is necessary to implement principles A and B within a session control protocol. SIP and SDP can be used as described in the SDP Offer/Answer model and in the previous sections of this memo. This section contains guidelines on how principles A and B must be implemented within session control protocols. It is independent of the particular protocol used. Principle C is supported by the RTP payload format defined in this specification. The picture and sequence parameter set NALUs SHOULD NOT be transmitted in the RTP payload unless reliable transport is provided for RTP, as a loss of a parameter set of either type will likely prevent decoding of a considerable portion of the corresponding RTP
stream. Thus, the transmission of parameter sets using a reliable session control protocol (i.e., usage of principle A or B above) is RECOMMENDED. In the rest of the section it is assumed that out-of-band signaling provides reliable transport of parameter set NALUs and that in-band transport does not. If in-band signaling of parameter sets is used, the sender SHOULD take the error characteristics into account and use mechanisms to provide a high probability for delivering the parameter sets correctly. Mechanisms that increase the probability for a correct reception include packet repetition, FEC, and retransmission. The use of an unreliable, out-of-band control protocol has similar disadvantages as the in-band signaling (possible loss) and, in addition, may also lead to difficulties in the synchronization (see below). Therefore, it is NOT RECOMMENDED. Parameter sets MAY be added or updated during the lifetime of a session using principles B and C. It is required that parameter sets are present at the decoder prior to the NAL units that refer to them. Updating or adding of parameter sets can result in further problems, and therefore the following recommendations should be considered. - When parameter sets are added or updated, principle C is vulnerable to transmission errors as described above, and therefore principle B is RECOMMENDED. - When parameter sets are added or updated, care SHOULD be taken to ensure that any parameter set is delivered prior to its usage. It is common that no synchronization is present between out-of-band signaling and in-band traffic. If out-of-band signaling is used, it is RECOMMENDED that a sender does not start sending NALUs requiring the updated parameter sets prior to acknowledgement of delivery from the signaling protocol. - When parameter sets are updated, the following synchronization issue should be taken into account. When overwriting a parameter set at the receiver, the sender has to ensure that the parameter set in question is not needed by any NALU present in the network or receiver buffers. Otherwise, decoding with a wrong parameter set may occur. To lessen this problem, it is RECOMMENDED either to overwrite only those parameter sets that have not been used for a sufficiently long time (to ensure that all related NALUs have been consumed), or to add a new parameter set instead (which may have negative consequences for the efficiency of the video coding). - When new parameter sets are added, previously unused parameter set identifiers are used. This avoids the problem identified in the
previous paragraph. However, in a multiparty session, unless a synchronized control protocol is used, there is a risk that multiple entities try to add different parameter sets for the same identifier, which has to be avoided. - Adding or modifying parameter sets by using both principles B and C in the same RTP session may lead to inconsistencies of the parameter sets because of the lack of synchronization between the control and the RTP channel. Therefore, principles B and C MUST NOT both be used in the same session unless sufficient synchronization can be provided. In some scenarios (e.g., when only the subset of this payload format specification corresponding to H.241 is used), it is not possible to employ out-of-band parameter set transmission. In this case, parameter sets have to be transmitted in-band. Here, the synchronization with the non-parameter-set-data in the bitstream is implicit, but the possibility of a loss has to be taken into account. The loss probability should be reduced using the mechanisms discussed above. - When parameter sets are initially provided using principle A and then later added or updated in-band (principle C), there is a risk associated with updating the parameter sets delivered out-of-band. If receivers miss some in-band updates (for example, because of a loss or a late tune-in), those receivers attempt to decode the bitstream using out-dated parameters. It is RECOMMENDED that parameter set IDs be partitioned between the out-of-band and in- band parameter sets. To allow for maximum flexibility and best performance from the H.264 coder, it is recommended, if possible, to allow any sender to add its own parameter sets to be used in a session. Setting the "parameter- add" parameter to false should only be done in cases where the session topology prevents a participant to add its own parameter sets.