There are several reasons why an endpoint might choose to send multiple media streams. In the discussion below, please keep in mind that the reasons for having multiple RTP streams vary and include, but are not limited to, the following:
-
There might be multiple media sources.
-
Multiple RTP streams might be needed to represent one media source, for example:
-
To carry different layers of a scalable encoding of a media source
-
Alternative encodings during simulcast, using different codecs for the same audio stream
-
Alternative formats during simulcast, multiple resolutions of the same video stream
-
A retransmission stream might repeat some parts of the content of another RTP stream.
-
A Forward Error Correction (FEC) stream might provide material that can be used to repair another RTP stream.
For each of these reasons, it is necessary to decide whether each additional RTP stream is sent within the same RTP session as the other RTP streams or it is necessary to use additional RTP sessions to group the RTP streams. For a combination of reasons, the suitable choice for one situation might not be the suitable choice for another situation. The choice is easiest when multiplexing multiple media sources of the same media type. However, all reasons warrant discussion and clarification regarding how to deal with them. As the discussion below will show, a single solution does not suit all purposes. To utilize RTP well and as efficiently as possible, both are needed. The real issue is knowing when to create multiple RTP sessions versus when to send multiple RTP streams in a single RTP session.
This section describes the multiplexing points present in RTP that can be used to distinguish RTP streams and groups of RTP streams.
Figure 1 outlines the process of demultiplexing incoming RTP streams, starting with one or more sockets representing the reception of one or more transport flows, e.g., based on the UDP destination port. It also demultiplexes RTP/RTCP from any other protocols, such as Session Traversal Utilities for NAT (STUN) [
RFC 5389] and DTLS-SRTP [
RFC 5764] on the same transport as described in [
RFC 7983]. The Processing and Buffering (PB) step in
Figure 1 terminates RTP/RTCP and prepares the RTP payload for input to the decoder.
| | |
| | | packets
+-- v v v
| +------------+
| | Socket(s) | Transport Protocol Demultiplexing
| +------------+
| || ||
RTP | RTP/ || |+-----> DTLS (SRTP keying, SCTP, etc.)
Session | RTCP || +------> STUN (multiplexed using same port)
+-- ||
+-- ||
| ++(split by SSRC)-++---> Identify SSRC collision
| || || || ||
| (associate with signaling by MID/RID)
| vv vv vv vv
RTP | +--+ +--+ +--+ +--+ Jitter buffer,
Streams | |PB| |PB| |PB| |PB| process RTCP, etc.
| +--+ +--+ +--+ +--+
+-- | | | |
(select decoder based on payload type (PT))
+-- | / | /
| +-----+ | /
| / | |/
Payload | v v v
Formats | +---+ +---+ +---+
| |Dec| |Dec| |Dec| Decoders
| +---+ +---+ +---+
+--
An RTP session is the highest semantic layer in RTP and represents an association between a group of communicating endpoints. RTP does not contain a session identifier, yet different RTP sessions must be possible to identify both across a set of different endpoints and from the perspective of a single endpoint.
For RTP session separation across endpoints, the set of participants that form an RTP session is defined as those that share a single SSRC space [
RFC 3550]. That is, if a group of participants are each aware of the SSRC identifiers belonging to the other participants, then those participants are in a single RTP session. A participant can become aware of an SSRC identifier by receiving an RTP packet containing the identifier in the SSRC field or contributing source (CSRC) list, by receiving an RTCP packet listing it in an SSRC field, or through signaling (e.g., the Session Description Protocol (SDP) [
RFC 4566] "a=ssrc:" attribute [
RFC 5576]). Thus, the scope of an RTP session is determined by the participants' network interconnection topology, in combination with RTP and RTCP forwarding strategies deployed by the endpoints and any middleboxes, and by the signaling.
For RTP session separation within a single endpoint, RTP relies on the underlying transport layer and the signaling to identify RTP sessions in a manner that is meaningful to the application. A single endpoint can have one or more transport flows for the same RTP session, and a single RTP session can span multiple transport-layer flows even if all endpoints use a single transport-layer flow per endpoint for that RTP session. The signaling layer might give RTP sessions an explicit identifier, or the identification might be implicit based on the addresses and ports used. Accordingly, a single RTP session can have multiple associated identifiers, explicit and implicit, belonging to different contexts. For example, when running RTP on top of UDP/IP, an endpoint can identify and delimit an RTP session from other RTP sessions by their UDP source and destination IP addresses and their UDP port numbers. A single RTP session can be using multiple IP/UDP flows for receiving and/or sending RTP packets to other endpoints or middleboxes, even if the endpoint does not have multiple IP addresses. Using multiple IP addresses only makes it more likely that multiple IP/UDP flows will be required. Another example is SDP media descriptions (the "m=" line and the subsequent associated lines) that signal the transport flow and RTP session configuration for the endpoint's part of the RTP session. The SDP grouping framework [
RFC 5888] allows labeling of the media descriptions to be used so that RTP Session Groups can be created. Through the use of [
RFC 8843], multiple media descriptions become part of a common RTP session where each media description represents the RTP streams sent or received for a media source.
RTP makes no normative statements about the relationship between different RTP sessions; however, applications that use more than one RTP session need to understand how the different RTP sessions that they create relate to one another.
An SSRC identifies a source of an RTP stream, or an RTP receiver when sending RTCP. Every endpoint has at least one SSRC identifier, even if it does not send RTP packets. RTP endpoints that are only RTP receivers still send RTCP and use their SSRC identifiers in the RTCP packets they send. An endpoint can have multiple SSRC identifiers if it sends multiple RTP streams. Endpoints that function as both RTP sender and RTP receiver use the same SSRC(s) in both roles.
The SSRC is a 32-bit identifier. It is present in every RTP and RTCP packet header and in the payload of some RTCP packet types. It can also be present in SDP signaling. Unless presignaled, e.g., using the SDP "a=ssrc:" attribute [
RFC 5576], the SSRC is chosen at random. It is not dependent on the network address of the endpoint and is intended to be unique within an RTP session. SSRC collisions can occur and are handled as specified in [
RFC 3550] and [
RFC 5576], resulting in the SSRC of the colliding RTP streams or receivers changing. An endpoint that changes its network transport address during a session has to choose a new SSRC identifier to avoid being interpreted as a looped source, unless a mechanism providing a virtual transport (such as Interactive Connectivity Establishment (ICE) [
RFC 8445]) abstracts the changes.
SSRC identifiers that belong to the same synchronization context (i.e., that represent RTP streams that can be synchronized using information in RTCP SR packets) use identical CNAME chunks in corresponding RTCP source description (SDES) packets. SDP signaling can also be used to provide explicit SSRC grouping [
RFC 5576].
In some cases, the same SSRC identifier value is used to relate streams in two different RTP sessions, such as in RTP retransmission [
RFC 4588]. This is to be avoided, since there is no guarantee that SSRC values are unique across RTP sessions. In the case of RTP retransmission [
RFC 4588], it is recommended to use explicit binding of the source RTP stream and the redundancy stream, e.g., using the RepairedRtpStreamId RTCP SDES item [
RFC 8852]. The RepairedRtpStreamId is a rather recent mechanism, so one cannot expect older applications to follow this recommendation.
Note that the RTP sequence number and RTP timestamp are scoped by the SSRC and are thus specific per RTP stream.
Different types of entities use an SSRC to identify themselves, as follows:
-
A real media source uses the SSRC to identify a "physical" media source.
-
A conceptual media source uses the SSRC to identify the result of applying some filtering function in a network node -- for example, a filtering function in an RTP mixer that provides the most active speaker based on some criteria, or a mix representing a set of other sources.
-
An RTP receiver uses the SSRC to identify itself as the source of its RTCP reports.
An endpoint that generates more than one media type, e.g., a conference participant sending both audio and video, need not (and, indeed, should not) use the same SSRC value across RTP sessions. Using RTCP compound packets containing the CNAME SDES item is the designated method for binding an SSRC to a CNAME, effectively cross-correlating SSRCs within and between RTP sessions as coming from the same endpoint. The main property attributed to SSRCs associated with the same CNAME is that they are from a particular synchronization context and can be synchronized at playback.
An RTP receiver receiving a previously unseen SSRC value will interpret it as a new source. It might in fact be a previously existing source that had to change its SSRC number due to an SSRC conflict. Using the media identification (MID) extension [
RFC 8843] helps to identify which media source the new SSRC represents, and using the restriction identifier (RID) extension [
RFC 8851] helps to identify what encoding or redundancy stream it represents, even though the SSRC changed. However, the originator of the previous SSRC ought to have ended the conflicting source by sending an RTCP BYE for it prior to starting to send with the new SSRC, making the new SSRC a new source.
The CSRC is not a separate identifier. Rather, an SSRC identifier is listed as a CSRC in the RTP header of a packet generated by an RTP mixer or video Multipoint Control Unit (MCU) / switch, if the corresponding SSRC was in the header of one of the packets that contributed to the output.
It is not possible, in general, to extract media represented by an individual CSRC, since it is typically the result of a media merge (e.g., mix) operation on the individual media streams corresponding to the CSRC identifiers. The exception is the case where only a single CSRC is indicated, as this represents the forwarding of an RTP stream that might have been modified. The RTP header extension ([
RFC 6465]) expands on the receiver's information about a packet with a CSRC list. Due to these restrictions, a CSRC will not be considered a fully qualified multiplexing point and will be disregarded in the rest of this document.
Each RTP stream utilizes one or more RTP payload formats. An RTP payload format describes how the output of a particular media codec is framed and encoded into RTP packets. The payload format is identified by the payload type (PT) field in the RTP packet header. The combination of SSRC and PT therefore identifies a specific RTP stream in a specific encoding format. The format definition can be taken from [
RFC 3551] for statically allocated payload types but ought to be explicitly defined in signaling, such as SDP, for both static and dynamic payload types. The term "format" here includes those aspects described by out-of-band signaling means; in SDP, the term "format" includes media type, RTP timestamp sampling rate, codec, codec configuration, payload format configurations, and various robustness mechanisms such as redundant encodings [
RFC 2198].
The RTP payload type is scoped by the sending endpoint within an RTP session. PT has the same meaning across all RTP streams in an RTP session. All SSRCs sent from a single endpoint share the same payload type definitions. The RTP payload type is designed such that only a single payload type is valid at any instant in time in the RTP stream's timestamp timeline, effectively time-multiplexing different payload types if any change occurs. The payload type can change on a per-packet basis for an SSRC -- for example, a speech codec making use of generic comfort noise [
RFC 3389]. If there is a true need to send multiple payload types for the same SSRC that are valid for the same instant, then redundant encodings [
RFC 2198] can be used. Several additional constraints, other than those mentioned above, need to be met to enable this usage, one of which is that the combined payload sizes of the different payload types ought not exceed the transport MTU.
Other aspects of using the RTP payload format are described in [
RFC 8088].
The payload type is not a multiplexing point at the RTP layer (see
Appendix A for a detailed discussion of why using the payload type as an RTP multiplexing point does not work). The RTP payload type is, however, used to determine how to consume and decode an RTP stream. The RTP payload type number is sometimes used to associate an RTP stream with the signaling, which in general requires that unique RTP payload type numbers be used in each context. Using MID, e.g., when bundling "m=" sections [
RFC 8843], can replace the payload type as a signaling association, and unique RTP payload types are then no longer required for that purpose.
The impact of how RTP multiplexing is performed will in general vary with how the RTP session participants are interconnected, as described in [
RFC 7667].
Even the most basic use case -- "Topo-Point-to-Point" as described in [
RFC 7667] -- raises a number of considerations, which are discussed in detail in the following sections. They range over such aspects as the following:
-
Does my communication peer support RTP as defined with multiple SSRCs per RTP session?
-
Do I need network differentiation in the form of QoS (Section 4.2.1)?
-
Can the application more easily process and handle the media streams if they are in different RTP sessions?
-
Do I need to use additional RTP streams for RTP retransmission or FEC?
For some point-to-multipoint topologies (e.g., Topo-ASM and Topo-SSM [
RFC 7667]), multicast is used to interconnect the session participants. Special considerations (documented in
Section 4.2.3) are then needed, as multicast is a one-to-many distribution system.
Sometimes, an RTP communication session can end up in a situation where the communicating peers are not compatible, for various reasons:
-
No common media codec for a media type, thus requiring transcoding.
-
Different support for multiple RTP streams and RTP sessions.
-
Usage of different media transport protocols (i.e., one peer uses RTP, but the other peer uses a different transport protocol).
-
Usage of different transport protocols, e.g., UDP, the Datagram Congestion Control Protocol (DCCP), or TCP.
-
Different security solutions (e.g., IPsec, TLS, DTLS, or the Secure Real-time Transport Protocol (SRTP)) with different keying mechanisms.
These compatibility issues can often be resolved by the inclusion of a translator between the two peers -- the Topo-PtP-Translator, as described in [
RFC 7667]. The translator's main purpose is to make the peers look compatible to each other. There can also be reasons other than compatibility for inserting a translator in the form of a middlebox or gateway -- for example, a need to monitor the RTP streams. Beware that changing the stream transport characteristics in the translator can require a thorough understanding of aspects ranging from congestion control and media-level adaptations to application-layer semantics.
Within the uses enabled by the RTP standard, the point-to-point topology can contain one or more RTP sessions with one or more media sources per session, each having one or more RTP streams per media source.
Using multiple RTP streams is a well-supported feature of RTP. However, for most implementers or people writing RTP/RTCP applications or extensions attempting to apply multiple streams, it can be unclear when it is most appropriate to add an additional RTP stream in an existing RTP session and when it is better to use multiple RTP sessions. This section discusses the various considerations that need to be taken into account.
RFC 3550 contains some recommendations and a numbered list (
Section 5.2 of
RFC 3550) of five arguments regarding different aspects of RTP multiplexing. Please review
Section 5.2 of
RFC 3550. Five important aspects are quoted below.
-
If, say, two audio streams shared the same RTP session and the same SSRC value, and one were to change encodings and thus acquire a different RTP payload type, there would be no general way of identifying which stream had changed encodings.
This argument advocates the use of different SSRCs for each individual RTP stream, as this is fundamental to RTP operation.
-
An SSRC is defined to identify a single timing and sequence number space. Interleaving multiple payload types would require different timing spaces if the media clock rates differ and would require different sequence number spaces to tell which payload type suffered packet loss.
This argument advocates against demultiplexing RTP streams within a session based only on their RTP payload type numbers; it still stands, as can be seen by the extensive list of issues discussed in Appendix A.
-
The RTCP sender and receiver reports (see Section 6.4) can only describe one timing and sequence number space per SSRC and do not carry a payload type field.
This argument is yet another argument against payload type multiplexing.
-
An RTP mixer would not be able to combine interleaved streams of incompatible media into one stream.
This argument advocates against multiplexing RTP packets that require different handling into the same session. In most cases, the RTP mixer must embed application logic to handle streams; the separation of streams according to stream type is just another piece of application logic, which might or might not be appropriate for a particular application. One type of application that can mix different media sources blindly is the audio-only telephone bridge, although the ability to do that comes from the well-defined scenario that is aided by the use of a single media type, even though individual streams may use incompatible codec types; most other types of applications need application-specific logic to perform the mix correctly.
-
Carrying multiple media in one RTP session precludes: the use of different network paths or network resource allocations if appropriate; reception of a subset of the media if desired, for example just audio if video would exceed the available bandwidth; and receiver implementations that use separate processes for the different media, whereas using separate RTP sessions permits either single- or multiple-process implementations.
This argument discusses network aspects that are described in Section 4.2. It also goes into aspects of implementation, like split component terminals (see Section 3.10 of RFC 7667) -- endpoints where different processes or interconnected devices handle different aspects of the whole multimedia session.
To summarize,
RFC 3550's view on multiplexing is to use unique SSRCs for anything that is its own media/packet stream and use different RTP sessions for media streams that don't share a media type. This document supports the first point; it is very valid. The latter needs further discussion, as imposing a single solution on all usages of RTP is inappropriate. [
RFC 8860] updates
RFC 3550 to allow multiple media types in an RTP session and provides a detailed analysis of the potential benefits and issues related to having multiple media types in the same RTP session. Thus, [
RFC 8860] provides a wider scope for an RTP session and considers multiple media types in one RTP session as a possible choice for the RTP application designer.
Using multiple SSRCs at one endpoint in an RTP session requires that some unclear aspects of the RTP specification be resolved. These items could potentially lead to some interoperability issues as well as some potential significant inefficiencies, as further discussed in "Sending Multiple RTP Streams in a Single RTP Session" [
RFC 8108]. An RTP application designer should consider these issues and the application's possible impact caused by a lack of appropriate RTP handling or optimization in the peer endpoints.
Using multiple RTP sessions can potentially mitigate application issues caused by multiple SSRCs in an RTP session.
A common problem in a number of various RTP extensions has been how to bind related RTP streams together. This issue is common to both using additional SSRCs and multiple RTP sessions.
The solutions can be divided into a few groups:
-
RTP/RTCP based
-
Signaling based, e.g., SDP
-
Grouping related RTP sessions
-
Grouping SSRCs within an RTP session
Most solutions are explicit, but some implicit methods have also been applied to the problem.
The SDP-based signaling solutions are:
-
SDP media description grouping:
-
The SDP grouping framework [RFC 5888] uses various semantics to group any number of media descriptions. SDP media description grouping has primarily been used to group RTP sessions, but in combination with [RFC 8843], it can also group multiple media descriptions within a single RTP session.
-
SDP media multiplexing:
-
[RFC 8843]uses information taken from both SDP and RTCP to associate RTP streams to SDP media descriptions. This allows both SDP and RTCP to group RTP streams belonging to an SDP media description and group multiple SDP media descriptions into a single RTP session.
-
SDP SSRC grouping:
-
[RFC 5576] includes a solution for grouping SSRCs in the same way that the grouping framework groups media descriptions.
The above grouping constructs support many use cases. Those solutions have shortcomings in cases where the session's dynamic properties are such that it is difficult or a drain on resources to keep the list of related SSRCs up to date.
One RTP/RTCP-based grouping solution is to use the RTCP SDES CNAME to bind related RTP streams to an endpoint or a synchronization context. For applications with a single RTP stream per type (media, source, or redundancy stream), the CNAME is sufficient for that purpose, independent of whether one or more RTP sessions are used. However, some applications choose not to use a CNAME because of perceived complexity or a desire not to implement RTCP and instead use the same SSRC value to bind related RTP streams across multiple RTP sessions. RTP retransmission [
RFC 4588], when configured to use multiple RTP sessions, and generic FEC [
RFC 5109] both use the CNAME method to relate the RTP streams, which may work but might have some downsides in RTP sessions with many participating SSRCs. It is not recommended to use identical SSRC values across RTP sessions to relate RTP streams; when an SSRC collision occurs, this will force a change of that SSRC in all RTP sessions and will thus resynchronize all of the streams instead of only the single media stream experiencing the collision.
Another method for implicitly binding SSRCs is used by RTP retransmission [
RFC 4588] when using the same RTP session as the source RTP stream for retransmissions. A receiver that is missing a packet issues an RTP retransmission request and then awaits a new SSRC carrying the RTP retransmission payload, where that SSRC is from the same CNAME. This limits a requester to having only one outstanding retransmission request on any new SSRCs per endpoint.
[
RFC 8851] provides an RTP/RTCP-based mechanism to unambiguously identify the RTP streams within an RTP session and restrict the streams' payload format parameters in a codec-agnostic way beyond what is provided with the regular payload types. The mapping is done by specifying an "a=rid" value in the SDP offer/answer signaling and having the corresponding RtpStreamId value as an SDES item and an RTP header extension [
RFC 8852]. The RID solution also includes a solution for binding redundancy RTP streams to their original source RTP streams, given that those streams use RID identifiers. The redundancy stream uses the RepairedRtpStreamId SDES item and RTP header extension to declare the RtpStreamId value of the source stream to create the binding.
Experience has shown that an explicit binding between the RTP streams, agnostic of SSRC values, behaves well. That way, solutions using multiple RTP streams in a single RTP session and in multiple RTP sessions will use the same type of binding.
There exist a number of FEC-based schemes designed to mitigate packet loss in the original streams. Most of the FEC schemes protect a single source flow. This protection is achieved by transmitting a certain amount of redundant information that is encoded such that it can repair one or more instances of packet loss over the set of packets the redundant information protects. This sequence of redundant information needs to be transmitted as its own media stream or, in some cases, instead of the original media stream. Thus, many of these schemes create a need for binding related flows, as discussed above. Looking at the history of these schemes, there are schemes using multiple SSRCs and schemes using multiple RTP sessions, and some schemes that support both modes of operation.
Using multiple RTP sessions supports the case where some set of receivers might not be able to utilize the FEC information. By placing it in a separate RTP session and if separating RTP sessions at the transport level, FEC can easily be ignored at the transport level, without considering any RTP-layer information.
In usages involving multicast, sending FEC information in a separate multicast group allows for similar flexibility. This is especially useful when receivers see heterogeneous packet loss rates. A receiver can decide, based on measurement of experienced packet loss rates, whether to join a multicast group with suitable FEC data repair capabilities.