This section describes one of the best-known ways to provide a good user experience over a given network path, but one thing to keep in mind is that application-level mechanisms cannot provide a better experience than the underlying network path can support.
A simple model of media playback can be described as a media stream consumer, a buffer, and a transport mechanism that fills the buffer. The consumption rate is fairly static and is represented by the content bitrate. The size of the buffer is also commonly a fixed size. The buffer fill process needs to be at least fast enough to ensure that the buffer is never empty; however, it also can have significant complexity when things like personalization or advertising insertion workflows are introduced.
The challenges in filling the buffer in a timely way fall into two broad categories:
-
Content variation (also sometimes called a "bitrate ladder") is the set of content renditions that are available at any given selection point.
-
Content selection comprises all of the steps a client uses to determine which content rendition to play.
The mechanism used to select the bitrate is part of the content selection, and the content variation is all of the different bitrate renditions.
Adaptive bitrate streaming ("ABR streaming" or simply "ABR") is a commonly used technique for dynamically adjusting the media quality of a stream to match bandwidth availability. When this goal is achieved, the media server will tend to send enough media that the media player does not "stall", without sending so much media that the media player cannot accept it.
ABR uses an application-level response strategy in which the streaming client attempts to detect the available bandwidth of the network path by first observing the successful application-layer download speed; then, given the available bandwidth, the client chooses a bitrate for each of the video, audio, subtitles, and metadata (among a limited number of available options for each type of media) that fits within that bandwidth, typically adjusting as changes in available bandwidth occur in the network or changes in capabilities occur during the playback (such as available memory, CPU, display size, etc.).
Media servers can provide media streams at various bitrates because the media has been encoded at various bitrates. This is a so-called "ladder" of bitrates that can be offered to media players as part of the manifest so that the media player can select among the available bitrate choices.
The media server may also choose to alter which bitrates are made available to players by adding or removing bitrate options from the ladder delivered to the player in subsequent manifests built and sent to the player. This way, both the player, through its selection of bitrate to request from the manifest, and the server, through its construction of the bitrates offered in the manifest, are able to affect network utilization.
Adaptive segmented delivery attempts to optimize its own use of the path between a media server and a media client. ABR playback is commonly implemented by streaming clients using HLS [
RFC 8216] or DASH [
MPEG-DASH] to perform a reliable segmented delivery of media over HTTP. Different implementations use different strategies [
ABRSurvey], often relying on proprietary algorithms (called rate adaptation or bitrate selection algorithms) to perform available bandwidth estimation/prediction and the bitrate selection.
Many systems will do an initial probe or a very simple throughput speed test at the start of media playback. This is done to get a rough sense of the highest (total) media bitrate that the network between the server and player will likely be able to provide under initial network conditions. After the initial testing, clients tend to rely upon passive network observations and will make use of player-side statistics, such as buffer fill rates, to monitor and respond to changing network conditions.
The choice of bitrate occurs within the context of optimizing for one or more metrics monitored by the client, such as the highest achievable audiovisual quality or the lowest chances for a rebuffering event (playback stall).
The inclusion of advertising alongside or interspersed with streaming media content is common in today's media landscape.
Some commonly used forms of advertising can introduce potential user experience issues for a media stream. This section provides a very brief overview of a complex and rapidly evolving space.
The same techniques used to allow a media player to switch between renditions of different bitrates at segment boundaries can also be used to enable the dynamic insertion of advertisements (hereafter referred to as "ads"), but this does not mean that the insertion of ads has no effect on the user's quality of experience.
Ads may be inserted with either Client-side Ad Insertion (CSAI) or Server-side Ad Insertion (SSAI). In CSAI, the ABR manifest will generally include links to an external ad server for some segments of the media stream, while in SSAI, the server will remain the same during ads but will include media segments that contain the advertising. In SSAI, the media segments may or may not be sourced from an external ad server like with CSAI.
In general, the more targeted the ad request is, the more requests the ad service needs to be able to handle concurrently. If connectivity is poor to the ad service, this can cause rebuffering even if the underlying media assets (both content and ads) can be accessed quickly. The less targeted the ad request is, the more likely that ad requests can be consolidated and that ads can be cached similarly to the media content.
In some cases, especially with SSAI, advertising space in a stream is reserved for a specific advertiser and can be integrated with the video so that the segments share the same encoding properties, such as bitrate, dynamic range, and resolution. However, in many cases, ad servers integrate with a Supply Side Platform (SSP) that offers advertising space in real-time auctions via an Ad Exchange, with bids for the advertising space coming from Demand Side Platforms (DSPs) that collect money from advertisers for delivering the ads. Most such Ad Exchanges use application-level protocol specifications published by the Interactive Advertising Bureau [
IAB-ADS], an industry trade organization.
This ecosystem balances several competing objectives, and integrating with it naively can produce surprising user experience results. For example, ad server provisioning and/or the bitrate of the ad segments might be different from that of the main content, and either of these differences can result in playback stalls. For another example, since the inserted ads are often produced independently, they might have a different base volume level than the main content, which can make for a jarring user experience.
Another major source of competing objectives comes from user privacy considerations vs. the advertiser's incentives to target ads to user segments based on behavioral data. Multiple studies, for example, [
BEHAVE] and [
BEHAVE2], have reported large improvements in ad effectiveness when using behaviorally targeted ads, relative to untargeted ads. This provides a strong incentive for advertisers to gain access to the data necessary to perform behavioral targeting, leading some to engage in what is indistinguishable from a pervasive monitoring attack [
RFC 7258] based on user tracking in order to collect the relevant data. A more complete review of issues in this space is available in [
BALANCING].
On top of these competing objectives, this market historically has had incidents of misreporting of ad delivery to end users for financial gain [
ADFRAUD]. As a mitigation for concerns driven by those incidents, some SSPs have required the use of specific media players that include features like reporting of ad delivery or providing additional user information that can be used for tracking.
In general, this is a rapidly developing space with many considerations, and media streaming operators engaged in advertising may need to research these and other concerns to find solutions that meet their user experience, user privacy, and financial goals. For further reading on mitigations, [
BAP] has published some standards and best practices based on user experience research.
This kind of bandwidth-measurement system can experience various troubles that are affected by networking and transport protocol issues. Because adaptive application-level response strategies are often using rates as observed by the application layer, there are sometimes inscrutable transport-level protocol behaviors that can produce surprising measurement values when the application-level feedback loop is interacting with a transport-level feedback loop.
A few specific examples of surprising phenomena that affect bitrate detection measurements are described in the following subsections. As these examples will demonstrate, it is common to encounter cases that can deliver application-level measurements that are too low, too high, and (possibly) correct but that vary more quickly than a lab-tested selection algorithm might expect.
These effects and others that cause transport behavior to diverge from lab modeling can sometimes have a significant impact on bitrate selection and on user QoE, especially where players use naive measurement strategies and selection algorithms that do not account for the likelihood of bandwidth measurements that diverge from the true path capacity.
When the bitrate selection is chosen substantially below the available capacity of the network path, the response to a segment request will typically complete in much less absolute time than the duration of the requested segment, leaving significant idle time between segment downloads. This can have a few surprising consequences:
-
TCP slow-start, when restarting after idle, requires multiple RTTs to re-establish a throughput at the network's available capacity. When the active transmission time for segments is substantially shorter than the time between segments, leaving an idle gap between segments that triggers a restart of TCP slow-start, the estimate of the successful download speed coming from the application-visible receive rate on the socket can thus end up much lower than the actual available network capacity. This, in turn, can prevent a shift to the most appropriate bitrate. [RFC 7661] provides some mitigations for this effect at the TCP transport layer for senders who anticipate a high incidence of this problem.
-
Mobile flow-bandwidth spectrum and timing mapping can be impacted by idle time in some networks. The carrier capacity assigned to a physical or virtual link can vary with activity. Depending on the idle time characteristics, this can result in a lower available bitrate than would be achievable with a steadier transmission in the same network.
Some receiver-side ABR algorithms, such as [
ELASTIC], are designed to try to avoid this effect.
Another way to mitigate this effect is by the help of two simultaneous TCP connections, as explained in [
MMSys11] for Microsoft Smooth Streaming. In some cases, the system-level TCP slow-start restart can also be disabled, for example, as described in [
OReilly-HPBN].
In addition to smoothing over an appropriate time scale to handle network jitter (see [
RFC 5481]), ABR systems relying on measurements at the application layer also have to account for noise from the in-order data transmission at the transport layer.
For instance, in the event of a lost packet on a TCP connection with SACK support (a common case for segmented delivery in practice), loss of a packet can provide a confusing bandwidth signal to the receiving application. Because of the sliding window in TCP, many packets may be accepted by the receiver without being available to the application until the missing packet arrives. Upon the arrival of the one missing packet after retransmit, the receiver will suddenly get access to a lot of data at the same time.
To a receiver measuring bytes received per unit time at the application layer and interpreting it as an estimate of the available network bandwidth, this appears as a high jitter in the goodput measurement, presenting as a stall followed by a sudden leap that can far exceed the actual capacity of the transport path from the server when the hole in the received data is filled by a later retransmission.
As many end devices have moved to wireless connections for the final hop (such as Wi-Fi, 5G, LTE, etc.), new problems in bandwidth detection have emerged.
In most real-world operating environments, wireless links can often experience sudden changes in capacity as the end user device moves from place to place or encounters new sources of interference. Microwave ovens, for example, can cause a throughput degradation in Wi-Fi of more than a factor of 2 while active [
Micro].
These swings in actual transport capacity can result in user experience issues when interacting with ABR algorithms that are not tuned to handle the capacity variation gracefully.
Media players use measurements to guide their segment-by-segment adaptive streaming requests but may also provide measurements to streaming media providers.
In turn, media providers may base analytics on these measurements to guide decisions, such as whether adaptive encoding bitrates in use are the best ones to provide to media players or whether current media content caching is providing the best experience for viewers.
To that effect, the Consumer Technology Association (CTA), who owns the Web Application Video Ecosystem (WAVE) project, has published two important specifications.
-
CTA-2066: Streaming Quality of Experience Events, Properties and Metrics
[
CTA-2066] specifies a set of media player events, properties, QoE metrics, and associated terminology for representing streaming media QoE across systems, media players, and analytics vendors. While all these events, properties, metrics, and associated terminology are used across a number of proprietary analytics and measurement solutions, they were used in slightly (or vastly) different ways that led to interoperability issues. CTA-2066 attempts to address this issue by defining common terminology and how each metric should be computed for consistent reporting.
-
CTA-5004: Web Application Video Ecosystem - Common Media Client Data (CMCD)
Many assume that the CDNs have a holistic view of the health and performance of the streaming clients. However, this is not the case. The CDNs produce millions of log lines per second across hundreds of thousands of clients, and they have no concept of a "session" as a client would have, so CDNs are decoupled from the metrics the clients generate and report. A CDN cannot tell which request belongs to which playback session, the duration of any media object, the bitrate, or whether any of the clients have stalled and are rebuffering or are about to stall and will rebuffer. The consequence of this decoupling is that a CDN cannot prioritize delivery for when the client needs it most, prefetch content, or trigger alerts when the network itself may be underperforming. One approach to couple the CDN to the playback sessions is for the clients to communicate standardized media-relevant information to the CDNs while they are fetching data. [
CTA-5004] was developed exactly for this purpose.