In the following paragraph(s), Jitter Buffer Management (JBM) denotes the actual buffer as well as any control, adaptation and media processing algorithm (excluding speech decoder) used in the management of the jitter induced in the transport channel. An illustration of an exemplary structure of an MTSI speech receiver with adaptive jitter buffer is shown in Figure 8.1 to clarify the terminology and the relation between different functional components.
The blocks "network analyzer" and "adaptation control logic" together with the information on buffer status form the actual buffer control functionality, whereas "speech decoder" and "adaptation unit" provide the media processing functionality. Note that the external playback device control driving the media processing is not shown in Figure 8.1.
The grey dashed lines indicate the measurement points for the jitter buffer delay, i.e. the difference between the decoder consumption time and the arrival time of the speech frame to the JBM.
The functional processing blocks are as follows:
Buffer: The jitter buffer unpacks the incoming RTP payloads and stores the received speech frames. The buffer status may be used as input to the adaptation decision logic. Furthermore, the buffer is also linked to the speech decoder to provide frames for decoding when they are requested for decoding.
Network analyser: The network analysis functionality is used to monitor the incoming packet stream and to collect reception statistics (e.g. jitter, packet loss) that are needed for jitter buffer adaptation. Note that this block can also include e.g. the functionality needed to maintain statistics required by the RTCP if it is being used.
Adaptation control logic: The control logic adjusting playback delay and operating the adaptation functionality makes decisions on the buffering delay adjustments and required media adaptation actions based on the buffer status (e.g. average buffering delay, buffer occupancy, etc.) and input from the network analyser. Furthermore, external control input, including RTCP from the sender, can be used e.g. to enable inter-media synchronisation, to adapt the jitter buffer, or other external scaling requests. The control logic may utilize different adaptation strategies such as fixed jitter buffer (without adaptation and time scaling), simple adaptation during comfort noise periods or buffer adaptation also during active speech. The general operation is controlled with desired proportion of frames arriving late, adaptation strategy and adaptation rate.
Speech decoder: The standard AMR, AMR-WB, EVS or IVAS speech decoder. Note that the speech decoder is also assumed to include error concealment / bad frame handling functionality. Speech decoder may be used with or without the adaptation unit.
Adaptation unit: The adaptation unit shortens or extends the output signal length according to requests given by the adaptation control logic to enable buffer delay adjustment in a transparent manner. The adaptation is performed using the frame based or sample based time scaling on the decoder output signal during comfort noise periods only or during active speech and comfort noise. The buffer control logic should have a mechanism to limit the maximum scaling ratio. Providing a scaling window in which the targeted time scale modifications are performed improves the situation in certain scenarios - e.g. when reacting to the clock drift or to a request of inter-media (re)synchronization - by allowing flexibility in allocating the scaling request on several frames and performing the scaling on a content-aware manner. The adaptation unit may be implemented either in a separate entity from the speech decoder or embedded within the decoder.
The functional requirements for the speech JBM guarantee appropriate management of jitter which shall be the same for all speech JBM implementations used in MTSI clients in terminals. A JBM implementation used in MTSI shall support the following requirements, but is not limited in functionality to these requirements. They are to be seen as a minimum set of functional requirements supported by every speech JBM used in MTSI.
Speech JBM used in MTSI shall:
support source-controlled rate operation as well as non-source-controlled rate operation;
be able to receive the de-packetized frames out of order and present them in order for decoder consumption;
be able to receive duplicate speech frames and only present unique speech frames for decoder consumption;
be able to handle clock drift between the encoding and decoding end-points.
An MTSI client in terminal supporting speech shall use a JBM fulfilling the minimum performance requirements defined in this clause. The JBM specified in [128] fulfils these minimum performance requirements and should be used for EVS. The JBM specified in [188] should be used for IVAS.
The jitter buffering time is the time spent by a speech frame in the JBM. It is measured as the difference between the decoding start time and the arrival time of the speech frame to the JBM. The frames that are discarded by the JBM are not counted in the measure.
The minimum performance requirements consist of objective criteria for delay and jitter-induced concealment operations. In order for a JBM implementation to pass the minimum performance requirements all objective criteria shall be met.
A JBM implementation used in MTSI shall comply with the following design guidelines:
The overall design of the JBM shall be to minimize the buffering time at all times while still conforming to the minimum performance requirements of jitter induced concealment operations and the design guidelines for sample-based timescaling (as set in bullet point 3);
If the limit of jitter induced concealment operations cannot be met, it is always preferred to increase the buffering time in order to avoid growing jitter induced concealment operations going beyond the stated limit above. This guideline applies even if that means that end-to-end delay requirement given in TS 22.105 can no longer be met;
If sample-based time scaling is used (after speech decoder), then artefacts caused by time scaling operation shall be kept to a minimum. Time scaling means the modification of the signal by stretching and/or compressing it over the time axis. The following guidelines on time scaling apply:
Use of a high-quality time scaling algorithm is recommended;
The amount of scaling should be as low as possible;
Scaling should be applied as infrequently as possible;
The objective performance requirements consist of criteria for delay, time scaling and jitter-induced concealment operations.
The objective minimum performance requirements are divided into three parts:
Limiting the jitter buffering time to provide as low end-to-end delay as possible.
Limiting the jitter induced concealment operations, i.e. setting limits on the allowed induced losses in the jitter buffer due to late losses, re-bufferings, and buffer overflows.
Limiting the use of time scaling to adapt the buffering depth in order to avoid introducing time scaling artefacts on the speech media.
In order to fulfil the objective performance requirements, the JBM under test needs to pass the respective criteria using the six channels as defined in clause 8.2.3.3. Note that in order to pass the criteria for a specific channel, all three requirements must be fulfilled.
The reference delay computation algorithm in Annex D defines the performance requirements for the set of delay and error profiles described in clause 8.2.3.3. The JBM algorithm under test shall meet these performance requirements. The performance requirements shall be a threshold for the Cumulative Distribution Function (CDF) of the speech-frame delay introduced by the reference delay computation algorithm. A CDF threshold is set by shifting the reference delay computation algorithm CDF 60 ms. The speech-frame delay CDF is defined as:
P(x) = Probability (delay_compensation_by_JBM ≤ x)
The relation between the reference delay computation algorithm and the CDF threshold is outlined in Figure 8.2.
The JBM algorithm under test shall achieve lower or same delay than that set by the CDF threshold for at least 90 % of the speech frames. The values for the CDF shall be collected for the full length of each delay and error profile. The delay measure in the criteria is measured as the time each speech frame spends in the JBM; i.e. the difference between the decoder consumption time and the arrival time of the speech frame to the JBM.
The parameter settings for the reference delay computation algorithm are:
The jitter induced concealment operations include:
JBM induced removal of a speech frame, i.e. buffer overflow or intentional frame dropping when reducing the buffer depth during adaptation.
Deletion of a speech frame because it arrived at the JBM too late.
Modification of the output timeline due to link loss.
Jitter-induced insertion of a speech frame controlled by the JBM (e.g. buffer underflow).
Link losses handled as error concealment and not changing the output timeline shall not be counted in the jitter induced concealment operations.
Jitter loss rate = JBM triggered concealed frames / Number of transmitted frames
The jitter loss rate shall be calculated for active speech frames only.
The jitter loss rate shall be below 1% for every channel measured over the full length of the respective channel. The value of 1 % was chosen because such a loss rate will usually not significantly reduce the speech quality.
Six different delay and error profiles are used to check the tested JBM for compliance with the minimum performance requirements. The profiles span a large range of operating conditions in which the JBM shall provide sufficient performance for the MTSI service. All profiles are 7500 IP packets long.
Moderate jitter with occasional delay spikes, 2 frames/packet (7 500 IP packets, 15 000 speech frames)
5.9
dly_error_profile_5.dat
6
Moderate jitter with severe delay spikes, 1 frame/packet
0.1
dly_error_profile_6.dat
The attached profiles in the zip-archive "delay_and_error_profiles.zip" are formatted as raw text files with one delay entry per line. The delay entries are written in milliseconds and packet losses are entered as "-1". Note that when testing for compliance, the starting point in the delay and error profile shall be randomized.
The files described in Table 8.2 and attached to the present document in the zip-archive "JBM_evaluation_files.zip" shall be used for evaluation of a JBM against the minimum performance requirements. The data is stored as RTP packets, formatted according to "RTP dump" format [41]. The input to these files is AMR or AMR-WB encoded frames, encapsulated into RTP packets using the octet-aligned mode of the AMR RTP payload format (RFC 4867).
Video receivers should implement an adaptive video de-jitter buffer. The overall design of the buffer should aim to minimize delay, maintain synchronization with speech, and minimize dropping of late packets. The exact implementation is left to the implementer.
Conversational quality of real-time text is experienced as being good, even with up to one second end-to-end text delay. Strict jitter buffer management is therefore not needed for text. Basic jitter buffer management for text is described in Section 5 of RFC 4103 where a calculation is described for the time allowed before an extra delayed text packet may be regarded to be lost.