TS 26.256
Codec for Immersive Voice and Audio Services (IVAS)
Jitter Buffer Management

3GPP‑Page ETSI‑search CONTENT_↓

V18.1.0 (PDF) 2024/09 15 p.

Rapporteur:: Mr. Doehla, Stefan
Fraunhofer IIS

Content for TS 26.256 Word version: 18.0.1

4.1 Introduction 4.2 Packet-based communications 4.3 IVAS Receiver architecture overview 5.1 Overview 5.2 Depacketization of RTP packets (informative) 5.3 Network Jitter Analysis and Delay Estimation 5.4 Adaptation Control Logic 5.5 Reconstructed signal output 5.6 De-Jitter Buffer

1 Scope p. 6

The present document defines the Jitter Buffer Management (JBM) solution for the Immersive Voice and Audio Services (IVAS) codec TS 26.250. Jitter Buffers are required in packet-based communications, such as 3GPP MTSI TS 26.114, to smooth the inter-arrival jitter of incoming media packets for uninterrupted playout.

The procedure of the present document is recommended for implementation in all network entities and UEs supporting the IVAS codec; procedures described in TS 26.253 and used in this document, such as multi-channel time-scale modification, metadata adaptation and rendering are mandatory for implementations in all network entities and UEs supporting the IVAS codec.

The present document does not describe the C code of this procedure. For a description of the floating-point C code implementation see TS 26.258.

In the case of discrepancy between the Jitter Buffer Management described in the present document and its C code specification contained in TS 26.258, the procedure defined by TS 26.258 prevails.

2 References p. 6

The following documents contain provisions which, through reference in this text, constitute provisions of the present document.

References are either specific (identified by date of publication, edition number, version number, etc.) or non-specific.
For a specific reference, subsequent revisions do not apply.
For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including a GSM document), a non-specific reference implicitly refers to the latest version of that document in the same Release as the present document.

[1]

TR 21.905: "Vocabulary for 3GPP Specifications".

[2]

TS 26.250: "Codec for Immersive Voice and Audio Services - General Overview".

[3]

TS 26.258: "Codec for Immersive Voice and Audio Services; ANSI C code (floating-point)".

[4]

TS 26.253: "Codec for Immersive Voice and Audio Services - Detailed Algorithmic Description incl. RTP payload format and SDP parameter definitions".

[5]

TS 26.441: "Codec for Enhanced Voice Services (EVS); General overview".

[6]

TS 26.171: "Speech codec speech processing functions; Adaptive Multi-Rate - Wideband (AMR-WB) speech codec; General description".

[7]

TS 26.114: "IP Multimedia Subsystem (IMS); Multimedia telephony; Media handling and interaction"

[8]

TS 26.448: "Codec for Enhanced Voice Services - Jitter Buffer Management"

[9]

TS 26.131: "Terminal acoustic characteristics for telephony; Requirements"

[10]

TS 26.261: "Terminal audio quality performance requirements for immersive audio services"

3 Definitions of terms, symbols and abbreviations p. 7

3.1 Terms p. 7

For the purposes of the present document, the terms given in TR 21.905 and the following apply. A term defined in the present document takes precedence over the definition of the same term, if any, in TR 21.905.

3.2 Symbols p. 7

Void.

3.3 Abbreviations p. 7

For the purposes of the present document, the abbreviations given in TR 21.905 and the following apply. An abbreviation defined in the present document takes precedence over the definition of the same abbreviation, if any, in TR 21.905.

AMR

Adaptive Multi-Rate (Codec)

AMR-WB

Adapative Multi-Rate Wideband (Codec)

DTX

Discontinuous Transmission

EVS

Enhanced Voice Services (Codec)

IVAS

Immersive Voice and Audio Services

JBM

Jitter Buffer Management

RTP

Real-Time Transmission Protocol

TSM

Time-Scale Modification

3.4 Mathematical Expressions p. 7

Void.

4 General p. 7

4.1 Introduction p. 7

The Jitter Buffer Management solution specified in this document extends the IVAS decoder with a mechanism to cope with the effects of packet-based communication over wireless transmission channels, i.e. buffering packets with different inter-arrival jitter and triggering of adapatation mechanisms to ensure low-delay communications.

It is used in conjunction with the IVAS decoder (described in TS 26.253 in detail), which can also decode EVS TS 26.441 and AMR-WB TS 26.171. The described solution is based on TS 26.448, which has been optimized for the Multimedia Telephony Service for IMS (MTSI) and fulfils the requirements for delay and jitter-induced concealment operations set in TS 26.114. Main differences to TS 26.448 are the support of immersive media formats and a corresponding time-warping scheme operating within the decoder.

4.2 Packet-based communications p. 7

In packet-based communications, packets arrive at the terminal with random jitters in their arrival time. Packets may also arrive out of order. Since the decoder expects to be fed a speech packet in a regular interval (for 3GPP speech codecs this is every 20 milliseconds) to output speech samples in periodic blocks, a de-jitter buffer is required to absorb the jitter in the packet arrival time. The larger the size of the de-jitter buffer, the better its ability to absorb the jitter in the arrival time and consequently fewer late arriving packets are discarded. Voice communications is also a delay critical system and therefore it becomes essential to keep the end to end delay as low as possible so that a two way conversation can be sustained.

The defined adaptive Jitter Buffer Management (JBM) solution reflects the above mentioned trade-offs. While attempting to minimize packet losses, the JBM algorithm in the receiver also keeps track of the delay in packet delivery as a result of the buffering. The JBM solution suitably adjusts the depth of the de-jitter buffer in order to achieve the trade-off between delay and late losses.

4.3 IVAS Receiver architecture overview p. 8

An IVAS receiver for MTSI-based communication is built on top of the IVAS Jitter Buffer Management solution. It follows the same principles as specified in clause 4.3 of TS 26.448 for the EVS Jitter Buffer Management solution. The received IVAS frames, contained in RTP packets, are depacketized and fed to the Jitter Buffer Management (JBM). The JBM smoothes the inter-arrival jitter of incoming packets for uninterrupted playout of the decoded EVS frames at the Acoustic Frontend of the terminal.

Figure 1 in TS 26.448 illustrates the architecture and data flow of the receiver side of an EVS terminal. The example architecture for EVS is also applicable to IVAS to outline the integration of the JBM in a terminal. This specification defines the JBM module and its interfaces to the RTP Depacker, the IVAS Decoder TS 26.253, and the Acoustic Frontend TS 26.131 and TS 26.261. The modules for Modem and Acoustic Frontend are outside the scope of the present document. The implementation of the RTP Depacker is outlined in TS 26.448 and also applicable for IVAS.

Real-time implementations of this architecture typically use independent processing threads for reacting on arriving RTP packets from the modem and for requesting PCM data for the Acoustic Frontend. Arriving packets are typically handled by listening for packets received on the network socket related to the RTP session. Incoming packets are pushed into the RTP Depacker module which extracts the frames contained in an RTP packet. These frame are then pushed into the JBM where the statistics are updated and the frames are stored for later decoding and playout. The Acoustic Frontend contains the audio interface which, concurrently to the push operation of IVAS frames, pulls PCM buffers from the JBM. The JBM is therefore required to provide PCM buffers, which are normally generated by decoding IVAS frames by the IVAS decoder or by other means to allow uninterrupted playout. Although the JBM is described for a multi-threaded architecture it does not specify thread-safe data structures due to the dependency on a particular implementation.

Note that the JBM does not directly forward frames from the RTP Depacker to the IVAS decoder but instead uses frame-based adaptation to smooth the network jitter. In addition signal-based adaptation is executed on the decoded PCM buffers, described in detail in clause 6.2.7.3 of TS 26.253 before they are pulled by the Acoustic Frontend. The corresponding algorithms are described in the following clauses.

5 Jitter Buffer Management p. 8

5.1 Overview p. 8

Jitter Buffer Management (JBM) includes the jitter estimation, control and jitter buffer adaptation algorithm to manage the inter-arrival jitter of the incoming packet stream.

The IVAS Jitter Buffer Management allows for fine grain adjustment of the play out delay by generating time scale modified (TSM) versions of a multi-channels signal, i.e. provide decoded frames that are longer or short in duration than the default frame length. The IVAS JBM splits the decoding and reconstruction/rendering into the steps transport channel and metadata decoding, the multi-channel time scale modification of the transport channels, resulting in a time scale modified version of the transport channels with specific duration, the adaption of the metadata and other rendering/reconstruction parameters to the time scale modified duration of the IVAS frame, and the reconstruction and rendering adapted to the new time scale modified frame length. The IVAS JBM decoding process performs a number of processing steps to provide the processed (output) signal based on the input audio signal representation (the encoded IVAS frame), where the time scale modification is performed on the intermediate audio signals, i.e. the transport channels which are generated by the first processing step of decoding the transport channels and meta data, and performs the second processing, i.e. the reconstruction/rendering of the output signal based on the time scaled intermediate audio signals. The reconstruction and rendering is adapted to the time scale modification as are the parameters needed for the reconstruction, and the meta data needed for the reconstruction.

The entire solution for IVAS consists of the following components, as depicted in Figure 1:

RTP Depacker (see clause 5.2) to analyse the incoming RTP packet stream and to extract the EVS speech frames along with meta data to estimate the network jitter
De-jitter Buffer (clause 5.6) to store the extracted IVAS speech frames before decoding and to perform frame-based adaptation
IVAS Transport Channel and Metadata Decoding (clause 6 of TS 26.253) for decoding the received IVAS frames to PCM data
Playout Delay Estimation Module (clause 5.3.5) to provide information on the current playout delay due to JBM
Network Jitter Analysis (clause 5.3) for estimating the packet inter-arrival jitter and target playout delay
Adaptation Control Logic (clause 5.4) to decide on actions for changing the playout delay based on the target playout delay
Multi-Channel Time-Scale Modification (clause 5.4.3) and clause 5.6 of the present document to perform signal-based adaptation for changing the playout delay
Metadata Adaptation and processing parameter adaption TS 26.253 (clause 6.2.7.2, and clauses 6.3.7, 6.4.11, 6.5.4, 6.6.7, 6.7.8, 6.8.5, 6.10, and 6.9.8) and clause 5.6 of the present document to perform adapation for meta data for fitting to time-warped signals
Reconstruction and Rendering TS 26.253 (clauses 6 and 7) and clause 5.5 of the present document to convert transport-channel and meta data to the reconstructed output channels

Copy of original 3GPP image for 3GPP TS 26.256, Fig. 1: Modules of the IVAS Jitter Buffer Management Solution

Figure 1: Modules of the IVAS Jitter Buffer Management Solution
(⇒ copy of original 3GPP image)

5.2 Depacketization of RTP packets (informative) p. 9

The RTP Depacker module of the JBM performs the depacketization of the incoming RTP packet stream. The operation is further described in clause 5.2 of TS 26.448. Packetization rules for IVAS are defined in Annex A of TS 26.253.

5.3 Network Jitter Analysis and Delay Estimation p. 9

The network jitter analysis and delay estimation for IVAS is identical to EVS, which is specified in clause 5.3 of TS 26.448.

5.4 Adaptation Control Logic p. 10

5.4.1 Control Logic p. 10

The control logic is identical to EVS. The operation is described in clause 5.4.1 of TS 26.448.

5.4.2 Frame-based adaptation p. 10

5.4.2.1 General p. 10

Adaptation on the frame level is performed on coded speech frames, i.e. with a granularity of one speech frame of 20 ms duration. Inserting or deleting speech frames results in adaptation with higher distortion but allows faster buffer adaptation than signal-based adaptation. Inserting or deleting NO_DATA frames during DTX allows fast adaption while minimizing distortion.

5.4.2.2 Insertion of Concealed Frames p. 10

Insertion of concealed frames identical to EVS. The operation is described in clause 5.4.2.2 of TS 26.448.

5.4.2.3 Frame Dropping p. 10

Frame dropping is identical to EVS. The operation is described in clause 5.4.2.3 of TS 26.448.

5.4.2.4 Comfort Noise Insertion in DTX p. 10

Comfort Noise insertion in DTX is identical to EVS. The operation is described in clause 5.4.2.4 of TS 26.448.

5.4.2.5 Comfort Noise Deletion in DTX p. 10

Comfort Noise deletion in DTX is identical to EVS. The operation is described in clause 5.4.2.5 of TS 26.448.

5.4.3 Signal-based adaptation p. 10

To alter the playout delay the decoder is able to generate time-warped signals. This allows increasing the number of samples for increasing the playout delay or reducing the number of samples to reduce the playout delay. The time-warping is performed by the decoder and its basic operation is described in clause 5.4.3 of TS 26.448. IVAS extends this time-scale modification to work on multi-channel signals. This is specified in clause 6.2.7.3 of TS 26.253.

The meta data and processing parameters are adapted to the achieved time-scale modification according to TS 26.253 (clause 6.2.7.2, and clauses 6.3.7, 6.4.11, 6.5.4, 6.6.7, 6.7.8, 6.8.5, 6.10, and 6.9.8).

5.5 Reconstructed signal output p. 10

5.5.1 General p. 10

The FIFO Receiver Output Buffer of clause 5.5 of TS 26.448 is replaced by the structure outlined in Figure 2. A pull from the acoustic front end renders either the number of samples requested if enough samples to be rendered are available or as many samples as are still available to be rendered according to the transport channel buffer. If the number of samples rendered is not enough to satisfy the pull request a new frame is decoded and TSM and meta data adaption are applied and enough samples are reconstructed and rendered from this frame to satisfy the pull request. The Receiver Output Buffer may either be omitted or the size has to be large enough to store enough samples for one pull call from the frontend.

Copy of original 3GPP image for 3GPP TS 26.256, Fig. 2: Reconstructed Signal Output

Figure 2: Reconstructed Signal Output
(⇒ copy of original 3GPP image)

5.5.2 Interaction with Decoder Transport Channel Buffer p. 11

The transport channel buffer according to clause 6.2.7.2 of TS 26.253 shall keep track of the number of already rendered samples, the number of samples still available for rendering the current IVAS frame, and the number of residual samples, i.e. transport channel samples that can not be rendered in the current frame, and provides this information to the De-Jitter Buffer for the playout delay estimation. The variable b_k used in clause 5.3.5 of TS 26.448 Eq. 11 clause is now instead of the duration of samples buffered in the Receiver Output Buffer module at playout time k the duration of the samples still available for rendering and the residual samples combined and expressed in milliseconds.

5.5.3 Residual Samples Handling p. 11

The reconstruction and renderer has a time resolution that is smaller than the time resolution of the PCM data, that is the smallest portion that can be output by the reconstruction and rendering contains a multiple number of PCM samples per output channel. The signal based adaption results in time scale modified frames that may not fit into this time resolution. Any spill, that is any residual samples that do not fit into the time resolution, is handled by the transport channel buffer management according to clause 6.2.7.2 of TS 26.253.

5.6 De-Jitter Buffer p. 11

The De-Jitter Buffer is identical to EVS. The operation is described in clause 5.6 of TS 26.448.

6 Decoder interaction p. 11

6.1 General p. 11

This JBM solution for the IVAS codec may also be used for EVS. The usage with EVS is described in clause 6.10 of TS 26.253.

6.2 Decoder Requirements p. 11

The defined JBM depends on the decoder processing function to create an uncompressed PCM frame from a coded frame. The JBM requires that the number of channels and sampling rate in use is known in order to initialize the signal-based adaptation. The JBM also requires the presence of PLC functionality to create a PCM frame on demand without a coded frame being available as input for the decoder for missing or lost frames.

The JBM will make use of DTX for playout adaptation during inactive periods when noise is generated by the CNG. It has however its own functionality integrated for playout adaptation of active signals if the codec does not currently use or support DTX, as well as for adaptation during a long period of active signal. To use DTX the RTP Depacker needs to determine if a frame is an SID frame or an active frame and provide that information to the JBM together with the respective frame. During DTX the JBM may alter the periods between SID frames or between the last SID frame and the first active frame to use the CNG of the decoder to create additional PCM buffers with comfort noise or to omit comfort noise frames.

The JBM expects that the decoder outputs PCM frames of arbitrary duration and a fixed audio sampling rate set at initialization. The JBM expects that the decoder has an internal transport channel buffer that handles the buffering of the time scale modified PCM frames either for direct output or for reconstruction and rendering. If the codec supports bandwidth switching, a resampling functionality is required in the decoder to provide PCM frames at the set sampling rate.