4.3. CELT Decoder
The CELT layer of Opus is based on the Modified Discrete Cosine Transform [MDCT] with partially overlapping windows of 5 to 22.5 ms. The main principle behind CELT is that the MDCT spectrum is divided into bands that (roughly) follow the Bark scale, i.e., the scale of the ear's critical bands [ZWICKER61]. The normal CELT layer uses 21 of those bands, though Opus Custom (see Section 6.2) may use a different number of bands. In Hybrid mode, the first 17 bands (up to 8 kHz) are not coded. A band can contain as little as one MDCT bin per channel, and as many as 176 bins per channel, as detailed in Table 55. In each band, the gain (energy) is coded separately from the shape of the spectrum. Coding the gain explicitly makes it easy to preserve the spectral envelope of the signal. The remaining unit- norm shape vector is encoded using a Pyramid Vector Quantizer (PVQ) Section 4.3.4. +--------+--------+------+-------+-------+-------------+------------+ | Frame | 2.5 ms | 5 ms | 10 ms | 20 ms | Start | Stop | | Size: | | | | | Frequency | Frequency | +--------+--------+------+-------+-------+-------------+------------+ | Band | Bins: | | | | | | | | | | | | | | | 0 | 1 | 2 | 4 | 8 | 0 Hz | 200 Hz | | | | | | | | | | 1 | 1 | 2 | 4 | 8 | 200 Hz | 400 Hz | | | | | | | | |
| 2 | 1 | 2 | 4 | 8 | 400 Hz | 600 Hz | | | | | | | | | | 3 | 1 | 2 | 4 | 8 | 600 Hz | 800 Hz | | | | | | | | | | 4 | 1 | 2 | 4 | 8 | 800 Hz | 1000 Hz | | | | | | | | | | 5 | 1 | 2 | 4 | 8 | 1000 Hz | 1200 Hz | | | | | | | | | | 6 | 1 | 2 | 4 | 8 | 1200 Hz | 1400 Hz | | | | | | | | | | 7 | 1 | 2 | 4 | 8 | 1400 Hz | 1600 Hz | | | | | | | | | | 8 | 2 | 4 | 8 | 16 | 1600 Hz | 2000 Hz | | | | | | | | | | 9 | 2 | 4 | 8 | 16 | 2000 Hz | 2400 Hz | | | | | | | | | | 10 | 2 | 4 | 8 | 16 | 2400 Hz | 2800 Hz | | | | | | | | | | 11 | 2 | 4 | 8 | 16 | 2800 Hz | 3200 Hz | | | | | | | | | | 12 | 4 | 8 | 16 | 32 | 3200 Hz | 4000 Hz | | | | | | | | | | 13 | 4 | 8 | 16 | 32 | 4000 Hz | 4800 Hz | | | | | | | | | | 14 | 4 | 8 | 16 | 32 | 4800 Hz | 5600 Hz | | | | | | | | | | 15 | 6 | 12 | 24 | 48 | 5600 Hz | 6800 Hz | | | | | | | | | | 16 | 6 | 12 | 24 | 48 | 6800 Hz | 8000 Hz | | | | | | | | | | 17 | 8 | 16 | 32 | 64 | 8000 Hz | 9600 Hz | | | | | | | | | | 18 | 12 | 24 | 48 | 96 | 9600 Hz | 12000 Hz | | | | | | | | | | 19 | 18 | 36 | 72 | 144 | 12000 Hz | 15600 Hz | | | | | | | | | | 20 | 22 | 44 | 88 | 176 | 15600 Hz | 20000 Hz | +--------+--------+------+-------+-------+-------------+------------+ Table 55: MDCT Bins per Channel per Band for Each Frame Size Transients are notoriously difficult for transform codecs to code. CELT uses two different strategies for them: 1. Using multiple smaller MDCTs instead of a single large MDCT, and 2. Dynamic time-frequency resolution changes (See Section 4.3.4.5).
To improve quality on highly tonal and periodic signals, CELT includes a pre-filter/post-filter combination. The pre-filter on the encoder side attenuates the signal's harmonics. The post-filter on the decoder side restores the original gain of the harmonics, while shaping the coding noise to roughly follow the harmonics. Such noise shaping reduces the perception of the noise. When coding a stereo signal, three coding methods are available: o mid-side stereo: encodes the mean and the difference of the left and right channels, o intensity stereo: only encodes the mean of the left and right channels (discards the difference), o dual stereo: encodes the left and right channels separately. An overview of the decoder is given in Figure 17. +---------+ | Coarse | +->| decoder |----+ | +---------+ | | | | +---------+ v | | Fine | +---+ +->| decoder |->| + | | +---------+ +---+ | ^ | +---------+ | | | | Range | | +----------+ v | Decoder |-+ | Bit | +------+ +---------+ | |Allocation| | 2**x | | +----------+ +------+ | | | | v v +--------+ | +---------+ +---+ +-------+ | pitch | +->| PVQ |->| * |->| IMDCT |->| post- |---> | | decoder | +---+ +-------+ | filter | | +---------+ +--------+ | ^ +--------------------------------------+ Legend: IMDCT = Inverse MDCT Figure 17: Structure of the CELT decoder The decoder is based on the following symbols and sets of symbols:
+---------------+---------------------+---------------+ | Symbol(s) | PDF | Condition | +---------------+---------------------+---------------+ | silence | {32767, 1}/32768 | | | | | | | post-filter | {1, 1}/2 | | | | | | | octave | uniform (6) | post-filter | | | | | | period | raw bits (4+octave) | post-filter | | | | | | gain | raw bits (3) | post-filter | | | | | | tapset | {2, 1, 1}/4 | post-filter | | | | | | transient | {7, 1}/8 | | | | | | | intra | {7, 1}/8 | | | | | | | coarse energy | Section 4.3.2 | | | | | | | tf_change | Section 4.3.1 | | | | | | | tf_select | {1, 1}/2 | Section 4.3.1 | | | | | | spread | {7, 2, 21, 2}/32 | | | | | | | dyn. alloc. | Section 4.3.3 | | | | | | | alloc. trim | Table 58 | | | | | | | skip | {1, 1}/2 | Section 4.3.3 | | | | | | intensity | uniform | Section 4.3.3 | | | | | | dual | {1, 1}/2 | | | | | | | fine energy | Section 4.3.2 | | | | | | | residual | Section 4.3.4 | | | | | | | anti-collapse | {1, 1}/2 | Section 4.3.5 | | | | | | finalize | Section 4.3.2 | | +---------------+---------------------+---------------+ Table 56: Order of the Symbols in the CELT Section of the Bitstream
The decoder extracts information from the range-coded bitstream in the order described in Table 56. In some circumstances, it is possible for a decoded value to be out of range due to a very small amount of redundancy in the encoding of large integers by the range coder. In that case, the decoder should assume there has been an error in the coding, decoding, or transmission and SHOULD take measures to conceal the error and/or report to the application that a problem has occurred. Such out of range errors cannot occur in the SILK layer.4.3.1. Transient Decoding
The "transient" flag indicates whether the frame uses a single long MDCT or several short MDCTs. When it is set, then the MDCT coefficients represent multiple short MDCTs in the frame. When not set, the coefficients represent a single long MDCT for the frame. The flag is encoded in the bitstream with a probability of 1/8. In addition to the global transient flag is a per-band binary flag to change the time-frequency (tf) resolution independently in each band. The change in tf resolution is defined in tf_select_table[][] in celt.c and depends on the frame size, whether the transient flag is set, and the value of tf_select. The tf_select flag uses a 1/2 probability, but is only decoded if it can have an impact on the result knowing the value of all per-band tf_change flags.4.3.2. Energy Envelope Decoding
It is important to quantize the energy with sufficient resolution because any energy quantization error cannot be compensated for at a later stage. Regardless of the resolution used for encoding the spectral shape of a band, it is perceptually important to preserve the energy in each band. CELT uses a three-step coarse-fine-fine strategy for encoding the energy in the base-2 log domain, as implemented in quant_bands.c.4.3.2.1. Coarse Energy Decoding
Coarse quantization of the energy uses a fixed resolution of 6 dB (integer part of base-2 log). To minimize the bitrate, prediction is applied both in time (using the previous frame) and in frequency (using the previous bands). The part of the prediction that is based on the previous frame can be disabled, creating an "intra" frame where the energy is coded without reference to prior frames. The decoder first reads the intra flag to determine what prediction is used. The 2-D z-transform [Z-TRANSFORM] of the prediction filter is
-1 -1 (1 - alpha*z_l )*(1 - z_b ) A(z_l, z_b) = ----------------------------- -1 1 - beta*z_b where b is the band index and l is the frame index. The prediction coefficients applied depend on the frame size in use when not using intra energy and are alpha=0, beta=4915/32768 when using intra energy. The time-domain prediction is based on the final fine quantization of the previous frame, while the frequency domain (within the current frame) prediction is based on coarse quantization only (because the fine quantization has not been computed yet). The prediction is clamped internally so that fixed-point implementations with limited dynamic range always remain in the same state as floating point implementations. We approximate the ideal probability distribution of the prediction error using a Laplace distribution with separate parameters for each frame size in intra- and inter- frame modes. These parameters are held in the e_prob_model table in quant_bands.c. The coarse energy decoding is performed by unquant_coarse_energy() (quant_bands.c). The decoding of the Laplace-distributed values is implemented in ec_laplace_decode() (laplace.c).4.3.2.2. Fine Energy Quantization
The number of bits assigned to fine energy quantization in each band is determined by the bit allocation computation described in Section 4.3.3. Let B_i be the number of fine energy bits for band i; the refinement is an integer f in the range [0,2**B_i-1]. The mapping between f and the correction applied to the coarse energy is equal to (f+1/2)/2**B_i - 1/2. Fine energy quantization is implemented in quant_fine_energy() (quant_bands.c). When some bits are left "unused" after all other flags have been decoded, these bits are assigned to a "final" step of fine allocation. In effect, these bits are used to add one extra fine energy bit per band per channel. The allocation process determines two "priorities" for the final fine bits. Any remaining bits are first assigned only to bands of priority 0, starting from band 0 and going up. If all bands of priority 0 have received one bit per channel, then bands of priority 1 are assigned an extra bit per channel, starting from band 0. If any bits are left after this, they are left unused. This is implemented in unquant_energy_finalise() (quant_bands.c).
4.3.3. Bit Allocation
Because the bit allocation drives the decoding of the range-coder stream, it MUST be recovered exactly so that identical coding decisions are made in the encoder and decoder. Any deviation from the reference's resulting bit allocation will result in corrupted output, though implementers are free to implement the procedure in any way that produces identical results. The per-band gain-shape structure of the CELT layer ensures that using the same number of bits for the spectral shape of a band in every frame will result in a roughly constant signal-to-noise ratio in that band. This results in coding noise that has the same spectral envelope as the signal. The masking curve produced by a standard psychoacoustic model also closely follows the spectral envelope of the signal. This structure means that the ideal allocation is more consistent from frame to frame than it is for other codecs without an equivalent structure and that a fixed allocation provides fairly consistent perceptual performance [VALIN2010]. Many codecs transmit significant amounts of side information to control the bit allocation within a frame. Often this control is only indirect, and it must be exercised carefully to achieve the desired rate constraints. The CELT layer, however, can adapt over a very wide range of rates, so it has a large number of codebook sizes to choose from for each band. Explicitly signaling the size of each of these codebooks would impose considerable overhead, even though the allocation is relatively static from frame to frame. This is because all of the information required to compute these codebook sizes must be derived from a single frame by itself, in order to retain robustness to packet loss, so the signaling cannot take advantage of knowledge of the allocation in neighboring frames. This problem is exacerbated in low-latency (small frame size) applications, which would include this overhead in every frame. For this reason, in the MDCT mode, Opus uses a primarily implicit bit allocation. The available bitstream capacity is known in advance to both the encoder and decoder without additional signaling, ultimately from the packet sizes expressed by a higher-level protocol. Using this information, the codec interpolates an allocation from a hard- coded table. While the band-energy structure effectively models intra-band masking, it ignores the weaker inter-band masking, band-temporal masking, and other less significant perceptual effects. While these effects can often be ignored, they can become significant for particular samples. One mechanism available to encoders would be to
simply increase the overall rate for these frames, but this is not possible in a constant rate mode and can be fairly inefficient. As a result three explicitly signaled mechanisms are provided to alter the implicit allocation: o Band boost o Allocation trim o Band skipping The first of these mechanisms, band boost, allows an encoder to boost the allocation in specific bands. The second, allocation trim, works by biasing the overall allocation towards higher or lower frequency bands. The third, band skipping, selects which low-precision high frequency bands will be allocated no shape bits at all. In stereo mode, there are two additional parameters potentially coded as part of the allocation procedure: a parameter to allow the selective elimination of allocation for the 'side' (i.e., intensity stereo) in jointly coded bands, and a flag to deactivate joint coding (i.e., dual stereo). These values are not signaled if they would be meaningless in the overall context of the allocation. Because every signaled adjustment increases overhead and implementation complexity, none were included speculatively: the reference encoder makes use of all of these mechanisms. While the decision logic in the reference was found to be effective enough to justify the overhead and complexity, further analysis techniques may be discovered that increase the effectiveness of these parameters. As with other signaled parameters, an encoder is free to choose the values in any manner, but, unless a technique is known to deliver superior perceptual results, the methods used by the reference implementation should be used. The allocation process consists of the following steps: determining the per-band maximum allocation vector, decoding the boosts, decoding the tilt, determining the remaining capacity of the frame, searching the mode table for the entry nearest but not exceeding the available space (subject to the tilt, boosts, band maximums, and band minimums), linear interpolation, reallocation of unused bits with concurrent skip decoding, determination of the fine-energy vs. shape split, and final reallocation. This process results in a per-band shape allocation (in 1/8th-bit units), a per-band fine-energy allocation (in 1 bit per channel units), a set of band priorities for controlling the use of remaining bits at the end of the frame, and a remaining balance of unallocated space, which is usually zero except at very high rates.
The "static" bit allocation (in 1/8 bits) for a quality q, excluding the minimums, maximums, tilt and boosts, is equal to channels*N*alloc[band][q]<<LM>>2, where alloc[][] is given in Table 57 and LM=log2(frame_size/120). The allocation is obtained by linearly interpolating between two values of q (in steps of 1/64) to find the highest allocation that does not exceed the number of bits remaining. Rows indicate the MDCT bands, columns are the different quality (q) parameters. The units are 1/32 bit per MDCT bin. +---+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | +---+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ | 0 | 90 | 110 | 118 | 126 | 134 | 144 | 152 | 162 | 172 | 200 | | | | | | | | | | | | | | 0 | 80 | 100 | 110 | 119 | 127 | 137 | 145 | 155 | 165 | 200 | | | | | | | | | | | | | | 0 | 75 | 90 | 103 | 112 | 120 | 130 | 138 | 148 | 158 | 200 | | | | | | | | | | | | | | 0 | 69 | 84 | 93 | 104 | 114 | 124 | 132 | 142 | 152 | 200 | | | | | | | | | | | | | | 0 | 63 | 78 | 86 | 95 | 103 | 113 | 123 | 133 | 143 | 200 | | | | | | | | | | | | | | 0 | 56 | 71 | 80 | 89 | 97 | 107 | 117 | 127 | 137 | 200 | | | | | | | | | | | | | | 0 | 49 | 65 | 75 | 83 | 91 | 101 | 111 | 121 | 131 | 200 | | | | | | | | | | | | | | 0 | 40 | 58 | 70 | 78 | 85 | 95 | 105 | 115 | 125 | 200 | | | | | | | | | | | | | | 0 | 34 | 51 | 65 | 72 | 78 | 88 | 98 | 108 | 118 | 198 | | | | | | | | | | | | | | 0 | 29 | 45 | 59 | 66 | 72 | 82 | 92 | 102 | 112 | 193 | | | | | | | | | | | | | | 0 | 20 | 39 | 53 | 60 | 66 | 76 | 86 | 96 | 106 | 188 | | | | | | | | | | | | | | 0 | 18 | 32 | 47 | 54 | 60 | 70 | 80 | 90 | 100 | 183 | | | | | | | | | | | | | | 0 | 10 | 26 | 40 | 47 | 54 | 64 | 74 | 84 | 94 | 178 | | | | | | | | | | | | | | 0 | 0 | 20 | 31 | 39 | 47 | 57 | 67 | 77 | 87 | 173 | | | | | | | | | | | | | | 0 | 0 | 12 | 23 | 32 | 41 | 51 | 61 | 71 | 81 | 168 | | | | | | | | | | | | | | 0 | 0 | 0 | 15 | 25 | 35 | 45 | 55 | 65 | 75 | 163 | | | | | | | | | | | | | | 0 | 0 | 0 | 4 | 17 | 29 | 39 | 49 | 59 | 69 | 158 | | | | | | | | | | | | |
| 0 | 0 | 0 | 0 | 12 | 23 | 33 | 43 | 53 | 63 | 153 | | | | | | | | | | | | | | 0 | 0 | 0 | 0 | 1 | 16 | 26 | 36 | 46 | 56 | 148 | | | | | | | | | | | | | | 0 | 0 | 0 | 0 | 0 | 10 | 15 | 20 | 30 | 45 | 129 | | | | | | | | | | | | | | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 1 | 1 | 20 | 104 | +---+----+-----+-----+-----+-----+-----+-----+-----+-----+-----+ Table 57: CELT Static Allocation Table The maximum allocation vector is an approximation of the maximum space that can be used by each band for a given mode. The value is approximate because the shape encoding is variable rate (due to entropy coding of splitting parameters). Setting the maximum too low reduces the maximum achievable quality in a band while setting it too high may result in waste: bitstream capacity available at the end of the frame that cannot be put to any use. The maximums specified by the codec reflect the average maximum. In the reference implementation, the maximums in bits/sample are precomputed in a static table (see cache_caps50[] in static_modes_float.h) for each band, for each value of LM, and for both mono and stereo. Implementations are expected to simply use the same table data, but the procedure for generating this table is included in rate.c as part of compute_pulse_cache(). To convert the values in cache.caps into the actual maximums: first, set nbBands to the maximum number of bands for this mode, and stereo to zero if stereo is not in use and one otherwise. For each band, set N to the number of MDCT bins covered by the band (for one channel), set LM to the shift value for the frame size. Then, set i to nbBands*(2*LM+stereo). Next, set the maximum for the band to the i-th index of cache.caps + 64 and multiply by the number of channels in the current frame (one or two) and by N, then divide the result by 4 using integer division. The resulting vector will be called cap[]. The elements fit in signed 16-bit integers but do not fit in 8 bits. This procedure is implemented in the reference in the function init_caps() in celt.c. The band boosts are represented by a series of binary symbols that are entropy coded with very low probability. Each band can potentially be boosted multiple times, subject to the frame actually having enough room to obey the boost and having enough room to code the boost symbol. The default coding cost for a boost starts out at six bits (probability p=1/64), but subsequent boosts in a band cost only a single bit and every time a band is boosted the initial cost is reduced (down to a minimum of two bits, or p=1/4). Since the
initial cost of coding a boost is 6 bits, the coding cost of the boost symbols when completely unused is 0.48 bits/frame for a 21 band mode (21*-log2(1-1/2**6)). To decode the band boosts: First, set 'dynalloc_logp' to 6, the initial amount of storage required to signal a boost in bits, 'total_bits' to the size of the frame in 8th bits, 'total_boost' to zero, and 'tell' to the total number of 8th bits decoded so far. For each band from the coding start (0 normally, but 17 in Hybrid mode) to the coding end (which changes depending on the signaled bandwidth), the boost quanta in units of 1/8 bit is calculated as quanta = min(8*N, max(48, N)). This represents a boost step size of six bits, subject to a lower limit of 1/8th bit/sample and an upper limit of 1 bit/sample. Set 'boost' to zero and 'dynalloc_loop_logp' to dynalloc_logp. While dynalloc_loop_log (the current worst case symbol cost) in 8th bits plus tell is less than total_bits plus total_boost and boost is less than cap[] for this band: Decode a bit from the bitstream with dynalloc_loop_logp as the cost of a one and update tell to reflect the current used capacity. If the decoded value is zero break the loop. Otherwise, add quanta to boost and total_boost, subtract quanta from total_bits, and set dynalloc_loop_log to 1. When the loop finishes 'boost' contains the bit allocation boost for this band. If boost is non-zero and dynalloc_logp is greater than 2, decrease dynalloc_logp. Once this process has been executed on all bands, the band boosts have been decoded. This procedure is implemented around line 2474 of celt.c. At very low rates, it is possible that there won't be enough available space to execute the inner loop even once. In these cases, band boost is not possible, but its overhead is completely eliminated. Because of the high cost of band boost when activated, a reasonable encoder should not be using it at very low rates. The reference implements its dynalloc decision logic around line 1304 of celt.c. The allocation trim is an integer value from 0-10. The default value of 5 indicates no trim. The trim parameter is entropy coded in order to lower the coding cost of less extreme adjustments. Values lower than 5 bias the allocation towards lower frequencies and values above 5 bias it towards higher frequencies. Like other signaled parameters, signaling of the trim is gated so that it is not included if there is insufficient space available in the bitstream. To decode the trim, first set the trim value to 5, then if and only if the count of decoded 8th bits so far (ec_tell_frac) plus 48 (6 bits) is less than or equal to the total frame size in 8th bits minus total_boost (a product of the above band boost procedure), decode the trim value using the PDF in Table 58.
+--------------------------------------------+ | PDF | +--------------------------------------------+ | {2, 2, 5, 10, 22, 46, 22, 10, 5, 2, 2}/128 | +--------------------------------------------+ Table 58: PDF for the Trim For 10 ms and 20 ms frames using short blocks and that have at least LM+2 bits left prior to the allocation process, one anti-collapse bit is reserved in the allocation process so it can be decoded later. Following the anti-collapse reservation, one bit is reserved for skip if available. For stereo frames, bits are reserved for intensity stereo and for dual stereo. Intensity stereo requires ilog2(end-start) bits. Those bits are reserved if there are enough bits left. Following this, one bit is reserved for dual stereo if available. The allocation computation begins by setting up some initial conditions. 'total' is set to the remaining available 8th bits, computed by taking the size of the coded frame times 8 and subtracting ec_tell_frac(). From this value, one (8th bit) is subtracted to ensure that the resulting allocation will be conservative. 'anti_collapse_rsv' is set to 8 (8th bits) if and only if the frame is a transient, LM is greater than 1, and total is greater than or equal to (LM+2) * 8. Total is then decremented by anti_collapse_rsv and clamped to be equal to or greater than zero. 'skip_rsv' is set to 8 (8th bits) if total is greater than 8, otherwise it is zero. Total is then decremented by skip_rsv. This reserves space for the final skipping flag. If the current frame is stereo, intensity_rsv is set to the conservative log2 in 8th bits of the number of coded bands for this frame (given by the table LOG2_FRAC_TABLE in rate.c). If intensity_rsv is greater than total, then intensity_rsv is set to zero. Otherwise, total is decremented by intensity_rsv, and if total is still greater than 8, dual_stereo_rsv is set to 8 and total is decremented by dual_stereo_rsv. The allocation process then computes a vector representing the hard minimum amounts allocation any band will receive for shape. This minimum is higher than the technical limit of the PVQ process, but very low rate allocations produce an excessively sparse spectrum and these bands are better served by having no allocation at all. For each coded band, set thresh[band] to 24 times the number of MDCT bins in the band and divide by 16. If 8 times the number of channels is greater, use that instead. This sets the minimum allocation to one
bit per channel or 48 128th bits per MDCT bin, whichever is greater. The band-size dependent part of this value is not scaled by the channel count, because at the very low rates where this limit is applicable there will usually be no bits allocated to the side. The previously decoded allocation trim is used to derive a vector of per-band adjustments, 'trim_offsets[]'. For each coded band take the alloc_trim and subtract 5 and LM. Then, multiply the result by the number of channels, the number of MDCT bins in the shortest frame size for this mode, the number of remaining bands, 2**LM, and 8. Next, divide this value by 64. Finally, if the number of MDCT bins in the band per channel is only one, 8 times the number of channels is subtracted in order to diminish the allocation by one bit, because width 1 bands receive greater benefit from the coarse energy coding.4.3.4. Shape Decoding
In each band, the normalized "shape" is encoded using Pyramid Vector Quantizer. In the simplest case, the number of bits allocated in Section 4.3.3 is converted to a number of pulses as described by Section 4.3.4.1. Knowing the number of pulses and the number of samples in the band, the decoder calculates the size of the codebook as detailed in Section 4.3.4.2. The size is used to decode an unsigned integer (uniform probability model), which is the codeword index. This index is converted into the corresponding vector as explained in Section 4.3.4.2. This vector is then scaled to unit norm.4.3.4.1. Bits to Pulses
Although the allocation is performed in 1/8th bit units, the quantization requires an integer number of pulses K. To do this, the encoder searches for the value of K that produces the number of bits nearest to the allocated value (rounding down if exactly halfway between two values), not to exceed the total number of bits available. For efficiency reasons, the search is performed against a precomputed allocation table that only permits some K values for each N. The number of codebook entries can be computed as explained in Section 4.3.4.2. The difference between the number of bits allocated and the number of bits used is accumulated to a "balance" (initialized to zero) that helps adjust the allocation for the next bands. One third of the balance is applied to the bit allocation of each band to help achieve the target allocation. The only exceptions are the band before the last and the last band, for which half the balance and the whole balance are applied, respectively.
4.3.4.2. PVQ Decoding
Decoding of PVQ vectors is implemented in decode_pulses() (cwrs.c). The unique codeword index is decoded as a uniformly distributed integer value between 0 and V(N,K)-1, where V(N,K) is the number of possible combinations of K pulses in N samples. The index is then converted to a vector in the same way specified in [PVQ]. The indexing is based on the calculation of V(N,K) (denoted N(L,K) in [PVQ]). The number of combinations can be computed recursively as V(N,K) = V(N-1,K) + V(N,K-1) + V(N-1,K-1), with V(N,0) = 1 and V(0,K) = 0, K != 0. There are many different ways to compute V(N,K), including precomputed tables and direct use of the recursive formulation. The reference implementation applies the recursive formulation one line (or column) at a time to save on memory use, along with an alternate, univariate recurrence to initialize an arbitrary line, and direct polynomial solutions for small N. All of these methods are equivalent, and have different trade-offs in speed, memory usage, and code size. Implementations MAY use any methods they like, as long as they are equivalent to the mathematical definition. The decoded vector X is recovered as follows. Let i be the index decoded with the procedure in Section 4.1.5 with ft = V(N,K), so that 0 <= i < V(N,K). Let k = K. Then, for j = 0 to (N - 1), inclusive, do: 1. Let p = (V(N-j-1,k) + V(N-j,k))/2. 2. If i < p, then let sgn = 1, else let sgn = -1 and set i = i - p. 3. Let k0 = k and set p = p - V(N-j-1,k). 4. While p > i, set k = k - 1 and p = p - V(N-j-1,k). 5. Set X[j] = sgn*(k0 - k) and i = i - p. The decoded vector X is then normalized such that its L2-norm equals one.4.3.4.3. Spreading
The normalized vector decoded in Section 4.3.4.2 is then rotated for the purpose of avoiding tonal artifacts. The rotation gain is equal to g_r = N / (N + f_r*K)
where N is the number of dimensions, K is the number of pulses, and f_r depends on the value of the "spread" parameter in the bitstream. +--------------+------------------------+ | Spread value | f_r | +--------------+------------------------+ | 0 | infinite (no rotation) | | | | | 1 | 15 | | | | | 2 | 10 | | | | | 3 | 5 | +--------------+------------------------+ Table 59: Spreading Values The rotation angle is then calculated as 2 pi * g_r theta = ---------- 4 A 2-D rotation R(i,j) between points x_i and x_j is defined as: x_i' = cos(theta)*x_i + sin(theta)*x_j x_j' = -sin(theta)*x_i + cos(theta)*x_j An N-D rotation is then achieved by applying a series of 2-D rotations back and forth, in the following order: R(x_1, x_2), R(x_2, x_3), ..., R(x_N-2, X_N-1), R(x_N-1, X_N), R(x_N-2, X_N-1), ..., R(x_1, x_2). If the decoded vector represents more than one time block, then this spreading process is applied separately on each time block. Also, if each block represents 8 samples or more, then another N-D rotation, by (pi/2-theta), is applied _before_ the rotation described above. This extra rotation is applied in an interleaved manner with a stride equal to round(sqrt(N/nb_blocks)), i.e., it is applied independently for each set of sample S_k = {stride*n + k}, n=0..N/stride-1.4.3.4.4. Split Decoding
To avoid the need for multi-precision calculations when decoding PVQ codevectors, the maximum size allowed for codebooks is 32 bits. When larger codebooks are needed, the vector is instead split in two sub- vectors of size N/2. A quantized gain parameter with precision
derived from the current allocation is entropy coded to represent the relative gains of each side of the split, and the entire decoding process is recursively applied. Multiple levels of splitting may be applied up to a limit of LM+1 splits. The same recursive mechanism is applied for the joint coding of stereo audio.4.3.4.5. Time-Frequency Change
The time-frequency (TF) parameters are used to control the time- frequency resolution trade-off in each coded band. For each band, there are two possible TF choices. For the first band coded, the PDF is {3, 1}/4 for frames marked as transient and {15, 1}/16 for the other frames. For subsequent bands, the TF choice is coded relative to the previous TF choice with probability {15, 1}/16 for transient frames and {31, 1}/32 otherwise. The mapping between the decoded TF choices and the adjustment in TF resolution is shown in the tables below. +-----------------+---+----+ | Frame size (ms) | 0 | 1 | +-----------------+---+----+ | 2.5 | 0 | -1 | | | | | | 5 | 0 | -1 | | | | | | 10 | 0 | -2 | | | | | | 20 | 0 | -2 | +-----------------+---+----+ Table 60: TF Adjustments for Non-transient Frames and tf_select=0 +-----------------+---+----+ | Frame size (ms) | 0 | 1 | +-----------------+---+----+ | 2.5 | 0 | -1 | | | | | | 5 | 0 | -2 | | | | | | 10 | 0 | -3 | | | | | | 20 | 0 | -3 | +-----------------+---+----+ Table 61: TF Adjustments for Non-transient Frames and tf_select=1
+-----------------+---+----+ | Frame size (ms) | 0 | 1 | +-----------------+---+----+ | 2.5 | 0 | -1 | | | | | | 5 | 1 | 0 | | | | | | 10 | 2 | 0 | | | | | | 20 | 3 | 0 | +-----------------+---+----+ Table 62: TF Adjustments for Transient Frames and tf_select=0 +-----------------+---+----+ | Frame size (ms) | 0 | 1 | +-----------------+---+----+ | 2.5 | 0 | -1 | | | | | | 5 | 1 | -1 | | | | | | 10 | 1 | -1 | | | | | | 20 | 1 | -1 | +-----------------+---+----+ Table 63: TF Adjustments for Transient Frames and tf_select=1 A negative TF adjustment means that the temporal resolution is increased, while a positive TF adjustment means that the frequency resolution is increased. Changes in TF resolution are implemented using the Hadamard transform [HADAMARD]. To increase the time resolution by N, N "levels" of the Hadamard transform are applied to the decoded vector for each interleaved MDCT vector. To increase the frequency resolution (assumes a transient frame), then N levels of the Hadamard transform are applied _across_ the interleaved MDCT vector. In the case of increased time resolution, the decoder uses the "sequency order" because the input vector is sorted in time.4.3.5. Anti-collapse Processing
The anti-collapse feature is designed to avoid the situation where the use of multiple short MDCTs causes the energy in one or more of the MDCTs to be zero for some bands, causing unpleasant artifacts. When the frame has the transient bit set, an anti-collapse bit is decoded. When anti-collapse is set, the energy in each small MDCT is prevented from collapsing to zero. For each band of each MDCT where a collapse is detected, a pseudo-random signal is inserted with an
energy corresponding to the minimum energy over the two previous frames. A renormalization step is then required to ensure that the anti-collapse step did not alter the energy preservation property.4.3.6. Denormalization
Just as each band was normalized in the encoder, the last step of the decoder before the inverse MDCT is to denormalize the bands. Each decoded normalized band is multiplied by the square root of the decoded energy. This is done by denormalise_bands() (bands.c).4.3.7. Inverse MDCT
The inverse MDCT implementation has no special characteristics. The input is N frequency-domain samples and the output is 2*N time-domain samples, while scaling by 1/2. A "low-overlap" window reduces the algorithmic delay. It is derived from a basic (full-overlap) 240- sample version of the window used by the Vorbis codec: 2 / /pi /pi n + 1/2\ \ \ W(n) = |sin|-- * sin|-- * -------| | | \ \2 \2 L / / / The low-overlap window is created by zero-padding the basic window and inserting ones in the middle, such that the resulting window still satisfies power complementarity [PRINCEN86]. The IMDCT and windowing are performed by mdct_backward (mdct.c).4.3.7.1. Post-Filter
The output of the inverse MDCT (after weighted overlap-add) is sent to the post-filter. Although the post-filter is applied at the end, the post-filter parameters are encoded at the beginning, just after the silence flag. The post-filter can be switched on or off using one bit (logp=1). If the post-filter is enabled, then the octave is decoded as an integer value between 0 and 6 of uniform probability. Once the octave is known, the fine pitch within the octave is decoded using 4+octave raw bits. The final pitch period is equal to (16<<octave)+fine_pitch-1 so it is bounded between 15 and 1022, inclusively. Next, the gain is decoded as three raw bits and is equal to G=3*(int_gain+1)/32. The set of post-filter taps is decoded last, using a pdf equal to {2, 1, 1}/4. Tapset zero corresponds to the filter coefficients g0 = 0.3066406250, g1 = 0.2170410156, g2 = 0.1296386719. Tapset one corresponds to the filter coefficients g0 = 0.4638671875, g1 = 0.2680664062, g2 = 0, and tapset two uses filter coefficients g0 = 0.7998046875, g1 = 0.1000976562, g2 = 0.
The post-filter response is thus computed as: y(n) = x(n) + G*(g0*y(n-T) + g1*(y(n-T+1)+y(n-T+1)) + g2*(y(n-T+2)+y(n-T+2))) During a transition between different gains, a smooth transition is calculated using the square of the MDCT window. It is important that values of y(n) be interpolated one at a time such that the past value of y(n) used is interpolated.4.3.7.2. De-emphasis
After the post-filter, the signal is de-emphasized using the inverse of the pre-emphasis filter used in the encoder: 1 1 ---- = --------------- A(z) -1 1 - alpha_p*z where alpha_p=0.8500061035.4.4. Packet Loss Concealment (PLC)
Packet Loss Concealment (PLC) is an optional decoder-side feature that SHOULD be included when receiving from an unreliable channel. Because PLC is not part of the bitstream, there are many acceptable ways to implement PLC with different complexity/quality trade-offs. The PLC in the reference implementation depends on the mode of last packet received. In CELT mode, the PLC finds a periodicity in the decoded signal and repeats the windowed waveform using the pitch offset. The windowed waveform is overlapped in such a way as to preserve the time-domain aliasing cancellation with the previous frame and the next frame. This is implemented in celt_decode_lost() (mdct.c). In SILK mode, the PLC uses LPC extrapolation from the previous frame, implemented in silk_PLC() (PLC.c).4.4.1. Clock Drift Compensation
Clock drift refers to the gradual desynchronization of two endpoints whose sample clocks run at different frequencies while they are streaming live audio. Differences in clock frequencies are generally attributable to manufacturing variation in the endpoints' clock hardware. For long-lived streams, the time difference between sender and receiver can grow without bound.
When the sender's clock runs slower than the receiver's, the effect is similar to packet loss: too few packets are received. The receiver can distinguish between drift and loss if the transport provides packet timestamps. A receiver for live streams SHOULD conceal the effects of drift, and it MAY do so by invoking the PLC. When the sender's clock runs faster than the receiver's, too many packets will be received. The receiver MAY respond by skipping any packet (i.e., not submitting the packet for decoding). This is likely to produce a less severe artifact than if the frame were dropped after decoding. A decoder MAY employ a more sophisticated drift compensation method. For example, the NetEQ component [GOOGLE-NETEQ] of the Google WebRTC codebase [GOOGLE-WEBRTC] compensates for drift by adding or removing one period when the signal is highly periodic. The reference implementation of Opus allows a caller to learn whether the current frame's signal is highly periodic, and if so what the period is, using the OPUS_GET_PITCH() request.4.5. Configuration Switching
Switching between the Opus coding modes, audio bandwidths, and channel counts requires careful consideration to avoid audible glitches. Switching between any two configurations of the CELT-only mode, any two configurations of the Hybrid mode, or from WB SILK to Hybrid mode does not require any special treatment in the decoder, as the MDCT overlap will smooth the transition. Switching from Hybrid mode to WB SILK requires adding in the final contents of the CELT overlap buffer to the first SILK-only packet. This can be done by decoding a 2.5 ms silence frame with the CELT decoder using the channel count of the SILK-only packet (and any choice of audio bandwidth), which will correctly handle the cases when the channel count changes as well. When changing the channel count for SILK-only or Hybrid packets, the encoder can avoid glitches by smoothly varying the stereo width of the input signal before or after the transition, and it SHOULD do so. However, other transitions between SILK-only packets or between NB or MB SILK and Hybrid packets may cause glitches, because neither the LSF coefficients nor the LTP, LPC, stereo unmixing, and resampler buffers are available at the new sample rate. These switches SHOULD be delayed by the encoder until quiet periods or transients, where the inevitable glitches will be less audible. Additionally, the bitstream MAY include redundant side information ("redundancy"), in the form of additional CELT frames embedded in each of the Opus frames around the transition.
The other transitions that cannot be easily handled are those where the lower frequencies switch between the SILK LP-based model and the CELT MDCT model. However, an encoder may not have an opportunity to delay such a switch to a convenient point. For example, if the content switches from speech to music, and the encoder does not have enough latency in its analysis to detect this in advance, there may be no convenient silence period during which to make the transition for quite some time. To avoid or reduce glitches during these problematic mode transitions, and between audio bandwidth changes in the SILK-only modes, transitions MAY include redundant side information ("redundancy"), in the form of an additional CELT frame embedded in the Opus frame. A transition between coding the lower frequencies with the LP model and the MDCT model or a transition that involves changing the SILK bandwidth is only normatively specified when it includes redundancy. For those without redundancy, it is RECOMMENDED that the decoder use a concealment technique (e.g., make use of a PLC algorithm) to "fill in" the gap or discontinuity caused by the mode transition. Therefore, PLC MUST NOT be applied during any normative transition, i.e., when o A packet includes redundancy for this transition (as described below), o The transition is between any WB SILK packet and any Hybrid packet, or vice versa, o The transition is between any two Hybrid mode packets, or o The transition is between any two CELT mode packets, unless there is actual packet loss.4.5.1. Transition Side Information (Redundancy)
Transitions with side information include an extra 5 ms "redundant" CELT frame within the Opus frame. This frame is designed to fill in the gap or discontinuity in the different layers without requiring the decoder to conceal it. For transitions from CELT-only to SILK- only or Hybrid, the redundant frame is inserted in the first Opus frame after the transition (i.e., the first SILK-only or Hybrid frame). For transitions from SILK-only or Hybrid to CELT-only, the redundant frame is inserted in the last Opus frame before the transition (i.e., the last SILK-only or Hybrid frame).
4.5.1.1. Redundancy Flag
The presence of redundancy is signaled in all SILK-only and Hybrid frames, not just those involved in a mode transition. This allows the frames to be decoded correctly even if an adjacent frame is lost. For SILK-only frames, this signaling is implicit, based on the size of the Opus frame and the number of bits consumed decoding the SILK portion of it. After decoding the SILK portion of the Opus frame, the decoder uses ec_tell() (see Section 4.1.6.1) to check if there are at least 17 bits remaining. If so, then the frame contains redundancy. For Hybrid frames, this signaling is explicit. After decoding the SILK portion of the Opus frame, the decoder uses ec_tell() (see Section 4.1.6.1) to ensure there are at least 37 bits remaining. If so, it reads a symbol with the PDF in Table 64, and if the value is 1, then the frame contains redundancy. Otherwise (if there were fewer than 37 bits left or the value was 0), the frame does not contain redundancy. +----------------+ | PDF | +----------------+ | {4095, 1}/4096 | +----------------+ Table 64: Redundancy Flag PDF4.5.1.2. Redundancy Position Flag
Since the current frame is a SILK-only or a Hybrid frame, it must be at least 10 ms. Therefore, it needs an additional flag to indicate whether the redundant 5 ms CELT frame should be mixed into the beginning of the current frame, or the end. After determining that a frame contains redundancy, the decoder reads a 1 bit symbol with a uniform PDF (Table 65). +----------+ | PDF | +----------+ | {1, 1}/2 | +----------+ Table 65: Redundancy Position PDF
If the value is zero, this is the first frame in the transition, and the redundancy belongs at the end. If the value is one, this is the second frame in the transition, and the redundancy belongs at the beginning. There is no way to specify that an Opus frame contains separate redundant CELT frames at both the beginning and the end.4.5.1.3. Redundancy Size
Unlike the CELT portion of a Hybrid frame, the redundant CELT frame does not use the same entropy coder state as the rest of the Opus frame, because this would break the CELT bit allocation mechanism in Hybrid frames. Thus, a redundant CELT frame always starts and ends on a byte boundary, even in SILK-only frames, where this is not strictly necessary. For SILK-only frames, the number of bytes in the redundant CELT frame is simply the number of whole bytes remaining, which must be at least 2, due to the space check in Section 4.5.1.1. For Hybrid frames, the number of bytes is equal to 2, plus a decoded unsigned integer less than 256 (see Section 4.1.5). This may be more than the number of whole bytes remaining in the Opus frame, in which case the frame is invalid. However, a decoder is not required to ignore the entire frame, as this may be the result of a bit error that desynchronized the range coder. There may still be useful data before the error, and a decoder MAY keep any audio decoded so far instead of invoking the PLC, but it is RECOMMENDED that the decoder stop decoding and discard the rest of the current Opus frame. It would have been possible to avoid these invalid states in the design of Opus by limiting the range of the explicit length decoded from Hybrid frames by the actual number of whole bytes remaining. However, this would require an encoder to determine the rate allocation for the MDCT layer up front, before it began encoding that layer. By allowing some invalid sizes, the encoder is able to defer that decision until much later. When encoding Hybrid frames that do not include redundancy, the encoder must still decide up front if it wishes to use the minimum 37 bits required to trigger encoding of the redundancy flag, but this is a much looser restriction. After determining the size of the redundant CELT frame, the decoder reduces the size of the buffer currently in use by the range coder by that amount. The MDCT layer reads any raw bits from the end of this reduced buffer, and all calculations of the number of bits remaining in the buffer must be done using this new, reduced size, rather than the original size of the Opus frame.
4.5.1.4. Decoding the Redundancy
The redundant frame is decoded like any other CELT-only frame, with the exception that it does not contain a TOC byte. The frame size is fixed at 5 ms, the channel count is set to that of the current frame, and the audio bandwidth is also set to that of the current frame, with the exception that for MB SILK frames, it is set to WB. If the redundancy belongs at the beginning (in a CELT-only to SILK- only or Hybrid transition), the final reconstructed output uses the first 2.5 ms of audio output by the decoder for the redundant frame as is, discarding the corresponding output from the SILK-only or Hybrid portion of the frame. The remaining 2.5 ms is cross-lapped with the decoded SILK/Hybrid signal using the CELT's power- complementary MDCT window to ensure a smooth transition. If the redundancy belongs at the end (in a SILK-only or Hybrid to CELT-only transition), only the second half (2.5 ms) of the audio output by the decoder for the redundant frame is used. In that case, the second half of the redundant frame is cross-lapped with the end of the SILK/Hybrid signal, again using CELT's power-complementary MDCT window to ensure a smooth transition.4.5.2. State Reset
When a transition occurs, the state of the SILK or the CELT decoder (or both) may need to be reset before decoding a frame in the new mode. This avoids reusing "out of date" memory, which may not have been updated in some time or may not be in a well-defined state due to, e.g., PLC. The SILK state is reset before every SILK-only or Hybrid frame where the previous frame was CELT-only. The CELT state is reset every time the operating mode changes and the new mode is either Hybrid or CELT-only, except when the transition uses redundancy as described above. When switching from SILK-only or Hybrid to CELT-only with redundancy, the CELT state is reset before decoding the redundant CELT frame embedded in the SILK-only or Hybrid frame, but it is not reset before decoding the following CELT-only frame. When switching from CELT-only mode to SILK-only or Hybrid mode with redundancy, the CELT decoder is not reset for decoding the redundant CELT frame.
4.5.3. Summary of Transitions
Figure 18 illustrates all of the normative transitions involving a mode change, an audio bandwidth change, or both. Each one uses an S, H, or C to represent an Opus frame in the corresponding mode. In addition, an R indicates the presence of redundancy in the Opus frame with which it is cross-lapped. Its location in the first or last 5 ms is assumed to correspond to whether it is the frame before or after the transition. Other uses of redundancy are non-normative. Finally, a c indicates the contents of the CELT overlap buffer after the previously decoded frame (i.e., as extracted by decoding a silence frame).
SILK to SILK with Redundancy: S -> S -> S & !R -> R & ;S -> S -> S NB or MB SILK to Hybrid with Redundancy: S -> S -> S & !R ->;H -> H -> H WB SILK to Hybrid: S -> S -> S ->!H -> H -> H SILK to CELT with Redundancy: S -> S -> S & !R -> C -> C -> C Hybrid to NB or MB SILK with Redundancy: H -> H -> H & !R -> R & ;S -> S -> S Hybrid to WB SILK: H -> H -> H -> c \ + > S -> S -> S Hybrid to CELT with Redundancy: H -> H -> H & !R -> C -> C -> C CELT to SILK with Redundancy: C -> C -> C -> R & ;S -> S -> S CELT to Hybrid with Redundancy: C -> C -> C -> R & |H -> H -> H Key: S SILK-only frame ; SILK decoder reset H Hybrid frame | CELT and SILK decoder resets C CELT-only frame ! CELT decoder reset c CELT overlap + Direct mixing R Redundant CELT frame & Windowed cross-lap Figure 18: Normative Transitions
The first two and the last two Opus frames in each example are illustrative, i.e., there is no requirement that a stream remain in the same configuration for three consecutive frames before or after a switch. The behavior of transitions without redundancy where PLC is allowed is non-normative. An encoder might still wish to use these transitions if, for example, it doesn't want to add the extra bitrate required for redundancy or if it makes a decision to switch after it has already transmitted the frame that would have had to contain the redundancy. Figure 19 illustrates the recommended cross-lapping and decoder resets for these transitions. SILK to SILK (audio bandwidth change): S -> S -> S ;S -> S -> S NB or MB SILK to Hybrid: S -> S -> S |H -> H -> H SILK to CELT without Redundancy: S -> S -> S -> P & !C -> C -> C Hybrid to NB or MB SILK: H -> H -> H -> c + ;S -> S -> S Hybrid to CELT without Redundancy: H -> H -> H -> P & !C -> C -> C CELT to SILK without Redundancy: C -> C -> C -> P & ;S -> S -> S CELT to Hybrid without Redundancy: C -> C -> C -> P & |H -> H -> H Key: S SILK-only frame ; SILK decoder reset H Hybrid frame | CELT and SILK decoder resets C CELT-only frame ! CELT decoder reset c CELT overlap + Direct mixing P Packet Loss Concealment & Windowed cross-lap Figure 19: Recommended Non-Normative Transitions Encoders SHOULD NOT use other transitions, e.g., those that involve redundancy in ways not illustrated in Figure 18.