4.2. SILK Decoder
The decoder's LP layer uses a modified version of the SILK codec (herein simply called "SILK"), which runs a decoded excitation signal through adaptive long-term and short-term prediction synthesis filters. It runs at NB, MB, and WB sample rates internally. When used in a SWB or FB Hybrid frame, the LP layer itself still only runs in WB.4.2.1. SILK Decoder Modules
An overview of the decoder is given in Figure 14. +---------+ +------------+ -->| Range |--->| Decode |---------------------------+ 1 | Decoder | 2 | Parameters |----------+ 5 | +---------+ +------------+ 4 | | 3 | | | \/ \/ \/ +------------+ +------------+ +------------+ | Generate |-->| LTP |-->| LPC | | Excitation | | Synthesis | | Synthesis | +------------+ +------------+ +------------+ ^ | | | +-------------------+----------------+ | 6 | +------------+ +-------------+ +-->| Stereo |-->| Sample Rate |--> | Unmixing | 7 | Conversion | 8 +------------+ +-------------+ 1: Range encoded bitstream 2: Coded parameters 3: Pulses, LSBs, and signs 4: Pitch lags, Long-Term Prediction (LTP) coefficients 5: Linear Predictive Coding (LPC) coefficients and gains 6: Decoded signal (mono or mid-side stereo) 7: Unmixed signal (mono or left-right stereo) 8: Resampled signal Figure 14: SILK Decoder
The decoder feeds the bitstream (1) to the range decoder from Section 4.1 and then decodes the parameters in it (2) using the procedures detailed in Sections 4.2.3 through 4.2.7.8.5. These parameters (3, 4, 5) are used to generate an excitation signal (see Section 4.2.7.8.6), which is fed to an optional Long-Term Prediction (LTP) filter (voiced frames only, see Section 4.2.7.9.1) and then a short-term prediction filter (see Section 4.2.7.9.2), producing the decoded signal (6). For stereo streams, the mid-side representation is converted to separate left and right channels (7). The result is finally resampled to the desired output sample rate (e.g., 48 kHz) so that the resampled signal (8) can be mixed with the CELT layer.4.2.2. LP Layer Organization
Internally, the LP layer of a single Opus frame is composed of either a single 10 ms regular SILK frame or between one and three 20 ms regular SILK frames. A stereo Opus frame may double the number of regular SILK frames (up to a total of six), since it includes separate frames for a mid channel and, optionally, a side channel. Optional Low Bit-Rate Redundancy (LBRR) frames, which are reduced- bitrate encodings of previous SILK frames, may be included to aid in recovery from packet loss. If present, these appear before the regular SILK frames. They are, in most respects, identical to regular, active SILK frames, except that they are usually encoded with a lower bitrate. This document uses "SILK frame" to refer to either one and "regular SILK frame" if it needs to draw a distinction between the two. Logically, each SILK frame is, in turn, composed of either two or four 5 ms subframes. Various parameters, such as the quantization gain of the excitation and the pitch lag and filter coefficients can vary on a subframe-by-subframe basis. Physically, the parameters for each subframe are interleaved in the bitstream, as described in the relevant sections for each parameter. All of these frames and subframes are decoded from the same range coder, with no padding between them. Thus, packing multiple SILK frames in a single Opus frame saves, on average, half a byte per SILK frame. It also allows some parameters to be predicted from prior SILK frames in the same Opus frame, since this does not degrade packet loss robustness (beyond any penalty for merely using fewer, larger packets to store multiple frames). Stereo support in SILK uses a variant of mid-side coding, allowing a mono decoder to simply decode the mid channel. However, the data for the two channels is interleaved, so a mono decoder must still unpack
the data for the side channel. It would be required to do so anyway for Hybrid Opus frames or to support decoding individual 20 ms frames. Table 3 summarizes the overall grouping of the contents of the LP layer. Figures 15 and 16 illustrate the ordering of the various SILK frames for a 60 ms Opus frame, for both mono and stereo, respectively. +-----------------------------------+---------------+---------------+ | Symbol(s) | PDF(s) | Condition | +-----------------------------------+---------------+---------------+ | Voice Activity Detection (VAD) | {1, 1}/2 | | | Flags | | | | | | | | LBRR Flag | {1, 1}/2 | | | | | | | Per-Frame LBRR Flags | Table 4 | Section 4.2.4 | | | | | | LBRR Frame(s) | Section 4.2.7 | Section 4.2.4 | | | | | | Regular SILK Frame(s) | Section 4.2.7 | | +-----------------------------------+---------------+---------------+ Table 3: Organization of the SILK layer of an Opus Frame +---------------------------------+ | VAD Flags | +---------------------------------+ | LBRR Flag | +---------------------------------+ | Per-Frame LBRR Flags (Optional) | +---------------------------------+ | LBRR Frame 1 (Optional) | +---------------------------------+ | LBRR Frame 2 (Optional) | +---------------------------------+ | LBRR Frame 3 (Optional) | +---------------------------------+ | Regular SILK Frame 1 | +---------------------------------+ | Regular SILK Frame 2 | +---------------------------------+ | Regular SILK Frame 3 | +---------------------------------+ Figure 15: A 60 ms Mono Frame
+---------------------------------------+ | Mid VAD Flags | +---------------------------------------+ | Mid LBRR Flag | +---------------------------------------+ | Side VAD Flags | +---------------------------------------+ | Side LBRR Flag | +---------------------------------------+ | Mid Per-Frame LBRR Flags (Optional) | +---------------------------------------+ | Side Per-Frame LBRR Flags (Optional) | +---------------------------------------+ | Mid LBRR Frame 1 (Optional) | +---------------------------------------+ | Side LBRR Frame 1 (Optional) | +---------------------------------------+ | Mid LBRR Frame 2 (Optional) | +---------------------------------------+ | Side LBRR Frame 2 (Optional) | +---------------------------------------+ | Mid LBRR Frame 3 (Optional) | +---------------------------------------+ | Side LBRR Frame 3 (Optional) | +---------------------------------------+ | Mid Regular SILK Frame 1 | +---------------------------------------+ | Side Regular SILK Frame 1 (Optional) | +---------------------------------------+ | Mid Regular SILK Frame 2 | +---------------------------------------+ | Side Regular SILK Frame 2 (Optional) | +---------------------------------------+ | Mid Regular SILK Frame 3 | +---------------------------------------+ | Side Regular SILK Frame 3 (Optional) | +---------------------------------------+ Figure 16: A 60 ms Stereo Frame4.2.3. Header Bits
The LP layer begins with two to eight header bits, decoded in silk_Decode() (dec_API.c). These consist of one Voice Activity Detection (VAD) bit per frame (up to 3), followed by a single flag indicating the presence of LBRR frames. For a stereo packet, these first flags correspond to the mid channel, and a second set of flags is included for the side channel.
Because these are the first symbols decoded by the range coder and because they are coded as binary values with uniform probability, they can be extracted directly from the most significant bits of the first byte of compressed data. Thus, a receiver can determine if an Opus frame contains any active SILK frames without the overhead of using the range decoder.4.2.4. Per-Frame LBRR Flags
For Opus frames longer than 20 ms, a set of LBRR flags is decoded for each channel that has its LBRR flag set. Each set contains one flag per 20 ms SILK frame. 40 ms Opus frames use the 2-frame LBRR flag PDF from Table 4, and 60 ms Opus frames use the 3-frame LBRR flag PDF. For each channel, the resulting 2- or 3-bit integer contains the corresponding LBRR flag for each frame, packed in order from the LSB to the MSB. +------------+-------------------------------------+ | Frame Size | PDF | +------------+-------------------------------------+ | 40 ms | {0, 53, 53, 150}/256 | | | | | 60 ms | {0, 41, 20, 29, 41, 15, 28, 82}/256 | +------------+-------------------------------------+ Table 4: LBRR Flag PDFs A 10 or 20 ms Opus frame does not contain any per-frame LBRR flags, as there may be at most one LBRR frame per channel. The global LBRR flag in the header bits (see Section 4.2.3) is already sufficient to indicate the presence of that single LBRR frame.4.2.5. LBRR Frames
The LBRR frames, if present, contain an encoded representation of the signal immediately prior to the current Opus frame as if it were encoded with the current mode, frame size, audio bandwidth, and channel count, even if those differ from the prior Opus frame. When one of these parameters changes from one Opus frame to the next, this implies that the LBRR frames of the current Opus frame may not be simple drop-in replacements for the contents of the previous Opus frame. For example, when switching from 20 ms to 60 ms, the 60 ms Opus frame may contain LBRR frames covering up to three prior 20 ms Opus frames, even if those frames already contained LBRR frames covering some of the same time periods. When switching from 20 ms to 10 ms, the 10 ms Opus frame can contain an LBRR frame covering at most half the prior
20 ms Opus frame, potentially leaving a hole that needs to be concealed from even a single packet loss (see Section 4.4). When switching from mono to stereo, the LBRR frames in the first stereo Opus frame MAY contain a non-trivial side channel. In order to properly produce LBRR frames under all conditions, an encoder might need to buffer up to 60 ms of audio and re-encode it during these transitions. However, the reference implementation opts to disable LBRR frames at the transition point for simplicity. Since transitions are relatively infrequent in normal usage, this does not have a significant impact on packet loss robustness. The LBRR frames immediately follow the LBRR flags, prior to any regular SILK frames. Section 4.2.7 describes their exact contents. LBRR frames do not include their own separate VAD flags. LBRR frames are only meant to be transmitted for active speech, thus all LBRR frames are treated as active. In a stereo Opus frame longer than 20 ms, although the per-frame LBRR flags for the mid channel are coded as a unit before the per-frame LBRR flags for the side channel, the LBRR frames themselves are interleaved. The decoder parses an LBRR frame for the mid channel of a given 20 ms interval (if present) and then immediately parses the corresponding LBRR frame for the side channel (if present), before proceeding to the next 20 ms interval.4.2.6. Regular SILK Frames
The regular SILK frame(s) follow the LBRR frames (if any). Section 4.2.7 describes their contents, as well. Unlike the LBRR frames, a regular SILK frame is coded for each time interval in an Opus frame, even if the corresponding VAD flags are unset. For stereo Opus frames longer than 20 ms, the regular mid and side SILK frames for each 20 ms interval are interleaved, just as with the LBRR frames. The side frame may be skipped by coding an appropriate flag, as detailed in Section 4.2.7.2.4.2.7. SILK Frame Contents
Each SILK frame includes a set of side information that encodes o The frame type and quantization type (Section 4.2.7.3), o Quantization gains (Section 4.2.7.4), o Short-term prediction filter coefficients (Section 4.2.7.5),
o A Line Spectral Frequencies (LSFs) interpolation weight (Section 4.2.7.5.5), o LTP filter lags and gains (Section 4.2.7.6), and o A Linear Congruential Generator (LCG) seed (Section 4.2.7.7). The quantized excitation signal (see Section 4.2.7.8) follows these at the end of the frame. Table 5 details the overall organization of a SILK frame.
+---------------------------+-------------------+-------------------+ | Symbol(s) | PDF(s) | Condition | +---------------------------+-------------------+-------------------+ | Stereo Prediction Weights | Table 6 | Section 4.2.7.1 | | | | | | Mid-only Flag | Table 8 | Section 4.2.7.2 | | | | | | Frame Type | Section 4.2.7.3 | | | | | | | Subframe Gains | Section 4.2.7.4 | | | | | | | Normalized LSF Stage-1 | Table 14 | | | Index | | | | | | | | Normalized LSF Stage-2 | Section 4.2.7.5.2 | | | Residual | | | | | | | | Normalized LSF | Table 26 | 20 ms frame | | Interpolation Weight | | | | | | | | Primary Pitch Lag | Section 4.2.7.6.1 | Voiced frame | | | | | | Subframe Pitch Contour | Table 32 | Voiced frame | | | | | | Periodicity Index | Table 37 | Voiced frame | | | | | | LTP Filter | Table 38 | Voiced frame | | | | | | LTP Scaling | Table 42 | Section 4.2.7.6.3 | | | | | | LCG Seed | Table 43 | | | | | | | Excitation Rate Level | Table 45 | | | | | | | Excitation Pulse Counts | Table 46 | | | | | | | Excitation Pulse | Section 4.2.7.8.3 | Non-zero pulse | | Locations | | count | | | | | | Excitation LSBs | Table 51 | Section 4.2.7.8.2 | | | | | | Excitation Signs | Table 52 | | +---------------------------+-------------------+-------------------+ Table 5: Order of the Symbols in an Individual SILK Frame
4.2.7.1. Stereo Prediction Weights
A SILK frame corresponding to the mid channel of a stereo Opus frame begins with a pair of side channel prediction weights, designed such that zeros indicate normal mid-side coupling. Since these weights can change on every frame, the first portion of each frame linearly interpolates between the previous weights and the current ones, using zeros for the previous weights if none are available. These prediction weights are never included in a mono Opus frame, and the previous weights are reset to zeros on any transition from mono to stereo. They are also not included in an LBRR frame for the side channel, even if the LBRR flags indicate the corresponding mid channel was not coded. In that case, the previous weights are used, again substituting in zeros if no previous weights are available since the last decoder reset (see Section 4.5.2). To summarize, these weights are coded if and only if o This is a stereo Opus frame (Section 3.1), and o The current SILK frame corresponds to the mid channel. The prediction weights are coded in three separate pieces, which are decoded by silk_stereo_decode_pred() (stereo_decode_pred.c). The first piece jointly codes the high-order part of a table index for both weights. The second piece codes the low-order part of each table index. The third piece codes an offset used to linearly interpolate between table indices. The details are as follows. Let n be an index decoded with the 25-element stage-1 PDF in Table 6. Then, let i0 and i1 be indices decoded with the stage-2 and stage-3 PDFs in Table 6, respectively, and let i2 and i3 be two more indices decoded with the stage-2 and stage-3 PDFs, all in that order. +-------+-----------------------------------------------------------+ | Stage | PDF | +-------+-----------------------------------------------------------+ | Stage | {7, 2, 1, 1, 1, 10, 24, 8, 1, 1, 3, 23, 92, 23, 3, 1, 1, | | 1 | 8, 24, 10, 1, 1, 1, 2, 7}/256 | | | | | Stage | {85, 86, 85}/256 | | 2 | | | | | | Stage | {51, 51, 52, 51, 51}/256 | | 3 | | +-------+-----------------------------------------------------------+ Table 6: Stereo Weight PDFs
Then, use n, i0, and i2 to form two table indices, wi0 and wi1, according to wi0 = i0 + 3*(n/5) wi1 = i2 + 3*(n%5) where the division is integer division. The range of these indices is 0 to 14, inclusive. Let w_Q13[i] be the i'th weight from Table 7. Then, the two prediction weights, w0_Q13 and w1_Q13, are w1_Q13 = w_Q13[wi1] + (((w_Q13[wi1+1] - w_Q13[wi1])*6554) >> 16)*(2*i3 + 1) w0_Q13 = w_Q13[wi0] + (((w_Q13[wi0+1] - w_Q13[wi0])*6554) >> 16)*(2*i1 + 1) - w1_Q13 N.B., w1_Q13 is computed first here, because w0_Q13 depends on it. The constant 6554 is approximately 0.1 in Q16. Although wi0 and wi1 only have 15 possible values, Table 7 contains 16 entries to allow interpolation between entry wi0 and (wi0 + 1) (and likewise for wi1).
+-------+--------------+ | Index | Weight (Q13) | +-------+--------------+ | 0 | -13732 | | | | | 1 | -10050 | | | | | 2 | -8266 | | | | | 3 | -7526 | | | | | 4 | -6500 | | | | | 5 | -5000 | | | | | 6 | -2950 | | | | | 7 | -820 | | | | | 8 | 820 | | | | | 9 | 2950 | | | | | 10 | 5000 | | | | | 11 | 6500 | | | | | 12 | 7526 | | | | | 13 | 8266 | | | | | 14 | 10050 | | | | | 15 | 13732 | +-------+--------------+ Table 7: Stereo Weight Table4.2.7.2. Mid-Only Flag
A flag appears after the stereo prediction weights that indicates if only the mid channel is coded for this time interval. It appears only when o This is a stereo Opus frame (see Section 3.1), o The current SILK frame corresponds to the mid channel, and
o Either * This is a regular SILK frame where the VAD flags (see Section 4.2.3) indicate that the corresponding side channel is not active. * This is an LBRR frame where the LBRR flags (see Sections 4.2.3 and 4.2.4) indicate that the corresponding side channel is not coded. It is omitted when there are no stereo weights, for all of the same reasons. It is also omitted for a regular SILK frame when the VAD flag of the corresponding side channel frame is set (indicating it is active). The side channel must be coded in this case, making the mid-only flag redundant. It is also omitted for an LBRR frame when the corresponding LBRR flags indicate the side channel is coded. When the flag is present, the decoder reads a single value using the PDF in Table 8, as implemented in silk_stereo_decode_mid_only() (stereo_decode_pred.c). If the flag is set, then there is no corresponding SILK frame for the side channel, the entire decoding process for the side channel is skipped, and zeros are fed to the stereo unmixing process (see Section 4.2.8) instead. As stated above, LBRR frames still include this flag when the LBRR flag indicates that the side channel is not coded. In that case, if this flag is zero (indicating that there should be a side channel), then Packet Loss Concealment (PLC, see Section 4.4) SHOULD be invoked to recover a side channel signal. Otherwise, the stereo image will collapse. +---------------+ | PDF | +---------------+ | {192, 64}/256 | +---------------+ Table 8: Mid-only Flag PDF4.2.7.3. Frame Type
Each SILK frame contains a single "frame type" symbol that jointly codes the signal type and quantization offset type of the corresponding frame. If the current frame is a regular SILK frame whose VAD bit was not set (an "inactive" frame), then the frame type symbol takes on a value of either 0 or 1 and is decoded using the first PDF in Table 9. If the frame is an LBRR frame or a regular SILK frame whose VAD flag was set (an "active" frame), then the value of the symbol may range from 2 to 5, inclusive, and is decoded using
the second PDF in Table 9. Table 10 translates between the value of the frame type symbol and the corresponding signal type and quantization offset type. +----------+-----------------------------+ | VAD Flag | PDF | +----------+-----------------------------+ | Inactive | {26, 230, 0, 0, 0, 0}/256 | | | | | Active | {0, 0, 24, 74, 148, 10}/256 | +----------+-----------------------------+ Table 9: Frame Type PDFs +------------+-------------+--------------------------+ | Frame Type | Signal Type | Quantization Offset Type | +------------+-------------+--------------------------+ | 0 | Inactive | Low | | | | | | 1 | Inactive | High | | | | | | 2 | Unvoiced | Low | | | | | | 3 | Unvoiced | High | | | | | | 4 | Voiced | Low | | | | | | 5 | Voiced | High | +------------+-------------+--------------------------+ Table 10: Signal Type and Quantization Offset Type from Frame Type4.2.7.4. Subframe Gains
A separate quantization gain is coded for each 5 ms subframe. These gains control the step size between quantization levels of the excitation signal and, therefore, the quality of the reconstruction. They are independent of and unrelated to the pitch contours coded for voiced frames. The quantization gains are themselves uniformly quantized to 6 bits on a log scale, giving them a resolution of approximately 1.369 dB and a range of approximately 1.94 dB to 88.21 dB. The subframe gains are either coded independently, or relative to the gain from the most recent coded subframe in the same channel. Independent coding is used if and only if
o This is the first subframe in the current SILK frame, and o Either * This is the first SILK frame of its type (LBRR or regular) for this channel in the current Opus frame, or * The previous SILK frame of the same type (LBRR or regular) for this channel in the same Opus frame was not coded. In an independently coded subframe gain, the 3 most significant bits of the quantization gain are decoded using a PDF selected from Table 11 based on the decoded signal type (see Section 4.2.7.3). +-------------+------------------------------------+ | Signal Type | PDF | +-------------+------------------------------------+ | Inactive | {32, 112, 68, 29, 12, 1, 1, 1}/256 | | | | | Unvoiced | {2, 17, 45, 60, 62, 47, 19, 4}/256 | | | | | Voiced | {1, 3, 26, 71, 94, 50, 9, 2}/256 | +-------------+------------------------------------+ Table 11: PDFs for Independent Quantization Gain MSB Coding The 3 least significant bits are decoded using a uniform PDF: +--------------------------------------+ | PDF | +--------------------------------------+ | {32, 32, 32, 32, 32, 32, 32, 32}/256 | +--------------------------------------+ Table 12: PDF for Independent Quantization Gain LSB Coding These 6 bits are combined to form a value, gain_index, between 0 and 63. When the gain for the previous subframe is available, then the current gain is limited as follows: log_gain = max(gain_index, previous_log_gain - 16) This may help some implementations limit the change in precision of their internal LTP history. The indices to which this clamp applies cannot simply be removed from the codebook, because previous_log_gain will not be available after packet loss. The clamping is skipped after a decoder reset, and in the side channel if the previous frame
in the side channel was not coded, since there is no value for previous_log_gain available. It MAY also be skipped after packet loss. For subframes that do not have an independent gain (including the first subframe of frames not listed as using independent coding above), the quantization gain is coded relative to the gain from the previous subframe (in the same channel). The PDF in Table 13 yields a delta_gain_index value between 0 and 40, inclusive. +-------------------------------------------------------------------+ | PDF | +-------------------------------------------------------------------+ | {6, 5, 11, 31, 132, 21, 8, 4, 3, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, | | 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | | 1}/256 | +-------------------------------------------------------------------+ Table 13: PDF for Delta Quantization Gain Coding The following formula translates this index into a quantization gain for the current subframe using the gain from the previous subframe: log_gain = clamp(0, max(2*delta_gain_index - 16, previous_log_gain + delta_gain_index - 4), 63) silk_gains_dequant() (gain_quant.c) dequantizes log_gain for the k'th subframe and converts it into a linear Q16 scale factor via gain_Q16[k] = silk_log2lin((0x1D1C71*log_gain>>16) + 2090) The function silk_log2lin() (log2lin.c) computes an approximation of 2**(inLog_Q7/128.0), where inLog_Q7 is its Q7 input. Let i = inLog_Q7>>7 be the integer part of inLogQ7 and f = inLog_Q7&127 be the fractional part. Then, (1<<i) + ((-174*f*(128-f)>>16)+f)*((1<<i)>>7) yields the approximate exponential. The final Q16 gain values lies between 81920 and 1686110208, inclusive (representing scale factors of 1.25 to 25728, respectively).