4.2.7.6. Long-Term Prediction (LTP) Parameters
After the normalized LSF indices and, for 20 ms frames, the LSF interpolation index, voiced frames (see Section 4.2.7.3) include additional LTP parameters. There is one primary lag index for each SILK frame, but this is refined to produce a separate lag index per subframe using a vector quantizer. Each subframe also gets its own prediction gain coefficient.4.2.7.6.1. Pitch Lags
The primary lag index is coded either relative to the primary lag of the prior frame in the same channel or as an absolute index. Absolute coding is used if and only if o This is the first SILK frame of its type (LBRR or regular) for this channel in the current Opus frame, o The previous SILK frame of the same type (LBRR or regular) for this channel in the same Opus frame was not coded, or
o That previous SILK frame was coded, but was not voiced (see Section 4.2.7.3). With absolute coding, the primary pitch lag may range from 2 ms (inclusive) up to 18 ms (exclusive), corresponding to pitches from 500 Hz down to 55.6 Hz, respectively. It is comprised of a high part and a low part, where the decoder first reads the high part using the 32-entry codebook in Table 29 and then the low part using the codebook corresponding to the current audio bandwidth from Table 30. The final primary pitch lag is then lag = lag_high*lag_scale + lag_low + lag_min where lag_high is the high part, lag_low is the low part, and lag_scale and lag_min are the values from the "Scale" and "Minimum Lag" columns of Table 30, respectively. +-------------------------------------------------------------------+ | PDF | +-------------------------------------------------------------------+ | {3, 3, 6, 11, 21, 30, 32, 19, 11, 10, 12, 13, 13, 12, 11, 9, 8, | | 7, 6, 4, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1}/256 | +-------------------------------------------------------------------+ Table 29: PDF for High Part of Primary Pitch Lag +------------+------------------------+-------+----------+----------+ | Audio | PDF | Scale | Minimum | Maximum | | Bandwidth | | | Lag | Lag | +------------+------------------------+-------+----------+----------+ | NB | {64, 64, 64, 64}/256 | 4 | 16 | 144 | | | | | | | | MB | {43, 42, 43, 43, 42, | 6 | 24 | 216 | | | 43}/256 | | | | | | | | | | | WB | {32, 32, 32, 32, 32, | 8 | 32 | 288 | | | 32, 32, 32}/256 | | | | +------------+------------------------+-------+----------+----------+ Table 30: PDF for Low Part of Primary Pitch Lag All frames that do not use absolute coding for the primary lag index use relative coding instead. The decoder reads a single delta value using the 21-entry PDF in Table 31. If the resulting value is zero, it falls back to the absolute coding procedure from the prior paragraph. Otherwise, the final primary pitch lag is then lag = previous_lag + (delta_lag_index - 9)
where previous_lag is the primary pitch lag from the most recent frame in the same channel and delta_lag_index is the value just decoded. This allows a per-frame change in the pitch lag of -8 to +11 samples. The decoder does no clamping at this point, so this value can fall outside the range of 2 ms to 18 ms, and the decoder must use this unclamped value when using relative coding in the next SILK frame (if any). However, because an Opus frame can use relative coding for at most two consecutive SILK frames, integer overflow should not be an issue. +-------------------------------------------------------------------+ | PDF | +-------------------------------------------------------------------+ | {46, 2, 2, 3, 4, 6, 10, 15, 26, 38, 30, 22, 15, 10, 7, 6, 4, 4, | | 2, 2, 2}/256 | +-------------------------------------------------------------------+ Table 31: PDF for Primary Pitch Lag Change After the primary pitch lag, a "pitch contour", stored as a single entry from one of four small VQ codebooks, gives lag offsets for each subframe in the current SILK frame. The codebook index is decoded using one of the PDFs in Table 32 depending on the current frame size and audio bandwidth. Tables 33 through 36 give the corresponding offsets to apply to the primary pitch lag for each subframe given the decoded codebook index. +-----------+--------+----------+-----------------------------------+ | Audio | SILK | Codebook | PDF | | Bandwidth | Frame | Size | | | | Size | | | +-----------+--------+----------+-----------------------------------+ | NB | 10 ms | 3 | {143, 50, 63}/256 | | | | | | | NB | 20 ms | 11 | {68, 12, 21, 17, 19, 22, 30, 24, | | | | | 17, 16, 10}/256 | | | | | | | MB or WB | 10 ms | 12 | {91, 46, 39, 19, 14, 12, 8, 7, 6, | | | | | 5, 5, 4}/256 | | | | | | | MB or WB | 20 ms | 34 | {33, 22, 18, 16, 15, 14, 14, 13, | | | | | 13, 10, 9, 9, 8, 6, 6, 6, 5, 4, | | | | | 4, 4, 3, 3, 3, 2, 2, 2, 2, 2, 2, | | | | | 2, 1, 1, 1, 1}/256 | +-----------+--------+----------+-----------------------------------+ Table 32: PDFs for Subframe Pitch Contour
+-------+------------------+ | Index | Subframe Offsets | +-------+------------------+ | 0 | 0 0 | | | | | 1 | 1 0 | | | | | 2 | 0 1 | +-------+------------------+ Table 33: Codebook Vectors for Subframe Pitch Contour: NB, 10 ms Frames +-------+------------------+ | Index | Subframe Offsets | +-------+------------------+ | 0 | 0 0 0 0 | | | | | 1 | 2 1 0 -1 | | | | | 2 | -1 0 1 2 | | | | | 3 | -1 0 0 1 | | | | | 4 | -1 0 0 0 | | | | | 5 | 0 0 0 1 | | | | | 6 | 0 0 1 1 | | | | | 7 | 1 1 0 0 | | | | | 8 | 1 0 0 0 | | | | | 9 | 0 0 0 -1 | | | | | 10 | 1 0 0 -1 | +-------+------------------+ Table 34: Codebook Vectors for Subframe Pitch Contour: NB, 20 ms Frames
+-------+------------------+ | Index | Subframe Offsets | +-------+------------------+ | 0 | 0 0 | | | | | 1 | 0 1 | | | | | 2 | 1 0 | | | | | 3 | -1 1 | | | | | 4 | 1 -1 | | | | | 5 | -1 2 | | | | | 6 | 2 -1 | | | | | 7 | -2 2 | | | | | 8 | 2 -2 | | | | | 9 | -2 3 | | | | | 10 | 3 -2 | | | | | 11 | -3 3 | +-------+------------------+ Table 35: Codebook Vectors for Subframe Pitch Contour: MB or WB, 10 ms Frames +-------+------------------+ | Index | Subframe Offsets | +-------+------------------+ | 0 | 0 0 0 0 | | | | | 1 | 0 0 1 1 | | | | | 2 | 1 1 0 0 | | | | | 3 | -1 0 0 0 | | | | | 4 | 0 0 0 1 | | | | | 5 | 1 0 0 0 | | | | | 6 | -1 0 0 1 | | | |
| 7 | 0 0 0 -1 |
| | |
| 8 | -1 0 1 2 |
| | |
| 9 | 1 0 0 -1 |
| | |
| 10 | -2 -1 1 2 |
| | |
| 11 | 2 1 0 -1 |
| | |
| 12 | -2 0 0 2 |
| | |
| 13 | -2 0 1 3 |
| | |
| 14 | 2 1 -1 -2 |
| | |
| 15 | -3 -1 1 3 |
| | |
| 16 | 2 0 0 -2 |
| | |
| 17 | 3 1 0 -2 |
| | |
| 18 | -3 -1 2 4 |
| | |
| 19 | -4 -1 1 4 |
| | |
| 20 | 3 1 -1 -3 |
| | |
| 21 | -4 -1 2 5 |
| | |
| 22 | 4 2 -1 -3 |
| | |
| 23 | 4 1 -1 -4 |
| | |
| 24 | -5 -1 2 6 |
| | |
| 25 | 5 2 -1 -4 |
| | |
| 26 | -6 -2 2 6 |
| | |
| 27 | -5 -2 2 5 |
| | |
| 28 | 6 2 -1 -5 |
| | |
| 29 | -7 -2 3 8 |
| | |
| 30 | 6 2 -2 -6 |
| | |
| 31 | 5 2 -2 -5 | | | | | 32 | 8 3 -2 -7 | | | | | 33 | -9 -3 3 9 | +-------+------------------+ Table 36: Codebook Vectors for Subframe Pitch Contour: MB or WB, 20 ms Frames The final pitch lag for each subframe is assembled in silk_decode_pitch() (decode_pitch.c). Let lag be the primary pitch lag for the current SILK frame, contour_index be index of the VQ codebook, and lag_cb[contour_index][k] be the corresponding entry of the codebook from the appropriate table given above for the k'th subframe. Then the final pitch lag for that subframe is pitch_lags[k] = clamp(lag_min, lag + lag_cb[contour_index][k], lag_max) where lag_min and lag_max are the values from the "Minimum Lag" and "Maximum Lag" columns of Table 30, respectively.4.2.7.6.2. LTP Filter Coefficients
SILK uses a separate 5-tap pitch filter for each subframe, selected from one of three codebooks. The three codebooks each represent different rate-distortion trade-offs, with average rates of 1.61 bits/subframe, 3.68 bits/subframe, and 4.85 bits/subframe, respectively. The importance of the filter coefficients generally depends on two factors: the periodicity of the signal and relative energy between the current subframe and the signal from one period earlier. Greater periodicity and decaying energy both lead to more important filter coefficients. Thus, they should be coded with lower distortion and higher rate. These properties are relatively stable over the duration of a single SILK frame. Hence, all of the subframes in a SILK frame choose their filter from the same codebook. This is signaled with an explicitly-coded "periodicity index". This immediately follows the subframe pitch lags, and is coded using the 3-entry PDF from Table 37.
+------------------+ | PDF | +------------------+ | {77, 80, 99}/256 | +------------------+ Table 37: Periodicity Index PDF The indices of the filters for each subframe follow. They are all coded using the PDF from Table 38 corresponding to the periodicity index. Tables 39 through 41 contain the corresponding filter taps as signed Q7 integers. +-------------+----------+------------------------------------------+ | Periodicity | Codebook | PDF | | Index | Size | | +-------------+----------+------------------------------------------+ | 0 | 8 | {185, 15, 13, 13, 9, 9, 6, 6}/256 | | | | | | 1 | 16 | {57, 34, 21, 20, 15, 13, 12, 13, 10, 10, | | | | 9, 10, 9, 8, 7, 8}/256 | | | | | | 2 | 32 | {15, 16, 14, 12, 12, 12, 11, 11, 11, 10, | | | | 9, 9, 9, 9, 8, 8, 8, 8, 7, 7, 6, 6, 5, | | | | 4, 5, 4, 4, 4, 3, 4, 3, 2}/256 | +-------------+----------+------------------------------------------+ Table 38: LTP Filter PDFs
+-------+---------------------+ | Index | Filter Taps (Q7) | +-------+---------------------+ | 0 | 4 6 24 7 5 | | | | | 1 | 0 0 2 0 0 | | | | | 2 | 12 28 41 13 -4 | | | | | 3 | -9 15 42 25 14 | | | | | 4 | 1 -2 62 41 -9 | | | | | 5 | -10 37 65 -4 3 | | | | | 6 | -6 4 66 7 -8 | | | | | 7 | 16 14 38 -3 33 | +-------+---------------------+ Table 39: Codebook Vectors for LTP Filter, Periodicity Index 0
+-------+---------------------+ | Index | Filter Taps (Q7) | +-------+---------------------+ | 0 | 13 22 39 23 12 | | | | | 1 | -1 36 64 27 -6 | | | | | 2 | -7 10 55 43 17 | | | | | 3 | 1 1 8 1 1 | | | | | 4 | 6 -11 74 53 -9 | | | | | 5 | -12 55 76 -12 8 | | | | | 6 | -3 3 93 27 -4 | | | | | 7 | 26 39 59 3 -8 | | | | | 8 | 2 0 77 11 9 | | | | | 9 | -8 22 44 -6 7 | | | | | 10 | 40 9 26 3 9 | | | | | 11 | -7 20 101 -7 4 | | | | | 12 | 3 -8 42 26 0 | | | | | 13 | -15 33 68 2 23 | | | | | 14 | -2 55 46 -2 15 | | | | | 15 | 3 -1 21 16 41 | +-------+---------------------+ Table 40: Codebook Vectors for LTP Filter, Periodicity Index 1 +-------+---------------------+ | Index | Filter Taps (Q7) | +-------+---------------------+ | 0 | -6 27 61 39 5 | | | | | 1 | -11 42 88 4 1 | | | | | 2 | -2 60 65 6 -4 | | | | | 3 | -1 -5 73 56 1 |
| 4 | -9 19 94 29 -9 |
| | |
| 5 | 0 12 99 6 4 |
| | |
| 6 | 8 -19 102 46 -13 |
| | |
| 7 | 3 2 13 3 2 |
| | |
| 8 | 9 -21 84 72 -18 |
| | |
| 9 | -11 46 104 -22 8 |
| | |
| 10 | 18 38 48 23 0 |
| | |
| 11 | -16 70 83 -21 11 |
| | |
| 12 | 5 -11 117 22 -8 |
| | |
| 13 | -6 23 117 -12 3 |
| | |
| 14 | 3 -8 95 28 4 |
| | |
| 15 | -10 15 77 60 -15 |
| | |
| 16 | -1 4 124 2 -4 |
| | |
| 17 | 3 38 84 24 -25 |
| | |
| 18 | 2 13 42 13 31 |
| | |
| 19 | 21 -4 56 46 -1 |
| | |
| 20 | -1 35 79 -13 19 |
| | |
| 21 | -7 65 88 -9 -14 |
| | |
| 22 | 20 4 81 49 -29 |
| | |
| 23 | 20 0 75 3 -17 |
| | |
| 24 | 5 -9 44 92 -8 |
| | |
| 25 | 1 -3 22 69 31 |
| | |
| 26 | -6 95 41 -12 5 |
| | |
| 27 | 39 67 16 -4 1 |
| | |
| 28 | 0 -6 120 55 -36 | | | | | 29 | -13 44 122 4 -24 | | | | | 30 | 81 5 11 3 7 | | | | | 31 | 2 0 9 10 88 | +-------+---------------------+ Table 41: Codebook Vectors for LTP Filter, Periodicity Index 24.2.7.6.3. LTP Scaling Parameter
An LTP scaling parameter appears after the LTP filter coefficients if and only if o This is a voiced frame (see Section 4.2.7.3), and o Either * This SILK frame corresponds to the first time interval of the current Opus frame for its type (LBRR or regular), or * This is an LBRR frame where the LBRR flags (see Section 4.2.4) indicate the previous LBRR frame in the same channel is not coded. This allows the encoder to trade off the prediction gain between packets against the recovery time after packet loss. Unlike absolute-coding for pitch lags, regular SILK frames that are not at the start of an Opus frame (i.e., that do not correspond to the first 20 ms time interval in Opus frames of 40 or 60 ms) do not include this field, even if the prior frame was not voiced, or (in the case of the side channel) not even coded. After an uncoded frame in the side channel, the LTP buffer (see Section 4.2.7.9.1) is cleared to zero, and is thus in a known state. In contrast, LBRR frames do include this field when the prior frame was not coded, since the LTP buffer contains the output of the PLC, which is non-normative. If present, the decoder reads a value using the 3-entry PDF in Table 42. The three possible values represent Q14 scale factors of 15565, 12288, and 8192, respectively (corresponding to approximately 0.95, 0.75, and 0.5). Frames that do not code the scaling parameter use the default factor of 15565 (approximately 0.95).
+-------------------+ | PDF | +-------------------+ | {128, 64, 64}/256 | +-------------------+ Table 42: PDF for LTP Scaling Parameter4.2.7.7. Linear Congruential Generator (LCG) Seed
As described in Section 4.2.7.8.6, SILK uses a Linear Congruential Generator (LCG) to inject pseudorandom noise into the quantized excitation. To ensure synchronization of this process between the encoder and decoder, each SILK frame stores a 2-bit seed after the LTP parameters (if any). The encoder may consider the choice of seed during quantization, and the flexibility of this choice lets it reduce distortion, helping to pay for the bit cost required to signal it. The decoder reads the seed using the uniform 4-entry PDF in Table 43, yielding a value between 0 and 3, inclusive. +----------------------+ | PDF | +----------------------+ | {64, 64, 64, 64}/256 | +----------------------+ Table 43: PDF for LCG Seed4.2.7.8. Excitation
SILK codes the excitation using a modified version of the Pyramid Vector Quantizer (PVQ) codebook [PVQ]. The PVQ codebook is designed for Laplace-distributed values and consists of all sums of K signed, unit pulses in a vector of dimension N, where two pulses at the same position are required to have the same sign. Thus, the codebook includes all integer codevectors y of dimension N that satisfy N-1 __ \ abs(y[j]) = K /_ j=0 Unlike regular PVQ, SILK uses a variable-length, rather than fixed- length, encoding. This encoding is better suited to the more Gaussian-like distribution of the coefficient magnitudes and the non- uniform distribution of their signs (caused by the quantization offset described below). SILK also handles large codebooks by coding
the least significant bits (LSBs) of each coefficient directly. This adds a small coding efficiency loss, but greatly reduces the computation time and ROM size required for decoding, as implemented in silk_decode_pulses() (decode_pulses.c). SILK fixes the dimension of the codebook to N = 16. The excitation is made up of a number of "shell blocks", each 16 samples in size. Table 44 lists the number of shell blocks required for a SILK frame for each possible audio bandwidth and frame size. 10 ms MB frames nominally contain 120 samples (10 ms at 12 kHz), which is not a multiple of 16. This is handled by coding 8 shell blocks (128 samples) and discarding the final 8 samples of the last block. The decoder contains no special case that prevents an encoder from placing pulses in these samples, and they must be correctly parsed from the bitstream if present, but they are otherwise ignored. +-----------------+------------+------------------------+ | Audio Bandwidth | Frame Size | Number of Shell Blocks | +-----------------+------------+------------------------+ | NB | 10 ms | 5 | | | | | | MB | 10 ms | 8 | | | | | | WB | 10 ms | 10 | | | | | | NB | 20 ms | 10 | | | | | | MB | 20 ms | 15 | | | | | | WB | 20 ms | 20 | +-----------------+------------+------------------------+ Table 44: Number of Shell Blocks Per SILK Frame4.2.7.8.1. Rate Level
The first symbol in the excitation is a "rate level", which is an index from 0 to 8, inclusive, coded using the PDF in Table 45 corresponding to the signal type of the current frame (from Section 4.2.7.3). The rate level selects the PDF used to decode the number of pulses in the individual shell blocks. It does not directly convey any information about the bitrate or the number of pulses itself, but merely changes the probability of the symbols in Section 4.2.7.8.2. Level 0 provides a more efficient encoding at low rates generally, and level 8 provides a more efficient encoding at high rates generally, though the most efficient level for a
particular SILK frame may depend on the exact distribution of the coded symbols. An encoder should, but is not required to, use the most efficient rate level. +----------------------+------------------------------------------+ | Signal Type | PDF | +----------------------+------------------------------------------+ | Inactive or Unvoiced | {15, 51, 12, 46, 45, 13, 33, 27, 14}/256 | | | | | Voiced | {33, 30, 36, 17, 34, 49, 18, 21, 18}/256 | +----------------------+------------------------------------------+ Table 45: PDFs for the Rate Level4.2.7.8.2. Pulses per Shell Block
The total number of pulses in each of the shell blocks follows the rate level. The pulse counts for all of the shell blocks are coded consecutively, before the content of any of the blocks. Each block may have anywhere from 0 to 16 pulses, inclusive, coded using the 18- entry PDF in Table 46 corresponding to the rate level from Section 4.2.7.8.1. The special value 17 indicates that this block has one or more additional LSBs to decode for each coefficient. If the decoder encounters this value, it decodes another value for the actual pulse count of the block, but uses the PDF corresponding to the special rate level 9 instead of the normal rate level. This process repeats until the decoder reads a value less than 17, and it then sets the number of extra LSBs used to the number of 17's decoded for that block. If it reads the value 17 ten times, then the next iteration uses the special rate level 10 instead of 9. The probability of decoding a 17 when using the PDF for rate level 10 is zero, ensuring that the number of LSBs for a block will not exceed 10. The cumulative distribution for rate level 10 is just a shifted version of that for 9 and thus does not require any additional storage.
+----------+--------------------------------------------------------+ | Rate | PDF | | Level | | +----------+--------------------------------------------------------+ | 0 | {131, 74, 25, 8, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | | | 1, 1}/256 | | | | | 1 | {58, 93, 60, 23, 7, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, | | | 1, 1}/256 | | | | | 2 | {43, 51, 46, 33, 24, 16, 11, 8, 6, 3, 3, 3, 2, 1, 1, | | | 2, 1, 2}/256 | | | | | 3 | {17, 52, 71, 57, 31, 12, 5, 1, 1, 1, 1, 1, 1, 1, 1, 1, | | | 1, 1}/256 | | | | | 4 | {6, 21, 41, 53, 49, 35, 21, 11, 6, 3, 2, 2, 1, 1, 1, | | | 1, 1, 1}/256 | | | | | 5 | {7, 14, 22, 28, 29, 28, 25, 20, 17, 13, 11, 9, 7, 5, | | | 4, 4, 3, 10}/256 | | | | | 6 | {2, 5, 14, 29, 42, 46, 41, 31, 19, 11, 6, 3, 2, 1, 1, | | | 1, 1, 1}/256 | | | | | 7 | {1, 2, 4, 10, 19, 29, 35, 37, 34, 28, 20, 14, 8, 5, 4, | | | 2, 2, 2}/256 | | | | | 8 | {1, 2, 2, 5, 9, 14, 20, 24, 27, 28, 26, 23, 20, 15, | | | 11, 8, 6, 15}/256 | | | | | 9 | {1, 1, 1, 6, 27, 58, 56, 39, 25, 14, 10, 6, 3, 3, 2, | | | 1, 1, 2}/256 | | | | | 10 | {2, 1, 6, 27, 58, 56, 39, 25, 14, 10, 6, 3, 3, 2, 1, | | | 1, 2, 0}/256 | +----------+--------------------------------------------------------+ Table 46: PDFs for the Pulse Count4.2.7.8.3. Pulse Location Decoding
The locations of the pulses in each shell block follow the pulse counts, as decoded by silk_shell_decoder() (shell_coder.c). As with the pulse counts, these locations are coded for all the shell blocks before any of the remaining information for each block. Unlike many other codecs, SILK places no restriction on the distribution of
pulses within a shell block. All of the pulses may be placed in a single location, or each one in a unique location, or anything in between. The location of pulses is coded by recursively partitioning each block into halves, and coding how many pulses fall on the left side of the split. All remaining pulses must fall on the right side of the split. The process then recurses into the left half, and after that returns, the right half (preorder traversal). The PDF to use is chosen by the size of the current partition (16, 8, 4, or 2) and the number of pulses in the partition (1 to 16, inclusive). Tables 47 through 50 list the PDFs used for each partition size and pulse count. This process skips partitions without any pulses, i.e., where the initial pulse count from Section 4.2.7.8.2 was zero, or where the split in the prior level indicated that all of the pulses fell on the other side. These partitions have nothing to code, so they require no PDF.
+------------+------------------------------------------------------+ | Pulse | PDF | | Count | | +------------+------------------------------------------------------+ | 1 | {126, 130}/256 | | | | | 2 | {56, 142, 58}/256 | | | | | 3 | {25, 101, 104, 26}/256 | | | | | 4 | {12, 60, 108, 64, 12}/256 | | | | | 5 | {7, 35, 84, 87, 37, 6}/256 | | | | | 6 | {4, 20, 59, 86, 63, 21, 3}/256 | | | | | 7 | {3, 12, 38, 72, 75, 42, 12, 2}/256 | | | | | 8 | {2, 8, 25, 54, 73, 59, 27, 7, 1}/256 | | | | | 9 | {2, 5, 17, 39, 63, 65, 42, 18, 4, 1}/256 | | | | | 10 | {1, 4, 12, 28, 49, 63, 54, 30, 11, 3, 1}/256 | | | | | 11 | {1, 4, 8, 20, 37, 55, 57, 41, 22, 8, 2, 1}/256 | | | | | 12 | {1, 3, 7, 15, 28, 44, 53, 48, 33, 16, 6, 1, 1}/256 | | | | | 13 | {1, 2, 6, 12, 21, 35, 47, 48, 40, 25, 12, 5, 1, | | | 1}/256 | | | | | 14 | {1, 1, 4, 10, 17, 27, 37, 47, 43, 33, 21, 9, 4, 1, | | | 1}/256 | | | | | 15 | {1, 1, 1, 8, 14, 22, 33, 40, 43, 38, 28, 16, 8, 1, | | | 1, 1}/256 | | | | | 16 | {1, 1, 1, 1, 13, 18, 27, 36, 41, 41, 34, 24, 14, 1, | | | 1, 1, 1}/256 | +------------+------------------------------------------------------+ Table 47: PDFs for Pulse Count Split, 16 Sample Partitions
+------------+------------------------------------------------------+ | Pulse | PDF | | Count | | +------------+------------------------------------------------------+ | 1 | {127, 129}/256 | | | | | 2 | {53, 149, 54}/256 | | | | | 3 | {22, 105, 106, 23}/256 | | | | | 4 | {11, 61, 111, 63, 10}/256 | | | | | 5 | {6, 35, 86, 88, 36, 5}/256 | | | | | 6 | {4, 20, 59, 87, 62, 21, 3}/256 | | | | | 7 | {3, 13, 40, 71, 73, 41, 13, 2}/256 | | | | | 8 | {3, 9, 27, 53, 70, 56, 28, 9, 1}/256 | | | | | 9 | {3, 8, 19, 37, 57, 61, 44, 20, 6, 1}/256 | | | | | 10 | {3, 7, 15, 28, 44, 54, 49, 33, 17, 5, 1}/256 | | | | | 11 | {1, 7, 13, 22, 34, 46, 48, 38, 28, 14, 4, 1}/256 | | | | | 12 | {1, 1, 11, 22, 27, 35, 42, 47, 33, 25, 10, 1, 1}/256 | | | | | 13 | {1, 1, 6, 14, 26, 37, 43, 43, 37, 26, 14, 6, 1, | | | 1}/256 | | | | | 14 | {1, 1, 4, 10, 20, 31, 40, 42, 40, 31, 20, 10, 4, 1, | | | 1}/256 | | | | | 15 | {1, 1, 3, 8, 16, 26, 35, 38, 38, 35, 26, 16, 8, 3, | | | 1, 1}/256 | | | | | 16 | {1, 1, 2, 6, 12, 21, 30, 36, 38, 36, 30, 21, 12, 6, | | | 2, 1, 1}/256 | +------------+------------------------------------------------------+ Table 48: PDFs for Pulse Count Split, 8 Sample Partitions
+------------+------------------------------------------------------+ | Pulse | PDF | | Count | | +------------+------------------------------------------------------+ | 1 | {127, 129}/256 | | | | | 2 | {49, 157, 50}/256 | | | | | 3 | {20, 107, 109, 20}/256 | | | | | 4 | {11, 60, 113, 62, 10}/256 | | | | | 5 | {7, 36, 84, 87, 36, 6}/256 | | | | | 6 | {6, 24, 57, 82, 60, 23, 4}/256 | | | | | 7 | {5, 18, 39, 64, 68, 42, 16, 4}/256 | | | | | 8 | {6, 14, 29, 47, 61, 52, 30, 14, 3}/256 | | | | | 9 | {1, 15, 23, 35, 51, 50, 40, 30, 10, 1}/256 | | | | | 10 | {1, 1, 21, 32, 42, 52, 46, 41, 18, 1, 1}/256 | | | | | 11 | {1, 6, 16, 27, 36, 42, 42, 36, 27, 16, 6, 1}/256 | | | | | 12 | {1, 5, 12, 21, 31, 38, 40, 38, 31, 21, 12, 5, 1}/256 | | | | | 13 | {1, 3, 9, 17, 26, 34, 38, 38, 34, 26, 17, 9, 3, | | | 1}/256 | | | | | 14 | {1, 3, 7, 14, 22, 29, 34, 36, 34, 29, 22, 14, 7, 3, | | | 1}/256 | | | | | 15 | {1, 2, 5, 11, 18, 25, 31, 35, 35, 31, 25, 18, 11, 5, | | | 2, 1}/256 | | | | | 16 | {1, 1, 4, 9, 15, 21, 28, 32, 34, 32, 28, 21, 15, 9, | | | 4, 1, 1}/256 | +------------+------------------------------------------------------+ Table 49: PDFs for Pulse Count Split, 4 Sample Partitions
+------------+------------------------------------------------------+ | Pulse | PDF | | Count | | +------------+------------------------------------------------------+ | 1 | {128, 128}/256 | | | | | 2 | {42, 172, 42}/256 | | | | | 3 | {21, 107, 107, 21}/256 | | | | | 4 | {12, 60, 112, 61, 11}/256 | | | | | 5 | {8, 34, 86, 86, 35, 7}/256 | | | | | 6 | {8, 23, 55, 90, 55, 20, 5}/256 | | | | | 7 | {5, 15, 38, 72, 72, 36, 15, 3}/256 | | | | | 8 | {6, 12, 27, 52, 77, 47, 20, 10, 5}/256 | | | | | 9 | {6, 19, 28, 35, 40, 40, 35, 28, 19, 6}/256 | | | | | 10 | {4, 14, 22, 31, 37, 40, 37, 31, 22, 14, 4}/256 | | | | | 11 | {3, 10, 18, 26, 33, 38, 38, 33, 26, 18, 10, 3}/256 | | | | | 12 | {2, 8, 13, 21, 29, 36, 38, 36, 29, 21, 13, 8, 2}/256 | | | | | 13 | {1, 5, 10, 17, 25, 32, 38, 38, 32, 25, 17, 10, 5, | | | 1}/256 | | | | | 14 | {1, 4, 7, 13, 21, 29, 35, 36, 35, 29, 21, 13, 7, 4, | | | 1}/256 | | | | | 15 | {1, 2, 5, 10, 17, 25, 32, 36, 36, 32, 25, 17, 10, 5, | | | 2, 1}/256 | | | | | 16 | {1, 2, 4, 7, 13, 21, 28, 34, 36, 34, 28, 21, 13, 7, | | | 4, 2, 1}/256 | +------------+------------------------------------------------------+ Table 50: PDFs for Pulse Count Split, 2 Sample Partitions4.2.7.8.4. LSB Decoding
After the decoder reads the pulse locations for all blocks, it reads the LSBs (if any) for each block in turn. Inside each block, it reads all the LSBs for each coefficient in turn, even those where no
pulses were allocated, before proceeding to the next one. For 10 ms MB frames, it reads LSBs even for the extra 8 samples in the last block. The LSBs are coded from most significant to least significant, and they all use the PDF in Table 51. +----------------+ | PDF | +----------------+ | {136, 120}/256 | +----------------+ Table 51: PDF for Excitation LSBs The number of LSBs read for each coefficient in a block is determined in Section 4.2.7.8.2. The magnitude of the coefficient is initially equal to the number of pulses placed at that location in Section 4.2.7.8.3. As each LSB is decoded, the magnitude is doubled, and then the value of the LSB added to it, to obtain an updated magnitude.4.2.7.8.5. Sign Decoding
After decoding the pulse locations and the LSBs, the decoder knows the magnitude of each coefficient in the excitation. It then decodes a sign for all coefficients with a non-zero magnitude, using one of the PDFs from Table 52. If the value decoded is 0, then the coefficient magnitude is negated. Otherwise, it remains positive. The decoder chooses the PDF for the sign based on the signal type and quantization offset type (from Section 4.2.7.3) and the number of pulses in the block (from Section 4.2.7.8.2). The number of pulses in the block does not take into account any LSBs. Most PDFs are skewed towards negative signs because of the quantization offset, but the PDFs for zero pulses are highly skewed towards positive signs. If a block contains many positive coefficients, it is sometimes beneficial to code it solely using LSBs (i.e., with zero pulses), since the encoder may be able to save enough bits on the signs to justify the less efficient coefficient magnitude encoding. +-------------+-----------------------+-------------+---------------+ | Signal Type | Quantization Offset | Pulse Count | PDF | | | Type | | | +-------------+-----------------------+-------------+---------------+ | Inactive | Low | 0 | {2, 254}/256 | | | | | | | Inactive | Low | 1 | {207, 49}/256 | | | | | | | Inactive | Low | 2 | {189, 67}/256 |
| Inactive | Low | 3 | {179, 77}/256 |
| | | | |
| Inactive | Low | 4 | {174, 82}/256 |
| | | | |
| Inactive | Low | 5 | {163, 93}/256 |
| | | | |
| Inactive | Low | 6 or more | {157, 99}/256 |
| | | | |
| Inactive | High | 0 | {58, 198}/256 |
| | | | |
| Inactive | High | 1 | {245, 11}/256 |
| | | | |
| Inactive | High | 2 | {238, 18}/256 |
| | | | |
| Inactive | High | 3 | {232, 24}/256 |
| | | | |
| Inactive | High | 4 | {225, 31}/256 |
| | | | |
| Inactive | High | 5 | {220, 36}/256 |
| | | | |
| Inactive | High | 6 or more | {211, 45}/256 |
| | | | |
| Unvoiced | Low | 0 | {1, 255}/256 |
| | | | |
| Unvoiced | Low | 1 | {210, 46}/256 |
| | | | |
| Unvoiced | Low | 2 | {190, 66}/256 |
| | | | |
| Unvoiced | Low | 3 | {178, 78}/256 |
| | | | |
| Unvoiced | Low | 4 | {169, 87}/256 |
| | | | |
| Unvoiced | Low | 5 | {162, 94}/256 |
| | | | |
| Unvoiced | Low | 6 or more | {152, |
| | | | 104}/256 |
| | | | |
| Unvoiced | High | 0 | {48, 208}/256 |
| | | | |
| Unvoiced | High | 1 | {242, 14}/256 |
| | | | |
| Unvoiced | High | 2 | {235, 21}/256 |
| | | | |
| Unvoiced | High | 3 | {224, 32}/256 |
| | | | |
| Unvoiced | High | 4 | {214, 42}/256 |
| | | | |
| Unvoiced | High | 5 | {205, 51}/256 |
| Unvoiced | High | 6 or more | {190, 66}/256 | | | | | | | Voiced | Low | 0 | {1, 255}/256 | | | | | | | Voiced | Low | 1 | {162, 94}/256 | | | | | | | Voiced | Low | 2 | {152, | | | | | 104}/256 | | | | | | | Voiced | Low | 3 | {147, | | | | | 109}/256 | | | | | | | Voiced | Low | 4 | {144, | | | | | 112}/256 | | | | | | | Voiced | Low | 5 | {141, | | | | | 115}/256 | | | | | | | Voiced | Low | 6 or more | {138, | | | | | 118}/256 | | | | | | | Voiced | High | 0 | {8, 248}/256 | | | | | | | Voiced | High | 1 | {203, 53}/256 | | | | | | | Voiced | High | 2 | {187, 69}/256 | | | | | | | Voiced | High | 3 | {176, 80}/256 | | | | | | | Voiced | High | 4 | {168, 88}/256 | | | | | | | Voiced | High | 5 | {161, 95}/256 | | | | | | | Voiced | High | 6 or more | {154, | | | | | 102}/256 | +-------------+-----------------------+-------------+---------------+ Table 52: PDFs for Excitation Signs4.2.7.8.6. Reconstructing the Excitation
After the signs have been read, there is enough information to reconstruct the complete excitation signal. This requires adding a constant quantization offset to each non-zero sample and then pseudorandomly inverting and offsetting every sample. The constant quantization offset varies depending on the signal type and quantization offset type (see Section 4.2.7.3).
+-------------+--------------------------+--------------------------+ | Signal Type | Quantization Offset Type | Quantization Offset | | | | (Q23) | +-------------+--------------------------+--------------------------+ | Inactive | Low | 25 | | | | | | Inactive | High | 60 | | | | | | Unvoiced | Low | 25 | | | | | | Unvoiced | High | 60 | | | | | | Voiced | Low | 8 | | | | | | Voiced | High | 25 | +-------------+--------------------------+--------------------------+ Table 53: Excitation Quantization Offsets Let e_raw[i] be the raw excitation value at position i, with a magnitude composed of the pulses at that location (see Section 4.2.7.8.3) combined with any additional LSBs (see Section 4.2.7.8.4), and with the corresponding sign decoded in Section 4.2.7.8.5. Additionally, let seed be the current pseudorandom seed, which is initialized to the value decoded from Section 4.2.7.7 for the first sample in the current SILK frame, and updated for each subsequent sample according to the procedure below. Finally, let offset_Q23 be the quantization offset from Table 53. Then the following procedure produces the final reconstructed excitation value, e_Q23[i]: e_Q23[i] = (e_raw[i] << 8) - sign(e_raw[i])*20 + offset_Q23; seed = (196314165*seed + 907633515) & 0xFFFFFFFF; e_Q23[i] = (seed & 0x80000000) ? -e_Q23[i] : e_Q23[i]; seed = (seed + e_raw[i]) & 0xFFFFFFFF; When e_raw[i] is zero, sign() returns 0 by the definition in Section 1.1.4, so the factor of 20 does not get added. The final e_Q23[i] value may require more than 16 bits per sample, but it will not require more than 23, including the sign.4.2.7.9. SILK Frame Reconstruction
The remainder of the reconstruction process for the frame does not need to be bit-exact, as small errors should only introduce proportionally small distortions. Although the reference implementation only includes a fixed-point version of the remaining
steps, this section describes them in terms of a floating-point version for simplicity. This produces a signal with a nominal range of -1.0 to 1.0. silk_decode_core() (decode_core.c) contains the code for the main reconstruction process. It proceeds subframe-by-subframe, since quantization gains, LTP parameters, and (in 20 ms SILK frames) LPC coefficients can vary from one to the next. Let a_Q12[k] be the LPC coefficients for the current subframe. If this is the first or second subframe of a 20 ms SILK frame and the LSF interpolation factor, w_Q2 (see Section 4.2.7.5.5), is less than 4, then these correspond to the final LPC coefficients produced by Section 4.2.7.5.8 from the interpolated LSF coefficients, n1_Q15[k] (computed in Section 4.2.7.5.5). Otherwise, they correspond to the final LPC coefficients produced from the uninterpolated LSF coefficients for the current frame, n2_Q15[k]. Also, let n be the number of samples in a subframe (40 for NB, 60 for MB, and 80 for WB), s be the index of the current subframe in this SILK frame (0 or 1 for 10 ms frames, or 0 to 3 for 20 ms frames), and j be the index of the first sample in the residual corresponding to the current subframe.4.2.7.9.1. LTP Synthesis
For unvoiced frames (see Section 4.2.7.3), the LPC residual for i such that j <= i < (j + n) is simply a normalized copy of the excitation signal, i.e., e_Q23[i] res[i] = --------- 2.0**23 Voiced SILK frames, on the other hand, pass the excitation through an LTP filter using the parameters decoded in Section 4.2.7.6 to produce an LPC residual. The LTP filter requires LPC residual values from before the current subframe as input. However, since the LPC coefficients may have changed, it obtains this residual by "rewhitening" the corresponding output signal using the LPC coefficients from the current subframe. Let out[i] for i such that (j - pitch_lags[s] - d_LPC - 2) <= i < j be the fully reconstructed output signal from the last (pitch_lags[s] + d_LPC + 2) samples of previous subframes (see Section 4.2.7.9.2), where pitch_lags[s] is the pitch lag for the current subframe from Section 4.2.7.6.1. Additionally, let lpc[i] for i such that (j - s*n - d_LPC) <= i < j be the fully reconstructed output signal from the last (s*n + d_LPC)
samples of previous subframes before clamping (see Section 4.2.7.9.2). During reconstruction of the first subframe for this channel after either o An uncoded regular SILK frame (if this is the side channel), or o A decoder reset (see Section 4.5.2), out[i] and lpc[i] are initially cleared to all zeros. If this is the third or fourth subframe of a 20 ms SILK frame and the LSF interpolation factor, w_Q2 (see Section 4.2.7.5.5), is less than 4, then let out_end be set to (j - (s-2)*n) and let LTP_scale_Q14 be set to 16384. Otherwise, set out_end to (j - s*n) and set LTP_scale_Q14 to the Q14 LTP scaling value from Section 4.2.7.6.3. Then, for i such that (j - pitch_lags[s] - 2) <= i < out_end, out[i] is rewhitened into an LPC residual, res[i], via 4.0*LTP_scale_Q14 res[i] = ----------------- * clamp(-1.0, gain_Q16[s] d_LPC-1 __ a_Q12[k] out[i] - \ out[i-k-1] * --------, 1.0) /_ 4096.0 k=0 This requires storage to buffer up to 306 values of out[i] from previous subframes. This corresponds to WB with a maximum pitch lag of 18 ms * 16 kHz samples, plus 16 samples for d_LPC, plus 2 samples for the width of the LTP filter. Then, for i such that out_end <= i < j, lpc[i] is rewhitened into an LPC residual, res[i], via d_LPC-1 65536.0 __ a_Q12[k] res[i] = ----------- * (lpc[i] - \ lpc[i-k-1] * --------) gain_Q16[s] /_ 4096.0 k=0 This requires storage to buffer up to 256 values of lpc[i] from previous subframes (240 from the current SILK frame and 16 from the previous SILK frame). This corresponds to WB with up to three previous subframes in the current SILK frame, plus 16 samples for d_LPC. The astute reader will notice that, given the definition of lpc[i] in Section 4.2.7.9.2, the output of this latter equation is merely a scaled version of the values of res[i] from previous subframes.
Let e_Q23[i] for j <= i < (j + n) be the excitation for the current subframe, and b_Q7[k] for 0 <= k < 5 be the coefficients of the LTP filter taken from the codebook entry in one of Tables 39 through 41 corresponding to the index decoded for the current subframe in Section 4.2.7.6.2. Then for i such that j <= i < (j + n), the LPC residual is 4 e_Q23[i] __ b_Q7[k] res[i] = --------- + \ res[i - pitch_lags[s] + 2 - k] * ------- 2.0**23 /_ 128.0 k=04.2.7.9.2. LPC Synthesis
LPC synthesis uses the short-term LPC filter to predict the next output coefficient. For i such that (j - d_LPC) <= i < j, let lpc[i] be the result of LPC synthesis from the last d_LPC samples of the previous subframe or zeros in the first subframe for this channel after either o An uncoded regular SILK frame (if this is the side channel), or o A decoder reset (see Section 4.5.2). Then, for i such that j <= i < (j + n), the result of LPC synthesis for the current subframe is d_LPC-1 gain_Q16[i] __ a_Q12[k] lpc[i] = ----------- * res[i] + \ lpc[i-k-1] * -------- 65536.0 /_ 4096.0 k=0 The decoder saves the final d_LPC values, i.e., lpc[i] such that (j + n - d_LPC) <= i < (j + n), to feed into the LPC synthesis of the next subframe. This requires storage for up to 16 values of lpc[i] (for WB frames). Then, the signal is clamped into the final nominal range: out[i] = clamp(-1.0, lpc[i], 1.0) This clamping occurs entirely after the LPC synthesis filter has run. The decoder saves the unclamped values, lpc[i], to feed into the LPC filter for the next subframe, but saves the clamped values, out[i], for rewhitening in voiced frames.
4.2.8. Stereo Unmixing
For stereo streams, after decoding a frame from each channel, the decoder must convert the mid-side (MS) representation into a left- right (LR) representation. The function silk_stereo_MS_to_LR (stereo_MS_to_LR.c) implements this process. In it, the decoder predicts the side channel using a) a simple low-passed version of the mid channel, and b) the unfiltered mid channel, using the prediction weights decoded in Section 4.2.7.1. This simple low-pass filter imposes a one-sample delay, and the unfiltered mid channel is also delayed by one sample. In order to allow seamless switching between stereo and mono, mono streams must also impose the same one-sample delay. The encoder requires an additional one-sample delay for both mono and stereo streams, though an encoder may omit the delay for mono if it knows it will never switch to stereo. The unmixing process operates in two phases. The first phase lasts for 8 ms, during which it interpolates the prediction weights from the previous frame, prev_w0_Q13 and prev_w1_Q13, to the values for the current frame, w0_Q13 and w1_Q13. The second phase simply uses these weights for the remainder of the frame. Let mid[i] and side[i] be the contents of out[i] (from Section 4.2.7.9.2) for the current mid and side channels, respectively, and let left[i] and right[i] be the corresponding stereo output channels. If the side channel is not coded (see Section 4.2.7.2), then side[i] is set to zero. Also, let j be defined as in Section 4.2.7.9, n1 be the number of samples in phase 1 (64 for NB, 96 for MB, and 128 for WB), and n2 be the total number of samples in the frame. Then, for i such that j <= i < (j + n2), the left and right channel output is prev_w0_Q13 (w0_Q13 - prev_w0_Q13) w0 = ----------- + min(i - j, n1)*---------------------- 8192.0 8192.0*n1 prev_w1_Q13 (w1_Q13 - prev_w1_Q13) w1 = ----------- + min(i - j, n1)*---------------------- 8192.0 8192.0*n1 mid[i-2] + 2*mid[i-1] + mid[i] p0 = ------------------------------ 4.0 left[i] = clamp(-1.0, (1 + w1)*mid[i-1] + side[i-1] + w0*p0, 1.0) right[i] = clamp(-1.0, (1 - w1)*mid[i-1] - side[i-1] - w0*p0, 1.0)
These formulas require two samples prior to index j, the start of the frame, for the mid channel, and one prior sample for the side channel. For the first frame after a decoder reset, zeros are used instead.4.2.9. Resampling
After stereo unmixing (if any), the decoder applies resampling to convert the decoded SILK output to the sample rate desired by the application. This is necessary when decoding a Hybrid frame at SWB or FB sample rates, or whenever the decoder wants the output at a different sample rate than the internal SILK sampling rate (e.g., to allow a constant sample rate when the audio bandwidth changes, or to allow mixing with audio from other applications). The resampler itself is non-normative, and a decoder can use any method it wants to perform the resampling. However, a minimum amount of delay is imposed to allow the resampler to operate, and this delay is normative, so that the corresponding delay can be applied to the MDCT layer in the encoder. A decoder is always free to use a resampler that requires more delay than allowed for here (e.g., to improve quality), but it must then delay the output of the MDCT layer by this extra amount. Keeping as much delay as possible on the encoder side allows an encoder that knows it will never use any of the SILK or Hybrid modes to skip this delay. By contrast, if it were all applied by the decoder, then a decoder that processes audio in fixed-size blocks would be forced to delay the output of CELT frames just in case of a later switch to a SILK or Hybrid mode. Table 54 gives the maximum resampler delay in samples at 48 kHz for each SILK audio bandwidth. Because the actual output rate may not be 48 kHz, it may not be possible to achieve exactly these delays while using a whole number of input or output samples. The reference implementation is able to resample to any of the supported output sampling rates (8, 12, 16, 24, or 48 kHz) within or near this delay constraint. Some resampling filters (including those used by the reference implementation) may add a delay that is not an exact integer, or is not linear-phase, and so cannot be represented by a single delay at all frequencies. However, such deviations are unlikely to be perceptible, and the comparison tool described in Section 6 is designed to be relatively insensitive to them. The delays listed here are the ones that should be targeted by the encoder.
+-----------------+-----------------------+ | Audio Bandwidth | Delay in Milliseconds | +-----------------+-----------------------+ | NB | 0.538 | | | | | MB | 0.692 | | | | | WB | 0.706 | +-----------------+-----------------------+ Table 54: SILK Resampler Delay Allocations NB is given a smaller decoder delay allocation than MB and WB to allow a higher-order filter when resampling to 8 kHz in both the encoder and decoder. This implies that the audio content of two SILK frames operating at different bandwidths is not perfectly aligned in time. This is not an issue for any transitions described in Section 4.5, because they all involve a SILK decoder reset. When the decoder is reset, any samples remaining in the resampling buffer are discarded, and the resampler is re-initialized with silence.