3.6. Encoding the Remaining Samples
A dynamic codebook is used to encode 1) the 23/22 remaining samples in the two sub-blocks containing the start state; 2) the sub-blocks after the start state in time; and 3) the sub-blocks before the start state in time. Thus, the encoding target can be either the 23/22 samples remaining of the 2 sub-blocks containing the start state, or a 40-sample sub-block. This target can consist of samples that are indexed forward in time or backward in time, depending on the location of the start state. The length of the target is denoted by lTarget. The coding is based on an adaptive codebook that is built from a codebook memory that contains decoded LPC excitation samples from the already encoded part of the block. These samples are indexed in the same time direction as is the target vector and end at the sample instant prior to the first sample instant represented in the target vector. The codebook memory has length lMem, which is equal to CB_MEML=147 for the two/four 40-sample sub-blocks and 85 for the 23/22-sample sub-block. The following figure shows an overview of the encoding procedure. +------------+ +---------------+ +-------------+ -> | 1. Decode | -> | 2. Mem setup | -> | 3. Perc. W. | -> +------------+ +---------------+ +-------------+ +------------+ +-----------------+ -> | 4. Search | -> | 5. Upd. Target | ------------------> | +------------+ +------------------ | ----<-------------<-----------<---------- stage=0..2 +----------------+ -> | 6. Recalc G[0] | ---------------> gains and CB indices +----------------+ Figure 3.4. Flow chart of the codebook search in the iLBC encoder.
1. Decode the part of the residual that has been encoded so far, using the codebook without perceptual weighting. 2. Set up the memory by taking data from the decoded residual. This memory is used to construct codebooks. For blocks preceding the start state, both the decoded residual and the target are time reversed (section 3.6.1). 3. Filter the memory + target with the perceptual weighting filter (section 3.6.2). 4. Search for the best match between the target and the codebook vector. Compute the optimal gain for this match and quantize that gain (section 3.6.4). 5. Update the perceptually weighted target by subtracting the contribution from the selected codebook vector from the perceptually weighted memory (quantized gain times selected vector). Repeat 4 and 5 for the two additional stages. 6. Calculate the energy loss due to encoding of the residual. If needed, compensate for this loss by an upscaling and requantization of the gain for the first stage (section 3.7). The following sections provide an in-depth description of the different blocks of Figure 3.4.3.6.1. Codebook Memory
The codebook memory is based on the already encoded sub-blocks, so the available data for encoding increases for each new sub-block that has been encoded. Until enough sub-blocks have been encoded to fill the codebook memory with data, it is padded with zeros. The following figure shows an example of the order in which the sub- blocks are encoded for the 30 ms frame size if the start state is located in the last 58 samples of sub-block 2 and 3. +-----------------------------------------------------+ | 5 | 1 |///|////////| 2 | 3 | 4 | +-----------------------------------------------------+ Figure 3.5. The order from 1 to 5 in which the sub-blocks are encoded. The slashed area is the start state.
The first target sub-block to be encoded is number 1, and the corresponding codebook memory is shown in the following figure. As the target vector comes before the start state in time, the codebook memory and target vector are time reversed; thus, after the block has been time reversed the search algorithm can be reused. As only the start state has been encoded so far, the last samples of the codebook memory are padded with zeros. +------------------------- |zeros|\\\\\\\\|\\\\| 1 | +------------------------- Figure 3.6. The codebook memory, length lMem=85 samples, and the target vector 1, length 22 samples. The next step is to encode sub-block 2 by using the memory that now has increased since sub-block 1 has been encoded. The following figure shows the codebook memory for encoding of sub-block 2. +----------------------------------- | zeros | 1 |///|////////| 2 | +----------------------------------- Figure 3.7. The codebook memory, length lMem=147 samples, and the target vector 2, length 40 samples. The next step is to encode sub-block 3 by using the memory which has been increased yet again since sub-blocks 1 and 2 have been encoded, but the sub-block still has to be padded with a few zeros. The following figure shows the codebook memory for encoding of sub-block 3. +------------------------------------------ |zeros| 1 |///|////////| 2 | 3 | +------------------------------------------ Figure 3.8. The codebook memory, length lMem=147 samples, and the target vector 3, length 40 samples. The next step is to encode sub-block 4 by using the memory which now has increased yet again since sub-blocks 1, 2, and 3 have been encoded. This time, the memory does not have to be padded with zeros. The following figure shows the codebook memory for encoding of sub-block 4.
+------------------------------------------ |1|///|////////| 2 | 3 | 4 | +------------------------------------------ Figure 3.9. The codebook memory, length lMem=147 samples, and the target vector 4, length 40 samples. The final target sub-block to be encoded is number 5, and the following figure shows the corresponding codebook memory. As the target vector comes before the start state in time, the codebook memory and target vector are time reversed. +------------------------------------------- | 3 | 2 |\\\\\\\\|\\\\| 1 | 5 | +------------------------------------------- Figure 3.10. The codebook memory, length lMem=147 samples, and the target vector 5, length 40 samples. For the case of 20 ms frames, the encoding procedure looks almost exactly the same. The only difference is that the size of the start state is 57 samples and that there are only three sub-blocks to be encoded. The encoding order is the same as above, starting with the 23-sample target and then encoding the two remaining 40-sample sub- blocks, first going forward in time and then going backward in time relative to the start state.3.6.2. Perceptual Weighting of Codebook Memory and Target
To provide a perceptual weighting of the coding error, a concatenation of the codebook memory and the target to be coded is all-pole filtered with the perceptual weighting filter specified in section 3.4. The filter state of the weighting filter is set to zero. in(0..(lMem-1)) = unweighted codebook memory in(lMem..(lMem+lTarget-1)) = unweighted target signal in -> Wk(z) -> filtered, where Wk(z) is taken from the sub-block of the target weighted codebook memory = filtered(0..(lMem-1)) weighted target signal = filtered(lMem..(lMem+lTarget-1)) The codebook search is done with the weighted codebook memory and the weighted target, whereas the decoding and the codebook memory update uses the unweighted codebook memory.
3.6.3. Codebook Creation
The codebook for the search is created from the perceptually weighted codebook memory. It consists of two sections, where the first is referred to as the base codebook and the second as the expanded codebook, as it is created by linear combinations of the first. Each of these two sections also has a subsection referred to as the augmented codebook. The augmented codebook is only created and used for the coding of the 40-sample sub-blocks and not for the 23/22- sample sub-block case. The codebook size used for the different sub-blocks and different stages are summarized in the table below. Stage 1 2 & 3 -------------------------------------------- 22 128 (64+0)*2 128 (64+0)*2 Sub- 1:st 40 256 (108+20)*2 128 (44+20)*2 Blocks 2:nd 40 256 (108+20)*2 256 (108+20)*2 3:rd 40 256 (108+20)*2 256 (108+20)*2 4:th 40 256 (108+20)*2 256 (108+20)*2 Table 3.1. Codebook sizes for the 30 ms mode. Table 3.1 shows the codebook size for the different sub-blocks and stages for 30 ms frames. Inside the parentheses it shows how the number of codebook vectors is distributed, within the two sections, between the base/expanded codebook and the augmented base/expanded codebook. It should be interpreted in the following way: (base/expanded cb + augmented base/expanded cb). The total number of codebook vectors for a specific sub-block and stage is given by the following formula: Tot. cb vectors = base cb + aug. base cb + exp. cb + aug. exp. cb The corresponding values to Figure 3.1 for 20 ms frames are only slightly modified. The short sub-block is 23 instead of 22 samples, and the 3:rd and 4:th sub-frame are not present.3.6.3.1. Creation of a Base Codebook
The base codebook is given by the perceptually weighted codebook memory that is mentioned in section 3.5.3. The different codebook vectors are given by sliding a window of length 23/22 or 40, given by variable lTarget, over the lMem-long perceptually weighted codebook memory. The indices are ordered so that the codebook vector containing sample (lMem-lTarget-n) to (lMem-n-1) of the codebook
memory vector has index n, where n=0..lMem-lTarget. Thus the total number of base codebook vectors is lMem-lTarget+1, and the indices are ordered from sample delay lTarget (23/22 or 40) to lMem+1 (86 or 148).3.6.3.2. Codebook Expansion
The base codebook is expanded by a factor of 2, creating an additional section in the codebook. This new section is obtained by filtering the base codebook, base_cb, with a FIR filter with filter length CB_FILTERLEN=8. The construction of the expanded codebook compensates for the delay of four samples introduced by the FIR filter. cbfiltersTbl[CB_FILTERLEN]={-0.033691, 0.083740, -0.144043, 0.713379, 0.806152, -0.184326, 0.108887, -0.034180}; ___ \ exp_cb(k)= + > cbfiltersTbl(i)*x(k-i+4) /__ i=0...(LPC_FILTERORDER-1) where x(j) = base_cb(j) for j=0..lMem-1 and 0 otherwise The individual codebook vectors of the new filtered codebook, exp_cb, and their indices are obtained in the same fashion as described above for the base codebook.3.6.3.3. Codebook Augmentation
For cases where encoding entire sub-blocks, i.e., cbveclen=40, the base and expanded codebooks are augmented to increase codebook richness. The codebooks are augmented by vectors produced by interpolation of segments. The base and expanded codebook, constructed above, consists of vectors corresponding to sample delays in the range from cbveclen to lMem. The codebook augmentation attempts to augment these codebooks with vectors corresponding to sample delays from 20 to 39. However, not all of these samples are present in the base codebook and expanded codebook, respectively. Therefore, the augmentation vectors are constructed as linear combinations between samples corresponding to sample delays in the range 20 to 39. The general idea of this procedure is presented in the following figures and text. The procedure is performed for both the base codebook and the expanded codebook.
- - ------------------------| codebook memory | - - ------------------------| |-5-|---15---|-5-| pi pp po | | Codebook vector |---15---|-5-|-----20-----| <- corresponding to i ii iii sample delay 20 Figure 3.11. Generation of the first augmented codebook. Figure 3.11 shows the codebook memory with pointers pi, pp, and po, where pi points to sample 25, pp to sample 20, and po to sample 5. Below the codebook memory, the augmented codebook vector corresponding to sample delay 20 is drawn. Segment i consists of fifteen samples from pointer pp and forward in time. Segment ii consists of five interpolated samples from pi and forward and from po and forward. The samples are linearly interpolated with weights [0.0, 0.2, 0.4, 0.6, 0.8] for pi and weights [1.0, 0.8, 0.6, 0.4, 0.2] for po. Segment iii consists of twenty samples from pp and forward. The augmented codebook vector corresponding to sample delay 21 is produced by moving pointers pp and pi one sample backward in time. This gives us the following figure. - - ------------------------| codebook memory | - - ------------------------| |-5-|---16---|-5-| pi pp po | | Codebook vector |---16---|-5-|-----19-----| <- corresponding to i ii iii sample delay 21 Figure 3.12. Generation of the second augmented codebook. Figure 3.12 shows the codebook memory with pointers pi, pp and po where pi points to sample 26, pp to sample 21, and po to sample 5. Below the codebook memory, the augmented codebook vector corresponding to sample delay 21 is drawn. Segment i now consists of sixteen samples from pp and forward. Segment ii consists of five interpolated samples from pi and forward and from po and forward, and the interpolation weights are the same throughout the procedure. Segment iii consists of nineteen samples from pp and forward. The same procedure of moving the two pointers is continued until the last augmented vector corresponding to sample delay 39 has been created. This gives a total of twenty new codebook vectors to each of the two
sections. Thus the total number of codebook vectors for each of the two sections, when including the augmented codebook, becomes lMem- SUBL+1+SUBL/2. This is provided that augmentation is evoked, i.e., that lTarget=SUBL.3.6.4. Codebook Search
The codebook search uses the codebooks described in the sections above to find the best match of the perceptually weighted target, see section 3.6.2. The search method is a multi-stage gain-shape matching performed as follows. At each stage the best shape vector is identified, then the gain is calculated and quantized, and finally the target is updated in preparation for the next codebook search stage. The number of stages is CB_NSTAGES=3. If the target is the 23/22-sample vector the codebooks are indexed so that the base codebook is followed by the expanded codebook. If the target is 40 samples the order is as follows: base codebook, augmented base codebook, expanded codebook, and augmented expanded codebook. The size of each codebook section and its corresponding augmented section is given by Table 3.1 in section 3.6.3. For example, when the second 40-sample sub-block is coded, indices 0 - 107 correspond to the base codebook, 108 - 127 correspond to the augmented base codebook, 128 - 235 correspond to the expanded codebook, and indices 236 - 255 correspond to the augmented expanded codebook. The indices are divided in the same fashion for all stages in the example. Only in the case of coding the first 40-sample sub- block is there a difference between stages (see Table 3.1).3.6.4.1. Codebook Search at Each Stage
The codebooks are searched to find the best match to the target at each stage. When the best match is found, the target is updated and the next-stage search is started. The three chosen codebook vectors and their corresponding gains constitute the encoded sub-block. The best match is decided by the following three criteria: 1. Compute the measure (target*cbvec)^2 / ||cbvec||^2 for all codebook vectors, cbvec, and choose the codebook vector maximizing the measure. The expression (target*cbvec) is the dot product between the target vector to be coded and the codebook vector for which we compute the measure. The norm, ||x||, is defined as the square root of (x*x).
2. The absolute value of the gain, corresponding to the chosen codebook vector, cbvec, must be smaller than a fixed limit, CB_MAXGAIN=1.3: |gain| < CB_MAXGAIN where the gain is computed in the following way: gain = (target*cbvec) / ||cbvec||^2 3. For the first stage, the dot product of the chosen codebook vector and target must be positive: target*cbvec > 0 In practice the above criteria are used in a sequential search through all codebook vectors. The best match is found by registering a new max measure and index whenever the previously registered max measure is surpassed and all other criteria are fulfilled. If none of the codebook vectors fulfill (2) and (3), the first codebook vector is selected.3.6.4.2. Gain Quantization at Each Stage
The gain follows as a result of the computation gain = (target*cbvec) / ||cbvec||^2 for the optimal codebook vector found by the procedure in section 3.6.4.1. The three stages quantize the gain, using 5, 4, and 3 bits, respectively. In the first stage, the gain is limited to positive values. This gain is quantized by finding the nearest value in the quantization table gain_sq5Tbl. gain_sq5Tbl[32]={0.037476, 0.075012, 0.112488, 0.150024, 0.187500, 0.224976, 0.262512, 0.299988, 0.337524, 0.375000, 0.412476, 0.450012, 0.487488, 0.525024, 0.562500, 0.599976, 0.637512, 0.674988, 0.712524, 0.750000, 0.787476, 0.825012, 0.862488, 0.900024, 0.937500, 0.974976, 1.012512, 1.049988, 1.087524, 1.125000, 1.162476, 1.200012} The gains of the subsequent two stages can be either positive or negative. The gains are quantized by using a quantization table times a scale factor. The second stage uses the table gain_sq4Tbl, and the third stage uses gain_sq3Tbl. The scale factor equates 0.1
or the absolute value of the quantized gain representation value obtained in the previous stage, whichever is larger. Again, the resulting gain index is the index to the nearest value of the quantization table times the scale factor. gainQ = scaleFact * gain_sqXTbl[index] gain_sq4Tbl[16]={-1.049988, -0.900024, -0.750000, -0.599976, -0.450012, -0.299988, -0.150024, 0.000000, 0.150024, 0.299988, 0.450012, 0.599976, 0.750000, 0.900024, 1.049988, 1.200012} gain_sq3Tbl[8]={-1.000000, -0.659973, -0.330017,0.000000, 0.250000, 0.500000, 0.750000, 1.00000}3.6.4.3. Preparation of Target for Next Stage
Before performing the search for the next stage, the perceptually weighted target vector is updated by subtracting from it the selected codebook vector (from the perceptually weighted codebook) times the corresponding quantized gain. target[i] = target[i] - gainQ * selected_vec[i]; A reference implementation of the codebook encoding is found in Appendix A.34.3.7. Gain Correction Encoding
The start state is quantized in a relatively model independent manner using 3 bits per sample. In contrast, the remaining parts of the block are encoded by using an adaptive codebook. This codebook will produce high matching accuracy whenever there is a high correlation between the target and the best codebook vector. For unvoiced speech segments and background noises, this is not necessarily so, which, due to the nature of the squared error criterion, results in a coded signal with less power than the target signal. As the coded start state has good power matching to the target, the result is a power fluctuation within the encoded frame. Perceptually, the main problem with this is that the time envelope of the signal energy becomes unsteady. To overcome this problem, the gains for the codebooks are re-scaled after the codebook encoding by searching for a new gain factor for the first stage codebook that provides better power matching. First, the energy for the target signal, tene, is computed along with the energy for the coded signal, cene, given by the addition of the three gain scaled codebook vectors. Because the gains of the second
and third stage scale with the gain of the first stage, when the first stage gain is changed from gain[0] to gain_sq5Tbl[i] the energy of the coded signal changes from cene to cene*(gain_sq5Tbl[i]*gain_sq5Tbl[i])/(gain[0]*gain[0]) where gain[0] is the gain for the first stage found in the original codebook search. A refined search is performed by testing the gain indices i=0 to 31, and as long as the new codebook energy as given above is less than tene, the gain index for stage 1 is increased. A restriction is applied so that the new gain value for stage 1 cannot be more than two times higher than the original value found in the codebook search. Note that by using this method we do not change the shape of the encoded vector, only the gain or amplitude.3.8. Bitstream Definition
The total number of bits used to describe one frame of 20 ms speech is 304, which fits in 38 bytes and results in a bit rate of 15.20 kbit/s. For the case of a frame length of 30 ms speech, the total number of bits used is 400, which fits in 50 bytes and results in a bit rate of 13.33 kbit/s. In the bitstream definition, the bits are distributed into three classes according to their bit error or loss sensitivity. The most sensitive bits (class 1) are placed first in the bitstream for each frame. The less sensitive bits (class 2) are placed after the class 1 bits. The least sensitive bits (class 3) are placed at the end of the bitstream for each frame. In the 20/30 ms frame length cases for each class, the following hold true: The class 1 bits occupy a total of 6/8 bytes (48/64 bits), the class 2 bits occupy 8/12 bytes (64/96 bits), and the class 3 bits occupy 24/30 bytes (191/239 bits). This distribution of the bits enables the use of uneven level protection (ULP) as is exploited in the payload format definition for iLBC [1]. The detailed bit allocation is shown in the table below. When a quantization index is distributed between more classes, the more significant bits belong to the lowest class.
Bitstream structure: ------------------------------------------------------------------+ Parameter | Bits Class <1,2,3> | | 20 ms frame | 30 ms frame | ----------------------------------+---------------+---------------+ Split 1 | 6 <6,0,0> | 6 <6,0,0> | LSF 1 Split 2 | 7 <7,0,0> | 7 <7,0,0> | LSF Split 3 | 7 <7,0,0> | 7 <7,0,0> | ------------------+---------------+---------------+ Split 1 | NA (Not Appl.)| 6 <6,0,0> | LSF 2 Split 2 | NA | 7 <7,0,0> | Split 3 | NA | 7 <7,0,0> | ------------------+---------------+---------------+ Sum | 20 <20,0,0> | 40 <40,0,0> | ----------------------------------+---------------+---------------+ Block Class | 2 <2,0,0> | 3 <3,0,0> | ----------------------------------+---------------+---------------+ Position 22 sample segment | 1 <1,0,0> | 1 <1,0,0> | ----------------------------------+---------------+---------------+ Scale Factor State Coder | 6 <6,0,0> | 6 <6,0,0> | ----------------------------------+---------------+---------------+ Sample 0 | 3 <0,1,2> | 3 <0,1,2> | Quantized Sample 1 | 3 <0,1,2> | 3 <0,1,2> | Residual : | : : | : : | State : | : : | : : | Samples : | : : | : : | Sample 56 | 3 <0,1,2> | 3 <0,1,2> | Sample 57 | NA | 3 <0,1,2> | ------------------+---------------+---------------+ Sum | 171 <0,57,114>| 174 <0,58,116>| ----------------------------------+---------------+---------------+ Stage 1 | 7 <6,0,1> | 7 <4,2,1> | CB for 22/23 Stage 2 | 7 <0,0,7> | 7 <0,0,7> | sample block Stage 3 | 7 <0,0,7> | 7 <0,0,7> | ------------------+---------------+---------------+ Sum | 21 <6,0,15> | 21 <4,2,15> | ----------------------------------+---------------+---------------+ Stage 1 | 5 <2,0,3> | 5 <1,1,3> | Gain for 22/23 Stage 2 | 4 <1,1,2> | 4 <1,1,2> | sample block Stage 3 | 3 <0,0,3> | 3 <0,0,3> | ------------------+---------------+---------------+ Sum | 12 <3,1,8> | 12 <2,2,8> | ----------------------------------+---------------+---------------+ Stage 1 | 8 <7,0,1> | 8 <6,1,1> | sub-block 1 Stage 2 | 7 <0,0,7> | 7 <0,0,7> | Stage 3 | 7 <0,0,7> | 7 <0,0,7> | ------------------+---------------+---------------+
Stage 1 | 8 <0,0,8> | 8 <0,7,1> | sub-block 2 Stage 2 | 8 <0,0,8> | 8 <0,0,8> | Indices Stage 3 | 8 <0,0,8> | 8 <0,0,8> | for CB ------------------+---------------+---------------+ sub-blocks Stage 1 | NA | 8 <0,7,1> | sub-block 3 Stage 2 | NA | 8 <0,0,8> | Stage 3 | NA | 8 <0,0,8> | ------------------+---------------+---------------+ Stage 1 | NA | 8 <0,7,1> | sub-block 4 Stage 2 | NA | 8 <0,0,8> | Stage 3 | NA | 8 <0,0,8> | ------------------+---------------+---------------+ Sum | 46 <7,0,39> | 94 <6,22,66> | ----------------------------------+---------------+---------------+ Stage 1 | 5 <1,2,2> | 5 <1,2,2> | sub-block 1 Stage 2 | 4 <1,1,2> | 4 <1,2,1> | Stage 3 | 3 <0,0,3> | 3 <0,0,3> | ------------------+---------------+---------------+ Stage 1 | 5 <1,1,3> | 5 <0,2,3> | sub-block 2 Stage 2 | 4 <0,2,2> | 4 <0,2,2> | Stage 3 | 3 <0,0,3> | 3 <0,0,3> | Gains for ------------------+---------------+---------------+ sub-blocks Stage 1 | NA | 5 <0,1,4> | sub-block 3 Stage 2 | NA | 4 <0,1,3> | Stage 3 | NA | 3 <0,0,3> | ------------------+---------------+---------------+ Stage 1 | NA | 5 <0,1,4> | sub-block 4 Stage 2 | NA | 4 <0,1,3> | Stage 3 | NA | 3 <0,0,3> | ------------------+---------------+---------------+ Sum | 24 <3,6,15> | 48 <2,12,34> | ----------------------------------+---------------+---------------+ Empty frame indicator | 1 <0,0,1> | 1 <0,0,1> | ------------------------------------------------------------------- SUM 304 <48,64,192> 400 <64,96,240> Table 3.2. The bitstream definition for iLBC for both the 20 ms frame size mode and the 30 ms frame size mode. When packetized into the payload, the bits MUST be sorted as follows: All the class 1 bits in the order (from top to bottom) as specified in the table, all the class 2 bits (from top to bottom), and all the class 3 bits in the same sequential order. The last bit, the empty frame indicator, SHOULD be set to zero by the encoder. If this bit is set to 1 the decoder SHOULD treat the data as a lost frame. For example, this bit can be set to 1 to indicate lost frame for file storage format, as in [1].
4. Decoder Principles
This section describes the principles of each component of the decoder algorithm. +-------------+ +--------+ +---------------+ payload -> | 1. Get para | -> | 2. LPC | -> | 3. Sc Dequant | -> +-------------+ +--------+ +---------------+ +-------------+ +------------------+ -> | 4. Mem setup| -> | 5. Construct res |-------> | +-------------+ +------------------- | ---------<-----------<-----------<------------ Sub-frame 0...2/4 (20 ms/30 ms) +----------------+ +----------+ -> | 6. Enhance res | -> | 7. Synth | ------------> +----------------+ +----------+ +-----------------+ -> | 8. Post Process | ----------------> decoded speech +-----------------+ Figure 4.1. Flow chart of the iLBC decoder. If a frame was lost, steps 1 to 5 SHOULD be replaced by a PLC algorithm. 1. Extract the parameters from the bitstream. 2. Decode the LPC and interpolate (section 4.1). 3. Construct the 57/58-sample start state (section 4.2). 4. Set up the memory by using data from the decoded residual. This memory is used for codebook construction. For blocks preceding the start state, both the decoded residual and the target are time reversed. Sub-frames are decoded in the same order as they were encoded. 5. Construct the residuals of this sub-frame (gain[0]*cbvec[0] + gain[1]*cbvec[1] + gain[2]*cbvec[2]). Repeat 4 and 5 until the residual of all sub-blocks has been constructed. 6. Enhance the residual with the post filter (section 4.6). 7. Synthesis of the residual (section 4.7). 8. Post process with HP filter, if desired (section 4.8).
4.1. LPC Filter Reconstruction
The decoding of the LP filter parameters is very straightforward. For a set of three/six indices, the corresponding LSF vector(s) are found by simple table lookup. For each of the LSF vectors, the three split vectors are concatenated to obtain qlsf1 and qlsf2, respectively (in the 20 ms mode only one LSF vector, qlsf, is constructed). The next step is the stability check described in section 3.2.5 followed by the interpolation scheme described in section 3.2.6 (3.2.7 for 20 ms frames). The only difference is that only the quantized LSFs are known at the decoder, and hence the unquantized LSFs are not processed. A reference implementation of the LPC filter reconstruction is given in Appendix A.36.4.2. Start State Reconstruction
The scalar encoded STATE_SHORT_LEN=58 (STATE_SHORT_LEN=57 in the 20 ms mode) state samples are reconstructed by 1) forming a set of samples (by table lookup) from the index stream idxVec[n], 2) multiplying the set with 1/scal=(10^qmax)/4.5, 3) time reversing the 57/58 samples, 4) filtering the time reversed block with the dispersion (all-pass) filter used in the encoder (as described in section 3.5.2); this compensates for the phase distortion of the earlier filter operation, and 5 reversing the 57/58 samples from the previous step. in(0..(STATE_SHORT_LEN-1)) = time reversed samples from table look-up, idxVecDec((STATE_SHORT_LEN-1)..0) in(STATE_SHORT_LEN..(2*STATE_SHORT_LEN-1)) = 0 Pk(z) = A~rk(z)/A~k(z), where ___ \ A~rk(z)= z^(-LPC_FILTERORDER) + > a~ki*z^(i-(LPC_FILTERORDER-1)) /__ i=0...(LPC_FILTERORDER-1) and A~k(z) is taken from the block where the start state begins in -> Pk(z) -> filtered out(k) = filtered(STATE_SHORT_LEN-1-k) + filtered(2*STATE_SHORT_LEN-1-k), k=0..(STATE_SHORT_LEN-1)
The remaining 23/22 samples in the state are reconstructed by the same adaptive codebook technique described in section 4.3. The location bit determines whether these are the first or the last 23/22 samples of the 80-sample state vector. If the remaining 23/22 samples are the first samples, then the scalar encoded STATE_SHORT_LEN state samples are time-reversed before initialization of the adaptive codebook memory vector. A reference implementation of the start state reconstruction is given in Appendix A.44.4.3. Excitation Decoding Loop
The decoding of the LPC excitation vector proceeds in the same order in which the residual was encoded at the encoder. That is, after the decoding of the entire 80-sample state vector, the forward sub-blocks (corresponding to samples occurring after the state vector samples) are decoded, and then the backward sub-blocks (corresponding to samples occurring before the state vector) are decoded, resulting in a fully decoded block of excitation signal samples. In particular, each sub-block is decoded by using the multistage adaptive codebook decoding module described in section 4.4. This module relies upon an adaptive codebook memory constructed before each run of the adaptive codebook decoding. The construction of the adaptive codebook memory in the decoder is identical to the method outlined in section 3.6.3, except that it is done on the codebook memory without perceptual weighting. For the initial forward sub-block, the last STATE_LEN=80 samples of the length CB_LMEM=147 adaptive codebook memory are filled with the samples of the state vector. For subsequent forward sub-blocks, the first SUBL=40 samples of the adaptive codebook memory are discarded, the remaining samples are shifted by SUBL samples toward the beginning of the vector, and the newly decoded SUBL=40 samples are placed at the end of the adaptive codebook memory. For backward sub-blocks, the construction is similar, except that every vector of samples involved is first time reversed. A reference implementation of the excitation decoding loop is found in Appendix A.5.
4.4. Multistage Adaptive Codebook Decoding
The Multistage Adaptive Codebook Decoding module is used at both the sender (encoder) and the receiver (decoder) ends to produce a synthetic signal in the residual domain that is eventually used to produce synthetic speech. The module takes the index values used to construct vectors that are scaled and summed together to produce a synthetic signal that is the output of the module.4.4.1. Construction of the Decoded Excitation Signal
The unpacked index values provided at the input to the module are references to extended codebooks, which are constructed as described in section 3.6.3, except that they are based on the codebook memory without the perceptual weighting. The unpacked three indices are used to look up three codebook vectors. The unpacked three gain indices are used to decode the corresponding 3 gains. In this decoding, the successive rescaling, as described in section 3.6.4.2, is applied. A reference implementation of the adaptive codebook decoding is listed in Appendix A.32.4.5. Packet Loss Concealment
If packet loss occurs, the decoder receives a signal saying that information regarding a block is lost. For such blocks it is RECOMMENDED to use a Packet Loss Concealment (PLC) unit to create a decoded signal that masks the effect of that packet loss. In the following we will describe an example of a PLC unit that can be used with the iLBC codec. As the PLC unit is used only at the decoder, the PLC unit does not affect interoperability between implementations. Other PLC implementations MAY therefore be used. The PLC described operates on the LP filters and the excitation signals and is based on the following principles:4.5.1. Block Received Correctly and Previous Block Also Received
If the block is received correctly, the PLC only records state information of the current block that can be used in case the next block is lost. The LP filter coefficients for each sub-block and the entire decoded excitation signal are all saved in the decoder state structure. All of this information will be needed if the following block is lost.
4.5.2. Block Not Received
If the block is not received, the block substitution is based on a pitch-synchronous repetition of the excitation signal, which is filtered by the last LP filter of the previous block. The previous block's information is stored in the decoder state structure. A correlation analysis is performed on the previous block's excitation signal in order to detect the amount of pitch periodicity and a pitch value. The correlation measure is also used to decide on the voicing level (the degree to which the previous block's excitation was a voiced or roughly periodic signal). The excitation in the previous block is used to create an excitation for the block to be substituted, such that the pitch of the previous block is maintained. Therefore, the new excitation is constructed in a pitch-synchronous manner. In order to avoid a buzzy-sounding substituted block, a random excitation is mixed with the new pitch periodic excitation, and the relative use of the two components is computed from the correlation measure (voicing level). For the block to be substituted, the newly constructed excitation signal is then passed through the LP filter to produce the speech that will be substituted for the lost block. For several consecutive lost blocks, the packet loss concealment continues in a similar manner. The correlation measure of the last block received is still used along with the same pitch value. The LP filters of the last block received are also used again. The energy of the substituted excitation for consecutive lost blocks is decreased, leading to a dampened excitation, and therefore to dampened speech.4.5.3. Block Received Correctly When Previous Block Not Received
For the case in which a block is received correctly when the previous block was not, the correctly received block's directly decoded speech (based solely on the received block) is not used as the actual output. The reason for this is that the directly decoded speech does not necessarily smoothly merge into the synthetic speech generated for the previous lost block. If the two signals are not smoothly merged, an audible discontinuity is accidentally produced. Therefore, a correlation analysis between the two blocks of excitation signal (the excitation of the previous concealed block and that of the current received block) is performed to find the best phase match. Then a simple overlap-add procedure is performed to merge the previous excitation smoothly into the current block's excitation.
The exact implementation of the packet loss concealment does not influence interoperability of the codec. A reference implementation of the packet loss concealment is suggested in Appendix A.14. Exact compliance with this suggested algorithm is not needed for a reference implementation to be fully compatible with the overall codec specification.4.6. Enhancement
The decoder contains an enhancement unit that operates on the reconstructed excitation signal. The enhancement unit increases the perceptual quality of the reconstructed signal by reducing the speech-correlated noise in the voiced speech segments. Compared to traditional postfilters, the enhancer has an advantage in that it can only modify the excitation signal slightly. This means that there is no risk of over enhancement. The enhancer works very similarly for both the 20 ms frame size mode and the 30 ms frame size mode. For the mode with 20 ms frame size, the enhancer uses a memory of six 80-sample excitation blocks prior in time plus the two new 80-sample excitation blocks. For each block of 160 new unenhanced excitation samples, 160 enhanced excitation samples are produced. The enhanced excitation is 40-sample delayed compared to the unenhanced excitation, as the enhancer algorithm uses lookahead. For the mode with 30 ms frame size, the enhancer uses a memory of five 80-sample excitation blocks prior in time plus the three new 80-sample excitation blocks. For each block of 240 new unenhanced excitation samples, 240 enhanced excitation samples are produced. The enhanced excitation is 80-sample delayed compared to the unenhanced excitation, as the enhancer algorithm uses lookahead. Outline of Enhancer The speech enhancement unit operates on sub-blocks of 80 samples, which means that there are two/three 80 sample sub-blocks per frame. Each of these two/three sub-blocks is enhanced separately, but in an analogous manner.
unenhanced residual | | +---------------+ +--------------+ +-> | 1. Pitch Est | -> | 2. Find PSSQ | --------> +---------------+ | +--------------+ +-----<-------<------<--+ +------------+ enh block 0..1/2 | -> | 3. Smooth | | +------------+ | \ | /\ | / \ Already | / 4. \----------->----------->-----------+ | \Crit/ Fulfilled | | \? / v | \/ | | \ +-----------------+ +---------+ | | Not +->| 5. Use Constr. | -> | 6. Mix | -----> Fulfilled +-----------------+ +---------+ ---------------> enhanced residual Figure 4.2. Flow chart of the enhancer. 1. Pitch estimation of each of the two/three new 80-sample blocks. 2. Find the pitch-period-synchronous sequence n (for block k) by a search around the estimated pitch value. Do this for n=1,2,3, -1,-2,-3. 3. Calculate the smoothed residual generated by the six pitch- period-synchronous sequences from prior step. 4. Check if the smoothed residual satisfies the criterion (section 4.6.4). 5. Use constraint to calculate mixing factor (section 4.6.5). 6. Mix smoothed signal with unenhanced residual (pssq(n) n=0). The main idea of the enhancer is to find three 80 sample blocks before and three 80-sample blocks after the analyzed unenhanced sub- block and to use these to improve the quality of the excitation in that sub-block. The six blocks are chosen so that they have the highest possible correlation with the unenhanced sub-block that is being enhanced. In other words, the six blocks are pitch-period- synchronous sequences to the unenhanced sub-block.
A linear combination of the six pitch-period-synchronous sequences is calculated that approximates the sub-block. If the squared error between the approximation and the unenhanced sub-block is small enough, the enhanced residual is set equal to this approximation. For the cases when the squared error criterion is not fulfilled, a linear combination of the approximation and the unenhanced residual forms the enhanced residual.4.6.1. Estimating the Pitch
Pitch estimates are needed to determine the locations of the pitch- period-synchronous sequences in a complexity-efficient way. For each of the new two/three sub-blocks, a pitch estimate is calculated by finding the maximum correlation in the range from lag 20 to lag 120. These pitch estimates are used to narrow down the search for the best possible pitch-period-synchronous sequences.4.6.2. Determination of the Pitch-Synchronous Sequences
Upon receiving the pitch estimates from the prior step, the enhancer analyzes and enhances one 80-sample sub-block at a time. The pitch- period-synchronous-sequences pssq(n) can be viewed as vectors of length 80 samples each shifted n*lag samples from the current sub- block. The six pitch-period-synchronous-sequences, pssq(-3) to pssq(-1) and pssq(1) to pssq(3), are found one at a time by the steps below: 1) Calculate the estimate of the position of the pssq(n). For pssq(n) in front of pssq(0) (n > 0), the location of the pssq(n) is estimated by moving one pitch estimate forward in time from the exact location of pssq(n-1). Similarly, pssq(n) behind pssq(0) (n < 0) is estimated by moving one pitch estimate backward in time from the exact location of pssq(n+1). If the estimated pssq(n) vector location is totally within the enhancer memory (Figure 4.3), steps 2, 3, and 4 are performed, otherwise the pssq(n) is set to zeros. 2) Compute the correlation between the unenhanced excitation and vectors around the estimated location interval of pssq(n). The correlation is calculated in the interval estimated location +/- 2 samples. This results in five correlation values. 3) The five correlation values are upsampled by a factor of 4, by using four simple upsampling filters (MA filters with coefficients upsFilter1.. upsFilter4). Within these the maximum value is found, which specifies the best pitch-period with a resolution of a quarter of a sample.
upsFilter1[7]={0.000000 0.000000 0.000000 1.000000 0.000000 0.000000 0.000000} upsFilter2[7]={0.015625 -0.076904 0.288330 0.862061 -0.106445 0.018799 -0.015625} upsFilter3[7]={0.023682 -0.124268 0.601563 0.601563 -0.124268 0.023682 -0.023682} upsFilter4[7]={0.018799 -0.106445 0.862061 0.288330 -0.076904 0.015625 -0.018799} 4) Generate the pssq(n) vector by upsampling of the excitation memory and extracting the sequence that corresponds to the lag delay that was calculated in prior step. With the steps above, all the pssq(n) can be found in an iterative manner, first moving backward in time from pssq(0) and then forward in time from pssq(0). 0 159 319 479 639 +---------------------------------------------------------------+ | -5 | -4 | -3 | -2 | -1 | 0 | 1 | 2 | +---------------------------------------------------------------+ |pssq 0 | |pssq -1| |pssq 1 | |pssq -2| |pssq 2 | |pssq -3| |pssq 3 | Figure 4.3. Enhancement for 20 ms frame size. Figure 4.3 depicts pitch-period-synchronous sequences in the enhancement of the first 80 sample block in the 20 ms frame size mode. The unenhanced signal input is stored in the last two sub- blocks (1 - 2), and the six other sub-blocks contain unenhanced residual prior-in-time. We perform the enhancement algorithm on two blocks of 80 samples, where the first of the two blocks consists of the last 40 samples of sub-block 0 and the first 40 samples of sub- block 1. The second 80-sample block consists of the last 40 samples of sub-block 1 and the first 40 samples of sub-block 2.
0 159 319 479 639 +---------------------------------------------------------------+ | -4 | -3 | -2 | -1 | 0 | 1 | 2 | 3 | +---------------------------------------------------------------+ |pssq 0 | |pssq -1| |pssq 1 | |pssq -2| |pssq 2 | |pssq -3| |pssq 3 | Figure 4.4. Enhancement for 30 ms frame size. Figure 4.4 depicts pitch-period-synchronous sequences in the enhancement of the first 80-sample block in the 30 ms frame size mode. The unenhanced signal input is stored in the last three sub- blocks (1 - 3). The five other sub-blocks contain unenhanced residual prior-in-time. The enhancement algorithm is performed on the three 80 sample sub-blocks 0, 1, and 2.4.6.3. Calculation of the Smoothed Excitation
A linear combination of the six pssq(n) (n!=0) form a smoothed approximation, z, of pssq(0). Most of the weight is put on the sequences that are close to pssq(0), as these are likely to be most similar to pssq(0). The smoothed vector is also rescaled so that the energy of z is the same as the energy of pssq(0). ___ \ y = > pssq(i) * pssq_weight(i) /__ i=-3,-2,-1,1,2,3 pssq_weight(i) = 0.5*(1-cos(2*pi*(i+4)/(2*3+2))) z = C * y, where C = ||pssq(0)||/||y||4.6.4. Enhancer Criterion
The criterion of the enhancer is that the enhanced excitation is not allowed to differ much from the unenhanced excitation. This criterion is checked for each 80-sample sub-block. e < (b * ||pssq(0)||^2), where b=0.05 and (Constraint 1) e = (pssq(0)-z)*(pssq(0)-z), and "*" means the dot product
4.6.5. Enhancing the excitation
From the criterion in the previous section, it is clear that the excitation is not allowed to change much. The purpose of this constraint is to prevent the creation of an enhanced signal significantly different from the original signal. This also means that the constraint limits the numerical size of the errors that the enhancement procedure can make. That is especially important in unvoiced segments and background noise segments for which increased periodicity could lead to lower perceived quality. When the constraint in the prior section is not met, the enhanced residual is instead calculated through a constrained optimization by using the Lagrange multiplier technique. The new constraint is that e = (b * ||pssq(0)||^2) (Constraint 2) We distinguish two solution regions for the optimization: 1) the region where the first constraint is fulfilled and 2) the region where the first constraint is not fulfilled and the second constraint must be used. In the first case, where the second constraint is not needed, the optimized re-estimated vector is simply z, the energy-scaled version of y. In the second case, where the second constraint is activated and becomes an equality constraint, we have z= A*y + B*pssq(0) where A = sqrt((b-b^2/4)*(w00*w00)/ (w11*w00 + w10*w10)) and w11 = pssq(0)*pssq(0) w00 = y*y w10 = y*pssq(0) (* symbolizes the dot product) and B = 1 - b/2 - A * w10/w00 Appendix A.16 contains a listing of a reference implementation for the enhancement method.
4.7. Synthesis Filtering
Upon decoding or PLC of the LP excitation block, the decoded speech block is obtained by running the decoded LP synthesis filter, 1/A~k(z), over the block. The synthesis filters have to be shifted to compensate for the delay in the enhancer. For 20 ms frame size mode, they SHOULD be shifted one 40-sample sub-block, and for 30 ms frame size mode, they SHOULD be shifted two 40-sample sub-blocks. The LP coefficients SHOULD be changed at the first sample of every sub-block while keeping the filter state. For PLC blocks, one solution is to apply the last LP coefficients of the last decoded speech block for all sub-blocks. The reference implementation for the synthesis filtering can be found in Appendix A.48.4.8. Post Filtering
If desired, the decoded block can be filtered by a high-pass filter. This removes the low frequencies of the decoded signal. A reference implementation of this, with cutoff at 65 Hz, is shown in Appendix A.30.5. Security Considerations
This algorithm for the coding of speech signals is not subject to any known security consideration; however, its RTP payload format [1] is subject to several considerations, which are addressed there. Confidentiality of the media streams is achieved by encryption; therefore external mechanisms, such as SRTP [5], MAY be used for that purpose.6. Evaluation of the iLBC Implementations
It is possible and suggested to evaluate certain iLBC implementation by utilizing methodology and tools available at http://www.ilbcfreeware.org/evaluation.html7. References
7.1. Normative References
[1] Duric, A. and S. Andersen, "Real-time Transport Protocol (RTP) Payload Format for internet Low Bit Rate Codec (iLBC) Speech", RFC 3952, December 2004. [2] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997.
[3] PacketCable(TM) Audio/Video Codecs Specification, Cable Television Laboratories, Inc.7.2. Informative References
[4] ITU-T Recommendation G.711, available online from the ITU bookstore at http://www.itu.int. [5] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. Norman, "The Secure Real Time Transport Protocol (SRTP)", RFC 3711, March 2004.8. Acknowledgements
This extensive work, besides listed authors, has the following authors, who could not have been listed among "official" authors (due to IESG restrictions in the number of authors who can be listed): Manohar N. Murthi (Department of Electrical and Computer Engineering, University of Miami), Fredrik Galschiodt, Julian Spittka, and Jan Skoglund (Global IP Sound). The authors are deeply indebted to the following people and thank them sincerely: Henry Sinnreich, Patrik Faltstrom, Alan Johnston, and Jean- Francois Mule for great support of the iLBC initiative and for valuable feedback and comments. Peter Vary, Frank Mertz, and Christoph Erdmann (RWTH Aachen); Vladimir Cuperman (Niftybox LLC); Thomas Eriksson (Chalmers Univ of Tech), and Gernot Kubin (TU Graz), for thorough review of the iLBC document and their valuable feedback and remarks.