3. Internal Framing
The Opus encoder produces "packets", which are each a contiguous set of bytes meant to be transmitted as a single unit. The packets described here do not include such things as IP, UDP, or RTP headers, which are normally found in a transport-layer packet. A single packet may contain multiple audio frames, so long as they share a common set of parameters, including the operating mode, audio bandwidth, frame size, and channel count (mono vs. stereo). This section describes the possible combinations of these parameters and the internal framing used to pack multiple frames into a single packet. This framing is not self-delimiting. Instead, it assumes that a lower layer (such as UDP or RTP [RFC3550] or Ogg [RFC3533] or Matroska [MATROSKA-WEBSITE]) will communicate the length, in bytes, of the packet, and it uses this information to reduce the framing overhead in the packet itself. A decoder implementation MUST support the framing described in this section. An alternative, self- delimiting variant of the framing is described in Appendix B. Support for that variant is OPTIONAL. All bit diagrams in this document number the bits so that bit 0 is the most significant bit of the first byte, and bit 7 is the least significant. Bit 8 is thus the most significant bit of the second byte, etc. Well-formed Opus packets obey certain requirements, marked [R1] through [R7] below. These are summarized in Section 3.4 along with appropriate means of handling malformed packets.3.1. The TOC Byte
A well-formed Opus packet MUST contain at least one byte [R1]. This byte forms a table-of-contents (TOC) header that signals which of the various modes and configurations a given packet uses. It is composed of a configuration number, "config", a stereo flag, "s", and a frame count code, "c", arranged as illustrated in Figure 1. A description of each of these fields follows.
0 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ | config |s| c | +-+-+-+-+-+-+-+-+ Figure 1: The TOC Byte The top five bits of the TOC byte, labeled "config", encode one of 32 possible configurations of operating mode, audio bandwidth, and frame size. As described, the LP (SILK) layer and MDCT (CELT) layer can be combined in three possible operating modes: 1. A SILK-only mode for use in low bitrate connections with an audio bandwidth of WB or less, 2. A Hybrid (SILK+CELT) mode for SWB or FB speech at medium bitrates, and 3. A CELT-only mode for very low delay speech transmission as well as music transmission (NB to FB). The 32 possible configurations each identify which one of these operating modes the packet uses, as well as the audio bandwidth and the frame size. Table 2 lists the parameters for each configuration.
+-----------------------+-----------+-----------+-------------------+ | Configuration | Mode | Bandwidth | Frame Sizes | | Number(s) | | | | +-----------------------+-----------+-----------+-------------------+ | 0...3 | SILK-only | NB | 10, 20, 40, 60 ms | | | | | | | 4...7 | SILK-only | MB | 10, 20, 40, 60 ms | | | | | | | 8...11 | SILK-only | WB | 10, 20, 40, 60 ms | | | | | | | 12...13 | Hybrid | SWB | 10, 20 ms | | | | | | | 14...15 | Hybrid | FB | 10, 20 ms | | | | | | | 16...19 | CELT-only | NB | 2.5, 5, 10, 20 ms | | | | | | | 20...23 | CELT-only | WB | 2.5, 5, 10, 20 ms | | | | | | | 24...27 | CELT-only | SWB | 2.5, 5, 10, 20 ms | | | | | | | 28...31 | CELT-only | FB | 2.5, 5, 10, 20 ms | +-----------------------+-----------+-----------+-------------------+ Table 2: TOC Byte Configuration Parameters The configuration numbers in each range (e.g., 0...3 for NB SILK- only) correspond to the various choices of frame size, in the same order. For example, configuration 0 has a 10 ms frame size and configuration 3 has a 60 ms frame size. One additional bit, labeled "s", signals mono vs. stereo, with 0 indicating mono and 1 indicating stereo. The remaining two bits of the TOC byte, labeled "c", code the number of frames per packet (codes 0 to 3) as follows: o 0: 1 frame in the packet o 1: 2 frames in the packet, each with equal compressed size o 2: 2 frames in the packet, with different compressed sizes o 3: an arbitrary number of frames in the packet This document refers to a packet as a code 0 packet, code 1 packet, etc., based on the value of "c".
3.2. Frame Packing
This section describes how frames are packed according to each possible value of "c" in the TOC byte.3.2.1. Frame Length Coding
When a packet contains multiple VBR frames (i.e., code 2 or 3), the compressed length of one or more of these frames is indicated with a one- or two-byte sequence, with the meaning of the first byte as follows: o 0: No frame (Discontinuous Transmission (DTX) or lost packet) o 1...251: Length of the frame in bytes o 252...255: A second byte is needed. The total length is (second_byte*4)+first_byte The special length 0 indicates that no frame is available, either because it was dropped during transmission by some intermediary or because the encoder chose not to transmit it. Any Opus frame in any mode MAY have a length of 0. The maximum representable length is 255*4+255=1275 bytes. For 20 ms frames, this represents a bitrate of 510 kbit/s, which is approximately the highest useful rate for lossily compressed fullband stereo music. Beyond this point, lossless codecs are more appropriate. It is also roughly the maximum useful rate of the MDCT layer as, shortly thereafter, quality no longer improves with additional bits due to limitations on the codebook sizes. No length is transmitted for the last frame in a VBR packet, or for any of the frames in a CBR packet, as it can be inferred from the total size of the packet and the size of all other data in the packet. However, the length of any individual frame MUST NOT exceed 1275 bytes [R2] to allow for repacketization by gateways, conference bridges, or other software.3.2.2. Code 0: One Frame in the Packet
For code 0 packets, the TOC byte is immediately followed by N-1 bytes of compressed data for a single frame (where N is the size of the packet), as illustrated in Figure 2.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|0|0| | +-+-+-+-+-+-+-+-+ | | Compressed frame 1 (N-1 bytes)... : : | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2: A Code 0 Packet3.2.3. Code 1: Two Frames in the Packet, Each with Equal Compressed Size
For code 1 packets, the TOC byte is immediately followed by the (N-1)/2 bytes of compressed data for the first frame, followed by (N-1)/2 bytes of compressed data for the second frame, as illustrated in Figure 3. The number of payload bytes available for compressed data, N-1, MUST be even for all code 1 packets [R3]. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|0|1| | +-+-+-+-+-+-+-+-+ : | Compressed frame 1 ((N-1)/2 bytes)... | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | Compressed frame 2 ((N-1)/2 bytes)... | : +-+-+-+-+-+-+-+-+ | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3: A Code 1 Packet3.2.4. Code 2: Two Frames in the Packet, with Different Compressed Sizes
For code 2 packets, the TOC byte is followed by a one- or two-byte sequence indicating the length of the first frame (marked N1 in Figure 4), followed by N1 bytes of compressed data for the first frame. The remaining N-N1-2 or N-N1-3 bytes are the compressed data for the second frame. This is illustrated in Figure 4. A code 2 packet MUST contain enough bytes to represent a valid length. For example, a 1-byte code 2 packet is always invalid, and a 2-byte code 2 packet whose second byte is in the range 252...255 is also invalid.
The length of the first frame, N1, MUST also be no larger than the size of the payload remaining after decoding that length for all code 2 packets [R4]. This makes, for example, a 2-byte code 2 packet with a second byte in the range 1...251 invalid as well (the only valid 2-byte code 2 packet is one where the length of both frames is zero). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|1|0| N1 (1-2 bytes): | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | Compressed frame 1 (N1 bytes)... | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | Compressed frame 2... : : | | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 4: A Code 2 Packet3.2.5. Code 3: A Signaled Number of Frames in the Packet
Code 3 packets signal the number of frames, as well as additional padding, called "Opus padding" to indicate that this padding is added at the Opus layer rather than at the transport layer. Code 3 packets MUST have at least 2 bytes [R6,R7]. The TOC byte is followed by a byte encoding the number of frames in the packet in bits 2 to 7 (marked "M" in Figure 5), with bit 1 indicating whether or not Opus padding is inserted (marked "p" in Figure 5), and bit 0 indicating VBR (marked "v" in Figure 5). M MUST NOT be zero, and the audio duration contained within a packet MUST NOT exceed 120 ms [R5]. This limits the maximum frame count for any frame size to 48 (for 2.5 ms frames), with lower limits for longer frame sizes. Figure 5 illustrates the layout of the frame count byte. 0 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |v|p| M | +-+-+-+-+-+-+-+-+ Figure 5: The frame count byte When Opus padding is used, the number of bytes of padding is encoded in the bytes following the frame count byte. Values from 0...254 indicate that 0...254 bytes of padding are included, in addition to
the byte(s) used to indicate the size of the padding. If the value is 255, then the size of the additional padding is 254 bytes, plus the padding value encoded in the next byte. There MUST be at least one more byte in the packet in this case [R6,R7]. The additional padding bytes appear at the end of the packet and MUST be set to zero by the encoder to avoid creating a covert channel. The decoder MUST accept any value for the padding bytes, however. Although this encoding provides multiple ways to indicate a given number of padding bytes, each uses a different number of bytes to indicate the padding size and thus will increase the total packet size by a different amount. For example, to add 255 bytes to a packet, set the padding bit, p, to 1, insert a single byte after the frame count byte with a value of 254, and append 254 padding bytes with the value zero to the end of the packet. To add 256 bytes to a packet, set the padding bit to 1, insert two bytes after the frame count byte with the values 255 and 0, respectively, and append 254 padding bytes with the value zero to the end of the packet. By using the value 255 multiple times, it is possible to create a packet of any specific, desired size. Let P be the number of header bytes used to indicate the padding size plus the number of padding bytes themselves (i.e., P is the total number of bytes added to the packet). Then, P MUST be no more than N-2 [R6,R7]. In the CBR case, let R=N-2-P be the number of bytes remaining in the packet after subtracting the (optional) padding. Then, the compressed length of each frame in bytes is equal to R/M. The value R MUST be a non-negative integer multiple of M [R6]. The compressed data for all M frames follows, each of size R/M bytes, as illustrated in Figure 6.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|1|1|0|p| M | Padding length (Optional) : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame 1 (R/M bytes)... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame 2 (R/M bytes)... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : ... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame M (R/M bytes)... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : Opus Padding (Optional)... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 6: A CBR Code 3 Packet In the VBR case, the (optional) padding length is followed by M-1 frame lengths (indicated by "N1" to "N[M-1]" in Figure 7), each encoded in a one- or two-byte sequence as described above. The packet MUST contain enough data for the M-1 lengths after removing the (optional) padding, and the sum of these lengths MUST be no larger than the number of bytes remaining in the packet after decoding them [R7]. The compressed data for all M frames follows, each frame consisting of the indicated number of bytes, with the final frame consuming any remaining bytes before the final padding, as illustrated in Figure 6. The number of header bytes (TOC byte, frame count byte, padding length bytes, and frame length bytes), plus the signaled length of the first M-1 frames themselves, plus the signaled length of the padding MUST be no larger than N, the total size of the packet.
0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | config |s|1|1|1|p| M | Padding length (Optional) : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : N1 (1-2 bytes): N2 (1-2 bytes): ... : N[M-1] | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame 1 (N1 bytes)... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame 2 (N2 bytes)... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : ... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | : Compressed frame M... : | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : Opus Padding (Optional)... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 7: A VBR Code 3 Packet3.3. Examples
Simplest case, one NB mono 20 ms SILK frame: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 1 |0|0|0| compressed data... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 8
Two FB mono 5 ms CELT frames of the same compressed size: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 29 |0|0|1| compressed data... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 9 Two FB mono 20 ms Hybrid frames of different compressed size: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 15 |0|1|1|1|0| 2 | N1 | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | compressed data... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 10 Four FB stereo 20 ms CELT frames of the same compressed size: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 31 |1|1|1|0|0| 4 | compressed data... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 113.4. Receiving Malformed Packets
A receiver MUST NOT process packets that violate any of the rules above as normal Opus packets. They are reserved for future applications, such as in-band headers (containing metadata, etc.). Packets that violate these constraints may cause implementations of _this_ specification to treat them as malformed and discard them. These constraints are summarized here for reference: [R1] Packets are at least one byte. [R2] No implicit frame length is larger than 1275 bytes. [R3] Code 1 packets have an odd total length, N, so that (N-1)/2 is an integer.
[R4] Code 2 packets have enough bytes after the TOC for a valid frame length, and that length is no larger than the number of bytes remaining in the packet. [R5] Code 3 packets contain at least one frame, but no more than 120 ms of audio total. [R6] The length of a CBR code 3 packet, N, is at least two bytes, the number of bytes added to indicate the padding size plus the trailing padding bytes themselves, P, is no more than N-2, and the frame count, M, satisfies the constraint that (N-2-P) is a non-negative integer multiple of M. [R7] VBR code 3 packets are large enough to contain all the header bytes (TOC byte, frame count byte, any padding length bytes, and any frame length bytes), plus the length of the first M-1 frames, plus any trailing padding bytes.4. Opus Decoder
The Opus decoder consists of two main blocks: the SILK decoder and the CELT decoder. At any given time, one or both of the SILK and CELT decoders may be active. The output of the Opus decode is the sum of the outputs from the SILK and CELT decoders with proper sample rate conversion and delay compensation on the SILK side, and optional decimation (when decoding to sample rates less than 48 kHz) on the CELT side, as illustrated in the block diagram below. +---------+ +------------+ | SILK | | Sample | +->| Decoder |--->| Rate |----+ Bit- +---------+ | | | | Conversion | v stream | Range |---+ +---------+ +------------+ /---\ Audio ------->| Decoder | | + |------> | |---+ +---------+ +------------+ \---/ +---------+ | | CELT | | Decimation | ^ +->| Decoder |--->| (Optional) |----+ | | | | +---------+ +------------+4.1. Range Decoder
Opus uses an entropy coder based on range coding [RANGE-CODING] [MARTIN79], which is itself a rediscovery of the FIFO arithmetic code introduced by [CODING-THESIS]. It is very similar to arithmetic encoding, except that encoding is done with digits in any base
instead of with bits, so it is faster when using larger bases (i.e., a byte). All of the calculations in the range coder must use bit- exact integer arithmetic. Symbols may also be coded as "raw bits" packed directly into the bitstream, bypassing the range coder. These are packed backwards starting at the end of the frame, as illustrated in Figure 12. This reduces complexity and makes the stream more resilient to bit errors, as corruption in the raw bits will not desynchronize the decoding process, unlike corruption in the input to the range decoder. Raw bits are only used in the CELT layer. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Range coder data (packed MSB to LSB) -> : + + : : + +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | <- Boundary occurs at an arbitrary bit position : +-+-+-+ + : <- Raw bits data (packed LSB to MSB) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Legend: LSB = Least Significant Bit MSB = Most Significant Bit Figure 12: Illustrative Example of Packing Range Coder and Raw Bits Data Each symbol coded by the range coder is drawn from a finite alphabet and coded in a separate "context", which describes the size of the alphabet and the relative frequency of each symbol in that alphabet. Suppose there is a context with n symbols, identified with an index that ranges from 0 to n-1. The parameters needed to encode or decode symbol k in this context are represented by a three-tuple (fl[k], fh[k], ft), all 16-bit unsigned integers, with 0 <= fl[k] < fh[k] <= ft <= 65535. The values of this tuple are derived from the probability model for the symbol, represented by traditional "frequency counts". Because Opus uses static contexts, those are not updated as symbols are decoded. Let f[i] be the frequency of symbol i. Then, the three-tuple corresponding to symbol k is given by the following:
k-1 n-1 __ __ fl[k] = \ f[i], fh[k] = fl[k] + f[k], ft = \ f[i] /_ /_ i=0 i=0 The range decoder extracts the symbols and integers encoded using the range encoder in Section 5.1. The range decoder maintains an internal state vector composed of the two-tuple (val, rng), where val represents the difference between the high end of the current range and the actual coded value, minus one, and rng represents the size of the current range. Both val and rng are 32-bit unsigned integer values.4.1.1. Range Decoder Initialization
Let b0 be an 8-bit unsigned integer containing first input byte (or containing zero if there are no bytes in this Opus frame). The decoder initializes rng to 128 and initializes val to (127 - (b0>>1)), where (b0>>1) is the top 7 bits of the first input byte. It saves the remaining bit, (b0&1), for use in the renormalization procedure described in Section 4.1.2.1, which the decoder invokes immediately after initialization to read additional bits and establish the invariant that rng > 2**23.4.1.2. Decoding Symbols
Decoding a symbol is a two-step process. The first step determines a 16-bit unsigned value fs, which lies within the range of some symbol in the current context. The second step updates the range decoder state with the three-tuple (fl[k], fh[k], ft) corresponding to that symbol. The first step is implemented by ec_decode() (entdec.c), which computes val fs = ft - min(------ + 1, ft) rng/ft The divisions here are integer division. The decoder then identifies the symbol in the current context corresponding to fs; i.e., the value of k whose three-tuple (fl[k], fh[k], ft) satisfies fl[k] <= fs < fh[k]. It uses this tuple to update val according to
rng val = val - --- * (ft - fh[k]) ft If fl[k] is greater than zero, then the decoder updates rng using rng rng = --- * (fh[k] - fl[k]) ft Otherwise, it updates rng using rng rng = rng - --- * (ft - fh[k]) ft Using a special case for the first symbol (rather than the last symbol, as is commonly done in other arithmetic coders) ensures that all the truncation error from the finite precision arithmetic accumulates in symbol 0. This makes the cost of coding a 0 slightly smaller, on average, than its estimated probability indicates and makes the cost of coding any other symbol slightly larger. When contexts are designed so that 0 is the most probable symbol, which is often the case, this strategy minimizes the inefficiency introduced by the finite precision. It also makes some of the special-case decoding routines in Section 4.1.3 particularly simple. After the updates, implemented by ec_dec_update() (entdec.c), the decoder normalizes the range using the procedure in the next section, and returns the index k.4.1.2.1. Renormalization
To normalize the range, the decoder repeats the following process, implemented by ec_dec_normalize() (entdec.c), until rng > 2**23. If rng is already greater than 2**23, the entire process is skipped. First, it sets rng to (rng<<8). Then, it reads the next byte of the Opus frame and forms an 8-bit value sym, using the leftover bit buffered from the previous byte as the high bit and the top 7 bits of the byte just read as the other 7 bits of sym. The remaining bit in the byte just read is buffered for use in the next iteration. If no more input bytes remain, it uses zero bits instead. See Section 4.1.1 for the initialization used to process the first byte. Then, it sets val = ((val<<8) + (255-sym)) & 0x7FFFFFFF
It is normal and expected that the range decoder will read several bytes into the data of the raw bits (if any) at the end of the frame by the time the frame is completely decoded, as illustrated in Figure 13. This same data MUST also be returned as raw bits when requested. The encoder is expected to terminate the stream in such a way that the range decoder will decode the intended values regardless of the data contained in the raw bits. Section 5.1.5 describes a procedure for doing this. If the range decoder consumes all of the bytes belonging to the current frame, it MUST continue to use zero when any further input bytes are required, even if there is additional data in the current packet from padding or other frames. n n+1 n+2 n+3 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : | <----------- Overlap region ------------> | : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ ^ ^ | End of data buffered by the range coder | ...-----------------------------------------------+ | | End of data consumed by raw bits +-------------------------------------------------------... Figure 13: Illustrative Example of Raw Bits Overlapping Range Coder Data4.1.3. Alternate Decoding Methods
The reference implementation uses three additional decoding methods that are exactly equivalent to the above but make assumptions and simplifications that allow for a more efficient implementation.4.1.3.1. ec_decode_bin()
The first is ec_decode_bin() (entdec.c), defined using the parameter ftb instead of ft. It is mathematically equivalent to calling ec_decode() with ft = (1<<ftb), but it avoids one of the divisions.4.1.3.2. ec_dec_bit_logp()
The next is ec_dec_bit_logp() (entdec.c), which decodes a single binary symbol, replacing both the ec_decode() and ec_dec_update() steps. The context is described by a single parameter, logp, which is the absolute value of the base-2 logarithm of the probability of a "1". It is mathematically equivalent to calling ec_decode() with ft = (1<<logp), followed by ec_dec_update() with the 3-tuple (fl[k] = 0, fh[k] = (1<<logp) - 1, ft = (1<<logp)) if the returned
value of fs is less than (1<<logp) - 1 (a "0" was decoded), and with (fl[k] = (1<<logp) - 1, fh[k] = ft = (1<<logp)) otherwise (a "1" was decoded). The implementation requires no multiplications or divisions.4.1.3.3. ec_dec_icdf()
The last is ec_dec_icdf() (entdec.c), which decodes a single symbol with a table-based context of up to 8 bits, also replacing both the ec_decode() and ec_dec_update() steps, as well as the search for the decoded symbol in between. The context is described by two parameters, an icdf ("inverse" cumulative distribution function) table and ftb. As with ec_decode_bin(), (1<<ftb) is equivalent to ft. idcf[k], on the other hand, stores (1<<ftb)-fh[k], which is equal to (1<<ftb) - fl[k+1]. fl[0] is assumed to be 0, and the table is terminated by a value of 0 (where fh[k] == ft). The function is mathematically equivalent to calling ec_decode() with ft = (1<<ftb), using the returned value fs to search the table for the first entry where fs < (1<<ftb)-icdf[k], and calling ec_dec_update() with fl[k] = (1<<ftb) - icdf[k-1] (or 0 if k == 0), fh[k] = (1<<ftb) - idcf[k], and ft = (1<<ftb). Combining the search with the update allows the division to be replaced by a series of multiplications (which are usually much cheaper), and using an inverse CDF allows the use of an ftb as large as 8 in an 8-bit table without any special cases. This is the primary interface with the range decoder in the SILK layer, though it is used in a few places in the CELT layer as well. Although icdf[k] is more convenient for the code, the frequency counts, f[k], are a more natural representation of the probability distribution function (PDF) for a given symbol. Therefore, this document lists the latter, not the former, when describing the context in which a symbol is coded as a list, e.g., {4, 4, 4, 4}/16 for a uniform context with four possible values and ft = 16. The value of ft after the slash is always the sum of the entries in the PDF, but is included for convenience. Contexts with identical probabilities, f[k]/ft, but different values of ft (or equivalently, ftb) are not the same, and cannot, in general, be used in place of one another. An icdf table is also not capable of representing a PDF where the first symbol has 0 probability. In such contexts, ec_dec_icdf() can decode the symbol by using a table that drops the entries for any initial zero-probability values and by adding the constant offset of the first value with a non-zero probability to its return value.
4.1.4. Decoding Raw Bits
The raw bits used by the CELT layer are packed at the end of the frame, with the least significant bit of the first value packed in the least significant bit of the last byte, filling up to the most significant bit in the last byte, continuing on to the least significant bit of the penultimate byte, and so on. The reference implementation reads them using ec_dec_bits() (entdec.c). Because the range decoder must read several bytes ahead in the stream, as described in Section 4.1.2.1, the input consumed by the raw bits may overlap with the input consumed by the range coder, and a decoder MUST allow this. The format should render it impossible to attempt to read more raw bits than there are actual bits in the frame, though a decoder may wish to check for this and report an error.4.1.5. Decoding Uniformly Distributed Integers
The function ec_dec_uint() (entdec.c) decodes one of ft equiprobable values in the range 0 to (ft - 1), inclusive, each with a frequency of 1, where ft may be as large as (2**32 - 1). Because ec_decode() is limited to a total frequency of (2**16 - 1), it splits up the value into a range coded symbol representing up to 8 of the high bits, and, if necessary, raw bits representing the remainder of the value. The limit of 8 bits in the range coded symbol is a trade-off between implementation complexity, modeling error (since the symbols no longer truly have equal coding cost), and rounding error introduced by the range coder itself (which gets larger as more bits are included). Using raw bits reduces the maximum number of divisions required in the worst case, but means that it may be possible to decode a value outside the range 0 to (ft - 1), inclusive. ec_dec_uint() takes a single, positive parameter, ft, which is not necessarily a power of two, and returns an integer, t, whose value lies between 0 and (ft - 1), inclusive. Let ftb = ilog(ft - 1), i.e., the number of bits required to store (ft - 1) in two's complement notation. If ftb is 8 or less, then t is decoded with t = ec_decode(ft), and the range coder state is updated using the three-tuple (t, t + 1, ft). If ftb is greater than 8, then the top 8 bits of t are decoded using t = ec_decode(((ft - 1) >> (ftb - 8)) + 1) the decoder state is updated using the three-tuple (t, t + 1, ((ft - 1) >> (ftb - 8)) + 1), and the remaining bits are decoded as raw bits, setting
t = (t << (ftb - 8)) | ec_dec_bits(ftb - 8) If, at this point, t >= ft, then the current frame is corrupt. In that case, the decoder should assume there has been an error in the coding, decoding, or transmission and SHOULD take measures to conceal the error (e.g., saturate to ft-1 or use the Packet Loss Concealment (PLC)) and/or report to the application that the error has occurred.4.1.6. Current Bit Usage
The bit allocation routines in the CELT decoder need a conservative upper bound on the number of bits that have been used from the current frame thus far, including both range coder bits and raw bits. This drives allocation decisions that must match those made in the encoder. The upper bound is computed in the reference implementation to whole-bit precision by the function ec_tell() (entcode.h) and to fractional 1/8th bit precision by the function ec_tell_frac() (entcode.c). Like all operations in the range coder, it must be implemented in a bit-exact manner, and it must produce exactly the same value returned by the same functions in the encoder after encoding the same symbols. ec_tell() is guaranteed to return ceil(ec_tell_frac()/8.0). In various places, the codec will check to ensure there is enough room to contain a symbol before attempting to decode it. In practice, although the number of bits used so far is an upper bound, decoding a symbol whose probability model suggests it has a worst-case cost of p 1/8th bits may actually advance the return value of ec_tell_frac() by p-1, p, or p+1 1/8th bits, due to approximation error in that upper bound, truncation error in the range coder, and for large values of ft, modeling error in ec_dec_uint(). However, this error is bounded, and periodic calls to ec_tell() or ec_tell_frac() at precisely defined points in the decoding process prevent it from accumulating. For a range coder symbol that requires a whole number of bits (i.e., for which ft/(fh[k] - fl[k]) is a power of two), where there are at least p 1/8th bits available, decoding the symbol will never cause ec_tell() or ec_tell_frac() to exceed the size of the frame ("bust the budget"). In this case, the return value of ec_tell_frac() will only advance by more than p 1/8th bits if there were an additional, fractional number of bits remaining, and it will never advance beyond the next whole-bit boundary, which is safe, since frames always contain a whole number of bits. However, when p is not a whole number of bits, an extra 1/8th bit is required to ensure that decoding the symbol will not bust the budget.
The reference implementation keeps track of the total number of whole bits that have been processed by the decoder so far in the variable nbits_total, including the (possibly fractional) number of bits that are currently buffered, but not consumed, inside the range coder. nbits_total is initialized to 9 just before the initial range renormalization process completes (or equivalently, it can be initialized to 33 after the first renormalization). The extra two bits over the actual amount buffered by the range coder guarantees that it is an upper bound and that there is enough room for the encoder to terminate the stream. Each iteration through the range coder's renormalization loop increases nbits_total by 8. Reading raw bits increases nbits_total by the number of raw bits read.4.1.6.1. ec_tell()
The whole number of bits buffered in rng may be estimated via lg = ilog(rng). ec_tell() then becomes a simple matter of removing these bits from the total. It returns (nbits_total - lg). In a newly initialized decoder, before any symbols have been read, this reports that 1 bit has been used. This is the bit reserved for termination of the encoder.4.1.6.2. ec_tell_frac()
ec_tell_frac() estimates the number of bits buffered in rng to fractional precision. Since rng must be greater than 2**23 after renormalization, lg must be at least 24. Let r_Q15 = rng >> (lg-16) so that 32768 <= r_Q15 < 65536, an unsigned Q15 value representing the fractional part of rng. Then, the following procedure can be used to add one bit of precision to lg. First, update r_Q15 = (r_Q15*r_Q15) >> 15 Then, add the 16th bit of r_Q15 to lg via lg = 2*lg + (r_Q15 >> 16) Finally, if this bit was a 1, reduce r_Q15 by a factor of two via r_Q15 = r_Q15 >> 1 so that it once again lies in the range 32768 <= r_Q15 < 65536. This procedure is repeated three times to extend lg to 1/8th bit precision. ec_tell_frac() then returns (nbits_total*8 - lg).