RFC 6716

Definition of the Opus Audio Codec

Pages: 326
Proposed Standard
→ Errata
Updated by: 8251

Part 2 of 14 – Pages 13 to 31

RFC6716 - Page 13 prevText

3.  Internal Framing

   The Opus encoder produces "packets", which are each a contiguous set
   of bytes meant to be transmitted as a single unit.  The packets
   described here do not include such things as IP, UDP, or RTP headers,
   which are normally found in a transport-layer packet.  A single
   packet may contain multiple audio frames, so long as they share a
   common set of parameters, including the operating mode, audio
   bandwidth, frame size, and channel count (mono vs. stereo).  This
   section describes the possible combinations of these parameters and
   the internal framing used to pack multiple frames into a single
   packet.  This framing is not self-delimiting.  Instead, it assumes
   that a lower layer (such as UDP or RTP [RFC3550] or Ogg [RFC3533] or
   Matroska [MATROSKA-WEBSITE]) will communicate the length, in bytes,
   of the packet, and it uses this information to reduce the framing
   overhead in the packet itself.  A decoder implementation MUST support
   the framing described in this section.  An alternative, self-
   delimiting variant of the framing is described in Appendix B.
   Support for that variant is OPTIONAL.

   All bit diagrams in this document number the bits so that bit 0 is
   the most significant bit of the first byte, and bit 7 is the least
   significant.  Bit 8 is thus the most significant bit of the second
   byte, etc.  Well-formed Opus packets obey certain requirements,
   marked [R1] through [R7] below.  These are summarized in Section 3.4
   along with appropriate means of handling malformed packets.

3.1.  The TOC Byte

   A well-formed Opus packet MUST contain at least one byte [R1].  This
   byte forms a table-of-contents (TOC) header that signals which of the
   various modes and configurations a given packet uses.  It is composed
   of a configuration number, "config", a stereo flag, "s", and a frame
   count code, "c", arranged as illustrated in Figure 1.  A description
   of each of these fields follows.

RFC6716 - Page 14

                              0
                              0 1 2 3 4 5 6 7
                             +-+-+-+-+-+-+-+-+
                             | config  |s| c |
                             +-+-+-+-+-+-+-+-+

                          Figure 1: The TOC Byte

   The top five bits of the TOC byte, labeled "config", encode one of 32
   possible configurations of operating mode, audio bandwidth, and frame
   size.  As described, the LP (SILK) layer and MDCT (CELT) layer can be
   combined in three possible operating modes:

   1.  A SILK-only mode for use in low bitrate connections with an audio
       bandwidth of WB or less,

   2.  A Hybrid (SILK+CELT) mode for SWB or FB speech at medium
       bitrates, and

   3.  A CELT-only mode for very low delay speech transmission as well
       as music transmission (NB to FB).

   The 32 possible configurations each identify which one of these
   operating modes the packet uses, as well as the audio bandwidth and
   the frame size.  Table 2 lists the parameters for each configuration.

RFC6716 - Page 15

   +-----------------------+-----------+-----------+-------------------+
   | Configuration         | Mode      | Bandwidth | Frame Sizes       |
   | Number(s)             |           |           |                   |
   +-----------------------+-----------+-----------+-------------------+
   | 0...3                 | SILK-only | NB        | 10, 20, 40, 60 ms |
   |                       |           |           |                   |
   | 4...7                 | SILK-only | MB        | 10, 20, 40, 60 ms |
   |                       |           |           |                   |
   | 8...11                | SILK-only | WB        | 10, 20, 40, 60 ms |
   |                       |           |           |                   |
   | 12...13               | Hybrid    | SWB       | 10, 20 ms         |
   |                       |           |           |                   |
   | 14...15               | Hybrid    | FB        | 10, 20 ms         |
   |                       |           |           |                   |
   | 16...19               | CELT-only | NB        | 2.5, 5, 10, 20 ms |
   |                       |           |           |                   |
   | 20...23               | CELT-only | WB        | 2.5, 5, 10, 20 ms |
   |                       |           |           |                   |
   | 24...27               | CELT-only | SWB       | 2.5, 5, 10, 20 ms |
   |                       |           |           |                   |
   | 28...31               | CELT-only | FB        | 2.5, 5, 10, 20 ms |
   +-----------------------+-----------+-----------+-------------------+

                Table 2: TOC Byte Configuration Parameters

   The configuration numbers in each range (e.g., 0...3 for NB SILK-
   only) correspond to the various choices of frame size, in the same
   order.  For example, configuration 0 has a 10 ms frame size and
   configuration 3 has a 60 ms frame size.

   One additional bit, labeled "s", signals mono vs. stereo, with 0
   indicating mono and 1 indicating stereo.

   The remaining two bits of the TOC byte, labeled "c", code the number
   of frames per packet (codes 0 to 3) as follows:

   o  0: 1 frame in the packet

   o  1: 2 frames in the packet, each with equal compressed size

   o  2: 2 frames in the packet, with different compressed sizes

   o  3: an arbitrary number of frames in the packet

   This document refers to a packet as a code 0 packet, code 1 packet,
   etc., based on the value of "c".

RFC6716 - Page 16

3.2.  Frame Packing

   This section describes how frames are packed according to each
   possible value of "c" in the TOC byte.

3.2.1.  Frame Length Coding

   When a packet contains multiple VBR frames (i.e., code 2 or 3), the
   compressed length of one or more of these frames is indicated with a
   one- or two-byte sequence, with the meaning of the first byte as
   follows:

   o  0: No frame (Discontinuous Transmission (DTX) or lost packet)

   o  1...251: Length of the frame in bytes

   o  252...255: A second byte is needed.  The total length is
      (second_byte*4)+first_byte

   The special length 0 indicates that no frame is available, either
   because it was dropped during transmission by some intermediary or
   because the encoder chose not to transmit it.  Any Opus frame in any
   mode MAY have a length of 0.

   The maximum representable length is 255*4+255=1275 bytes.  For 20 ms
   frames, this represents a bitrate of 510 kbit/s, which is
   approximately the highest useful rate for lossily compressed fullband
   stereo music.  Beyond this point, lossless codecs are more
   appropriate.  It is also roughly the maximum useful rate of the MDCT
   layer as, shortly thereafter, quality no longer improves with
   additional bits due to limitations on the codebook sizes.

   No length is transmitted for the last frame in a VBR packet, or for
   any of the frames in a CBR packet, as it can be inferred from the
   total size of the packet and the size of all other data in the
   packet.  However, the length of any individual frame MUST NOT exceed
   1275 bytes [R2] to allow for repacketization by gateways, conference
   bridges, or other software.

3.2.2.  Code 0: One Frame in the Packet

   For code 0 packets, the TOC byte is immediately followed by N-1 bytes
   of compressed data for a single frame (where N is the size of the
   packet), as illustrated in Figure 2.

RFC6716 - Page 17

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|0|0|                                               |
     +-+-+-+-+-+-+-+-+                                               |
     |                    Compressed frame 1 (N-1 bytes)...          :
     :                                                               |
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 2: A Code 0 Packet

3.2.3.  Code 1: Two Frames in the Packet, Each with Equal Compressed
        Size

   For code 1 packets, the TOC byte is immediately followed by the
   (N-1)/2 bytes of compressed data for the first frame, followed by
   (N-1)/2 bytes of compressed data for the second frame, as illustrated
   in Figure 3.  The number of payload bytes available for compressed
   data, N-1, MUST be even for all code 1 packets [R3].

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|0|1|                                               |
     +-+-+-+-+-+-+-+-+                                               :
     |             Compressed frame 1 ((N-1)/2 bytes)...             |
     :                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               :
     |             Compressed frame 2 ((N-1)/2 bytes)...             |
     :                                               +-+-+-+-+-+-+-+-+
     |                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 3: A Code 1 Packet

3.2.4.  Code 2: Two Frames in the Packet, with Different Compressed
        Sizes

   For code 2 packets, the TOC byte is followed by a one- or two-byte
   sequence indicating the length of the first frame (marked N1 in
   Figure 4), followed by N1 bytes of compressed data for the first
   frame.  The remaining N-N1-2 or N-N1-3 bytes are the compressed data
   for the second frame.  This is illustrated in Figure 4.  A code 2
   packet MUST contain enough bytes to represent a valid length.  For
   example, a 1-byte code 2 packet is always invalid, and a 2-byte code
   2 packet whose second byte is in the range 252...255 is also invalid.

RFC6716 - Page 18

   The length of the first frame, N1, MUST also be no larger than the
   size of the payload remaining after decoding that length for all code
   2 packets [R4].  This makes, for example, a 2-byte code 2 packet with
   a second byte in the range 1...251 invalid as well (the only valid
   2-byte code 2 packet is one where the length of both frames is zero).

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|1|0| N1 (1-2 bytes):                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               :
     |               Compressed frame 1 (N1 bytes)...                |
     :                               +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                               |                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
     |                     Compressed frame 2...                     :
     :                                                               |
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                         Figure 4: A Code 2 Packet

3.2.5.  Code 3: A Signaled Number of Frames in the Packet

   Code 3 packets signal the number of frames, as well as additional
   padding, called "Opus padding" to indicate that this padding is added
   at the Opus layer rather than at the transport layer.  Code 3 packets
   MUST have at least 2 bytes [R6,R7].  The TOC byte is followed by a
   byte encoding the number of frames in the packet in bits 2 to 7
   (marked "M" in Figure 5), with bit 1 indicating whether or not Opus
   padding is inserted (marked "p" in Figure 5), and bit 0 indicating
   VBR (marked "v" in Figure 5).  M MUST NOT be zero, and the audio
   duration contained within a packet MUST NOT exceed 120 ms [R5].  This
   limits the maximum frame count for any frame size to 48 (for 2.5 ms
   frames), with lower limits for longer frame sizes.  Figure 5
   illustrates the layout of the frame count byte.

                              0
                              0 1 2 3 4 5 6 7
                             +-+-+-+-+-+-+-+-+
                             |v|p|     M     |
                             +-+-+-+-+-+-+-+-+

                      Figure 5: The frame count byte

   When Opus padding is used, the number of bytes of padding is encoded
   in the bytes following the frame count byte.  Values from 0...254
   indicate that 0...254 bytes of padding are included, in addition to

RFC6716 - Page 19

   the byte(s) used to indicate the size of the padding.  If the value
   is 255, then the size of the additional padding is 254 bytes, plus
   the padding value encoded in the next byte.  There MUST be at least
   one more byte in the packet in this case [R6,R7].  The additional
   padding bytes appear at the end of the packet and MUST be set to zero
   by the encoder to avoid creating a covert channel.  The decoder MUST
   accept any value for the padding bytes, however.

   Although this encoding provides multiple ways to indicate a given
   number of padding bytes, each uses a different number of bytes to
   indicate the padding size and thus will increase the total packet
   size by a different amount.  For example, to add 255 bytes to a
   packet, set the padding bit, p, to 1, insert a single byte after the
   frame count byte with a value of 254, and append 254 padding bytes
   with the value zero to the end of the packet.  To add 256 bytes to a
   packet, set the padding bit to 1, insert two bytes after the frame
   count byte with the values 255 and 0, respectively, and append 254
   padding bytes with the value zero to the end of the packet.  By using
   the value 255 multiple times, it is possible to create a packet of
   any specific, desired size.  Let P be the number of header bytes used
   to indicate the padding size plus the number of padding bytes
   themselves (i.e., P is the total number of bytes added to the
   packet).  Then, P MUST be no more than N-2 [R6,R7].

   In the CBR case, let R=N-2-P be the number of bytes remaining in the
   packet after subtracting the (optional) padding.  Then, the
   compressed length of each frame in bytes is equal to R/M.  The value
   R MUST be a non-negative integer multiple of M [R6].  The compressed
   data for all M frames follows, each of size R/M bytes, as illustrated
   in Figure 6.

RFC6716 - Page 20

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|1|1|0|p|     M     |  Padding length (Optional)    :
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 1 (R/M bytes)...               :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 2 (R/M bytes)...               :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                              ...                              :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame M (R/M bytes)...               :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :                  Opus Padding (Optional)...                   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                       Figure 6: A CBR Code 3 Packet

   In the VBR case, the (optional) padding length is followed by M-1
   frame lengths (indicated by "N1" to "N[M-1]" in Figure 7), each
   encoded in a one- or two-byte sequence as described above.  The
   packet MUST contain enough data for the M-1 lengths after removing
   the (optional) padding, and the sum of these lengths MUST be no
   larger than the number of bytes remaining in the packet after
   decoding them [R7].  The compressed data for all M frames follows,
   each frame consisting of the indicated number of bytes, with the
   final frame consuming any remaining bytes before the final padding,
   as illustrated in Figure 6.  The number of header bytes (TOC byte,
   frame count byte, padding length bytes, and frame length bytes), plus
   the signaled length of the first M-1 frames themselves, plus the
   signaled length of the padding MUST be no larger than N, the total
   size of the packet.

RFC6716 - Page 21

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | config  |s|1|1|1|p|     M     | Padding length (Optional)     :
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     : N1 (1-2 bytes): N2 (1-2 bytes):     ...       :     N[M-1]    |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 1 (N1 bytes)...                :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :               Compressed frame 2 (N2 bytes)...                :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                              ...                              :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     |                                                               |
     :                     Compressed frame M...                     :
     |                                                               |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :                  Opus Padding (Optional)...                   |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                       Figure 7: A VBR Code 3 Packet

3.3.  Examples

   Simplest case, one NB mono 20 ms SILK frame:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |    1    |0|0|0|               compressed data...              :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 8

RFC6716 - Page 22

   Two FB mono 5 ms CELT frames of the same compressed size:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   29    |0|0|1|               compressed data...              :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 9

   Two FB mono 20 ms Hybrid frames of different compressed size:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   15    |0|1|1|1|0|     2     |      N1       |               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
   |                       compressed data...                      :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 10

   Four FB stereo 20 ms CELT frames of the same compressed size:

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   31    |1|1|1|0|0|     4     |      compressed data...       :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                                 Figure 11

3.4.  Receiving Malformed Packets

   A receiver MUST NOT process packets that violate any of the rules
   above as normal Opus packets.  They are reserved for future
   applications, such as in-band headers (containing metadata, etc.).
   Packets that violate these constraints may cause implementations of
   _this_ specification to treat them as malformed and discard them.

   These constraints are summarized here for reference:

   [R1]  Packets are at least one byte.

   [R2]  No implicit frame length is larger than 1275 bytes.

   [R3]  Code 1 packets have an odd total length, N, so that (N-1)/2 is
         an integer.

RFC6716 - Page 23

   [R4]  Code 2 packets have enough bytes after the TOC for a valid
         frame length, and that length is no larger than the number of
         bytes remaining in the packet.

   [R5]  Code 3 packets contain at least one frame, but no more than
         120 ms of audio total.

   [R6]  The length of a CBR code 3 packet, N, is at least two bytes,
         the number of bytes added to indicate the padding size plus the
         trailing padding bytes themselves, P, is no more than N-2, and
         the frame count, M, satisfies the constraint that (N-2-P) is a
         non-negative integer multiple of M.

   [R7]  VBR code 3 packets are large enough to contain all the header
         bytes (TOC byte, frame count byte, any padding length bytes,
         and any frame length bytes), plus the length of the first M-1
         frames, plus any trailing padding bytes.

4.  Opus Decoder

   The Opus decoder consists of two main blocks: the SILK decoder and
   the CELT decoder.  At any given time, one or both of the SILK and
   CELT decoders may be active.  The output of the Opus decode is the
   sum of the outputs from the SILK and CELT decoders with proper sample
   rate conversion and delay compensation on the SILK side, and optional
   decimation (when decoding to sample rates less than 48 kHz) on the
   CELT side, as illustrated in the block diagram below.


                            +---------+    +------------+
                            |  SILK   |    |   Sample   |
                         +->| Decoder |--->|    Rate    |----+
   Bit-    +---------+   |  |         |    | Conversion |    v
   stream  |  Range  |---+  +---------+    +------------+  /---\  Audio
   ------->| Decoder |                                     | + |------>
           |         |---+  +---------+    +------------+  \---/
           +---------+   |  |  CELT   |    | Decimation |    ^
                         +->| Decoder |--->| (Optional) |----+
                            |         |    |            |
                            +---------+    +------------+


4.1.  Range Decoder

   Opus uses an entropy coder based on range coding [RANGE-CODING]
   [MARTIN79], which is itself a rediscovery of the FIFO arithmetic code
   introduced by [CODING-THESIS].  It is very similar to arithmetic
   encoding, except that encoding is done with digits in any base

RFC6716 - Page 24

   instead of with bits, so it is faster when using larger bases (i.e.,
   a byte).  All of the calculations in the range coder must use bit-
   exact integer arithmetic.

   Symbols may also be coded as "raw bits" packed directly into the
   bitstream, bypassing the range coder.  These are packed backwards
   starting at the end of the frame, as illustrated in Figure 12.  This
   reduces complexity and makes the stream more resilient to bit errors,
   as corruption in the raw bits will not desynchronize the decoding
   process, unlike corruption in the input to the range decoder.  Raw
   bits are only used in the CELT layer.

      0                   1                   2                   3
      0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     | Range coder data (packed MSB to LSB) ->                       :
     +                                                               +
     :                                                               :
     +     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :     | <- Boundary occurs at an arbitrary bit position         :
     +-+-+-+                                                         +
     :                          <- Raw bits data (packed LSB to MSB) |
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     Legend:

     LSB = Least Significant Bit
     MSB = Most Significant Bit

          Figure 12: Illustrative Example of Packing Range Coder
                             and Raw Bits Data

   Each symbol coded by the range coder is drawn from a finite alphabet
   and coded in a separate "context", which describes the size of the
   alphabet and the relative frequency of each symbol in that alphabet.

   Suppose there is a context with n symbols, identified with an index
   that ranges from 0 to n-1.  The parameters needed to encode or decode
   symbol k in this context are represented by a three-tuple
   (fl[k], fh[k], ft), all 16-bit unsigned integers, with
   0 <= fl[k] < fh[k] <= ft <= 65535.  The values of this tuple are
   derived from the probability model for the symbol, represented by
   traditional "frequency counts".  Because Opus uses static contexts,
   those are not updated as symbols are decoded.  Let f[i] be the
   frequency of symbol i.  Then, the three-tuple corresponding to symbol
   k is given by the following:

RFC6716 - Page 25

                   k-1                                   n-1
                   __                                    __
           fl[k] = \  f[i],  fh[k] = fl[k] + f[k],  ft = \  f[i]
                   /_                                    /_
                   i=0                                   i=0

   The range decoder extracts the symbols and integers encoded using the
   range encoder in Section 5.1.  The range decoder maintains an
   internal state vector composed of the two-tuple (val, rng), where val
   represents the difference between the high end of the current range
   and the actual coded value, minus one, and rng represents the size of
   the current range.  Both val and rng are 32-bit unsigned integer
   values.

4.1.1.  Range Decoder Initialization

   Let b0 be an 8-bit unsigned integer containing first input byte (or
   containing zero if there are no bytes in this Opus frame).  The
   decoder initializes rng to 128 and initializes val to (127 -
    (b0>>1)), where (b0>>1) is the top 7 bits of the first input byte.
   It saves the remaining bit, (b0&1), for use in the renormalization
   procedure described in Section 4.1.2.1, which the decoder invokes
   immediately after initialization to read additional bits and
   establish the invariant that rng > 2**23.

4.1.2.  Decoding Symbols

   Decoding a symbol is a two-step process.  The first step determines a
   16-bit unsigned value fs, which lies within the range of some symbol
   in the current context.  The second step updates the range decoder
   state with the three-tuple (fl[k], fh[k], ft) corresponding to that
   symbol.

   The first step is implemented by ec_decode() (entdec.c), which
   computes

                                      val
                       fs = ft - min(------ + 1, ft)
                                     rng/ft

   The divisions here are integer division.

   The decoder then identifies the symbol in the current context
   corresponding to fs; i.e., the value of k whose three-tuple
   (fl[k], fh[k], ft) satisfies fl[k] <= fs < fh[k].  It uses this tuple
   to update val according to

RFC6716 - Page 26

                                   rng
                       val = val - --- * (ft - fh[k])
                                   ft

   If fl[k] is greater than zero, then the decoder updates rng using

                              rng
                        rng = --- * (fh[k] - fl[k])
                              ft

   Otherwise, it updates rng using

                                   rng
                       rng = rng - --- * (ft - fh[k])
                                   ft

   Using a special case for the first symbol (rather than the last
   symbol, as is commonly done in other arithmetic coders) ensures that
   all the truncation error from the finite precision arithmetic
   accumulates in symbol 0.  This makes the cost of coding a 0 slightly
   smaller, on average, than its estimated probability indicates and
   makes the cost of coding any other symbol slightly larger.  When
   contexts are designed so that 0 is the most probable symbol, which is
   often the case, this strategy minimizes the inefficiency introduced
   by the finite precision.  It also makes some of the special-case
   decoding routines in Section 4.1.3 particularly simple.

   After the updates, implemented by ec_dec_update() (entdec.c), the
   decoder normalizes the range using the procedure in the next section,
   and returns the index k.

4.1.2.1.  Renormalization

   To normalize the range, the decoder repeats the following process,
   implemented by ec_dec_normalize() (entdec.c), until rng > 2**23.  If
   rng is already greater than 2**23, the entire process is skipped.
   First, it sets rng to (rng<<8).  Then, it reads the next byte of the
   Opus frame and forms an 8-bit value sym, using the leftover bit
   buffered from the previous byte as the high bit and the top 7 bits of
   the byte just read as the other 7 bits of sym.  The remaining bit in
   the byte just read is buffered for use in the next iteration.  If no
   more input bytes remain, it uses zero bits instead.  See
   Section 4.1.1 for the initialization used to process the first byte.
   Then, it sets

                 val = ((val<<8) + (255-sym)) & 0x7FFFFFFF

RFC6716 - Page 27

   It is normal and expected that the range decoder will read several
   bytes into the data of the raw bits (if any) at the end of the frame
   by the time the frame is completely decoded, as illustrated in
   Figure 13.  This same data MUST also be returned as raw bits when
   requested.  The encoder is expected to terminate the stream in such a
   way that the range decoder will decode the intended values regardless
   of the data contained in the raw bits.  Section 5.1.5 describes a
   procedure for doing this.  If the range decoder consumes all of the
   bytes belonging to the current frame, it MUST continue to use zero
   when any further input bytes are required, even if there is
   additional data in the current packet from padding or other frames.

      n              n+1             n+2             n+3
      0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
     :     | <----------- Overlap region ------------> |             :
     +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           ^                                           ^
           |   End of data buffered by the range coder |
     ...-----------------------------------------------+
           |
           | End of data consumed by raw bits
           +-------------------------------------------------------...

          Figure 13: Illustrative Example of Raw Bits Overlapping
                             Range Coder Data

4.1.3.  Alternate Decoding Methods

   The reference implementation uses three additional decoding methods
   that are exactly equivalent to the above but make assumptions and
   simplifications that allow for a more efficient implementation.

4.1.3.1.  ec_decode_bin()

   The first is ec_decode_bin() (entdec.c), defined using the parameter
   ftb instead of ft.  It is mathematically equivalent to calling
   ec_decode() with ft = (1<<ftb), but it avoids one of the divisions.

4.1.3.2.  ec_dec_bit_logp()

   The next is ec_dec_bit_logp() (entdec.c), which decodes a single
   binary symbol, replacing both the ec_decode() and ec_dec_update()
   steps.  The context is described by a single parameter, logp, which
   is the absolute value of the base-2 logarithm of the probability of a
   "1".  It is mathematically equivalent to calling ec_decode() with
   ft = (1<<logp), followed by ec_dec_update() with the 3-tuple
   (fl[k] = 0, fh[k] = (1<<logp) - 1, ft = (1<<logp)) if the returned

RFC6716 - Page 28

   value of fs is less than (1<<logp) - 1 (a "0" was decoded), and with
   (fl[k] = (1<<logp) - 1, fh[k] = ft = (1<<logp)) otherwise (a "1" was
   decoded).  The implementation requires no multiplications or
   divisions.

4.1.3.3.  ec_dec_icdf()

   The last is ec_dec_icdf() (entdec.c), which decodes a single symbol
   with a table-based context of up to 8 bits, also replacing both the
   ec_decode() and ec_dec_update() steps, as well as the search for the
   decoded symbol in between.  The context is described by two
   parameters, an icdf ("inverse" cumulative distribution function)
   table and ftb.  As with ec_decode_bin(), (1<<ftb) is equivalent to
   ft. idcf[k], on the other hand, stores (1<<ftb)-fh[k], which is equal
   to (1<<ftb) - fl[k+1]. fl[0] is assumed to be 0, and the table is
   terminated by a value of 0 (where fh[k] == ft).

   The function is mathematically equivalent to calling ec_decode() with
   ft = (1<<ftb), using the returned value fs to search the table for
   the first entry where fs < (1<<ftb)-icdf[k], and calling
   ec_dec_update() with fl[k] = (1<<ftb) - icdf[k-1] (or 0 if k == 0),
   fh[k] = (1<<ftb) - idcf[k], and ft = (1<<ftb).  Combining the search
   with the update allows the division to be replaced by a series of
   multiplications (which are usually much cheaper), and using an
   inverse CDF allows the use of an ftb as large as 8 in an 8-bit table
   without any special cases.  This is the primary interface with the
   range decoder in the SILK layer, though it is used in a few places in
   the CELT layer as well.

   Although icdf[k] is more convenient for the code, the frequency
   counts, f[k], are a more natural representation of the probability
   distribution function (PDF) for a given symbol.  Therefore, this
   document lists the latter, not the former, when describing the
   context in which a symbol is coded as a list, e.g., {4, 4, 4, 4}/16
   for a uniform context with four possible values and ft = 16.  The
   value of ft after the slash is always the sum of the entries in the
   PDF, but is included for convenience.  Contexts with identical
   probabilities, f[k]/ft, but different values of ft (or equivalently,
   ftb) are not the same, and cannot, in general, be used in place of
   one another.  An icdf table is also not capable of representing a PDF
   where the first symbol has 0 probability.  In such contexts,
   ec_dec_icdf() can decode the symbol by using a table that drops the
   entries for any initial zero-probability values and by adding the
   constant offset of the first value with a non-zero probability to its
   return value.

RFC6716 - Page 29

4.1.4.  Decoding Raw Bits

   The raw bits used by the CELT layer are packed at the end of the
   frame, with the least significant bit of the first value packed in
   the least significant bit of the last byte, filling up to the most
   significant bit in the last byte, continuing on to the least
   significant bit of the penultimate byte, and so on.  The reference
   implementation reads them using ec_dec_bits() (entdec.c).  Because
   the range decoder must read several bytes ahead in the stream, as
   described in Section 4.1.2.1, the input consumed by the raw bits may
   overlap with the input consumed by the range coder, and a decoder
   MUST allow this.  The format should render it impossible to attempt
   to read more raw bits than there are actual bits in the frame, though
   a decoder may wish to check for this and report an error.

4.1.5.  Decoding Uniformly Distributed Integers

   The function ec_dec_uint() (entdec.c) decodes one of ft equiprobable
   values in the range 0 to (ft - 1), inclusive, each with a frequency
   of 1, where ft may be as large as (2**32 - 1).  Because ec_decode()
   is limited to a total frequency of (2**16 - 1), it splits up the
   value into a range coded symbol representing up to 8 of the high
   bits, and, if necessary, raw bits representing the remainder of the
   value.  The limit of 8 bits in the range coded symbol is a trade-off
   between implementation complexity, modeling error (since the symbols
   no longer truly have equal coding cost), and rounding error
   introduced by the range coder itself (which gets larger as more bits
   are included).  Using raw bits reduces the maximum number of
   divisions required in the worst case, but means that it may be
   possible to decode a value outside the range 0 to (ft - 1),
   inclusive.

   ec_dec_uint() takes a single, positive parameter, ft, which is not
   necessarily a power of two, and returns an integer, t, whose value
   lies between 0 and (ft - 1), inclusive.  Let ftb = ilog(ft - 1),
   i.e., the number of bits required to store (ft - 1) in two's
   complement notation.  If ftb is 8 or less, then t is decoded with
   t = ec_decode(ft), and the range coder state is updated using the
   three-tuple (t, t + 1, ft).

   If ftb is greater than 8, then the top 8 bits of t are decoded using

                 t = ec_decode(((ft - 1) >> (ftb - 8)) + 1)

   the decoder state is updated using the three-tuple (t, t + 1, ((ft -
    1) >> (ftb - 8)) + 1), and the remaining bits are decoded as raw
   bits, setting

RFC6716 - Page 30

                t = (t << (ftb - 8)) | ec_dec_bits(ftb - 8)

   If, at this point, t >= ft, then the current frame is corrupt.  In
   that case, the decoder should assume there has been an error in the
   coding, decoding, or transmission and SHOULD take measures to conceal
   the error (e.g., saturate to ft-1 or use the Packet Loss Concealment
   (PLC)) and/or report to the application that the error has occurred.

4.1.6.  Current Bit Usage

   The bit allocation routines in the CELT decoder need a conservative
   upper bound on the number of bits that have been used from the
   current frame thus far, including both range coder bits and raw bits.
   This drives allocation decisions that must match those made in the
   encoder.  The upper bound is computed in the reference implementation
   to whole-bit precision by the function ec_tell() (entcode.h) and to
   fractional 1/8th bit precision by the function ec_tell_frac()
   (entcode.c).  Like all operations in the range coder, it must be
   implemented in a bit-exact manner, and it must produce exactly the
   same value returned by the same functions in the encoder after
   encoding the same symbols.

   ec_tell() is guaranteed to return ceil(ec_tell_frac()/8.0).  In
   various places, the codec will check to ensure there is enough room
   to contain a symbol before attempting to decode it.  In practice,
   although the number of bits used so far is an upper bound, decoding a
   symbol whose probability model suggests it has a worst-case cost of p
   1/8th bits may actually advance the return value of ec_tell_frac() by
   p-1, p, or p+1 1/8th bits, due to approximation error in that upper
   bound, truncation error in the range coder, and for large values of
   ft, modeling error in ec_dec_uint().

   However, this error is bounded, and periodic calls to ec_tell() or
   ec_tell_frac() at precisely defined points in the decoding process
   prevent it from accumulating.  For a range coder symbol that requires
   a whole number of bits (i.e., for which ft/(fh[k] - fl[k]) is a power
   of two), where there are at least p 1/8th bits available, decoding
   the symbol will never cause ec_tell() or ec_tell_frac() to exceed the
   size of the frame ("bust the budget").  In this case, the return
   value of ec_tell_frac() will only advance by more than p 1/8th bits
   if there were an additional, fractional number of bits remaining, and
   it will never advance beyond the next whole-bit boundary, which is
   safe, since frames always contain a whole number of bits.  However,
   when p is not a whole number of bits, an extra 1/8th bit is required
   to ensure that decoding the symbol will not bust the budget.

RFC6716 - Page 31

   The reference implementation keeps track of the total number of whole
   bits that have been processed by the decoder so far in the variable
   nbits_total, including the (possibly fractional) number of bits that
   are currently buffered, but not consumed, inside the range coder.
   nbits_total is initialized to 9 just before the initial range
   renormalization process completes (or equivalently, it can be
   initialized to 33 after the first renormalization).  The extra two
   bits over the actual amount buffered by the range coder guarantees
   that it is an upper bound and that there is enough room for the
   encoder to terminate the stream.  Each iteration through the range
   coder's renormalization loop increases nbits_total by 8.  Reading raw
   bits increases nbits_total by the number of raw bits read.

4.1.6.1.  ec_tell()

   The whole number of bits buffered in rng may be estimated via
   lg = ilog(rng). ec_tell() then becomes a simple matter of removing
   these bits from the total.  It returns (nbits_total - lg).

   In a newly initialized decoder, before any symbols have been read,
   this reports that 1 bit has been used.  This is the bit reserved for
   termination of the encoder.

4.1.6.2.  ec_tell_frac()

   ec_tell_frac() estimates the number of bits buffered in rng to
   fractional precision.  Since rng must be greater than 2**23 after
   renormalization, lg must be at least 24.  Let

                           r_Q15 = rng >> (lg-16)

   so that 32768 <= r_Q15 < 65536, an unsigned Q15 value representing
   the fractional part of rng.  Then, the following procedure can be
   used to add one bit of precision to lg.  First, update

                        r_Q15 = (r_Q15*r_Q15) >> 15

   Then, add the 16th bit of r_Q15 to lg via

                         lg = 2*lg + (r_Q15 >> 16)

   Finally, if this bit was a 1, reduce r_Q15 by a factor of two via

                             r_Q15 = r_Q15 >> 1

   so that it once again lies in the range 32768 <= r_Q15 < 65536.  This
   procedure is repeated three times to extend lg to 1/8th bit
   precision. ec_tell_frac() then returns (nbits_total*8 - lg).

(next page on part 3)