Tech-invite3GPPspaceIETFspace
959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 6716

Definition of the Opus Audio Codec

Pages: 326
Proposed Standard
Errata
Updated by:  8251
Part 1 of 14 – Pages 1 to 13
None   None   Next

Top   ToC   RFC6716 - Page 1
Internet Engineering Task Force (IETF)                         JM. Valin
Request for Comments: 6716                           Mozilla Corporation
Category: Standards Track                                         K. Vos
ISSN: 2070-1721                                  Skype Technologies S.A.
                                                           T. Terriberry
                                                     Mozilla Corporation
                                                          September 2012


                   Definition of the Opus Audio Codec

Abstract

This document defines the Opus interactive speech and audio codec. Opus is designed to handle a wide range of interactive audio applications, including Voice over IP, videoconferencing, in-game chat, and even live, distributed music performances. It scales from low bitrate narrowband speech at 6 kbit/s to very high quality stereo music at 510 kbit/s. Opus uses both Linear Prediction (LP) and the Modified Discrete Cosine Transform (MDCT) to achieve good compression of both speech and music. Status of This Memo This is an Internet Standards Track document. This document is a product of the Internet Engineering Task Force (IETF). It represents the consensus of the IETF community. It has received public review and has been approved for publication by the Internet Engineering Steering Group (IESG). Further information on Internet Standards is available in Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc6716.
Top   ToC   RFC6716 - Page 2
Copyright Notice

   Copyright (c) 2012 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

   The licenses granted by the IETF Trust to this RFC under Section 3.c
   of the Trust Legal Provisions shall also include the right to extract
   text from Sections 1 through 8 and Appendix A and Appendix B of this
   RFC and create derivative works from these extracts, and to copy,
   publish, display and distribute such derivative works in any medium
   and for any purpose, provided that no such derivative work shall be
   presented, displayed or published in a manner that states or implies
   that it is part of this RFC or any other IETF Document.

Table of Contents

1. Introduction ....................................................5 1.1. Notation and Conventions ...................................6 2. Opus Codec Overview .............................................8 2.1. Control Parameters ........................................10 2.1.1. Bitrate ............................................10 2.1.2. Number of Channels (Mono/Stereo) ...................11 2.1.3. Audio Bandwidth ....................................11 2.1.4. Frame Duration .....................................11 2.1.5. Complexity .........................................11 2.1.6. Packet Loss Resilience .............................12 2.1.7. Forward Error Correction (FEC) .....................12 2.1.8. Constant/Variable Bitrate ..........................12 2.1.9. Discontinuous Transmission (DTX) ...................13 3. Internal Framing ...............................................13 3.1. The TOC Byte ..............................................13 3.2. Frame Packing .............................................16 3.2.1. Frame Length Coding ................................16 3.2.2. Code 0: One Frame in the Packet ....................16 3.2.3. Code 1: Two Frames in the Packet, Each with Equal Compressed Size ..............................17 3.2.4. Code 2: Two Frames in the Packet, with Different Compressed Sizes .........................17
Top   ToC   RFC6716 - Page 3
           3.2.5. Code 3: A Signaled Number of Frames in the Packet ..18
      3.3. Examples ..................................................21
      3.4. Receiving Malformed Packets ...............................22
   4. Opus Decoder ...................................................23
      4.1. Range Decoder .............................................23
           4.1.1. Range Decoder Initialization .......................25
           4.1.2. Decoding Symbols ...................................25
           4.1.3. Alternate Decoding Methods .........................27
           4.1.4. Decoding Raw Bits ..................................29
           4.1.5. Decoding Uniformly Distributed Integers ............29
           4.1.6. Current Bit Usage ..................................30
      4.2. SILK Decoder ..............................................32
           4.2.1. SILK Decoder Modules ...............................32
           4.2.2. LP Layer Organization ..............................33
           4.2.3. Header Bits ........................................35
           4.2.4. Per-Frame LBRR Flags ...............................36
           4.2.5. LBRR Frames ........................................36
           4.2.6. Regular SILK Frames ................................37
           4.2.7. SILK Frame Contents ................................37
                  4.2.7.1. Stereo Prediction Weights .................40
                  4.2.7.2. Mid-Only Flag .............................42
                  4.2.7.3. Frame Type ................................43
                  4.2.7.4. Subframe Gains ............................44
                  4.2.7.5. Normalized Line Spectral Frequency
                           (LSF) and Linear Predictive Coding (LPC)
                           Coeffieients ..............................46
                  4.2.7.6. Long-Term Prediction (LTP) Parameters .....74
                  4.2.7.7. Linear Congruential Generator (LCG) Seed ..86
                  4.2.7.8. Excitation ................................86
                  4.2.7.9. SILK Frame Reconstruction .................98
           4.2.8. Stereo Unmixing ...................................102
           4.2.9. Resampling ........................................103
      4.3. CELT Decoder .............................................104
           4.3.1. Transient Decoding ................................108
           4.3.2. Energy Envelope Decoding ..........................108
           4.3.3. Bit Allocation ....................................110
           4.3.4. Shape Decoding ....................................116
           4.3.5. Anti-collapse Processing ..........................120
           4.3.6. Denormalization ...................................121
           4.3.7. Inverse MDCT ......................................121
      4.4. Packet Loss Concealment (PLC) ............................122
           4.4.1. Clock Drift Compensation ..........................122
      4.5. Configuration Switching ..................................123
           4.5.1. Transition Side Information (Redundancy) ..........124
           4.5.2. State Reset .......................................127
           4.5.3. Summary of Transitions ............................128
   5. Opus Encoder ..................................................131
      5.1. Range Encoder ............................................132
Top   ToC   RFC6716 - Page 4
           5.1.1. Encoding Symbols ..................................133
           5.1.2. Alternate Encoding Methods ........................134
           5.1.3. Encoding Raw Bits .................................135
           5.1.4. Encoding Uniformly Distributed Integers ...........135
           5.1.5. Finalizing the Stream .............................135
           5.1.6. Current Bit Usage .................................136
      5.2. SILK Encoder .............................................136
           5.2.1. Sample Rate Conversion ............................137
           5.2.2. Stereo Mixing .....................................137
           5.2.3. SILK Core Encoder .................................138
      5.3. CELT Encoder .............................................150
           5.3.1. Pitch Pre-filter ..................................150
           5.3.2. Bands and Normalization ...........................151
           5.3.3. Energy Envelope Quantization ......................151
           5.3.4. Bit Allocation ....................................151
           5.3.5. Stereo Decisions ..................................152
           5.3.6. Time-Frequency Decision ...........................153
           5.3.7. Spreading Values Decision .........................153
           5.3.8. Spherical Vector Quantization .....................154
   6. Conformance ...................................................155
      6.1. Testing ..................................................155
      6.2. Opus Custom ..............................................156
   7. Security Considerations .......................................157
   8. Acknowledgements ..............................................158
   9. References ....................................................159
      9.1. Normative References .....................................159
      9.2. Informative References ...................................159
   Appendix A. Reference Implementation .............................163
      A.1. Extracting the Source ....................................164
      A.2. Up-to-Date Implementation ................................164
      A.3. Base64-Encoded Source Code ...............................164
      A.4. Test Vectors .............................................321
   Appendix B. Self-Delimiting Framing ..............................321
Top   ToC   RFC6716 - Page 5

1. Introduction

The Opus codec is a real-time interactive audio codec designed to meet the requirements described in [REQUIREMENTS]. It is composed of a layer based on Linear Prediction (LP) [LPC] and a layer based on the Modified Discrete Cosine Transform (MDCT) [MDCT]. The main idea behind using two layers is as follows: in speech, linear prediction techniques (such as Code-Excited Linear Prediction, or CELP) code low frequencies more efficiently than transform (e.g., MDCT) domain techniques, while the situation is reversed for music and higher speech frequencies. Thus, a codec with both layers available can operate over a wider range than either one alone and can achieve better quality by combining them than by using either one individually. The primary normative part of this specification is provided by the source code in Appendix A. Only the decoder portion of this software is normative, though a significant amount of code is shared by both the encoder and decoder. Section 6 provides a decoder conformance test. The decoder contains a great deal of integer and fixed-point arithmetic that needs to be performed exactly, including all rounding considerations, so any useful specification requires domain-specific symbolic language to adequately define these operations. Additionally, any conflict between the symbolic representation and the included reference implementation must be resolved. For the practical reasons of compatibility and testability, it would be advantageous to give the reference implementation priority in any disagreement. The C language is also one of the most widely understood, human-readable symbolic representations for machine behavior. For these reasons, this RFC uses the reference implementation as the sole symbolic representation of the codec. While the symbolic representation is unambiguous and complete, it is not always the easiest way to understand the codec's operation. For this reason, this document also describes significant parts of the codec in prose and takes the opportunity to explain the rationale behind many of the more surprising elements of the design. These descriptions are intended to be accurate and informative, but the limitations of common English sometimes result in ambiguity, so it is expected that the reader will always read them alongside the symbolic representation. Numerous references to the implementation are provided for this purpose. The descriptions sometimes differ from the reference in ordering or through mathematical simplification wherever such deviation makes an explanation easier to understand. For example, the right shift and left shift operations in the reference implementation are often described using division and
Top   ToC   RFC6716 - Page 6
   multiplication in the text.  In general, the text is focused on the
   "what" and "why" while the symbolic representation most clearly
   provides the "how".

1.1. Notation and Conventions

The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in RFC 2119 [RFC2119]. Various operations in the codec require bit-exact fixed-point behavior, even when writing a floating point implementation. The notation "Q<n>", where n is an integer, denotes the number of binary digits to the right of the decimal point in a fixed-point number. For example, a signed Q14 value in a 16-bit word can represent values from -2.0 to 1.99993896484375, inclusive. This notation is for informational purposes only. Arithmetic, when described, always operates on the underlying integer. For example, the text will explicitly indicate any shifts required after a multiplication. Expressions, where included in the text, follow C operator rules and precedence, with the exception that the syntax "x**y" indicates x raised to the power y. The text also makes use of the following functions.

1.1.1. min(x,y)

The smallest of two values x and y.

1.1.2. max(x,y)

The largest of two values x and y.

1.1.3. clamp(lo,x,hi)

clamp(lo,x,hi) = max(lo,min(x,hi)) With this definition, if lo > hi, then lo is returned.

1.1.4. sign(x)

The sign of x, i.e., ( -1, x < 0 sign(x) = < 0, x == 0 ( 1, x > 0
Top   ToC   RFC6716 - Page 7

1.1.5. abs(x)

The absolute value of x, i.e., abs(x) = sign(x)*x

1.1.6. floor(f)

The largest integer z such that z <= f.

1.1.7. ceil(f)

The smallest integer z such that z >= f.

1.1.8. round(f)

The integer z nearest to f, with ties rounded towards negative infinity, i.e., round(f) = ceil(f - 0.5)

1.1.9. log2(f)

The base-two logarithm of f.

1.1.10. ilog(n)

The minimum number of bits required to store a positive integer n in binary, or 0 for a non-positive integer n. ( 0, n <= 0 ilog(n) = < ( floor(log2(n))+1, n > 0 Examples: o ilog(-1) = 0 o ilog(0) = 0 o ilog(1) = 1 o ilog(2) = 2 o ilog(3) = 2 o ilog(4) = 3
Top   ToC   RFC6716 - Page 8
   o  ilog(7) = 3

2. Opus Codec Overview

The Opus codec scales from 6 kbit/s narrowband mono speech to 510 kbit/s fullband stereo music, with algorithmic delays ranging from 5 ms to 65.2 ms. At any given time, either the LP layer, the MDCT layer, or both, may be active. It can seamlessly switch between all of its various operating modes, giving it a great deal of flexibility to adapt to varying content and network conditions without renegotiating the current session. The codec allows input and output of various audio bandwidths, defined as follows: +----------------------+-----------------+-------------------------+ | Abbreviation | Audio Bandwidth | Sample Rate (Effective) | +----------------------+-----------------+-------------------------+ | NB (narrowband) | 4 kHz | 8 kHz | | | | | | MB (medium-band) | 6 kHz | 12 kHz | | | | | | WB (wideband) | 8 kHz | 16 kHz | | | | | | SWB (super-wideband) | 12 kHz | 24 kHz | | | | | | FB (fullband) | 20 kHz (*) | 48 kHz | +----------------------+-----------------+-------------------------+ Table 1 (*) Although the sampling theorem allows a bandwidth as large as half the sampling rate, Opus never codes audio above 20 kHz, as that is the generally accepted upper limit of human hearing. Opus defines super-wideband (SWB) with an effective sample rate of 24 kHz, unlike some other audio coding standards that use 32 kHz. This was chosen for a number of reasons. The band layout in the MDCT layer naturally allows skipping coefficients for frequencies over 12 kHz, but does not allow cleanly dropping just those frequencies over 16 kHz. A sample rate of 24 kHz also makes resampling in the MDCT layer easier, as 24 evenly divides 48, and when 24 kHz is sufficient, it can save computation in other processing, such as Acoustic Echo Cancellation (AEC). Experimental changes to the band layout to allow a 16 kHz cutoff (32 kHz effective sample rate) showed potential quality degradations at other sample rates, and, at typical bitrates, the number of bits saved by using such a cutoff instead of coding in fullband (FB) mode is very small. Therefore, if an application wishes to process a signal sampled at 32 kHz, it should just use FB.
Top   ToC   RFC6716 - Page 9
   The LP layer is based on the SILK codec [SILK].  It supports NB, MB,
   or WB audio and frame sizes from 10 ms to 60 ms, and requires an
   additional 5 ms look-ahead for noise shaping estimation.  A small
   additional delay (up to 1.5 ms) may be required for sampling rate
   conversion.  Like Vorbis [VORBIS-WEBSITE] and many other modern
   codecs, SILK is inherently designed for variable bitrate (VBR)
   coding, though the encoder can also produce constant bitrate (CBR)
   streams.  The version of SILK used in Opus is substantially modified
   from, and not compatible with, the stand-alone SILK codec previously
   deployed by Skype.  This document does not serve to define that
   format, but those interested in the original SILK codec should see
   [SILK] instead.

   The MDCT layer is based on the Constrained-Energy Lapped Transform
   (CELT) codec [CELT].  It supports NB, WB, SWB, or FB audio and frame
   sizes from 2.5 ms to 20 ms, and requires an additional 2.5 ms look-
   ahead due to the overlapping MDCT windows.  The CELT codec is
   inherently designed for CBR coding, but unlike many CBR codecs, it is
   not limited to a set of predetermined rates.  It internally allocates
   bits to exactly fill any given target budget, and an encoder can
   produce a VBR stream by varying the target on a per-frame basis.  The
   MDCT layer is not used for speech when the audio bandwidth is WB or
   less, as it is not useful there.  On the other hand, non-speech
   signals are not always adequately coded using linear prediction.
   Therefore, the MDCT layer should be used for music signals.

   A "Hybrid" mode allows the use of both layers simultaneously with a
   frame size of 10 or 20 ms and an SWB or FB audio bandwidth.  The LP
   layer codes the low frequencies by resampling the signal down to WB.
   The MDCT layer follows, coding the high frequency portion of the
   signal.  The cutoff between the two lies at 8 kHz, the maximum WB
   audio bandwidth.  In the MDCT layer, all bands below 8 kHz are
   discarded, so there is no coding redundancy between the two layers.

   The sample rate (in contrast to the actual audio bandwidth) can be
   chosen independently on the encoder and decoder side, e.g., a
   fullband signal can be decoded as wideband, or vice versa.  This
   approach ensures a sender and receiver can always interoperate,
   regardless of the capabilities of their actual audio hardware.
   Internally, the LP layer always operates at a sample rate of twice
   the audio bandwidth, up to a maximum of 16 kHz, which it continues to
   use for SWB and FB.  The decoder simply resamples its output to
   support different sample rates.  The MDCT layer always operates
   internally at a sample rate of 48 kHz.  Since all the supported
   sample rates evenly divide this rate, and since the decoder may
   easily zero out the high frequency portion of the spectrum in the
   frequency domain, it can simply decimate the MDCT layer output to
   achieve the other supported sample rates very cheaply.
Top   ToC   RFC6716 - Page 10
   After conversion to the common, desired output sample rate, the
   decoder simply adds the output from the two layers together.  To
   compensate for the different look-ahead required by each layer, the
   CELT encoder input is delayed by an additional 2.7 ms.  This ensures
   that low frequencies and high frequencies arrive at the same time.
   This extra delay may be reduced by an encoder by using less look-
   ahead for noise shaping or using a simpler resampler in the LP layer,
   but this will reduce quality.  However, the base 2.5 ms look-ahead in
   the CELT layer cannot be reduced in the encoder because it is needed
   for the MDCT overlap, whose size is fixed by the decoder.

   Both layers use the same entropy coder, avoiding any waste from
   "padding bits" between them.  The hybrid approach makes it easy to
   support both CBR and VBR coding.  Although the LP layer is VBR, the
   bit allocation of the MDCT layer can produce a final stream that is
   CBR by using all the bits left unused by the LP layer.

2.1. Control Parameters

The Opus codec includes a number of control parameters that can be changed dynamically during regular operation of the codec, without interrupting the audio stream from the encoder to the decoder. These parameters only affect the encoder since any impact they have on the bitstream is signaled in-band such that a decoder can decode any Opus stream without any out-of-band signaling. Any Opus implementation can add or modify these control parameters without affecting interoperability. The most important encoder control parameters in the reference encoder are listed below.

2.1.1. Bitrate

Opus supports all bitrates from 6 kbit/s to 510 kbit/s. All other parameters being equal, higher bitrate results in higher quality. For a frame size of 20 ms, these are the bitrate "sweet spots" for Opus in various configurations: o 8-12 kbit/s for NB speech, o 16-20 kbit/s for WB speech, o 28-40 kbit/s for FB speech, o 48-64 kbit/s for FB mono music, and o 64-128 kbit/s for FB stereo music.
Top   ToC   RFC6716 - Page 11

2.1.2. Number of Channels (Mono/Stereo)

Opus can transmit either mono or stereo frames within a single stream. When decoding a mono frame in a stereo decoder, the left and right channels are identical, and when decoding a stereo frame in a mono decoder, the mono output is the average of the left and right channels. In some cases, it is desirable to encode a stereo input stream in mono (e.g., because the bitrate is too low to encode stereo with sufficient quality). The number of channels encoded can be selected in real-time, but by default the reference encoder attempts to make the best decision possible given the current bitrate.

2.1.3. Audio Bandwidth

The audio bandwidths supported by Opus are listed in Table 1. Just like for the number of channels, any decoder can decode audio that is encoded at any bandwidth. For example, any Opus decoder operating at 8 kHz can decode an FB Opus frame, and any Opus decoder operating at 48 kHz can decode an NB frame. Similarly, the reference encoder can take a 48 kHz input signal and encode it as NB. The higher the audio bandwidth, the higher the required bitrate to achieve acceptable quality. The audio bandwidth can be explicitly specified in real- time, but, by default, the reference encoder attempts to make the best bandwidth decision possible given the current bitrate.

2.1.4. Frame Duration

Opus can encode frames of 2.5, 5, 10, 20, 40, or 60 ms. It can also combine multiple frames into packets of up to 120 ms. For real-time applications, sending fewer packets per second reduces the bitrate, since it reduces the overhead from IP, UDP, and RTP headers. However, it increases latency and sensitivity to packet losses, as losing one packet constitutes a loss of a bigger chunk of audio. Increasing the frame duration also slightly improves coding efficiency, but the gain becomes small for frame sizes above 20 ms. For this reason, 20 ms frames are a good choice for most applications.

2.1.5. Complexity

There are various aspects of the Opus encoding process where trade- offs can be made between CPU complexity and quality/bitrate. In the reference encoder, the complexity is selected using an integer from 0 to 10, where 0 is the lowest complexity and 10 is the highest. Examples of computations for which such trade-offs may occur are: o The order of the pitch analysis whitening filter [WHITENING],
Top   ToC   RFC6716 - Page 12
   o  The order of the short-term noise shaping filter,

   o  The number of states in delayed decision quantization of the
      residual signal, and

   o  The use of certain bitstream features such as variable time-
      frequency resolution and the pitch post-filter.

2.1.6. Packet Loss Resilience

Audio codecs often exploit inter-frame correlations to reduce the bitrate at a cost in error propagation: after losing one packet, several packets need to be received before the decoder is able to accurately reconstruct the speech signal. The extent to which Opus exploits inter-frame dependencies can be adjusted on the fly to choose a trade-off between bitrate and amount of error propagation.

2.1.7. Forward Error Correction (FEC)

Another mechanism providing robustness against packet loss is the in- band Forward Error Correction (FEC). Packets that are determined to contain perceptually important speech information, such as onsets or transients, are encoded again at a lower bitrate and this re-encoded information is added to a subsequent packet.

2.1.8. Constant/Variable Bitrate

Opus is more efficient when operating with variable bitrate (VBR), which is the default. When low-latency transmission is required over a relatively slow connection, then constrained VBR can also be used. This uses VBR in a way that simulates a "bit reservoir" and is equivalent to what MP3 (MPEG 1, Layer 3) and AAC (Advanced Audio Coding) call CBR (i.e., not true CBR due to the bit reservoir). In some (rare) applications, constant bitrate (CBR) is required. There are two main reasons to operate in CBR mode: o When the transport only supports a fixed size for each compressed frame, or o When encryption is used for an audio stream that is either highly constrained (e.g., yes/no, recorded prompts) or highly sensitive [SRTP-VBR]. Bitrate may still be allowed to vary, even with sensitive data, as long as the variation is not driven by the input signal (for example, to match changing network conditions). To achieve this, an application should still run Opus in CBR mode, but change the target rate before each packet.
Top   ToC   RFC6716 - Page 13

2.1.9. Discontinuous Transmission (DTX)

Discontinuous Transmission (DTX) reduces the bitrate during silence or background noise. When DTX is enabled, only one frame is encoded every 400 milliseconds.


(page 13 continued on part 2)

Next Section