Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 6386

VP8 Data Format and Decoding Guide

Pages: 304
Informational
Errata
Part 1 of 11 – Pages 1 to 15
None   None   Next

Top   ToC   RFC6386 - Page 1
Independent Submission                                       J. Bankoski
Request for Comments: 6386                                   J. Koleszar
Category: Informational                                       L. Quillio
ISSN: 2070-1721                                               J. Salonen
                                                              P. Wilkins
                                                                   Y. Xu
                                                             Google Inc.
                                                           November 2011


                   VP8 Data Format and Decoding Guide

Abstract

This document describes the VP8 compressed video data format, together with a discussion of the decoding procedure for the format. Status of This Memo This document is not an Internet Standards Track specification; it is published for informational purposes. This is a contribution to the RFC Series, independently of any other RFC stream. The RFC Editor has chosen to publish this document at its discretion and makes no statement about its value for implementation or deployment. Documents approved for publication by the RFC Editor are not a candidate for any level of Internet Standard; see Section 2 of RFC 5741. Information about the current status of this document, any errata, and how to provide feedback on it may be obtained at http://www.rfc-editor.org/info/rfc6386. Copyright Notice Copyright (c) 2011 IETF Trust and the persons identified as the document authors. All rights reserved. This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents (http://trustee.ietf.org/license-info) in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document.
Top   ToC   RFC6386 - Page 2

Table of Contents

1. Introduction ....................................................4 2. Format Overview .................................................5 3. Compressed Frame Types ..........................................7 4. Overview of Compressed Data Format ..............................8 5. Overview of the Decoding Process ................................9 6. Description of Algorithms ......................................14 7. Boolean Entropy Decoder ........................................16 7.1. Underlying Theory of Coding ...............................17 7.2. Practical Algorithm Description ...........................18 7.3. Actual Implementation .....................................20 8. Compressed Data Components .....................................25 8.1. Tree Coding Implementation ................................27 8.2. Tree Coding Example .......................................28 9. Frame Header ...................................................30 9.1. Uncompressed Data Chunk ...................................30 9.2. Color Space and Pixel Type (Key Frames Only) ..............33 9.3. Segment-Based Adjustments .................................34 9.4. Loop Filter Type and Levels ...............................35 9.5. Token Partition and Partition Data Offsets ................36 9.6. Dequantization Indices ....................................37 9.7. Refresh Golden Frame and Altref Frame .....................38 9.8. Refresh Last Frame Buffer .................................39 9.9. DCT Coefficient Probability Update ........................39 9.10. Remaining Frame Header Data (Non-Key Frame) ..............40 9.11. Remaining Frame Header Data (Key Frame) ..................41 10. Segment-Based Feature Adjustments .............................41 11. Key Frame Macroblock Prediction Records .......................42 11.1. mb_skip_coeff ............................................42 11.2. Luma Modes ...............................................42 11.3. Subblock Mode Contexts ...................................45 11.4. Chroma Modes .............................................46 11.5. Subblock Mode Probability Table ..........................47 12. Intraframe Prediction .........................................50 12.1. mb_skip_coeff ............................................51 12.2. Chroma Prediction ........................................51 12.3. Luma Prediction ..........................................54 13. DCT Coefficient Decoding ......................................60 13.1. Macroblock without Non-Zero Coefficient Values ...........61 13.2. Coding of Individual Coefficient Values ..................61 13.3. Token Probabilities ......................................63 13.4. Token Probability Updates ................................68 13.5. Default Token Probability Table ..........................73
Top   ToC   RFC6386 - Page 3
   14. DCT and WHT Inversion and Macroblock Reconstruction ...........76
      14.1. Dequantization ...........................................76
      14.2. Inverse Transforms .......................................78
      14.3. Implementation of the WHT Inversion ......................78
      14.4. Implementation of the DCT Inversion ......................81
      14.5. Summation of Predictor and Residue .......................83
   15. Loop Filter ...................................................84
      15.1. Filter Geometry and Overall Procedure ....................85
      15.2. Simple Filter ............................................87
      15.3. Normal Filter ............................................91
      15.4. Calculation of Control Parameters ........................95
   16. Interframe Macroblock Prediction Records ......................97
      16.1. Intra-Predicted Macroblocks ..............................97
      16.2. Inter-Predicted Macroblocks ..............................98
      16.3. Mode and Motion Vector Contexts ..........................99
      16.4. Split Prediction ........................................105
   17. Motion Vector Decoding .......................................108
      17.1. Coding of Each Component ................................108
      17.2. Probability Updates .....................................110
   18. Interframe Prediction ........................................113
      18.1. Bounds on, and Adjustment of, Motion Vectors ............113
      18.2. Prediction Subblocks ....................................115
      18.3. Sub-Pixel Interpolation .................................115
      18.4. Filter Properties .......................................118
   19. Annex A: Bitstream Syntax ....................................120
      19.1. Uncompressed Data Chunk .................................121
      19.2. Frame Header ............................................122
      19.3. Macroblock Data .........................................130
   20. Attachment One: Reference Decoder Source Code ................133
      20.1. bit_ops.h ...............................................133
      20.2. bool_decoder.h ..........................................133
      20.3. dequant_data.h ..........................................137
      20.4. dixie.c .................................................138
      20.5. dixie.h .................................................151
      20.6. dixie_loopfilter.c ......................................158
      20.7. dixie_loopfilter.h ......................................170
      20.8. idct_add.c ..............................................171
      20.9. idct_add.h ..............................................174
      20.10. mem.h ..................................................175
      20.11. modemv.c ...............................................176
      20.12. modemv.h ...............................................192
      20.13. modemv_data.h ..........................................193
      20.14. predict.c ..............................................198
      20.15. predict.h ..............................................231
      20.16. tokens.c ...............................................232
      20.17. tokens.h ...............................................242
      20.18. vp8_prob_data.h ........................................243
      20.19. vpx_codec_internal.h ...................................252
Top   ToC   RFC6386 - Page 4
      20.20. vpx_decoder.h ..........................................263
      20.21. vpx_decoder_compat.h ...................................271
      20.22. vpx_image.c ............................................285
      20.23. vpx_image.h ............................................291
      20.24. vpx_integer.h ..........................................298
      20.25. AUTHORS File ...........................................299
      20.26. LICENSE ................................................301
      20.27. PATENTS ................................................302
   21. Security Considerations ......................................302
   22. References ...................................................303
      22.1. Normative Reference .....................................303
      22.2. Informative References ..................................303

1. Introduction

This document describes the VP8 compressed video data format, together with a discussion of the decoding procedure for the format. It is intended to be used in conjunction with, and as a guide to, the reference decoder source code provided in Attachment One (Section 20). If there are any conflicts between this narrative and the reference source code, the reference source code should be considered correct. The bitstream is defined by the reference source code and not this narrative. Like many modern video compression schemes, VP8 is based on decomposition of frames into square subblocks of pixels, prediction of such subblocks using previously constructed blocks, and adjustment of such predictions (as well as synthesis of unpredicted blocks) using a discrete cosine transform (hereafter abbreviated as DCT). In one special case, however, VP8 uses a Walsh-Hadamard transform (hereafter abbreviated as WHT) instead of a DCT. Roughly speaking, such systems reduce datarate by exploiting the temporal and spatial coherence of most video signals. It is more efficient to specify the location of a visually similar portion of a prior frame than it is to specify pixel values. The frequency segregation provided by the DCT and WHT facilitates the exploitation of both spatial coherence in the original signal and the tolerance of the human visual system to moderate losses of fidelity in the reconstituted signal. VP8 augments these basic concepts with, among other things, sophisticated usage of contextual probabilities. The result is a significant reduction in datarate at a given quality.
Top   ToC   RFC6386 - Page 5
   Unlike some similar schemes (the older MPEG formats, for example),
   VP8 specifies exact values for reconstructed pixels.  Specifically,
   the specification for the DCT and WHT portions of the reconstruction
   does not allow for any "drift" caused by truncation of fractions.
   Rather, the algorithm is specified using fixed-precision integer
   operations exclusively.  This greatly facilitates the verification of
   the correctness of a decoder implementation and also avoids
   difficult-to-predict visual incongruities between such
   implementations.

   It should be remarked that, in a complete video playback system, the
   displayed frames may or may not be identical to the reconstructed
   frames.  Many systems apply a final level of filtering (commonly
   referred to as postprocessing) to the reconstructed frames prior to
   viewing.  Such postprocessing has no effect on the decoding and
   reconstruction of subsequent frames (which are predicted using the
   completely specified reconstructed frames) and is beyond the scope of
   this document.  In practice, the nature and extent of this sort of
   postprocessing is dependent on both the taste of the user and on the
   computational facilities of the playback environment.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

2. Format Overview

VP8 works exclusively with an 8-bit YUV 4:2:0 image format. In this format, each 8-bit pixel in the two chroma planes (U and V) corresponds positionally to a 2x2 block of 8-bit luma pixels in the Y plane; coordinates of the upper left corner of the Y block are of course exactly twice the coordinates of the corresponding chroma pixels. When we refer to pixels or pixel distances without specifying a plane, we are implicitly referring to the Y plane or to the complete image, both of which have the same (full) resolution. As is usually the case, the pixels are simply a large array of bytes stored in rows from top to bottom, each row being stored from left to right. This "left to right" then "top to bottom" raster-scan order is reflected in the layout of the compressed data as well. Provision has been made in the VP8 bitstream header for the support of a secondary YUV color format, in the form of a reserved bit. Occasionally, at very low datarates, a compression system may decide to reduce the resolution of the input signal to facilitate efficient compression. The VP8 data format supports this via optional upscaling of its internal reconstruction buffer prior to output (this
Top   ToC   RFC6386 - Page 6
   is completely distinct from the optional postprocessing discussed
   earlier, which has nothing to do with decoding per se).  This
   upsampling restores the video frames to their original resolution.
   In other words, the compression/decompression system can be viewed as
   a "black box", where the input and output are always at a given
   resolution.  The compressor might decide to "cheat" and process the
   signal at a lower resolution.  In that case, the decompressor needs
   the ability to restore the signal to its original resolution.

   Internally, VP8 decomposes each output frame into an array of
   macroblocks.  A macroblock is a square array of pixels whose Y
   dimensions are 16x16 and whose U and V dimensions are 8x8.
   Macroblock-level data in a compressed frame occurs (and must be
   processed) in a raster order similar to that of the pixels comprising
   the frame.

   Macroblocks are further decomposed into 4x4 subblocks.  Every
   macroblock has 16 Y subblocks, 4 U subblocks, and 4 V subblocks.  Any
   subblock-level data (and processing of such data) again occurs in
   raster order, this time in raster order within the containing
   macroblock.

   As discussed in further detail below, data can be specified at the
   levels of both macroblocks and their subblocks.

   Pixels are always treated, at a minimum, at the level of subblocks,
   which may be thought of as the "atoms" of the VP8 algorithm.  In
   particular, the 2x2 chroma blocks corresponding to 4x4 Y subblocks
   are never treated explicitly in the data format or in the algorithm
   specification.

   The DCT and WHT always operate at a 4x4 resolution.  The DCT is used
   for the 16Y, 4U, and 4V subblocks.  The WHT is used (with some but
   not all prediction modes) to encode a 4x4 array comprising the
   average intensities of the 16 Y subblocks of a macroblock.  These
   average intensities are, up to a constant normalization factor,
   nothing more than the 0th DCT coefficients of the Y subblocks.  This
   "higher-level" WHT is a substitute for the explicit specification of
   those coefficients, in exactly the same way as the DCT of a subblock
   substitutes for the specification of the pixel values comprising the
   subblock.  We consider this 4x4 array as a second-order subblock
   called Y2, and think of a macroblock as containing 24 "real"
   subblocks and, sometimes, a 25th "virtual" subblock.  This is dealt
   with further in Section 13.

   The frame layout used by the reference decoder may be found in the
   file vpx_image.h (Section 20.23).
Top   ToC   RFC6386 - Page 7

3. Compressed Frame Types

There are only two types of frames in VP8. Intraframes (also called key frames and, in MPEG terminology, I-frames) are decoded without reference to any other frame in a sequence; that is, the decompressor reconstructs such frames beginning from its "default" state. Key frames provide random access (or seeking) points in a video stream. Interframes (also called prediction frames and, in MPEG terminology, P-frames) are encoded with reference to prior frames, specifically all prior frames up to and including the most recent key frame. Generally speaking, the correct decoding of an interframe depends on the correct decoding of the most recent key frame and all ensuing frames. Consequently, the decoding algorithm is not tolerant of dropped frames: In an environment in which frames may be dropped or corrupted, correct decoding will not be possible until a key frame is correctly received. In contrast to MPEG, there is no use of bidirectional prediction. No frame is predicted using frames temporally subsequent to it; there is no analog to an MPEG B-frame. Secondly, VP8 augments these notions with that of alternate prediction frames, called golden frames and altref frames (alternative reference frames). Blocks in an interframe may be predicted using blocks in the immediately previous frame as well as the most recent golden frame or altref frame. Every key frame is automatically golden and altref, and any interframe may optionally replace the most recent golden or altref frame. Golden frames and altref frames may also be used to partially overcome the intolerance to dropped frames discussed above: If a compressor is configured to code golden frames only with reference to the prior golden frame (and key frame), then the "substream" of key and golden frames may be decoded regardless of loss of other interframes. Roughly speaking, the implementation requires (on the compressor side) that golden frames subsume and recode any context updates effected by the intervening interframes. A typical application of this approach is video conferencing, in which retransmission of a prior golden frame and/or a delay in playback until receipt of the next golden frame is preferable to a larger retransmit and/or delay until the next key frame.
Top   ToC   RFC6386 - Page 8

4. Overview of Compressed Data Format

The input to a VP8 decoder is a sequence of compressed frames whose order matches their order in time. Issues such as the duration of frames, the corresponding audio, and synchronization are generally provided by the playback environment and are irrelevant to the decoding process itself; however, to aid in fast seeking, a start code is included in the header of each key frame. The decoder is simply presented with a sequence of compressed frames and produces a sequence of decompressed (reconstructed) YUV frames corresponding to the input sequence. As stated in the Introduction, the exact pixel values in the reconstructed frame are part of VP8's specification. This document specifies the layout of the compressed frames and gives unambiguous algorithms for the correct production of reconstructed frames. The first frame presented to the decompressor is of course a key frame. This may be followed by any number of interframes; the correct reconstruction of each frame depends on all prior frames up to the key frame. The next key frame restarts this process: The decompressor resets to its default initial condition upon reception of a key frame, and the decoding of a key frame (and its ensuing interframes) is completely independent of any prior decoding. At the highest level, every compressed frame has three or more pieces. It begins with an uncompressed data chunk comprising 10 bytes in the case of key frames and 3 bytes for interframes. This is followed by two or more blocks of compressed data (called partitions). These compressed data partitions begin and end on byte boundaries. The first compressed partition has two subsections: 1. Header information that applies to the frame as a whole. 2. Per-macroblock information specifying how each macroblock is predicted from the already-reconstructed data that is available to the decompressor. As stated above, the macroblock-level information occurs in raster- scan order. The rest of the partitions contain, for each block, the DCT/WHT coefficients (quantized and logically compressed) of the residue signal to be added to the predicted block values. It typically accounts for roughly 70% of the overall datarate. VP8 supports packing the compressed DCT/WHT coefficients' data from macroblock
Top   ToC   RFC6386 - Page 9
   rows into separate partitions.  If there is more than one partition
   for these coefficients, the sizes of the partitions -- except the
   last partition -- in bytes are also present in the bitstream right
   after the above first partition.  Each of the sizes is a 3-byte data
   item written in little endian format.  These sizes provide the
   decoder direct access to all DCT/WHT coefficient partitions, which
   enables parallel processing of the coefficients in a decoder.

   The separate partitioning of the prediction data and coefficient data
   also allows flexibility in the implementation of a decompressor: An
   implementation may decode and store the prediction information for
   the whole frame and then decode, transform, and add the residue
   signal to the entire frame, or it may simultaneously decode both
   partitions, calculating prediction information and adding in the
   residue signal for each block in order.  The length field in the
   frame tag, which allows decoding of the second partition to begin
   before the first partition has been completely decoded, is necessary
   for the second "block-at-a-time" decoder implementation.

   All partitions are decoded using separate instances of the boolean
   entropy decoder described in Section 7.  Although some of the data
   represented within the partitions is conceptually "flat" (a bit is
   just a bit with no probabilistic expectation one way or the other),
   because of the way such coders work, there is never a direct
   correspondence between a "conceptual bit" and an actual physical bit
   in the compressed data partitions.  Only in the 3- or 10-byte
   uncompressed chunk described above is there such a physical
   correspondence.

   A related matter is that seeking within a partition is not supported.
   The data must be decompressed and processed (or at least stored) in
   the order in which it occurs in the partition.

   While this document specifies the ordering of the partition data
   correctly, the details and semantics of this data are discussed in a
   more logical fashion to facilitate comprehension.  For example, the
   frame header contains updates to many probability tables used in
   decoding per-macroblock data.  The per-macroblock data is often
   described before the layouts of the probabilities and their updates,
   even though this is the opposite of their order in the bitstream.

5. Overview of the Decoding Process

A VP8 decoder needs to maintain four YUV frame buffers whose resolutions are at least equal to that of the encoded image. These buffers hold the current frame being reconstructed, the immediately previous reconstructed frame, the most recent golden frame, and the most recent altref frame.
Top   ToC   RFC6386 - Page 10
   Most implementations will wish to "pad" these buffers with
   "invisible" pixels that extend a moderate number of pixels beyond all
   four edges of the visible image.  This simplifies interframe
   prediction by allowing all (or most) prediction blocks -- which are
   not guaranteed to lie within the visible area of a prior frame -- to
   address usable image data.

   Regardless of the amount of padding chosen, the invisible rows above
   (or below) the image are filled with copies of the top (or bottom)
   row of the image; the invisible columns to the left (or right) of the
   image are filled with copies of the leftmost (or rightmost) visible
   row; and the four invisible corners are filled with copies of the
   corresponding visible corner pixels.  The use of these prediction
   buffers (and suggested sizes for the halo) will be elaborated on in
   the discussion of motion vectors, interframe prediction, and
   sub-pixel interpolation later in this document.

   As will be seen in the description of the frame header, the image
   dimensions are specified (and can change) with every key frame.
   These buffers (and any other data structures whose size depends on
   the size of the image) should be allocated (or re-allocated)
   immediately after the dimensions are decoded.

   Leaving most of the details for later elaboration, the following is
   an outline of the decoding process.

   First, the frame header (the beginning of the first data partition)
   is decoded.  Altering or augmenting the maintained state of the
   decoder, this provides the context in which the per-macroblock data
   can be interpreted.

   The macroblock data occurs (and must be processed) in raster-scan
   order.  This data comes in two or more parts.  The first (prediction
   or mode) part comes in the remainder of the first data partition.
   The other parts comprise the data partition(s) for the DCT/WHT
   coefficients of the residue signal.  For each macroblock, the
   prediction data must be processed before the residue.

   Each macroblock is predicted using one (and only one) of four
   possible frames.  All macroblocks in a key frame, and all intra-coded
   macroblocks in an interframe, are predicted using the already-decoded
   macroblocks in the current frame.  Macroblocks in an interframe may
   also be predicted using the previous frame, the golden frame, or the
   altref frame.  Such macroblocks are said to be inter-coded.
Top   ToC   RFC6386 - Page 11
   The purpose of prediction is to use already-constructed image data to
   approximate the portion of the original image being reconstructed.
   The effect of any of the prediction modes is then to write a
   macroblock-sized prediction buffer containing this approximation.

   Regardless of the prediction method, the residue DCT signal is
   decoded, dequantized, reverse-transformed, and added to the
   prediction buffer to produce the (almost final) reconstruction value
   of the macroblock, which is stored in the correct position of the
   current frame buffer.

   The residue signal consists of 24 (sixteen Y, four U, and four V) 4x4
   quantized and losslessly compressed DCT transforms approximating the
   difference between the original macroblock in the uncompressed source
   and the prediction buffer.  For most prediction modes, the 0th
   coefficients of the sixteen Y subblocks are expressed via a 25th WHT
   of the second-order virtual Y2 subblock discussed above.

   Intra-prediction exploits the spatial coherence of frames.  The 16x16
   luma (Y) and 8x8 chroma (UV) components are predicted independently
   of each other using one of four simple means of pixel propagation,
   starting from the already-reconstructed (16-pixel-long luma, 8-pixel-
   long chroma) row above, and column to the left of, the current
   macroblock.  The four methods are:

   1.  Copying the row from above throughout the prediction buffer.

   2.  Copying the column from the left throughout the prediction
       buffer.

   3.  Copying the average value of the row and column throughout the
       prediction buffer.

   4.  Extrapolation from the row and column using the (fixed) second
       difference (horizontal and vertical) from the upper left corner.

   Additionally, the sixteen Y subblocks may be predicted independently
   of each other using one of ten different modes, four of which are 4x4
   analogs of those described above, augmented with six "diagonal"
   prediction methods.  There are two types of predictions, one intra
   and one prediction (among all the modes), for which the residue
   signal does not use the Y2 block to encode the DC portion of the
   sixteen 4x4 Y subblock DCTs.  This "independent Y subblock" mode has
   no effect on the 8x8 chroma prediction.
Top   ToC   RFC6386 - Page 12
   Inter-prediction exploits the temporal coherence between nearby
   frames.  Except for the choice of the prediction frame itself, there
   is no difference between inter-prediction based on the previous frame
   and that based on the golden frame or altref frame.

   Inter-prediction is conceptually very simple.  While, for reasons of
   efficiency, there are several methods of encoding the relationship
   between the current macroblock and corresponding sections of the
   prediction frame, ultimately each of the sixteen Y subblocks is
   related to a 4x4 subblock of the prediction frame, whose position in
   that frame differs from the current subblock position by a (usually
   small) displacement.  These two-dimensional displacements are called
   motion vectors.

   The motion vectors used by VP8 have quarter-pixel precision.
   Prediction of a subblock using a motion vector that happens to have
   integer (whole number) components is very easy: The 4x4 block of
   pixels from the displaced block in the previous, golden, or altref
   frame is simply copied into the correct position of the current
   macroblock's prediction buffer.

   Fractional displacements are conceptually and implementationally more
   complex.  They require the inference (or synthesis) of sample values
   that, strictly speaking, do not exist.  This is one of the most basic
   problems in signal processing, and readers conversant with that
   subject will see that the approach taken by VP8 provides a good
   balance of robustness, accuracy, and efficiency.

   Leaving the details for the implementation discussion below, the
   pixel interpolation is calculated by applying a kernel filter (using
   reasonable-precision integer math) three pixels on either side, both
   horizontally and vertically, of the pixel to be synthesized.  The
   resulting 4x4 block of synthetic pixels is then copied into position
   exactly as in the case of integer displacements.

   Each of the eight chroma subblocks is handled similarly.  Their
   motion vectors are never specified explicitly; instead, the motion
   vector for each chroma subblock is calculated by averaging the
   vectors of the four Y subblocks that occupy the same area of the
   frame.  Since chroma pixels have twice the diameter (and four times
   the area) of luma pixels, the calculated chroma motion vectors have
   1/8-pixel resolution, but the procedure for copying or generating
   pixels for each subblock is essentially identical to that done in the
   luma plane.
Top   ToC   RFC6386 - Page 13
   After all the macroblocks have been generated (predicted and
   corrected with the DCT/WHT residue), a filtering step (the loop
   filter) is applied to the entire frame.  The purpose of the loop
   filter is to reduce blocking artifacts at the boundaries between
   macroblocks and between subblocks of the macroblocks.  The term "loop
   filter" is used because this filter is part of the "coding loop";
   that is, it affects the reconstructed frame buffers that are used to
   predict ensuing frames.  This is distinguished from the
   postprocessing filters discussed earlier, which affect only the
   viewed video and do not "feed into" subsequent frames.

   Next, if signaled in the data, the current frame may replace the
   golden frame prediction buffer and/or the altref frame buffer.

   The halos of the frame buffers are next filled as specified above.
   Finally, at least as far as decoding is concerned, the (references
   to) the "current" and "last" frame buffers should be exchanged in
   preparation for the next frame.

   Various processes may be required (or desired) before viewing the
   generated frame.  As discussed in the frame dimension information
   below, truncation and/or upscaling of the frame may be required.
   Some playback systems may require a different frame format (RGB,
   YUY2, etc.).  Finally, as mentioned in the Introduction, further
   postprocessing or filtering of the image prior to viewing may be
   desired.  Since the primary purpose of this document is a decoding
   specification, the postprocessing is not specified in this document.

   While the basic ideas of prediction and correction used by VP8 are
   straightforward, many of the details are quite complex.  The
   management of probabilities is particularly elaborate.  Not only do
   the various modes of intra-prediction and motion vector specification
   have associated probabilities, but they, together with the coding of
   DCT coefficients and motion vectors, often base these probabilities
   on a variety of contextual information (calculated from what has been
   decoded so far), as well as on explicit modification via the frame
   header.

   The "top-level" of decoding and frame reconstruction is implemented
   in the reference decoder file dixie.c (Section 20.4).

   This concludes our summary of decoding and reconstruction; we
   continue by discussing the individual aspects in more depth.

   A reasonable "divide and conquer" approach to implementation of a
   decoder is to begin by decoding streams composed exclusively of key
   frames.  After that works reliably, interframe handling can be added
   more easily than if complete functionality were attempted
Top   ToC   RFC6386 - Page 14
   immediately.  In accordance with this, we first discuss components
   needed to decode key frames (most of which are also used in the
   decoding of interframes) and conclude with topics exclusive to
   interframes.

6. Description of Algorithms

As the intent of this document, together with the reference decoder source code, is to specify a platform-independent procedure for the decoding and reconstruction of a VP8 video stream, many (small) algorithms must be described exactly. Due to its near-universality, terseness, ability to easily describe calculation at specific precisions, and the fact that On2's reference VP8 decoder is written in C, these algorithm fragments are written using the C programming language, augmented with a few simple definitions below. The standard (and best) reference for C is [Kernighan]. Many code fragments will be presented in this document. Some will be nearly identical to corresponding sections of the reference decoder; others will differ. Roughly speaking, there are three reasons for such differences: 1. For reasons of efficiency, the reference decoder version may be less obvious. 2. The reference decoder often uses large data structures to maintain context that need not be described or used here. 3. The authors of this document felt that a different expression of the same algorithm might facilitate exposition. Regardless of the chosen presentation, the calculation effected by any of the algorithms described here is identical to that effected by the corresponding portion of the reference decoder. All VP8 decoding algorithms use integer math. To facilitate specification of arithmetic precision, we define the following types. ---- Begin code block -------------------------------------- typedef signed char int8; /* signed int exactly 8 bits wide */ typedef unsigned char uint8; /* unsigned "" */ typedef short int16; /* signed int exactly 16 bits wide */ typedef unsigned int16 uint16; /* unsigned "" */
Top   ToC   RFC6386 - Page 15
   /* int32 is a signed integer type at least 32 bits wide */

   typedef long int32; /* guaranteed to work on all systems */
   typedef int  int32; /* will be more efficient on some systems */

   typedef unsigned int32 uint32;

   /* unsigned integer type, at least 16 bits wide, whose exact size
      is most convenient to whatever processor we are using */

   typedef unsigned int uint;

   /* While pixels themselves are 8-bit unsigned integers,
      pixel arithmetic often occurs at 16- or 32-bit precision and
      the results need to be "saturated" or clamped to an 8-bit
      range. */

   typedef uint8 Pixel;

   Pixel clamp255(int32 v) { return v < 0? 0 : (v < 255? v : 255);}

   /*  As is elaborated in the discussion of the bool_decoder below,
       VP8 represents probabilities as unsigned 8-bit numbers. */

   typedef uint8 Prob;

   ---- End code block ----------------------------------------

   We occasionally need to discuss mathematical functions involving
   honest-to-goodness "infinite precision" real numbers.  The DCT is
   first described via the cosine function cos; the ratio of the lengths
   of the circumference and diameter of a circle is denoted pi; at one
   point, we take a (base 1/2) logarithm, denoted log; and pow(x, y)
   denotes x raised to the power y.  If x = 2 and y is a small
   non-negative integer, pow(2, y) may be expressed in C as 1 << y.

   Finally, we sometimes need to divide signed integers by powers of
   two; that is, we occasionally right-shift signed numbers.  The
   behavior of such shifts (i.e., the propagation of the sign bit) is,
   perhaps surprisingly, not defined by the C language itself and is
   left up to individual compilers.  Because of the utility of this
   frequently needed operation, it is at least arguable that it should
   be defined by the language (to naturally propagate the sign bit) and,
   at a minimum, should be correctly implemented by any reasonable
   compiler.  In the interest of strict portability, we attempt to call
   attention to these shifts when they arise.


(next page on part 2)

Next Section