Tech-invite3GPPspaceIETFspace
96959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 7932

Brotli Compressed Data Format

Pages: 128
Informational
Errata
Part 2 of 6 – Pages 22 to 43
First   Prev   Next

Top   ToC   RFC7932 - Page 22   prevText

6. Encoding of Block-Switch Commands

As described in Section 2, a block-switch command is a pair <block type, block count>. These are encoded in the compressed data part of the meta-block, right before the start of each new block of a particular block category. Each block type in the compressed data is represented with a block type code, encoded using a prefix code over the block type code alphabet. A block type symbol 0 means that the new block type is the same as the type of the previous block from the same block category, i.e., the block type that preceded the current type, while a block type symbol 1 means that the new block type equals the current block type plus one. If the current block type is the maximal possible, then a block type symbol of 1 results in wrapping to a new block type of 0. Block type symbols 2..257 represent block types 0..255, respectively. The previous and current block types are initialized to 1 and 0, respectively, at the end of the meta-block header. Since the first block type of each block category is 0, the block type of the first block-switch command is not encoded in the compressed data. If a block category has only one block type, the block count of the first block-switch command is also omitted from the compressed data; otherwise, it is encoded in the meta-block header. Since the end of the meta-block is detected by the number of uncompressed bytes produced, the block counts for any of the three categories need not count down to exactly zero at the end of the meta-block. The number of different block types in each block category, denoted by NBLTYPESL, NBLTYPESI, and NBLTYPESD for literals, insert-and-copy lengths, and distances, respectively, is encoded in the meta-block header, and it must equal to the largest block type plus one in that block category. In other words, the set of literal, insert-and-copy length, and distance block types must be [0..NBLTYPESL-1], [0..NBLTYPESI-1], and [0..NBLTYPESD-1], respectively. From this it follows that the alphabet size of literal, insert-and-copy length, and distance block type codes is NBLTYPESL + 2, NBLTYPESI + 2, and NBLTYPESD + 2, respectively. Each block count in the compressed data is represented with a pair <block count code, extra bits>. The block count code and the extra bits are encoded back-to-back, the block count code is encoded using a prefix code over the block count code alphabet, while the extra bits value is encoded as a fixed-width integer value. The number of extra bits can be 0..24, and it is dependent on the block count code.
Top   ToC   RFC7932 - Page 23
   The symbols of the block count code alphabet along with the number of
   extra bits and the range of block counts are as follows:

           Extra              Extra               Extra
      Code Bits Lengths  Code Bits Lengths   Code Bits Lengths
      ---- ---- -------  ---- ---- -------   ---- ---- -------
       0    2    1..4     9    4   65..80    18    7   369..496
       1    2    5..8    10    4   81..96    19    8   497..752
       2    2    9..12   11    4   97..112   20    9   753..1264
       3    2   13..16   12    5  113..144   21   10   1265..2288
       4    3   17..24   13    5  145..176   22   11   2289..4336
       5    3   25..32   14    5  177..208   23   12   4337..8432
       6    3   33..40   15    5  209..240   24   13   8433..16624
       7    3   41..48   16    6  241..304   25   24   16625..16793840
       8    4   49..64   17    6  305..368

   The first block-switch command of each block category is special in
   the sense that it is encoded in the meta-block header, and as
   described earlier, the block type code is omitted since it is an
   implicit zero.

7. Context Modeling

As described in Section 2, the prefix tree used to encode a literal byte or a distance code depends on the block type and the context ID. This section specifies how to compute the context ID for a particular literal and distance code and how to encode the context map that maps a <block type, context ID> pair to the index of a prefix code in the array of literal and distance prefix codes.

7.1. Context Modes and Context ID Lookup for Literals

The context for encoding the next literal is defined by the last two bytes in the stream (p1, p2, where p1 is the most recent byte), regardless of whether these bytes are produced by uncompressed meta- blocks, backward references, static dictionary references, or by literal insertions. At the start of the stream, p1 and p2 are initialized to zero. There are four methods, called context modes, to compute the Context ID: * LSB6, where the Context ID is the value of six least significant bits of p1, * MSB6, where the Context ID is the value of six most significant bits of p1,
Top   ToC   RFC7932 - Page 24
      *  UTF8, where the Context ID is a complex function of p1, p2,
         optimized for text compression, and

      *  Signed, where Context ID is a complex function of p1, p2,
         optimized for compressing sequences of signed integers.

   The Context ID for the UTF8 and Signed context modes is computed
   using the following lookup tables Lut0, Lut1, and Lut2.

      Lut0 :=
         0,  0,  0,  0,  0,  0,  0,  0,  0,  4,  4,  0,  0,  4,  0,  0,
         0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,
         8, 12, 16, 12, 12, 20, 12, 16, 24, 28, 12, 12, 32, 12, 36, 12,
        44, 44, 44, 44, 44, 44, 44, 44, 44, 44, 32, 32, 24, 40, 28, 12,
        12, 48, 52, 52, 52, 48, 52, 52, 52, 48, 52, 52, 52, 52, 52, 48,
        52, 52, 52, 52, 52, 48, 52, 52, 52, 52, 52, 24, 12, 28, 12, 12,
        12, 56, 60, 60, 60, 56, 60, 60, 60, 56, 60, 60, 60, 60, 60, 56,
        60, 60, 60, 60, 60, 56, 60, 60, 60, 60, 60, 24, 12, 28, 12,  0,
         0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
         0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
         0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
         0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1,
         2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
         2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
         2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3,
         2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3, 2, 3

      Lut1 :=
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1, 1,
         1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 1, 1, 1, 1, 1,
         1, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 1, 1, 1, 1, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2
Top   ToC   RFC7932 - Page 25
      Lut2 :=
         0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5, 5,
         6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 6, 7

   The lengths and the CRC-32 check values (see Appendix C) of each of
   these tables as a sequence of bytes are as follows:

      Table    Length    CRC-32
      -----    ------    ------
      Lut0     256       0x8e91efb7
      Lut1     256       0xd01a32f4
      Lut2     256       0x0dd7a0d6

   Given p1 is the last uncompressed byte and p2 is the second-to-last
   uncompressed byte, the context IDs can be computed as follows:

      For LSB6:    Context ID = p1 & 0x3f
      For MSB6:    Context ID = p1 >> 2
      For UTF8:    Context ID = Lut0[p1] | Lut1[p2]
      For Signed:  Context ID = (Lut2[p1] << 3) | Lut2[p2]

   From the lookup tables defined above and the operations to compute
   the context IDs, we can see that context IDs for literals are in the
   range of 0..63.

   The context modes LSB6, MSB6, UTF8, and Signed are denoted by
   integers 0, 1, 2, 3.

   A context mode is defined for each literal block type and they are
   stored in a consecutive array of bits in the meta-block header,
   always two bits per block type.
Top   ToC   RFC7932 - Page 26

7.2. Context ID for Distances

The context for encoding a distance code is defined by the copy length corresponding to the distance. The context IDs are 0, 1, 2, and 3 for copy lengths 2, 3, 4, and more than 4, respectively.

7.3. Encoding of the Context Map

There are two context maps, one for literals and one for distances. The size of the context map is 64 * NBLTYPESL for literals, and 4 * NBLTYPESD for distances. Each value in the context map is an integer between 0 and 255, indicating the index of the prefix code to be used when encoding the next literal or distance. The context maps are two-dimensional matrices, encoded as one- dimensional arrays: CMAPL[0..(64 * NBLTYPESL - 1)] CMAPD[0..(4 * NBLTYPESD - 1)] The index of the prefix code for encoding a literal or distance code with block type, BTYPE_x, and context ID, CIDx, is: index of literal prefix code = CMAPL[64 * BTYPE_L + CIDL] index of distance prefix code = CMAPD[4 * BTYPE_D + CIDD] The values of the context map are encoded with the combination of run length encoding for zero values and prefix coding. Let RLEMAX denote the number of run length codes and NTREES denote the maximum value in the context map plus one. NTREES must equal the number of different values in the context map; in other words, the different values in the context map must be the [0..NTREES-1] interval. The alphabet of the prefix code has the following RLEMAX + NTREES symbols: 0: value zero 1: repeat a zero 2 to 3 times, read 1 bit for repeat length 2: repeat a zero 4 to 7 times, read 2 bits for repeat length ... RLEMAX: repeat a zero (1 << RLEMAX) to (1 << (RLEMAX+1))-1 times, read RLEMAX bits for repeat length RLEMAX + 1: value 1 ... RLEMAX + NTREES - 1: value NTREES - 1
Top   ToC   RFC7932 - Page 27
   If RLEMAX = 0, the run length coding is not used and the symbols of
   the alphabet are directly the values in the context map.  We can now
   define the format of the context map (the same format is used for
   literal and distance context maps):

      1..5 bits: RLEMAX, 0 is encoded with one 0 bit, and values 1..16
                 are encoded with bit pattern xxxx1 (so 01001 is 5)

      Prefix code with alphabet size NTREES + RLEMAX

      Context map size values encoded with the above prefix code and run
         length coding for zero values.  If a run length would result in
         more lengths in total than the size of the context map, then
         the stream should be rejected as invalid.

      1 bit:  IMTF bit, if set, we do an inverse move-to-front transform
              on the values in the context map to get the prefix code
              indexes.

   Note that RLEMAX may be larger than the value necessary to represent
   the longest sequence of zero values.  Also, the NTREES value is
   encoded right before the context map as described in Section 9.2.

   We define the inverse move-to-front transform used in this
   specification by the following C language function:

      void InverseMoveToFrontTransform(uint8_t* v, int v_len) {
         uint8_t mtf[256];
         int i;
         for (i = 0; i < 256; ++i) {
            mtf[i] = (uint8_t)i;
         }
         for (i = 0; i < v_len; ++i) {
            uint8_t index = v[i];
            uint8_t value = mtf[index];
            v[i] = value;
            for (; index; --index) {
               mtf[index] = mtf[index - 1];
            }
            mtf[0] = value;
         }
      }

   Note that the inverse move-to-front transform will not produce values
   outside the [0..NTREES-1] interval.
Top   ToC   RFC7932 - Page 28

8. Static Dictionary

At any given point during decoding the compressed data, a reference to a duplicated string in the uncompressed data produced so far has a maximum backward distance value, which is the minimum of the window size and the number of uncompressed bytes produced. However, decoding a distance from the compressed stream, as described in Section 4, can produce distances that are greater than this maximum allowed value. In this case, the distance is treated as a reference to a word in the static dictionary given in Appendix A. The copy length for a static dictionary reference must be between 4 and 24. The static dictionary has three parts: * DICT[0..DICTSIZE], an array of bytes * DOFFSET[0..24], an array of byte-offset values for each length * NDBITS[0..24], an array of bit-depth values for each length The number of static dictionary words for a given length is: NWORDS[length] = 0 (if length < 4) NWORDS[length] = (1 << NDBITS[length]) (if length >= 4) DOFFSET and DICTSIZE are defined by the following recursion: DOFFSET[0] = 0 DOFFSET[length + 1] = DOFFSET[length] + length * NWORDS[length] DICTSIZE = DOFFSET[24] + 24 * NWORDS[24] The offset of a word within the DICT array for a given length and index is: offset(length, index) = DOFFSET[length] + index * length Each static dictionary word has 121 different forms, given by applying a word transformation to a base word in the DICT array. The list of word transformations is given in Appendix B. The static dictionary word for a <length, distance> pair can be reconstructed as follows: word_id = distance - (max allowed distance + 1) index = word_id % NWORDS[length] base_word = DICT[offset(length, index)..offset(length, index+1)-1] transform_id = word_id >> NDBITS[length] The string copied to the uncompressed stream is computed by applying the transformation to the base dictionary word. If transform_id is greater than 120, or the length is smaller than 4 or greater than 24, then the compressed stream should be rejected as invalid.
Top   ToC   RFC7932 - Page 29
   Each word transformation has the following form:

      transform_i(word) = prefix_i + T_i(word) + suffix_i

   where the _i subscript denotes the transform_id above.  Each T_i is
   one of the following 21 elementary transforms:

      Identity, FermentFirst, FermentAll,
      OmitFirst1, ..., OmitFirst9, OmitLast1, ..., OmitLast9

   The form of these elementary transforms is as follows:

      Identity(word) = word

      FermentFirst(word) = see below

      FermentAll(word) = see below

      OmitFirstk(word) = the last (length(word) - k) bytes of word, or
                         empty string if length(word) < k

      OmitLastk(word) = the first (length(word) - k) bytes of word, or
                        empty string if length(word) < k
Top   ToC   RFC7932 - Page 30
   We define the FermentFirst and FermentAll transforms used in this
   specification by the following C language functions:

      int Ferment(uint8_t* word, int word_len, int pos) {
         if (word[pos] < 192) {
            if (word[pos] >= 97 and word[pos] <= 122) {
               word[pos] = word[pos] ^ 32;
            }
            return 1;
         } else if (word[pos] < 224) {
            if (pos + 1 < word_len) {
               word[pos + 1] = word[pos + 1] ^ 32;
            }
            return 2;
         } else {
            if (pos + 2 < word_len) {
               word[pos + 2] = word[pos + 2] ^ 5;
            }
            return 3;
         }
      }

      void FermentFirst(uint8_t* word, int word_len) {
         if (word_len > 0) {
            Ferment(word, word_len, 0);
         }
      }

      void FermentAll(uint8_t* word, int word_len) {
         int i = 0;
         while (i < word_len) {
            i += Ferment(word, word_len, i);
         }
      }

   Appendix B contains the list of transformations by specifying the
   prefix, elementary transform and suffix components of each of them.
   Note that the OmitFirst8 elementary transform is not used in the list
   of transformations.  The strings in Appendix B are in C-string format
   with respect to escape (backslash) characters.

   The maximum number of additional bytes that a transform may add to a
   base word is 13.  Since the largest base word is 24 bytes long, a
   buffer of 38 bytes is sufficient to store any transformed words
   (counting a terminating zero byte).
Top   ToC   RFC7932 - Page 31

9. Compressed Data Format

In this section, we describe the format of the compressed data set in terms of the format of the individual data items described in the previous sections.

9.1. Format of the Stream Header

The stream header has only the following one field: 1..7 bits: WBITS, a value in the range 10..24, encoded with the following variable-length code (as it appears in the compressed data, where the bits are parsed from right to left): Value Bit Pattern ----- ----------- 10 0100001 11 0110001 12 1000001 13 1010001 14 1100001 15 1110001 16 0 17 0000001 18 0011 19 0101 20 0111 21 1001 22 1011 23 1101 24 1111 Note that bit pattern 0010001 is invalid and must not be used. The size of the sliding window, which is the maximum value of any non-dictionary reference backward distance, is given by the following formula: window size = (1 << WBITS) - 16
Top   ToC   RFC7932 - Page 32

9.2. Format of the Meta-Block Header

A compliant compressed data set has at least one meta-block. Each meta-block contains a header with information about the uncompressed length of the meta-block, and a bit signaling if the meta-block is the last one. The format of the meta-block header is the following: 1 bit: ISLAST, set to 1 if this is the last meta-block 1 bit: ISLASTEMPTY, if set to 1, the meta-block is empty; this field is only present if ISLAST bit is set -- if it is 1, then the meta-block and the brotli stream ends at that bit, with any remaining bits in the last byte of the compressed stream filled with zeros (if the fill bits are not zero, then the stream should be rejected as invalid) 2 bits: MNIBBLES, number of nibbles to represent the uncompressed length, encoded with the following fixed-length code: Value Bit Pattern ----- ----------- 0 11 4 00 5 01 6 10 If MNIBBLES is 0, the meta-block is empty, i.e., it does not generate any uncompressed data. In this case, the rest of the meta-block has the following format: 1 bit: reserved, must be zero 2 bits: MSKIPBYTES, number of bytes to represent metadata length MSKIPBYTES * 8 bits: MSKIPLEN - 1, where MSKIPLEN is the number of metadata bytes; this field is only present if MSKIPBYTES is positive; otherwise, MSKIPLEN is 0 (if MSKIPBYTES is greater than 1, and the last byte is all zeros, then the stream should be rejected as invalid) 0..7 bits: fill bits until the next byte boundary, must be all zeros MSKIPLEN bytes of metadata, not part of the uncompressed data or the sliding window
Top   ToC   RFC7932 - Page 33
      MNIBBLES * 4 bits: MLEN - 1, where MLEN is the length of the meta-
              block uncompressed data in bytes (if MNIBBLES is greater
              than 4, and the last nibble is all zeros, then the stream
              should be rejected as invalid)

      1 bit:  ISUNCOMPRESSED, if set to 1, any bits of compressed data
              up to the next byte boundary are ignored, and the rest of
              the meta-block contains MLEN bytes of literal data; this
              field is only present if the ISLAST bit is not set (if the
              ignored bits are not all zeros, the stream should be
              rejected as invalid)

      1..11 bits: NBLTYPESL, number of literal block types, encoded with
              the following variable-length code (as it appears in the
              compressed data, where the bits are parsed from right to
              left, so 0110111 has the value 12):

                       Value    Bit Pattern
                       -----    -----------
                         1                0
                         2             0001
                       3..4           x0011
                       5..8          xx0101
                       9..16        xxx0111
                      17..32       xxxx1001
                      33..64      xxxxx1011
                      65..128    xxxxxx1101
                     129..256   xxxxxxx1111

         Prefix code over the block type code alphabet for literal block
            types, appears only if NBLTYPESL >= 2

         Prefix code over the block count code alphabet for literal
            block counts, appears only if NBLTYPESL >= 2

         Block count code + extra bits for first literal block count,
            appears only if NBLTYPESL >= 2

      1..11 bits: NBLTYPESI, number of insert-and-copy block types,
                  encoded with the same variable-length code as above

         Prefix code over the block type code alphabet for insert-and-
            copy block types, appears only if NBLTYPESI >= 2

         Prefix code over the block count code alphabet for insert-and-
            copy block counts, appears only if NBLTYPESI >= 2
Top   ToC   RFC7932 - Page 34
         Block count code + extra bits for first insert-and-copy block
            count, appears only if NBLTYPESI >= 2

      1..11 bits: NBLTYPESD, number of distance block types, encoded
                  with the same variable-length code as above

         Prefix code over the block type code alphabet for distance
            block types, appears only if NBLTYPESD >= 2

         Prefix code over the block count code alphabet for distance
            block counts, appears only if NBLTYPESD >= 2

         Block count code + extra bits for first distance block count,
            appears only if NBLTYPESD >= 2

      2 bits: NPOSTFIX, parameter used in the distance coding

      4 bits: four most significant bits of NDIRECT, to get the actual
              value of the parameter NDIRECT, left-shift this four-bit
              number by NPOSTFIX bits

      NBLTYPESL * 2 bits: context mode for each literal block type

      1..11 bits: NTREESL, number of literal prefix trees, encoded with
                  the same variable-length code as NBLTYPESL

         Literal context map, encoded as described in Section 7.3,
            appears only if NTREESL >= 2; otherwise, the context map has
            only zero values

      1..11 bits: NTREESD, number of distance prefix trees, encoded with
                  the same variable-length code as NBLTYPESD

         Distance context map, encoded as described in Section 7.3,
            appears only if NTREESD >= 2; otherwise, the context map has
            only zero values

      NTREESL prefix codes for literals

      NBLTYPESI prefix codes for insert-and-copy lengths

      NTREESD prefix codes for distances
Top   ToC   RFC7932 - Page 35

9.3. Format of the Meta-Block Data

The compressed data part of a meta-block consists of a series of commands. Each command has the following format: Block type code for next insert-and-copy block type, appears only if NBLTYPESI >= 2 and the previous insert-and-copy block count is zero Block count code + extra bits for next insert-and-copy block count, appears only if NBLTYPESI >= 2 and the previous insert- and-copy block count is zero Insert-and-copy length, encoded as in Section 5, using the insert- and-copy length prefix code with the current insert-and-copy block type index Insert length number of literals, with the following format: Block type code for next literal block type, appears only if NBLTYPESL >= 2 and the previous literal block count is zero Block count code + extra bits for next literal block count, appears only if NBLTYPESL >= 2 and the previous literal block count is zero Next byte of the uncompressed data, encoded with the literal prefix code with the index determined by the previous two bytes of the uncompressed data, the current literal block type, and the context map, as described in Section 7.3 Block type code for next distance block type, appears only if NBLTYPESD >= 2 and the previous distance block count is zero Block count code + extra bits for next distance block count, appears only if NBLTYPESD >= 2 and the previous distance block count is zero Distance code, encoded as in Section 4, using the distance prefix code with the index determined by the copy length, the current distance block type, and the distance context map, as described in Section 7.3, appears only if the distance code is not an implicit 0, as indicated by the insert-and-copy length code
Top   ToC   RFC7932 - Page 36
   The number of commands in the meta-block is such that the sum of the
   uncompressed bytes produced (i.e., the number of literals inserted
   plus the number of bytes copied from past data or generated from the
   static dictionary) over all the commands gives the uncompressed
   length, MLEN encoded in the meta-block header.

   If the total number of uncompressed bytes produced after the insert
   part of the last command equals MLEN, then the copy length of the
   last command is ignored and will not produce any uncompressed output.
   In this case, the copy length of the last command can have any value.
   In any other case, if the number of literals to insert, the copy
   length, or the resulting dictionary word length would cause MLEN to
   be exceeded, then the stream should be rejected as invalid.

   If the last command of the last non-empty meta-block does not end on
   a byte boundary, the unused bits in the last byte must be zeros.

10. Decoding Algorithm

The decoding algorithm that produces the uncompressed data is as follows: read window size do read ISLAST bit if ISLAST read ISLASTEMPTY bit if ISLASTEMPTY break from loop read MNIBBLES if MNIBBLES is zero verify reserved bit is zero read MSKIPLEN skip any bits up to the next byte boundary skip MSKIPLEN bytes continue to the next meta-block else read MLEN if not ISLAST read ISUNCOMPRESSED bit if ISUNCOMPRESSED skip any bits up to the next byte boundary copy MLEN bytes of compressed data as literals continue to the next meta-block
Top   ToC   RFC7932 - Page 37
         loop for each three block categories (i = L, I, D)
            read NBLTYPESi
            if NBLTYPESi >= 2
               read prefix code for block types, HTREE_BTYPE_i
               read prefix code for block counts, HTREE_BLEN_i
               read block count, BLEN_i
               set block type, BTYPE_i to 0
               initialize second-to-last and last block types to 0 and 1
            else
               set block type, BTYPE_i to 0
               set block count, BLEN_i to 16777216
         read NPOSTFIX and NDIRECT
         read array of literal context modes, CMODE[]
         read NTREESL
         if NTREESL >= 2
            read literal context map, CMAPL[]
         else
            fill CMAPL[] with zeros
         read NTREESD
         if NTREESD >= 2
            read distance context map, CMAPD[]
         else
            fill CMAPD[] with zeros
         read array of literal prefix codes, HTREEL[]
         read array of insert-and-copy length prefix codes, HTREEI[]
         read array of distance prefix codes, HTREED[]
         do
            if BLEN_I is zero
               read block type using HTREE_BTYPE_I and set BTYPE_I
                  save previous block type
               read block count using HTREE_BLEN_I and set BLEN_I
            decrement BLEN_I
            read insert-and-copy length symbol using HTREEI[BTYPE_I]
            compute insert length, ILEN, and copy length, CLEN
            loop for ILEN
               if BLEN_L is zero
                  read block type using HTREE_BTYPE_L and set BTYPE_L
                     save previous block type
                  read block count using HTREE_BLEN_L and set BLEN_L
               decrement BLEN_L
               look up context mode CMODE[BTYPE_L]
               compute context ID, CIDL from last two uncompressed bytes
               read literal using HTREEL[CMAPL[64*BTYPE_L + CIDL]]
               write literal to uncompressed stream
            if number of uncompressed bytes produced in the loop for
               this meta-block is MLEN, then break from loop (in this
               case the copy length is ignored and can have any value)
Top   ToC   RFC7932 - Page 38
            if distance code is implicit zero from insert-and-copy code
               set backward distance to the last distance
            else
               if BLEN_D is zero
                  read block type using HTREE_BTYPE_D and set BTYPE_D
                     save previous block type
                  read block count using HTREE_BLEN_D and set BLEN_D
               decrement BLEN_D
               compute context ID, CIDD from CLEN
               read distance code using HTREED[CMAPD[4*BTYPE_D + CIDD]]
               compute distance by distance short code substitution
               if distance code is not zero,
                  and distance is not a static dictionary reference,
                  push distance to the ring buffer of last distances
            if distance is less than the max allowed distance plus one
               move backwards distance bytes in the uncompressed data,
               and copy CLEN bytes from this position to
               the uncompressed stream
            else
               look up the static dictionary word, transform the word as
               directed, and copy the result to the uncompressed stream
         while number of uncompressed bytes for this meta-block < MLEN
      while not ISLAST

   If the stream ends before the completion of the last meta-block, then
   the stream should be rejected as invalid.

   Note that a duplicated string reference may refer to a string in a
   previous meta-block, i.e., the backward distance may cross one or
   more meta-block boundaries.  However, a backward copy distance will
   not refer past the beginning of the uncompressed stream or the window
   size; any such distance is interpreted as a reference to a static
   dictionary word.  Also, note that the referenced string may overlap
   the current position, for example, if the last 2 bytes decoded have
   values X and Y, a string reference with <length = 5, distance = 2>
   adds X,Y,X,Y,X to the uncompressed stream.

11. Considerations for Compressor Implementations

Since the intent of this document is to define the brotli compressed data format without reference to any particular compression algorithm, the material in this section is not part of the definition of the format, and a compressor need not follow it in order to be compliant.
Top   ToC   RFC7932 - Page 39

11.1. Trivial Compressor

In this section, we present a very simple algorithm that produces a valid brotli stream representing an arbitrary sequence of uncompressed bytes in the form of the following C++ language function. string BrotliCompressTrivial(const string& u) { if (u.empty()) { return string(1, 6); } int i; string c; c.append(1, 12); for (i = 0; i + 65535 < u.size(); i += 65536) { c.append(1, 248); c.append(1, 255); c.append(1, 15); c.append(&u[i], 65536); } if (i < u.size()) { int r = u.size() - i - 1; c.append(1, (r & 31) << 3); c.append(1, r >> 5); c.append(1, 8 + (r >> 13)); c.append(&u[i], r + 1); } c.append(1, 3); return c; } Note that this simple algorithm does not actually compress data, that is, the brotli representation will always be bigger than the original, but it shows that every sequence of N uncompressed bytes can be represented with a valid brotli stream that is not longer than N + (3 * (N >> 16) + 5) bytes.

11.2. Aligning Compressed Meta-Blocks to Byte Boundaries

As described in Section 9, only those meta-blocks that immediately follow an uncompressed meta-block or a metadata meta-block are guaranteed to start on a byte boundary. In some applications, it might be required that every non-metadata meta-block starts on a byte boundary. This can be achieved by appending an empty metadata meta- block after every non-metadata meta-block that does not end on a byte boundary.
Top   ToC   RFC7932 - Page 40

11.3. Creating Self-Contained Parts within the Compressed Data

In some encoder implementations, it might be required to make a sequence of bytes within a brotli stream self-contained, that is, such that they can be decompressed independently from previous parts of the compressed data. This is a useful feature for three reasons. First, if a large compressed file is damaged, it is possible to recover some of the file after the damage. Second, it is useful when doing differential transfer of compressed data. If a sequence of uncompressed bytes is unchanged and compressed independently from previous data, then the compressed representation may also be unchanged and can therefore be transferred very cheaply. Third, if sequences of uncompressed bytes are compressed independently, it allows for parallel compression of these byte sequences within the same file, in addition to parallel compression of multiple files. Given two sequences of uncompressed bytes, U0 and U1, we will now describe how to create two sequences of compressed bytes, C0 and C1, such that the concatenation of C0 and C1 is a valid brotli stream, and that C0 and C1 (together with the first byte of C0 that contains the window size) can be decompressed independently from each other to U0 and U1. When compressing the byte sequence U0 to produce C0, we can use any compressor that works on the complete set of uncompressed bytes U0, with the following two changes. First, the ISLAST bit of the last meta-block of C0 must not be set. Second, C0 must end at a byte- boundary, which can be ensured by appending an empty metadata meta- block to it, as in Section 11.2. When compressing the byte sequence U1 to produce C1, we can use any compressor that starts a new meta-block at the beginning of U1 within the U0+U1 input stream, with the following two changes. First, backward distances in C1 must not refer to static dictionary words or uncompressed bytes in U0. Even if a sequence of bytes in U1 would match a static dictionary word, or a sequence of bytes that overlaps U0, the compressor must represent this sequence of bytes with a combination of literal insertions and backward references to bytes in U1 instead. Second, the ring buffer of last four distances must be replenished first with distances in C1 before using it to encode other distances in C1. Note that both compressors producing C0 and C1 have to use the same window size, but the stream header is emitted only by the compressor that produces C0. Note that this method can be easily generalized to more than two sequences of uncompressed bytes.
Top   ToC   RFC7932 - Page 41

12. Security Considerations

As with any compressed file formats, decompressor implementations should handle all compressed data byte sequences, not only those that conform to this specification, where non-conformant compressed data sequences should be rejected as invalid. A possible attack against a system containing a decompressor implementation (e.g., a web browser) is to exploit a buffer overflow triggered by invalid compressed data. Therefore, decompressor implementations should perform bounds-checking for each memory access that result from values decoded from the compressed stream and derivatives thereof. Another possible attack against a system containing a decompressor implementation is to provide it (either valid or invalid) compressed data that can make the decompressor system's resource consumption (CPU, memory, or storage) to be disproportionately large compared to the size of the compressed data. In addition to the size of the compressed data, the amount of CPU, memory, and storage required to decompress a single compressed meta-block within a brotli stream is controlled by the following two parameters: the size of the uncompressed meta-block, which is encoded at the start of the compressed meta-block, and the size of the sliding window, which is encoded at the start of the brotli stream. Decompressor implementations in systems where memory or storage is constrained should perform a sanity-check on these two parameters. The uncompressed meta-block size that was decoded from the compressed stream should be compared against either a hard limit, given by the system's constraints or some expectation about the uncompressed data, or against a certain multiple of the size of the compressed data. If the uncompressed meta-block size is determined to be too high, the compressed data should be rejected. Likewise, when the complete uncompressed stream is kept in the system containing the decompressor implementation, the total uncompressed size of the stream should be checked before decompressing each additional meta-block. If the size of the sliding window that was decoded from the start of the compressed stream is greater than a certain soft limit, then the decompressor implementation should, at first, allocate a smaller sliding window that fits the first uncompressed meta-block, and afterwards, before decompressing each additional meta-block, it should increase the size of the sliding window until the sliding window size specified in the compressed data is reached.
Top   ToC   RFC7932 - Page 42
   Correspondingly, possible attacks against a system containing a
   compressor implementation (e.g., a web server) are to exploit a
   buffer overflow or cause disproportionately large resource
   consumption by providing, e.g., uncompressible data.  As described in
   Section 11.1, an output buffer of

            S(N) = N + (3 * (N >> 16) + 5)

   bytes is sufficient to hold a valid compressed brotli stream
   representing an arbitrary sequence of N uncompressed bytes.
   Therefore, compressor implementations should allocate at least S(N)
   bytes of output buffer before compressing N bytes of data with
   unknown compressibility and should perform bounds-checking for each
   write into this output buffer.  If their output buffer is full,
   compressor implementations should revert to the trivial compression
   algorithm described in Section 11.1.  The resource consumption of a
   compressor implementation for a particular input data depends mostly
   on the algorithm used to find backward matches and on the algorithm
   used to construct context maps and prefix codes and only to a lesser
   extent on the input data itself.  If the system containing a
   compressor implementation is overloaded, a possible way to reduce
   resource usage is to switch to more simple algorithms for backward
   reference search and prefix code construction, or to fall back to the
   trivial compression algorithm described in Section 11.1.

   A possible attack against a system that sends compressed data over an
   encrypted channel is the following.  An attacker who can repeatedly
   mix arbitrary (attacker-supplied) data with secret data (passwords,
   cookies) and observe the length of the ciphertext can potentially
   reconstruct the secret data.  To protect against this kind of attack,
   applications should not mix sensitive data with non-sensitive,
   potentially attacker-supplied data in the same compressed stream.

13. IANA Considerations

The "HTTP Content Coding Registry" has been updated with the registration below: +-------+-------------------------------------+------------+ | Name | Description | Reference | +-------+-------------------------------------+------------+ | br | Brotli Compressed Data Format | RFC 7932 | +-------+-------------------------------------+------------+
Top   ToC   RFC7932 - Page 43

14. Informative References

[HUFFMAN] Huffman, D. A., "A Method for the Construction of Minimum Redundancy Codes", Proceedings of the Institute of Radio Engineers, September 1952, Vol. 40, No. 9, pp. 1098-1101. [LZ77] Ziv, J. and A. Lempel, "A Universal Algorithm for Sequential Data Compression", IEEE Transactions on Information Theory, Vol. 23, No. 3, pp. 337-343, DOI 10.1109/TIT.1977.1055714, May 1977, <https://www.cs.duke.edu/courses/spring03/cps296.5/papers/ ziv_lempel_1977_universal_algorithm.pdf>. [RFC1951] Deutsch, P., "DEFLATE Compressed Data Format Specification version 1.3", RFC 1951, DOI 10.17487/RFC1951, May 1996, <http://www.rfc-editor.org/info/rfc1951>. [WOFF2] Levantovsky, V., Ed., and R. Levien, Ed., "WOFF File Format 2.0", W3C Candidate Recommendation, March 2016, <http://www.w3.org/TR/WOFF2/>.


(next page on part 3)

Next Section