A CBOR data item (
Section 2) is encoded to or decoded from a byte string carrying a well-formed encoded data item as described in this section. The encoding is summarized in
Table 7 in
Appendix B, indexed by the initial byte. An encoder
MUST produce only well-formed encoded data items. A decoder
MUST NOT return a decoded data item when it encounters input that is not a well-formed encoded CBOR data item (this does not detract from the usefulness of diagnostic and recovery tools that might make available some information from a damaged encoded CBOR data item).
The initial byte of each encoded data item contains both information about the major type (the high-order 3 bits, described in
Section 3.1) and additional information (the low-order 5 bits). With a few exceptions, the additional information's value describes how to load an unsigned integer "argument":
-
Less than 24:
-
The argument's value is the value of the additional information.
-
24, 25, 26, or 27:
-
The argument's value is held in the following 1, 2, 4, or 8 bytes,respectively, in network byte order. For major type 7 andadditional information value 25, 26, 27, these bytes are not used asan integer argument, but as a floating-point value (seeSection 3.3).
-
28, 29, 30:
-
These values are reserved for future additions to the CBOR format.In the present version of CBOR, the encoded item is not well-formed.
-
31:
-
No argument value is derived.If the major type is 0, 1, or 6, the encoded item is notwell-formed. For major types 2 to 5, the item's length isindefinite, and for major type 7, the byte does not constitute a dataitem at all but terminates an indefinite-length item; all aredescribed in Section 3.2.
The initial byte and any additional bytes consumed to construct the argument are collectively referred to as the
head of the data item.
The meaning of this argument depends on the major type. For example, in major type 0, the argument is the value of the data item itself (and in major type 1, the value of the data item is computed from the argument); in major type 2 and 3, it gives the length of the string data in bytes that follow; and in major types 4 and 5, it is used to determine the number of data items enclosed.
If the encoded sequence of bytes ends before the end of a data item, that item is not well-formed. If the encoded sequence of bytes still has bytes remaining after the outermost encoded item is decoded, that encoding is not a single well-formed CBOR item. Depending on the application, the decoder may either treat the encoding as not well-formed or just identify the start of the remaining bytes to the application.
A CBOR decoder implementation can be based on a jump table with all 256 defined values for the initial byte (
Table 7). A decoder in a constrained implementation can instead use the structure of the initial byte and following bytes for more compact code (see
Appendix C for a rough impression of how this could look).
The following lists the major types and the additional information and other bytes associated with the type.
-
Major type 0:
-
An unsigned integer in the range 0..264-1 inclusive. The value of theencoded item is the argument itself. For example, theinteger 10 is denoted as the one byte 0b000_01010 (major type 0,additional information 10). The integer 500 would be 0b000_11001(major type 0, additional information 25) followed by the two bytes0x01f4, which is 500 in decimal.
-
Major type 1:
-
A negative integer in the range -264..-1 inclusive. The value ofthe item is -1 minus the argument. For example, the integer-500 would be 0b001_11001 (major type 1, additional information 25)followed by the two bytes 0x01f3, which is 499 in decimal.
-
Major type 2:
-
A byte string. The number of bytes in the string is equal to theargument. For example, a bytestring whose length is 5 would have an initial byte of 0b010_00101(major type 2, additional information 5 for the length), followed by5 bytes of binary content. A byte string whose length is 500 wouldhave 3 initial bytes of 0b010_11001 (major type 2, additionalinformation 25 to indicate a two-byte length) followed by the twobytes 0x01f4 for a length of 500, followed by 500 bytes of binarycontent.
-
Major type 3:
-
A text string (Section 2) encoded as UTF-8[RFC 3629]. The number of bytes in the string is equal to theargument. A string containing an invalid UTF-8 sequence iswell-formed but invalid (Section 1.2). This type is provided forsystems that need to interpret or display human-readable text, andallows the differentiation between unstructured bytes and text thathas a specified repertoire (that of Unicode) and encoding (UTF-8). In contrast to formatssuch as JSON, the Unicode characters in this type are neverescaped. Thus, a newline character (U+000A) is always represented ina string as the byte 0x0a, and never as the bytes 0x5c6e (thecharacters "\" and "n") nor as 0x5c7530303061 (the characters "\","u", "0", "0", "0", and "a").
-
Major type 4:
-
An array of data items. In other formats, arrays are also called lists, sequences, ortuples (a "CBOR sequence" is something slightly different, though [RFC 8742]).The argument is the number of data items in thearray. Items in anarray do not need to all be of the same type. For example, an arraythat contains 10 items of any type would have an initial byte of0b100_01010 (major type 4, additional information 10 for thelength) followed by the 10 remaining items.
-
Major type 5:
-
A map of pairs of data items. Maps are also called tables,dictionaries, hashes, or objects (in JSON). A map is comprised ofpairs of data items, each pair consisting of a key that isimmediately followed by a value. The argument is the numberof pairs of data items in the map. Forexample, a map that contains 9 pairs would have an initial byte of0b101_01001 (major type 5, additional information 9 for thenumber of pairs) followed by the 18 remaining items. The first itemis the first key, the second item is the first value, the third itemis the second key, and so on. Because items in a map come in pairs,their total number is always even: a map that contains an oddnumber of items (no value data present after the last key data item) is not well-formed.A map that has duplicate keys may bewell-formed, but it is not valid, and thus it causes indeterminatedecoding; see also Section 5.6.
-
Major type 6:
-
A tagged data item ("tag") whose tag number, an integer in the range0..264-1 inclusive, is the argument andwhose enclosed data item (tag content) is the single encoded data item that follows the head.See Section 3.4.
-
Major type 7:
-
Floating-point numbers and simple values, as well as the "break"stop code. See Section 3.3.
These eight major types lead to a simple table showing which of the 256 possible values for the initial byte of a data item are used (
Table 7).
In major types 6 and 7, many of the possible values are reserved for future specification. See
Section 9 for more information on these values.
Table 1 summarizes the major types defined by CBOR, ignoring
Section 3.2 for now. The number N in this table stands for the argument.
Major Type |
Meaning |
Content |
0 |
unsigned integer N |
- |
1 |
negative integer -1-N |
- |
2 |
byte string |
N bytes |
3 |
text string |
N bytes (UTF-8 text) |
4 |
array |
N data items (elements) |
5 |
map |
2N data items (key/value pairs) |
6 |
tag of number N |
1 data item |
7 |
simple/float |
- |
Table 1: Overview over the Definite-Length Use of CBOR Major Types (N = Argument)
Four CBOR items (arrays, maps, byte strings, and text strings) can be encoded with an indefinite length using additional information value 31. This is useful if the encoding of the item needs to begin before the number of items inside the array or map, or the total length of the string, is known. (The ability to start sending a data item before all of it is known is often referred to as "streaming" within that data item.)
Indefinite-length arrays and maps are dealt with differently than indefinite-length strings (byte strings and text strings).
The "break" stop code is encoded with major type 7 and additional information value 31 (0b111_11111). It is not itself a data item: it is just a syntactic feature to close an indefinite-length item.
If the "break" stop code appears where a data item is expected, other than directly inside an indefinite-length string, array, or map -- for example, directly inside a definite-length array or map -- the enclosing item is not well-formed.
Indefinite-length arrays and maps are represented using their major type with the additional information value of 31, followed by an arbitrary-length sequence of zero or more items for an array or key/value pairs for a map, followed by the "break" stop code (
Section 3.2.1). In other words, indefinite-length arrays and maps look identical to other arrays and maps except for beginning with the additional information value of 31 and ending with the "break" stop code.
If the "break" stop code appears after a key in a map, in place of that key's value, the map is not well-formed.
There is no restriction against nesting indefinite-length array or map items. A "break" only terminates a single item, so nested indefinite-length items need exactly as many "break" stop codes as there are type bytes starting an indefinite-length item.
For example, assume an encoder wants to represent the abstract array [1, [2, 3], [4, 5]]. The definite-length encoding would be 0x8301820203820405:
83 -- Array of length 3
01 -- 1
82 -- Array of length 2
02 -- 2
03 -- 3
82 -- Array of length 2
04 -- 4
05 -- 5
Indefinite-length encoding could be applied independently to each of the three arrays encoded in this data item, as required, leading to representations such as:
0x9f018202039f0405ffff
9F -- Start indefinite-length array
01 -- 1
82 -- Array of length 2
02 -- 2
03 -- 3
9F -- Start indefinite-length array
04 -- 4
05 -- 5
FF -- "break" (inner array)
FF -- "break" (outer array)
0x9f01820203820405ff
9F -- Start indefinite-length array
01 -- 1
82 -- Array of length 2
02 -- 2
03 -- 3
82 -- Array of length 2
04 -- 4
05 -- 5
FF -- "break"
0x83018202039f0405ff
83 -- Array of length 3
01 -- 1
82 -- Array of length 2
02 -- 2
03 -- 3
9F -- Start indefinite-length array
04 -- 4
05 -- 5
FF -- "break"
0x83019f0203ff820405
83 -- Array of length 3
01 -- 1
9F -- Start indefinite-length array
02 -- 2
03 -- 3
FF -- "break"
82 -- Array of length 2
04 -- 4
05 -- 5
An example of an indefinite-length map (that happens to have two key/value pairs) might be:
0xbf6346756ef563416d7421ff
BF -- Start indefinite-length map
63 -- First key, UTF-8 string length 3
46756e -- "Fun"
F5 -- First value, true
63 -- Second key, UTF-8 string length 3
416d74 -- "Amt"
21 -- Second value, -2
FF -- "break"
Indefinite-length strings are represented by a byte containing the major type for byte string or text string with an additional information value of 31, followed by a series of zero or more strings of the specified type ("chunks") that have definite lengths, and finished by the "break" stop code (
Section 3.2.1). The data item represented by the indefinite-length string is the concatenation of the chunks. If no chunks are present, the data item is an empty string of the specified type. Zero-length chunks, while not particularly useful, are permitted.
If any item between the indefinite-length string indicator (0b010_11111 or 0b011_11111) and the "break" stop code is not a definite-length string item of the same major type, the string is not well-formed.
The design does not allow nesting indefinite-length strings as chunks into indefinite-length strings. If it were allowed, it would require decoder implementations to keep a stack, or at least a count, of nesting levels. It is unnecessary on the encoder side because the inner indefinite-length string would consist of chunks, and these could instead be put directly into the outer indefinite-length string.
If any definite-length text string inside an indefinite-length text string is invalid, the indefinite-length text string is invalid. Note that this implies that the UTF-8 bytes of a single Unicode code point (scalar value) cannot be spread between chunks: a new chunk of a text string can only be started at a code point boundary.
For example, assume an encoded data item consisting of the bytes:
0b010_11111 0b010_00100 0xaabbccdd 0b010_00011 0xeeff99 0b111_11111
5F -- Start indefinite-length byte string
44 -- Byte string of length 4
aabbccdd -- Bytes content
43 -- Byte string of length 3
eeff99 -- Bytes content
FF -- "break"
After decoding, this results in a single byte string with seven bytes: 0xaabbccddeeff99.
Table 2 summarizes the major types defined by CBOR as used for indefinite-length encoding (with additional information set to 31).
Major Type |
Meaning |
Enclosed up to "break" Stop Code |
0 |
(not well-formed) |
- |
1 |
(not well-formed) |
- |
2 |
byte string |
definite-length byte strings |
3 |
text string |
definite-length text strings |
4 |
array |
data items (elements) |
5 |
map |
data items (key/value pairs) |
6 |
(not well-formed) |
- |
7 |
"break" stop code |
- |
Table 2: Overview of the Indefinite-Length Use of CBOR Major Types (Additional Information = 31)
Major type 7 is for two types of data: floating-point numbers and "simple values" that do not need any content. Each value of the 5-bit additional information in the initial byte has its own separate meaning, as defined in
Table 3. Like the major types for integers, items of this major type do not carry content data; all the information is in the initial bytes (the head).
5-Bit Value |
Semantics |
0..23 |
Simple value (value 0..23) |
24 |
Simple value (value 32..255 in following byte) |
25 |
IEEE 754 Half-Precision Float (16 bits follow) |
26 |
IEEE 754 Single-Precision Float (32 bits follow) |
27 |
IEEE 754 Double-Precision Float (64 bits follow) |
28-30 |
Reserved, not well-formed in the present document |
31 |
"break" stop code for indefinite-length items (Section 3.2.1) |
Table 3: Values for Additional Information in Major Type 7
As with all other major types, the 5-bit value 24 signifies a single-byte extension: it is followed by an additional byte to represent the simple value. (To minimize confusion, only the values 32 to 255 are used.) This maintains the structure of the initial bytes: as for the other major types, the length of these always depends on the additional information in the first byte.
Table 4 lists the numeric values assigned and available for simple values.
Value |
Semantics |
0..19 |
(unassigned) |
20 |
false |
21 |
true |
22 |
null |
23 |
undefined |
24..31 |
(reserved) |
32..255 |
(unassigned) |
Table 4: Simple Values
An encoder
MUST NOT issue two-byte sequences that start with 0xf8 (major type 7, additional information 24) and continue with a byte less than 0x20 (32 decimal). Such sequences are not well-formed. (This implies that an encoder cannot encode
false,
true,
null, or
undefined in two-byte sequences and that only the one-byte variants of these are well-formed; more generally speaking, each simple value only has a single representation variant).
The 5-bit values of 25, 26, and 27 are for 16-bit, 32-bit, and 64-bit IEEE 754 binary floating-point values [
IEEE754]. These floating-point values are encoded in the additional bytes of the appropriate size. (See
Appendix D for some information about 16-bit floating-point numbers.)
In CBOR, a data item can be enclosed by a tag to give it some additional semantics, as uniquely identified by a
tag number. The tag is major type 6, its argument (
Section 3) indicates the tag number, and it contains a single enclosed data item, the
tag content. (If a tag requires further structure to its content, this structure is provided by the enclosed data item.) We use the term
tag for the entire data item consisting of both a tag number and the tag content: the tag content is the data item that is being tagged.
For example, assume that a byte string of length 12 is marked with a tag of number 2 to indicate it is an unsigned
bignum (
Section 3.4.3). The encoded data item would start with a byte 0b110_00010 (major type 6, additional information 2 for the tag number) followed by the encoded tag content: 0b010_01100 (major type 2, additional information 12 for the length) followed by the 12 bytes of the bignum.
In the extended generic data model, a tag number's definition describes the additional semantics conveyed with the tag number. These semantics may include equivalence of some tagged data items with other data items, including some that can be represented in the basic generic data model. For instance, 0xc24101, a bignum the tag content of which is the byte string with the single byte 0x01, is equivalent to an integer 1, which could also be encoded as 0x01, 0x1801, or 0x190001. The tag definition may specify a preferred serialization (
Section 4.1) that is recommended for generic encoders; this may prefer basic generic data model representations over ones that employ a tag.
The tag definition usually defines which nested data items are valid for such tags. Tag definitions may restrict their content to a very specific syntactic structure, as the tags defined in this document do, or they may define their content more semantically. An example for the latter is how tags 40 and 1040 accept multiple ways to represent arrays [
RFC 8746].
As a matter of convention, many tags do not accept
null or
undefined values as tag content; instead, the expectation is that a
null or
undefined value can be used in place of the entire tag;
Section 3.4.2 provides some further considerations for one specific tag about the handling of this convention in application protocols and in mapping to platform types.
Decoders do not need to understand tags of every tag number, and tags may be of little value in applications where the implementation creating a particular CBOR data item and the implementation decoding that stream know the semantic meaning of each item in the data flow. The primary purpose of tags in this specification is to define common data types such as dates. A secondary purpose is to provide conversion hints when it is foreseen that the CBOR data item needs to be translated into a different format, requiring hints about the content of items. Understanding the semantics of tags is optional for a decoder; it can simply present both the tag number and the tag content to the application, without interpreting the additional semantics of the tag.
A tag applies semantics to the data item it encloses. Tags can nest: if tag A encloses tag B, which encloses data item C, tag A applies to the result of applying tag B on data item C.
IANA maintains a registry of tag numbers as described in
Section 9.2.
Table 5 provides a list of tag numbers that were defined in [
RFC 7049] with definitions in the rest of this section. (Tag number 35 was also defined in [
RFC 7049]; a discussion of this tag number follows in
Section 3.4.5.3.) Note that many other tag numbers have been defined since the publication of [
RFC 7049]; see the registry described at
Section 9.2 for the complete list.
Table 5: Tag Numbers Defined in RFC 7049
Conceptually, tags are interpreted in the generic data model, not at (de-)serialization time. A small number of tags (at this time, tag number 25 and tag number 29 [
IANA.cbor-tags]) have been registered with semantics that may require processing at (de-)serialization time: the decoder needs to be aware of, and the encoder needs to be in control of, the exact sequence in which data items are encoded into the CBOR data item. This means these tags cannot be implemented on top of an arbitrary generic CBOR encoder/decoder (which might not reflect the serialization order for entries in a map at the data model level and vice versa); their implementation therefore typically needs to be integrated into the generic encoder/decoder. The definition of new tags with this property is
NOT RECOMMENDED.
IANA allocated tag numbers 65535, 4294967295, and 18446744073709551615 (binary all-ones in 16-bit, 32-bit, and 64-bit). These can be used as a convenience for implementers who want a single-integer data structure to indicate either the presence of a specific tag or absence of a tag. That allocation is described in [
CBOR-TAGS]. These tags are not intended to occur in actual CBOR data items; implementations
MAY flag such an occurrence as an error.
Protocols can extend the generic data model (
Section 2) with data items representing points in time by using tag numbers 0 and 1, with arbitrarily sized integers by using tag numbers 2 and 3, and with floating-point values of arbitrary size and precision by using tag numbers 4 and 5.
Tag number 0 contains a text string in the standard format described by the
date-time production in [
RFC 3339], as refined by
Section 3.3 of
RFC 4287, representing the point in time described there. A nested item of another type or a text string that doesn't match the format described in [
RFC 4287] is invalid.
Tag number 1 contains a numerical value counting the number of seconds from 1970-01-01T00:00Z in UTC time to the represented point in civil time.
The tag content
MUST be an unsigned or negative integer (major types 0 and 1) or a floating-point number (major type 7 with additional information 25, 26, or 27). Other contained types are invalid.
Nonnegative values (major type 0 and nonnegative floating-point numbers) stand for time values on or after 1970-01-01T00:00Z UTC and are interpreted according to POSIX [
TIME_T]. (POSIX time is also known as "UNIX Epoch time".) Leap seconds are handled specially by POSIX time, and this results in a 1-second discontinuity several times per decade. Note that applications that require the expression of times beyond early 2106 cannot leave out support of 64-bit integers for the tag content.
Negative values (major type 1 and negative floating-point numbers) are interpreted as determined by the application requirements as there is no universal standard for UTC count-of-seconds time before 1970-01-01T00:00Z (this is particularly true for points in time that precede discontinuities in national calendars). The same applies to non-finite values.
To indicate fractional seconds, floating-point values can be used within tag number 1 instead of integer values. Note that this generally requires binary64 support, as binary16 and binary32 provide nonzero fractions of seconds only for a short period of time around early 1970. An application that requires tag number 1 support may restrict the tag content to be an integer (or a floating-point value) only.
Note that platform types for date/time may include
null or
undefined values, which may also be desirable at an application protocol level. While emitting tag number 1 values with non-finite tag content values (e.g., with NaN for undefined date/time values or with Infinity for an expiry date that is not set) may seem an obvious way to handle this, using untagged
null or
undefined avoids the use of non-finites and results in a shorter encoding. Application protocol designers are encouraged to consider these cases and include clear guidelines for handling them.
Protocols using tag numbers 2 and 3 extend the generic data model (
Section 2) with "bignums" representing arbitrarily sized integers. In the basic generic data model, bignum values are not equal to integers from the same model, but the extended generic data model created by this tag definition defines equivalence based on numeric value, and preferred serialization (
Section 4.1) never makes use of bignums that also can be expressed as basic integers (see below).
Bignums are encoded as a byte string data item, which is interpreted as an unsigned integer n in network byte order. Contained items of other types are invalid. For tag number 2, the value of the bignum is n. For tag number 3, the value of the bignum is -1 - n. The preferred serialization of the byte string is to leave out any leading zeroes (note that this means the preferred serialization for n = 0 is the empty byte string, but see below). Decoders that understand these tags
MUST be able to decode bignums that do have leading zeroes. The preferred serialization of an integer that can be represented using major type 0 or 1 is to encode it this way instead of as a bignum (which means that the empty string never occurs in a bignum when using preferred serialization). Note that this means the non-preferred choice of a bignum representation instead of a basic integer for encoding a number is not intended to have application semantics (just as the choice of a longer basic integer representation than needed, such as 0x1800 for 0x00, does not).
For example, the number 18446744073709551616 (2
64) is represented as 0b110_00010 (major type 6, tag number 2), followed by 0b010_01001 (major type 2, length 9), followed by 0x010000000000000000 (one byte 0x01 and eight bytes 0x00). In hexadecimal:
C2 -- Tag 2
49 -- Byte string of length 9
010000000000000000 -- Bytes content
Protocols using tag number 4 extend the generic data model with data items representing arbitrary-length decimal fractions of the form m*(10
e). Protocols using tag number 5 extend the generic data model with data items representing arbitrary-length binary fractions of the form m*(2
e). As with bignums, values of different types are not equal in the generic data model.
Decimal fractions combine an integer mantissa with a base-10 scaling factor. They are most useful if an application needs the exact representation of a decimal fraction such as 1.1 because there is no exact representation for many decimal fractions in binary floating-point representations.
"Bigfloats" combine an integer mantissa with a base-2 scaling factor. They are binary floating-point values that can exceed the range or the precision of the three IEEE 754 formats supported by CBOR (
Section 3.3). Bigfloats may also be used by constrained applications that need some basic binary floating-point capability without the need for supporting IEEE 754.
A decimal fraction or a bigfloat is represented as a tagged array that contains exactly two integer numbers: an exponent e and a mantissa m. Decimal fractions (tag number 4) use base-10 exponents; the value of a decimal fraction data item is m*(10
e). Bigfloats (tag number 5) use base-2 exponents; the value of a bigfloat data item is m*(2
e). The exponent e
MUST be represented in an integer of major type 0 or 1, while the mantissa can also be a bignum (
Section 3.4.3). Contained items with other structures are invalid.
An example of a decimal fraction is the representation of the number 273.15 as 0b110_00100 (major type 6 for tag, additional information 4 for the tag number), followed by 0b100_00010 (major type 4 for the array, additional information 2 for the length of the array), followed by 0b001_00001 (major type 1 for the first integer, additional information 1 for the value of -2), followed by 0b000_11001 (major type 0 for the second integer, additional information 25 for a two-byte value), followed by 0b0110101010110011 (27315 in two bytes). In hexadecimal:
C4 -- Tag 4
82 -- Array of length 2
21 -- -2
19 6ab3 -- 27315
An example of a bigfloat is the representation of the number 1.5 as 0b110_00101 (major type 6 for tag, additional information 5 for the tag number), followed by 0b100_00010 (major type 4 for the array, additional information 2 for the length of the array), followed by 0b001_00000 (major type 1 for the first integer, additional information 0 for the value of -1), followed by 0b000_00011 (major type 0 for the second integer, additional information 3 for the value of 3). In hexadecimal:
C5 -- Tag 5
82 -- Array of length 2
20 -- -1
03 -- 3
Decimal fractions and bigfloats provide no representation of Infinity, -Infinity, or NaN; if these are needed in place of a decimal fraction or bigfloat, the IEEE 754 half-precision representations from
Section 3.3 can be used.
The tags in this section are for content hints that might be used by generic CBOR processors. These content hints do not extend the generic data model.
Sometimes it is beneficial to carry an embedded CBOR data item that is not meant to be decoded immediately at the time the enclosing data item is being decoded. Tag number 24 (CBOR data item) can be used to tag the embedded byte string as a single data item encoded in CBOR format. Contained items that aren't byte strings are invalid. A contained byte string is valid if it encodes a well-formed CBOR data item; validity checking of the decoded CBOR item is not required for tag validity (but could be offered by a generic decoder as a special option).
Tag numbers 21 to 23 indicate that a byte string might require a specific encoding when interoperating with a text-based representation. These tags are useful when an encoder knows that the byte string data it is writing is likely to be later converted to a particular JSON-based usage. That usage specifies that some strings are encoded as base64, base64url, and so on. The encoder uses byte strings instead of doing the encoding itself to reduce the message size, to reduce the code size of the encoder, or both. The encoder does not know whether or not the converter will be generic, and therefore wants to say what it believes is the proper way to convert binary strings to JSON.
The data item tagged can be a byte string or any other data item. In the latter case, the tag applies to all of the byte string data items contained in the data item, except for those contained in a nested data item tagged with an expected conversion.
These three tag numbers suggest conversions to three of the base data encodings defined in [
RFC 4648]. Tag number 21 suggests conversion to base64url encoding (
Section 5 of
RFC 4648) where padding is not used (see
Section 3.2 of
RFC 4648); that is, all trailing equals signs ("=") are removed from the encoded string. Tag number 22 suggests conversion to classical base64 encoding (
Section 4 of
RFC 4648) with padding as defined in
RFC 4648. For both base64url and base64, padding bits are set to zero (see
Section 3.5 of
RFC 4648), and the conversion to alternate encoding is performed on the contents of the byte string (that is, without adding any line breaks, whitespace, or other additional characters). Tag number 23 suggests conversion to base16 (hex) encoding with uppercase alphabetics (see
Section 8 of
RFC 4648). Note that, for all three tag numbers, the encoding of the empty byte string is the empty text string.
Some text strings hold data that have formats widely used on the Internet, and sometimes those formats can be validated and presented to the application in appropriate form by the decoder. There are tags for some of these formats.
-
Tag number 32 is for URIs, as defined in [RFC 3986]. If the text string doesn't match the URI-reference production, the string is invalid.
-
Tag numbers 33 and 34 are for base64url- and base64-encoded text strings, respectively, as defined in [RFC 4648]. If any of the following apply:
-
the encoded text string contains non-alphabet characters or only 1 alphabet character in the last block of 4 (where alphabet is defined by Section 5 of RFC 4648 for tag number 33 and Section 4 of RFC 4648 for tag number 34), or
-
the padding bits in a 2- or 3-character block are not 0, or
-
the base64 encoding has the wrong number of padding characters, or
-
the base64url encoding has padding characters,
the string is invalid.
-
Tag number 36 is for MIME messages (including all headers), as defined in [RFC 2045]. A text string that isn't a valid MIME message is invalid. (For this tag, validity checking may be particularly onerous for a generic decoder and might therefore not be offered. Note that many MIME messages are general binary data and therefore cannot be represented in a text string; [IANA.cbor-tags] lists a registration for tag number 257 that is similar to tag number 36 but uses a byte string as its tag content.)
Note that tag numbers 33 and 34 differ from 21 and 22 in that the data is transported in base-encoded form for the former and in raw byte string form for the latter.
[
RFC 7049] also defined a tag number 35 for regular expressions that are in Perl Compatible Regular Expressions (PCRE/PCRE2) form [
PCRE] or in JavaScript regular expression syntax [
ECMA262]. The state of the art in these regular expression specifications has since advanced and is continually advancing, so this specification does not attempt to update the references. Instead, this tag remains available (as registered in [
RFC 7049]) for applications that specify the particular regular expression variant they use out-of-band (possibly by limiting the usage to a defined common subset of both PCRE and ECMA262). As this specification clarifies tag validity beyond [
RFC 7049], we note that due to the open way the tag was defined in [
RFC 7049], any contained string value needs to be valid at the CBOR tag level (but then may not be "expected" at the application level).
In many applications, it will be clear from the context that CBOR is being employed for encoding a data item. For instance, a specific protocol might specify the use of CBOR, or a media type is indicated that specifies its use. However, there may be applications where such context information is not available, such as when CBOR data is stored in a file that does not have disambiguating metadata. Here, it may help to have some distinguishing characteristics for the data itself.
Tag number 55799 is defined for this purpose, specifically for use at the start of a stored encoded CBOR data item as specified by an application. It does not impart any special semantics on the data item that it encloses; that is, the semantics of the tag content enclosed in tag number 55799 is exactly identical to the semantics of the tag content itself.
The serialization of this tag's head is 0xd9d9f7, which does not appear to be in use as a distinguishing mark for any frequently used file types. In particular, 0xd9d9f7 is not a valid start of a Unicode text in any Unicode encoding if it is followed by a valid CBOR data item.
For instance, a decoder might be able to decode both CBOR and JSON. Such a decoder would need to mechanically distinguish the two formats. An easy way for an encoder to help the decoder would be to tag the entire CBOR item with tag number 55799, the serialization of which will never be found at the beginning of a JSON text.