RFC 4646

Tags for Identifying Languages

Pages: 59

Obsoletes: 3066
Obsoleted by: 5646

Part 3 of 3 – Pages 38 to 59

noToC RFC4646 - Page 38 prevText

4.  Formation and Processing of Language Tags

   This section addresses how to use the information in the registry
   with the tag syntax to choose, form, and process language tags.

4.1.  Choice of Language Tag

   One is sometimes faced with the choice between several possible tags
   for the same body of text.

   Interoperability is best served when all users use the same language
   tag in order to represent the same language.  If an application has
   requirements that make the rules here inapplicable, then that
   application risks damaging interoperability.  It is strongly
   RECOMMENDED that users not define their own rules for language tag
   choice.

   Subtags SHOULD only be used where they add useful distinguishing
   information; extraneous subtags interfere with the meaning,
   understanding, and processing of language tags.  In particular, users
   and implementations SHOULD follow the 'Prefix' and 'Suppress-Script'
   fields in the registry (defined in Section 3.1): these fields provide
   guidance on when specific additional subtags SHOULD (and SHOULD NOT)
   be used in a language tag.

   Of particular note, many applications can benefit from the use of
   script subtags in language tags, as long as the use is consistent for
   a given context.  Script subtags were not formally defined in RFC
   3066 and their use can affect matching and subtag identification by
   implementations of RFC 3066, as these subtags appear between the
   primary language and region subtags.  For example, if a user requests
   content in an implementation of Section 2.5 of [RFC3066] using the
   language range "en-US", content labeled "en-Latn-US" will not match
   the request.  Therefore, it is important to know when script subtags
   will customarily be used and when they ought not be used.  In the
   registry, the Suppress-Script field helps ensure greater
   compatibility between the language tags generated according to the
   rules in this document and language tags and tag processors or
   consumers based on RFC 3066 by defining when users SHOULD NOT include
   a script subtag with a particular primary language subtag.

noToC RFC4646 - Page 39

   Extended language subtags (type 'extlang' in the registry; see
   Section 3.1) also appear between the primary language and region
   subtags and are reserved for future standardization.  Applications
   might benefit from their judicious use in forming language tags in
   the future.  Similar recommendations are expected to apply to their
   use as apply to script subtags.

   Standards, protocols, and applications that reference this document
   normatively but apply different rules to the ones given in this
   section MUST specify how the procedure varies from the one given
   here.

   The choice of subtags used to form a language tag SHOULD be guided by
   the following rules:

   1.  Use as precise a tag as possible, but no more specific than is
       justified.  Avoid using subtags that are not important for
       distinguishing content in an application.

       *  For example, 'de' might suffice for tagging an email written
          in German, while "de-CH-1996" is probably unnecessarily
          precise for such a task.

   2.  The script subtag SHOULD NOT be used to form language tags unless
       the script adds some distinguishing information to the tag.  The
       field 'Suppress-Script' in the primary language record in the
       registry indicates which script subtags do not add distinguishing
       information for most applications.

       *  For example, the subtag 'Latn' should not be used with the
          primary language 'en' because nearly all English documents are
          written in the Latin script and it adds no distinguishing
          information.  However, if a document were written in English
          mixing Latin script with another script such as Braille
          ('Brai'), then it might be appropriate to choose to indicate
          both scripts to aid in content selection, such as the
          application of a style sheet.

   3.  If a tag or subtag has a 'Preferred-Value' field in its registry
       entry, then the value of that field SHOULD be used to form the
       language tag in preference to the tag or subtag in which the
       preferred value appears.

       *  For example, use 'he' for Hebrew in preference to 'iw'.

noToC RFC4646 - Page 40

   4.  The 'und' (Undetermined) primary language subtag SHOULD NOT be
       used to label content, even if the language is unknown.  Omitting
       the language tag altogether is preferred to using a tag with a
       primary language subtag of 'und'.  The 'und' subtag MAY be useful
       for protocols that require a language tag to be provided.  The
       'und' subtag MAY also be useful when matching language tags in
       certain situations.

   5.  The 'mul' (Multiple) primary language subtag SHOULD NOT be used
       whenever the protocol allows the separate tags for multiple
       languages, as is the case for the Content-Language header in
       HTTP.  The 'mul' subtag conveys little useful information:
       content in multiple languages SHOULD individually tag the
       languages where they appear or otherwise indicate the actual
       language in preference to the 'mul' subtag.

   6.  The same variant subtag SHOULD NOT be used more than once within
       a language tag.

       *  For example, do not use "de-DE-1901-1901".

   To ensure consistent backward compatibility, this document contains
   several provisions to account for potential instability in the
   standards used to define the subtags that make up language tags.
   These provisions mean that no language tag created under the rules in
   this document will become obsolete.

4.2.  Meaning of the Language Tag

   The relationship between the tag and the information it relates to is
   defined by the context in which the tag appears.  Accordingly, this
   section gives only possible examples of its usage.

   o  For a single information object, the associated language tags
      might be interpreted as the set of languages that is necessary for
      a complete comprehension of the complete object.  Example: Plain
      text documents.

   o  For an aggregation of information objects, the associated language
      tags could be taken as the set of languages used inside components
      of that aggregation.  Examples: Document stores and libraries.

   o  For information objects whose purpose is to provide alternatives,
      the associated language tags could be regarded as a hint that the
      content is provided in several languages and that one has to
      inspect each of the alternatives in order to find its language or
      languages.  In this case, the presence of multiple tags might not
      mean that one needs to be multi-lingual to get complete

noToC RFC4646 - Page 41

      understanding of the document.  Example: MIME multipart/
      alternative.

   o  In markup languages, such as HTML and XML, language information
      can be added to each part of the document identified by the markup
      structure (including the whole document itself).  For example, one
      could write <span lang="fr">C'est la vie.</span> inside a
      Norwegian document; the Norwegian-speaking user could then access
      a French-Norwegian dictionary to find out what the marked section
      meant.  If the user were listening to that document through a
      speech synthesis interface, this formation could be used to signal
      the synthesizer to appropriately apply French text-to-speech
      pronunciation rules to that span of text, instead of applying the
      inappropriate Norwegian rules.

   Language tags are related when they contain a similar sequence of
   subtags.  For example, if a language tag B contains language tag A as
   a prefix, then B is typically "narrower" or "more specific" than A.
   Thus, "zh-Hant-TW" is more specific than "zh-Hant".

   This relationship is not guaranteed in all cases: specifically,
   languages that begin with the same sequence of subtags are NOT
   guaranteed to be mutually intelligible, although they might be.  For
   example, the tag "az" shares a prefix with both "az-Latn"
   (Azerbaijani written using the Latin script) and "az-Cyrl"
   (Azerbaijani written using the Cyrillic script).  A person fluent in
   one script might not be able to read the other, even though the text
   might be identical.  Content tagged as "az" most probably is written
   in just one script and thus might not be intelligible to a reader
   familiar with the other script.

4.3.  Length Considerations

   [RFC3066] did not provide an upper limit on the size of language
   tags.  While RFC 3066 did define the semantics of particular subtags
   in such a way that most language tags consisted of language and
   region subtags with a combined total length of up to six characters,
   larger registered tags were not only possible but were actually
   registered.

   Neither the language tag syntax nor other requirements in this
   document impose a fixed upper limit on the number of subtags in a
   language tag (and thus an upper bound on the size of a tag).  The
   language tag syntax suggests that, depending on the specific
   language, more subtags (and thus a longer tag) are sometimes
   necessary to completely identify the language for certain
   applications; thus, it is possible to envision long or complex subtag
   sequences.

noToC RFC4646 - Page 42

4.3.1.  Working with Limited Buffer Sizes

   Some applications and protocols are forced to allocate fixed buffer
   sizes or otherwise limit the length of a language tag.  A conformant
   implementation or specification MAY refuse to support the storage of
   language tags that exceed a specified length.  Any such limitation
   SHOULD be clearly documented, and such documentation SHOULD include
   what happens to longer tags (for example, whether an error value is
   generated or the language tag is truncated).  A protocol that allows
   tags to be truncated at an arbitrary limit, without giving any
   indication of what that limit is, has the potential for causing harm
   by changing the meaning of tags in substantial ways.

   In practice, most language tags do not require more than a few
   subtags and will not approach reasonably sized buffer limitations;
   see Section 4.1.

   Some specifications or protocols have limits on tag length but do not
   have a fixed length limitation.  For example, [RFC2231] has no
   explicit length limitation: the length available for the language tag
   is constrained by the length of other header components (such as the
   charset's name) coupled with the 76-character limit in [RFC2047].
   Thus, the "limit" might be 50 or more characters, but it could
   potentially be quite small.

   The considerations for assigning a buffer limit are:

      Implementations SHOULD NOT truncate language tags unless the
      meaning of the tag is purposefully being changed, or unless the
      tag does not fit into a limited buffer size specified by a
      protocol for storage or transmission.

      Implementations SHOULD warn the user when a tag is truncated since
      truncation changes the semantic meaning of the tag.

      Implementations of protocols or specifications that are space
      constrained but do not have a fixed limit SHOULD use the longest
      possible tag in preference to truncation.

      Protocols or specifications that specify limited buffer sizes for
      language tags MUST allow for language tags of up to 33 characters.

      Protocols or specifications that specify limited buffer sizes for
      language tags SHOULD allow for language tags of at least 42
      characters.

noToC RFC4646 - Page 43

   The following illustration shows how the 42-character recommendation
   was derived.  The combination of language and extended language
   subtags was chosen for future compatibility.  At up to 15 characters,
   this combination is longer than the longest possible primary language
   subtag (8 characters):

   language      =  3 (ISO 639-2; ISO 639-1 requires 2)
   extlang1      =  4 (each subsequent subtag includes '-')
   extlang2      =  4 (unlikely: needs prefix="language-extlang1")
   extlang3      =  4 (extremely unlikely)
   script        =  5 (if not suppressed: see Section 4.1)
   region        =  4 (UN M.49; ISO 3166 requires 3)
   variant1      =  9 (MUST have language as a prefix)
   variant2      =  9 (MUST have language-variant1 as a prefix)

   total         = 42 characters

              Figure 7: Derivation of the Limit on Tag Length

4.3.2.  Truncation of Language Tags

   Truncation of a language tag alters the meaning of the tag, and thus
   SHOULD be avoided.  However, truncation of language tags is sometimes
   necessary due to limited buffer sizes.  Such truncation MUST NOT
   permit a subtag to be chopped off in the middle or the formation of
   invalid tags (for example, one ending with the "-" character).

   This means that applications or protocols that truncate tags MUST do
   so by progressively removing subtags along with their preceding "-"
   from the right side of the language tag until the tag is short enough
   for the given buffer.  If the resulting tag ends with a single-
   character subtag, that subtag and its preceding "-" MUST also be
   removed.  For example:

   Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1
   1. zh-Latn-CN-variant1-a-extend1-x-wadegile
   2. zh-Latn-CN-variant1-a-extend1
   3. zh-Latn-CN-variant1
   4. zh-Latn-CN
   5. zh-Latn
   6. zh

                    Figure 8: Example of Tag Truncation

noToC RFC4646 - Page 44

4.4.  Canonicalization of Language Tags

   Since a particular language tag is sometimes used by many processes,
   language tags SHOULD always be created or generated in a canonical
   form.

   A language tag is in canonical form when:

   1.  The tag is well-formed according the rules in Section 2.1 and
       Section 2.2.

   2.  Subtags of type 'Region' that have a Preferred-Value mapping in
       the IANA registry (see Section 3.1) SHOULD be replaced with their
       mapped value.  Note: In rare cases, the mapped value will also
       have a Preferred-Value.

   3.  Redundant or grandfathered tags that have a Preferred-Value
       mapping in the IANA registry (see Section 3.1) MUST be replaced
       with their mapped value.  These items either are deprecated
       mappings created before the adoption of this document (such as
       the mapping of "no-nyn" to "nn" or "i-klingon" to "tlh") or are
       the result of later registrations or additions to this document
       (for example, "zh-guoyu" might be mapped to a language-extlang
       combination such as "zh-cmn" by some future update of this
       document).

   4.  Other subtags that have a Preferred-Value mapping in the IANA
       registry (see Section 3.1) MUST be replaced with their mapped
       value.  These items consist entirely of clerical corrections to
       ISO 639-1 in which the deprecated subtags have been maintained
       for compatibility purposes.

   5.  If more than one extension subtag sequence exists, the extension
       sequences are ordered into case-insensitive ASCII order by
       singleton subtag.

   Example: The language tag "en-A-aaa-B-ccc-bbb-x-xyz" is in canonical
   form, while "en-B-ccc-bbb-A-aaa-X-xyz" is well-formed but not in
   canonical form.

   Example: The language tag "en-BU" (English as used in Burma) is not
   canonical because the 'BU' subtag has a canonical mapping to 'MM'
   (Myanmar), although the tag "en-BU" maintains its validity.

   Canonicalization of language tags does not imply anything about the
   use of upper or lowercase letters when processing or comparing
   subtags (and as described in Section 2.1).  All comparisons MUST be
   performed in a case-insensitive manner.

noToC RFC4646 - Page 45

   When performing canonicalization of language tags, processors MAY
   regularize the case of the subtags (that is, this process is
   OPTIONAL), following the case used in the registry.  Note that this
   corresponds to the following casing rules: uppercase all non-initial
   two-letter subtags; titlecase all non-initial four-letter subtags;
   lowercase everything else.

   Note: Case folding of ASCII letters in certain locales, unless
   carefully handled, sometimes produces non-ASCII character values.
   The Unicode Character Database file "SpecialCasing.txt" defines the
   specific cases that are known to cause problems with this.  In
   particular, the letter 'i' (U+0069) in Turkish and Azerbaijani is
   uppercased to U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE).
   Implementers SHOULD specify a locale-neutral casing operation to
   ensure that case folding of subtags does not produce this value,
   which is illegal in language tags.  For example, if one were to
   uppercase the region subtag 'in' using Turkish locale rules, the
   sequence U+0130 U+004E would result instead of the expected 'IN'.

   Note: if the field 'Deprecated' appears in a registry record without
   an accompanying 'Preferred-Value' field, then that tag or subtag is
   deprecated without a replacement.  Validating processors SHOULD NOT
   generate tags that include these values, although the values are
   canonical when they appear in a language tag.

   An extension MUST define any relationships that exist between the
   various subtags in the extension and thus MAY define an alternate
   canonicalization scheme for the extension's subtags.  Extensions MAY
   define how the order of the extension's subtags are interpreted.  For
   example, an extension could define that its subtags are in canonical
   order when the subtags are placed into ASCII order: that is,
   "en-a-aaa-bbb-ccc" instead of "en-a-ccc-bbb-aaa".  Another extension
   might define that the order of the subtags influences their semantic
   meaning (so that "en-b-ccc-bbb-aaa" has a different value from
   "en-b-aaa-bbb-ccc").  However, extension specifications SHOULD be
   designed so that they are tolerant of the typical processes described
   in Section 3.7.

4.5.  Considerations for Private Use Subtags

   Private use subtags, like all other subtags, MUST conform to the
   format and content constraints in the ABNF.  Private use subtags have
   no meaning outside the private agreement between the parties that
   intend to use or exchange language tags that employ them.  The same
   subtags MAY be used with a different meaning under a separate private
   agreement.  They SHOULD NOT be used where alternatives exist and
   SHOULD NOT be used in content or protocols intended for general use.

noToC RFC4646 - Page 46

   Private use subtags are simply useless for information exchange
   without prior arrangement.  The value and semantic meaning of private
   use tags and of the subtags used within such a language tag are not
   defined by this document.

   Subtags defined in the IANA registry as having a specific private use
   meaning convey more information that a purely private use tag
   prefixed by the singleton subtag 'x'.  For applications, this
   additional information MAY be useful.

   For example, the region subtags 'AA', 'ZZ', and in the ranges
   'QM'-'QZ' and 'XA'-'XZ' (derived from ISO 3166 private use codes) MAY
   be used to form a language tag.  A tag such as "zh-Hans-XQ" conveys a
   great deal of public, interchangeable information about the language
   material (that it is Chinese in the simplified Chinese script and is
   suitable for some geographic region 'XQ').  While the precise
   geographic region is not known outside of private agreement, the tag
   conveys far more information than an opaque tag such as "x-someLang",
   which contains no information about the language subtag or script
   subtag outside of the private agreement.

   However, in some cases content tagged with private use subtags MAY
   interact with other systems in a different and possibly unsuitable
   manner compared to tags that use opaque, privately defined subtags,
   so the choice of the best approach sometimes depends on the
   particular domain in question.

5.  IANA Considerations

   This section deals with the processes and requirements necessary for
   IANA to undertake to maintain the subtag and extension registries as
   defined by this document and in accordance with the requirements of
   [RFC2434].

   The impact on the IANA maintainers of the two registries defined by
   this document will be a small increase in the frequency of new
   entries or updates.

5.1.  Language Subtag Registry

   Upon adoption of this document, the registry will be initialized by a
   companion document: [RFC4645].  The criteria and process for
   selecting the initial set of records are described in that document.
   The initial set of records represents no impact on IANA, since the
   work to create it will be performed externally.

noToC RFC4646 - Page 47

   The new registry MUST be listed under "Language Tags" at
   <http://www.iana.org/numbers.html>, replacing the existing
   registrations defined by [RFC3066].  The existing set of registration
   forms and RFC 3066 registrations MUST be relabeled as "Language Tags
   (Obsolete)" and maintained (but not added to or modified).

   Future work on the Language Subtag Registry SHALL be limited to
   inserting or replacing whole records preformatted for IANA by the
   Language Subtag Reviewer as described in Section 3.3 of this document
   and archiving the forwarded registration form.

   Each record MUST be sent to iana@iana.org with a subject line
   indicating whether the enclosed record is an insertion of a new
   record (indicated by the word "INSERT" in the subject line) or a
   replacement of an existing record (indicated by the word "MODIFY" in
   the subject line).  Records MUST NOT be deleted from the registry.
   IANA MUST place any inserted or modified records into the appropriate
   section of the language subtag registry, grouping the records by
   their 'Type' field.  Inserted records MAY be placed anywhere in the
   appropriate section; there is no guarantee of the order of the
   records beyond grouping them together by 'Type'.  Modified records
   MUST overwrite the record they replace.

   Included in any request to insert or modify records MUST be a new
   File-Date record.  This record MUST be placed first in the registry.
   In the event that the File-Date record present in the registry has a
   later date than the record being inserted or modified, the existing
   record MUST be preserved.

5.2.  Extensions Registry

   The Language Tag Extensions Registry will also be generated and sent
   to IANA as described in Section 3.7.  This registry can contain at
   most 35 records, and thus changes to this registry are expected to be
   very infrequent.

   Future work by IANA on the Language Tag Extensions Registry is
   limited to two cases.  First, the IESG MAY request that new records
   be inserted into this registry from time to time.  These requests
   MUST include the record to insert in the exact format described in
   Section 3.7.  In addition, there MAY be occasional requests from the
   maintaining authority for a specific extension to update the contact
   information or URLs in the record.  These requests MUST include the
   complete, updated record.  IANA is not responsible for validating the
   information provided, only that it is properly formatted.  It should
   reasonably be seen to come from the maintaining authority named in
   the record present in the registry.

noToC RFC4646 - Page 48

6.  Security Considerations

   Language tags used in content negotiation, like any other information
   exchanged on the Internet, might be a source of concern because they
   might be used to infer the nationality of the sender, and thus
   identify potential targets for surveillance.

   This is a special case of the general problem that anything sent is
   visible to the receiving party and possibly to third parties as well.
   It is useful to be aware that such concerns can exist in some cases.

   The evaluation of the exact magnitude of the threat, and any possible
   countermeasures, is left to each application protocol (see BCP 72
   [RFC3552] for best current practice guidance on security threats and
   defenses).

   The language tag associated with a particular information item is of
   no consequence whatsoever in determining whether that content might
   contain possible homographs.  The fact that a text is tagged as being
   in one language or using a particular script subtag provides no
   assurance whatsoever that it does not contain characters from scripts
   other than the one(s) associated with or specified by that language
   tag.

   Since there is no limit to the number of variant, private use, and
   extension subtags, and consequently no limit on the possible length
   of a tag, implementations need to guard against buffer overflow
   attacks.  See Section 4.3 for details on language tag truncation,
   which can occur as a consequence of defenses against buffer overflow.

   Although the specification of valid subtags for an extension (see
   Section 3.7) MUST be available over the Internet, implementations
   SHOULD NOT mechanically depend on it being always accessible, to
   prevent denial-of-service attacks.

7.  Character Set Considerations

   The syntax in this document requires that language tags use only the
   characters A-Z, a-z, 0-9, and HYPHEN-MINUS, which are present in most
   character sets, so the composition of language tags should not have
   any character set issues.

   Rendering of characters based on the content of a language tag is not
   addressed in this memo.  Historically, some languages have relied on
   the use of specific character sets or other information in order to
   infer how a specific character should be rendered (notably this
   applies to language- and culture-specific variations of Han
   ideographs as used in Japanese, Chinese, and Korean).  When language

noToC RFC4646 - Page 49

   tags are applied to spans of text, rendering engines sometimes use
   that information in deciding which font to use in the absence of
   other information, particularly where languages with distinct writing
   traditions use the same characters.

8.  Changes from RFC 3066

   The main goals for this revision of language tags were the following:

   *Compatibility.* All RFC 3066 language tags (including those in the
   IANA registry) remain valid in this specification.  The changes in
   this document represent additional constraints on language tags.
   That is, in no case is the syntax more permissive and processors
   based on the ABNF and other provisions of RFC 3066 (such as those
   described in [XMLSchema]) will be able to process the tags described
   by this document.  In addition, this document defines language tags
   in such as way as to ensure future compatibility.

   *Stability.* Because of changes in the past in the underlying ISO
   standards, a valid RFC 3066 language tag could become invalid or have
   its meaning change.  This has the potential of invalidating content
   that may have an extensive shelf-life.  In this specification, once a
   language tag is valid, it remains valid forever.

   *Validity.* The structure of language tags defined by this document
   makes it possible to determine if a particular tag is well-formed
   without regard for the actual content or "meaning" of the tag as a
   whole.  This is important because the registry grows and underlying
   standards change over time.  In addition, it must be possible to
   determine if a tag is valid (or not) for a given point in time in
   order to provide reproducible, testable results.  This process must
   not be error-prone; otherwise implementations might give different
   results.  By having an authoritative registry with specific
   versioning information, the validity of language tags at any point in
   time can be precisely determined (instead of interpolating values
   from many separate sources).

   *Utility.* It is sometimes important to be able to differentiate
   between written forms of a language -- for many implementations this
   is more important than distinguishing between the spoken variants of
   a language.  Languages are written in a wide variety of different
   scripts, so this document provides for the generative use of ISO
   15924 script codes.  Like the generative use of ISO language and
   country codes in RFC 3066, this allows combinations to be produced
   without resorting to the registration process.  The addition of UN
   M.49 codes provides for the generation of language tags with regional
   scope, which is also required by some applications.

noToC RFC4646 - Page 50

   The recast of the registry from containing whole language tags to
   subtags is a key part of this.  An important feature of RFC 3066 was
   that it allowed generative use of subtags.  This allows people to
   meaningfully use generated tags, without the delays in registering
   whole tags or the need to register all of the combinations that might
   be useful.

   The choice of placing the extended language and script subtags
   between the primary language and region subtags was widely debated.
   This design was chosen because the prevalent matching and content
   negotiation schemes rely on the subtags being arranged in order of
   increasing specificity.  That is, the subtags that mark a greater
   barrier to mutual intelligibility appear left-most in a tag.  For
   example, when selecting content written in Azerbaijani, the script
   (Arabic, Cyrillic, or Latin) represents a greater barrier to
   understanding than any regional variations (those associated with
   Azerbaijan or Iran, for example).  Individuals who prefer documents
   in a particular script, but can deal with the minor regional
   differences, can therefore select appropriate content.  Applications
   that do not deal with written content will continue to omit these
   subtags.

   *Extensibility.* Because of the widespread use of language tags, it
   is disruptive to have periodic revisions of the core specification,
   even in the face of demonstrated need.  The extension mechanism
   provides for a way for independent RFCs to define extensions to
   language tags.  These extensions have a very constrained, well-
   defined structure that prevents extensions from interfering with
   implementations of language tags defined in this document.

   The document also anticipates features of ISO 639-3 with the addition
   of the extended language subtags, as well as the possibility of other
   ISO 639 parts becoming useful for the formation of language tags in
   the future.

   The use and definition of private use tags have also been modified,
   to allow people to use private use subtags to extend or modify
   defined tags and to move as much information as possible out of
   private use and into the regular structure.

   The goal for each of these modifications is to reduce or eliminate
   the need for future revisions of this document.

noToC RFC4646 - Page 51

   The specific changes in this document to meet these goals are:

   o  Defines the ABNF and rules for subtags so that the category of all
      subtags can be determined without reference to the registry.

   o  Adds the concept of well-formed vs. validating processors,
      defining the rules by which an implementation can claim to be one
      or the other.

   o  Replaces the IANA language tag registry with a language subtag
      registry that provides a complete list of valid subtags in the
      IANA registry.  This allows for robust implementation and ease of
      maintenance.  The language subtag registry becomes the canonical
      source for forming language tags.

   o  Provides a process that guarantees stability of language tags, by
      handling reuse of values by ISO 639, ISO 15924, and ISO 3166 in
      the event that they register a previously used value for a new
      purpose.

   o  Allows ISO 15924 script code subtags and allows them to be used
      generatively.  Defines a method for indicating in the registry
      when script subtags are necessary for a given language tag.

   o  Adds the concept of a variant subtag and allows variants to be
      used generatively.

   o  Adds the ability to use a class of UN M.49 tags for supra-national
      regions and to resolve conflicts in the assignment of ISO 3166
      codes.

   o  Defines the private use tags in ISO 639, ISO 15924, and ISO 3166
      as the mechanism for creating private use language, script, and
      region subtags, respectively.

   o  Adds a well-defined extension mechanism.

   o  Defines an extended language subtag, possibly for use with certain
      anticipated features of ISO 639-3.

noToC RFC4646 - Page 52

9.  References

9.1.  Normative References

   [ISO10646]     International Organization for Standardization,
                  "ISO/IEC 10646:2003. Information technology --
                  Universal Multiple-Octet Coded Character Set (UCS)",
                  2003.

   [ISO15924]     International Organization for Standardization, "ISO
                  15924:2004. Information and documentation -- Codes for
                  the representation of names of scripts", January 2004.

   [ISO3166-1]    International Organization for Standardization, "ISO
                  3166-1:1997. Codes for the representation of names of
                  countries and their subdivisions -- Part 1: Country
                  codes", 1997.

   [ISO639-1]     International Organization for Standardization, "ISO
                  639-1:2002. Codes for the representation of names of
                  languages -- Part 1: Alpha-2 code", 2002.

   [ISO639-2]     International Organization for Standardization, "ISO
                  639-2:1998. Codes for the representation of names of
                  languages -- Part 2: Alpha-3 code, first edition",
                  1998.

   [ISO646]       International Organization for Standardization,
                  "ISO/IEC 646:1991, Information technology -- ISO 7-bit
                  coded character set for information interchange.",
                  1991.

   [RFC2026]      Bradner, S., "The Internet Standards Process --
                  Revision 3", BCP 9, RFC 2026, October 1996.

   [RFC2028]      Hovey, R. and S. Bradner, "The Organizations Involved
                  in the IETF Standards Process", BCP 11, RFC 2028,
                  October 1996.

   [RFC2119]      Bradner, S., "Key words for use in RFCs to Indicate
                  Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2434]      Narten, T. and H. Alvestrand, "Guidelines for Writing
                  an IANA Considerations Section in RFCs", BCP 26,
                  RFC 2434, October 1998.

noToC RFC4646 - Page 53

   [RFC2860]      Carpenter, B., Baker, F., and M. Roberts, "Memorandum
                  of Understanding Concerning the Technical Work of the
                  Internet Assigned Numbers Authority", RFC 2860,
                  June 2000.

   [RFC3339]      Klyne, G., Ed. and C. Newman, "Date and Time on the
                  Internet: Timestamps", RFC 3339, July 2002.

   [RFC4234]      Crocker, D., Ed. and P. Overell, "Augmented BNF for
                  Syntax Specifications: ABNF", RFC 4234, October 2005.

   [UN_M.49]      Statistics Division, United Nations, "Standard Country
                  or Area Codes for Statistical Use", UN Standard
                  Country or Area Codes for Statistical Use, Revision 4
                  (United Nations publication, Sales No. 98.XVII.9,
                  June 1999.

9.2.  Informative References

   [RFC1766]      Alvestrand, H., "Tags for the Identification of
                  Languages", RFC 1766, March 1995.

   [RFC2047]      Moore, K., "MIME (Multipurpose Internet Mail
                  Extensions) Part Three: Message Header Extensions for
                  Non-ASCII Text", RFC 2047, November 1996.

   [RFC2231]      Freed, N. and K. Moore, "MIME Parameter Value and
                  Encoded Word Extensions: Character Sets, Languages,
                  and Continuations", RFC 2231, November 1997.

   [RFC2781]      Hoffman, P. and F. Yergeau, "UTF-16, an encoding of
                  ISO 10646", RFC 2781, February 2000.

   [RFC3066]      Alvestrand, H., "Tags for the Identification of
                  Languages", BCP 47, RFC 3066, January 2001.

   [RFC3552]      Rescorla, E. and B. Korver, "Guidelines for Writing
                  RFC Text on Security Considerations", BCP 72,
                  RFC 3552, July 2003.

   [RFC4645]      Ewell, D., Ed., "Initial Language Subtag Registry",
                  RFC 4645, September 2006.

   [RFC4647]      Phillips, A., Ed. and M. Davis, Ed., "Matching of
                  Language Tags", BCP 47, RFC 4647, September 2006.

noToC RFC4646 - Page 54

   [Unicode]      Unicode Consortium, "The Unicode Standard, Version
                  5.0", Boston, MA, Addison-Wesley, 2007. ISBN 0-321-
                  48091-0.

   [XML10]        Bray (et al), T., "Extensible Markup Language (XML)
                  1.0", 02 2004.

   [XMLSchema]    Biron, P., Ed. and A. Malhotra, Ed., "XML Schema Part
                  2: Datatypes Second Edition", 10 2004, <
                  http://www.w3.org/TR/xmlschema-2/>.

   [iso639.prin]  ISO 639 Joint Advisory Committee, "ISO 639 Joint
                  Advisory Committee:  Working principles for ISO 639
                  maintenance", March 2000, <http://www.loc.gov/
                  standards/iso639-2/iso639jac_n3r.html>.

   [record-jar]   Raymond, E., "The Art of Unix Programming", 2003,
                  <urn:isbn:0-13-142901-9>.

noToC RFC4646 - Page 55

Appendix A.  Acknowledgements

   Any list of contributors is bound to be incomplete; please regard the
   following as only a selection from the group of people who have
   contributed to make this document what it is today.

   The contributors to RFC 3066 and RFC 1766, the precursors of this
   document, made enormous contributions directly or indirectly to this
   document and are generally responsible for the success of language
   tags.

   The following people (in alphabetical order) contributed to this
   document or to RFCs 1766 and 3066:

   Glenn Adams, Harald Tveit Alvestrand, Tim Berners-Lee, Marc Blanchet,
   Nathaniel Borenstein, Karen Broome, Eric Brunner, Sean M. Burke, M.T.
   Carrasco Benitez, Jeremy Carroll, John Clews, Jim Conklin, Peter
   Constable, John Cowan, Mark Crispin, Dave Crocker, Elwyn Davies,
   Martin Duerst, Frank Ellerman, Michael Everson, Doug Ewell, Ned
   Freed, Tim Goodwin, Dirk-Willem van Gulik, Marion Gunn, Joel Halpren,
   Elliotte Rusty Harold, Paul Hoffman, Scott Hollenbeck, Richard
   Ishida, Olle Jarnefors, Kent Karlsson, John Klensin, Erkki
   Kolehmainen, Alain LaBonte, Eric Mader, Ira McDonald, Keith Moore,
   Chris Newman, Masataka Ohta, Dylan Pierce, Randy Presuhn, George
   Rhoten, Felix Sasaki, Markus Scherer, Keld Jorn Simonsen, Thierry
   Sourbier, Otto Stolz, Tex Texin, Andrea Vine, Rhys Weatherley, Misha
   Wolf, Francois Yergeau and many, many others.

   Very special thanks must go to Harald Tveit Alvestrand, who
   originated RFCs 1766 and 3066, and without whom this document would
   not have been possible.  Special thanks must go to Michael Everson,
   who has served as Language Tag Reviewer for almost the complete
   period since the publication of RFC 1766.  Special thanks to Doug
   Ewell, for his production of the first complete subtag registry, and
   his work in producing a test parser for verifying language tags.

noToC RFC4646 - Page 56

Appendix B.  Examples of Language Tags (Informative)

   Simple language subtag:

      de (German)

      fr (French)

      ja (Japanese)

      i-enochian (example of a grandfathered tag)

   Language subtag plus Script subtag:

      zh-Hant (Chinese written using the Traditional Chinese script)

      zh-Hans (Chinese written using the Simplified Chinese script)

      sr-Cyrl (Serbian written using the Cyrillic script)

      sr-Latn (Serbian written using the Latin script)

   Language-Script-Region:

      zh-Hans-CN (Chinese written using the Simplified script as used in
      mainland China)

      sr-Latn-CS (Serbian written using the Latin script as used in
      Serbia and Montenegro)

   Language-Variant:

      sl-rozaj (Resian dialect of Slovenian

      sl-nedis (Nadiza dialect of Slovenian)

   Language-Region-Variant:

      de-CH-1901 (German as used in Switzerland using the 1901 variant
      [orthography])

      sl-IT-nedis (Slovenian as used in Italy, Nadiza dialect)

noToC RFC4646 - Page 57

   Language-Script-Region-Variant:

      sl-Latn-IT-nedis (Nadiza dialect of Slovenian written using the
      Latin script as used in Italy.  Note that this tag is NOT
      RECOMMENDED because subtag 'sl' has a Suppress-Script value of
      'Latn')

   Language-Region:

      de-DE (German for Germany)

      en-US (English as used in the United States)

      es-419 (Spanish appropriate for the Latin America and Caribbean
      region using the UN region code)

   Private use subtags:

      de-CH-x-phonebk

      az-Arab-x-AZE-derbend

   Extended language subtags (examples ONLY: extended languages MUST be
   defined by revision or update to this document):

      zh-min

      zh-min-nan-Hant-CN

   Private use registry values:

      x-whatever (private use using the singleton 'x')

      qaa-Qaaa-QM-x-southern (all private tags)

      de-Qaaa (German, with a private script)

      sr-Latn-QM (Serbian, Latin-script, private region)

      sr-Qaaa-CS (Serbian, private script, for Serbia and Montenegro)

   Tags that use extensions (examples ONLY: extensions MUST be defined
   by revision or update to this document or by RFC):

      en-US-u-islamCal

      zh-CN-a-myExt-x-private

noToC RFC4646 - Page 58

      en-a-myExt-b-another

   Some Invalid Tags:

      de-419-DE (two region tags)

      a-DE (use of a single-character subtag in primary position; note
      that there are a few grandfathered tags that start with "i-" that
      are valid)

      ar-a-aaa-b-bbb-a-ccc (two extensions with same single-letter
      prefix)

Authors' Addresses

   Addison Phillips (Editor)
   Yahoo! Inc.

   EMail: addison@inter-locale.com


   Mark Davis (Editor)
   Google

   EMail: mark.davis@macchiato.com or mark.davis@google.com

noToC RFC4646 - Page 59

Full Copyright Statement

   Copyright (C) The Internet Society (2006).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.

Acknowledgement

   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).