Tech-invite3GPPspaceIETFspace
96959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 4646

Tags for Identifying Languages

Pages: 59

Obsoletes:  3066
Obsoleted by:  5646
Part 3 of 3 – Pages 38 to 59
First   Prev   None

ToP   noToC   RFC4646 - Page 38   prevText

4. Formation and Processing of Language Tags

This section addresses how to use the information in the registry with the tag syntax to choose, form, and process language tags.

4.1. Choice of Language Tag

One is sometimes faced with the choice between several possible tags for the same body of text. Interoperability is best served when all users use the same language tag in order to represent the same language. If an application has requirements that make the rules here inapplicable, then that application risks damaging interoperability. It is strongly RECOMMENDED that users not define their own rules for language tag choice. Subtags SHOULD only be used where they add useful distinguishing information; extraneous subtags interfere with the meaning, understanding, and processing of language tags. In particular, users and implementations SHOULD follow the 'Prefix' and 'Suppress-Script' fields in the registry (defined in Section 3.1): these fields provide guidance on when specific additional subtags SHOULD (and SHOULD NOT) be used in a language tag. Of particular note, many applications can benefit from the use of script subtags in language tags, as long as the use is consistent for a given context. Script subtags were not formally defined in RFC 3066 and their use can affect matching and subtag identification by implementations of RFC 3066, as these subtags appear between the primary language and region subtags. For example, if a user requests content in an implementation of Section 2.5 of [RFC3066] using the language range "en-US", content labeled "en-Latn-US" will not match the request. Therefore, it is important to know when script subtags will customarily be used and when they ought not be used. In the registry, the Suppress-Script field helps ensure greater compatibility between the language tags generated according to the rules in this document and language tags and tag processors or consumers based on RFC 3066 by defining when users SHOULD NOT include a script subtag with a particular primary language subtag.
ToP   noToC   RFC4646 - Page 39
   Extended language subtags (type 'extlang' in the registry; see
   Section 3.1) also appear between the primary language and region
   subtags and are reserved for future standardization.  Applications
   might benefit from their judicious use in forming language tags in
   the future.  Similar recommendations are expected to apply to their
   use as apply to script subtags.

   Standards, protocols, and applications that reference this document
   normatively but apply different rules to the ones given in this
   section MUST specify how the procedure varies from the one given
   here.

   The choice of subtags used to form a language tag SHOULD be guided by
   the following rules:

   1.  Use as precise a tag as possible, but no more specific than is
       justified.  Avoid using subtags that are not important for
       distinguishing content in an application.

       *  For example, 'de' might suffice for tagging an email written
          in German, while "de-CH-1996" is probably unnecessarily
          precise for such a task.

   2.  The script subtag SHOULD NOT be used to form language tags unless
       the script adds some distinguishing information to the tag.  The
       field 'Suppress-Script' in the primary language record in the
       registry indicates which script subtags do not add distinguishing
       information for most applications.

       *  For example, the subtag 'Latn' should not be used with the
          primary language 'en' because nearly all English documents are
          written in the Latin script and it adds no distinguishing
          information.  However, if a document were written in English
          mixing Latin script with another script such as Braille
          ('Brai'), then it might be appropriate to choose to indicate
          both scripts to aid in content selection, such as the
          application of a style sheet.

   3.  If a tag or subtag has a 'Preferred-Value' field in its registry
       entry, then the value of that field SHOULD be used to form the
       language tag in preference to the tag or subtag in which the
       preferred value appears.

       *  For example, use 'he' for Hebrew in preference to 'iw'.
ToP   noToC   RFC4646 - Page 40
   4.  The 'und' (Undetermined) primary language subtag SHOULD NOT be
       used to label content, even if the language is unknown.  Omitting
       the language tag altogether is preferred to using a tag with a
       primary language subtag of 'und'.  The 'und' subtag MAY be useful
       for protocols that require a language tag to be provided.  The
       'und' subtag MAY also be useful when matching language tags in
       certain situations.

   5.  The 'mul' (Multiple) primary language subtag SHOULD NOT be used
       whenever the protocol allows the separate tags for multiple
       languages, as is the case for the Content-Language header in
       HTTP.  The 'mul' subtag conveys little useful information:
       content in multiple languages SHOULD individually tag the
       languages where they appear or otherwise indicate the actual
       language in preference to the 'mul' subtag.

   6.  The same variant subtag SHOULD NOT be used more than once within
       a language tag.

       *  For example, do not use "de-DE-1901-1901".

   To ensure consistent backward compatibility, this document contains
   several provisions to account for potential instability in the
   standards used to define the subtags that make up language tags.
   These provisions mean that no language tag created under the rules in
   this document will become obsolete.

4.2. Meaning of the Language Tag

The relationship between the tag and the information it relates to is defined by the context in which the tag appears. Accordingly, this section gives only possible examples of its usage. o For a single information object, the associated language tags might be interpreted as the set of languages that is necessary for a complete comprehension of the complete object. Example: Plain text documents. o For an aggregation of information objects, the associated language tags could be taken as the set of languages used inside components of that aggregation. Examples: Document stores and libraries. o For information objects whose purpose is to provide alternatives, the associated language tags could be regarded as a hint that the content is provided in several languages and that one has to inspect each of the alternatives in order to find its language or languages. In this case, the presence of multiple tags might not mean that one needs to be multi-lingual to get complete
ToP   noToC   RFC4646 - Page 41
      understanding of the document.  Example: MIME multipart/
      alternative.

   o  In markup languages, such as HTML and XML, language information
      can be added to each part of the document identified by the markup
      structure (including the whole document itself).  For example, one
      could write <span lang="fr">C'est la vie.</span> inside a
      Norwegian document; the Norwegian-speaking user could then access
      a French-Norwegian dictionary to find out what the marked section
      meant.  If the user were listening to that document through a
      speech synthesis interface, this formation could be used to signal
      the synthesizer to appropriately apply French text-to-speech
      pronunciation rules to that span of text, instead of applying the
      inappropriate Norwegian rules.

   Language tags are related when they contain a similar sequence of
   subtags.  For example, if a language tag B contains language tag A as
   a prefix, then B is typically "narrower" or "more specific" than A.
   Thus, "zh-Hant-TW" is more specific than "zh-Hant".

   This relationship is not guaranteed in all cases: specifically,
   languages that begin with the same sequence of subtags are NOT
   guaranteed to be mutually intelligible, although they might be.  For
   example, the tag "az" shares a prefix with both "az-Latn"
   (Azerbaijani written using the Latin script) and "az-Cyrl"
   (Azerbaijani written using the Cyrillic script).  A person fluent in
   one script might not be able to read the other, even though the text
   might be identical.  Content tagged as "az" most probably is written
   in just one script and thus might not be intelligible to a reader
   familiar with the other script.

4.3. Length Considerations

[RFC3066] did not provide an upper limit on the size of language tags. While RFC 3066 did define the semantics of particular subtags in such a way that most language tags consisted of language and region subtags with a combined total length of up to six characters, larger registered tags were not only possible but were actually registered. Neither the language tag syntax nor other requirements in this document impose a fixed upper limit on the number of subtags in a language tag (and thus an upper bound on the size of a tag). The language tag syntax suggests that, depending on the specific language, more subtags (and thus a longer tag) are sometimes necessary to completely identify the language for certain applications; thus, it is possible to envision long or complex subtag sequences.
ToP   noToC   RFC4646 - Page 42

4.3.1. Working with Limited Buffer Sizes

Some applications and protocols are forced to allocate fixed buffer sizes or otherwise limit the length of a language tag. A conformant implementation or specification MAY refuse to support the storage of language tags that exceed a specified length. Any such limitation SHOULD be clearly documented, and such documentation SHOULD include what happens to longer tags (for example, whether an error value is generated or the language tag is truncated). A protocol that allows tags to be truncated at an arbitrary limit, without giving any indication of what that limit is, has the potential for causing harm by changing the meaning of tags in substantial ways. In practice, most language tags do not require more than a few subtags and will not approach reasonably sized buffer limitations; see Section 4.1. Some specifications or protocols have limits on tag length but do not have a fixed length limitation. For example, [RFC2231] has no explicit length limitation: the length available for the language tag is constrained by the length of other header components (such as the charset's name) coupled with the 76-character limit in [RFC2047]. Thus, the "limit" might be 50 or more characters, but it could potentially be quite small. The considerations for assigning a buffer limit are: Implementations SHOULD NOT truncate language tags unless the meaning of the tag is purposefully being changed, or unless the tag does not fit into a limited buffer size specified by a protocol for storage or transmission. Implementations SHOULD warn the user when a tag is truncated since truncation changes the semantic meaning of the tag. Implementations of protocols or specifications that are space constrained but do not have a fixed limit SHOULD use the longest possible tag in preference to truncation. Protocols or specifications that specify limited buffer sizes for language tags MUST allow for language tags of up to 33 characters. Protocols or specifications that specify limited buffer sizes for language tags SHOULD allow for language tags of at least 42 characters.
ToP   noToC   RFC4646 - Page 43
   The following illustration shows how the 42-character recommendation
   was derived.  The combination of language and extended language
   subtags was chosen for future compatibility.  At up to 15 characters,
   this combination is longer than the longest possible primary language
   subtag (8 characters):

   language      =  3 (ISO 639-2; ISO 639-1 requires 2)
   extlang1      =  4 (each subsequent subtag includes '-')
   extlang2      =  4 (unlikely: needs prefix="language-extlang1")
   extlang3      =  4 (extremely unlikely)
   script        =  5 (if not suppressed: see Section 4.1)
   region        =  4 (UN M.49; ISO 3166 requires 3)
   variant1      =  9 (MUST have language as a prefix)
   variant2      =  9 (MUST have language-variant1 as a prefix)

   total         = 42 characters

              Figure 7: Derivation of the Limit on Tag Length

4.3.2. Truncation of Language Tags

Truncation of a language tag alters the meaning of the tag, and thus SHOULD be avoided. However, truncation of language tags is sometimes necessary due to limited buffer sizes. Such truncation MUST NOT permit a subtag to be chopped off in the middle or the formation of invalid tags (for example, one ending with the "-" character). This means that applications or protocols that truncate tags MUST do so by progressively removing subtags along with their preceding "-" from the right side of the language tag until the tag is short enough for the given buffer. If the resulting tag ends with a single- character subtag, that subtag and its preceding "-" MUST also be removed. For example: Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1 1. zh-Latn-CN-variant1-a-extend1-x-wadegile 2. zh-Latn-CN-variant1-a-extend1 3. zh-Latn-CN-variant1 4. zh-Latn-CN 5. zh-Latn 6. zh Figure 8: Example of Tag Truncation
ToP   noToC   RFC4646 - Page 44

4.4. Canonicalization of Language Tags

Since a particular language tag is sometimes used by many processes, language tags SHOULD always be created or generated in a canonical form. A language tag is in canonical form when: 1. The tag is well-formed according the rules in Section 2.1 and Section 2.2. 2. Subtags of type 'Region' that have a Preferred-Value mapping in the IANA registry (see Section 3.1) SHOULD be replaced with their mapped value. Note: In rare cases, the mapped value will also have a Preferred-Value. 3. Redundant or grandfathered tags that have a Preferred-Value mapping in the IANA registry (see Section 3.1) MUST be replaced with their mapped value. These items either are deprecated mappings created before the adoption of this document (such as the mapping of "no-nyn" to "nn" or "i-klingon" to "tlh") or are the result of later registrations or additions to this document (for example, "zh-guoyu" might be mapped to a language-extlang combination such as "zh-cmn" by some future update of this document). 4. Other subtags that have a Preferred-Value mapping in the IANA registry (see Section 3.1) MUST be replaced with their mapped value. These items consist entirely of clerical corrections to ISO 639-1 in which the deprecated subtags have been maintained for compatibility purposes. 5. If more than one extension subtag sequence exists, the extension sequences are ordered into case-insensitive ASCII order by singleton subtag. Example: The language tag "en-A-aaa-B-ccc-bbb-x-xyz" is in canonical form, while "en-B-ccc-bbb-A-aaa-X-xyz" is well-formed but not in canonical form. Example: The language tag "en-BU" (English as used in Burma) is not canonical because the 'BU' subtag has a canonical mapping to 'MM' (Myanmar), although the tag "en-BU" maintains its validity. Canonicalization of language tags does not imply anything about the use of upper or lowercase letters when processing or comparing subtags (and as described in Section 2.1). All comparisons MUST be performed in a case-insensitive manner.
ToP   noToC   RFC4646 - Page 45
   When performing canonicalization of language tags, processors MAY
   regularize the case of the subtags (that is, this process is
   OPTIONAL), following the case used in the registry.  Note that this
   corresponds to the following casing rules: uppercase all non-initial
   two-letter subtags; titlecase all non-initial four-letter subtags;
   lowercase everything else.

   Note: Case folding of ASCII letters in certain locales, unless
   carefully handled, sometimes produces non-ASCII character values.
   The Unicode Character Database file "SpecialCasing.txt" defines the
   specific cases that are known to cause problems with this.  In
   particular, the letter 'i' (U+0069) in Turkish and Azerbaijani is
   uppercased to U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE).
   Implementers SHOULD specify a locale-neutral casing operation to
   ensure that case folding of subtags does not produce this value,
   which is illegal in language tags.  For example, if one were to
   uppercase the region subtag 'in' using Turkish locale rules, the
   sequence U+0130 U+004E would result instead of the expected 'IN'.

   Note: if the field 'Deprecated' appears in a registry record without
   an accompanying 'Preferred-Value' field, then that tag or subtag is
   deprecated without a replacement.  Validating processors SHOULD NOT
   generate tags that include these values, although the values are
   canonical when they appear in a language tag.

   An extension MUST define any relationships that exist between the
   various subtags in the extension and thus MAY define an alternate
   canonicalization scheme for the extension's subtags.  Extensions MAY
   define how the order of the extension's subtags are interpreted.  For
   example, an extension could define that its subtags are in canonical
   order when the subtags are placed into ASCII order: that is,
   "en-a-aaa-bbb-ccc" instead of "en-a-ccc-bbb-aaa".  Another extension
   might define that the order of the subtags influences their semantic
   meaning (so that "en-b-ccc-bbb-aaa" has a different value from
   "en-b-aaa-bbb-ccc").  However, extension specifications SHOULD be
   designed so that they are tolerant of the typical processes described
   in Section 3.7.

4.5. Considerations for Private Use Subtags

Private use subtags, like all other subtags, MUST conform to the format and content constraints in the ABNF. Private use subtags have no meaning outside the private agreement between the parties that intend to use or exchange language tags that employ them. The same subtags MAY be used with a different meaning under a separate private agreement. They SHOULD NOT be used where alternatives exist and SHOULD NOT be used in content or protocols intended for general use.
ToP   noToC   RFC4646 - Page 46
   Private use subtags are simply useless for information exchange
   without prior arrangement.  The value and semantic meaning of private
   use tags and of the subtags used within such a language tag are not
   defined by this document.

   Subtags defined in the IANA registry as having a specific private use
   meaning convey more information that a purely private use tag
   prefixed by the singleton subtag 'x'.  For applications, this
   additional information MAY be useful.

   For example, the region subtags 'AA', 'ZZ', and in the ranges
   'QM'-'QZ' and 'XA'-'XZ' (derived from ISO 3166 private use codes) MAY
   be used to form a language tag.  A tag such as "zh-Hans-XQ" conveys a
   great deal of public, interchangeable information about the language
   material (that it is Chinese in the simplified Chinese script and is
   suitable for some geographic region 'XQ').  While the precise
   geographic region is not known outside of private agreement, the tag
   conveys far more information than an opaque tag such as "x-someLang",
   which contains no information about the language subtag or script
   subtag outside of the private agreement.

   However, in some cases content tagged with private use subtags MAY
   interact with other systems in a different and possibly unsuitable
   manner compared to tags that use opaque, privately defined subtags,
   so the choice of the best approach sometimes depends on the
   particular domain in question.

5. IANA Considerations

This section deals with the processes and requirements necessary for IANA to undertake to maintain the subtag and extension registries as defined by this document and in accordance with the requirements of [RFC2434]. The impact on the IANA maintainers of the two registries defined by this document will be a small increase in the frequency of new entries or updates.

5.1. Language Subtag Registry

Upon adoption of this document, the registry will be initialized by a companion document: [RFC4645]. The criteria and process for selecting the initial set of records are described in that document. The initial set of records represents no impact on IANA, since the work to create it will be performed externally.
ToP   noToC   RFC4646 - Page 47
   The new registry MUST be listed under "Language Tags" at
   <http://www.iana.org/numbers.html>, replacing the existing
   registrations defined by [RFC3066].  The existing set of registration
   forms and RFC 3066 registrations MUST be relabeled as "Language Tags
   (Obsolete)" and maintained (but not added to or modified).

   Future work on the Language Subtag Registry SHALL be limited to
   inserting or replacing whole records preformatted for IANA by the
   Language Subtag Reviewer as described in Section 3.3 of this document
   and archiving the forwarded registration form.

   Each record MUST be sent to iana@iana.org with a subject line
   indicating whether the enclosed record is an insertion of a new
   record (indicated by the word "INSERT" in the subject line) or a
   replacement of an existing record (indicated by the word "MODIFY" in
   the subject line).  Records MUST NOT be deleted from the registry.
   IANA MUST place any inserted or modified records into the appropriate
   section of the language subtag registry, grouping the records by
   their 'Type' field.  Inserted records MAY be placed anywhere in the
   appropriate section; there is no guarantee of the order of the
   records beyond grouping them together by 'Type'.  Modified records
   MUST overwrite the record they replace.

   Included in any request to insert or modify records MUST be a new
   File-Date record.  This record MUST be placed first in the registry.
   In the event that the File-Date record present in the registry has a
   later date than the record being inserted or modified, the existing
   record MUST be preserved.

5.2. Extensions Registry

The Language Tag Extensions Registry will also be generated and sent to IANA as described in Section 3.7. This registry can contain at most 35 records, and thus changes to this registry are expected to be very infrequent. Future work by IANA on the Language Tag Extensions Registry is limited to two cases. First, the IESG MAY request that new records be inserted into this registry from time to time. These requests MUST include the record to insert in the exact format described in Section 3.7. In addition, there MAY be occasional requests from the maintaining authority for a specific extension to update the contact information or URLs in the record. These requests MUST include the complete, updated record. IANA is not responsible for validating the information provided, only that it is properly formatted. It should reasonably be seen to come from the maintaining authority named in the record present in the registry.
ToP   noToC   RFC4646 - Page 48

6. Security Considerations

Language tags used in content negotiation, like any other information exchanged on the Internet, might be a source of concern because they might be used to infer the nationality of the sender, and thus identify potential targets for surveillance. This is a special case of the general problem that anything sent is visible to the receiving party and possibly to third parties as well. It is useful to be aware that such concerns can exist in some cases. The evaluation of the exact magnitude of the threat, and any possible countermeasures, is left to each application protocol (see BCP 72 [RFC3552] for best current practice guidance on security threats and defenses). The language tag associated with a particular information item is of no consequence whatsoever in determining whether that content might contain possible homographs. The fact that a text is tagged as being in one language or using a particular script subtag provides no assurance whatsoever that it does not contain characters from scripts other than the one(s) associated with or specified by that language tag. Since there is no limit to the number of variant, private use, and extension subtags, and consequently no limit on the possible length of a tag, implementations need to guard against buffer overflow attacks. See Section 4.3 for details on language tag truncation, which can occur as a consequence of defenses against buffer overflow. Although the specification of valid subtags for an extension (see Section 3.7) MUST be available over the Internet, implementations SHOULD NOT mechanically depend on it being always accessible, to prevent denial-of-service attacks.

7. Character Set Considerations

The syntax in this document requires that language tags use only the characters A-Z, a-z, 0-9, and HYPHEN-MINUS, which are present in most character sets, so the composition of language tags should not have any character set issues. Rendering of characters based on the content of a language tag is not addressed in this memo. Historically, some languages have relied on the use of specific character sets or other information in order to infer how a specific character should be rendered (notably this applies to language- and culture-specific variations of Han ideographs as used in Japanese, Chinese, and Korean). When language
ToP   noToC   RFC4646 - Page 49
   tags are applied to spans of text, rendering engines sometimes use
   that information in deciding which font to use in the absence of
   other information, particularly where languages with distinct writing
   traditions use the same characters.

8. Changes from RFC 3066

The main goals for this revision of language tags were the following: *Compatibility.* All RFC 3066 language tags (including those in the IANA registry) remain valid in this specification. The changes in this document represent additional constraints on language tags. That is, in no case is the syntax more permissive and processors based on the ABNF and other provisions of RFC 3066 (such as those described in [XMLSchema]) will be able to process the tags described by this document. In addition, this document defines language tags in such as way as to ensure future compatibility. *Stability.* Because of changes in the past in the underlying ISO standards, a valid RFC 3066 language tag could become invalid or have its meaning change. This has the potential of invalidating content that may have an extensive shelf-life. In this specification, once a language tag is valid, it remains valid forever. *Validity.* The structure of language tags defined by this document makes it possible to determine if a particular tag is well-formed without regard for the actual content or "meaning" of the tag as a whole. This is important because the registry grows and underlying standards change over time. In addition, it must be possible to determine if a tag is valid (or not) for a given point in time in order to provide reproducible, testable results. This process must not be error-prone; otherwise implementations might give different results. By having an authoritative registry with specific versioning information, the validity of language tags at any point in time can be precisely determined (instead of interpolating values from many separate sources). *Utility.* It is sometimes important to be able to differentiate between written forms of a language -- for many implementations this is more important than distinguishing between the spoken variants of a language. Languages are written in a wide variety of different scripts, so this document provides for the generative use of ISO 15924 script codes. Like the generative use of ISO language and country codes in RFC 3066, this allows combinations to be produced without resorting to the registration process. The addition of UN M.49 codes provides for the generation of language tags with regional scope, which is also required by some applications.
ToP   noToC   RFC4646 - Page 50
   The recast of the registry from containing whole language tags to
   subtags is a key part of this.  An important feature of RFC 3066 was
   that it allowed generative use of subtags.  This allows people to
   meaningfully use generated tags, without the delays in registering
   whole tags or the need to register all of the combinations that might
   be useful.

   The choice of placing the extended language and script subtags
   between the primary language and region subtags was widely debated.
   This design was chosen because the prevalent matching and content
   negotiation schemes rely on the subtags being arranged in order of
   increasing specificity.  That is, the subtags that mark a greater
   barrier to mutual intelligibility appear left-most in a tag.  For
   example, when selecting content written in Azerbaijani, the script
   (Arabic, Cyrillic, or Latin) represents a greater barrier to
   understanding than any regional variations (those associated with
   Azerbaijan or Iran, for example).  Individuals who prefer documents
   in a particular script, but can deal with the minor regional
   differences, can therefore select appropriate content.  Applications
   that do not deal with written content will continue to omit these
   subtags.

   *Extensibility.* Because of the widespread use of language tags, it
   is disruptive to have periodic revisions of the core specification,
   even in the face of demonstrated need.  The extension mechanism
   provides for a way for independent RFCs to define extensions to
   language tags.  These extensions have a very constrained, well-
   defined structure that prevents extensions from interfering with
   implementations of language tags defined in this document.

   The document also anticipates features of ISO 639-3 with the addition
   of the extended language subtags, as well as the possibility of other
   ISO 639 parts becoming useful for the formation of language tags in
   the future.

   The use and definition of private use tags have also been modified,
   to allow people to use private use subtags to extend or modify
   defined tags and to move as much information as possible out of
   private use and into the regular structure.

   The goal for each of these modifications is to reduce or eliminate
   the need for future revisions of this document.
ToP   noToC   RFC4646 - Page 51
   The specific changes in this document to meet these goals are:

   o  Defines the ABNF and rules for subtags so that the category of all
      subtags can be determined without reference to the registry.

   o  Adds the concept of well-formed vs. validating processors,
      defining the rules by which an implementation can claim to be one
      or the other.

   o  Replaces the IANA language tag registry with a language subtag
      registry that provides a complete list of valid subtags in the
      IANA registry.  This allows for robust implementation and ease of
      maintenance.  The language subtag registry becomes the canonical
      source for forming language tags.

   o  Provides a process that guarantees stability of language tags, by
      handling reuse of values by ISO 639, ISO 15924, and ISO 3166 in
      the event that they register a previously used value for a new
      purpose.

   o  Allows ISO 15924 script code subtags and allows them to be used
      generatively.  Defines a method for indicating in the registry
      when script subtags are necessary for a given language tag.

   o  Adds the concept of a variant subtag and allows variants to be
      used generatively.

   o  Adds the ability to use a class of UN M.49 tags for supra-national
      regions and to resolve conflicts in the assignment of ISO 3166
      codes.

   o  Defines the private use tags in ISO 639, ISO 15924, and ISO 3166
      as the mechanism for creating private use language, script, and
      region subtags, respectively.

   o  Adds a well-defined extension mechanism.

   o  Defines an extended language subtag, possibly for use with certain
      anticipated features of ISO 639-3.
ToP   noToC   RFC4646 - Page 52

9. References

9.1. Normative References

[ISO10646] International Organization for Standardization, "ISO/IEC 10646:2003. Information technology -- Universal Multiple-Octet Coded Character Set (UCS)", 2003. [ISO15924] International Organization for Standardization, "ISO 15924:2004. Information and documentation -- Codes for the representation of names of scripts", January 2004. [ISO3166-1] International Organization for Standardization, "ISO 3166-1:1997. Codes for the representation of names of countries and their subdivisions -- Part 1: Country codes", 1997. [ISO639-1] International Organization for Standardization, "ISO 639-1:2002. Codes for the representation of names of languages -- Part 1: Alpha-2 code", 2002. [ISO639-2] International Organization for Standardization, "ISO 639-2:1998. Codes for the representation of names of languages -- Part 2: Alpha-3 code, first edition", 1998. [ISO646] International Organization for Standardization, "ISO/IEC 646:1991, Information technology -- ISO 7-bit coded character set for information interchange.", 1991. [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [RFC2028] Hovey, R. and S. Bradner, "The Organizations Involved in the IETF Standards Process", BCP 11, RFC 2028, October 1996. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 2434, October 1998.
ToP   noToC   RFC4646 - Page 53
   [RFC2860]      Carpenter, B., Baker, F., and M. Roberts, "Memorandum
                  of Understanding Concerning the Technical Work of the
                  Internet Assigned Numbers Authority", RFC 2860,
                  June 2000.

   [RFC3339]      Klyne, G., Ed. and C. Newman, "Date and Time on the
                  Internet: Timestamps", RFC 3339, July 2002.

   [RFC4234]      Crocker, D., Ed. and P. Overell, "Augmented BNF for
                  Syntax Specifications: ABNF", RFC 4234, October 2005.

   [UN_M.49]      Statistics Division, United Nations, "Standard Country
                  or Area Codes for Statistical Use", UN Standard
                  Country or Area Codes for Statistical Use, Revision 4
                  (United Nations publication, Sales No. 98.XVII.9,
                  June 1999.

9.2. Informative References

[RFC1766] Alvestrand, H., "Tags for the Identification of Languages", RFC 1766, March 1995. [RFC2047] Moore, K., "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text", RFC 2047, November 1996. [RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations", RFC 2231, November 1997. [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646", RFC 2781, February 2000. [RFC3066] Alvestrand, H., "Tags for the Identification of Languages", BCP 47, RFC 3066, January 2001. [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC Text on Security Considerations", BCP 72, RFC 3552, July 2003. [RFC4645] Ewell, D., Ed., "Initial Language Subtag Registry", RFC 4645, September 2006. [RFC4647] Phillips, A., Ed. and M. Davis, Ed., "Matching of Language Tags", BCP 47, RFC 4647, September 2006.
ToP   noToC   RFC4646 - Page 54
   [Unicode]      Unicode Consortium, "The Unicode Standard, Version
                  5.0", Boston, MA, Addison-Wesley, 2007. ISBN 0-321-
                  48091-0.

   [XML10]        Bray (et al), T., "Extensible Markup Language (XML)
                  1.0", 02 2004.

   [XMLSchema]    Biron, P., Ed. and A. Malhotra, Ed., "XML Schema Part
                  2: Datatypes Second Edition", 10 2004, <
                  http://www.w3.org/TR/xmlschema-2/>.

   [iso639.prin]  ISO 639 Joint Advisory Committee, "ISO 639 Joint
                  Advisory Committee:  Working principles for ISO 639
                  maintenance", March 2000, <http://www.loc.gov/
                  standards/iso639-2/iso639jac_n3r.html>.

   [record-jar]   Raymond, E., "The Art of Unix Programming", 2003,
                  <urn:isbn:0-13-142901-9>.
ToP   noToC   RFC4646 - Page 55

Appendix A. Acknowledgements

Any list of contributors is bound to be incomplete; please regard the following as only a selection from the group of people who have contributed to make this document what it is today. The contributors to RFC 3066 and RFC 1766, the precursors of this document, made enormous contributions directly or indirectly to this document and are generally responsible for the success of language tags. The following people (in alphabetical order) contributed to this document or to RFCs 1766 and 3066: Glenn Adams, Harald Tveit Alvestrand, Tim Berners-Lee, Marc Blanchet, Nathaniel Borenstein, Karen Broome, Eric Brunner, Sean M. Burke, M.T. Carrasco Benitez, Jeremy Carroll, John Clews, Jim Conklin, Peter Constable, John Cowan, Mark Crispin, Dave Crocker, Elwyn Davies, Martin Duerst, Frank Ellerman, Michael Everson, Doug Ewell, Ned Freed, Tim Goodwin, Dirk-Willem van Gulik, Marion Gunn, Joel Halpren, Elliotte Rusty Harold, Paul Hoffman, Scott Hollenbeck, Richard Ishida, Olle Jarnefors, Kent Karlsson, John Klensin, Erkki Kolehmainen, Alain LaBonte, Eric Mader, Ira McDonald, Keith Moore, Chris Newman, Masataka Ohta, Dylan Pierce, Randy Presuhn, George Rhoten, Felix Sasaki, Markus Scherer, Keld Jorn Simonsen, Thierry Sourbier, Otto Stolz, Tex Texin, Andrea Vine, Rhys Weatherley, Misha Wolf, Francois Yergeau and many, many others. Very special thanks must go to Harald Tveit Alvestrand, who originated RFCs 1766 and 3066, and without whom this document would not have been possible. Special thanks must go to Michael Everson, who has served as Language Tag Reviewer for almost the complete period since the publication of RFC 1766. Special thanks to Doug Ewell, for his production of the first complete subtag registry, and his work in producing a test parser for verifying language tags.
ToP   noToC   RFC4646 - Page 56

Appendix B. Examples of Language Tags (Informative)

Simple language subtag: de (German) fr (French) ja (Japanese) i-enochian (example of a grandfathered tag) Language subtag plus Script subtag: zh-Hant (Chinese written using the Traditional Chinese script) zh-Hans (Chinese written using the Simplified Chinese script) sr-Cyrl (Serbian written using the Cyrillic script) sr-Latn (Serbian written using the Latin script) Language-Script-Region: zh-Hans-CN (Chinese written using the Simplified script as used in mainland China) sr-Latn-CS (Serbian written using the Latin script as used in Serbia and Montenegro) Language-Variant: sl-rozaj (Resian dialect of Slovenian sl-nedis (Nadiza dialect of Slovenian) Language-Region-Variant: de-CH-1901 (German as used in Switzerland using the 1901 variant [orthography]) sl-IT-nedis (Slovenian as used in Italy, Nadiza dialect)
ToP   noToC   RFC4646 - Page 57
   Language-Script-Region-Variant:

      sl-Latn-IT-nedis (Nadiza dialect of Slovenian written using the
      Latin script as used in Italy.  Note that this tag is NOT
      RECOMMENDED because subtag 'sl' has a Suppress-Script value of
      'Latn')

   Language-Region:

      de-DE (German for Germany)

      en-US (English as used in the United States)

      es-419 (Spanish appropriate for the Latin America and Caribbean
      region using the UN region code)

   Private use subtags:

      de-CH-x-phonebk

      az-Arab-x-AZE-derbend

   Extended language subtags (examples ONLY: extended languages MUST be
   defined by revision or update to this document):

      zh-min

      zh-min-nan-Hant-CN

   Private use registry values:

      x-whatever (private use using the singleton 'x')

      qaa-Qaaa-QM-x-southern (all private tags)

      de-Qaaa (German, with a private script)

      sr-Latn-QM (Serbian, Latin-script, private region)

      sr-Qaaa-CS (Serbian, private script, for Serbia and Montenegro)

   Tags that use extensions (examples ONLY: extensions MUST be defined
   by revision or update to this document or by RFC):

      en-US-u-islamCal

      zh-CN-a-myExt-x-private
ToP   noToC   RFC4646 - Page 58
      en-a-myExt-b-another

   Some Invalid Tags:

      de-419-DE (two region tags)

      a-DE (use of a single-character subtag in primary position; note
      that there are a few grandfathered tags that start with "i-" that
      are valid)

      ar-a-aaa-b-bbb-a-ccc (two extensions with same single-letter
      prefix)

Authors' Addresses

Addison Phillips (Editor) Yahoo! Inc. EMail: addison@inter-locale.com Mark Davis (Editor) Google EMail: mark.davis@macchiato.com or mark.davis@google.com
ToP   noToC   RFC4646 - Page 59
Full Copyright Statement

   Copyright (C) The Internet Society (2006).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the procedures with respect to rights in RFC documents can be
   found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at
   ietf-ipr@ietf.org.

Acknowledgement

   Funding for the RFC Editor function is provided by the IETF
   Administrative Support Activity (IASA).