Tech-invite3GPPspaceIETFspace
96959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 5646

Tags for Identifying Languages

Pages: 84
Best Current Practice: 47
Errata
BCP 47 is also:  4647
Obsoletes:  4646
Part 3 of 4 – Pages 46 to 71
First   Prev   Next

Top   ToC   RFC5646 - Page 46   prevText

3.6. Possibilities for Registration

Possibilities for registration of subtags or information about subtags include: o Primary language subtags for languages not listed in ISO 639 that are not variants of any listed or registered language MAY be registered. At the time this document was created, there were no examples of this form of subtag. Before attempting to register a language subtag, there MUST be an attempt to register the language
Top   ToC   RFC5646 - Page 47
      with ISO 639.  Subtags MUST NOT be registered for languages
      defined by codes that exist in ISO 639-1, ISO 639-2, or ISO 639-3;
      that are under consideration by the ISO 639 registration
      authorities; or that have never been attempted for registration
      with those authorities.  If ISO 639 has previously rejected a
      language for registration, it is reasonable to assume that there
      must be additional, very compelling evidence of need before it
      will be registered as a primary language subtag in the IANA
      registry (to the extent that it is very unlikely that any subtags
      will be registered of this type).

   o  Dialect or other divisions or variations within a language, its
      orthography, writing system, regional or historical usage,
      transliteration or other transformation, or distinguishing
      variation MAY be registered as variant subtags.  An example is the
      'rozaj' subtag (the Resian dialect of Slovenian).

   o  The addition or maintenance of fields (generally of an
      informational nature) in tag or subtag records as described in
      Section 3.1 is allowed.  Such changes are subject to the stability
      provisions in Section 3.4.  This includes 'Description',
      'Comments', 'Deprecated', and 'Preferred-Value' fields for
      obsolete or withdrawn codes, or the addition of 'Suppress-Script'
      or 'Macrolanguage' fields to primary language subtags, as well as
      other changes permitted by this document, such as the addition of
      an appropriate 'Prefix' field to a variant subtag.

   o  The addition of records and related field value changes necessary
      to reflect assignments made by ISO 639, ISO 15924, ISO 3166-1, and
      UN M.49 as described in Section 3.4 is allowed.

   Subtags proposed for registration that would cause all or part of a
   grandfathered tag to become redundant but whose meaning conflicts
   with or alters the meaning of the grandfathered tag MUST be rejected.

   This document leaves the decision on what subtags or changes to
   subtags are appropriate (or not) to the registration process
   described in Section 3.5.

   Note: Four-character primary language subtags are reserved to allow
   for the possibility of alpha4 codes in some future addition to the
   ISO 639 family of standards.

   ISO 639 defines a registration authority for additions to and changes
   in the list of languages in ISO 639.  This agency is:
Top   ToC   RFC5646 - Page 48
   International Information Centre for Terminology (Infoterm)
   Aichholzgasse 6/12, AT-1120
   Wien, Austria
   Phone: +43 1 26 75 35 Ext. 312 Fax: +43 1 216 32 72

   ISO 639-2 defines a registration authority for additions to and
   changes in the list of languages in ISO 639-2.  This agency is:

   Library of Congress
   Network Development and MARC Standards Office
   Washington, DC 20540, USA
   Phone: +1 202 707 6237 Fax: +1 202 707 0115
   URL: http://www.loc.gov/standards/iso639-2

   ISO 639-3 defines a registration authority for additions to and
   changes in the list of languages in ISO 639-3.  This agency is:

   SIL International
   ISO 639-3 Registrar
   7500 W. Camp Wisdom Rd.
   Dallas, TX 75236, USA
   Phone: +1 972 708 7400, ext. 2293
   Fax: +1 972 708 7546
   Email: iso639-3@sil.org
   URL: http://www.sil.org/iso639-3

   ISO 639-5 defines a registration authority for additions to and
   changes in the list of languages in ISO 639-5.  This agency is the
   same as for ISO 639-2 and is:

   Library of Congress
   Network Development and MARC Standards Office
   Washington, DC 20540, USA
   Phone: +1 202 707 6237
   Fax: +1 202 707 0115
   URL: http://www.loc.gov/standards/iso639-5

   The maintenance agency for ISO 3166-1 (country codes) is:

   ISO 3166 Maintenance Agency
   c/o International Organization for Standardization
   Case postale 56
   CH-1211 Geneva 20, Switzerland
   Phone: +41 22 749 72 33 Fax: +41 22 749 73 49
   URL: http://www.iso.org/iso/en/prods-services/iso3166ma/index.html
Top   ToC   RFC5646 - Page 49
   The registration authority for ISO 15924 (script codes) is:

   Unicode Consortium
   Box 391476
   Mountain View, CA 94039-1476, USA
   URL: http://www.unicode.org/iso15924

   The Statistics Division of the United Nations Secretariat maintains
   the Standard Country or Area Codes for Statistical Use and can be
   reached at:

   Statistical Services Branch
   Statistics Division
   United Nations, Room DC2-1620
   New York, NY 10017, USA
   Fax: +1-212-963-0623
   Email: statistics@un.org
   URL: http://unstats.un.org/unsd/methods/m49/m49alpha.htm

3.7. Extensions and the Extensions Registry

Extension subtags are those introduced by single-character subtags ("singletons") other than 'x'. They are reserved for the generation of identifiers that contain a language component and are compatible with applications that understand language tags. The structure and form of extensions are defined by this document so that implementations can be created that are forward compatible with applications that might be created using singletons in the future. In addition, defining a mechanism for maintaining singletons will lend stability to this document by reducing the likely need for future revisions or updates. Single-character subtags are assigned by IANA using the "IETF Review" policy defined by [RFC5226]. This policy requires the development of an RFC, which SHALL define the name, purpose, processes, and procedures for maintaining the subtags. The maintaining or registering authority, including name, contact email, discussion list email, and URL location of the registry, MUST be indicated clearly in the RFC. The RFC MUST specify or include each of the following: o The specification MUST reference the specific version or revision of this document that governs its creation and MUST reference this section of this document. o The specification and all subtags defined by the specification MUST follow the ABNF and other rules for the formation of tags and subtags as defined in this document. In particular, it MUST
Top   ToC   RFC5646 - Page 50
      specify that case is not significant and that subtags MUST NOT
      exceed eight characters in length.

   o  The specification MUST specify a canonical representation.

   o  The specification of valid subtags MUST be available over the
      Internet and at no cost.

   o  The specification MUST be in the public domain or available via a
      royalty-free license acceptable to the IETF and specified in the
      RFC.

   o  The specification MUST be versioned, and each version of the
      specification MUST be numbered, dated, and stable.

   o  The specification MUST be stable.  That is, extension subtags,
      once defined by a specification, MUST NOT be retracted or change
      in meaning in any substantial way.

   o  The specification MUST include, in a separate section, the
      registration form reproduced in this section (below) to be used in
      registering the extension upon publication as an RFC.

   o  IANA MUST be informed of changes to the contact information and
      URL for the specification.

   IANA will maintain a registry of allocated single-character
   (singleton) subtags.  This registry MUST use the record-jar format
   described by the ABNF in Section 3.1.1.  Upon publication of an
   extension as an RFC, the maintaining authority defined in the RFC
   MUST forward this registration form to <iesg@ietf.org>, who MUST
   forward the request to <iana@iana.org>.  The maintaining authority of
   the extension MUST maintain the accuracy of the record by sending an
   updated full copy of the record to <iana@iana.org> with the subject
   line "LANGUAGE TAG EXTENSION UPDATE" whenever content changes.  Only
   the 'Comments', 'Contact_Email', 'Mailing_List', and 'URL' fields MAY
   be modified in these updates.

   Failure to maintain this record, maintain the corresponding registry,
   or meet other conditions imposed by this section of this document MAY
   be appealed to the IESG [RFC2028] under the same rules as other IETF
   decisions (see [RFC2026]) and MAY result in the authority to maintain
   the extension being withdrawn or reassigned by the IESG.
Top   ToC   RFC5646 - Page 51
   %%
   Identifier:
   Description:
   Comments:
   Added:
   RFC:
   Authority:
   Contact_Email:
   Mailing_List:
   URL:
   %%

    Figure 6: Format of Records in the Language Tag Extensions Registry

   'Identifier' contains the single-character subtag (singleton)
   assigned to the extension.  The Internet-Draft submitted to define
   the extension SHOULD specify which letter or digit to use, although
   the IESG MAY change the assignment when approving the RFC.

   'Description' contains the name and description of the extension.

   'Comments' is an OPTIONAL field and MAY contain a broader description
   of the extension.

   'Added' contains the date the extension's RFC was published in the
   "full-date" format specified in [RFC3339].  For example: 2004-06-28
   represents June 28, 2004, in the Gregorian calendar.

   'RFC' contains the RFC number assigned to the extension.

   'Authority' contains the name of the maintaining authority for the
   extension.

   'Contact_Email' contains the email address used to contact the
   maintaining authority.

   'Mailing_List' contains the URL or subscription email address of the
   mailing list used by the maintaining authority.

   'URL' contains the URL of the registry for this extension.

   The determination of whether an Internet-Draft meets the above
   conditions and the decision to grant or withhold such authority rests
   solely with the IESG and is subject to the normal review and appeals
   process associated with the RFC process.

   Extension authors are strongly cautioned that many (including most
   well-formed) processors will be unaware of any special relationships
Top   ToC   RFC5646 - Page 52
   or meaning inherent in the order of extension subtags.  Extension
   authors SHOULD avoid subtag relationships or canonicalization
   mechanisms that interfere with matching or with length restrictions
   that sometimes exist in common protocols where the extension is used.
   In particular, applications MAY truncate the subtags in doing
   matching or in fitting into limited lengths, so it is RECOMMENDED
   that the most significant information be in the most significant
   (left-most) subtags and that the specification gracefully handle
   truncated subtags.

   When a language tag is to be used in a specific, known protocol, it
   is RECOMMENDED that the language tag not contain extensions not
   supported by that protocol.  In addition, note that some protocols
   MAY impose upper limits on the length of the strings used to store or
   transport the language tag.

3.8. Update of the Language Subtag Registry

After the adoption of this document, the IANA Language Subtag Registry needed an update so that it would contain the complete set of subtags valid in a language tag. [RFC5645] describes the process used to create this update. Registrations that are in process under the rules defined in [RFC4646] when this document is adopted MUST be completed under the rules contained in this document.

3.9. Applicability of the Subtag Registry

The Language Subtag Registry is the source of data elements used to construct language tags, following the rules described in this document. Language tags are designed for indicating linguistic attributes of various content, including not only text but also most media formats, such as video or audio. They also form the basis for language and locale negotiation in various protocols and APIs. The registry is therefore applicable to many applications that need some form of language identification, with these limitations: o It is not designed to be the sole data source in the creation of a language-selection user interface. For example, the registry does not contain translations for subtag descriptions or for tags composed from the subtags. Sources for localized data based on the registry are generally available, notably [CLDR]. Nor does the registry indicate which subtag combinations are particularly useful or relevant.
Top   ToC   RFC5646 - Page 53
   o  It does not provide information indicating relationships between
      different languages, such as might be used in a user interface to
      select language tags hierarchically, regionally, or on some other
      organizational model.

   o  It does not supply information about potential overlap between
      different language tags, as the notion of what constitutes a
      language is not precise: several different language tags might be
      reasonable choices for the same given piece of content.

   o  It does not contain information about appropriate fallback choices
      when performing language negotiation.  A good fallback language
      might be linguistically unrelated to the specified language.  The
      fact that one language is often used as a fallback language for
      another is usually a result of outside factors, such as geography,
      history, or culture -- factors that might not apply in all cases.
      For example, most people who use Breton (a Celtic language used in
      the Northwest of France) would probably prefer to be served French
      (a Romance language) if Breton isn't available.

4. Formation and Processing of Language Tags

This section addresses how to use the information in the registry with the tag syntax to choose, form, and process language tags.

4.1. Choice of Language Tag

The guiding principle in forming language tags is to "tag content wisely." Sometimes there is a choice between several possible tags for the same content. The choice of which tag to use depends on the content and application in question, and some amount of judgment might be necessary when selecting a tag. Interoperability is best served when the same language tag is used consistently to represent the same language. If an application has requirements that make the rules here inapplicable, then that application risks damaging interoperability. It is strongly RECOMMENDED that users not define their own rules for language tag choice. Standards, protocols, and applications that reference this document normatively but apply different rules to the ones given in this section MUST specify how language tag selection varies from the guidelines given here. To ensure consistent backward compatibility, this document contains several provisions to account for potential instability in the standards used to define the subtags that make up language tags.
Top   ToC   RFC5646 - Page 54
   These provisions mean that no valid language tag can become invalid,
   nor will a language tag have a narrower scope in the future (it may
   have a broader scope).  The most appropriate language tag for a given
   application or content item might evolve over time, but once applied,
   the tag itself cannot become invalid or have its meaning wholly
   change.

   A subtag SHOULD only be used when it adds useful distinguishing
   information to the tag.  Extraneous subtags interfere with the
   meaning, understanding, and processing of language tags.  In
   particular, users and implementations SHOULD follow the 'Prefix' and
   'Suppress-Script' fields in the registry (defined in Section 3.1):
   these fields provide guidance on when specific additional subtags
   SHOULD be used or avoided in a language tag.

   The choice of subtags used to form a language tag SHOULD follow these
   guidelines:

   1.  Use as precise a tag as possible, but no more specific than is
       justified.  Avoid using subtags that are not important for
       distinguishing content in an application.

       *  For example, 'de' might suffice for tagging an email written
          in German, while "de-CH-1996" is probably unnecessarily
          precise for such a task.

       *  Note that some subtag sequences might not represent the
          language a casual user might expect.  For example, the Swiss
          German (Schweizerdeutsch) language is represented by "gsw-CH"
          and not by "de-CH".  This latter tag represents German ('de')
          as used in Switzerland ('CH'), also known as Swiss High German
          (Schweizer Hochdeutsch).  Both are real languages, and
          distinguishing between them could be important to an
          application.

   2.  The script subtag SHOULD NOT be used to form language tags unless
       the script adds some distinguishing information to the tag.
       Script subtags were first formally defined in [RFC4646].  Their
       use can affect matching and subtag identification for
       implementations of [RFC1766] or [RFC3066] (which are obsoleted by
       this document), as these subtags appear between the primary
       language and region subtags.  Some applications can benefit from
       the use of script subtags in language tags, as long as the use is
       consistent for a given context.  Script subtags are never
       appropriate for unwritten content (such as audio recordings).
       The field 'Suppress-Script' in the primary or extended language
       record in the registry indicates script subtags that do not add
       distinguishing information for most applications; this field
Top   ToC   RFC5646 - Page 55
       defines when users SHOULD NOT include a script subtag with a
       particular primary language subtag.

       For example, if an implementation selects content using Basic
       Filtering [RFC4647] (originally described in Section 14.4 of
       [RFC2616]) and the user requested the language range "en-US",
       content labeled "en-Latn-US" will not match the request and thus
       not be selected.  Therefore, it is important to know when script
       subtags will customarily be used and when they ought not be used.

       For example:

       *  The subtag 'Latn' should not be used with the primary language
          'en' because nearly all English documents are written in the
          Latin script and it adds no distinguishing information.
          However, if a document were written in English mixing Latin
          script with another script such as Braille ('Brai'), then it
          might be appropriate to choose to indicate both scripts to aid
          in content selection, such as the application of a style
          sheet.

       *  When labeling content that is unwritten (such as a recording
          of human speech), the script subtag should not be used, even
          if the language is customarily written in several scripts.
          Thus, the subtitles to a movie might use the tag "uz-Arab"
          (Uzbek, Arabic script), but the audio track for the same
          language would be tagged simply "uz".  (The tag "uz-Zxxx"
          could also be used where content is not written, as the subtag
          'Zxxx' represents the "Code for unwritten documents".)

   3.  If a tag or subtag has a 'Preferred-Value' field in its registry
       entry, then the value of that field SHOULD be used to form the
       language tag in preference to the tag or subtag in which the
       preferred value appears.

       *  For example, use 'jbo' for Lojban in preference to the
          grandfathered tag "art-lojban".

   4.  Use subtags or sequences of subtags for individual languages in
       preference to subtags for language collections.  A "language
       collection" is a group of languages that are descended from a
       common ancestor, are spoken in the same geographical area, or are
       otherwise related.  Certain language collections are assigned
       codes by [ISO639-5] (and some of these [ISO639-5] codes are also
       defined as collections in [ISO639-2]).  These codes are included
       as primary language subtags in the registry.  Subtags for a
       language collection in the registry have a 'Scope' field with a
       value of 'collection'.  A subtag for a language collection is
Top   ToC   RFC5646 - Page 56
       always preferred to less specific alternatives such as 'mul' and
       'und' (see below), and a subtag representing a language
       collection MAY be used when more specific language information is
       not available.  However, most users and implementations do not
       know there is a relationship between the collection and its
       individual languages.  In addition, the relationship between the
       individual languages in the collection is not well defined; in
       particular, the languages are usually not mutually intelligible.
       Since the subtags are different, a request for the collection
       will typically only produce items tagged with the collection's
       subtag, not items tagged with subtags for the individual
       languages contained in the collection.

       *  For example, collections are interpreted inclusively, so the
          subtag 'gem' (Germanic languages) could, but SHOULD NOT, be
          used with content that would be better tagged with "en"
          (English), "de" (German), or "gsw" (Swiss German, Alemannic).
          While 'gem' collects all of these (and other) languages, most
          implementations will not match 'gem' to the individual
          languages; thus, using the subtag will not produce the desired
          result.

   5.  [ISO639-2] has defined several codes included in the subtag
       registry that require additional care when choosing language
       tags.  In most of these cases, where omitting the language tag is
       permitted, such omission is preferable to using these codes.
       Language tags SHOULD NOT incorporate these subtags as a prefix,
       unless the additional information conveys some value to the
       application.

       *  The 'mul' (Multiple) primary language subtag identifies
          content in multiple languages.  This subtag SHOULD NOT be used
          when a list of languages or individual tags for each content
          element can be used instead.  For example, the 'Content-
          Language' header [RFC3282] allows a list of languages to be
          used, not just a single language tag.

       *  The 'und' (Undetermined) primary language subtag identifies
          linguistic content whose language is not determined.  This
          subtag SHOULD NOT be used unless a language tag is required
          and language information is not available or cannot be
          determined.  Omitting the language tag (where permitted) is
          preferred.  The 'und' subtag might be useful for protocols
          that require a language tag to be provided or where a primary
          language subtag is required (such as in "und-Latn").  The
          'und' subtag MAY also be useful when matching language tags in
          certain situations.
Top   ToC   RFC5646 - Page 57
       *  The 'zxx' (Non-Linguistic, Not Applicable) primary language
          subtag identifies content for which a language classification
          is inappropriate or does not apply.  Some examples might
          include instrumental or electronic music; sound recordings
          consisting of nonverbal sounds; audiovisual materials with no
          narration, dialog, printed titles, or subtitles; machine-
          readable data files consisting of machine languages or
          character codes; or programming source code.

       *  The 'mis' (Uncoded) primary language subtag identifies content
          whose language is known but that does not currently have a
          corresponding subtag.  This subtag SHOULD NOT be used.
          Because the addition of other codes in the future can render
          its application invalid, it is inherently unstable and hence
          incompatible with the stability goals of BCP 47.  It is always
          preferable to use other subtags: either 'und' or (with prior
          agreement) private use subtags.

   6.  Use variant subtags sparingly and in the correct order.  Most
       variant subtags have one or more 'Prefix' fields in the registry
       that express the list of subtags with which they are appropriate.
       Variants SHOULD only be used with subtags that appear in one of
       these 'Prefix' fields.  If a variant lists a second variant in
       one of its 'Prefix' fields, the first variant SHOULD appear
       directly after the second variant in any language tag where both
       occur.  General purpose variants (those with no 'Prefix' fields
       at all) SHOULD appear after any other variant subtags.  Order any
       remaining variants by placing the most significant subtag first.
       If none of the subtags is more significant or no relationship can
       be determined, alphabetize the subtags.  Because variants are
       very specialized, using many of them together generally makes the
       tag so narrow as to override the additional precision gained.
       Putting the subtags into another order interferes with
       interoperability, as well as the overall interpretation of the
       tag.

       For example:

       *  The tag "en-scotland-fonipa" (English, Scottish dialect, IPA
          phonetic transcription) is correctly ordered because
          'scotland' has a 'Prefix' of "en", while 'fonipa' has no
          'Prefix' field.

       *  The tag "sl-IT-rozaj-biske-1994" is correctly ordered: 'rozaj'
          lists "sl" as its sole 'Prefix'; 'biske' lists "sl-rozaj" as
          its sole 'Prefix'.  The subtag '1994' has several prefixes,
Top   ToC   RFC5646 - Page 58
          including "sl-rozaj".  However, it follows both 'rozaj' and
          'biske' because one of its 'Prefix' fields is "sl-rozaj-
          biske".

   7.  The grandfathered tag "i-default" (Default Language) was
       originally registered according to [RFC1766] to meet the needs of
       [RFC2277].  It is not used to indicate a specific language, but
       rather to identify the condition or content used where the
       language preferences of the user cannot be established.  It
       SHOULD NOT be used except as a means of labeling the default
       content for applications or protocols that require default
       language content to be labeled with that specific tag.  It MAY
       also be used by an application or protocol to identify when the
       default language content is being returned.

4.1.1. Tagging Encompassed Languages

Some primary language records in the registry have a 'Macrolanguage' field (Section 3.1.10) that contains a mapping from each "encompassed language" to its macrolanguage. The 'Macrolanguage' mapping doesn't define what the relationship between the encompassed language and its macrolanguage is, nor does it define how languages encompassed by the same macrolanguage are related to each other. Two different languages encompassed by the same macrolanguage may differ from one another more than, say, French and Spanish do. A few specific macrolanguages, such as Chinese ('zh') and Arabic ('ar'), are handled differently. See Section 4.1.2. The more specific encompassed language subtag SHOULD be used to form the language tag, although either the macrolanguage's primary language subtag or the encompassed language's subtag MAY be used. This means, for example, tagging Plains Cree with 'crk' rather than 'cr' (Cree), and so forth. Each macrolanguage subtag's scope, by definition, includes all of its encompassed languages. Since the relationship between encompassed languages varies, users cannot assume that the macrolanguage subtag means any particular encompassed language, nor that any given pair of encompassed languages are mutually intelligible or otherwise interchangeable. Applications MAY use macrolanguage information to improve matching or language negotiation. For example, the information that 'sr' (Serbian) and 'hr' (Croatian) share a macrolanguage expresses a closer relation between those languages than between, say, 'sr' (Serbian) and 'ma' (Macedonian). However, this relationship is not guaranteed nor is it exclusive. For example, Romanian ('ro') and
Top   ToC   RFC5646 - Page 59
   Moldavian ('mo') do not share a macrolanguage, but are far more
   closely related to each other than Cantonese ('yue') and Wu ('wuu'),
   which do share a macrolanguage.

4.1.2. Using Extended Language Subtags

To accommodate language tag forms used prior to the adoption of this document, language tags provide a special compatibility mechanism: the extended language subtag. Selected languages have been provided with both primary and extended language subtags. These include macrolanguages, such as Malay ('ms') and Uzbek ('uz'), that have a specific dominant variety that is generally synonymous with the macrolanguage. Other languages, such as the Chinese ('zh') and Arabic ('ar') macrolanguages and the various sign languages ('sgn'), have traditionally used their primary language subtag, possibly coupled with various region subtags or as part of a registered grandfathered tag, to indicate the language. With the adoption of this document, specific ISO 639-3 subtags became available to identify the languages contained within these diverse language families or groupings. This presents a choice of language tags where previously none existed: o Each encompassed language's subtag SHOULD be used as the primary language subtag. For example, a document in Mandarin Chinese would be tagged "cmn" (the subtag for Mandarin Chinese) in preference to "zh" (Chinese). o If compatibility is desired or needed, the encompassed subtag MAY be used as an extended language subtag. For example, a document in Mandarin Chinese could be tagged "zh-cmn" instead of either "cmn" or "zh". o The macrolanguage or prefixing subtag MAY still be used to form the tag instead of the more specific encompassed language subtag. That is, tags such as "zh-HK" or "sgn-RU" are still valid. Chinese ('zh') provides a useful illustration of this. In the past, various content has used tags beginning with the 'zh' subtag, with application-specific meaning being associated with region codes, private use sequences, or grandfathered registered values. This is because historically only the macrolanguage subtag 'zh' was available for forming language tags. However, the languages encompassed by the Chinese subtag 'zh' are, in the main, not mutually intelligible when spoken, and the written forms of these languages also show wide variation in form and usage.
Top   ToC   RFC5646 - Page 60
   To provide compatibility, Chinese languages encompassed by the 'zh'
   subtag are in the registry both as primary language subtags and as
   extended language subtags.  For example, the ISO 639-3 code for
   Cantonese is 'yue'.  Content in Cantonese might historically have
   used a tag such as "zh-HK" (since Cantonese is commonly spoken in
   Hong Kong), although that tag actually means any type of Chinese as
   used in Hong Kong.  With the availability of ISO 639-3 codes in the
   registry, content in Cantonese can be directly tagged using the 'yue'
   subtag.  The content can use it as a primary language subtag, as in
   the tag "yue-HK" (Cantonese, Hong Kong).  Or it can use an extended
   language subtag with 'zh', as in the tag "zh-yue-Hant" (Chinese,
   Cantonese, Traditional script).

   As noted above, applications can choose to use the macrolanguage
   subtag to form the tag instead of using the more specific encompassed
   language subtag.  For example, an application with large quantities
   of data already using tags with the 'zh' (Chinese) subtag might
   continue to use this more general subtag even for new data, even
   though the content could be more precisely tagged with 'cmn'
   (Mandarin), 'yue' (Cantonese), 'wuu' (Wu), and so on.  Similarly, an
   application already using tags that start with the 'ar' (Arabic)
   subtag might continue to use this more general subtag even for new
   data, which could be more precisely tagged with 'arb' (Standard
   Arabic).

   In some cases, the encompassed languages had tags registered for them
   during the RFC 3066 era.  Those grandfathered tags not already
   deprecated or rendered redundant were deprecated in the registry upon
   adoption of this document.  As grandfathered values, they remain
   valid for use, and some content or applications might use them.  As
   with other grandfathered tags, since implementations might not be
   able to associate the grandfathered tags with the encompassed
   language subtag equivalents that are recommended by this document,
   implementations are encouraged to canonicalize tags for comparison
   purposes.  Some examples of this include the tags "zh-hakka" (Hakka)
   and "zh-guoyu" (Mandarin or Standard Chinese).

   Sign languages share a mode of communication rather than a linguistic
   heritage.  There are many sign languages that have developed
   independently, and the subtag 'sgn' indicates only the presence of a
   sign language.  A number of sign languages also had grandfathered
   tags registered for them during the RFC 3066 era.  For example, the
   grandfathered tag "sgn-US" was registered to represent 'American Sign
   Language' specifically, without reference to the United States.  This
   is still valid, but deprecated: a document in American Sign Language
   can be labeled either "ase" or "sgn-ase" (the 'ase' subtag is for the
   language called 'American Sign Language').
Top   ToC   RFC5646 - Page 61

4.2. Meaning of the Language Tag

The meaning of a language tag is related to the meaning of the subtags that it contains. Each subtag, in turn, implies a certain range of expectations one might have for related content, although it is not a guarantee. For example, the use of a script subtag such as 'Arab' (Arabic script) does not mean that the content contains only Arabic characters. It does mean that the language involved is predominantly in the Arabic script. Thus, a language tag and its subtags can encompass a very wide range of variation and yet remain appropriate in each particular instance. Validity of a tag is not the only factor determining its usefulness. While every valid tag has a meaning, it might not represent any real- world language usage. This is unavoidable in a system in which subtags can be combined freely. For example, tags such as "ar-Cyrl-CO" (Arabic, Cyrillic script, as used in Colombia) or "tlh- Kore-AQ-fonipa" (Klingon, Korean script, as used in Antarctica, IPA phonetic transcription) are both valid and unlikely to represent a useful combination of language attributes. The meaning of a given tag doesn't depend on the context in which it appears. The relationship between a tag's meaning and the information objects to which that tag is applied, however, can vary. o For a single information object, the associated language tags might be interpreted as the set of languages that is necessary for a complete comprehension of the complete object. Example: Plain text documents. o For an aggregation of information objects, the associated language tags could be taken as the set of languages used inside components of that aggregation. Examples: Document stores and libraries. o For information objects whose purpose is to provide alternatives, the associated language tags could be regarded as a hint that the content is provided in several languages and that one has to inspect each of the alternatives in order to find its language or languages. In this case, the presence of multiple tags might not mean that one needs to be multilingual to get complete understanding of the document. Example: MIME multipart/ alternative [RFC2046]. o For markup languages, such as HTML and XML, language information can be added to each part of the document identified by the markup structure (including the whole document itself). For example, one could write <span lang="fr">C'est la vie.</span> inside a German document; the German-speaking user could then access a French-
Top   ToC   RFC5646 - Page 62
      German dictionary to find out what the marked section meant.  If
      the user were listening to that document through a speech
      synthesis interface, this formation could be used to signal the
      synthesizer to appropriately apply French text-to-speech
      pronunciation rules to that span of text, instead of applying the
      inappropriate German rules.

   o  For markup languages and document formats that allow the audience
      to be identified, a language tag could indicate the audience(s)
      appropriate for that document.  For example, the same HTML
      document described in the preceding bullet might have an HTTP
      header "Content-Language: de" to indicate that the intended
      audience for the file is German (even though three words appear
      and are identified as being in French within it).

   o  For systems and APIs, language tags form the basis for most
      implementations of locale identifiers.  For example, see Unicode's
      CLDR (Common Locale Data Repository) (see UTS #35 [UTS35])
      project.

   Language tags are related when they contain a similar sequence of
   subtags.  For example, if a language tag B contains language tag A as
   a prefix, then B is typically "narrower" or "more specific" than A.
   Thus, "zh-Hant-TW" is more specific than "zh-Hant".

   This relationship is not guaranteed in all cases: specifically,
   languages that begin with the same sequence of subtags are NOT
   guaranteed to be mutually intelligible, although they might be.  For
   example, the tag "az" shares a prefix with both "az-Latn"
   (Azerbaijani written using the Latin script) and "az-Cyrl"
   (Azerbaijani written using the Cyrillic script).  A person fluent in
   one script might not be able to read the other, even though the
   linguistic content (e.g., what would be heard if both texts were read
   aloud) might be identical.  Content tagged as "az" most probably is
   written in just one script and thus might not be intelligible to a
   reader familiar with the other script.

   Similarly, not all subtags specify an actual distinction in language.
   For example, the tags "en-US" and "en-CA" mean, roughly, English with
   features generally thought to be characteristic of the United States
   and Canada, respectively.  They do not imply that a significant
   dialectical boundary exists between any arbitrarily selected point in
   the United States and any arbitrarily selected point in Canada.
   Neither does a particular region subtag imply that linguistic
   distinctions do not exist within that region.
Top   ToC   RFC5646 - Page 63

4.3. Lists of Languages

In some applications, a single content item might best be associated with more than one language tag. Examples of such a usage include: o Content items that contain multiple, distinct varieties. Often this is used to indicate an appropriate audience for a given content item when multiple choices might be appropriate. Examples of this could include: * Metadata about the appropriate audience for a movie title. For example, a DVD might label its individual audio tracks 'de' (German), 'fr' (French), and 'es' (Spanish), but the overall title would list "de, fr, es" as its overall audience. * A French/English, English/French dictionary tagged as both "en" and "fr" to specify that it applies equally to French and English. * A side-by-side or interlinear translation of a document, as is commonly done with classical works in Latin or Greek. o Content items that contain a single language but that require multiple levels of specificity. For example, a library might wish to classify a particular work as both Norwegian ('no') and as Nynorsk ('nn') for audiences capable of appreciating the distinction or needing to select content more narrowly.

4.4. Length Considerations

There is no defined upper limit on the size of language tags. While historically most language tags have consisted of language and region subtags with a combined total length of up to six characters, larger tags have always been both possible and have actually appeared in use. Neither the language tag syntax nor other requirements in this document impose a fixed upper limit on the number of subtags in a language tag (and thus an upper bound on the size of a tag). The language tag syntax suggests that, depending on the specific language, more subtags (and thus a longer tag) are sometimes necessary to completely identify the language for certain applications; thus, it is possible to envision long or complex subtag sequences.
Top   ToC   RFC5646 - Page 64

4.4.1. Working with Limited Buffer Sizes

Some applications and protocols are forced to allocate fixed buffer sizes or otherwise limit the length of a language tag. A conformant implementation or specification MAY refuse to support the storage of language tags that exceed a specified length. Any such limitation SHOULD be clearly documented, and such documentation SHOULD include what happens to longer tags (for example, whether an error value is generated or the language tag is truncated). A protocol that allows tags to be truncated at an arbitrary limit, without giving any indication of what that limit is, has the potential to cause harm by changing the meaning of tags in substantial ways. In practice, most language tags do not require more than a few subtags and will not approach reasonably sized buffer limitations; see Section 4.1. Some specifications or protocols have limits on tag length but do not have a fixed length limitation. For example, [RFC2231] has no explicit length limitation: the length available for the language tag is constrained by the length of other header components (such as the charset's name) coupled with the 76-character limit in [RFC2047]. Thus, the "limit" might be 50 or more characters, but it could potentially be quite small. The considerations for assigning a buffer limit are: Implementations SHOULD NOT truncate language tags unless the meaning of the tag is purposefully being changed, or unless the tag does not fit into a limited buffer size specified by a protocol for storage or transmission. Implementations SHOULD warn the user when a tag is truncated since truncation changes the semantic meaning of the tag. Implementations of protocols or specifications that are space constrained but do not have a fixed limit SHOULD use the longest possible tag in preference to truncation. Protocols or specifications that specify limited buffer sizes for language tags MUST allow for language tags of at least 35 characters. Note that [RFC4646] recommended a minimum field size of 42 characters because it included all three elements of the 'extlang' production. Two of these are now permanently reserved, so a registered primary language subtag of the maximum length of 8 characters is now longer than the longest language-extlang combination. Protocols or specifications that commonly use
Top   ToC   RFC5646 - Page 65
      extensions or private use subtags might wish to reserve or
      recommend a longer "minimum buffer" size.

   The following illustration shows how the 35-character recommendation
   was derived:

   language      =  8 ; longest allowed registered value
                      ;   longer than primary+extlang
                      ;   which requires 7 characters
   script        =  5 ; if not suppressed: see Section 4.1
   region        =  4 ; UN M.49 numeric region code
                      ;   ISO 3166-1 codes require 3
   variant1      =  9 ; needs 'language' as a prefix
   variant2      =  9 ; very rare, as it needs
                      ;   'language-variant1' as a prefix

   total         = 35 characters

              Figure 7: Derivation of the Limit on Tag Length

4.4.2. Truncation of Language Tags

Truncation of a language tag alters the meaning of the tag, and thus SHOULD be avoided. However, truncation of language tags is sometimes necessary due to limited buffer sizes. Such truncation MUST NOT permit a subtag to be chopped off in the middle or the formation of invalid tags (for example, one ending with the "-" character). This means that applications or protocols that truncate tags MUST do so by progressively removing subtags along with their preceding "-" from the right side of the language tag until the tag is short enough for the given buffer. If the resulting tag ends with a single- character subtag, that subtag and its preceding "-" MUST also be removed. For example: Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1 1. zh-Latn-CN-variant1-a-extend1-x-wadegile 2. zh-Latn-CN-variant1-a-extend1 3. zh-Latn-CN-variant1 4. zh-Latn-CN 5. zh-Latn 6. zh Figure 8: Example of Tag Truncation
Top   ToC   RFC5646 - Page 66

4.5. Canonicalization of Language Tags

Since a particular language tag can be used by many processes, language tags SHOULD always be created or generated in canonical form. A language tag is in 'canonical form' when the tag is well-formed according to the rules in Sections 2.1 and 2.2 and it has been canonicalized by applying each of the following steps in order, using data from the IANA registry (see Section 3.1): 1. Extension sequences are ordered into case-insensitive ASCII order by singleton subtag. * For example, the subtag sequence '-a-babble' comes before '-b-warble'. 2. Redundant or grandfathered tags are replaced by their 'Preferred- Value', if there is one. * The field-body of the 'Preferred-Value' for grandfathered and redundant tags is an "extended language range" [RFC4647] and might consist of more than one subtag. * 'Preferred-Value' fields in the registry provide mappings from deprecated tags to modern equivalents. Many of these were created before the adoption of this document (such as the mapping of "no-nyn" to "nn" or "i-klingon" to "tlh"). Others are the result of later registrations or additions to the registry as permitted or required by this document (for example, "zh-hakka" was deprecated in favor of the ISO 639-3 code 'hak' when this document was adopted). 3. Subtags are replaced by their 'Preferred-Value', if there is one. For extlangs, the original primary language subtag is also replaced if there is a primary language subtag in the 'Preferred- Value'. * The field-body of the 'Preferred-Value' for extlangs is an "extended language range" and typically maps to a primary language subtag. For example, the subtag sequence "zh-hak" (Chinese, Hakka) is replaced with the subtag 'hak' (Hakka). * Most of the non-extlang subtags are either Region subtags where the country name or designation has changed or clerical corrections to ISO 639-1.
Top   ToC   RFC5646 - Page 67
   The canonical form contains no 'extlang' subtags.  There is an
   alternate 'extlang form' that maintains or reinstates extlang
   subtags.  This form can be useful in environments where the presence
   of the 'Prefix' subtag is considered beneficial in matching or
   selection (see Section 4.1.2).

   A language tag is in 'extlang form' when the tag is well-formed
   according to the rules in Sections 2.1 and 2.2 and it has been
   processed by applying each of the following two steps in order, using
   data from the IANA registry:

   1.  The language tag is first transformed into canonical form, as
       described above.

   2.  If the language tag starts with a primary language subtag that is
       also an extlang subtag, then the language tag is prepended with
       the extlang's 'Prefix'.

       *  For example, "hak-CN" (Hakka, China) has the primary language
          subtag 'hak', which in turn has an 'extlang' record with a
          'Prefix' 'zh' (Chinese).  The extlang form is "zh-hak-CN"
          (Chinese, Hakka, China).

       *  Note that Step 2 (prepending a prefix) can restore a subtag
          that was removed by Step 1 (canonicalizing).

   Example: The language tag "en-a-aaa-b-ccc-bbb-x-xyz" is in canonical
   form, while "en-b-ccc-bbb-a-aaa-X-xyz" is well-formed and potentially
   valid (extensions 'a' and 'b' are not defined as of the publication
   of this document) but not in canonical form (the extensions are not
   in alphabetical order).

   Example: Although the tag "en-BU" (English as used in Burma)
   maintains its validity, the language tag "en-BU" is not in canonical
   form because the 'BU' subtag has a canonical mapping to 'MM'
   (Myanmar).

   Canonicalization of language tags does not imply anything about the
   use of upper- or lowercase letters when processing or comparing
   subtags (and as described in Section 2.1).  All comparisons MUST be
   performed in a case-insensitive manner.

   When performing canonicalization of language tags, processors MAY
   regularize the case of the subtags (that is, this process is
   OPTIONAL), following the case used in the registry (see
   Section 2.1.1).
Top   ToC   RFC5646 - Page 68
   If more than one variant appears within a tag, processors MAY reorder
   the variants to obtain better matching behavior or more consistent
   presentation.  Reordering of the variants SHOULD follow the
   recommendations for variant ordering in Section 4.1.

   If the field 'Deprecated' appears in a registry record without an
   accompanying 'Preferred-Value' field, then that tag or subtag is
   deprecated without a replacement.  These values are canonical when
   they appear in a language tag.  However, tags that include these
   values SHOULD NOT be selected by users or generated by
   implementations.

   An extension MUST define any relationships that exist between the
   various subtags in the extension and thus MAY define an alternate
   canonicalization scheme for the extension's subtags.  Extensions MAY
   define how the order of the extension's subtags is interpreted.  For
   example, an extension could define that its subtags are in canonical
   order when the subtags are placed into ASCII order: that is, "en-a-
   aaa-bbb-ccc" instead of "en-a-ccc-bbb-aaa".  Another extension might
   define that the order of the subtags influences their semantic
   meaning (so that "en-b-ccc-bbb-aaa" has a different value from "en-b-
   aaa-bbb-ccc").  However, extension specifications SHOULD be designed
   so that they are tolerant of the typical processes described in
   Section 3.7.

4.6. Considerations for Private Use Subtags

Private use subtags, like all other subtags, MUST conform to the format and content constraints in the ABNF. Private use subtags have no meaning outside the private agreement between the parties that intend to use or exchange language tags that employ them. The same subtags MAY be used with a different meaning under a separate private agreement. They SHOULD NOT be used where alternatives exist and SHOULD NOT be used in content or protocols intended for general use. Private use subtags are simply useless for information exchange without prior arrangement. The value and semantic meaning of private use tags and of the subtags used within such a language tag are not defined by this document. Private use sequences introduced by the 'x' singleton are completely opaque to users or implementations outside of the private use agreement. So, in addition to private use subtag sequences introduced by the singleton subtag 'x', the Language Subtag Registry provides private use language, script, and region subtags derived from the private use codes assigned by the underlying standards. These subtags are valid for use in forming language tags; they are RECOMMENDED over the 'x' singleton private use subtag sequences
Top   ToC   RFC5646 - Page 69
   because they convey more information via their linkage to the
   language tag's inherent structure.

   For example, the region subtags 'AA', 'ZZ', and those in the ranges
   'QM'-'QZ' and 'XA'-'XZ' (derived from the ISO 3166-1 private use
   codes) can be used to form a language tag.  A tag such as
   "zh-Hans-XQ" conveys a great deal of public, interchangeable
   information about the language material (that it is Chinese in the
   simplified Chinese script and is suitable for some geographic region
   'XQ').  While the precise geographic region is not known outside of
   private agreement, the tag conveys far more information than an
   opaque tag such as "x-somelang" or even "zh-Hans-x-xq" (where the
   'xq' subtag's meaning is entirely opaque).

   However, in some cases content tagged with private use subtags can
   interact with other systems in a different and possibly unsuitable
   manner compared to tags that use opaque, privately defined subtags,
   so the choice of the best approach sometimes depends on the
   particular domain in question.

5. IANA Considerations

This section deals with the processes and requirements necessary for IANA to maintain the subtag and extension registries as defined by this document and in accordance with the requirements of [RFC5226]. The impact on the IANA maintainers of the two registries defined by this document will be a small increase in the frequency of new entries or updates. IANA also is required to create a new mailing list (described below in Section 5.1) to announce registry changes and updates.

5.1. Language Subtag Registry

IANA updated the registry using instructions and content provided in a companion document [RFC5645]. The criteria and process for selecting the updated set of records are described in that document. The updated set of records represents no impact on IANA, since the work to create it will be performed externally. Future work on the Language Subtag Registry includes the following activities: o Inserting or replacing whole records. These records are preformatted for IANA by the Language Subtag Reviewer, as described in Section 3.3. o Archiving and making publicly available the registration forms.
Top   ToC   RFC5646 - Page 70
   o  Announcing each updated version of the registry on the
      "ietf-languages-announcements@iana.org" mailing list.

   Each registration form sent to IANA contains a single record for
   incorporation into the registry.  The form will be sent to
   <iana@iana.org> by the Language Subtag Reviewer.  It will have a
   subject line indicating whether the enclosed form represents an
   insertion of a new record (indicated by the word "INSERT" in the
   subject line) or a replacement of an existing record (indicated by
   the word "MODIFY" in the subject line).  At no time can a record be
   deleted from the registry.

   IANA will extract the record from the form and place the inserted or
   modified record into the appropriate section of the Language Subtag
   Registry, grouping the records by their 'Type' field.  Inserted
   records can be placed anywhere within the appropriate section; there
   is no guarantee that the registry's records will be placed in any
   particular order except that they will always be grouped by 'Type'.
   Modified records overwrite the record they replace.

   Whenever an entry is created or modified in the registry, the 'File-
   Date' record at the start of the registry is updated to reflect the
   most recent modification date.  The date format SHALL be the "full-
   date" format of [RFC3339].  The date SHALL be the date on which that
   version of the registry was first published by IANA.  There SHALL be
   at most one version of the registry published in a day.  A 'File-
   Date' record is also included in each request to IANA to insert or
   modify records, indicating the acceptance date of the records in the
   request.

   The updated registry file MUST use the UTF-8 character encoding, and
   IANA MUST check the registry file for proper encoding.  Non-ASCII
   characters can be sent to IANA by attaching the registration form to
   the email message or by using various encodings in the mail message
   body (UTF-8 is recommended).  IANA will verify any unclear or
   corrupted characters with the Language Subtag Reviewer prior to
   posting the updated registry.

   IANA will also archive and make publicly available from
   http://www.iana.org each registration form.  Note that multiple
   registrations can pertain to the same record in the registry.

   Developers who are dependent upon the Language Subtag Registry
   sometimes would like to be informed of changes in the registry so
   that they can update their implementations.  When any change is made
   to the Language Subtag Registry, IANA will send an announcement
   message to <ietf-languages-announcements@iana.org> (a self-
   subscribing list to which only IANA can post).
Top   ToC   RFC5646 - Page 71

5.2. Extensions Registry

The Language Tag Extensions Registry can contain at most 35 records, and thus changes to this registry are expected to be very infrequent. Future work by IANA on the Language Tag Extensions Registry is limited to two cases. First, the IESG MAY request that new records be inserted into this registry from time to time. These requests MUST include the record to insert in the exact format described in Section 3.7. In addition, there MAY be occasional requests from the maintaining authority for a specific extension to update the contact information or URLs in the record. These requests MUST include the complete, updated record. IANA is not responsible for validating the information provided, only that it is properly formatted. IANA SHOULD take reasonable steps to ascertain that the request comes from the maintaining authority named in the record present in the registry.


(page 71 continued on part 4)

Next Section