RFC 5646

Tags for Identifying Languages

Pages: 84
Best Current Practice: 47
→ Errata
BCP 47 is also: 4647
Obsoletes: 4646

Part 4 of 4 – Pages 71 to 84

RFC5646 - Page 71 prevText

6.  Security Considerations

   Language tags used in content negotiation, like any other information
   exchanged on the Internet, might be a source of concern because they
   might be used to infer the nationality of the sender, and thus
   identify potential targets for surveillance.

   This is a special case of the general problem that anything sent is
   visible to the receiving party and possibly to third parties as well.
   It is useful to be aware that such concerns can exist in some cases.

   The evaluation of the exact magnitude of the threat, and any possible
   countermeasures, is left to each application protocol (see BCP 72
   [RFC3552] for best current practice guidance on security threats and
   defenses).

   The language tag associated with a particular information item is of
   no consequence whatsoever in determining whether that content might
   contain possible homographs.  The fact that a text is tagged as being
   in one language or using a particular script subtag provides no
   assurance whatsoever that it does not contain characters from scripts
   other than the one(s) associated with or specified by that language
   tag.

   Since there is no limit to the number of variant, private use, and
   extension subtags, and consequently no limit on the possible length
   of a tag, implementations need to guard against buffer overflow
   attacks.  See Section 4.4 for details on language tag truncation,
   which can occur as a consequence of defenses against buffer overflow.

RFC5646 - Page 72

   To prevent denial-of-service attacks, applications SHOULD NOT depend
   on either the Language Subtag Registry or the Language Tag Extensions
   Registry being always accessible.  Additionally, although the
   specification of valid subtags for an extension (see Section 3.7)
   MUST be available over the Internet, implementations SHOULD NOT
   mechanically depend on those sources being always accessible.

   The registries specified in this document are not suitable for
   frequent or real-time access to, or retrieval of, the full registry
   contents.  Most applications do not need registry data at all.  For
   others, being able to validate or canonicalize language tags as of a
   particular registry date will be sufficient, as the registry contents
   change only occasionally.  Changes are announced to
   <ietf-languages-announcements@iana.org>.  This mailing list is
   intended for interested organizations and individuals, not for bulk
   subscription to trigger automatic software updates.  The size of the
   registry makes it unsuitable for automatic software updates.
   Implementers considering integrating the Language Subtag Registry in
   an automatic updating scheme are strongly advised to distribute only
   suitably encoded differences, and only via their own infrastructure
   -- not directly from IANA.

   Changes, or the absence thereof, can also easily be detected by
   looking at the 'File-Date' record at the start of the registry, or by
   using features of the protocol used for downloading, without having
   to download the full registry.  At the time of publication of this
   document, IANA is making the Language Tag Registry available over
   HTTP 1.1.  The proper way to update a local copy of the Language
   Subtag Registry using HTTP 1.1 is to use a conditional GET [RFC2616].

7.  Character Set Considerations

   The syntax in this document requires that language tags use only the
   characters A-Z, a-z, 0-9, and HYPHEN-MINUS, which are present in most
   character sets, so the composition of language tags shouldn't have
   any character set issues.

   The rendering of text based on the language tag is not addressed
   here.  Historically, some processes have relied on the use of
   character set/encoding information (or other external information) in
   order to infer how a specific string of characters should be
   rendered.  Notably, this applies to language- and culture-specific
   variations of Han ideographs as used in Japanese, Chinese, and
   Korean, where use of, for example, a Japanese character encoding such
   as EUC-JP implies that the text itself is in Japanese.  When language
   tags are applied to spans of text, rendering engines might be able to
   use that information to better select fonts or make other rendering

RFC5646 - Page 73

   choices, particularly where languages with distinct writing
   traditions use the same characters.

8.  Changes from RFC 4646

   The main goal for this revision of RFC 4646 was to incorporate two
   new parts of ISO 639 (ISO 639-3 and ISO 639-5) and their attendant
   sets of language codes into the IANA Language Subtag Registry.  This
   permits the identification of many more languages and language
   collections than previously supported.

   The specific changes in this document to meet these goals are:

   o  Defined the incorporation of ISO 639-3 and ISO 639-5 codes for use
      as primary and extended language subtags.  It also permanently
      reserves and disallows the use of additional 'extlang' subtags.
      The changes necessary to achieve this were:

      *  Modified the ABNF comments.

      *  Updated various registration and stability requirements
         sections to reference ISO 639-3 and ISO 639-5 in addition to
         ISO 639-1 and ISO 639-2.

      *  Edited the text to eliminate references to extended language
         subtags where they are no longer used.

      *  Explained the change in the section on extended language
         subtags.

   o  Changed the ABNF related to grandfathered tags.  The irregular
      tags are now listed.  Well-formed grandfathered tags are now
      described by the 'langtag' production, and the 'grandfathered'
      production was removed as a result.  Also: added description of
      both types of grandfathered tags to Section 2.2.8.

   o  Added the paragraph on "collections" to Section 4.1.

   o  Changed the capitalization rules for 'Tag' fields in Section 3.1.

   o  Split Section 3.1 up into subsections.

   o  Modified Section 3.5 to allow 'Suppress-Script' fields to be
      added, modified, or removed via the registration process.  This
      was an erratum from RFC 4646.

   o  Modified examples that used region code 'CS' (formerly Serbia and
      Montenegro) to use 'RS' (Serbia) instead.

RFC5646 - Page 74

   o  Modified the rules for creating and maintaining record
      'Description' fields to prevent duplicates, including inverted
      duplicates.

   o  Removed the lengthy description of why RFC 4646 was created from
      this section, which also caused the removal of the reference to
      XML Schema.

   o  Modified the text in Section 2.1 to place more emphasis on the
      fact that language tags are not case sensitive.

   o  Replaced the example "fr-Latn-CA" in Section 2.1 with "sr-Latn-RS"
      and "az-Arab-IR" because "fr-Latn-CA" doesn't respect the
      'Suppress-Script' on 'Latn' with 'fr'.

   o  Changed the requirements for well-formedness to make singleton
      repetition checking optional (it is required for validity
      checking) in Section 2.2.9.

   o  Changed the text in Section 2.2.9 referring to grandfathered
      checking to note that the list is now included in the ABNF.

   o  Modified and added text to Section 3.2.  The job description was
      placed first.  A note was added making clear that the Language
      Subtag Reviewer may delegate various non-critical duties,
      including list moderation.  Finally, additional text was added to
      make the appointment process clear and to clarify that decisions
      and performance of the reviewer are appealable.

   o  Added text to Section 3.5 clarifying that the
      ietf-languages@iana.org list is operated by whomever the IESG
      appoints.

   o  Added text to Section 3.1.5 clarifying that the first Description
      in a 'language' record matches the corresponding Reference Name
      for the language in ISO 639-3.

   o  Modified Section 2.2.9 to define classes of conformance related to
      specific tags (formerly 'well-formed' and 'valid' referred to
      implementations).  Notes were added about the removal of 'extlang'
      from the ABNF provided in RFC 4646, allowing for well-formedness
      using this older definition.  Reference to RFC 3066 well-
      formedness was also added.

   o  Added text to the end of Section 3.1.2 noting that future versions
      of this document might add new field types to the registry format
      and recommending that implementations ignore any unrecognized
      fields.

RFC5646 - Page 75

   o  Added text about what the lack of a 'Suppress-Script' field means
      in a record to Section 3.1.9.

   o  Added text allowing the correction of misspellings and typographic
      errors to Section 3.1.5.

   o  Added text to Section 3.1.8 disallowing 'Prefix' field conflicts
      (such as circular prefix references).

   o  Modified text in Section 3.5 to require the subtag reviewer to
      announce his/her decision (or extension) following the two-week
      period.  Also clarified that any decision or failure to decide can
      be appealed.

   o  Modified text in Section 4.1 to include the (heretofore anecdotal)
      guiding principle of tag choice, and clarifying the non-use of
      script subtags in non-written applications.

   o  Prohibited multiple use of the same variant in a tag (i.e., "de-
      1901-1901").  Previously, this was only a recommendation
      ("SHOULD").

   o  Removed inappropriate [RFC2119] language from the illustration in
      Section 4.4.1.

   o  Replaced the example of deprecating "zh-guoyu" with "zh-
      hakka"->"hak" in Section 4.5, noting that it was this document
      that caused the change.

   o  Replaced the section in Section 4.1 dealing with "mul"/"und" to
      include the subtags 'zxx' and 'mis', as well as the tag
      "i-default".  A normative reference to RFC 2277 was added.

   o  Added text to Section 3.5 clarifying that any modifications of a
      registration request must be sent to the <ietf-languages@iana.org>
      list before submission to IANA.

   o  Changed the ABNF for the record-jar format from using the LWSP
      production to use a folding whitespace production similar to obs-
      FWS in [RFC5234].  This effectively prevents unintentional blank
      lines inside a field.

   o  Clarified and revised text in Sections 3.3, 3.5, and 5.1 to
      clarify that the Language Subtag Reviewer sends the complete
      registration forms to IANA, that IANA extracts the record from the
      form, and that the forms must also be archived separately from the
      registry.

RFC5646 - Page 76

   o  Added text to Section 5 requiring IANA to send an announcement to
      an ietf-languages-announcements list whenever the registry is
      updated.

   o  Modification of the registry to use UTF-8 as its character
      encoding.  This also entails additional instructions to IANA and
      the Language Subtag Reviewer in the registration process.

   o  Modified the rules in Section 2.2.4 so that "exceptionally
      reserved" ISO 3166-1 codes other than 'UK' were included into the
      registry.  In particular, this allows the code 'EU' (European
      Union) to be used to form language tags or (more commonly) for
      applications that use the registry for region codes to reference
      this subtag.

   o  Modified the IANA considerations section (Section 5) to remove
      unnecessary normative [RFC2119] language.

9.  References

9.1.  Normative References

   [ISO15924]       International Organization for Standardization, "ISO
                    15924:2004.  Information and documentation -- Codes
                    for the representation of names of scripts",
                    January 2004.

   [ISO3166-1]      International Organization for Standardization, "ISO
                    3166-1:2006.  Codes for the representation of names
                    of countries and their subdivisions -- Part 1:
                    Country codes", November 2006.

   [ISO639-1]       International Organization for Standardization, "ISO
                    639-1:2002.  Codes for the representation of names
                    of languages -- Part 1: Alpha-2 code", July 2002.

   [ISO639-2]       International Organization for Standardization, "ISO
                    639-2:1998.  Codes for the representation of names
                    of languages -- Part 2: Alpha-3 code", October 1998.

   [ISO639-3]       International Organization for Standardization, "ISO
                    639-3:2007.  Codes for the representation of names
                    of languages - Part 3: Alpha-3 code for
                    comprehensive coverage of languages", February 2007.

RFC5646 - Page 77

   [ISO639-5]       International Organization for Standardization, "ISO
                    639-5:2008. Codes for the representation of names of
                    languages -- Part 5: Alpha-3 code for language
                    families and groups", May 2008.

   [ISO646]         International Organization for Standardization,
                    "ISO/IEC 646:1991, Information technology -- ISO
                    7-bit coded character set for information
                    interchange.", 1991.

   [RFC2026]        Bradner, S., "The Internet Standards Process --
                    Revision 3", BCP 9, RFC 2026, October 1996.

   [RFC2119]        Bradner, S., "Key words for use in RFCs to Indicate
                    Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2277]        Alvestrand, H., "IETF Policy on Character Sets and
                    Languages", BCP 18, RFC 2277, January 1998.

   [RFC3339]        Klyne, G., Ed. and C. Newman, "Date and Time on the
                    Internet: Timestamps", RFC 3339, July 2002.

   [RFC4647]        Phillips, A. and M. Davis, "Matching of Language
                    Tags", BCP 47, RFC 4647, September 2006.

   [RFC5226]        Narten, T. and H. Alvestrand, "Guidelines for
                    Writing an IANA Considerations Section in RFCs",
                    BCP 26, RFC 5226, May 2008.

   [RFC5234]        Crocker, D. and P. Overell, "Augmented BNF for
                    Syntax Specifications: ABNF", STD 68, RFC 5234,
                    January 2008.

   [SpecialCasing]  The Unicode Consoritum, "Unicode Character Database,
                    Special Casing Properties", March 2008, <http://
                    unicode.org/Public/UNIDATA/SpecialCasing.txt>.

   [UAX14]          Freitag, A., "Unicode Standard Annex #14: Line
                    Breaking Properties", August 2006,
                    <http://www.unicode.org/reports/tr14/>.

   [UN_M.49]        Statistics Division, United Nations, "Standard
                    Country or Area Codes for Statistical Use", Revision
                    4 (United Nations publication, Sales No. 98.XVII.9,
                    June 1999.

RFC5646 - Page 78

   [Unicode]        Unicode Consortium, "The Unicode Consortium. The
                    Unicode Standard, Version 5.0, (Boston, MA, Addison-
                    Wesley, 2003. ISBN 0-321-49081-0)", January 2007.

9.2.  Informative References

   [CLDR]           "The Common Locale Data Repository Project",
                    <http://cldr.unicode.org>.

   [RFC1766]        Alvestrand, H., "Tags for the Identification of
                    Languages", RFC 1766, March 1995.

   [RFC2028]        Hovey, R. and S. Bradner, "The Organizations
                    Involved in the IETF Standards Process", BCP 11,
                    RFC 2028, October 1996.

   [RFC2046]        Freed, N. and N. Borenstein, "Multipurpose Internet
                    Mail Extensions (MIME) Part Two: Media Types",
                    RFC 2046, November 1996.

   [RFC2047]        Moore, K., "MIME (Multipurpose Internet Mail
                    Extensions) Part Three: Message Header Extensions
                    for Non-ASCII Text", RFC 2047, November 1996.

   [RFC2231]        Freed, N. and K. Moore, "MIME Parameter Value and
                    Encoded Word Extensions:
                    Character Sets, Languages, and Continuations",
                    RFC 2231, November 1997.

   [RFC2616]        Fielding, R., Gettys, J., Mogul, J., Frystyk, H.,
                    Masinter, L., Leach, P., and T. Berners-Lee,
                    "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616,
                    June 1999.

   [RFC2781]        Hoffman, P. and F. Yergeau, "UTF-16, an encoding of
                    ISO 10646", RFC 2781, February 2000.

   [RFC3066]        Alvestrand, H., "Tags for the Identification of
                    Languages", RFC 3066, January 2001.

   [RFC3282]        Alvestrand, H., "Content Language Headers",
                    RFC 3282, May 2002.

   [RFC3552]        Rescorla, E. and B. Korver, "Guidelines for Writing
                    RFC Text on Security Considerations", BCP 72,
                    RFC 3552, July 2003.

RFC5646 - Page 79

   [RFC3629]        Yergeau, F., "UTF-8, a transformation format of ISO
                    10646", STD 63, RFC 3629, November 2003.

   [RFC4645]        Ewell, D., "Initial Language Subtag Registry",
                    RFC 4645, September 2006.

   [RFC4646]        Phillips, A. and M. Davis, "Tags for Identifying
                    Languages", BCP 47, RFC 4646, September 2006.

   [RFC5645]        Ewell, D., Ed., "Update to the Language Subtag
                    Registry", September 2009.

   [UTS35]          Davis, M., "Unicode Technical Standard #35: Locale
                    Data Markup Language (LDML)", December 2007,
                    <http://www.unicode.org/reports/tr35/>.

   [iso639.prin]    ISO 639 Joint Advisory Committee, "ISO 639 Joint
                    Advisory Committee:  Working principles for ISO 639
                    maintenance", March 2000, <http://www.loc.gov/
                    standards/iso639-2/iso639jac_n3r.html>.

   [record-jar]     Raymond, E., "The Art of Unix Programming", 2003,
                    <urn:isbn:0-13-142901-9>.

RFC5646 - Page 80

Appendix A.  Examples of Language Tags (Informative)

   Simple language subtag:

      de (German)

      fr (French)

      ja (Japanese)

      i-enochian (example of a grandfathered tag)

   Language subtag plus Script subtag:

      zh-Hant (Chinese written using the Traditional Chinese script)

      zh-Hans (Chinese written using the Simplified Chinese script)

      sr-Cyrl (Serbian written using the Cyrillic script)

      sr-Latn (Serbian written using the Latin script)

   Extended language subtags and their primary language subtag
   counterparts:

      zh-cmn-Hans-CN (Chinese, Mandarin, Simplified script, as used in
      China)

      cmn-Hans-CN (Mandarin Chinese, Simplified script, as used in
      China)

      zh-yue-HK (Chinese, Cantonese, as used in Hong Kong SAR)

      yue-HK (Cantonese Chinese, as used in Hong Kong SAR)

   Language-Script-Region:

      zh-Hans-CN (Chinese written using the Simplified script as used in
      mainland China)

      sr-Latn-RS (Serbian written using the Latin script as used in
      Serbia)

RFC5646 - Page 81

   Language-Variant:

      sl-rozaj (Resian dialect of Slovenian)

      sl-rozaj-biske (San Giorgio dialect of Resian dialect of
      Slovenian)

      sl-nedis (Nadiza dialect of Slovenian)

   Language-Region-Variant:

      de-CH-1901 (German as used in Switzerland using the 1901 variant
      [orthography])

      sl-IT-nedis (Slovenian as used in Italy, Nadiza dialect)

   Language-Script-Region-Variant:

      hy-Latn-IT-arevela (Eastern Armenian written in Latin script, as
      used in Italy)

   Language-Region:

      de-DE (German for Germany)

      en-US (English as used in the United States)

      es-419 (Spanish appropriate for the Latin America and Caribbean
      region using the UN region code)

   Private use subtags:

      de-CH-x-phonebk

      az-Arab-x-AZE-derbend

   Private use registry values:

      x-whatever (private use using the singleton 'x')

      qaa-Qaaa-QM-x-southern (all private tags)

      de-Qaaa (German, with a private script)

      sr-Latn-QM (Serbian, Latin script, private region)

      sr-Qaaa-RS (Serbian, private script, for Serbia)

RFC5646 - Page 82

   Tags that use extensions (examples ONLY -- extensions MUST be defined
   by revision or update to this document, or by RFC):

      en-US-u-islamcal

      zh-CN-a-myext-x-private

      en-a-myext-b-another

   Some Invalid Tags:

      de-419-DE (two region tags)

      a-DE (use of a single-character subtag in primary position; note
      that there are a few grandfathered tags that start with "i-" that
      are valid)

      ar-a-aaa-b-bbb-a-ccc (two extensions with same single-letter
      prefix)

Appendix B.  Examples of Registration Forms

   LANGUAGE SUBTAG REGISTRATION FORM

   1. Name of requester: Han Steenwijk
   2. E-mail address of requester: han.steenwijk @ unipd.it
   3. Record Requested:

   Type:        variant
   Subtag:      biske
   Description: The San Giorgio dialect of Resian
   Description: The Bila dialect of Resian
   Prefix:      sl-rozaj
   Comments:    The dialect of San Giorgio/Bila is one of the
      four major local dialects of Resian

   4. Intended meaning of the subtag:

   The local variety of Resian as spoken in San Giorgio/Bila

   5. Reference to published description of the language (book or
   article):

    -- Jan I.N. Baudouin de Courtenay - Opyt fonetiki rez'janskich
   govorov, Varsava - Peterburg: Vende - Kozancikov, 1875.

RFC5646 - Page 83

   LANGUAGE SUBTAG REGISTRATION FORM

   1. Name of requester: Jaska Zedlik
   2. E-mail address of requester: jz53 @ zedlik.com
   3. Record Requested:

   Type:   variant
   Subtag: tarask
   Description: Belarusian in Taraskievica orthography
   Prefix: be
   Comments: The subtag represents Branislau Taraskievic's Belarusian
     orthography as published in "Bielaruski klasycny pravapis" by
     Juras Buslakou, Vincuk Viacorka, Zmicier Sanko, and Zmicier Sauka
     (Vilnia-Miensk 2005).

   4. Intended meaning of the subtag:

   The subtag is intended to represent the Belarusian orthography as
   published in "Bielaruski klasycny pravapis" by Juras Buslakou, Vincuk
   Viacorka, Zmicier Sanko, and Zmicier Sauka (Vilnia-Miensk 2005).

   5. Reference to published description of the language (book or
   article):

   Taraskievic, Branislau. Bielaruskaja gramatyka dla skol. Vilnia: Vyd.
   "Bielaruskaha kamitetu", 1929, 5th edition.

   Buslakou, Juras; Viacorka, Vincuk; Sanko, Zmicier; Sauka, Zmicier.
   Bielaruski klasycny pravapis. Vilnia-Miensk, 2005.

   6. Any other relevant information:

   Belarusian in Taraskievica orthography became widely used, especially
   in Belarusian-speaking Internet segment, but besides this some books
   and newspapers are also printed using this orthography of Belarusian.

Appendix C.  Acknowledgements

   Any list of contributors is bound to be incomplete; please regard the
   following as only a selection from the group of people who have
   contributed to make this document what it is today.

   The contributors to RFC 4646, RFC 4647, RFC 3066, and RFC 1766, the
   precursors of this document, made enormous contributions directly or
   indirectly to this document and are generally responsible for the
   success of language tags.

RFC5646 - Page 84

   The following people contributed to this document:

   Stephane Bortzmeyer, Karen Broome, Peter Constable, John Cowan,
   Martin Duerst, Frank Ellerman, Doug Ewell, Deborah Garside, Marion
   Gunn, Alfred Hoenes, Kent Karlsson, Chris Newman, Randy Presuhn,
   Stephen Silver, Shawn Steele, and many, many others.

   Very special thanks must go to Harald Tveit Alvestrand, who
   originated RFCs 1766 and 3066, and without whom this document would
   not have been possible.

   Special thanks go to Michael Everson, who served as the Language Tag
   Reviewer for almost the entire RFC 1766/RFC 3066 period, as well as
   the Language Subtag Reviewer since the adoption of RFC 4646.

   Special thanks also go to Doug Ewell, for his production of the first
   complete subtag registry, his work to support and maintain new
   registrations, and his careful editorship of both RFC 4645 and
   [RFC5645].

Authors' Addresses

   Addison Phillips (editor)
   Lab126

   EMail: addison@inter-locale.com
   URI:   http://www.inter-locale.com


   Mark Davis (editor)
   Google

   EMail: markdavis@google.com