Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 5646

Tags for Identifying Languages

Pages: 84
Best Current Practice: 47
Errata
BCP 47 is also:  4647
Obsoletes:  4646
Part 4 of 4 – Pages 71 to 84
First   Prev   None

Top   ToC   RFC5646 - Page 71   prevText

6. Security Considerations

Language tags used in content negotiation, like any other information exchanged on the Internet, might be a source of concern because they might be used to infer the nationality of the sender, and thus identify potential targets for surveillance. This is a special case of the general problem that anything sent is visible to the receiving party and possibly to third parties as well. It is useful to be aware that such concerns can exist in some cases. The evaluation of the exact magnitude of the threat, and any possible countermeasures, is left to each application protocol (see BCP 72 [RFC3552] for best current practice guidance on security threats and defenses). The language tag associated with a particular information item is of no consequence whatsoever in determining whether that content might contain possible homographs. The fact that a text is tagged as being in one language or using a particular script subtag provides no assurance whatsoever that it does not contain characters from scripts other than the one(s) associated with or specified by that language tag. Since there is no limit to the number of variant, private use, and extension subtags, and consequently no limit on the possible length of a tag, implementations need to guard against buffer overflow attacks. See Section 4.4 for details on language tag truncation, which can occur as a consequence of defenses against buffer overflow.
Top   ToC   RFC5646 - Page 72
   To prevent denial-of-service attacks, applications SHOULD NOT depend
   on either the Language Subtag Registry or the Language Tag Extensions
   Registry being always accessible.  Additionally, although the
   specification of valid subtags for an extension (see Section 3.7)
   MUST be available over the Internet, implementations SHOULD NOT
   mechanically depend on those sources being always accessible.

   The registries specified in this document are not suitable for
   frequent or real-time access to, or retrieval of, the full registry
   contents.  Most applications do not need registry data at all.  For
   others, being able to validate or canonicalize language tags as of a
   particular registry date will be sufficient, as the registry contents
   change only occasionally.  Changes are announced to
   <ietf-languages-announcements@iana.org>.  This mailing list is
   intended for interested organizations and individuals, not for bulk
   subscription to trigger automatic software updates.  The size of the
   registry makes it unsuitable for automatic software updates.
   Implementers considering integrating the Language Subtag Registry in
   an automatic updating scheme are strongly advised to distribute only
   suitably encoded differences, and only via their own infrastructure
   -- not directly from IANA.

   Changes, or the absence thereof, can also easily be detected by
   looking at the 'File-Date' record at the start of the registry, or by
   using features of the protocol used for downloading, without having
   to download the full registry.  At the time of publication of this
   document, IANA is making the Language Tag Registry available over
   HTTP 1.1.  The proper way to update a local copy of the Language
   Subtag Registry using HTTP 1.1 is to use a conditional GET [RFC2616].

7. Character Set Considerations

The syntax in this document requires that language tags use only the characters A-Z, a-z, 0-9, and HYPHEN-MINUS, which are present in most character sets, so the composition of language tags shouldn't have any character set issues. The rendering of text based on the language tag is not addressed here. Historically, some processes have relied on the use of character set/encoding information (or other external information) in order to infer how a specific string of characters should be rendered. Notably, this applies to language- and culture-specific variations of Han ideographs as used in Japanese, Chinese, and Korean, where use of, for example, a Japanese character encoding such as EUC-JP implies that the text itself is in Japanese. When language tags are applied to spans of text, rendering engines might be able to use that information to better select fonts or make other rendering
Top   ToC   RFC5646 - Page 73
   choices, particularly where languages with distinct writing
   traditions use the same characters.

8. Changes from RFC 4646

The main goal for this revision of RFC 4646 was to incorporate two new parts of ISO 639 (ISO 639-3 and ISO 639-5) and their attendant sets of language codes into the IANA Language Subtag Registry. This permits the identification of many more languages and language collections than previously supported. The specific changes in this document to meet these goals are: o Defined the incorporation of ISO 639-3 and ISO 639-5 codes for use as primary and extended language subtags. It also permanently reserves and disallows the use of additional 'extlang' subtags. The changes necessary to achieve this were: * Modified the ABNF comments. * Updated various registration and stability requirements sections to reference ISO 639-3 and ISO 639-5 in addition to ISO 639-1 and ISO 639-2. * Edited the text to eliminate references to extended language subtags where they are no longer used. * Explained the change in the section on extended language subtags. o Changed the ABNF related to grandfathered tags. The irregular tags are now listed. Well-formed grandfathered tags are now described by the 'langtag' production, and the 'grandfathered' production was removed as a result. Also: added description of both types of grandfathered tags to Section 2.2.8. o Added the paragraph on "collections" to Section 4.1. o Changed the capitalization rules for 'Tag' fields in Section 3.1. o Split Section 3.1 up into subsections. o Modified Section 3.5 to allow 'Suppress-Script' fields to be added, modified, or removed via the registration process. This was an erratum from RFC 4646. o Modified examples that used region code 'CS' (formerly Serbia and Montenegro) to use 'RS' (Serbia) instead.
Top   ToC   RFC5646 - Page 74
   o  Modified the rules for creating and maintaining record
      'Description' fields to prevent duplicates, including inverted
      duplicates.

   o  Removed the lengthy description of why RFC 4646 was created from
      this section, which also caused the removal of the reference to
      XML Schema.

   o  Modified the text in Section 2.1 to place more emphasis on the
      fact that language tags are not case sensitive.

   o  Replaced the example "fr-Latn-CA" in Section 2.1 with "sr-Latn-RS"
      and "az-Arab-IR" because "fr-Latn-CA" doesn't respect the
      'Suppress-Script' on 'Latn' with 'fr'.

   o  Changed the requirements for well-formedness to make singleton
      repetition checking optional (it is required for validity
      checking) in Section 2.2.9.

   o  Changed the text in Section 2.2.9 referring to grandfathered
      checking to note that the list is now included in the ABNF.

   o  Modified and added text to Section 3.2.  The job description was
      placed first.  A note was added making clear that the Language
      Subtag Reviewer may delegate various non-critical duties,
      including list moderation.  Finally, additional text was added to
      make the appointment process clear and to clarify that decisions
      and performance of the reviewer are appealable.

   o  Added text to Section 3.5 clarifying that the
      ietf-languages@iana.org list is operated by whomever the IESG
      appoints.

   o  Added text to Section 3.1.5 clarifying that the first Description
      in a 'language' record matches the corresponding Reference Name
      for the language in ISO 639-3.

   o  Modified Section 2.2.9 to define classes of conformance related to
      specific tags (formerly 'well-formed' and 'valid' referred to
      implementations).  Notes were added about the removal of 'extlang'
      from the ABNF provided in RFC 4646, allowing for well-formedness
      using this older definition.  Reference to RFC 3066 well-
      formedness was also added.

   o  Added text to the end of Section 3.1.2 noting that future versions
      of this document might add new field types to the registry format
      and recommending that implementations ignore any unrecognized
      fields.
Top   ToC   RFC5646 - Page 75
   o  Added text about what the lack of a 'Suppress-Script' field means
      in a record to Section 3.1.9.

   o  Added text allowing the correction of misspellings and typographic
      errors to Section 3.1.5.

   o  Added text to Section 3.1.8 disallowing 'Prefix' field conflicts
      (such as circular prefix references).

   o  Modified text in Section 3.5 to require the subtag reviewer to
      announce his/her decision (or extension) following the two-week
      period.  Also clarified that any decision or failure to decide can
      be appealed.

   o  Modified text in Section 4.1 to include the (heretofore anecdotal)
      guiding principle of tag choice, and clarifying the non-use of
      script subtags in non-written applications.

   o  Prohibited multiple use of the same variant in a tag (i.e., "de-
      1901-1901").  Previously, this was only a recommendation
      ("SHOULD").

   o  Removed inappropriate [RFC2119] language from the illustration in
      Section 4.4.1.

   o  Replaced the example of deprecating "zh-guoyu" with "zh-
      hakka"->"hak" in Section 4.5, noting that it was this document
      that caused the change.

   o  Replaced the section in Section 4.1 dealing with "mul"/"und" to
      include the subtags 'zxx' and 'mis', as well as the tag
      "i-default".  A normative reference to RFC 2277 was added.

   o  Added text to Section 3.5 clarifying that any modifications of a
      registration request must be sent to the <ietf-languages@iana.org>
      list before submission to IANA.

   o  Changed the ABNF for the record-jar format from using the LWSP
      production to use a folding whitespace production similar to obs-
      FWS in [RFC5234].  This effectively prevents unintentional blank
      lines inside a field.

   o  Clarified and revised text in Sections 3.3, 3.5, and 5.1 to
      clarify that the Language Subtag Reviewer sends the complete
      registration forms to IANA, that IANA extracts the record from the
      form, and that the forms must also be archived separately from the
      registry.
Top   ToC   RFC5646 - Page 76
   o  Added text to Section 5 requiring IANA to send an announcement to
      an ietf-languages-announcements list whenever the registry is
      updated.

   o  Modification of the registry to use UTF-8 as its character
      encoding.  This also entails additional instructions to IANA and
      the Language Subtag Reviewer in the registration process.

   o  Modified the rules in Section 2.2.4 so that "exceptionally
      reserved" ISO 3166-1 codes other than 'UK' were included into the
      registry.  In particular, this allows the code 'EU' (European
      Union) to be used to form language tags or (more commonly) for
      applications that use the registry for region codes to reference
      this subtag.

   o  Modified the IANA considerations section (Section 5) to remove
      unnecessary normative [RFC2119] language.

9. References

9.1. Normative References

[ISO15924] International Organization for Standardization, "ISO 15924:2004. Information and documentation -- Codes for the representation of names of scripts", January 2004. [ISO3166-1] International Organization for Standardization, "ISO 3166-1:2006. Codes for the representation of names of countries and their subdivisions -- Part 1: Country codes", November 2006. [ISO639-1] International Organization for Standardization, "ISO 639-1:2002. Codes for the representation of names of languages -- Part 1: Alpha-2 code", July 2002. [ISO639-2] International Organization for Standardization, "ISO 639-2:1998. Codes for the representation of names of languages -- Part 2: Alpha-3 code", October 1998. [ISO639-3] International Organization for Standardization, "ISO 639-3:2007. Codes for the representation of names of languages - Part 3: Alpha-3 code for comprehensive coverage of languages", February 2007.
Top   ToC   RFC5646 - Page 77
   [ISO639-5]       International Organization for Standardization, "ISO
                    639-5:2008. Codes for the representation of names of
                    languages -- Part 5: Alpha-3 code for language
                    families and groups", May 2008.

   [ISO646]         International Organization for Standardization,
                    "ISO/IEC 646:1991, Information technology -- ISO
                    7-bit coded character set for information
                    interchange.", 1991.

   [RFC2026]        Bradner, S., "The Internet Standards Process --
                    Revision 3", BCP 9, RFC 2026, October 1996.

   [RFC2119]        Bradner, S., "Key words for use in RFCs to Indicate
                    Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2277]        Alvestrand, H., "IETF Policy on Character Sets and
                    Languages", BCP 18, RFC 2277, January 1998.

   [RFC3339]        Klyne, G., Ed. and C. Newman, "Date and Time on the
                    Internet: Timestamps", RFC 3339, July 2002.

   [RFC4647]        Phillips, A. and M. Davis, "Matching of Language
                    Tags", BCP 47, RFC 4647, September 2006.

   [RFC5226]        Narten, T. and H. Alvestrand, "Guidelines for
                    Writing an IANA Considerations Section in RFCs",
                    BCP 26, RFC 5226, May 2008.

   [RFC5234]        Crocker, D. and P. Overell, "Augmented BNF for
                    Syntax Specifications: ABNF", STD 68, RFC 5234,
                    January 2008.

   [SpecialCasing]  The Unicode Consoritum, "Unicode Character Database,
                    Special Casing Properties", March 2008, <http://
                    unicode.org/Public/UNIDATA/SpecialCasing.txt>.

   [UAX14]          Freitag, A., "Unicode Standard Annex #14: Line
                    Breaking Properties", August 2006,
                    <http://www.unicode.org/reports/tr14/>.

   [UN_M.49]        Statistics Division, United Nations, "Standard
                    Country or Area Codes for Statistical Use", Revision
                    4 (United Nations publication, Sales No. 98.XVII.9,
                    June 1999.
Top   ToC   RFC5646 - Page 78
   [Unicode]        Unicode Consortium, "The Unicode Consortium. The
                    Unicode Standard, Version 5.0, (Boston, MA, Addison-
                    Wesley, 2003. ISBN 0-321-49081-0)", January 2007.

9.2. Informative References

[CLDR] "The Common Locale Data Repository Project", <http://cldr.unicode.org>. [RFC1766] Alvestrand, H., "Tags for the Identification of Languages", RFC 1766, March 1995. [RFC2028] Hovey, R. and S. Bradner, "The Organizations Involved in the IETF Standards Process", BCP 11, RFC 2028, October 1996. [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", RFC 2046, November 1996. [RFC2047] Moore, K., "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text", RFC 2047, November 1996. [RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations", RFC 2231, November 1997. [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646", RFC 2781, February 2000. [RFC3066] Alvestrand, H., "Tags for the Identification of Languages", RFC 3066, January 2001. [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282, May 2002. [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC Text on Security Considerations", BCP 72, RFC 3552, July 2003.
Top   ToC   RFC5646 - Page 79
   [RFC3629]        Yergeau, F., "UTF-8, a transformation format of ISO
                    10646", STD 63, RFC 3629, November 2003.

   [RFC4645]        Ewell, D., "Initial Language Subtag Registry",
                    RFC 4645, September 2006.

   [RFC4646]        Phillips, A. and M. Davis, "Tags for Identifying
                    Languages", BCP 47, RFC 4646, September 2006.

   [RFC5645]        Ewell, D., Ed., "Update to the Language Subtag
                    Registry", September 2009.

   [UTS35]          Davis, M., "Unicode Technical Standard #35: Locale
                    Data Markup Language (LDML)", December 2007,
                    <http://www.unicode.org/reports/tr35/>.

   [iso639.prin]    ISO 639 Joint Advisory Committee, "ISO 639 Joint
                    Advisory Committee:  Working principles for ISO 639
                    maintenance", March 2000, <http://www.loc.gov/
                    standards/iso639-2/iso639jac_n3r.html>.

   [record-jar]     Raymond, E., "The Art of Unix Programming", 2003,
                    <urn:isbn:0-13-142901-9>.
Top   ToC   RFC5646 - Page 80

Appendix A. Examples of Language Tags (Informative)

Simple language subtag: de (German) fr (French) ja (Japanese) i-enochian (example of a grandfathered tag) Language subtag plus Script subtag: zh-Hant (Chinese written using the Traditional Chinese script) zh-Hans (Chinese written using the Simplified Chinese script) sr-Cyrl (Serbian written using the Cyrillic script) sr-Latn (Serbian written using the Latin script) Extended language subtags and their primary language subtag counterparts: zh-cmn-Hans-CN (Chinese, Mandarin, Simplified script, as used in China) cmn-Hans-CN (Mandarin Chinese, Simplified script, as used in China) zh-yue-HK (Chinese, Cantonese, as used in Hong Kong SAR) yue-HK (Cantonese Chinese, as used in Hong Kong SAR) Language-Script-Region: zh-Hans-CN (Chinese written using the Simplified script as used in mainland China) sr-Latn-RS (Serbian written using the Latin script as used in Serbia)
Top   ToC   RFC5646 - Page 81
   Language-Variant:

      sl-rozaj (Resian dialect of Slovenian)

      sl-rozaj-biske (San Giorgio dialect of Resian dialect of
      Slovenian)

      sl-nedis (Nadiza dialect of Slovenian)

   Language-Region-Variant:

      de-CH-1901 (German as used in Switzerland using the 1901 variant
      [orthography])

      sl-IT-nedis (Slovenian as used in Italy, Nadiza dialect)

   Language-Script-Region-Variant:

      hy-Latn-IT-arevela (Eastern Armenian written in Latin script, as
      used in Italy)

   Language-Region:

      de-DE (German for Germany)

      en-US (English as used in the United States)

      es-419 (Spanish appropriate for the Latin America and Caribbean
      region using the UN region code)

   Private use subtags:

      de-CH-x-phonebk

      az-Arab-x-AZE-derbend

   Private use registry values:

      x-whatever (private use using the singleton 'x')

      qaa-Qaaa-QM-x-southern (all private tags)

      de-Qaaa (German, with a private script)

      sr-Latn-QM (Serbian, Latin script, private region)

      sr-Qaaa-RS (Serbian, private script, for Serbia)
Top   ToC   RFC5646 - Page 82
   Tags that use extensions (examples ONLY -- extensions MUST be defined
   by revision or update to this document, or by RFC):

      en-US-u-islamcal

      zh-CN-a-myext-x-private

      en-a-myext-b-another

   Some Invalid Tags:

      de-419-DE (two region tags)

      a-DE (use of a single-character subtag in primary position; note
      that there are a few grandfathered tags that start with "i-" that
      are valid)

      ar-a-aaa-b-bbb-a-ccc (two extensions with same single-letter
      prefix)

Appendix B. Examples of Registration Forms

LANGUAGE SUBTAG REGISTRATION FORM 1. Name of requester: Han Steenwijk 2. E-mail address of requester: han.steenwijk @ unipd.it 3. Record Requested: Type: variant Subtag: biske Description: The San Giorgio dialect of Resian Description: The Bila dialect of Resian Prefix: sl-rozaj Comments: The dialect of San Giorgio/Bila is one of the four major local dialects of Resian 4. Intended meaning of the subtag: The local variety of Resian as spoken in San Giorgio/Bila 5. Reference to published description of the language (book or article): -- Jan I.N. Baudouin de Courtenay - Opyt fonetiki rez'janskich govorov, Varsava - Peterburg: Vende - Kozancikov, 1875.
Top   ToC   RFC5646 - Page 83
   LANGUAGE SUBTAG REGISTRATION FORM

   1. Name of requester: Jaska Zedlik
   2. E-mail address of requester: jz53 @ zedlik.com
   3. Record Requested:

   Type:   variant
   Subtag: tarask
   Description: Belarusian in Taraskievica orthography
   Prefix: be
   Comments: The subtag represents Branislau Taraskievic's Belarusian
     orthography as published in "Bielaruski klasycny pravapis" by
     Juras Buslakou, Vincuk Viacorka, Zmicier Sanko, and Zmicier Sauka
     (Vilnia-Miensk 2005).

   4. Intended meaning of the subtag:

   The subtag is intended to represent the Belarusian orthography as
   published in "Bielaruski klasycny pravapis" by Juras Buslakou, Vincuk
   Viacorka, Zmicier Sanko, and Zmicier Sauka (Vilnia-Miensk 2005).

   5. Reference to published description of the language (book or
   article):

   Taraskievic, Branislau. Bielaruskaja gramatyka dla skol. Vilnia: Vyd.
   "Bielaruskaha kamitetu", 1929, 5th edition.

   Buslakou, Juras; Viacorka, Vincuk; Sanko, Zmicier; Sauka, Zmicier.
   Bielaruski klasycny pravapis. Vilnia-Miensk, 2005.

   6. Any other relevant information:

   Belarusian in Taraskievica orthography became widely used, especially
   in Belarusian-speaking Internet segment, but besides this some books
   and newspapers are also printed using this orthography of Belarusian.

Appendix C. Acknowledgements

Any list of contributors is bound to be incomplete; please regard the following as only a selection from the group of people who have contributed to make this document what it is today. The contributors to RFC 4646, RFC 4647, RFC 3066, and RFC 1766, the precursors of this document, made enormous contributions directly or indirectly to this document and are generally responsible for the success of language tags.
Top   ToC   RFC5646 - Page 84
   The following people contributed to this document:

   Stephane Bortzmeyer, Karen Broome, Peter Constable, John Cowan,
   Martin Duerst, Frank Ellerman, Doug Ewell, Deborah Garside, Marion
   Gunn, Alfred Hoenes, Kent Karlsson, Chris Newman, Randy Presuhn,
   Stephen Silver, Shawn Steele, and many, many others.

   Very special thanks must go to Harald Tveit Alvestrand, who
   originated RFCs 1766 and 3066, and without whom this document would
   not have been possible.

   Special thanks go to Michael Everson, who served as the Language Tag
   Reviewer for almost the entire RFC 1766/RFC 3066 period, as well as
   the Language Subtag Reviewer since the adoption of RFC 4646.

   Special thanks also go to Doug Ewell, for his production of the first
   complete subtag registry, his work to support and maintain new
   registrations, and his careful editorship of both RFC 4645 and
   [RFC5645].

Authors' Addresses

Addison Phillips (editor) Lab126 EMail: addison@inter-locale.com URI: http://www.inter-locale.com Mark Davis (editor) Google EMail: markdavis@google.com