6. Security Considerations
Language tags used in content negotiation, like any other information exchanged on the Internet, might be a source of concern because they might be used to infer the nationality of the sender, and thus identify potential targets for surveillance. This is a special case of the general problem that anything sent is visible to the receiving party and possibly to third parties as well. It is useful to be aware that such concerns can exist in some cases. The evaluation of the exact magnitude of the threat, and any possible countermeasures, is left to each application protocol (see BCP 72 [RFC3552] for best current practice guidance on security threats and defenses). The language tag associated with a particular information item is of no consequence whatsoever in determining whether that content might contain possible homographs. The fact that a text is tagged as being in one language or using a particular script subtag provides no assurance whatsoever that it does not contain characters from scripts other than the one(s) associated with or specified by that language tag. Since there is no limit to the number of variant, private use, and extension subtags, and consequently no limit on the possible length of a tag, implementations need to guard against buffer overflow attacks. See Section 4.4 for details on language tag truncation, which can occur as a consequence of defenses against buffer overflow.
To prevent denial-of-service attacks, applications SHOULD NOT depend on either the Language Subtag Registry or the Language Tag Extensions Registry being always accessible. Additionally, although the specification of valid subtags for an extension (see Section 3.7) MUST be available over the Internet, implementations SHOULD NOT mechanically depend on those sources being always accessible. The registries specified in this document are not suitable for frequent or real-time access to, or retrieval of, the full registry contents. Most applications do not need registry data at all. For others, being able to validate or canonicalize language tags as of a particular registry date will be sufficient, as the registry contents change only occasionally. Changes are announced to <ietf-languages-announcements@iana.org>. This mailing list is intended for interested organizations and individuals, not for bulk subscription to trigger automatic software updates. The size of the registry makes it unsuitable for automatic software updates. Implementers considering integrating the Language Subtag Registry in an automatic updating scheme are strongly advised to distribute only suitably encoded differences, and only via their own infrastructure -- not directly from IANA. Changes, or the absence thereof, can also easily be detected by looking at the 'File-Date' record at the start of the registry, or by using features of the protocol used for downloading, without having to download the full registry. At the time of publication of this document, IANA is making the Language Tag Registry available over HTTP 1.1. The proper way to update a local copy of the Language Subtag Registry using HTTP 1.1 is to use a conditional GET [RFC2616].7. Character Set Considerations
The syntax in this document requires that language tags use only the characters A-Z, a-z, 0-9, and HYPHEN-MINUS, which are present in most character sets, so the composition of language tags shouldn't have any character set issues. The rendering of text based on the language tag is not addressed here. Historically, some processes have relied on the use of character set/encoding information (or other external information) in order to infer how a specific string of characters should be rendered. Notably, this applies to language- and culture-specific variations of Han ideographs as used in Japanese, Chinese, and Korean, where use of, for example, a Japanese character encoding such as EUC-JP implies that the text itself is in Japanese. When language tags are applied to spans of text, rendering engines might be able to use that information to better select fonts or make other rendering
choices, particularly where languages with distinct writing traditions use the same characters.8. Changes from RFC 4646
The main goal for this revision of RFC 4646 was to incorporate two new parts of ISO 639 (ISO 639-3 and ISO 639-5) and their attendant sets of language codes into the IANA Language Subtag Registry. This permits the identification of many more languages and language collections than previously supported. The specific changes in this document to meet these goals are: o Defined the incorporation of ISO 639-3 and ISO 639-5 codes for use as primary and extended language subtags. It also permanently reserves and disallows the use of additional 'extlang' subtags. The changes necessary to achieve this were: * Modified the ABNF comments. * Updated various registration and stability requirements sections to reference ISO 639-3 and ISO 639-5 in addition to ISO 639-1 and ISO 639-2. * Edited the text to eliminate references to extended language subtags where they are no longer used. * Explained the change in the section on extended language subtags. o Changed the ABNF related to grandfathered tags. The irregular tags are now listed. Well-formed grandfathered tags are now described by the 'langtag' production, and the 'grandfathered' production was removed as a result. Also: added description of both types of grandfathered tags to Section 2.2.8. o Added the paragraph on "collections" to Section 4.1. o Changed the capitalization rules for 'Tag' fields in Section 3.1. o Split Section 3.1 up into subsections. o Modified Section 3.5 to allow 'Suppress-Script' fields to be added, modified, or removed via the registration process. This was an erratum from RFC 4646. o Modified examples that used region code 'CS' (formerly Serbia and Montenegro) to use 'RS' (Serbia) instead.
o Modified the rules for creating and maintaining record 'Description' fields to prevent duplicates, including inverted duplicates. o Removed the lengthy description of why RFC 4646 was created from this section, which also caused the removal of the reference to XML Schema. o Modified the text in Section 2.1 to place more emphasis on the fact that language tags are not case sensitive. o Replaced the example "fr-Latn-CA" in Section 2.1 with "sr-Latn-RS" and "az-Arab-IR" because "fr-Latn-CA" doesn't respect the 'Suppress-Script' on 'Latn' with 'fr'. o Changed the requirements for well-formedness to make singleton repetition checking optional (it is required for validity checking) in Section 2.2.9. o Changed the text in Section 2.2.9 referring to grandfathered checking to note that the list is now included in the ABNF. o Modified and added text to Section 3.2. The job description was placed first. A note was added making clear that the Language Subtag Reviewer may delegate various non-critical duties, including list moderation. Finally, additional text was added to make the appointment process clear and to clarify that decisions and performance of the reviewer are appealable. o Added text to Section 3.5 clarifying that the ietf-languages@iana.org list is operated by whomever the IESG appoints. o Added text to Section 3.1.5 clarifying that the first Description in a 'language' record matches the corresponding Reference Name for the language in ISO 639-3. o Modified Section 2.2.9 to define classes of conformance related to specific tags (formerly 'well-formed' and 'valid' referred to implementations). Notes were added about the removal of 'extlang' from the ABNF provided in RFC 4646, allowing for well-formedness using this older definition. Reference to RFC 3066 well- formedness was also added. o Added text to the end of Section 3.1.2 noting that future versions of this document might add new field types to the registry format and recommending that implementations ignore any unrecognized fields.
o Added text about what the lack of a 'Suppress-Script' field means in a record to Section 3.1.9. o Added text allowing the correction of misspellings and typographic errors to Section 3.1.5. o Added text to Section 3.1.8 disallowing 'Prefix' field conflicts (such as circular prefix references). o Modified text in Section 3.5 to require the subtag reviewer to announce his/her decision (or extension) following the two-week period. Also clarified that any decision or failure to decide can be appealed. o Modified text in Section 4.1 to include the (heretofore anecdotal) guiding principle of tag choice, and clarifying the non-use of script subtags in non-written applications. o Prohibited multiple use of the same variant in a tag (i.e., "de- 1901-1901"). Previously, this was only a recommendation ("SHOULD"). o Removed inappropriate [RFC2119] language from the illustration in Section 4.4.1. o Replaced the example of deprecating "zh-guoyu" with "zh- hakka"->"hak" in Section 4.5, noting that it was this document that caused the change. o Replaced the section in Section 4.1 dealing with "mul"/"und" to include the subtags 'zxx' and 'mis', as well as the tag "i-default". A normative reference to RFC 2277 was added. o Added text to Section 3.5 clarifying that any modifications of a registration request must be sent to the <ietf-languages@iana.org> list before submission to IANA. o Changed the ABNF for the record-jar format from using the LWSP production to use a folding whitespace production similar to obs- FWS in [RFC5234]. This effectively prevents unintentional blank lines inside a field. o Clarified and revised text in Sections 3.3, 3.5, and 5.1 to clarify that the Language Subtag Reviewer sends the complete registration forms to IANA, that IANA extracts the record from the form, and that the forms must also be archived separately from the registry.
o Added text to Section 5 requiring IANA to send an announcement to an ietf-languages-announcements list whenever the registry is updated. o Modification of the registry to use UTF-8 as its character encoding. This also entails additional instructions to IANA and the Language Subtag Reviewer in the registration process. o Modified the rules in Section 2.2.4 so that "exceptionally reserved" ISO 3166-1 codes other than 'UK' were included into the registry. In particular, this allows the code 'EU' (European Union) to be used to form language tags or (more commonly) for applications that use the registry for region codes to reference this subtag. o Modified the IANA considerations section (Section 5) to remove unnecessary normative [RFC2119] language.9. References
9.1. Normative References
[ISO15924] International Organization for Standardization, "ISO 15924:2004. Information and documentation -- Codes for the representation of names of scripts", January 2004. [ISO3166-1] International Organization for Standardization, "ISO 3166-1:2006. Codes for the representation of names of countries and their subdivisions -- Part 1: Country codes", November 2006. [ISO639-1] International Organization for Standardization, "ISO 639-1:2002. Codes for the representation of names of languages -- Part 1: Alpha-2 code", July 2002. [ISO639-2] International Organization for Standardization, "ISO 639-2:1998. Codes for the representation of names of languages -- Part 2: Alpha-3 code", October 1998. [ISO639-3] International Organization for Standardization, "ISO 639-3:2007. Codes for the representation of names of languages - Part 3: Alpha-3 code for comprehensive coverage of languages", February 2007.
[ISO639-5] International Organization for Standardization, "ISO 639-5:2008. Codes for the representation of names of languages -- Part 5: Alpha-3 code for language families and groups", May 2008. [ISO646] International Organization for Standardization, "ISO/IEC 646:1991, Information technology -- ISO 7-bit coded character set for information interchange.", 1991. [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, January 1998. [RFC3339] Klyne, G., Ed. and C. Newman, "Date and Time on the Internet: Timestamps", RFC 3339, July 2002. [RFC4647] Phillips, A. and M. Davis, "Matching of Language Tags", BCP 47, RFC 4647, September 2006. [RFC5226] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 5226, May 2008. [RFC5234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", STD 68, RFC 5234, January 2008. [SpecialCasing] The Unicode Consoritum, "Unicode Character Database, Special Casing Properties", March 2008, <http:// unicode.org/Public/UNIDATA/SpecialCasing.txt>. [UAX14] Freitag, A., "Unicode Standard Annex #14: Line Breaking Properties", August 2006, <http://www.unicode.org/reports/tr14/>. [UN_M.49] Statistics Division, United Nations, "Standard Country or Area Codes for Statistical Use", Revision 4 (United Nations publication, Sales No. 98.XVII.9, June 1999.
[Unicode] Unicode Consortium, "The Unicode Consortium. The Unicode Standard, Version 5.0, (Boston, MA, Addison- Wesley, 2003. ISBN 0-321-49081-0)", January 2007.9.2. Informative References
[CLDR] "The Common Locale Data Repository Project", <http://cldr.unicode.org>. [RFC1766] Alvestrand, H., "Tags for the Identification of Languages", RFC 1766, March 1995. [RFC2028] Hovey, R. and S. Bradner, "The Organizations Involved in the IETF Standards Process", BCP 11, RFC 2028, October 1996. [RFC2046] Freed, N. and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part Two: Media Types", RFC 2046, November 1996. [RFC2047] Moore, K., "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text", RFC 2047, November 1996. [RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations", RFC 2231, November 1997. [RFC2616] Fielding, R., Gettys, J., Mogul, J., Frystyk, H., Masinter, L., Leach, P., and T. Berners-Lee, "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616, June 1999. [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646", RFC 2781, February 2000. [RFC3066] Alvestrand, H., "Tags for the Identification of Languages", RFC 3066, January 2001. [RFC3282] Alvestrand, H., "Content Language Headers", RFC 3282, May 2002. [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC Text on Security Considerations", BCP 72, RFC 3552, July 2003.
[RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003. [RFC4645] Ewell, D., "Initial Language Subtag Registry", RFC 4645, September 2006. [RFC4646] Phillips, A. and M. Davis, "Tags for Identifying Languages", BCP 47, RFC 4646, September 2006. [RFC5645] Ewell, D., Ed., "Update to the Language Subtag Registry", September 2009. [UTS35] Davis, M., "Unicode Technical Standard #35: Locale Data Markup Language (LDML)", December 2007, <http://www.unicode.org/reports/tr35/>. [iso639.prin] ISO 639 Joint Advisory Committee, "ISO 639 Joint Advisory Committee: Working principles for ISO 639 maintenance", March 2000, <http://www.loc.gov/ standards/iso639-2/iso639jac_n3r.html>. [record-jar] Raymond, E., "The Art of Unix Programming", 2003, <urn:isbn:0-13-142901-9>.
Appendix A. Examples of Language Tags (Informative)
Simple language subtag: de (German) fr (French) ja (Japanese) i-enochian (example of a grandfathered tag) Language subtag plus Script subtag: zh-Hant (Chinese written using the Traditional Chinese script) zh-Hans (Chinese written using the Simplified Chinese script) sr-Cyrl (Serbian written using the Cyrillic script) sr-Latn (Serbian written using the Latin script) Extended language subtags and their primary language subtag counterparts: zh-cmn-Hans-CN (Chinese, Mandarin, Simplified script, as used in China) cmn-Hans-CN (Mandarin Chinese, Simplified script, as used in China) zh-yue-HK (Chinese, Cantonese, as used in Hong Kong SAR) yue-HK (Cantonese Chinese, as used in Hong Kong SAR) Language-Script-Region: zh-Hans-CN (Chinese written using the Simplified script as used in mainland China) sr-Latn-RS (Serbian written using the Latin script as used in Serbia)
Language-Variant: sl-rozaj (Resian dialect of Slovenian) sl-rozaj-biske (San Giorgio dialect of Resian dialect of Slovenian) sl-nedis (Nadiza dialect of Slovenian) Language-Region-Variant: de-CH-1901 (German as used in Switzerland using the 1901 variant [orthography]) sl-IT-nedis (Slovenian as used in Italy, Nadiza dialect) Language-Script-Region-Variant: hy-Latn-IT-arevela (Eastern Armenian written in Latin script, as used in Italy) Language-Region: de-DE (German for Germany) en-US (English as used in the United States) es-419 (Spanish appropriate for the Latin America and Caribbean region using the UN region code) Private use subtags: de-CH-x-phonebk az-Arab-x-AZE-derbend Private use registry values: x-whatever (private use using the singleton 'x') qaa-Qaaa-QM-x-southern (all private tags) de-Qaaa (German, with a private script) sr-Latn-QM (Serbian, Latin script, private region) sr-Qaaa-RS (Serbian, private script, for Serbia)
Tags that use extensions (examples ONLY -- extensions MUST be defined by revision or update to this document, or by RFC): en-US-u-islamcal zh-CN-a-myext-x-private en-a-myext-b-another Some Invalid Tags: de-419-DE (two region tags) a-DE (use of a single-character subtag in primary position; note that there are a few grandfathered tags that start with "i-" that are valid) ar-a-aaa-b-bbb-a-ccc (two extensions with same single-letter prefix)Appendix B. Examples of Registration Forms
LANGUAGE SUBTAG REGISTRATION FORM 1. Name of requester: Han Steenwijk 2. E-mail address of requester: han.steenwijk @ unipd.it 3. Record Requested: Type: variant Subtag: biske Description: The San Giorgio dialect of Resian Description: The Bila dialect of Resian Prefix: sl-rozaj Comments: The dialect of San Giorgio/Bila is one of the four major local dialects of Resian 4. Intended meaning of the subtag: The local variety of Resian as spoken in San Giorgio/Bila 5. Reference to published description of the language (book or article): -- Jan I.N. Baudouin de Courtenay - Opyt fonetiki rez'janskich govorov, Varsava - Peterburg: Vende - Kozancikov, 1875.
LANGUAGE SUBTAG REGISTRATION FORM 1. Name of requester: Jaska Zedlik 2. E-mail address of requester: jz53 @ zedlik.com 3. Record Requested: Type: variant Subtag: tarask Description: Belarusian in Taraskievica orthography Prefix: be Comments: The subtag represents Branislau Taraskievic's Belarusian orthography as published in "Bielaruski klasycny pravapis" by Juras Buslakou, Vincuk Viacorka, Zmicier Sanko, and Zmicier Sauka (Vilnia-Miensk 2005). 4. Intended meaning of the subtag: The subtag is intended to represent the Belarusian orthography as published in "Bielaruski klasycny pravapis" by Juras Buslakou, Vincuk Viacorka, Zmicier Sanko, and Zmicier Sauka (Vilnia-Miensk 2005). 5. Reference to published description of the language (book or article): Taraskievic, Branislau. Bielaruskaja gramatyka dla skol. Vilnia: Vyd. "Bielaruskaha kamitetu", 1929, 5th edition. Buslakou, Juras; Viacorka, Vincuk; Sanko, Zmicier; Sauka, Zmicier. Bielaruski klasycny pravapis. Vilnia-Miensk, 2005. 6. Any other relevant information: Belarusian in Taraskievica orthography became widely used, especially in Belarusian-speaking Internet segment, but besides this some books and newspapers are also printed using this orthography of Belarusian.Appendix C. Acknowledgements
Any list of contributors is bound to be incomplete; please regard the following as only a selection from the group of people who have contributed to make this document what it is today. The contributors to RFC 4646, RFC 4647, RFC 3066, and RFC 1766, the precursors of this document, made enormous contributions directly or indirectly to this document and are generally responsible for the success of language tags.
The following people contributed to this document: Stephane Bortzmeyer, Karen Broome, Peter Constable, John Cowan, Martin Duerst, Frank Ellerman, Doug Ewell, Deborah Garside, Marion Gunn, Alfred Hoenes, Kent Karlsson, Chris Newman, Randy Presuhn, Stephen Silver, Shawn Steele, and many, many others. Very special thanks must go to Harald Tveit Alvestrand, who originated RFCs 1766 and 3066, and without whom this document would not have been possible. Special thanks go to Michael Everson, who served as the Language Tag Reviewer for almost the entire RFC 1766/RFC 3066 period, as well as the Language Subtag Reviewer since the adoption of RFC 4646. Special thanks also go to Doug Ewell, for his production of the first complete subtag registry, his work to support and maintain new registrations, and his careful editorship of both RFC 4645 and [RFC5645].Authors' Addresses
Addison Phillips (editor) Lab126 EMail: addison@inter-locale.com URI: http://www.inter-locale.com Mark Davis (editor) Google EMail: markdavis@google.com