4. Formation and Processing of Language Tags
This section addresses how to use the information in the registry with the tag syntax to choose, form, and process language tags.4.1. Choice of Language Tag
One is sometimes faced with the choice between several possible tags for the same body of text. Interoperability is best served when all users use the same language tag in order to represent the same language. If an application has requirements that make the rules here inapplicable, then that application risks damaging interoperability. It is strongly RECOMMENDED that users not define their own rules for language tag choice. Subtags SHOULD only be used where they add useful distinguishing information; extraneous subtags interfere with the meaning, understanding, and processing of language tags. In particular, users and implementations SHOULD follow the 'Prefix' and 'Suppress-Script' fields in the registry (defined in Section 3.1): these fields provide guidance on when specific additional subtags SHOULD (and SHOULD NOT) be used in a language tag. Of particular note, many applications can benefit from the use of script subtags in language tags, as long as the use is consistent for a given context. Script subtags were not formally defined in RFC 3066 and their use can affect matching and subtag identification by implementations of RFC 3066, as these subtags appear between the primary language and region subtags. For example, if a user requests content in an implementation of Section 2.5 of [RFC3066] using the language range "en-US", content labeled "en-Latn-US" will not match the request. Therefore, it is important to know when script subtags will customarily be used and when they ought not be used. In the registry, the Suppress-Script field helps ensure greater compatibility between the language tags generated according to the rules in this document and language tags and tag processors or consumers based on RFC 3066 by defining when users SHOULD NOT include a script subtag with a particular primary language subtag.
Extended language subtags (type 'extlang' in the registry; see Section 3.1) also appear between the primary language and region subtags and are reserved for future standardization. Applications might benefit from their judicious use in forming language tags in the future. Similar recommendations are expected to apply to their use as apply to script subtags. Standards, protocols, and applications that reference this document normatively but apply different rules to the ones given in this section MUST specify how the procedure varies from the one given here. The choice of subtags used to form a language tag SHOULD be guided by the following rules: 1. Use as precise a tag as possible, but no more specific than is justified. Avoid using subtags that are not important for distinguishing content in an application. * For example, 'de' might suffice for tagging an email written in German, while "de-CH-1996" is probably unnecessarily precise for such a task. 2. The script subtag SHOULD NOT be used to form language tags unless the script adds some distinguishing information to the tag. The field 'Suppress-Script' in the primary language record in the registry indicates which script subtags do not add distinguishing information for most applications. * For example, the subtag 'Latn' should not be used with the primary language 'en' because nearly all English documents are written in the Latin script and it adds no distinguishing information. However, if a document were written in English mixing Latin script with another script such as Braille ('Brai'), then it might be appropriate to choose to indicate both scripts to aid in content selection, such as the application of a style sheet. 3. If a tag or subtag has a 'Preferred-Value' field in its registry entry, then the value of that field SHOULD be used to form the language tag in preference to the tag or subtag in which the preferred value appears. * For example, use 'he' for Hebrew in preference to 'iw'.
4. The 'und' (Undetermined) primary language subtag SHOULD NOT be used to label content, even if the language is unknown. Omitting the language tag altogether is preferred to using a tag with a primary language subtag of 'und'. The 'und' subtag MAY be useful for protocols that require a language tag to be provided. The 'und' subtag MAY also be useful when matching language tags in certain situations. 5. The 'mul' (Multiple) primary language subtag SHOULD NOT be used whenever the protocol allows the separate tags for multiple languages, as is the case for the Content-Language header in HTTP. The 'mul' subtag conveys little useful information: content in multiple languages SHOULD individually tag the languages where they appear or otherwise indicate the actual language in preference to the 'mul' subtag. 6. The same variant subtag SHOULD NOT be used more than once within a language tag. * For example, do not use "de-DE-1901-1901". To ensure consistent backward compatibility, this document contains several provisions to account for potential instability in the standards used to define the subtags that make up language tags. These provisions mean that no language tag created under the rules in this document will become obsolete.4.2. Meaning of the Language Tag
The relationship between the tag and the information it relates to is defined by the context in which the tag appears. Accordingly, this section gives only possible examples of its usage. o For a single information object, the associated language tags might be interpreted as the set of languages that is necessary for a complete comprehension of the complete object. Example: Plain text documents. o For an aggregation of information objects, the associated language tags could be taken as the set of languages used inside components of that aggregation. Examples: Document stores and libraries. o For information objects whose purpose is to provide alternatives, the associated language tags could be regarded as a hint that the content is provided in several languages and that one has to inspect each of the alternatives in order to find its language or languages. In this case, the presence of multiple tags might not mean that one needs to be multi-lingual to get complete
understanding of the document. Example: MIME multipart/ alternative. o In markup languages, such as HTML and XML, language information can be added to each part of the document identified by the markup structure (including the whole document itself). For example, one could write <span lang="fr">C'est la vie.</span> inside a Norwegian document; the Norwegian-speaking user could then access a French-Norwegian dictionary to find out what the marked section meant. If the user were listening to that document through a speech synthesis interface, this formation could be used to signal the synthesizer to appropriately apply French text-to-speech pronunciation rules to that span of text, instead of applying the inappropriate Norwegian rules. Language tags are related when they contain a similar sequence of subtags. For example, if a language tag B contains language tag A as a prefix, then B is typically "narrower" or "more specific" than A. Thus, "zh-Hant-TW" is more specific than "zh-Hant". This relationship is not guaranteed in all cases: specifically, languages that begin with the same sequence of subtags are NOT guaranteed to be mutually intelligible, although they might be. For example, the tag "az" shares a prefix with both "az-Latn" (Azerbaijani written using the Latin script) and "az-Cyrl" (Azerbaijani written using the Cyrillic script). A person fluent in one script might not be able to read the other, even though the text might be identical. Content tagged as "az" most probably is written in just one script and thus might not be intelligible to a reader familiar with the other script.4.3. Length Considerations
[RFC3066] did not provide an upper limit on the size of language tags. While RFC 3066 did define the semantics of particular subtags in such a way that most language tags consisted of language and region subtags with a combined total length of up to six characters, larger registered tags were not only possible but were actually registered. Neither the language tag syntax nor other requirements in this document impose a fixed upper limit on the number of subtags in a language tag (and thus an upper bound on the size of a tag). The language tag syntax suggests that, depending on the specific language, more subtags (and thus a longer tag) are sometimes necessary to completely identify the language for certain applications; thus, it is possible to envision long or complex subtag sequences.
4.3.1. Working with Limited Buffer Sizes
Some applications and protocols are forced to allocate fixed buffer sizes or otherwise limit the length of a language tag. A conformant implementation or specification MAY refuse to support the storage of language tags that exceed a specified length. Any such limitation SHOULD be clearly documented, and such documentation SHOULD include what happens to longer tags (for example, whether an error value is generated or the language tag is truncated). A protocol that allows tags to be truncated at an arbitrary limit, without giving any indication of what that limit is, has the potential for causing harm by changing the meaning of tags in substantial ways. In practice, most language tags do not require more than a few subtags and will not approach reasonably sized buffer limitations; see Section 4.1. Some specifications or protocols have limits on tag length but do not have a fixed length limitation. For example, [RFC2231] has no explicit length limitation: the length available for the language tag is constrained by the length of other header components (such as the charset's name) coupled with the 76-character limit in [RFC2047]. Thus, the "limit" might be 50 or more characters, but it could potentially be quite small. The considerations for assigning a buffer limit are: Implementations SHOULD NOT truncate language tags unless the meaning of the tag is purposefully being changed, or unless the tag does not fit into a limited buffer size specified by a protocol for storage or transmission. Implementations SHOULD warn the user when a tag is truncated since truncation changes the semantic meaning of the tag. Implementations of protocols or specifications that are space constrained but do not have a fixed limit SHOULD use the longest possible tag in preference to truncation. Protocols or specifications that specify limited buffer sizes for language tags MUST allow for language tags of up to 33 characters. Protocols or specifications that specify limited buffer sizes for language tags SHOULD allow for language tags of at least 42 characters.
The following illustration shows how the 42-character recommendation was derived. The combination of language and extended language subtags was chosen for future compatibility. At up to 15 characters, this combination is longer than the longest possible primary language subtag (8 characters): language = 3 (ISO 639-2; ISO 639-1 requires 2) extlang1 = 4 (each subsequent subtag includes '-') extlang2 = 4 (unlikely: needs prefix="language-extlang1") extlang3 = 4 (extremely unlikely) script = 5 (if not suppressed: see Section 4.1) region = 4 (UN M.49; ISO 3166 requires 3) variant1 = 9 (MUST have language as a prefix) variant2 = 9 (MUST have language-variant1 as a prefix) total = 42 characters Figure 7: Derivation of the Limit on Tag Length4.3.2. Truncation of Language Tags
Truncation of a language tag alters the meaning of the tag, and thus SHOULD be avoided. However, truncation of language tags is sometimes necessary due to limited buffer sizes. Such truncation MUST NOT permit a subtag to be chopped off in the middle or the formation of invalid tags (for example, one ending with the "-" character). This means that applications or protocols that truncate tags MUST do so by progressively removing subtags along with their preceding "-" from the right side of the language tag until the tag is short enough for the given buffer. If the resulting tag ends with a single- character subtag, that subtag and its preceding "-" MUST also be removed. For example: Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1 1. zh-Latn-CN-variant1-a-extend1-x-wadegile 2. zh-Latn-CN-variant1-a-extend1 3. zh-Latn-CN-variant1 4. zh-Latn-CN 5. zh-Latn 6. zh Figure 8: Example of Tag Truncation
4.4. Canonicalization of Language Tags
Since a particular language tag is sometimes used by many processes, language tags SHOULD always be created or generated in a canonical form. A language tag is in canonical form when: 1. The tag is well-formed according the rules in Section 2.1 and Section 2.2. 2. Subtags of type 'Region' that have a Preferred-Value mapping in the IANA registry (see Section 3.1) SHOULD be replaced with their mapped value. Note: In rare cases, the mapped value will also have a Preferred-Value. 3. Redundant or grandfathered tags that have a Preferred-Value mapping in the IANA registry (see Section 3.1) MUST be replaced with their mapped value. These items either are deprecated mappings created before the adoption of this document (such as the mapping of "no-nyn" to "nn" or "i-klingon" to "tlh") or are the result of later registrations or additions to this document (for example, "zh-guoyu" might be mapped to a language-extlang combination such as "zh-cmn" by some future update of this document). 4. Other subtags that have a Preferred-Value mapping in the IANA registry (see Section 3.1) MUST be replaced with their mapped value. These items consist entirely of clerical corrections to ISO 639-1 in which the deprecated subtags have been maintained for compatibility purposes. 5. If more than one extension subtag sequence exists, the extension sequences are ordered into case-insensitive ASCII order by singleton subtag. Example: The language tag "en-A-aaa-B-ccc-bbb-x-xyz" is in canonical form, while "en-B-ccc-bbb-A-aaa-X-xyz" is well-formed but not in canonical form. Example: The language tag "en-BU" (English as used in Burma) is not canonical because the 'BU' subtag has a canonical mapping to 'MM' (Myanmar), although the tag "en-BU" maintains its validity. Canonicalization of language tags does not imply anything about the use of upper or lowercase letters when processing or comparing subtags (and as described in Section 2.1). All comparisons MUST be performed in a case-insensitive manner.
When performing canonicalization of language tags, processors MAY regularize the case of the subtags (that is, this process is OPTIONAL), following the case used in the registry. Note that this corresponds to the following casing rules: uppercase all non-initial two-letter subtags; titlecase all non-initial four-letter subtags; lowercase everything else. Note: Case folding of ASCII letters in certain locales, unless carefully handled, sometimes produces non-ASCII character values. The Unicode Character Database file "SpecialCasing.txt" defines the specific cases that are known to cause problems with this. In particular, the letter 'i' (U+0069) in Turkish and Azerbaijani is uppercased to U+0130 (LATIN CAPITAL LETTER I WITH DOT ABOVE). Implementers SHOULD specify a locale-neutral casing operation to ensure that case folding of subtags does not produce this value, which is illegal in language tags. For example, if one were to uppercase the region subtag 'in' using Turkish locale rules, the sequence U+0130 U+004E would result instead of the expected 'IN'. Note: if the field 'Deprecated' appears in a registry record without an accompanying 'Preferred-Value' field, then that tag or subtag is deprecated without a replacement. Validating processors SHOULD NOT generate tags that include these values, although the values are canonical when they appear in a language tag. An extension MUST define any relationships that exist between the various subtags in the extension and thus MAY define an alternate canonicalization scheme for the extension's subtags. Extensions MAY define how the order of the extension's subtags are interpreted. For example, an extension could define that its subtags are in canonical order when the subtags are placed into ASCII order: that is, "en-a-aaa-bbb-ccc" instead of "en-a-ccc-bbb-aaa". Another extension might define that the order of the subtags influences their semantic meaning (so that "en-b-ccc-bbb-aaa" has a different value from "en-b-aaa-bbb-ccc"). However, extension specifications SHOULD be designed so that they are tolerant of the typical processes described in Section 3.7.4.5. Considerations for Private Use Subtags
Private use subtags, like all other subtags, MUST conform to the format and content constraints in the ABNF. Private use subtags have no meaning outside the private agreement between the parties that intend to use or exchange language tags that employ them. The same subtags MAY be used with a different meaning under a separate private agreement. They SHOULD NOT be used where alternatives exist and SHOULD NOT be used in content or protocols intended for general use.
Private use subtags are simply useless for information exchange without prior arrangement. The value and semantic meaning of private use tags and of the subtags used within such a language tag are not defined by this document. Subtags defined in the IANA registry as having a specific private use meaning convey more information that a purely private use tag prefixed by the singleton subtag 'x'. For applications, this additional information MAY be useful. For example, the region subtags 'AA', 'ZZ', and in the ranges 'QM'-'QZ' and 'XA'-'XZ' (derived from ISO 3166 private use codes) MAY be used to form a language tag. A tag such as "zh-Hans-XQ" conveys a great deal of public, interchangeable information about the language material (that it is Chinese in the simplified Chinese script and is suitable for some geographic region 'XQ'). While the precise geographic region is not known outside of private agreement, the tag conveys far more information than an opaque tag such as "x-someLang", which contains no information about the language subtag or script subtag outside of the private agreement. However, in some cases content tagged with private use subtags MAY interact with other systems in a different and possibly unsuitable manner compared to tags that use opaque, privately defined subtags, so the choice of the best approach sometimes depends on the particular domain in question.5. IANA Considerations
This section deals with the processes and requirements necessary for IANA to undertake to maintain the subtag and extension registries as defined by this document and in accordance with the requirements of [RFC2434]. The impact on the IANA maintainers of the two registries defined by this document will be a small increase in the frequency of new entries or updates.5.1. Language Subtag Registry
Upon adoption of this document, the registry will be initialized by a companion document: [RFC4645]. The criteria and process for selecting the initial set of records are described in that document. The initial set of records represents no impact on IANA, since the work to create it will be performed externally.
The new registry MUST be listed under "Language Tags" at <http://www.iana.org/numbers.html>, replacing the existing registrations defined by [RFC3066]. The existing set of registration forms and RFC 3066 registrations MUST be relabeled as "Language Tags (Obsolete)" and maintained (but not added to or modified). Future work on the Language Subtag Registry SHALL be limited to inserting or replacing whole records preformatted for IANA by the Language Subtag Reviewer as described in Section 3.3 of this document and archiving the forwarded registration form. Each record MUST be sent to iana@iana.org with a subject line indicating whether the enclosed record is an insertion of a new record (indicated by the word "INSERT" in the subject line) or a replacement of an existing record (indicated by the word "MODIFY" in the subject line). Records MUST NOT be deleted from the registry. IANA MUST place any inserted or modified records into the appropriate section of the language subtag registry, grouping the records by their 'Type' field. Inserted records MAY be placed anywhere in the appropriate section; there is no guarantee of the order of the records beyond grouping them together by 'Type'. Modified records MUST overwrite the record they replace. Included in any request to insert or modify records MUST be a new File-Date record. This record MUST be placed first in the registry. In the event that the File-Date record present in the registry has a later date than the record being inserted or modified, the existing record MUST be preserved.5.2. Extensions Registry
The Language Tag Extensions Registry will also be generated and sent to IANA as described in Section 3.7. This registry can contain at most 35 records, and thus changes to this registry are expected to be very infrequent. Future work by IANA on the Language Tag Extensions Registry is limited to two cases. First, the IESG MAY request that new records be inserted into this registry from time to time. These requests MUST include the record to insert in the exact format described in Section 3.7. In addition, there MAY be occasional requests from the maintaining authority for a specific extension to update the contact information or URLs in the record. These requests MUST include the complete, updated record. IANA is not responsible for validating the information provided, only that it is properly formatted. It should reasonably be seen to come from the maintaining authority named in the record present in the registry.
6. Security Considerations
Language tags used in content negotiation, like any other information exchanged on the Internet, might be a source of concern because they might be used to infer the nationality of the sender, and thus identify potential targets for surveillance. This is a special case of the general problem that anything sent is visible to the receiving party and possibly to third parties as well. It is useful to be aware that such concerns can exist in some cases. The evaluation of the exact magnitude of the threat, and any possible countermeasures, is left to each application protocol (see BCP 72 [RFC3552] for best current practice guidance on security threats and defenses). The language tag associated with a particular information item is of no consequence whatsoever in determining whether that content might contain possible homographs. The fact that a text is tagged as being in one language or using a particular script subtag provides no assurance whatsoever that it does not contain characters from scripts other than the one(s) associated with or specified by that language tag. Since there is no limit to the number of variant, private use, and extension subtags, and consequently no limit on the possible length of a tag, implementations need to guard against buffer overflow attacks. See Section 4.3 for details on language tag truncation, which can occur as a consequence of defenses against buffer overflow. Although the specification of valid subtags for an extension (see Section 3.7) MUST be available over the Internet, implementations SHOULD NOT mechanically depend on it being always accessible, to prevent denial-of-service attacks.7. Character Set Considerations
The syntax in this document requires that language tags use only the characters A-Z, a-z, 0-9, and HYPHEN-MINUS, which are present in most character sets, so the composition of language tags should not have any character set issues. Rendering of characters based on the content of a language tag is not addressed in this memo. Historically, some languages have relied on the use of specific character sets or other information in order to infer how a specific character should be rendered (notably this applies to language- and culture-specific variations of Han ideographs as used in Japanese, Chinese, and Korean). When language
tags are applied to spans of text, rendering engines sometimes use that information in deciding which font to use in the absence of other information, particularly where languages with distinct writing traditions use the same characters.8. Changes from RFC 3066
The main goals for this revision of language tags were the following: *Compatibility.* All RFC 3066 language tags (including those in the IANA registry) remain valid in this specification. The changes in this document represent additional constraints on language tags. That is, in no case is the syntax more permissive and processors based on the ABNF and other provisions of RFC 3066 (such as those described in [XMLSchema]) will be able to process the tags described by this document. In addition, this document defines language tags in such as way as to ensure future compatibility. *Stability.* Because of changes in the past in the underlying ISO standards, a valid RFC 3066 language tag could become invalid or have its meaning change. This has the potential of invalidating content that may have an extensive shelf-life. In this specification, once a language tag is valid, it remains valid forever. *Validity.* The structure of language tags defined by this document makes it possible to determine if a particular tag is well-formed without regard for the actual content or "meaning" of the tag as a whole. This is important because the registry grows and underlying standards change over time. In addition, it must be possible to determine if a tag is valid (or not) for a given point in time in order to provide reproducible, testable results. This process must not be error-prone; otherwise implementations might give different results. By having an authoritative registry with specific versioning information, the validity of language tags at any point in time can be precisely determined (instead of interpolating values from many separate sources). *Utility.* It is sometimes important to be able to differentiate between written forms of a language -- for many implementations this is more important than distinguishing between the spoken variants of a language. Languages are written in a wide variety of different scripts, so this document provides for the generative use of ISO 15924 script codes. Like the generative use of ISO language and country codes in RFC 3066, this allows combinations to be produced without resorting to the registration process. The addition of UN M.49 codes provides for the generation of language tags with regional scope, which is also required by some applications.
The recast of the registry from containing whole language tags to subtags is a key part of this. An important feature of RFC 3066 was that it allowed generative use of subtags. This allows people to meaningfully use generated tags, without the delays in registering whole tags or the need to register all of the combinations that might be useful. The choice of placing the extended language and script subtags between the primary language and region subtags was widely debated. This design was chosen because the prevalent matching and content negotiation schemes rely on the subtags being arranged in order of increasing specificity. That is, the subtags that mark a greater barrier to mutual intelligibility appear left-most in a tag. For example, when selecting content written in Azerbaijani, the script (Arabic, Cyrillic, or Latin) represents a greater barrier to understanding than any regional variations (those associated with Azerbaijan or Iran, for example). Individuals who prefer documents in a particular script, but can deal with the minor regional differences, can therefore select appropriate content. Applications that do not deal with written content will continue to omit these subtags. *Extensibility.* Because of the widespread use of language tags, it is disruptive to have periodic revisions of the core specification, even in the face of demonstrated need. The extension mechanism provides for a way for independent RFCs to define extensions to language tags. These extensions have a very constrained, well- defined structure that prevents extensions from interfering with implementations of language tags defined in this document. The document also anticipates features of ISO 639-3 with the addition of the extended language subtags, as well as the possibility of other ISO 639 parts becoming useful for the formation of language tags in the future. The use and definition of private use tags have also been modified, to allow people to use private use subtags to extend or modify defined tags and to move as much information as possible out of private use and into the regular structure. The goal for each of these modifications is to reduce or eliminate the need for future revisions of this document.
The specific changes in this document to meet these goals are: o Defines the ABNF and rules for subtags so that the category of all subtags can be determined without reference to the registry. o Adds the concept of well-formed vs. validating processors, defining the rules by which an implementation can claim to be one or the other. o Replaces the IANA language tag registry with a language subtag registry that provides a complete list of valid subtags in the IANA registry. This allows for robust implementation and ease of maintenance. The language subtag registry becomes the canonical source for forming language tags. o Provides a process that guarantees stability of language tags, by handling reuse of values by ISO 639, ISO 15924, and ISO 3166 in the event that they register a previously used value for a new purpose. o Allows ISO 15924 script code subtags and allows them to be used generatively. Defines a method for indicating in the registry when script subtags are necessary for a given language tag. o Adds the concept of a variant subtag and allows variants to be used generatively. o Adds the ability to use a class of UN M.49 tags for supra-national regions and to resolve conflicts in the assignment of ISO 3166 codes. o Defines the private use tags in ISO 639, ISO 15924, and ISO 3166 as the mechanism for creating private use language, script, and region subtags, respectively. o Adds a well-defined extension mechanism. o Defines an extended language subtag, possibly for use with certain anticipated features of ISO 639-3.
9. References
9.1. Normative References
[ISO10646] International Organization for Standardization, "ISO/IEC 10646:2003. Information technology -- Universal Multiple-Octet Coded Character Set (UCS)", 2003. [ISO15924] International Organization for Standardization, "ISO 15924:2004. Information and documentation -- Codes for the representation of names of scripts", January 2004. [ISO3166-1] International Organization for Standardization, "ISO 3166-1:1997. Codes for the representation of names of countries and their subdivisions -- Part 1: Country codes", 1997. [ISO639-1] International Organization for Standardization, "ISO 639-1:2002. Codes for the representation of names of languages -- Part 1: Alpha-2 code", 2002. [ISO639-2] International Organization for Standardization, "ISO 639-2:1998. Codes for the representation of names of languages -- Part 2: Alpha-3 code, first edition", 1998. [ISO646] International Organization for Standardization, "ISO/IEC 646:1991, Information technology -- ISO 7-bit coded character set for information interchange.", 1991. [RFC2026] Bradner, S., "The Internet Standards Process -- Revision 3", BCP 9, RFC 2026, October 1996. [RFC2028] Hovey, R. and S. Bradner, "The Organizations Involved in the IETF Standards Process", BCP 11, RFC 2028, October 1996. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2434] Narten, T. and H. Alvestrand, "Guidelines for Writing an IANA Considerations Section in RFCs", BCP 26, RFC 2434, October 1998.
[RFC2860] Carpenter, B., Baker, F., and M. Roberts, "Memorandum of Understanding Concerning the Technical Work of the Internet Assigned Numbers Authority", RFC 2860, June 2000. [RFC3339] Klyne, G., Ed. and C. Newman, "Date and Time on the Internet: Timestamps", RFC 3339, July 2002. [RFC4234] Crocker, D., Ed. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 4234, October 2005. [UN_M.49] Statistics Division, United Nations, "Standard Country or Area Codes for Statistical Use", UN Standard Country or Area Codes for Statistical Use, Revision 4 (United Nations publication, Sales No. 98.XVII.9, June 1999.9.2. Informative References
[RFC1766] Alvestrand, H., "Tags for the Identification of Languages", RFC 1766, March 1995. [RFC2047] Moore, K., "MIME (Multipurpose Internet Mail Extensions) Part Three: Message Header Extensions for Non-ASCII Text", RFC 2047, November 1996. [RFC2231] Freed, N. and K. Moore, "MIME Parameter Value and Encoded Word Extensions: Character Sets, Languages, and Continuations", RFC 2231, November 1997. [RFC2781] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646", RFC 2781, February 2000. [RFC3066] Alvestrand, H., "Tags for the Identification of Languages", BCP 47, RFC 3066, January 2001. [RFC3552] Rescorla, E. and B. Korver, "Guidelines for Writing RFC Text on Security Considerations", BCP 72, RFC 3552, July 2003. [RFC4645] Ewell, D., Ed., "Initial Language Subtag Registry", RFC 4645, September 2006. [RFC4647] Phillips, A., Ed. and M. Davis, Ed., "Matching of Language Tags", BCP 47, RFC 4647, September 2006.
[Unicode] Unicode Consortium, "The Unicode Standard, Version 5.0", Boston, MA, Addison-Wesley, 2007. ISBN 0-321- 48091-0. [XML10] Bray (et al), T., "Extensible Markup Language (XML) 1.0", 02 2004. [XMLSchema] Biron, P., Ed. and A. Malhotra, Ed., "XML Schema Part 2: Datatypes Second Edition", 10 2004, < http://www.w3.org/TR/xmlschema-2/>. [iso639.prin] ISO 639 Joint Advisory Committee, "ISO 639 Joint Advisory Committee: Working principles for ISO 639 maintenance", March 2000, <http://www.loc.gov/ standards/iso639-2/iso639jac_n3r.html>. [record-jar] Raymond, E., "The Art of Unix Programming", 2003, <urn:isbn:0-13-142901-9>.
Appendix A. Acknowledgements
Any list of contributors is bound to be incomplete; please regard the following as only a selection from the group of people who have contributed to make this document what it is today. The contributors to RFC 3066 and RFC 1766, the precursors of this document, made enormous contributions directly or indirectly to this document and are generally responsible for the success of language tags. The following people (in alphabetical order) contributed to this document or to RFCs 1766 and 3066: Glenn Adams, Harald Tveit Alvestrand, Tim Berners-Lee, Marc Blanchet, Nathaniel Borenstein, Karen Broome, Eric Brunner, Sean M. Burke, M.T. Carrasco Benitez, Jeremy Carroll, John Clews, Jim Conklin, Peter Constable, John Cowan, Mark Crispin, Dave Crocker, Elwyn Davies, Martin Duerst, Frank Ellerman, Michael Everson, Doug Ewell, Ned Freed, Tim Goodwin, Dirk-Willem van Gulik, Marion Gunn, Joel Halpren, Elliotte Rusty Harold, Paul Hoffman, Scott Hollenbeck, Richard Ishida, Olle Jarnefors, Kent Karlsson, John Klensin, Erkki Kolehmainen, Alain LaBonte, Eric Mader, Ira McDonald, Keith Moore, Chris Newman, Masataka Ohta, Dylan Pierce, Randy Presuhn, George Rhoten, Felix Sasaki, Markus Scherer, Keld Jorn Simonsen, Thierry Sourbier, Otto Stolz, Tex Texin, Andrea Vine, Rhys Weatherley, Misha Wolf, Francois Yergeau and many, many others. Very special thanks must go to Harald Tveit Alvestrand, who originated RFCs 1766 and 3066, and without whom this document would not have been possible. Special thanks must go to Michael Everson, who has served as Language Tag Reviewer for almost the complete period since the publication of RFC 1766. Special thanks to Doug Ewell, for his production of the first complete subtag registry, and his work in producing a test parser for verifying language tags.
Appendix B. Examples of Language Tags (Informative)
Simple language subtag: de (German) fr (French) ja (Japanese) i-enochian (example of a grandfathered tag) Language subtag plus Script subtag: zh-Hant (Chinese written using the Traditional Chinese script) zh-Hans (Chinese written using the Simplified Chinese script) sr-Cyrl (Serbian written using the Cyrillic script) sr-Latn (Serbian written using the Latin script) Language-Script-Region: zh-Hans-CN (Chinese written using the Simplified script as used in mainland China) sr-Latn-CS (Serbian written using the Latin script as used in Serbia and Montenegro) Language-Variant: sl-rozaj (Resian dialect of Slovenian sl-nedis (Nadiza dialect of Slovenian) Language-Region-Variant: de-CH-1901 (German as used in Switzerland using the 1901 variant [orthography]) sl-IT-nedis (Slovenian as used in Italy, Nadiza dialect)
Language-Script-Region-Variant: sl-Latn-IT-nedis (Nadiza dialect of Slovenian written using the Latin script as used in Italy. Note that this tag is NOT RECOMMENDED because subtag 'sl' has a Suppress-Script value of 'Latn') Language-Region: de-DE (German for Germany) en-US (English as used in the United States) es-419 (Spanish appropriate for the Latin America and Caribbean region using the UN region code) Private use subtags: de-CH-x-phonebk az-Arab-x-AZE-derbend Extended language subtags (examples ONLY: extended languages MUST be defined by revision or update to this document): zh-min zh-min-nan-Hant-CN Private use registry values: x-whatever (private use using the singleton 'x') qaa-Qaaa-QM-x-southern (all private tags) de-Qaaa (German, with a private script) sr-Latn-QM (Serbian, Latin-script, private region) sr-Qaaa-CS (Serbian, private script, for Serbia and Montenegro) Tags that use extensions (examples ONLY: extensions MUST be defined by revision or update to this document or by RFC): en-US-u-islamCal zh-CN-a-myExt-x-private
en-a-myExt-b-another Some Invalid Tags: de-419-DE (two region tags) a-DE (use of a single-character subtag in primary position; note that there are a few grandfathered tags that start with "i-" that are valid) ar-a-aaa-b-bbb-a-ccc (two extensions with same single-letter prefix)Authors' Addresses
Addison Phillips (Editor) Yahoo! Inc. EMail: addison@inter-locale.com Mark Davis (Editor) Google EMail: mark.davis@macchiato.com or mark.davis@google.com
Full Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgement Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA).