3.6. Possibilities for Registration
Possibilities for registration of subtags or information about subtags include: o Primary language subtags for languages not listed in ISO 639 that are not variants of any listed or registered language MAY be registered. At the time this document was created, there were no examples of this form of subtag. Before attempting to register a language subtag, there MUST be an attempt to register the language
with ISO 639. Subtags MUST NOT be registered for languages defined by codes that exist in ISO 639-1, ISO 639-2, or ISO 639-3; that are under consideration by the ISO 639 registration authorities; or that have never been attempted for registration with those authorities. If ISO 639 has previously rejected a language for registration, it is reasonable to assume that there must be additional, very compelling evidence of need before it will be registered as a primary language subtag in the IANA registry (to the extent that it is very unlikely that any subtags will be registered of this type). o Dialect or other divisions or variations within a language, its orthography, writing system, regional or historical usage, transliteration or other transformation, or distinguishing variation MAY be registered as variant subtags. An example is the 'rozaj' subtag (the Resian dialect of Slovenian). o The addition or maintenance of fields (generally of an informational nature) in tag or subtag records as described in Section 3.1 is allowed. Such changes are subject to the stability provisions in Section 3.4. This includes 'Description', 'Comments', 'Deprecated', and 'Preferred-Value' fields for obsolete or withdrawn codes, or the addition of 'Suppress-Script' or 'Macrolanguage' fields to primary language subtags, as well as other changes permitted by this document, such as the addition of an appropriate 'Prefix' field to a variant subtag. o The addition of records and related field value changes necessary to reflect assignments made by ISO 639, ISO 15924, ISO 3166-1, and UN M.49 as described in Section 3.4 is allowed. Subtags proposed for registration that would cause all or part of a grandfathered tag to become redundant but whose meaning conflicts with or alters the meaning of the grandfathered tag MUST be rejected. This document leaves the decision on what subtags or changes to subtags are appropriate (or not) to the registration process described in Section 3.5. Note: Four-character primary language subtags are reserved to allow for the possibility of alpha4 codes in some future addition to the ISO 639 family of standards. ISO 639 defines a registration authority for additions to and changes in the list of languages in ISO 639. This agency is:
International Information Centre for Terminology (Infoterm) Aichholzgasse 6/12, AT-1120 Wien, Austria Phone: +43 1 26 75 35 Ext. 312 Fax: +43 1 216 32 72 ISO 639-2 defines a registration authority for additions to and changes in the list of languages in ISO 639-2. This agency is: Library of Congress Network Development and MARC Standards Office Washington, DC 20540, USA Phone: +1 202 707 6237 Fax: +1 202 707 0115 URL: http://www.loc.gov/standards/iso639-2 ISO 639-3 defines a registration authority for additions to and changes in the list of languages in ISO 639-3. This agency is: SIL International ISO 639-3 Registrar 7500 W. Camp Wisdom Rd. Dallas, TX 75236, USA Phone: +1 972 708 7400, ext. 2293 Fax: +1 972 708 7546 Email: iso639-3@sil.org URL: http://www.sil.org/iso639-3 ISO 639-5 defines a registration authority for additions to and changes in the list of languages in ISO 639-5. This agency is the same as for ISO 639-2 and is: Library of Congress Network Development and MARC Standards Office Washington, DC 20540, USA Phone: +1 202 707 6237 Fax: +1 202 707 0115 URL: http://www.loc.gov/standards/iso639-5 The maintenance agency for ISO 3166-1 (country codes) is: ISO 3166 Maintenance Agency c/o International Organization for Standardization Case postale 56 CH-1211 Geneva 20, Switzerland Phone: +41 22 749 72 33 Fax: +41 22 749 73 49 URL: http://www.iso.org/iso/en/prods-services/iso3166ma/index.html
The registration authority for ISO 15924 (script codes) is: Unicode Consortium Box 391476 Mountain View, CA 94039-1476, USA URL: http://www.unicode.org/iso15924 The Statistics Division of the United Nations Secretariat maintains the Standard Country or Area Codes for Statistical Use and can be reached at: Statistical Services Branch Statistics Division United Nations, Room DC2-1620 New York, NY 10017, USA Fax: +1-212-963-0623 Email: statistics@un.org URL: http://unstats.un.org/unsd/methods/m49/m49alpha.htm3.7. Extensions and the Extensions Registry
Extension subtags are those introduced by single-character subtags ("singletons") other than 'x'. They are reserved for the generation of identifiers that contain a language component and are compatible with applications that understand language tags. The structure and form of extensions are defined by this document so that implementations can be created that are forward compatible with applications that might be created using singletons in the future. In addition, defining a mechanism for maintaining singletons will lend stability to this document by reducing the likely need for future revisions or updates. Single-character subtags are assigned by IANA using the "IETF Review" policy defined by [RFC5226]. This policy requires the development of an RFC, which SHALL define the name, purpose, processes, and procedures for maintaining the subtags. The maintaining or registering authority, including name, contact email, discussion list email, and URL location of the registry, MUST be indicated clearly in the RFC. The RFC MUST specify or include each of the following: o The specification MUST reference the specific version or revision of this document that governs its creation and MUST reference this section of this document. o The specification and all subtags defined by the specification MUST follow the ABNF and other rules for the formation of tags and subtags as defined in this document. In particular, it MUST
specify that case is not significant and that subtags MUST NOT exceed eight characters in length. o The specification MUST specify a canonical representation. o The specification of valid subtags MUST be available over the Internet and at no cost. o The specification MUST be in the public domain or available via a royalty-free license acceptable to the IETF and specified in the RFC. o The specification MUST be versioned, and each version of the specification MUST be numbered, dated, and stable. o The specification MUST be stable. That is, extension subtags, once defined by a specification, MUST NOT be retracted or change in meaning in any substantial way. o The specification MUST include, in a separate section, the registration form reproduced in this section (below) to be used in registering the extension upon publication as an RFC. o IANA MUST be informed of changes to the contact information and URL for the specification. IANA will maintain a registry of allocated single-character (singleton) subtags. This registry MUST use the record-jar format described by the ABNF in Section 3.1.1. Upon publication of an extension as an RFC, the maintaining authority defined in the RFC MUST forward this registration form to <iesg@ietf.org>, who MUST forward the request to <iana@iana.org>. The maintaining authority of the extension MUST maintain the accuracy of the record by sending an updated full copy of the record to <iana@iana.org> with the subject line "LANGUAGE TAG EXTENSION UPDATE" whenever content changes. Only the 'Comments', 'Contact_Email', 'Mailing_List', and 'URL' fields MAY be modified in these updates. Failure to maintain this record, maintain the corresponding registry, or meet other conditions imposed by this section of this document MAY be appealed to the IESG [RFC2028] under the same rules as other IETF decisions (see [RFC2026]) and MAY result in the authority to maintain the extension being withdrawn or reassigned by the IESG.
%% Identifier: Description: Comments: Added: RFC: Authority: Contact_Email: Mailing_List: URL: %% Figure 6: Format of Records in the Language Tag Extensions Registry 'Identifier' contains the single-character subtag (singleton) assigned to the extension. The Internet-Draft submitted to define the extension SHOULD specify which letter or digit to use, although the IESG MAY change the assignment when approving the RFC. 'Description' contains the name and description of the extension. 'Comments' is an OPTIONAL field and MAY contain a broader description of the extension. 'Added' contains the date the extension's RFC was published in the "full-date" format specified in [RFC3339]. For example: 2004-06-28 represents June 28, 2004, in the Gregorian calendar. 'RFC' contains the RFC number assigned to the extension. 'Authority' contains the name of the maintaining authority for the extension. 'Contact_Email' contains the email address used to contact the maintaining authority. 'Mailing_List' contains the URL or subscription email address of the mailing list used by the maintaining authority. 'URL' contains the URL of the registry for this extension. The determination of whether an Internet-Draft meets the above conditions and the decision to grant or withhold such authority rests solely with the IESG and is subject to the normal review and appeals process associated with the RFC process. Extension authors are strongly cautioned that many (including most well-formed) processors will be unaware of any special relationships
or meaning inherent in the order of extension subtags. Extension authors SHOULD avoid subtag relationships or canonicalization mechanisms that interfere with matching or with length restrictions that sometimes exist in common protocols where the extension is used. In particular, applications MAY truncate the subtags in doing matching or in fitting into limited lengths, so it is RECOMMENDED that the most significant information be in the most significant (left-most) subtags and that the specification gracefully handle truncated subtags. When a language tag is to be used in a specific, known protocol, it is RECOMMENDED that the language tag not contain extensions not supported by that protocol. In addition, note that some protocols MAY impose upper limits on the length of the strings used to store or transport the language tag.3.8. Update of the Language Subtag Registry
After the adoption of this document, the IANA Language Subtag Registry needed an update so that it would contain the complete set of subtags valid in a language tag. [RFC5645] describes the process used to create this update. Registrations that are in process under the rules defined in [RFC4646] when this document is adopted MUST be completed under the rules contained in this document.3.9. Applicability of the Subtag Registry
The Language Subtag Registry is the source of data elements used to construct language tags, following the rules described in this document. Language tags are designed for indicating linguistic attributes of various content, including not only text but also most media formats, such as video or audio. They also form the basis for language and locale negotiation in various protocols and APIs. The registry is therefore applicable to many applications that need some form of language identification, with these limitations: o It is not designed to be the sole data source in the creation of a language-selection user interface. For example, the registry does not contain translations for subtag descriptions or for tags composed from the subtags. Sources for localized data based on the registry are generally available, notably [CLDR]. Nor does the registry indicate which subtag combinations are particularly useful or relevant.
o It does not provide information indicating relationships between different languages, such as might be used in a user interface to select language tags hierarchically, regionally, or on some other organizational model. o It does not supply information about potential overlap between different language tags, as the notion of what constitutes a language is not precise: several different language tags might be reasonable choices for the same given piece of content. o It does not contain information about appropriate fallback choices when performing language negotiation. A good fallback language might be linguistically unrelated to the specified language. The fact that one language is often used as a fallback language for another is usually a result of outside factors, such as geography, history, or culture -- factors that might not apply in all cases. For example, most people who use Breton (a Celtic language used in the Northwest of France) would probably prefer to be served French (a Romance language) if Breton isn't available.4. Formation and Processing of Language Tags
This section addresses how to use the information in the registry with the tag syntax to choose, form, and process language tags.4.1. Choice of Language Tag
The guiding principle in forming language tags is to "tag content wisely." Sometimes there is a choice between several possible tags for the same content. The choice of which tag to use depends on the content and application in question, and some amount of judgment might be necessary when selecting a tag. Interoperability is best served when the same language tag is used consistently to represent the same language. If an application has requirements that make the rules here inapplicable, then that application risks damaging interoperability. It is strongly RECOMMENDED that users not define their own rules for language tag choice. Standards, protocols, and applications that reference this document normatively but apply different rules to the ones given in this section MUST specify how language tag selection varies from the guidelines given here. To ensure consistent backward compatibility, this document contains several provisions to account for potential instability in the standards used to define the subtags that make up language tags.
These provisions mean that no valid language tag can become invalid, nor will a language tag have a narrower scope in the future (it may have a broader scope). The most appropriate language tag for a given application or content item might evolve over time, but once applied, the tag itself cannot become invalid or have its meaning wholly change. A subtag SHOULD only be used when it adds useful distinguishing information to the tag. Extraneous subtags interfere with the meaning, understanding, and processing of language tags. In particular, users and implementations SHOULD follow the 'Prefix' and 'Suppress-Script' fields in the registry (defined in Section 3.1): these fields provide guidance on when specific additional subtags SHOULD be used or avoided in a language tag. The choice of subtags used to form a language tag SHOULD follow these guidelines: 1. Use as precise a tag as possible, but no more specific than is justified. Avoid using subtags that are not important for distinguishing content in an application. * For example, 'de' might suffice for tagging an email written in German, while "de-CH-1996" is probably unnecessarily precise for such a task. * Note that some subtag sequences might not represent the language a casual user might expect. For example, the Swiss German (Schweizerdeutsch) language is represented by "gsw-CH" and not by "de-CH". This latter tag represents German ('de') as used in Switzerland ('CH'), also known as Swiss High German (Schweizer Hochdeutsch). Both are real languages, and distinguishing between them could be important to an application. 2. The script subtag SHOULD NOT be used to form language tags unless the script adds some distinguishing information to the tag. Script subtags were first formally defined in [RFC4646]. Their use can affect matching and subtag identification for implementations of [RFC1766] or [RFC3066] (which are obsoleted by this document), as these subtags appear between the primary language and region subtags. Some applications can benefit from the use of script subtags in language tags, as long as the use is consistent for a given context. Script subtags are never appropriate for unwritten content (such as audio recordings). The field 'Suppress-Script' in the primary or extended language record in the registry indicates script subtags that do not add distinguishing information for most applications; this field
defines when users SHOULD NOT include a script subtag with a particular primary language subtag. For example, if an implementation selects content using Basic Filtering [RFC4647] (originally described in Section 14.4 of [RFC2616]) and the user requested the language range "en-US", content labeled "en-Latn-US" will not match the request and thus not be selected. Therefore, it is important to know when script subtags will customarily be used and when they ought not be used. For example: * The subtag 'Latn' should not be used with the primary language 'en' because nearly all English documents are written in the Latin script and it adds no distinguishing information. However, if a document were written in English mixing Latin script with another script such as Braille ('Brai'), then it might be appropriate to choose to indicate both scripts to aid in content selection, such as the application of a style sheet. * When labeling content that is unwritten (such as a recording of human speech), the script subtag should not be used, even if the language is customarily written in several scripts. Thus, the subtitles to a movie might use the tag "uz-Arab" (Uzbek, Arabic script), but the audio track for the same language would be tagged simply "uz". (The tag "uz-Zxxx" could also be used where content is not written, as the subtag 'Zxxx' represents the "Code for unwritten documents".) 3. If a tag or subtag has a 'Preferred-Value' field in its registry entry, then the value of that field SHOULD be used to form the language tag in preference to the tag or subtag in which the preferred value appears. * For example, use 'jbo' for Lojban in preference to the grandfathered tag "art-lojban". 4. Use subtags or sequences of subtags for individual languages in preference to subtags for language collections. A "language collection" is a group of languages that are descended from a common ancestor, are spoken in the same geographical area, or are otherwise related. Certain language collections are assigned codes by [ISO639-5] (and some of these [ISO639-5] codes are also defined as collections in [ISO639-2]). These codes are included as primary language subtags in the registry. Subtags for a language collection in the registry have a 'Scope' field with a value of 'collection'. A subtag for a language collection is
always preferred to less specific alternatives such as 'mul' and 'und' (see below), and a subtag representing a language collection MAY be used when more specific language information is not available. However, most users and implementations do not know there is a relationship between the collection and its individual languages. In addition, the relationship between the individual languages in the collection is not well defined; in particular, the languages are usually not mutually intelligible. Since the subtags are different, a request for the collection will typically only produce items tagged with the collection's subtag, not items tagged with subtags for the individual languages contained in the collection. * For example, collections are interpreted inclusively, so the subtag 'gem' (Germanic languages) could, but SHOULD NOT, be used with content that would be better tagged with "en" (English), "de" (German), or "gsw" (Swiss German, Alemannic). While 'gem' collects all of these (and other) languages, most implementations will not match 'gem' to the individual languages; thus, using the subtag will not produce the desired result. 5. [ISO639-2] has defined several codes included in the subtag registry that require additional care when choosing language tags. In most of these cases, where omitting the language tag is permitted, such omission is preferable to using these codes. Language tags SHOULD NOT incorporate these subtags as a prefix, unless the additional information conveys some value to the application. * The 'mul' (Multiple) primary language subtag identifies content in multiple languages. This subtag SHOULD NOT be used when a list of languages or individual tags for each content element can be used instead. For example, the 'Content- Language' header [RFC3282] allows a list of languages to be used, not just a single language tag. * The 'und' (Undetermined) primary language subtag identifies linguistic content whose language is not determined. This subtag SHOULD NOT be used unless a language tag is required and language information is not available or cannot be determined. Omitting the language tag (where permitted) is preferred. The 'und' subtag might be useful for protocols that require a language tag to be provided or where a primary language subtag is required (such as in "und-Latn"). The 'und' subtag MAY also be useful when matching language tags in certain situations.
* The 'zxx' (Non-Linguistic, Not Applicable) primary language subtag identifies content for which a language classification is inappropriate or does not apply. Some examples might include instrumental or electronic music; sound recordings consisting of nonverbal sounds; audiovisual materials with no narration, dialog, printed titles, or subtitles; machine- readable data files consisting of machine languages or character codes; or programming source code. * The 'mis' (Uncoded) primary language subtag identifies content whose language is known but that does not currently have a corresponding subtag. This subtag SHOULD NOT be used. Because the addition of other codes in the future can render its application invalid, it is inherently unstable and hence incompatible with the stability goals of BCP 47. It is always preferable to use other subtags: either 'und' or (with prior agreement) private use subtags. 6. Use variant subtags sparingly and in the correct order. Most variant subtags have one or more 'Prefix' fields in the registry that express the list of subtags with which they are appropriate. Variants SHOULD only be used with subtags that appear in one of these 'Prefix' fields. If a variant lists a second variant in one of its 'Prefix' fields, the first variant SHOULD appear directly after the second variant in any language tag where both occur. General purpose variants (those with no 'Prefix' fields at all) SHOULD appear after any other variant subtags. Order any remaining variants by placing the most significant subtag first. If none of the subtags is more significant or no relationship can be determined, alphabetize the subtags. Because variants are very specialized, using many of them together generally makes the tag so narrow as to override the additional precision gained. Putting the subtags into another order interferes with interoperability, as well as the overall interpretation of the tag. For example: * The tag "en-scotland-fonipa" (English, Scottish dialect, IPA phonetic transcription) is correctly ordered because 'scotland' has a 'Prefix' of "en", while 'fonipa' has no 'Prefix' field. * The tag "sl-IT-rozaj-biske-1994" is correctly ordered: 'rozaj' lists "sl" as its sole 'Prefix'; 'biske' lists "sl-rozaj" as its sole 'Prefix'. The subtag '1994' has several prefixes,
including "sl-rozaj". However, it follows both 'rozaj' and 'biske' because one of its 'Prefix' fields is "sl-rozaj- biske". 7. The grandfathered tag "i-default" (Default Language) was originally registered according to [RFC1766] to meet the needs of [RFC2277]. It is not used to indicate a specific language, but rather to identify the condition or content used where the language preferences of the user cannot be established. It SHOULD NOT be used except as a means of labeling the default content for applications or protocols that require default language content to be labeled with that specific tag. It MAY also be used by an application or protocol to identify when the default language content is being returned.4.1.1. Tagging Encompassed Languages
Some primary language records in the registry have a 'Macrolanguage' field (Section 3.1.10) that contains a mapping from each "encompassed language" to its macrolanguage. The 'Macrolanguage' mapping doesn't define what the relationship between the encompassed language and its macrolanguage is, nor does it define how languages encompassed by the same macrolanguage are related to each other. Two different languages encompassed by the same macrolanguage may differ from one another more than, say, French and Spanish do. A few specific macrolanguages, such as Chinese ('zh') and Arabic ('ar'), are handled differently. See Section 4.1.2. The more specific encompassed language subtag SHOULD be used to form the language tag, although either the macrolanguage's primary language subtag or the encompassed language's subtag MAY be used. This means, for example, tagging Plains Cree with 'crk' rather than 'cr' (Cree), and so forth. Each macrolanguage subtag's scope, by definition, includes all of its encompassed languages. Since the relationship between encompassed languages varies, users cannot assume that the macrolanguage subtag means any particular encompassed language, nor that any given pair of encompassed languages are mutually intelligible or otherwise interchangeable. Applications MAY use macrolanguage information to improve matching or language negotiation. For example, the information that 'sr' (Serbian) and 'hr' (Croatian) share a macrolanguage expresses a closer relation between those languages than between, say, 'sr' (Serbian) and 'ma' (Macedonian). However, this relationship is not guaranteed nor is it exclusive. For example, Romanian ('ro') and
Moldavian ('mo') do not share a macrolanguage, but are far more closely related to each other than Cantonese ('yue') and Wu ('wuu'), which do share a macrolanguage.4.1.2. Using Extended Language Subtags
To accommodate language tag forms used prior to the adoption of this document, language tags provide a special compatibility mechanism: the extended language subtag. Selected languages have been provided with both primary and extended language subtags. These include macrolanguages, such as Malay ('ms') and Uzbek ('uz'), that have a specific dominant variety that is generally synonymous with the macrolanguage. Other languages, such as the Chinese ('zh') and Arabic ('ar') macrolanguages and the various sign languages ('sgn'), have traditionally used their primary language subtag, possibly coupled with various region subtags or as part of a registered grandfathered tag, to indicate the language. With the adoption of this document, specific ISO 639-3 subtags became available to identify the languages contained within these diverse language families or groupings. This presents a choice of language tags where previously none existed: o Each encompassed language's subtag SHOULD be used as the primary language subtag. For example, a document in Mandarin Chinese would be tagged "cmn" (the subtag for Mandarin Chinese) in preference to "zh" (Chinese). o If compatibility is desired or needed, the encompassed subtag MAY be used as an extended language subtag. For example, a document in Mandarin Chinese could be tagged "zh-cmn" instead of either "cmn" or "zh". o The macrolanguage or prefixing subtag MAY still be used to form the tag instead of the more specific encompassed language subtag. That is, tags such as "zh-HK" or "sgn-RU" are still valid. Chinese ('zh') provides a useful illustration of this. In the past, various content has used tags beginning with the 'zh' subtag, with application-specific meaning being associated with region codes, private use sequences, or grandfathered registered values. This is because historically only the macrolanguage subtag 'zh' was available for forming language tags. However, the languages encompassed by the Chinese subtag 'zh' are, in the main, not mutually intelligible when spoken, and the written forms of these languages also show wide variation in form and usage.
To provide compatibility, Chinese languages encompassed by the 'zh' subtag are in the registry both as primary language subtags and as extended language subtags. For example, the ISO 639-3 code for Cantonese is 'yue'. Content in Cantonese might historically have used a tag such as "zh-HK" (since Cantonese is commonly spoken in Hong Kong), although that tag actually means any type of Chinese as used in Hong Kong. With the availability of ISO 639-3 codes in the registry, content in Cantonese can be directly tagged using the 'yue' subtag. The content can use it as a primary language subtag, as in the tag "yue-HK" (Cantonese, Hong Kong). Or it can use an extended language subtag with 'zh', as in the tag "zh-yue-Hant" (Chinese, Cantonese, Traditional script). As noted above, applications can choose to use the macrolanguage subtag to form the tag instead of using the more specific encompassed language subtag. For example, an application with large quantities of data already using tags with the 'zh' (Chinese) subtag might continue to use this more general subtag even for new data, even though the content could be more precisely tagged with 'cmn' (Mandarin), 'yue' (Cantonese), 'wuu' (Wu), and so on. Similarly, an application already using tags that start with the 'ar' (Arabic) subtag might continue to use this more general subtag even for new data, which could be more precisely tagged with 'arb' (Standard Arabic). In some cases, the encompassed languages had tags registered for them during the RFC 3066 era. Those grandfathered tags not already deprecated or rendered redundant were deprecated in the registry upon adoption of this document. As grandfathered values, they remain valid for use, and some content or applications might use them. As with other grandfathered tags, since implementations might not be able to associate the grandfathered tags with the encompassed language subtag equivalents that are recommended by this document, implementations are encouraged to canonicalize tags for comparison purposes. Some examples of this include the tags "zh-hakka" (Hakka) and "zh-guoyu" (Mandarin or Standard Chinese). Sign languages share a mode of communication rather than a linguistic heritage. There are many sign languages that have developed independently, and the subtag 'sgn' indicates only the presence of a sign language. A number of sign languages also had grandfathered tags registered for them during the RFC 3066 era. For example, the grandfathered tag "sgn-US" was registered to represent 'American Sign Language' specifically, without reference to the United States. This is still valid, but deprecated: a document in American Sign Language can be labeled either "ase" or "sgn-ase" (the 'ase' subtag is for the language called 'American Sign Language').
4.2. Meaning of the Language Tag
The meaning of a language tag is related to the meaning of the subtags that it contains. Each subtag, in turn, implies a certain range of expectations one might have for related content, although it is not a guarantee. For example, the use of a script subtag such as 'Arab' (Arabic script) does not mean that the content contains only Arabic characters. It does mean that the language involved is predominantly in the Arabic script. Thus, a language tag and its subtags can encompass a very wide range of variation and yet remain appropriate in each particular instance. Validity of a tag is not the only factor determining its usefulness. While every valid tag has a meaning, it might not represent any real- world language usage. This is unavoidable in a system in which subtags can be combined freely. For example, tags such as "ar-Cyrl-CO" (Arabic, Cyrillic script, as used in Colombia) or "tlh- Kore-AQ-fonipa" (Klingon, Korean script, as used in Antarctica, IPA phonetic transcription) are both valid and unlikely to represent a useful combination of language attributes. The meaning of a given tag doesn't depend on the context in which it appears. The relationship between a tag's meaning and the information objects to which that tag is applied, however, can vary. o For a single information object, the associated language tags might be interpreted as the set of languages that is necessary for a complete comprehension of the complete object. Example: Plain text documents. o For an aggregation of information objects, the associated language tags could be taken as the set of languages used inside components of that aggregation. Examples: Document stores and libraries. o For information objects whose purpose is to provide alternatives, the associated language tags could be regarded as a hint that the content is provided in several languages and that one has to inspect each of the alternatives in order to find its language or languages. In this case, the presence of multiple tags might not mean that one needs to be multilingual to get complete understanding of the document. Example: MIME multipart/ alternative [RFC2046]. o For markup languages, such as HTML and XML, language information can be added to each part of the document identified by the markup structure (including the whole document itself). For example, one could write <span lang="fr">C'est la vie.</span> inside a German document; the German-speaking user could then access a French-
German dictionary to find out what the marked section meant. If the user were listening to that document through a speech synthesis interface, this formation could be used to signal the synthesizer to appropriately apply French text-to-speech pronunciation rules to that span of text, instead of applying the inappropriate German rules. o For markup languages and document formats that allow the audience to be identified, a language tag could indicate the audience(s) appropriate for that document. For example, the same HTML document described in the preceding bullet might have an HTTP header "Content-Language: de" to indicate that the intended audience for the file is German (even though three words appear and are identified as being in French within it). o For systems and APIs, language tags form the basis for most implementations of locale identifiers. For example, see Unicode's CLDR (Common Locale Data Repository) (see UTS #35 [UTS35]) project. Language tags are related when they contain a similar sequence of subtags. For example, if a language tag B contains language tag A as a prefix, then B is typically "narrower" or "more specific" than A. Thus, "zh-Hant-TW" is more specific than "zh-Hant". This relationship is not guaranteed in all cases: specifically, languages that begin with the same sequence of subtags are NOT guaranteed to be mutually intelligible, although they might be. For example, the tag "az" shares a prefix with both "az-Latn" (Azerbaijani written using the Latin script) and "az-Cyrl" (Azerbaijani written using the Cyrillic script). A person fluent in one script might not be able to read the other, even though the linguistic content (e.g., what would be heard if both texts were read aloud) might be identical. Content tagged as "az" most probably is written in just one script and thus might not be intelligible to a reader familiar with the other script. Similarly, not all subtags specify an actual distinction in language. For example, the tags "en-US" and "en-CA" mean, roughly, English with features generally thought to be characteristic of the United States and Canada, respectively. They do not imply that a significant dialectical boundary exists between any arbitrarily selected point in the United States and any arbitrarily selected point in Canada. Neither does a particular region subtag imply that linguistic distinctions do not exist within that region.
4.3. Lists of Languages
In some applications, a single content item might best be associated with more than one language tag. Examples of such a usage include: o Content items that contain multiple, distinct varieties. Often this is used to indicate an appropriate audience for a given content item when multiple choices might be appropriate. Examples of this could include: * Metadata about the appropriate audience for a movie title. For example, a DVD might label its individual audio tracks 'de' (German), 'fr' (French), and 'es' (Spanish), but the overall title would list "de, fr, es" as its overall audience. * A French/English, English/French dictionary tagged as both "en" and "fr" to specify that it applies equally to French and English. * A side-by-side or interlinear translation of a document, as is commonly done with classical works in Latin or Greek. o Content items that contain a single language but that require multiple levels of specificity. For example, a library might wish to classify a particular work as both Norwegian ('no') and as Nynorsk ('nn') for audiences capable of appreciating the distinction or needing to select content more narrowly.4.4. Length Considerations
There is no defined upper limit on the size of language tags. While historically most language tags have consisted of language and region subtags with a combined total length of up to six characters, larger tags have always been both possible and have actually appeared in use. Neither the language tag syntax nor other requirements in this document impose a fixed upper limit on the number of subtags in a language tag (and thus an upper bound on the size of a tag). The language tag syntax suggests that, depending on the specific language, more subtags (and thus a longer tag) are sometimes necessary to completely identify the language for certain applications; thus, it is possible to envision long or complex subtag sequences.
4.4.1. Working with Limited Buffer Sizes
Some applications and protocols are forced to allocate fixed buffer sizes or otherwise limit the length of a language tag. A conformant implementation or specification MAY refuse to support the storage of language tags that exceed a specified length. Any such limitation SHOULD be clearly documented, and such documentation SHOULD include what happens to longer tags (for example, whether an error value is generated or the language tag is truncated). A protocol that allows tags to be truncated at an arbitrary limit, without giving any indication of what that limit is, has the potential to cause harm by changing the meaning of tags in substantial ways. In practice, most language tags do not require more than a few subtags and will not approach reasonably sized buffer limitations; see Section 4.1. Some specifications or protocols have limits on tag length but do not have a fixed length limitation. For example, [RFC2231] has no explicit length limitation: the length available for the language tag is constrained by the length of other header components (such as the charset's name) coupled with the 76-character limit in [RFC2047]. Thus, the "limit" might be 50 or more characters, but it could potentially be quite small. The considerations for assigning a buffer limit are: Implementations SHOULD NOT truncate language tags unless the meaning of the tag is purposefully being changed, or unless the tag does not fit into a limited buffer size specified by a protocol for storage or transmission. Implementations SHOULD warn the user when a tag is truncated since truncation changes the semantic meaning of the tag. Implementations of protocols or specifications that are space constrained but do not have a fixed limit SHOULD use the longest possible tag in preference to truncation. Protocols or specifications that specify limited buffer sizes for language tags MUST allow for language tags of at least 35 characters. Note that [RFC4646] recommended a minimum field size of 42 characters because it included all three elements of the 'extlang' production. Two of these are now permanently reserved, so a registered primary language subtag of the maximum length of 8 characters is now longer than the longest language-extlang combination. Protocols or specifications that commonly use
extensions or private use subtags might wish to reserve or recommend a longer "minimum buffer" size. The following illustration shows how the 35-character recommendation was derived: language = 8 ; longest allowed registered value ; longer than primary+extlang ; which requires 7 characters script = 5 ; if not suppressed: see Section 4.1 region = 4 ; UN M.49 numeric region code ; ISO 3166-1 codes require 3 variant1 = 9 ; needs 'language' as a prefix variant2 = 9 ; very rare, as it needs ; 'language-variant1' as a prefix total = 35 characters Figure 7: Derivation of the Limit on Tag Length4.4.2. Truncation of Language Tags
Truncation of a language tag alters the meaning of the tag, and thus SHOULD be avoided. However, truncation of language tags is sometimes necessary due to limited buffer sizes. Such truncation MUST NOT permit a subtag to be chopped off in the middle or the formation of invalid tags (for example, one ending with the "-" character). This means that applications or protocols that truncate tags MUST do so by progressively removing subtags along with their preceding "-" from the right side of the language tag until the tag is short enough for the given buffer. If the resulting tag ends with a single- character subtag, that subtag and its preceding "-" MUST also be removed. For example: Tag to truncate: zh-Latn-CN-variant1-a-extend1-x-wadegile-private1 1. zh-Latn-CN-variant1-a-extend1-x-wadegile 2. zh-Latn-CN-variant1-a-extend1 3. zh-Latn-CN-variant1 4. zh-Latn-CN 5. zh-Latn 6. zh Figure 8: Example of Tag Truncation
4.5. Canonicalization of Language Tags
Since a particular language tag can be used by many processes, language tags SHOULD always be created or generated in canonical form. A language tag is in 'canonical form' when the tag is well-formed according to the rules in Sections 2.1 and 2.2 and it has been canonicalized by applying each of the following steps in order, using data from the IANA registry (see Section 3.1): 1. Extension sequences are ordered into case-insensitive ASCII order by singleton subtag. * For example, the subtag sequence '-a-babble' comes before '-b-warble'. 2. Redundant or grandfathered tags are replaced by their 'Preferred- Value', if there is one. * The field-body of the 'Preferred-Value' for grandfathered and redundant tags is an "extended language range" [RFC4647] and might consist of more than one subtag. * 'Preferred-Value' fields in the registry provide mappings from deprecated tags to modern equivalents. Many of these were created before the adoption of this document (such as the mapping of "no-nyn" to "nn" or "i-klingon" to "tlh"). Others are the result of later registrations or additions to the registry as permitted or required by this document (for example, "zh-hakka" was deprecated in favor of the ISO 639-3 code 'hak' when this document was adopted). 3. Subtags are replaced by their 'Preferred-Value', if there is one. For extlangs, the original primary language subtag is also replaced if there is a primary language subtag in the 'Preferred- Value'. * The field-body of the 'Preferred-Value' for extlangs is an "extended language range" and typically maps to a primary language subtag. For example, the subtag sequence "zh-hak" (Chinese, Hakka) is replaced with the subtag 'hak' (Hakka). * Most of the non-extlang subtags are either Region subtags where the country name or designation has changed or clerical corrections to ISO 639-1.
The canonical form contains no 'extlang' subtags. There is an alternate 'extlang form' that maintains or reinstates extlang subtags. This form can be useful in environments where the presence of the 'Prefix' subtag is considered beneficial in matching or selection (see Section 4.1.2). A language tag is in 'extlang form' when the tag is well-formed according to the rules in Sections 2.1 and 2.2 and it has been processed by applying each of the following two steps in order, using data from the IANA registry: 1. The language tag is first transformed into canonical form, as described above. 2. If the language tag starts with a primary language subtag that is also an extlang subtag, then the language tag is prepended with the extlang's 'Prefix'. * For example, "hak-CN" (Hakka, China) has the primary language subtag 'hak', which in turn has an 'extlang' record with a 'Prefix' 'zh' (Chinese). The extlang form is "zh-hak-CN" (Chinese, Hakka, China). * Note that Step 2 (prepending a prefix) can restore a subtag that was removed by Step 1 (canonicalizing). Example: The language tag "en-a-aaa-b-ccc-bbb-x-xyz" is in canonical form, while "en-b-ccc-bbb-a-aaa-X-xyz" is well-formed and potentially valid (extensions 'a' and 'b' are not defined as of the publication of this document) but not in canonical form (the extensions are not in alphabetical order). Example: Although the tag "en-BU" (English as used in Burma) maintains its validity, the language tag "en-BU" is not in canonical form because the 'BU' subtag has a canonical mapping to 'MM' (Myanmar). Canonicalization of language tags does not imply anything about the use of upper- or lowercase letters when processing or comparing subtags (and as described in Section 2.1). All comparisons MUST be performed in a case-insensitive manner. When performing canonicalization of language tags, processors MAY regularize the case of the subtags (that is, this process is OPTIONAL), following the case used in the registry (see Section 2.1.1).
If more than one variant appears within a tag, processors MAY reorder the variants to obtain better matching behavior or more consistent presentation. Reordering of the variants SHOULD follow the recommendations for variant ordering in Section 4.1. If the field 'Deprecated' appears in a registry record without an accompanying 'Preferred-Value' field, then that tag or subtag is deprecated without a replacement. These values are canonical when they appear in a language tag. However, tags that include these values SHOULD NOT be selected by users or generated by implementations. An extension MUST define any relationships that exist between the various subtags in the extension and thus MAY define an alternate canonicalization scheme for the extension's subtags. Extensions MAY define how the order of the extension's subtags is interpreted. For example, an extension could define that its subtags are in canonical order when the subtags are placed into ASCII order: that is, "en-a- aaa-bbb-ccc" instead of "en-a-ccc-bbb-aaa". Another extension might define that the order of the subtags influences their semantic meaning (so that "en-b-ccc-bbb-aaa" has a different value from "en-b- aaa-bbb-ccc"). However, extension specifications SHOULD be designed so that they are tolerant of the typical processes described in Section 3.7.4.6. Considerations for Private Use Subtags
Private use subtags, like all other subtags, MUST conform to the format and content constraints in the ABNF. Private use subtags have no meaning outside the private agreement between the parties that intend to use or exchange language tags that employ them. The same subtags MAY be used with a different meaning under a separate private agreement. They SHOULD NOT be used where alternatives exist and SHOULD NOT be used in content or protocols intended for general use. Private use subtags are simply useless for information exchange without prior arrangement. The value and semantic meaning of private use tags and of the subtags used within such a language tag are not defined by this document. Private use sequences introduced by the 'x' singleton are completely opaque to users or implementations outside of the private use agreement. So, in addition to private use subtag sequences introduced by the singleton subtag 'x', the Language Subtag Registry provides private use language, script, and region subtags derived from the private use codes assigned by the underlying standards. These subtags are valid for use in forming language tags; they are RECOMMENDED over the 'x' singleton private use subtag sequences
because they convey more information via their linkage to the language tag's inherent structure. For example, the region subtags 'AA', 'ZZ', and those in the ranges 'QM'-'QZ' and 'XA'-'XZ' (derived from the ISO 3166-1 private use codes) can be used to form a language tag. A tag such as "zh-Hans-XQ" conveys a great deal of public, interchangeable information about the language material (that it is Chinese in the simplified Chinese script and is suitable for some geographic region 'XQ'). While the precise geographic region is not known outside of private agreement, the tag conveys far more information than an opaque tag such as "x-somelang" or even "zh-Hans-x-xq" (where the 'xq' subtag's meaning is entirely opaque). However, in some cases content tagged with private use subtags can interact with other systems in a different and possibly unsuitable manner compared to tags that use opaque, privately defined subtags, so the choice of the best approach sometimes depends on the particular domain in question.5. IANA Considerations
This section deals with the processes and requirements necessary for IANA to maintain the subtag and extension registries as defined by this document and in accordance with the requirements of [RFC5226]. The impact on the IANA maintainers of the two registries defined by this document will be a small increase in the frequency of new entries or updates. IANA also is required to create a new mailing list (described below in Section 5.1) to announce registry changes and updates.5.1. Language Subtag Registry
IANA updated the registry using instructions and content provided in a companion document [RFC5645]. The criteria and process for selecting the updated set of records are described in that document. The updated set of records represents no impact on IANA, since the work to create it will be performed externally. Future work on the Language Subtag Registry includes the following activities: o Inserting or replacing whole records. These records are preformatted for IANA by the Language Subtag Reviewer, as described in Section 3.3. o Archiving and making publicly available the registration forms.
o Announcing each updated version of the registry on the "ietf-languages-announcements@iana.org" mailing list. Each registration form sent to IANA contains a single record for incorporation into the registry. The form will be sent to <iana@iana.org> by the Language Subtag Reviewer. It will have a subject line indicating whether the enclosed form represents an insertion of a new record (indicated by the word "INSERT" in the subject line) or a replacement of an existing record (indicated by the word "MODIFY" in the subject line). At no time can a record be deleted from the registry. IANA will extract the record from the form and place the inserted or modified record into the appropriate section of the Language Subtag Registry, grouping the records by their 'Type' field. Inserted records can be placed anywhere within the appropriate section; there is no guarantee that the registry's records will be placed in any particular order except that they will always be grouped by 'Type'. Modified records overwrite the record they replace. Whenever an entry is created or modified in the registry, the 'File- Date' record at the start of the registry is updated to reflect the most recent modification date. The date format SHALL be the "full- date" format of [RFC3339]. The date SHALL be the date on which that version of the registry was first published by IANA. There SHALL be at most one version of the registry published in a day. A 'File- Date' record is also included in each request to IANA to insert or modify records, indicating the acceptance date of the records in the request. The updated registry file MUST use the UTF-8 character encoding, and IANA MUST check the registry file for proper encoding. Non-ASCII characters can be sent to IANA by attaching the registration form to the email message or by using various encodings in the mail message body (UTF-8 is recommended). IANA will verify any unclear or corrupted characters with the Language Subtag Reviewer prior to posting the updated registry. IANA will also archive and make publicly available from http://www.iana.org each registration form. Note that multiple registrations can pertain to the same record in the registry. Developers who are dependent upon the Language Subtag Registry sometimes would like to be informed of changes in the registry so that they can update their implementations. When any change is made to the Language Subtag Registry, IANA will send an announcement message to <ietf-languages-announcements@iana.org> (a self- subscribing list to which only IANA can post).
5.2. Extensions Registry
The Language Tag Extensions Registry can contain at most 35 records, and thus changes to this registry are expected to be very infrequent. Future work by IANA on the Language Tag Extensions Registry is limited to two cases. First, the IESG MAY request that new records be inserted into this registry from time to time. These requests MUST include the record to insert in the exact format described in Section 3.7. In addition, there MAY be occasional requests from the maintaining authority for a specific extension to update the contact information or URLs in the record. These requests MUST include the complete, updated record. IANA is not responsible for validating the information provided, only that it is properly formatted. IANA SHOULD take reasonable steps to ascertain that the request comes from the maintaining authority named in the record present in the registry.