RFC 5892

The Unicode Code Points and Internationalized Domain Names for Applications (IDNA)

Pages: 70
Proposed Standard
→ Errata
Updated by: 8753

Part 1 of 4 – Pages 1 to 12

RFC5892 - Page 1

Internet Engineering Task Force (IETF)                 P. Faltstrom, Ed.
Request for Comments: 5892                                         Cisco
Category: Standards Track                                    August 2010
ISSN: 2070-1721


                      The Unicode Code Points and
         Internationalized Domain Names for Applications (IDNA)

Abstract

   This document specifies rules for deciding whether a code point,
   considered in isolation or in context, is a candidate for inclusion
   in an Internationalized Domain Name (IDN).

   It is part of the specification of Internationalizing Domain Names in
   Applications 2008 (IDNA2008).

Status of This Memo

   This is an Internet Standards Track document.

   This document is a product of the Internet Engineering Task Force
   (IETF).  It represents the consensus of the IETF community.  It has
   received public review and has been approved for publication by the
   Internet Engineering Steering Group (IESG).  Further information on
   Internet Standards is available in Section 2 of RFC 5741.

   Information about the current status of this document, any errata,
   and how to provide feedback on it may be obtained at
   http://www.rfc-editor.org/info/rfc5892.

Copyright Notice

   Copyright (c) 2010 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.

RFC5892 - Page 2

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  3
   2.  Category Definitions Used to Calculate Derived Property
       Value  . . . . . . . . . . . . . . . . . . . . . . . . . . . .  5
     2.1.  LetterDigits (A) . . . . . . . . . . . . . . . . . . . . .  5
     2.2.  Unstable (B) . . . . . . . . . . . . . . . . . . . . . . .  6
     2.3.  IgnorableProperties (C)  . . . . . . . . . . . . . . . . .  6
     2.4.  IgnorableBlocks (D)  . . . . . . . . . . . . . . . . . . .  7
     2.5.  LDH (E)  . . . . . . . . . . . . . . . . . . . . . . . . .  7
     2.6.  Exceptions (F) . . . . . . . . . . . . . . . . . . . . . .  7
     2.7.  BackwardCompatible (G) . . . . . . . . . . . . . . . . . .  9
     2.8.  JoinControl (H)  . . . . . . . . . . . . . . . . . . . . .  9
     2.9.  OldHangulJamo (I)  . . . . . . . . . . . . . . . . . . . .  9
     2.10. Unassigned (J) . . . . . . . . . . . . . . . . . . . . . .  9
   3.  Calculation of the Derived Property  . . . . . . . . . . . . . 10
   4.  Code Points  . . . . . . . . . . . . . . . . . . . . . . . . . 10
   5.  IANA Considerations  . . . . . . . . . . . . . . . . . . . . . 11
     5.1.  IDNA-Derived Property Value Registry . . . . . . . . . . . 11
     5.2.  IDNA Context Registry  . . . . . . . . . . . . . . . . . . 11
       5.2.1.  Template for Context Registry  . . . . . . . . . . . . 11
   6.  Security Considerations  . . . . . . . . . . . . . . . . . . . 12
   7.  Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . 12
   Appendix A.   Contextual Rules Registry  . . . . . . . . . . . . . 13
   Appendix A.1. ZERO WIDTH NON-JOINER  . . . . . . . . . . . . . . . 15
   Appendix A.2. ZERO WIDTH JOINER  . . . . . . . . . . . . . . . . . 16
   Appendix A.3. MIDDLE DOT . . . . . . . . . . . . . . . . . . . . . 16
   Appendix A.4. GREEK LOWER NUMERAL SIGN (KERAIA)  . . . . . . . . . 17
   Appendix A.5. HEBREW PUNCTUATION GERESH  . . . . . . . . . . . . . 17
   Appendix A.6. HEBREW PUNCTUATION GERSHAYIM . . . . . . . . . . . . 18
   Appendix A.7. KATAKANA MIDDLE DOT  . . . . . . . . . . . . . . . . 18
   Appendix A.8. ARABIC-INDIC DIGITS  . . . . . . . . . . . . . . . . 19
   Appendix A.9. EXTENDED ARABIC-INDIC DIGITS . . . . . . . . . . . . 19
   Appendix B.   Code Points 0x0000 - 0x10FFFF  . . . . . . . . . . . 20
   Appendix B.1. Code Points in Unicode Character Database (UCD)
                 Format . . . . . . . . . . . . . . . . . . . . . . . 20
   8.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 69
     8.1.  Normative References . . . . . . . . . . . . . . . . . . . 69
     8.2.  Informative References . . . . . . . . . . . . . . . . . . 69

RFC5892 - Page 3

1.  Introduction

   RFC 4690 [RFC4690] suggests an inclusion-based approach for selecting
   the code points from The Unicode Standard [Unicode52] that should be
   included in the list of code points that may be used in
   Internationalized Domain Names.

   Specifically, RFC 4690 [RFC4690] says the following:

      The IAB has concluded that there is a consensus within the broader
      community that lists of code points should be specified by the use
      of an inclusion-based mechanism (i.e., identifying the characters
      that are permitted), rather than by excluding a small number of
      characters from the total Unicode set as Stringprep [RFC3454] and
      Nameprep [RFC3491] do today.  That conclusion should be reviewed
      by the IETF community and action taken as appropriate.

   This document reviews and classifies the collections of code points
   in the Unicode character set by examining various properties of the
   code points.  It then defines an algorithm for determining a derived
   property value.  It specifies a procedure, and not a table, of code
   points so that the algorithm can be used to determine code point sets
   independent of the version of Unicode that is in use.

   This document is not intended to specify precisely how these property
   values are to be applied in IDN labels.  That information appears in
   the Protocol document [RFC5891], but it is important to understand
   that the assignment of a value of this property to a particular
   character is not sufficient to determine whether it can be used in a
   given label.  In particular, some combinations of allowed code points
   are not advisable for use in IDNs due to rules specific to a script
   or class of characters.  The requirement for such rules is linked to
   the operations in the Protocol document and especially to the
   characters designated as requiring contextual rules.

   The value of the property is to be interpreted as follows.

   o  PROTOCOL VALID: Those that are allowed to be used in IDNs.  Code
      points with this property value are permitted for general use in
      IDNs.  However, that a label consists only of code points that
      have this property value does not imply that the label can be used
      in DNS.  See the Protocol document for algorithms to make
      decisions about labels in domain names.  The abbreviated term
      PVALID is used to refer to this value in the rest of this
      document.

RFC5892 - Page 4

   o  CONTEXTUAL RULE REQUIRED: Some characteristics of the character,
      such as it being invisible in certain contexts or problematic in
      others, require that it not be used in labels unless specific
      other characters or properties are present.  The abbreviated term
      CONTEXT is used to refer to this value in the rest of this
      document.  There are two subdivisions of CONTEXTUAL RULE REQUIRED,
      one for Join_controls (called CONTEXTJ) and for other characters
      (called CONTEXTO).  These are discussed in more detail below and
      in the Protocol document.

   o  DISALLOWED: Those that should clearly not be included in IDNs.
      Code points with this property value are not permitted in IDNs.

   o  UNASSIGNED: Those code points that are not designated (i.e., are
      unassigned) in the Unicode Standard.

   The mechanisms described here allow determination of the value of the
   property for future versions of Unicode (including characters added
   after Unicode 5.2).  Changes in Unicode properties that do not affect
   the outcome of this process do not affect IDN.  For example, a
   character can have its Unicode General_Category value (see
   [Unicode52]) change from So to Sm or from Lo to Ll, without affecting
   the algorithm results.  Moreover, even if such changes were the
   result, the BackwardCompatible list (Section 2.7) can be adjusted to
   ensure the stability of the results.

   Some code points need to be allowed in exceptional circumstances but
   should be excluded in all other cases; these rules are also described
   in other documents.  The most notable of these are the Join Control
   characters, U+200D ZERO WIDTH JOINER and U+200C ZERO WIDTH
   NON-JOINER.  Both of them have the derived property value CONTEXTJ.
   A character with the derived property value CONTEXTJ or CONTEXTO
   (CONTEXTUAL RULE REQUIRED) is not to be used unless an appropriate
   rule has been established and the context of the character is
   consistent with that rule.  It is invalid to either register a string
   containing these characters or even to look one up unless such a
   contextual rule is found and satisfied.  Please see Appendix A, "The
   Contextual Rules Registry", for more information.

   This document is part of a series that, together, constitute a
   proposal for updating the IDNA standards to resolve issues uncovered
   in recent years, cover a broader range of scripts, and provide for
   migration to newer versions of Unicode.  See the Rationale document
   [RFC5894] for a broader discussion.

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [RFC2119].

RFC5892 - Page 5

2.  Category Definitions Used to Calculate Derived Property Value

   The derived property obtains its value based on a two-step procedure.
   First, characters are placed in one or more character categories
   based on either core properties defined by the Unicode Standard or by
   treating the code point as an exception and addressing the code point
   by its code point value.  These categories are not mutually
   exclusive.

   In the second step, set operations are used with these categories to
   determine the values for an IDN-specific property.  Those operations
   are specified in Section 3.

   Unicode property names and property value names may have short
   abbreviations, such as gc for the General_Category property, and Ll
   for the Lowercase_Letter property value of the gc property.

   In the following specification of categories, the operation that
   returns the value of a particular Unicode character property for a
   code point is designated by using the formal name of that property
   (from PropertyAliases.txt) followed by '(cp)'.  For example, the
   value of the General_Category property for a code point is indicated
   by General_Category(cp).

2.1.  LetterDigits (A)

   A: General_Category(cp) is in {Ll, Lu, Lo, Nd, Lm, Mn, Mc}

   These rules identify characters commonly used in mnemonics and often
   informally described as "language characters".  In general, only code
   points assigned to this category are suitable for use in IDN.

   For more information, see Section 4.5 of The Unicode Standard
   [Unicode].

   The categories used in this rule are:

   o  Ll - Lowercase_Letter

   o  Lu - Uppercase_Letter

   o  Lo - Other_Letter

   o  Nd - Decimal_Number

   o  Lm - Modifier_Letter

RFC5892 - Page 6

   o  Mn - Nonspacing_Mark

   o  Mc - Spacing_Mark

2.2.  Unstable (B)

   B: toNFKC(toCaseFold(toNFKC(cp))) != cp

   This category is used to group the characters that are not stable
   under Normalization Form K (NFKC) and case folding.  In general,
   these code points are not suitable for use for IDN.

   The toCaseFold() operation is defined in Section 3.13 of The Unicode
   Standard [Unicode].

   The toNFKC() operation returns the code point in normalization form
   KC.  For more information, see Section 5 of Unicode Standard Annex
   #15 [TR15].

   It should be noted that NFKC is used, although Normalization Form C
   (NFC) is used in the "IDNA Protocol" document [RFC5891].

2.3.  IgnorableProperties (C)

   C: Default_Ignorable_Code_Point(cp) = True or
      White_Space(cp) = True or
      Noncharacter_Code_Point(cp) = True

   This category is used to group code points that are not recommended
   for use in identifiers.  In general, these code points are not
   suitable for use in an IDN.

   The definition for Default_Ignorable_Code_Point can be found in
   DerivedCoreProperties.txt [DerivedCoreProperties] and is at the time
   of Unicode 5.2:

   Other_Default_Ignorable_Code_Point + Cf (Format characters)
   + Variation_Selector - White_Space - FFF9..FFFB (Annotation
   Characters) - 0600..0603, 06DD, 070F (exceptional Cf characters
   that should be visible)

RFC5892 - Page 7

2.4.  IgnorableBlocks (D)

   D: Block(cp) is in {Combining Diacritical Marks for Symbols,
                       Musical Symbols, Ancient Greek Musical Notation}

   This category is used to identify code points that are not useful in
   mnemonics or that are otherwise impractical for IDN use.  In general,
   these code points are not suitable for use for IDN.

   The definition of blocks can be found in Blocks.txt [BlockNames].

2.5.  LDH (E)

   E: cp is in {002D, 0030..0039, 0061..007A}

   This category is used in the second step to preserve the traditional
   "hostname" (LDH -- as described in the Definitions document
   [RFC5890]) characters ('-', 0-9, and a-z).  In general, these code
   points are suitable for use for IDN.  Note that there are other rules
   regarding the code point U+002D HYPHEN-MINUS that are specified in
   the IDNA Protocol Specification [RFC5891].

2.6.  Exceptions (F)

   F: cp is in {00B7, 00DF, 0375, 03C2, 05F3, 05F4, 0640, 0660,
                0661, 0662, 0663, 0664, 0665, 0666, 0667, 0668,
                0669, 06F0, 06F1, 06F2, 06F3, 06F4, 06F5, 06F6,
                06F7, 06F8, 06F9, 06FD, 06FE, 07FA, 0F0B, 3007,
                302E, 302F, 3031, 3032, 3033, 3034, 3035, 303B,
                30FB}

   This category explicitly lists code points for which the category
   cannot be assigned using only the core property values that exist in
   the Unicode standard.  The values are according to the table below:

 PVALID -- Would otherwise have been DISALLOWED

 00DF; PVALID     # LATIN SMALL LETTER SHARP S
 03C2; PVALID     # GREEK SMALL LETTER FINAL SIGMA
 06FD; PVALID     # ARABIC SIGN SINDHI AMPERSAND
 06FE; PVALID     # ARABIC SIGN SINDHI POSTPOSITION MEN
 0F0B; PVALID     # TIBETAN MARK INTERSYLLABIC TSHEG
 3007; PVALID     # IDEOGRAPHIC NUMBER ZERO

RFC5892 - Page 8

 CONTEXTO -- Would otherwise have been DISALLOWED

 00B7; CONTEXTO   # MIDDLE DOT
 0375; CONTEXTO   # GREEK LOWER NUMERAL SIGN (KERAIA)
 05F3; CONTEXTO   # HEBREW PUNCTUATION GERESH
 05F4; CONTEXTO   # HEBREW PUNCTUATION GERSHAYIM
 30FB; CONTEXTO   # KATAKANA MIDDLE DOT

 CONTEXTO -- Would otherwise have been PVALID

 0660; CONTEXTO   # ARABIC-INDIC DIGIT ZERO
 0661; CONTEXTO   # ARABIC-INDIC DIGIT ONE
 0662; CONTEXTO   # ARABIC-INDIC DIGIT TWO
 0663; CONTEXTO   # ARABIC-INDIC DIGIT THREE
 0664; CONTEXTO   # ARABIC-INDIC DIGIT FOUR
 0665; CONTEXTO   # ARABIC-INDIC DIGIT FIVE
 0666; CONTEXTO   # ARABIC-INDIC DIGIT SIX
 0667; CONTEXTO   # ARABIC-INDIC DIGIT SEVEN
 0668; CONTEXTO   # ARABIC-INDIC DIGIT EIGHT
 0669; CONTEXTO   # ARABIC-INDIC DIGIT NINE
 06F0; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT ZERO
 06F1; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT ONE
 06F2; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT TWO
 06F3; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT THREE
 06F4; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT FOUR
 06F5; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT FIVE
 06F6; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT SIX
 06F7; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT SEVEN
 06F8; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT EIGHT
 06F9; CONTEXTO   # EXTENDED ARABIC-INDIC DIGIT NINE

 DISALLOWED -- Would otherwise have been PVALID

 0640; DISALLOWED # ARABIC TATWEEL
 07FA; DISALLOWED # NKO LAJANYALAN
 302E; DISALLOWED # HANGUL SINGLE DOT TONE MARK
 302F; DISALLOWED # HANGUL DOUBLE DOT TONE MARK
 3031; DISALLOWED # VERTICAL KANA REPEAT MARK
 3032; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK
 3033; DISALLOWED # VERTICAL KANA REPEAT MARK UPPER HALF
 3034; DISALLOWED # VERTICAL KANA REPEAT WITH VOICED SOUND MARK UPPER HA
 3035; DISALLOWED # VERTICAL KANA REPEAT MARK LOWER HALF
 303B; DISALLOWED # VERTICAL IDEOGRAPHIC ITERATION MARK

RFC5892 - Page 9

2.7.  BackwardCompatible (G)

   G: cp is in {}

   This category includes the code points that property values in
   versions of Unicode after 5.2 have changed in such a way that the
   derived property value would no longer be PVALID or DISALLOWED.  If
   changes are made to future versions of Unicode so that code points
   might change the property value from PVALID or DISALLOWED, then this
   table can be updated and keep special exception values so that the
   property values for code points stay stable.

2.8.  JoinControl (H)

   H: Join_Control(cp) = True

   This category consists of Join Control characters (i.e., they are not
   in LetterDigits (Section 2.1) but are still required in IDN labels
   under some circumstances).

2.9.  OldHangulJamo (I)

   I: Hangul_Syllable_Type(cp) is in {L, V, T}

   This category consists of all conjoining Hangul Jamo (Leading Jamo,
   Vowel Jamo, and Trailing Jamo).

   Elimination of conjoining Hangul Jamo from the set of PVALID
   characters results in restricting the set of Korean PVALID characters
   just to preformed, modern Hangul syllable characters.  Old Hangul
   syllables, which must be spelled with sequences of conjoining Hangul
   Jamo, are not PVALID for IDNs.

2.10.  Unassigned (J)

   J: General_Category(cp) is in {Cn} and
      Noncharacter_Code_Point(cp) = False

   This category consists of code points in the Unicode character set
   that are not (yet) assigned.  It should be noted that Unicode
   distinguishes between "unassigned code points" and "unassigned
   characters".  The unassigned code points are all but (Cn -
   Noncharacters), while the unassigned *characters* are all but (Cn +
   Cs).

RFC5892 - Page 10

3.  Calculation of the Derived Property

   As described above (Section 1) and in more detail in the IDNA
   Protocol document [RFC5891], possible values of the IDN property are:

   o  PVALID

   o  CONTEXTJ

   o  CONTEXTO

   o  DISALLOWED

   o  UNASSIGNED

   The algorithm to calculate the value of the derived property is as
   follows.  If the name of a rule (such as Exception) is used, that
   implies the set of code points that the rule defines, while the same
   name as a function call (such as Exception(cp)) implies the value cp
   has in the Exceptions table.

   If .cp. .in.  Exceptions Then Exceptions(cp);
   Else If .cp. .in.  BackwardCompatible Then BackwardCompatible(cp);
   Else If .cp. .in.  Unassigned Then UNASSIGNED;
   Else If .cp. .in.  LDH Then PVALID;
   Else If .cp. .in.  JoinControl Then CONTEXTJ;
   Else If .cp. .in.  Unstable Then DISALLOWED;
   Else If .cp. .in.  IgnorableProperties Then DISALLOWED;
   Else If .cp. .in.  IgnorableBlocks Then DISALLOWED;
   Else If .cp. .in.  OldHangulJamo Then DISALLOWED;
   Else If .cp. .in.  LetterDigits Then PVALID;
   Else DISALLOWED;

4.  Code Points

   The categories and rules defined in Sections 2 and 3 apply to all
   Unicode code points.  The table in Appendix B shows, for illustrative
   purposes, the consequences of the categories and classification
   rules, and the resulting property values.

   The list of code points that can be found in Appendix B is
   non-normative.  Sections 2 and 3 are normative.

RFC5892 - Page 11

5.  IANA Considerations

5.1.  IDNA-Derived Property Value Registry

   IANA has created a registry with the derived properties for the
   versions of Unicode released after (and including) version 5.2.  The
   derived property value is to be calculated in cooperation with a
   designated expert [RFC5226] according to the specifications in
   Sections 2 and 3 and not by copying the non-normative table found in
   Appendix B.

   If non-backward-compatible changes or other problems arise during the
   creation or designated expert review of the table of derived property
   values, they should be flagged for the IESG.  Changes to the rules
   (as specified in Sections 2 and 3), including BackwardCompatible
   (Section 2.7) (a set that is at release of this document is empty)
   require IETF Review, as described in RFC 5226 [RFC5226].

5.2.  IDNA Context Registry

   For characters that are defined in the IDNA derived property value
   registry (Section 5.1) as CONTEXTO or CONTEXTJ and that therefore
   require a contextual rule, IANA has created and now maintains a list
   of approved contextual rules.  Additions or changes to these rules
   require IETF Review, as described in [RFC5226].

   Appendix A contains further discussion and a table from which that
   registry can be initialized.

5.2.1.  Template for Context Registry

   The following information is to be given when a new rule is created.

      Name: Unique name of the rule

      Code point: Rule that should be applied when this code point
      exists in the label

      Overview: Description in plain English on what the rule verifies

      Lookup: Should the rule be applied at time of lookup?

      Rule Set: The set of rules, with a reference to the defining
      document.

RFC5892 - Page 12

6.  Security Considerations

   Security Considerations for this version of IDNA, except for the
   special issues associated with right-to-left scripts and characters,
   are described in the Definitions document [RFC5890].  Specific issues
   for labels containing characters associated with scripts written
   right to left appear in the Bidi document [RFC5893].

7.  Acknowledgements

   This document would not have been possible to produce without input
   from many people.  The main contributors are (in alphabetical order)
   Harald Alvestrand, Vint Cerf, Tina Dam, Mark Davis, Gihan Dias,
   Mouhammet Diop, Michael Everson, Asmus Freytag, Debbie Garside, Paul
   Hoffman, Kent Karlsson, Cary Karp, Jaeyoun Kim, John Klensin, Olaf
   Kolkman, Gervase Markham, Ram Mohan, Lisa Moore, Yngve Pettersen,
   Erik van der Poel, Hualin Qian, Rick Reed, Pete Resnick, Lakmal
   Silva, Michel Suignard, Andrew Sullivan, Wil Tan, Kenneth Whistler,
   Chris Wright, and Yoshiro Yoneya.

(next page on part 2)