Appendix B. How to Translate Tables Based on RFC 3743 into the XML Format
As background, the rules specified in [RFC3743] work as follows: 1. The original (requested) label is checked to make sure that all the code points are a subset of the repertoire. 2. If it passes the check, the original label is allocatable. 3. Generate the all-simplified and all-traditional variant labels (union of all the labels generated using all the simplified variants of the code points) for allocation. To illustrate by example, here is one of the more complicated set of variants: U+4E7E U+4E81 U+5E72 U+5E79 U+69A6 U+6F27 The following shows the relevant section of the Chinese language table published by the .ASIA registry [ASIA-TABLE]. Its entries read: <codepoint>;<simpl-variant(s)>;<trad-variant(s)>;<other-variant(s)> These are the lines corresponding to the set of variants listed above: U+4E7E;U+4E7E,U+5E72;U+4E7E;U+4E81,U+5E72,U+6F27,U+5E79,U+69A6 U+4E81;U+5E72;U+4E7E;U+5E72,U+6F27,U+5E79,U+69A6 U+5E72;U+5E72;U+5E72,U+4E7E,U+5E79;U+4E7E,U+4E81,U+69A6,U+6F27 U+5E79;U+5E72;U+5E79;U+69A6,U+4E7E,U+4E81,U+6F27 U+69A6;U+5E72;U+69A6;U+5E79,U+4E7E,U+4E81,U+6F27 U+6F27;U+4E7E;U+6F27;U+4E81,U+5E72,U+5E79,U+69A6
The corresponding "data" section XML format would look like this: <data> <char cp="4E7E"> <var cp="4E7E" type="both" comment="identity" /> <var cp="4E81" type="blocked" /> <var cp="5E72" type="simp" /> <var cp="5E79" type="blocked" /> <var cp="69A6" type="blocked" /> <var cp="6F27" type="blocked" /> </char> <char cp="4E81"> <var cp="4E7E" type="trad" /> <var cp="5E72" type="simp" /> <var cp="5E79" type="blocked" /> <var cp="69A6" type="blocked" /> <var cp="6F27" type="blocked" /> </char> <char cp="5E72"> <var cp="4E7E" type="trad"/> <var cp="4E81" type="blocked"/> <var cp="5E72" type="both" comment="identity"/> <var cp="5E79" type="trad"/> <var cp="69A6" type="blocked"/> <var cp="6F27" type="blocked"/> </char> <char cp="5E79"> <var cp="4E7E" type="blocked"/> <var cp="4E81" type="blocked"/> <var cp="5E72" type="simp"/> <var cp="5E79" type="trad" comment="identity"/> <var cp="69A6" type="blocked"/> <var cp="6F27" type="blocked"/> </char> <char cp="69A6"> <var cp="4E7E" type="blocked"/> <var cp="4E81" type="blocked"/> <var cp="5E72" type="simp"/> <var cp="5E79" type="blocked"/> <var cp="69A6" type="trad" comment="identity"/> <var cp="6F27" type="blocked"/> </char>
<char cp="6F27"> <var cp="4E7E" type="simp"/> <var cp="4E81" type="blocked"/> <var cp="5E72" type="blocked"/> <var cp="5E79" type="blocked"/> <var cp="69A6" type="blocked"/> <var cp="6F27" type="trad" comment="identity"/> </char> </data> Here, the simplified variants have been given a type of "simp" and the traditional variants one of "trad", and all other ones are given "blocked". Because some variant mappings show in more than one column, while the XML format allows only a single type value, they have been given the type of "both". Note that some variant mappings map to themselves (identity); that is, the mapping is reflexive (see Section 5.3.4). In creating the permutation of all variant labels, these mappings have no effect, other than adding a value to the variant type list for the variant label containing them. In the example so far, all of the entries with type="both" are also mappings where source and target are identical. That is, they are reflexive mappings as defined in Section 5.3.4. Given a label "U+4E7E U+4E81", the following labels would be ruled allocatable per [RFC3743], based on how that standard is commonly implemented in domain registries: Original label: U+4E7E U+4E81 Simplified label 1: U+4E7E U+5E72 Simplified label 2: U+5E72 U+5E72 Traditional label: U+4E7E U+4E7E However, if allocatable labels were generated simply by a straight permutation of all variants with type other than type="blocked" and without regard to the simplified and traditional variants, we would end up with an extra allocatable label of "U+5E72 U+4E7E". This label is composed of both a Simplified Chinese character and a Traditional Chinese code point and therefore shouldn't be allocatable.
To more fully resolve the dispositions requires several actions to be defined, as described in Section 7.2.2, that will override the default actions from Section 7.6. After blocking all labels that contain a variant with type "blocked", these actions will set to "allocatable" labels based on the following variant types: "simp", "trad", and "both". Note that these variant types do not directly relate to dispositions for the variant label, but that the actions will resolve them to the Standard Dispositions on labels, i.e., "blocked" and "allocatable". To resolve label dispositions requires five actions to be defined (in the "rules" section of the XML document in question); these actions apply in order, and the first one triggered defines the disposition for the label. The actions are as follows: 1. Block all variant labels containing at least one blocked variant. 2. Allocate all labels that consist entirely of variants that are "simp" or "both". 3. Also allocate all labels that are entirely "trad" or "both". 4. Block all surviving labels containing any one of the dispositions "simp" or "trad" or "both", because they are now known to be part of an undesirable mixed simplified/traditional label. 5. Allocate any remaining label; the original label would be such a label. The rules declarations would be represented as: <rules> <!--"action" elements - order defines precedence--> <action disp="blocked" any-variant="blocked" /> <action disp="allocatable" only-variants="simp both" /> <action disp="allocatable" only-variants="trad both" /> <action disp="blocked" any-variant="simp trad" /> <action disp="allocatable" comment="catch-all" /> </rules> Up to now, variants with type "both" have occurred only associated with reflexive variant mappings. The "action" elements defined above rely on the assumption that this is always the case. However, consider the following set of variants: U+62E0;U+636E;U+636E;U+64DA U+636E;U+636E;U+64DA;U+62E0 U+64DA;U+636E;U+64DA;U+62E0
The corresponding XML would be:
<char cp="62E0">
<var cp="636E" type="both" comment="both, but not reflexive" />
<var cp="64DA" type="blocked" />
</char>
<char cp="636E">
<var cp="636E" type="simp" comment="reflexive, but not both" />
<var cp="64DA" type="trad" />
<var cp="62E0" type="blocked" />
</char>
<char cp="64DA">
<var cp="636E" type="simp" />
<var cp="64DA" type="trad" comment="reflexive" />
<var cp="62E0" type="blocked" />
</char>
To make such variant sets work requires a way to selectively trigger
an action based on whether a variant type is associated with an
identity or reflexive mapping, or is associated with an ordinary
variant mapping. This can be done by adding a prefix "r-" to the
"type" attribute on reflexive variant mappings. For example, the
"trad" for code point U+64DA in the preceding figure would become
"r-trad".
With the dispositions prepared in this way, only a slight
modification to the actions is needed to yield the correct set of
allocatable labels:
<action disp="blocked" any-variant="blocked" />
<action disp="allocatable" only-variants="simp r-simp both r-both" />
<action disp="allocatable" only-variants="trad r-trad both r-both" />
<action disp="blocked" all-variants="simp trad both" />
<action disp="allocatable" />
The first three actions get triggered by the same labels as before.
The fourth action blocks any label that combines an original code
point with any mix of ordinary variant mappings; however, no labels
that are a combination of only original code points (code points
having either no variant mappings or a reflexive mapping) would be
affected. These are the original labels, and they are allocated in
the last action.
Using this scheme of assigning types to ordinary and reflexive variants, all tables in the style of RFC 3743 can be converted to XML. By defining a set of actions as outlined above, the LGR will yield the correct set of allocatable variants: all variants consisting completely of variant code points preferred for simplified or traditional, respectively, will be allocated, as will be the original label. All other variant labels will be blocked.Appendix C. Indic Syllable Structure Example
In LGRs for Indic scripts, it may be desirable to restrict valid labels to sequences of valid Indic syllables, or aksharas. This appendix gives a sample set of rules designed to enforce this restriction. Below is an example of BNF for an akshara, which has been published in "Devanagari Script Behaviour for Hindi" [TDIL-HINDI]. The rules for other languages and scripts used in India are expected to be generally similar. For Hindi, the BNF has the form: V[m]|{C[N]H}C[N](H|[v][m]) Where: V (uppercase) is any independent vowel m is any vowel modifier (Devanagari Anusvara, Visarga, and Candrabindu) C is any consonant (with inherent vowel) N is Nukta H is a halant (or virama) v (lowercase) is any dependent vowel sign (matra) {} encloses items that may be repeated one or more times [ ] encloses items that may or may not be present | separates items, out of which only one can be present
By using the Unicode character property "InSC" or "Indic_Syllabic_Category", which corresponds rather directly to the classification of characters in the BNF above, we can translate the BNF into a set of WLE rules matching the definition of an akshara. <rules> <!--Character class definitions go here--> <class name="halant" property="InSC:Virama" /> <union name="vowel-modifier"> <class property="InSC:Visarga" /> <class property="InSC:Bindu" comment="includes anusvara" /> </union> <!--Whole label evaluation and context rules go here--> <rule name="consonant-with-optional-nukta"> <class by-ref="InSC:Consonant" /> <class by-ref="InSC:Nukta" count="0:1"/> </rule> <rule name="independent-vowel-with-optional-modifier"> <class by-ref="InSC:Vowel_Independent" /> <class by-ref="vowel-modifier" count="0:1" /> </rule> <rule name="optional-dependent-vowel-with-opt-modifier" > <class by-ref="InSC:Vowel_Dependent" count="0:1" /> <class by-ref="vowel-modifier" count="0:1" /> </rule> <rule name="consonant-cluster"> <rule count="0+"> <rule by-ref="consonant-with-optional-nukta" /> <class by-ref="halant" /> </rule> <rule by-ref="consonant-with-optional-nukta" /> <choice> <class by-ref="halant" /> <rule by-ref="optional-dependent-vowel-with-opt-modifier" /> </choice> </rule> <rule name="akshara"> <choice> <rule by-ref="independent-vowel-with-optional-modifier" /> <rule by-ref="consonant-cluster" /> </choice> </rule>
<rule name="WLE-akshara-or-other" comment="series of one or more aksharas, possibly alternating with other types of code points such as digits"> <start /> <choice count="1+"> <class property="InSC:other" /> <rule by-ref="akshara" /> </choice> <end /> </rule> <!--"action" elements go here - order defines precedence--> <action disp="invalid" not-match="WLE-akshara-or-other" /> </rules> With the rules and classes as defined above, the final action assigns a disposition of "invalid" to all labels that are not composed of a sequence of well-formed aksharas, optionally interspersed with other characters, perhaps digits, for example. The relevant Unicode character property could be replicated by tagging repertoire values directly in the LGR; this would remove the dependency on any specific version of the Unicode Standard. Generally, dependent vowels may only follow consonant expressions; however, for some scripts, like Bengali, the Unicode Standard supports sequences of dependent vowels or their application on independent vowels. This makes the definition of akshara less restrictive.C.1. Reducing Complexity
As presented in this example, the rules are rather complex -- although useful in demonstrating the features of the XML format, such complexity would be an undesirable feature in an actual LGR. It is possible to reduce the complexity of the rules in this example by defining alternate rules that simply define the permissible pair-wise context of adjacent code points by character class, such as a rule that a halant can only follow a (nuktated) consonant. Such pair-wise contexts are easier to understand, implement, and verify, and have the additional benefit of allowing tools to better pinpoint why a label failed to validate. They also tend to correspond more directly to the kind of well-formedness requirements that are most relevant to DNS security, like the requirement to limit the application of a combining mark (such as a vowel modifier) to only selected base characters (in this case, vowels). (See the example and discussion in [WLE-RULES].)
Appendix D. RELAX NG Compact Schema
This schema is provided in RELAX NG Compact format [RELAX-NG]. <CODE BEGINS> # # LGR XML Schema 1.0 # default namespace = "urn:ietf:params:xml:ns:lgr-1.0" # # SIMPLE TYPES # # RFC 5646 language tag (e.g., "de", "und-Latn") language-tag = xsd:token # The scope to which the LGR applies. For the "domain" scope type, # it should be a fully qualified domain name. scope-value = xsd:token { minLength = "1" } ## a single code point code-point = xsd:token { pattern = "[0-9A-F]{4,6}" } ## a space-separated sequence of code points code-point-sequence = xsd:token { pattern = "[0-9A-F]{4,6}( [0-9A-F]{4,6})+" } ## single code point, or a sequence of code points, or empty string code-point-literal = code-point | code-point-sequence | "" ## code point or sequence only non-empty-code-point-literal = code-point | code-point-sequence ## code point sent represented in short form code-point-set-shorthand = xsd:token { pattern = "([0-9A-F]{4,6}|[0-9A-F]{4,6}-[0-9A-F]{4,6})" ~ "( ([0-9A-F]{4,6}|[0-9A-F]{4,6}-[0-9A-F]{4,6}))*" }
## dates are used in information fields in the meta ## section ("YYYY-MM-DD") date-pattern = xsd:token { pattern = "\d{4}-\d\d-\d\d" } ## variant type ## the variant type MUST be non-empty and MUST NOT ## start with a "_"; using xsd:NMTOKEN here because ## we need space-separated lists of them variant-type = xsd:NMTOKEN ## variant type list for action triggers ## the list MUST NOT be empty, and entries MUST NOT ## start with a "_" variant-type-list = xsd:NMTOKENS ## reference to a rule name (used in "when" and "not-when" ## attributes, as well as the "by-ref" attribute of the "rule" ## element). rule-ref = xsd:IDREF ## a space-separated list of tags. Tags should generally follow ## xsd:Name syntax. However, we are using the xsd:NMTOKENS here ## because there is no native XSD datatype for space-separated ## xsd:Name tags = xsd:NMTOKENS ## The value space of a "from-tag" attribute. Although it is closer ## to xsd:IDREF lexically and semantically, tags are not unique in ## the document. As such, we are unable to take advantage of ## facilities provided by a validator. xsd:NMTOKEN is used instead ## of the stricter xsd:Names here so as to be consistent with ## the above. tag-ref = xsd:NMTOKEN ## an identifier type (used by "name" attributes). identifier = xsd:ID ## used in the class "by-ref" attribute to reference another class of ## the same "name" attribute value. class-ref = xsd:IDREF ## "count" attribute pattern ("n", "n+", or "n:m") count-pattern = xsd:token { pattern = "\d+(\+|:\d+)?" }
## "ref" attribute pattern ## space-separated list of "id" attribute values for ## "reference" elements. These reference ids ## must be declared in a "reference" element ## before they can be used in a "ref" attribute ref-pattern = xsd:token { pattern = "[\-_.:0-9A-Z]+( [\-_.:0-9A-Z]+)*" } # # STRUCTURES # ## Representation of a single code point or a sequence of code ## points char = element char { attribute cp { code-point-literal }, attribute comment { text }?, attribute when { rule-ref }?, attribute not-when { rule-ref }?, attribute tag { tags }?, attribute ref { ref-pattern }?, variant* } ## Representation of a range of code points range = element range { attribute first-cp { code-point }, attribute last-cp { code-point }, attribute comment { text }?, attribute when { rule-ref }?, attribute not-when { rule-ref }?, attribute tag { tags }?, attribute ref { ref-pattern }? } ## Representation of a variant code point or sequence variant = element var { attribute cp { code-point-literal }, attribute type { xsd:NMTOKEN }?, attribute when { rule-ref }?, attribute not-when { rule-ref }?, attribute comment { text }?, attribute ref { ref-pattern }? }
# # Classes # ## a "class" element that references the name of another "class" ## (or set-operator like "union") defined elsewhere. ## If used as a matcher (appearing under a "rule" element), ## the "count" attribute may be present. class-invocation = element class { class-invocation-content } class-invocation-content = attribute by-ref { class-ref }, attribute count { count-pattern }?, attribute comment { text }? ## defines a new class (set of code points) using Unicode property ## or code points of the same tag value or code point literals class-declaration = element class { class-declaration-content } class-declaration-content = # "name" attribute MUST be present if this is a "top-level" # class declaration, i.e., appearing directly under the "rules" # element. Otherwise, it MUST be absent. attribute name { identifier }?, # If used as a matcher (appearing in a "rule" element, but not # when nested inside a set-operator or class), the "count" # attribute may be present. Otherwise, it MUST be absent. attribute count { count-pattern }?, attribute comment { text }?, attribute ref { ref-pattern }?, ( # define the class by property (e.g., property="sc:Latn"), OR attribute property { xsd:NMTOKEN } # define the class by tagged code points, OR | attribute from-tag { tag-ref } # text node to allow for shorthand notation # e.g., "0061 0062-0063" | code-point-set-shorthand )
class-invocation-or-declaration = element class { class-invocation-content | class-declaration-content } class-or-set-operator-nested = class-invocation-or-declaration | set-operator class-or-set-operator-declaration = # a "class" element or set-operator (effectively defining a class) # directly in the "rules" element. class-declaration | set-operator # # set-operators # complement-operator = element complement { attribute name { identifier }?, attribute comment { text }?, attribute ref { ref-pattern }?, # "count" attribute MUST only be used when this set-operator is # used as a matcher (i.e., nested in a "rule" element but not # inside a set-operator or class) attribute count { count-pattern }?, class-or-set-operator-nested } union-operator = element union { attribute name { identifier }?, attribute comment { text }?, attribute ref { ref-pattern }?, # "count" attribute MUST only be used when this set-operator is # used as a matcher (i.e., nested in a "rule" element but not # inside a set-operator or class) attribute count { count-pattern }?, class-or-set-operator-nested, # needs two or more child elements class-or-set-operator-nested+ }
intersection-operator = element intersection { attribute name { identifier }?, attribute comment { text }?, attribute ref { ref-pattern }?, # "count" attribute MUST only be used when this set-operator is # used as a matcher (i.e., nested in a "rule" element but not # inside a set-operator or class) attribute count { count-pattern }?, class-or-set-operator-nested, class-or-set-operator-nested } difference-operator = element difference { attribute name { identifier }?, attribute comment { text }?, attribute ref { ref-pattern }?, # "count" attribute MUST only be used when this set-operator is # used as a matcher (i.e., nested in a "rule" element but not # inside a set-operator or class) attribute count { count-pattern }?, class-or-set-operator-nested, class-or-set-operator-nested } symmetric-difference-operator = element symmetric-difference { attribute name { identifier }?, attribute comment { text }?, attribute ref { ref-pattern }?, # "count" attribute MUST only be used when this set-operator is # used as a matcher (i.e., nested in a "rule" element but not # inside a set-operator or class) attribute count { count-pattern }?, class-or-set-operator-nested, class-or-set-operator-nested } ## operators that transform class(es) into a new class. set-operator = complement-operator | union-operator | intersection-operator | difference-operator | symmetric-difference-operator
# # Match operators (matchers) # any-matcher = element any { attribute count { count-pattern }?, attribute comment { text }? } choice-matcher = element choice { ## "count" attribute MUST only be used when the choice-matcher ## contains no nested "start", "end", "anchor", "look-behind", ## or "look-ahead" operators and no nested rule-matchers ## containing any of these elements attribute count { count-pattern }?, attribute comment { text }?, # two or more match operators match-operator-choice, match-operator-choice+ } char-matcher = # for use as a matcher - like "char" but without a "tag" attribute element char { attribute cp { non-empty-code-point-literal }, # If used as a matcher (appearing in a "rule" element), the # "count" attribute may be present. Otherwise, it MUST be # absent. attribute count { count-pattern }?, attribute comment { text }?, attribute ref { ref-pattern }? } start-matcher = element start { attribute comment { text }? } end-matcher = element end { attribute comment { text }? } anchor-matcher = element anchor { attribute comment { text }? }
look-ahead-matcher = element look-ahead { attribute comment { text }?, match-operators-non-pos } look-behind-matcher = element look-behind { attribute comment { text }?, match-operators-non-pos } ## non-positional match operator that can be used as a direct child ## element of the choice-matcher. match-operator-choice = ( any-matcher | choice-matcher | start-matcher | end-matcher | char-matcher | class-or-set-operator-nested | rule-matcher ) ## non-positional match operators do not contain any "anchor", ## "look-behind", or "look-ahead" elements. match-operators-non-pos = ( start-matcher?, (any-matcher | choice-matcher | char-matcher | class-or-set-operator-nested | rule-matcher)*, end-matcher? ) ## positional match operators have an "anchor" element, which may be ## preceded by a "look-behind" element, or followed by a "look-ahead" ## element, or both. match-operators-pos = look-behind-matcher?, anchor-matcher, look-ahead-matcher? match-operators = match-operators-non-pos | match-operators-pos
# # Rules # # top-level rule must have "name" attribute rule-declaration-top = element rule { attribute name { identifier }, attribute comment { text }?, attribute ref { ref-pattern }?, match-operators } ## "rule" element used as a matcher (either "by-ref" or contains ## other match operators itself) rule-matcher = element rule { ## "count" attribute MUST only be used when the rule-matcher ## contains no nested "start", "end", "anchor", "look-behind", ## or "look-ahead" operators and no nested rule-matchers ## containing any of these elements attribute count { count-pattern }?, attribute comment { text }?, attribute ref { ref-pattern }?, (attribute by-ref { rule-ref } | match-operators) } # # Actions # action-declaration = element action { attribute comment { text }?, attribute ref { ref-pattern }?, # dispositions are often named after variant types or vice versa attribute disp { variant-type }, ( attribute match { rule-ref } | attribute not-match { rule-ref } )?, ( attribute any-variant { variant-type-list } | attribute all-variants { variant-type-list } | attribute only-variants { variant-type-list } )? }
# DOCUMENT STRUCTURE start = lgr lgr = element lgr { meta-section?, data-section, rules-section? } ## Meta section - information recorded with an LGR that generally ## does not affect machine processing (except for "unicode-version"). ## However, if any "class-declaration" uses the "property" attribute, ## a "unicode-version" element MUST be present. meta-section = element meta { element version { attribute comment { text }?, text }? & element date { date-pattern }? & element language { language-tag }* & element scope { # type may by "domain" or an application-defined value attribute type { xsd:NCName }, scope-value }* & element validity-start { date-pattern }? & element validity-end { date-pattern }? & element unicode-version { xsd:token { pattern = "\d+\.\d+\.\d+" } }? & element description { # this SHOULD be a valid MIME type attribute type { text }?, text }?
& element references { element reference { attribute id { xsd:token { # limit "id" attribute to uppercase letters, # digits, and a few punctuation marks; use of # integers is RECOMMENDED pattern = "[\-_.:0-9A-Z]*" minLength = "1" } }, attribute comment { text }?, text }* }? } data-section = element data { (char | range)+ } ## Note that action declarations are strictly order dependent. ## class-or-set-operator-declaration and rule-declaration-top ## are weakly order dependent; they must precede first use of the ## identifier via "by-ref". rules-section = element rules { ( class-or-set-operator-declaration | rule-declaration-top | action-declaration)* } <CODE ENDS>
Acknowledgements
This format builds upon the work on documenting IDN tables by many different registry operators. Notably, a comprehensive language table for Chinese, Japanese, and Korean was developed by the "Joint Engineering Team" [RFC3743]; this table is the basis of many registry policies. Also, a set of guidelines for Arabic script registrations [RFC5564] was published by the Arabic-language community. Contributions that have shaped this document have been provided by Francisco Arias, Julien Bernard, Mark Davis, Martin Duerst, Paul Hoffman, Sarmad Hussain, Barry Leiba, Alexander Mayrhofer, Alexey Melnikov, Nicholas Ostler, Thomas Roessler, Audric Schiltknecht, Steve Sheng, Michel Suignard, Andrew Sullivan, Wil Tan, and John Yunker.Authors' Addresses
Kim Davies Internet Corporation for Assigned Names and Numbers 12025 Waterfront Drive Los Angeles, CA 90094 United States of America Phone: +1 310 301 5800 Email: kim.davies@icann.org URI: http://www.icann.org/ Asmus Freytag ASMUS, Inc. Email: asmus@unicode.org