6. Whole Label and Context Evaluation
6.1. Basic Concepts
The "rules" element contains the specification of both context-based and whole label rules. Collectively, these are known as Whole Label Evaluation (WLE) rules (Section 6.3). The "rules" element also contains the character classes (Section 6.2) that they depend on, and any actions (Section 7) that assign dispositions to labels based on rules or variant mappings.
A whole label rule is applied to the whole label. It is used to validate both original labels and any variant labels computed from them. A rule implementing a conditional context as discussed in Section 5.2 does not necessarily apply to the whole label but may be specific to the context around a single code point or code point sequence. Certain code points in a label sometimes need to satisfy context-based rules -- for example, for the label to be considered valid, or to satisfy the context for a variant mapping (see the description of the "when" attribute in Section 6.4). For example, if a rule is referenced in the "when" attribute of a variant mapping, it is used to describe the conditional context under which the particular variant mapping is defined to exist. Each rule is defined in a "rule" element. A rule may contain the following as child elements: o literal code points or code point sequences o character classes, which define sets of code points to be used for context comparisons o context operators, which define when character classes and literals may appear o nested rules, whether defined in place or invoked by reference Collectively, these are called "match operators" and are listed in Section 6.3.2. An LGR containing rules or match operators that 1. are incorrectly defined or nested, 2. have invalid attributes, or 3. have invalid or undefined attribute values MUST be rejected. Note that not all of the constraints defined here are validated by the schema.
6.2. Character Classes
Character classes are sets of characters that often share a particular property. While they function like sets in every way, even supporting the usual set operators, they are called "character classes" here in a nod to the use of that term in regular expression syntax. (This also avoids confusion with the term "character set" in the sense of character encoding.) Character classes can be specified in several ways: o by defining the class via matching a tag in the code point data. All characters with the same "tag" attribute are part of the same class; o by referencing a value of one of the Unicode character properties defined in the Unicode Character Database; o by explicitly listing all the code points in the class; or o by defining the class as a set combination of any number of other classes.6.2.1. Declaring and Invoking Named Classes
A character class has an OPTIONAL "name" attribute consisting of a single identifier not containing spaces. All names for classes must be unique. If the "name" attribute is omitted, the class is anonymous and exists only inside the rule or combined class where it is defined. A named character class is defined independently and can be referenced by name from within any rules or as part of other character class definitions. <class name="example" comment="an example class definition"> 0061 4E00 </class> ... <rule> <class by-ref="example" /> </rule> An empty "class" element with a "by-ref" attribute is a reference to an existing named class. The "by-ref" attribute MUST NOT be used in the same "class" element with any of these attributes: "name", "from-tag", "property", or "ref". The "name" attribute MUST be present if and only if the class is a direct child element of the "rules" element. It is an error to reference a named class for which the definition has not been seen.
6.2.2. Tag-Based Classes
The "char" or "range" elements that are child elements of the "data" element MAY contain a "tag" attribute that consists of one or more space-separated tag values; for example: <char cp="0061" tag="letter lower"/> <char cp="4E00" tag="letter"/> This defines two tags for use with code point U+0061, the tag "letter" and the tag "lower". Use <class name="letter" from-tag="letter" /> <class name="lower" from-tag="lower" /> to define two named character classes, "letter" and "lower", containing all code points with the respective tags, the first with 0061 and 4E00 as elements, and the latter with 0061 but not 4E00 as an element. The "name" attribute may be omitted for an anonymous in-place definition of a nested, tag-based class. Tag values are typically identifiers, with the addition of a few punctuation symbols, such as a colon. Formally, they MUST correspond to the XML 1.0 Nmtoken production. While a "tag" attribute may contain a list of tag values, the "from-tag" attribute MUST always contain a single tag value. If the document contains no "char" or "range" elements with a corresponding tag, the character class represents the empty set. This is valid, to allow a common "rules" element to be shared across files. However, it is RECOMMENDED that implementations allow for a warning to ensure that referring to an undefined tag in this way is intentional.6.2.3. Unicode Property-Based Classes
A class is defined in terms of Unicode properties by giving the Unicode property alias and the property value or property value alias, separated by a colon. <class name="virama" property="ccc:9" /> The example above selects all code points for which the Unicode Canonical Combining Class (ccc) value is 9. This value of the ccc is assigned to all code points that encode viramas.
Unicode property values MUST be designated via a composite of the attribute name and value as defined for the property value in [UAX42], separated by a colon. Loose matching of property values and names as described in [UAX44] is not appropriate for an XML schema and is not supported; it is likewise not supported in the XML representation [UAX42] of the Unicode Character Database itself. A property-based class MAY be anonymous, or, when defined as an immediate child of the "rules" element, it MAY be named to relate a formal property definition to its usage, such as the use of the value 9 for ccc to designate a virama (or halant) in various scripts. Unicode properties may, in principle, change between versions of the Unicode Standard. However, the values assigned for a given version are fixed. If Unicode properties are used, a Unicode version MUST be declared in the "unicode-version" element in the header. (Note: Some Unicode properties are by definition stable across versions and do not change once assigned; see [Unicode-Stability].) All implementations processing LGR files SHOULD provide support for the following minimal set of Unicode properties: o General Category (gc) o Script (sc) o Canonical Combining Class (ccc) o Bidi Class (bc) o Arabic Joining Type (jt) o Indic Syllabic Category (InSC) o Deprecated (Dep) The short name for each property is given in parentheses. If a program that is using an LGR to determine the validity of a label encounters a property that it does not support, it MUST abort with an error.
6.2.4. Explicitly Declared Classes
A class of code points may also be declared by listing all code points that are members of the class. This is useful when tagging cannot be used because code points are not listed individually as part of the eligible set of code points for the given LGR -- for example, because they only occur in code point sequences. To define a class in terms of an explicit list of code points, use a space-separated list of hexadecimal code point values: <class name="abcd">0061 0062 0063 0064</class> This defines a class named "abcd" containing the code points for characters "a", "b", "c", and "d". The ordering of the code points is not material, but it is RECOMMENDED to list them in ascending order; not doing so makes it unnecessarily difficult for users to detect errors such as duplicates or to compare and review these classes against other specifications. In a class definition, ranges of code points are represented by a hexadecimal start and end value separated by a hyphen. The following declaration is equivalent to the preceding: <class name="abcd">0061-0064</class> Range and code point declarations can be freely intermixed: <class name="abcd">0061 0062-0063 0064</class> The contents of a class differ from a repertoire in that the latter MAY contain sequences as elements, while the former MUST NOT. Instead, they closely resemble character classes as found in regular expressions.
6.2.5. Combined Classes
Classes may be combined using operators for set complement, union, intersection, difference (elements of the first class that are not in the second), and symmetric difference (elements in either class but not both). Because classes fundamentally function like sets, the union of several character classes is itself a class, for example. +-------------------+----------------------------------------------+ | Logical Operation | Example | +-------------------+----------------------------------------------+ | Complement | <complement><class by-ref="xxx"></complement>| +-------------------+----------------------------------------------+ | Union | <union> | | | <class by-ref="class-1"/> | | | <class by-ref="class-2"/> | | | <class by-ref="class-3"/> | | | </union> | +-------------------+----------------------------------------------+ | Intersection | <intersection> | | | <class by-ref="class-1"/> | | | <class by-ref="class-2"/> | | | </intersection> | +-------------------+----------------------------------------------+ | Difference | <difference> | | | <class by-ref="class-1"/> | | | <class by-ref="class-2"/> | | | </difference> | +-------------------+----------------------------------------------+ | Symmetric | <symmetric-difference> | | Difference | <class by-ref="class-1"/> | | | <class by-ref="class-2"/> | | | </symmetric-difference> | +-------------------+----------------------------------------------+ Set Operators The elements from this table may be arbitrarily nested inside each other, subject to the following restriction: a "complement" element MUST contain precisely one "class" or one of the operator elements, while an "intersection", "symmetric-difference", or "difference" element MUST contain precisely two, and a "union" element MUST contain two or more of these elements.
An anonymous combined class can be defined directly inside a rule or any of the match operator elements that allow child elements (see Section 6.3.2) by using the set combination as the outer element. <rule> <union> <class by-ref="xxx"/> <class by-ref="yyy"/> </union> </rule> The example shows the definition of an anonymous combined class that represents the union of classes "xxx" and "yyy". There is no need to wrap this union inside another "class" element, and, in fact, set combination elements MUST NOT be nested inside a "class" element. Lastly, to create a named combined class that can be referenced in other classes or in rules as <class by-ref="xxxyyy"/>, add a "name" attribute to the set combination element -- for example, <union name="xxxyyy" /> -- and place it at the top level immediately below the "rules" element (see Section 6.2.1). <rules> <union name="xxxyyy"> <class by-ref="xxx"/> <class by-ref="yyy"/> </union> ... </rules> Because (as for ordinary sets) a combination of classes is itself a class, no matter by what combinations of set operators a combined class is created, a reference to it always uses the "class" element as described in Section 6.2.1. That is, a named class is always referenced via an empty "class" element using the "by-ref" attribute containing the name of the class to be referenced.6.3. Whole Label and Context Rules
Each rule comprises a series of matching operators that must be satisfied in order to determine whether a label meets a given condition. Rules may reference other rules or character classes defined elsewhere in the table.
6.3.1. The "rule" Element
A matching rule is defined by a "rule" element, the child elements of which are one of the match operators from Section 6.3.2. In evaluating a rule, each child element is matched in order. "rule" elements MAY be nested inside each other and inside certain match operators. A simple rule to match a label where all characters are members of some class called "preferred-codepoint": <rule name="preferred-label"> <start /> <class by-ref="preferred-codepoint" count="1+"/> <end /> </rule> Rules are paired with explicit and implied actions, triggering these actions when a rule matches a label. For example, a simple explicit action for the rule shown above would be: <action disp="allocatable" match="preferred-label" /> The rule in this example would have the effect of setting the policy disposition for a label made up entirely of preferred code points to "allocatable". Explicit actions are further discussed in Section 7 and implicit actions in Section 7.5. Another use of rules is in defining conditional contexts for code points and variants as discussed in Sections 5.2 and 5.3.5. A rule that is an immediate child element of the "rules" element MUST be named using a "name" attribute containing a single identifier string with no spaces. A named rule may be incorporated into another rule by reference and may also be referenced by an "action" element, "when" attribute, or "not-when" attribute. If the "name" attribute is omitted, the rule is anonymous and MUST be nested inside another rule or match operator.
6.3.2. The Match Operators
The child elements of a rule are a series of match operators, which are listed here by type and name and with a basic example or two. +------------+-------------+------------------------------------+ | Type | Operator | Examples | +------------+-------------+------------------------------------+ | logical | any | <any /> | | +-------------+------------------------------------+ | | choice | <choice> | | | | <rule by-ref="alternative1"/> | | | | <rule by-ref="alternative2"/> | | | | </choice> | +--------------------------+------------------------------------+ | positional | start | <start /> | | +-------------+------------------------------------+ | | end | <end /> | +--------------------------+------------------------------------+ | literal | char | <char cp="0061 0062 0063" /> | +--------------------------+------------------------------------+ | set | class | <class by-ref="class1" /> | | | | <class>0061 0064-0065</class> | +--------------------------+------------------------------------+ | group | rule | <rule by-ref="rule1" /> | | | | <rule><any /></rule> | +--------------------------+------------------------------------+ | contextual | anchor | <anchor /> | | +-------------+------------------------------------+ | | look-ahead | <look-ahead><any /></look-ahead> | | +-------------+------------------------------------+ | | look-behind | <look-behind><any /></look-behind> | +--------------------------+------------------------------------+ Match Operators Any element defining an anonymous class can be used as a match operator, including any of the set combination operators (see Section 6.2.5) as well as references to named classes. All match operators shown as empty elements in the Examples column of the table above do not support child elements of their own; otherwise, match operators MAY be nested. In particular, anonymous "rule" elements can be used for grouping.
6.3.3. The "count" Attribute
The OPTIONAL "count" attribute, when present, specifies the minimally required or maximal permitted number of times a match operator is used to match input. If the "count" attribute is n the match operator matches the input exactly n times, where n is 1 or greater. n+ the match operator matches the input at least n times, where n is 0 or greater. n:m the match operator matches the input at least n times, where n is 0 or greater, but matches the input up to m times in total, where m > n. If m = n and n > 0, the match operator matches the input exactly n times. If there is no "count" attribute, the match operator matches the input exactly once. In matching, greedy evaluation is used in the sense defined for regular expressions: beyond the required number or times, the input is matched as many times as possible, but not so often as to prevent a match of the remainder of the rule. A "count" attribute MUST NOT be applied to any element that contains a "name" attribute but MAY be applied to operators such as "class" that declare anonymous classes (including combined classes) or invoke any predefined classes by reference. The "count" attribute MUST NOT be applied to any "class" element, or element defining a combined class, when it is nested inside a combined class. A "count" attribute MUST NOT be applied to match operators of type "start", "end", "anchor", "look-ahead", or "look-behind" or to any operators, such as "rule" or "choice", that contain a nested instance of them. This limitation applies recursively and irrespective of whether a "rule" element containing these nested instances is declared in place or used by reference. However, the "count" attribute MAY be applied to any other instances of either an anonymous "rule" element or a "choice" element, including those instances nested inside other match operators. It MAY also be applied to the elements "any" and "char", when used as match operators.
6.3.4. The "name" and "by-ref" Attributes
Like classes (see Section 6.2.1), rules declared as immediate child elements of the "rules" element MUST be named using a unique "name" attribute, and all other instances MUST NOT be named. Anonymous rules and classes or references to named rules and classes can be nested inside other match operators by reference. To reference a named rule or class inside a rule or match operator, use a "rule" or "class" element with an OPTIONAL "by-ref" attribute containing the name of the referenced element. It is an error to reference a rule or class for which the complete definition has not been seen. In other words, it is explicitly not possible to define recursive rules or class definitions. The "by-ref" attribute MUST NOT appear in the same element as the "name" attribute or in an element that has any child elements. The example shows several named classes and a named rule referencing some of them by name. <class name="letter" property="gc:L"/> <class name="combining-mark" property="gc:M"/> <class name="digit" property="gc:Nd" /> <rule name="letter-grapheme"> <class by-ref="letter" count="1+"/> <class by-ref="combining-mark" count="0+"/> </rule>6.3.5. The "choice" Element
The "choice" element is used to represent a list of two or more alternatives: <rule name="ldh"> <choice count="1+"> <class by-ref="letter"/> <class by-ref="digit"/> <char cp="002D" comment="literal HYPHEN"/> </choice> </rule> Each child element of a "choice" element represents one alternative. The first matching alternative determines the match for the "choice" element. To express a choice where an alternative itself consists of a sequence of elements, the sequence must be wrapped in an anonymous rule.
6.3.6. Literal Code Point Sequences
A literal code point sequence matches a single code point or a sequence. It is defined by a "char" element, with the code point or sequence to be matched given by the "cp" attribute. When used as a literal, a "char" element MAY contain a "count" attribute in addition to the "cp" attribute and OPTIONAL "comment" or "ref" attributes. No other attributes or child elements are permitted.6.3.7. The "any" Element
The "any" element is an empty element that matches any single code point. It MAY have a "count" attribute. For an example, see Section 6.3.9. Unlike a literal, the "any" element MUST NOT have a "ref" attribute.6.3.8. The "start" and "end" Elements
To match the beginning or end of a label, use the "start" or "end" element. An empty label would match this rule: <rule name="empty-label"> <start/> <end/> </rule> Conceptually, whole label rules evaluate the label as a whole, but in practice, many rules do not actually need to be specified to match the entire label. For example, to express a requirement of not starting a label with a digit, a rule needs to describe only the initial part of a label. This example uses the previously defined rules, together with "start" and "end" elements, to define a rule that requires that an entire label be well-formed. For this example, that means that it must start with a letter and that it contains no leading digits or combining marks nor combining marks placed on digits. <rule name="leading-letter" > <start /> <rule by-ref="letter-grapheme" count="1"/> <choice count="0+"> <rule by-ref="letter-grapheme" count="0+"/> <class by-ref="digit" count="0+"/> </choice> <end /> </rule>
Each "start" or "end" element occurs at most once in a rule, except if nested inside a "choice" element in such a way that in matching each alternative at most one occurrence of each is encountered. Otherwise, the result is an error, as is any case where a "start" or "end" element is not encountered as the first or last element to be matched, respectively, in matching a rule. "start" and "end" elements are empty elements that do not have a "count" attribute or any other attribute other than "comment". It is an error for any match operator enclosing a nested "start" or "end" element to have a "count" attribute.6.3.9. Example Context Rule from IDNA Specification
This is an example of the WLE rule from [RFC5892] forbidding the mixture of the Arabic-Indic and extended Arabic-Indic digits in the same label. It is implemented as a whole label rule associated with the code point ranges using the "not-when" attribute, which defines an impermissible context. The example also demonstrates several instances of the use of anonymous rules for grouping. <data> <range first-cp="0660" last-cp="0669" not-when="mixed-digits" tag="arabic-indic-digits" /> <range first-cp="06F0" last-cp="06F9" not-when="mixed-digits" tag="extended-arabic-indic-digits" /> </data> <rules> <rule name="mixed-digits"> <choice> <rule> <class from-tag="arabic-indic-digits"/> <any count="0+"/> <class from-tag="extended-arabic-indic-digits"/> </rule> <rule> <class from-tag="extended-arabic-indic-digits"/> <any count="0+"/> <class from-tag="arabic-indic-digits"/> </rule> </choice> </rule> </rules> As specified in the example, a label containing a code point from either of the two digit ranges is invalid for any label matching the "mixed-digits" rule, that is, any time that a code point from the other range is also present. Note that invalidating the label is not
the same as invalidating the definition of the "range" elements; in particular, the definition of the tag values does not depend on the "when" attribute.6.4. Parameterized Context or When Rules
To recap: When a rule is intended to provide a context for evaluating the validity of a code point or variant mapping, it is invoked by the "when" or "not-when" attributes described in Section 5.2. For "char" and "range" elements, an action implied by a context rule always has a disposition of "invalid" whenever the rule given by the "when" attribute is not matched (see Section 7.5). Conversely, a "not-when" attribute results in a disposition of "invalid" whenever the rule is matched. When a rule is used in this way, it is called a context or "when" rule. The example in the previous section shows a whole label rule used as a context rule, essentially making the whole label the context. The next sections describe several match operators that can be used to provide a more specific specification of a context, allowing a parameterized context rule. See Section 7 for an alternative method of defining an invalid disposition for a label not matching a whole label rule.6.4.1. The "anchor" Element
Such parameterized context rules are rules that contain a special placeholder represented by an "anchor" element. As each When Rule is evaluated, if an "anchor" element is present, it is replaced by a literal corresponding to the "cp" attribute of the element containing the "when" (or "not-when") attribute. The match to the "anchor" element must be at the same position in the label as the code point or variant mapping triggering the When Rule. For example, the Greek lower numeral sign is invalid if not immediately preceding a character in the Greek script. This is most naturally addressed with a parameterized When Rule using "look-ahead": <char cp="0375" when="preceding-greek"/> ... <class name="greek-script" property="sc:Grek"/> <rule name="preceding-greek"> <anchor/> <look-ahead> <class by-ref="greek-script"/> </look-ahead> </rule>
In evaluating this rule, the "anchor" element is treated as if it was replaced by a literal <char cp="0375"/> but only the instance of U+0375 at the given position is evaluated. If a label had two instances of U+0375 with the first one matching the rule and the second not, then evaluating the When Rule MUST succeed for the first instance and fail for the second. Unlike other rules, rules containing an "anchor" element MUST only be invoked via the "when" or "not-when" attributes on code points or variants; otherwise, their "anchor" elements cannot be evaluated. However, it is possible to invoke rules not containing an "anchor" element from a "when" or "not-when" attribute. (See Section 6.4.3.) The "anchor" element is an empty element, with no attributes permitted except "comment".6.4.2. The "look-behind" and "look-ahead" Elements
Context rules use the "look-behind" and "look-ahead" elements to define context before and after the code point sequence matched by the "anchor" element. If the "anchor" element is omitted, neither the "look-behind" nor the "look-ahead" element may be present in a rule.
Here is an example of a rule that defines an "initial" context for an Arabic code point: <class name="transparent" property="jt:T"/> <class name="right-joining" property="jt:R"/> <class name="left-joining" property="jt:L"/> <class name="dual-joining" property="jt:D"/> <class name="non-joining" property="jt:U"/> <rule name="Arabic-initial"> <look-behind> <choice> <start/> <rule> <class by-ref="transparent" count="0+"/> <class by-ref="non-joining"/> </rule> </choice> </look-behind> <anchor/> <look-ahead> <class by-ref="transparent" count="0+" /> <choice> <class by-ref="right-joining" /> <class by-ref="dual-joining" /> </choice> </look-ahead> </rule> A "when" rule (or context rule) is a named rule that contains any combination of "look-behind", "anchor", and "look-ahead" elements, in that order. Each of these elements occurs at most once, except if nested inside a "choice" element in such a way that in matching each alternative at most one occurrence of each is encountered. Otherwise, the result is undefined. None of these elements takes a "count" attribute, nor does any enclosing match operator; otherwise, the result is undefined. If a context rule contains a "look-ahead" or "look-behind" element, it MUST contain an "anchor" element. If, because of a "choice" element, a required anchor is not actually encountered, the results are undefined.
6.4.3. Omitting the "anchor" Element
If the "anchor" element is omitted, the evaluation of the context rule is not tied to the position of the code point or sequence associated with the "when" attribute. According to [RFC5892], the Katakana middle dot is invalid in any label not containing at least one Japanese character anywhere in the label. Because this requirement is independent of the position of the middle dot, the rule does not require an "anchor" element. <char cp="30FB" when="japanese-in-label"/> <rule name="japanese-in-label"> <union> <class property="sc:Hani"/> <class property="sc:Kata"/> <class property="sc:Hira"/> </union> </rule> The Katakana middle dot is used only with Han, Katakana, or Hiragana. The corresponding When Rule requires that at least one code point in the label be in one of these scripts, but the position of that code point is independent of the location of the middle dot; therefore, no anchor is required. (Note that the Katakana middle dot itself is of script Common, that is, "sc:Zyyy".)