Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 7940

Representing Label Generation Rulesets Using XML

Pages: 82
Proposed Standard
Errata
Part 2 of 4 – Pages 23 to 40
First   Prev   Next

Top   ToC   RFC7940 - Page 23   prevText

6. Whole Label and Context Evaluation

6.1. Basic Concepts

The "rules" element contains the specification of both context-based and whole label rules. Collectively, these are known as Whole Label Evaluation (WLE) rules (Section 6.3). The "rules" element also contains the character classes (Section 6.2) that they depend on, and any actions (Section 7) that assign dispositions to labels based on rules or variant mappings.
Top   ToC   RFC7940 - Page 24
   A whole label rule is applied to the whole label.  It is used to
   validate both original labels and any variant labels computed
   from them.

   A rule implementing a conditional context as discussed in Section 5.2
   does not necessarily apply to the whole label but may be specific to
   the context around a single code point or code point sequence.
   Certain code points in a label sometimes need to satisfy
   context-based rules -- for example, for the label to be considered
   valid, or to satisfy the context for a variant mapping (see the
   description of the "when" attribute in Section 6.4).

   For example, if a rule is referenced in the "when" attribute of a
   variant mapping, it is used to describe the conditional context under
   which the particular variant mapping is defined to exist.

   Each rule is defined in a "rule" element.  A rule may contain the
   following as child elements:

   o  literal code points or code point sequences

   o  character classes, which define sets of code points to be used for
      context comparisons

   o  context operators, which define when character classes and
      literals may appear

   o  nested rules, whether defined in place or invoked by reference

   Collectively, these are called "match operators" and are listed in
   Section 6.3.2.  An LGR containing rules or match operators that

   1.  are incorrectly defined or nested,

   2.  have invalid attributes, or

   3.  have invalid or undefined attribute values

   MUST be rejected.  Note that not all of the constraints defined here
   are validated by the schema.
Top   ToC   RFC7940 - Page 25

6.2. Character Classes

Character classes are sets of characters that often share a particular property. While they function like sets in every way, even supporting the usual set operators, they are called "character classes" here in a nod to the use of that term in regular expression syntax. (This also avoids confusion with the term "character set" in the sense of character encoding.) Character classes can be specified in several ways: o by defining the class via matching a tag in the code point data. All characters with the same "tag" attribute are part of the same class; o by referencing a value of one of the Unicode character properties defined in the Unicode Character Database; o by explicitly listing all the code points in the class; or o by defining the class as a set combination of any number of other classes.

6.2.1. Declaring and Invoking Named Classes

A character class has an OPTIONAL "name" attribute consisting of a single identifier not containing spaces. All names for classes must be unique. If the "name" attribute is omitted, the class is anonymous and exists only inside the rule or combined class where it is defined. A named character class is defined independently and can be referenced by name from within any rules or as part of other character class definitions. <class name="example" comment="an example class definition"> 0061 4E00 </class> ... <rule> <class by-ref="example" /> </rule> An empty "class" element with a "by-ref" attribute is a reference to an existing named class. The "by-ref" attribute MUST NOT be used in the same "class" element with any of these attributes: "name", "from-tag", "property", or "ref". The "name" attribute MUST be present if and only if the class is a direct child element of the "rules" element. It is an error to reference a named class for which the definition has not been seen.
Top   ToC   RFC7940 - Page 26

6.2.2. Tag-Based Classes

The "char" or "range" elements that are child elements of the "data" element MAY contain a "tag" attribute that consists of one or more space-separated tag values; for example: <char cp="0061" tag="letter lower"/> <char cp="4E00" tag="letter"/> This defines two tags for use with code point U+0061, the tag "letter" and the tag "lower". Use <class name="letter" from-tag="letter" /> <class name="lower" from-tag="lower" /> to define two named character classes, "letter" and "lower", containing all code points with the respective tags, the first with 0061 and 4E00 as elements, and the latter with 0061 but not 4E00 as an element. The "name" attribute may be omitted for an anonymous in-place definition of a nested, tag-based class. Tag values are typically identifiers, with the addition of a few punctuation symbols, such as a colon. Formally, they MUST correspond to the XML 1.0 Nmtoken production. While a "tag" attribute may contain a list of tag values, the "from-tag" attribute MUST always contain a single tag value. If the document contains no "char" or "range" elements with a corresponding tag, the character class represents the empty set. This is valid, to allow a common "rules" element to be shared across files. However, it is RECOMMENDED that implementations allow for a warning to ensure that referring to an undefined tag in this way is intentional.

6.2.3. Unicode Property-Based Classes

A class is defined in terms of Unicode properties by giving the Unicode property alias and the property value or property value alias, separated by a colon. <class name="virama" property="ccc:9" /> The example above selects all code points for which the Unicode Canonical Combining Class (ccc) value is 9. This value of the ccc is assigned to all code points that encode viramas.
Top   ToC   RFC7940 - Page 27
   Unicode property values MUST be designated via a composite of the
   attribute name and value as defined for the property value in
   [UAX42], separated by a colon.  Loose matching of property values and
   names as described in [UAX44] is not appropriate for an XML schema
   and is not supported; it is likewise not supported in the XML
   representation [UAX42] of the Unicode Character Database itself.

   A property-based class MAY be anonymous, or, when defined as an
   immediate child of the "rules" element, it MAY be named to relate a
   formal property definition to its usage, such as the use of the value
   9 for ccc to designate a virama (or halant) in various scripts.

   Unicode properties may, in principle, change between versions of the
   Unicode Standard.  However, the values assigned for a given version
   are fixed.  If Unicode properties are used, a Unicode version MUST be
   declared in the "unicode-version" element in the header.  (Note: Some
   Unicode properties are by definition stable across versions and do
   not change once assigned; see [Unicode-Stability].)

   All implementations processing LGR files SHOULD provide support for
   the following minimal set of Unicode properties:

   o  General Category (gc)

   o  Script (sc)

   o  Canonical Combining Class (ccc)

   o  Bidi Class (bc)

   o  Arabic Joining Type (jt)

   o  Indic Syllabic Category (InSC)

   o  Deprecated (Dep)

   The short name for each property is given in parentheses.

   If a program that is using an LGR to determine the validity of a
   label encounters a property that it does not support, it MUST abort
   with an error.
Top   ToC   RFC7940 - Page 28

6.2.4. Explicitly Declared Classes

A class of code points may also be declared by listing all code points that are members of the class. This is useful when tagging cannot be used because code points are not listed individually as part of the eligible set of code points for the given LGR -- for example, because they only occur in code point sequences. To define a class in terms of an explicit list of code points, use a space-separated list of hexadecimal code point values: <class name="abcd">0061 0062 0063 0064</class> This defines a class named "abcd" containing the code points for characters "a", "b", "c", and "d". The ordering of the code points is not material, but it is RECOMMENDED to list them in ascending order; not doing so makes it unnecessarily difficult for users to detect errors such as duplicates or to compare and review these classes against other specifications. In a class definition, ranges of code points are represented by a hexadecimal start and end value separated by a hyphen. The following declaration is equivalent to the preceding: <class name="abcd">0061-0064</class> Range and code point declarations can be freely intermixed: <class name="abcd">0061 0062-0063 0064</class> The contents of a class differ from a repertoire in that the latter MAY contain sequences as elements, while the former MUST NOT. Instead, they closely resemble character classes as found in regular expressions.
Top   ToC   RFC7940 - Page 29

6.2.5. Combined Classes

Classes may be combined using operators for set complement, union, intersection, difference (elements of the first class that are not in the second), and symmetric difference (elements in either class but not both). Because classes fundamentally function like sets, the union of several character classes is itself a class, for example. +-------------------+----------------------------------------------+ | Logical Operation | Example | +-------------------+----------------------------------------------+ | Complement | <complement><class by-ref="xxx"></complement>| +-------------------+----------------------------------------------+ | Union | <union> | | | <class by-ref="class-1"/> | | | <class by-ref="class-2"/> | | | <class by-ref="class-3"/> | | | </union> | +-------------------+----------------------------------------------+ | Intersection | <intersection> | | | <class by-ref="class-1"/> | | | <class by-ref="class-2"/> | | | </intersection> | +-------------------+----------------------------------------------+ | Difference | <difference> | | | <class by-ref="class-1"/> | | | <class by-ref="class-2"/> | | | </difference> | +-------------------+----------------------------------------------+ | Symmetric | <symmetric-difference> | | Difference | <class by-ref="class-1"/> | | | <class by-ref="class-2"/> | | | </symmetric-difference> | +-------------------+----------------------------------------------+ Set Operators The elements from this table may be arbitrarily nested inside each other, subject to the following restriction: a "complement" element MUST contain precisely one "class" or one of the operator elements, while an "intersection", "symmetric-difference", or "difference" element MUST contain precisely two, and a "union" element MUST contain two or more of these elements.
Top   ToC   RFC7940 - Page 30
   An anonymous combined class can be defined directly inside a rule or
   any of the match operator elements that allow child elements (see
   Section 6.3.2) by using the set combination as the outer element.

       <rule>
           <union>
               <class by-ref="xxx"/>
               <class by-ref="yyy"/>
           </union>
       </rule>

   The example shows the definition of an anonymous combined class that
   represents the union of classes "xxx" and "yyy".  There is no need to
   wrap this union inside another "class" element, and, in fact, set
   combination elements MUST NOT be nested inside a "class" element.

   Lastly, to create a named combined class that can be referenced in
   other classes or in rules as <class by-ref="xxxyyy"/>, add a "name"
   attribute to the set combination element -- for example,
   <union name="xxxyyy" /> -- and place it at the top level immediately
   below the "rules" element (see Section 6.2.1).

       <rules>
          <union name="xxxyyy">
              <class by-ref="xxx"/>
              <class by-ref="yyy"/>
          </union>
            ...
       </rules>

   Because (as for ordinary sets) a combination of classes is itself a
   class, no matter by what combinations of set operators a combined
   class is created, a reference to it always uses the "class" element
   as described in Section 6.2.1.  That is, a named class is always
   referenced via an empty "class" element using the "by-ref" attribute
   containing the name of the class to be referenced.

6.3. Whole Label and Context Rules

Each rule comprises a series of matching operators that must be satisfied in order to determine whether a label meets a given condition. Rules may reference other rules or character classes defined elsewhere in the table.
Top   ToC   RFC7940 - Page 31

6.3.1. The "rule" Element

A matching rule is defined by a "rule" element, the child elements of which are one of the match operators from Section 6.3.2. In evaluating a rule, each child element is matched in order. "rule" elements MAY be nested inside each other and inside certain match operators. A simple rule to match a label where all characters are members of some class called "preferred-codepoint": <rule name="preferred-label"> <start /> <class by-ref="preferred-codepoint" count="1+"/> <end /> </rule> Rules are paired with explicit and implied actions, triggering these actions when a rule matches a label. For example, a simple explicit action for the rule shown above would be: <action disp="allocatable" match="preferred-label" /> The rule in this example would have the effect of setting the policy disposition for a label made up entirely of preferred code points to "allocatable". Explicit actions are further discussed in Section 7 and implicit actions in Section 7.5. Another use of rules is in defining conditional contexts for code points and variants as discussed in Sections 5.2 and 5.3.5. A rule that is an immediate child element of the "rules" element MUST be named using a "name" attribute containing a single identifier string with no spaces. A named rule may be incorporated into another rule by reference and may also be referenced by an "action" element, "when" attribute, or "not-when" attribute. If the "name" attribute is omitted, the rule is anonymous and MUST be nested inside another rule or match operator.
Top   ToC   RFC7940 - Page 32

6.3.2. The Match Operators

The child elements of a rule are a series of match operators, which are listed here by type and name and with a basic example or two. +------------+-------------+------------------------------------+ | Type | Operator | Examples | +------------+-------------+------------------------------------+ | logical | any | <any /> | | +-------------+------------------------------------+ | | choice | <choice> | | | | <rule by-ref="alternative1"/> | | | | <rule by-ref="alternative2"/> | | | | </choice> | +--------------------------+------------------------------------+ | positional | start | <start /> | | +-------------+------------------------------------+ | | end | <end /> | +--------------------------+------------------------------------+ | literal | char | <char cp="0061 0062 0063" /> | +--------------------------+------------------------------------+ | set | class | <class by-ref="class1" /> | | | | <class>0061 0064-0065</class> | +--------------------------+------------------------------------+ | group | rule | <rule by-ref="rule1" /> | | | | <rule><any /></rule> | +--------------------------+------------------------------------+ | contextual | anchor | <anchor /> | | +-------------+------------------------------------+ | | look-ahead | <look-ahead><any /></look-ahead> | | +-------------+------------------------------------+ | | look-behind | <look-behind><any /></look-behind> | +--------------------------+------------------------------------+ Match Operators Any element defining an anonymous class can be used as a match operator, including any of the set combination operators (see Section 6.2.5) as well as references to named classes. All match operators shown as empty elements in the Examples column of the table above do not support child elements of their own; otherwise, match operators MAY be nested. In particular, anonymous "rule" elements can be used for grouping.
Top   ToC   RFC7940 - Page 33

6.3.3. The "count" Attribute

The OPTIONAL "count" attribute, when present, specifies the minimally required or maximal permitted number of times a match operator is used to match input. If the "count" attribute is n the match operator matches the input exactly n times, where n is 1 or greater. n+ the match operator matches the input at least n times, where n is 0 or greater. n:m the match operator matches the input at least n times, where n is 0 or greater, but matches the input up to m times in total, where m > n. If m = n and n > 0, the match operator matches the input exactly n times. If there is no "count" attribute, the match operator matches the input exactly once. In matching, greedy evaluation is used in the sense defined for regular expressions: beyond the required number or times, the input is matched as many times as possible, but not so often as to prevent a match of the remainder of the rule. A "count" attribute MUST NOT be applied to any element that contains a "name" attribute but MAY be applied to operators such as "class" that declare anonymous classes (including combined classes) or invoke any predefined classes by reference. The "count" attribute MUST NOT be applied to any "class" element, or element defining a combined class, when it is nested inside a combined class. A "count" attribute MUST NOT be applied to match operators of type "start", "end", "anchor", "look-ahead", or "look-behind" or to any operators, such as "rule" or "choice", that contain a nested instance of them. This limitation applies recursively and irrespective of whether a "rule" element containing these nested instances is declared in place or used by reference. However, the "count" attribute MAY be applied to any other instances of either an anonymous "rule" element or a "choice" element, including those instances nested inside other match operators. It MAY also be applied to the elements "any" and "char", when used as match operators.
Top   ToC   RFC7940 - Page 34

6.3.4. The "name" and "by-ref" Attributes

Like classes (see Section 6.2.1), rules declared as immediate child elements of the "rules" element MUST be named using a unique "name" attribute, and all other instances MUST NOT be named. Anonymous rules and classes or references to named rules and classes can be nested inside other match operators by reference. To reference a named rule or class inside a rule or match operator, use a "rule" or "class" element with an OPTIONAL "by-ref" attribute containing the name of the referenced element. It is an error to reference a rule or class for which the complete definition has not been seen. In other words, it is explicitly not possible to define recursive rules or class definitions. The "by-ref" attribute MUST NOT appear in the same element as the "name" attribute or in an element that has any child elements. The example shows several named classes and a named rule referencing some of them by name. <class name="letter" property="gc:L"/> <class name="combining-mark" property="gc:M"/> <class name="digit" property="gc:Nd" /> <rule name="letter-grapheme"> <class by-ref="letter" count="1+"/> <class by-ref="combining-mark" count="0+"/> </rule>

6.3.5. The "choice" Element

The "choice" element is used to represent a list of two or more alternatives: <rule name="ldh"> <choice count="1+"> <class by-ref="letter"/> <class by-ref="digit"/> <char cp="002D" comment="literal HYPHEN"/> </choice> </rule> Each child element of a "choice" element represents one alternative. The first matching alternative determines the match for the "choice" element. To express a choice where an alternative itself consists of a sequence of elements, the sequence must be wrapped in an anonymous rule.
Top   ToC   RFC7940 - Page 35

6.3.6. Literal Code Point Sequences

A literal code point sequence matches a single code point or a sequence. It is defined by a "char" element, with the code point or sequence to be matched given by the "cp" attribute. When used as a literal, a "char" element MAY contain a "count" attribute in addition to the "cp" attribute and OPTIONAL "comment" or "ref" attributes. No other attributes or child elements are permitted.

6.3.7. The "any" Element

The "any" element is an empty element that matches any single code point. It MAY have a "count" attribute. For an example, see Section 6.3.9. Unlike a literal, the "any" element MUST NOT have a "ref" attribute.

6.3.8. The "start" and "end" Elements

To match the beginning or end of a label, use the "start" or "end" element. An empty label would match this rule: <rule name="empty-label"> <start/> <end/> </rule> Conceptually, whole label rules evaluate the label as a whole, but in practice, many rules do not actually need to be specified to match the entire label. For example, to express a requirement of not starting a label with a digit, a rule needs to describe only the initial part of a label. This example uses the previously defined rules, together with "start" and "end" elements, to define a rule that requires that an entire label be well-formed. For this example, that means that it must start with a letter and that it contains no leading digits or combining marks nor combining marks placed on digits. <rule name="leading-letter" > <start /> <rule by-ref="letter-grapheme" count="1"/> <choice count="0+"> <rule by-ref="letter-grapheme" count="0+"/> <class by-ref="digit" count="0+"/> </choice> <end /> </rule>
Top   ToC   RFC7940 - Page 36
   Each "start" or "end" element occurs at most once in a rule, except
   if nested inside a "choice" element in such a way that in matching
   each alternative at most one occurrence of each is encountered.
   Otherwise, the result is an error, as is any case where a "start" or
   "end" element is not encountered as the first or last element to be
   matched, respectively, in matching a rule.  "start" and "end"
   elements are empty elements that do not have a "count" attribute or
   any other attribute other than "comment".  It is an error for any
   match operator enclosing a nested "start" or "end" element to have a
   "count" attribute.

6.3.9. Example Context Rule from IDNA Specification

This is an example of the WLE rule from [RFC5892] forbidding the mixture of the Arabic-Indic and extended Arabic-Indic digits in the same label. It is implemented as a whole label rule associated with the code point ranges using the "not-when" attribute, which defines an impermissible context. The example also demonstrates several instances of the use of anonymous rules for grouping. <data> <range first-cp="0660" last-cp="0669" not-when="mixed-digits" tag="arabic-indic-digits" /> <range first-cp="06F0" last-cp="06F9" not-when="mixed-digits" tag="extended-arabic-indic-digits" /> </data> <rules> <rule name="mixed-digits"> <choice> <rule> <class from-tag="arabic-indic-digits"/> <any count="0+"/> <class from-tag="extended-arabic-indic-digits"/> </rule> <rule> <class from-tag="extended-arabic-indic-digits"/> <any count="0+"/> <class from-tag="arabic-indic-digits"/> </rule> </choice> </rule> </rules> As specified in the example, a label containing a code point from either of the two digit ranges is invalid for any label matching the "mixed-digits" rule, that is, any time that a code point from the other range is also present. Note that invalidating the label is not
Top   ToC   RFC7940 - Page 37
   the same as invalidating the definition of the "range" elements; in
   particular, the definition of the tag values does not depend on the
   "when" attribute.

6.4. Parameterized Context or When Rules

To recap: When a rule is intended to provide a context for evaluating the validity of a code point or variant mapping, it is invoked by the "when" or "not-when" attributes described in Section 5.2. For "char" and "range" elements, an action implied by a context rule always has a disposition of "invalid" whenever the rule given by the "when" attribute is not matched (see Section 7.5). Conversely, a "not-when" attribute results in a disposition of "invalid" whenever the rule is matched. When a rule is used in this way, it is called a context or "when" rule. The example in the previous section shows a whole label rule used as a context rule, essentially making the whole label the context. The next sections describe several match operators that can be used to provide a more specific specification of a context, allowing a parameterized context rule. See Section 7 for an alternative method of defining an invalid disposition for a label not matching a whole label rule.

6.4.1. The "anchor" Element

Such parameterized context rules are rules that contain a special placeholder represented by an "anchor" element. As each When Rule is evaluated, if an "anchor" element is present, it is replaced by a literal corresponding to the "cp" attribute of the element containing the "when" (or "not-when") attribute. The match to the "anchor" element must be at the same position in the label as the code point or variant mapping triggering the When Rule. For example, the Greek lower numeral sign is invalid if not immediately preceding a character in the Greek script. This is most naturally addressed with a parameterized When Rule using "look-ahead": <char cp="0375" when="preceding-greek"/> ... <class name="greek-script" property="sc:Grek"/> <rule name="preceding-greek"> <anchor/> <look-ahead> <class by-ref="greek-script"/> </look-ahead> </rule>
Top   ToC   RFC7940 - Page 38
   In evaluating this rule, the "anchor" element is treated as if it was
   replaced by a literal

       <char cp="0375"/>

   but only the instance of U+0375 at the given position is evaluated.
   If a label had two instances of U+0375 with the first one matching
   the rule and the second not, then evaluating the When Rule MUST
   succeed for the first instance and fail for the second.

   Unlike other rules, rules containing an "anchor" element MUST only be
   invoked via the "when" or "not-when" attributes on code points or
   variants; otherwise, their "anchor" elements cannot be evaluated.
   However, it is possible to invoke rules not containing an "anchor"
   element from a "when" or "not-when" attribute.  (See Section 6.4.3.)

   The "anchor" element is an empty element, with no attributes
   permitted except "comment".

6.4.2. The "look-behind" and "look-ahead" Elements

Context rules use the "look-behind" and "look-ahead" elements to define context before and after the code point sequence matched by the "anchor" element. If the "anchor" element is omitted, neither the "look-behind" nor the "look-ahead" element may be present in a rule.
Top   ToC   RFC7940 - Page 39
   Here is an example of a rule that defines an "initial" context for an
   Arabic code point:

       <class name="transparent" property="jt:T"/>
       <class name="right-joining" property="jt:R"/>
       <class name="left-joining" property="jt:L"/>
       <class name="dual-joining" property="jt:D"/>
       <class name="non-joining" property="jt:U"/>
       <rule name="Arabic-initial">
         <look-behind>
           <choice>
             <start/>
             <rule>
               <class by-ref="transparent" count="0+"/>
               <class by-ref="non-joining"/>
             </rule>
           </choice>
         </look-behind>
         <anchor/>
         <look-ahead>
           <class by-ref="transparent" count="0+" />
           <choice>
             <class by-ref="right-joining" />
             <class by-ref="dual-joining" />
           </choice>
         </look-ahead>
       </rule>

   A "when" rule (or context rule) is a named rule that contains any
   combination of "look-behind", "anchor", and "look-ahead" elements, in
   that order.  Each of these elements occurs at most once, except if
   nested inside a "choice" element in such a way that in matching each
   alternative at most one occurrence of each is encountered.
   Otherwise, the result is undefined.  None of these elements takes a
   "count" attribute, nor does any enclosing match operator; otherwise,
   the result is undefined.  If a context rule contains a "look-ahead"
   or "look-behind" element, it MUST contain an "anchor" element.  If,
   because of a "choice" element, a required anchor is not actually
   encountered, the results are undefined.
Top   ToC   RFC7940 - Page 40

6.4.3. Omitting the "anchor" Element

If the "anchor" element is omitted, the evaluation of the context rule is not tied to the position of the code point or sequence associated with the "when" attribute. According to [RFC5892], the Katakana middle dot is invalid in any label not containing at least one Japanese character anywhere in the label. Because this requirement is independent of the position of the middle dot, the rule does not require an "anchor" element. <char cp="30FB" when="japanese-in-label"/> <rule name="japanese-in-label"> <union> <class property="sc:Hani"/> <class property="sc:Kata"/> <class property="sc:Hira"/> </union> </rule> The Katakana middle dot is used only with Han, Katakana, or Hiragana. The corresponding When Rule requires that at least one code point in the label be in one of these scripts, but the position of that code point is independent of the location of the middle dot; therefore, no anchor is required. (Note that the Katakana middle dot itself is of script Common, that is, "sc:Zyyy".)


(page 40 continued on part 3)

Next Section