RFC 1345

Character Mnemonics and Character Sets

Pages: 103
Informational
→ Errata

Part 1 of 4 – Pages 1 to 5

noToC RFC1345 - Page 1

Network Working Group                                        K. Simonsen
Request for Comments: 1345                   Rationel Almen Planlaegning
                                                               June 1992


                  Character Mnemonics & Character Sets

Status of the Memo

   This memo provides information for the Internet community.  It does
   not specify an Internet standard.  Distribution of this memo is
   unlimited.

Summary

   This memo lists a selection of characters and their presence in some
   coded character sets. To facilitate the coded character set
   tabulations an unambiguous mnemonic for each character is used, and a
   format for tabulating the coded character sets is defined. The coded
   character sets are given names for easy reference. A family of coded
   character sets called the mnemonic character sets and conversion
   between these coded character set without information loss is
   defined.

   The character set names are registered with the Internet Assigned
   Numbers Authority (IANA).  Additional character sets not described in
   this memo should be registered with the IANA. This memo may be
   updated periodically, or additional specifications may be published,
   to reflect other coded character sets.

   Please send any comments including comments about the accuracy of the
   tables to the author, Keld Simonsen <Keld.Simonsen@dkuug.dk>.

1.  INTRODUCTION

   With the growing internationalization of the Internet, support for
   many coded character sets is required. It is the intention of this
   memo to document precisely the mapping between all characters and
   their corresponding coded representations in various coded character
   sets, and give names to these coded character sets, so they can be
   referenced unambiguously in Internet standards.

   This memo does not indicate anything about the validity of using
   these specifications in any Internet standard, so you should consult
   each individual Internet standard to see which coded character sets
   and names are allowed there.

   Unambiguous character mnemonics are specified, which provide a
   practical way of identifying a character, without reference to a
   coded character set and its code in this coded character set.  The
   mnemonics are written in a minimal set of characters, namely the
   invariant 83 graphical characters of ISO 646, which is a kind of
   greatest common subset to be found between the majority of coded

noToC RFC1345 - Page 2

   character sets, including ASCII, national variants of the ISO 646 7-
   bit character set and various EBCDICs.  In addition, the numeric
   value of the coded representations of all these characters are the
   same in all coded character sets compatible with ISO standards.  All
   of them except two, EXCLAMATION MARK and QUOTATION MARK, have the
   same coded representation in all variants of EBCDIC.  This minimal
   set of characters is called the reference character set in this memo.

   The mnemonics can be used in Internet standards for easy and
   unambiguous reference, and they can also serve as a fallback
   representation in various Internet specifications.

   The coded character sets covered include all parts of ISO 8859, ISO
   6937-2 and all ISO 646 conforming coded character sets in the ISO
   character set registry managed by ECMA according to ISO 2375.  Almost
   all graphic coded character sets in the ECMA registry (1) are
   covered.  The graphic coded character sets not included are registry
   numbers 31, 38, 39, 53, 59, 68, 71, 72, 129 and 137.  In addition
   many vendor defined character sets are covered, including PC
   codepages (4), (7), (8), many EBCDIC character sets (4), (5), (6) and
   HP, DEC and Apple character sets (8), (9), (10), (13), (14).  The
   East-Asian 16-bit character sets from the ECMA registry is also
   included in this memo.

2.  CHARACTER MNEMONICS

2.1  General Syntax

   The character mnemonics are taken from the ISO committee draft (CD)
   of the POSIX.2 standard (3).  They are classified into two groups:


   1. A group with two-character mnemonics
      - Primarily intended for alphabetic scripts like Latin, Greek,
        Cyrillic, Hebrew and Arabic, and special characters.
   2. A group with variable-length mnemonics
      - primarily intended for non-alphabetic scripts like Japanese and
        Chinese, but also used for some accented letters and special
        characters.

   In the two-character mnemonics, all invariant graphic character in
   the ISO 646 character codes except "&" are used, i.e. the following
   characters:

           ! "     %   ' ( ) * + , - . / 0 1 2 3 4 5 6 7 8 9 : ; < = > ?
             A B C D E F G H I J K L M N O P Q R S T U V W X Y Z       _
             a b c d e f g h i j k l m n o p q r s t u v w x y z

   The character "_" is not used as the first character.

   In the variable-length mnemonics, the character "_" is not  used as
   the first character. If it is used in a name, its presence is
   doubled.

noToC RFC1345 - Page 3

   The mnemonics can be used in several different ways for different
   purposes.  One of these is description of coded character sets, which
   is detailed in section 3.  Another is for extending a given coded
   character set to a mnemonic character set.  This is described in
   section 4.  The restrictions on the use of the characters "&" and "_"
   are due to demands of the compositional methods of these techniques.

2.2  ISO Official Long Descriptive Character Name

   For all mnemonics, the character for which it stands is indicated in
   the following table by a long descriptive name.  This name is
   identical to the ISO name of the character as given in reference (2).
   For a few characters that are not included there, descriptive names
   of the same kind are introduced in this memo.  The source of each
   character is stated in the table after the name and should be
   consulted for a reliable identification of the character.

   These long descriptive names consists only of the capital Latin
   letters of the invariant part of ISO 646, the digits, "-", and SPACE.
   Digits are only used in names of ideographic and Hangul characters
   and never as the first character.

2.3  The 2-character Mnemonics

   The two-character mnemonics include various accented Latin letters,
   Greek, Cyrillic, Hebrew, Arabic, Hiragana and Katakana.  Also a fair
   number of special characters are included.  Almost all ISO or ISO
   registered 7- and 8-bit graphical coded character sets are covered
   with these two-character mnemonics.

   The two characters are chosen so the graphical appearance in the
   reference set resembles as much as possible (within the possibilities
   available) the graphical appearance of the character. The basic
   character set of ISO 646 is used as the reference set, as mentioned
   above.

   The characters in the reference character set are chosen to represent
   themselves.

   For control characters from ISO 646 the two-character acronyms of ISO
   2047 are used as mnemonics.  For the other control characters of ISO
   6429, two-character mnemonics have been selected based on the
   variable-length acronyms used in that standard.

   Letters, including Greek, Cyrillic, Arabic and Hebrew, are
   represented with the base letter as the first letter, and the second
   letter represents an accent or relation to a non-Latin script.  Non-
   Latin letters are transliterated to Latin letters, following
   transliteration standards as closely as possible.  This is also done
   with the Latin letters such as ETH and THORN, and the
   Danish/Norwegian/Swedish letter A WITH RING ABOVE is transliterated
   into "aa".

noToC RFC1345 - Page 4

   After a letter, the second character signifies the following:

     Exclamation mark           ! Grave
     Apostrophe                 ' Acute accent
     Greater-Than sign          > Circumflex accent
     Question Mark              ? tilde
     Hyphen-Minus               - Macron
     Left parenthesis           ( Breve
     Full Stop                  . Dot Above
     Colon                      : Diaeresis
     Comma                      , Cedilla
     Underline                  _ Underline
     Solidus                    / Stroke
     Quotation mark             " Double acute accent
     Semicolon                  ; Ogonek
     Less-Than sign             < Caron
     Zero                       0 Ring above
     Two                        2 Hook
     Nine                       9 Horn

     Equals                     = Cyrillic
     Asterisk                   * Greek
     Percent sign               % Greek/Cyrillic special
     Plus                       + smalls: Arabic, capitals: Hebrew
     Three                      3 some Latin/Greek/Cyrillic letters
     Four                       4 Bopomofo
     Five                       5 Hiragana
     Six                        6 Katakana

   In designing the mnemonics the following special characters were
   reserved: The ampersand is reserved as an intro character, indicating
   that the following string is in the mnemonic character set.  The
   underline character is reserved for the variable-length mnemonics.
   This use does not eliminate usage as an accent or language
   identifier.

   Special characters are encoded with some mnemonic value.  These are
   not systematic thruout, but most mnemonics start with a related
   special character of the reference set.

2.4  The Variable-length Character Mnemonics

   The Variable-length Character Mnemonics are primarily meant for the
   ideographic characters in larger Asian character sets, but are also
   used for accented characters with several accents and some special
   characters. To have the mnemonics as short as possible, which both
   saves storage and is easier to input, a quite short name is
   preferred. Considering the Chinese standard GB 2312-1980, the
   Japanese standards JIS X0208 and JIS X0212, and the Korean standard
   KS C 5601, they are all given by row and column numbers between 1 and
   94. So two positions for row and column and a character set
   identifier of one character would be almost as short as possible.
   The following character set identifiers are defined:

noToC RFC1345 - Page 5

            c   GB 2312-1980
            j   JIS X0208-1990
            J   JIS X0212-1990
            k   KS C 5601-1987

   This system for the representation of ideographic characters and
   Hangul characters is not truly mnemonic, but it provides short
   representations that are easy to connect to the corresponding
   character by means of the code table of an official character set
   standard. Alternative methods based on the graphic appearance or the
   pronunciation of the characters are thought to be unfeasible.

   One prominent character in the reference character set is reserved
   for identifying variable-length mnemonics, namely the underline
   character "_". This character is intended as a delimiter both in the
   front and in the end of the mnemonic. An example of its use would be:
   (&=intro):

             &_j3210_ &_j4436_&_j6530_

(page 5 continued on part 2)