RFC 3987

Internationalized Resource Identifiers (IRIs)

Pages: 46
Proposed Standard
→ Errata

Part 3 of 3 – Pages 29 to 46

RFC3987 - Page 29 prevText

6.  Use of IRIs

6.1.  Limitations on UCS Characters Allowed in IRIs

   This section discusses limitations on characters and character
   sequences usable for IRIs beyond those given in section 2.2 and
   section 4.1.  The considerations in this section are relevant when
   IRIs are created and when URIs are converted to IRIs.

   a.  The repertoire of characters allowed in each IRI component is
       limited by the definition of that component.  For example, the
       definition of the scheme component does not allow characters
       beyond US-ASCII.

       (Note: In accordance with URI practice, generic IRI software
       cannot and should not check for such limitations.)

   b.  The UCS contains many areas of characters for which there are
       strong visual look-alikes.  Because of the likelihood of
       transcription errors, these also should be avoided.  This
       includes the full-width equivalents of Latin characters,
       half-width Katakana characters for Japanese, and many others.  It
       also includes many look-alikes of "space", "delims", and
       "unwise", characters excluded in [RFC3491].

   Additional information is available from [UNIXML].  [UNIXML] is
   written in the context of running text rather than in that of
   identifiers.  Nevertheless, it discusses many of the categories of
   characters not appropriate for IRIs.

6.2.  Software Interfaces and Protocols

   Although an IRI is defined as a sequence of characters, software
   interfaces for URIs typically function on sequences of octets or
   other kinds of code units.  Thus, software interfaces and protocols
   MUST define which character encoding is used.

   Intermediate software interfaces between IRI-capable components and
   URI-only components MUST map the IRIs per section 3.1, when
   transferring from IRI-capable to URI-only components.  This mapping
   SHOULD be applied as late as possible.  It SHOULD NOT be applied
   between components that are known to be able to handle IRIs.

RFC3987 - Page 30

6.3.  Format of URIs and IRIs in Documents and Protocols

   Document formats that transport URIs may have to be upgraded to allow
   the transport of IRIs.  In cases where the document as a whole has a
   native character encoding, IRIs MUST also be encoded in this
   character encoding and converted accordingly by a parser or
   interpreter.  IRI characters not expressible in the native character
   encoding SHOULD be escaped by using the escaping conventions of the
   document format if such conventions are available. Alternatively,
   they MAY be percent-encoded according to section 3.1. For example, in
   HTML or XML, numeric character references SHOULD be used.  If a
   document as a whole has a native character encoding and that
   character encoding is not UTF-8, then IRIs MUST NOT be placed into
   the document in the UTF-8 character encoding.

   Note: Some formats already accommodate IRIs, although they use
   different terminology.  HTML 4.0 [HTML4] defines the conversion from
   IRIs to URIs as error-avoiding behavior.  XML 1.0 [XML1], XLink
   [XLink], XML Schema [XMLSchema], and specifications based upon them
   allow IRIs.  Also, it is expected that all relevant new W3C formats
   and protocols will be required to handle IRIs [CharMod].

6.4.  Use of UTF-8 for Encoding Original Characters

   This section discusses details and gives examples for point c) in
   section 1.2.  To be able to use IRIs, the URI corresponding to the
   IRI in question has to encode original characters into octets by
   using UTF-8.  This can be specified for all URIs of a URI scheme or
   can apply to individual URIs for schemes that do not specify how to
   encode original characters.  It can apply to the whole URI, or only
   to some part.  For background information on encoding characters into
   URIs, see also section 2.5 of [RFC3986].

   For new URI schemes, using UTF-8 is recommended in [RFC2718].
   Examples where UTF-8 is already used are the URN syntax [RFC2141],
   IMAP URLs [RFC2192], and POP URLs [RFC2384].  On the other hand,
   because the HTTP URL scheme does not specify how to encode original
   characters, only some HTTP URLs can have corresponding but different
   IRIs.

   For example, for a document with a URI of
   "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to
   construct a corresponding IRI (in XML notation, see, section 1.4):
   "http://www.example.org/r&#xE9;sum&#xE9;.html" ("&#xE9"; stands for
   the e-acute character, and "%C3%A9" is the UTF-8 encoded and
   percent-encoded representation of that character).  On the other
   hand, for a document with a URI of

RFC3987 - Page 31

   "http://www.example.org/r%E9sum%E9.html", the percent-encoding octets
   cannot be converted to actual characters in an IRI, as the
   percent-encoding is not based on UTF-8.

   This means that for most URI schemes, there is no need to upgrade
   their scheme definition in order for them to work with IRIs.  The
   main case where upgrading makes sense is when a scheme definition, or
   a particular component of a scheme, is strictly limited to the use of
   US-ASCII characters with no provision to include non-ASCII
   characters/octets via percent-encoding, or if a scheme definition
   currently uses highly scheme-specific provisions for the encoding of
   non-ASCII characters.  An example of this is the mailto: scheme
   [RFC2368].

   This specification does not upgrade any scheme specifications in any
   way; this has to be done separately.  Also, note that there is no
   such thing as an "IRI scheme"; all IRIs use URI schemes, and all URI
   schemes can be used with IRIs, even though in some cases only by
   using URIs directly as IRIs, without any conversion.

   URI schemes can impose restrictions on the syntax of scheme-specific
   URIs; i.e., URIs that are admissible under the generic URI syntax
   [RFC3986] may not be admissible due to narrower syntactic constraints
   imposed by a URI scheme specification.  URI scheme definitions cannot
   broaden the syntactic restrictions of the generic URI syntax;
   otherwise, it would be possible to generate URIs that satisfied the
   scheme-specific syntactic constraints without satisfying the
   syntactic constraints of the generic URI syntax.  However, additional
   syntactic constraints imposed by URI scheme specifications are
   applicable to IRI, as the corresponding URI resulting from the
   mapping defined in section 3.1 MUST be a valid URI under the
   syntactic restrictions of generic URI syntax and any narrower
   restrictions imposed by the corresponding URI scheme specification.

   The requirement for the use of UTF-8 applies to all parts of a URI
   (with the potential exception of the ireg-name part; see section
   3.1).  However, it is possible that the capability of IRIs to
   represent a wide range of characters directly is used just in some
   parts of the IRI (or IRI reference).  The other parts of the IRI may
   only contain US-ASCII characters, or they may not be based on UTF-8.
   They may be based on another character encoding, or they may directly
   encode raw binary data (see also [RFC2397]).

   For example, it is possible to have a URI reference of
   "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
   document name is encoded in iso-8859-1 based on server settings, but
   where the fragment identifier is encoded in UTF-8 according to

RFC3987 - Page 32

   [XPointer]. The IRI corresponding to the above URI would be (in XML
   notation)
   "http://www.example.org/r%E9sum%E9.xml#r&#xE9;sum&#xE9";.

   Similar considerations apply to query parts.  The functionality of
   IRIs (namely, to be able to include non-ASCII characters) can only be
   used if the query part is encoded in UTF-8.

6.5.  Relative IRI References

   Processing of relative IRI references against a base is handled
   straightforwardly; the algorithms of [RFC3986] can be applied
   directly, treating the characters additionally allowed in IRI
   references in the same way that unreserved characters are in URI
   references.

7.  URI/IRI Processing Guidelines (Informative)

   This informative section provides guidelines for supporting IRIs in
   the same software components and operations that currently process
   URIs: Software interfaces that handle URIs, software that allows
   users to enter URIs, software that creates or generates URIs,
   software that displays URIs, formats and protocols that transport
   URIs, and software that interprets URIs.  These may all require
   modification before functioning properly with IRIs.  The
   considerations in this section also apply to URI references and IRI
   references.

7.1.  URI/IRI Software Interfaces

   Software interfaces that handle URIs, such as URI-handling APIs and
   protocols transferring URIs, need interfaces and protocol elements
   that are designed to carry IRIs.

   In case the current handling in an API or protocol is based on
   US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as
   it is compatible with US-ASCII, is in accordance with the
   recommendations of [RFC2277], and makes converting to URIs easy.  In
   any case, the API or protocol definition must clearly define the
   character encoding to be used.

   The transfer from URI-only to IRI-capable components requires no
   mapping, although the conversion described in section 3.2 above may
   be performed.  It is preferable not to perform this inverse
   conversion when there is a chance that this cannot be done correctly.

RFC3987 - Page 33

7.2.  URI/IRI Entry

   Some components allow users to enter URIs into the system by typing
   or dictation, for example.  This software must be updated to allow
   for IRI entry.

   A person viewing a visual representation of an IRI (as a sequence of
   glyphs, in some order, in some visual display) or hearing an IRI will
   use an entry method for characters in the user's language to input
   the IRI.  Depending on the script and the input method used, this may
   be a more or less complicated process.

   The process of IRI entry must ensure, as much as possible, that the
   restrictions defined in section 2.2 are met.  This may be done by
   choosing appropriate input methods or variants/settings thereof, by
   appropriately converting the characters being input, by eliminating
   characters that cannot be converted, and/or by issuing a warning or
   error message to the user.

   As an example of variant settings, input method editors for East
   Asian Languages usually allow the input of Latin letters and related
   characters in full-width or half-width versions.  For IRI input, the
   input method editor should be set so that it produces half-width
   Latin letters and punctuation and full-width Katakana.

   An input field primarily or solely used for the input of URIs/IRIs
   may allow the user to view an IRI as it is mapped to a URI.  Places
   where the input of IRIs is frequent may provide the possibility for
   viewing an IRI as mapped to a URI.  This will help users when some of
   the software they use does not yet accept IRIs.

   An IRI input component interfacing to components that handle URIs,
   but not IRIs, must map the IRI to a URI before passing it to these
   components.

   For the input of IRIs with right-to-left characters, please see
   section 4.3.

7.3.  URI/IRI Transfer between Applications

   Many applications, particularly mail user agents, try to detect URIs
   appearing in plain text.  For this, they use some heuristics based on
   URI syntax.  They then allow the user to click on such URIs and
   retrieve the corresponding resource in an appropriate (usually
   scheme-dependent) application.

RFC3987 - Page 34

   Such applications have to be upgraded to use the IRI syntax as a base
   for heuristics.  In particular, a non-ASCII character should not be
   taken as the indication of the end of an IRI.  Such applications also
   have to make sure that they correctly convert the detected IRI from
   the character encoding of the document or application where the IRI
   appears to the character encoding used by the system-wide IRI
   invocation mechanism, or to a URI (according to section 3.1) if the
   system-wide invocation mechanism only accepts URIs.

   The clipboard is another frequently used way to transfer URIs and
   IRIs from one application to another.  On most platforms, the
   clipboard is able to store and transfer text in many languages and
   scripts.  Correctly used, the clipboard transfers characters, not
   bytes, which will do the right thing with IRIs.

7.4.  URI/IRI Generation

   Systems that offer resources through the Internet, where those
   resources have logical names, sometimes automatically generate URIs
   for the resources they offer.  For example, some HTTP servers can
   generate a directory listing for a file directory and then respond to
   the generated URIs with the files.

   Many legacy character encodings are in use in various file systems.
   Many currently deployed systems do not transform the local character
   representation of the underlying system before generating URIs.

   For maximum interoperability, systems that generate resource
   identifiers should make the appropriate transformations.  For
   example, if a file system contains a file named
   "r&#xE9;sum&#xE9;.html", a server should expose this as
   "r%C3%A9sum%C3%A9.html" in a URI, which allows use of
   "r&#xE9;sum&#xE9;.html" in an IRI, even if locally the file name is
   kept in a character encoding other than UTF-8.

   This recommendation particularly applies to HTTP servers.  For FTP
   servers, similar considerations apply; see [RFC2640].

7.5.  URI/IRI Selection

   In some cases, resource owners and publishers have control over the
   IRIs used to identify their resources.  This control is mostly
   executed by controlling the resource names, such as file names,
   directly.

RFC3987 - Page 35

   In these cases, it is recommended to avoid choosing IRIs that are
   easily confused.  For example, for US-ASCII, the lower-case ell ("l")
   is easily confused with the digit one ("1"), and the upper-case oh
   ("O") is easily confused with the digit zero ("0").  Publishers
   should avoid confusing users with "br0ken" or "1ame" identifiers.

   Outside the US-ASCII repertoire, there are many more opportunities
   for confusion; a complete set of guidelines is too lengthy to include
   here.  As long as names are limited to characters from a single
   script, native writers of a given script or language will know best
   when ambiguities can appear, and how they can be avoided.  What may
   look ambiguous to a stranger may be completely obvious to the average
   native user.  On the other hand, in some cases, the UCS contains
   variants for compatibility reasons; for example, for typographic
   purposes.  These should be avoided wherever possible.  Although there
   may be exceptions, newly created resource names should generally be
   in NFKC [UTR15] (which means that they are also in NFC).

   As an example, the UCS contains the "fi" ligature at U+FB01 for
   compatibility reasons.  Wherever possible, IRIs should use the two
   letters "f" and "i" rather than the "fi" ligature.  An example where
   the latter may be used is in the query part of an IRI for an explicit
   search for a word written containing the "fi" ligature.

   In certain cases, there is a chance that characters from different
   scripts look the same.  The best known example is the similarity of
   the Latin "A", the Greek "Alpha", and the Cyrillic "A".  To avoid
   such cases, only IRIs should be created where all the characters in a
   single component are used together in a given language.  This usually
   means that all of these characters will be from the same script, but
   there are languages that mix characters from different scripts (such
   as Japanese).  This is similar to the heuristics used to distinguish
   between letters and numbers in the examples above.  Also, for Latin,
   Greek, and Cyrillic, using lowercase letters results in fewer
   ambiguities than using uppercase letters would.

7.6.  Display of URIs/IRIs

   In situations where the rendering software is not expected to display
   non-ASCII parts of the IRI correctly using the available layout and
   font resources, these parts should be percent-encoded before being
   displayed.

   For display of Bidi IRIs, please see section 4.1.

RFC3987 - Page 36

7.7.  Interpretation of URIs and IRIs

   Software that interprets IRIs as the names of local resources should
   accept IRIs in multiple forms and convert and match them with the
   appropriate local resource names.

   First, multiple representations include both IRIs in the native
   character encoding of the protocol and also their URI counterparts.

   Second, it may include URIs constructed based on character encodings
   other than UTF-8.  These URIs may be produced by user agents that do
   not conform to this specification and that use legacy character
   encodings to convert non-ASCII characters to URIs.  Whether this is
   necessary, and what character encodings to cover, depends on a number
   of factors, such as the legacy character encodings used locally and
   the distribution of various versions of user agents.  For example,
   software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in
   addition to UTF-8.

   Third, it may include additional mappings to be more user-friendly
   and robust against transmission errors.  These would be similar to
   how some servers currently treat URIs as case insensitive or perform
   additional matching to account for spelling errors.  For characters
   beyond the US-ASCII repertoire, this may, for example, include
   ignoring the accents on received IRIs or resource names.  Please note
   that such mappings, including case mappings, are language dependent.

   It can be difficult to identify a resource unambiguously if too many
   mappings are taken into consideration.  However, percent-encoded and
   not percent-encoded parts of IRIs can always be clearly
   distinguished.  Also, the regularity of UTF-8 (see [Duerst97]) makes
   the potential for collisions lower than it may seem at first.

7.8.  Upgrading Strategy

   Where this recommendation places further constraints on software for
   which many instances are already deployed, it is important to
   introduce upgrades carefully and to be aware of the various
   interdependencies.

   If IRIs cannot be interpreted correctly, they should not be created,
   generated, or transported.  This suggests that upgrading URI
   interpreting software to accept IRIs should have highest priority.

   On the other hand, a single IRI is interpreted only by a single or
   very few interpreters that are known in advance, although it may be
   entered and transported very widely.

RFC3987 - Page 37

   Therefore, IRIs benefit most from a broad upgrade of software to be
   able to enter and transport IRIs.  However, before an individual IRI
   is published, care should be taken to upgrade the corresponding
   interpreting software in order to cover the forms expected to be
   received by various versions of entry and transport software.

   The upgrade of generating software to generate IRIs instead of using
   a local character encoding should happen only after the service is
   upgraded to accept IRIs.  Similarly, IRIs should only be generated
   when the service accepts IRIs and the intervening infrastructure and
   protocol is known to transport them safely.

   Software converting from URIs to IRIs for display should be upgraded
   only after upgraded entry software has been widely deployed to the
   population that will see the displayed result.

   Where there is a free choice of character encodings, it is often
   possible to reduce the effort and dependencies for upgrading to IRIs
   by using UTF-8 rather than another encoding.  For example, when a new
   file-based Web server is set up, using UTF-8 as the character
   encoding for file names will make the transition to IRIs easier.
   Likewise, when a new Web form is set up using UTF-8 as the character
   encoding of the form page, the returned query URIs will use UTF-8 as
   the character encoding (unless the user, for whatever reason, changes
   the character encoding) and will therefore be compatible with IRIs.

   These recommendations, when taken together, will allow for the
   extension from URIs to IRIs in order to handle characters other than
   US-ASCII while minimizing interoperability problems.  For
   considerations regarding the upgrade of URI scheme definitions, see
   section 6.4.

8.  Security Considerations

   The security considerations discussed in [RFC3986] also apply to
   IRIs.  In addition, the following issues require particular care for
   IRIs.

   Incorrect encoding or decoding can lead to security problems.  In
   particular, some UTF-8 decoders do not check against overlong byte
   sequences.  As an example, a "/" is encoded with the byte 0x2F both
   in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly
   interpret the sequence 0xC0 0xAF as a "/".  A sequence such as

RFC3987 - Page 38

   "%C0%AF.." may pass some security tests and then be interpreted as
   "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
   and checking are not done in the right order, and/or if reserved
   characters and unreserved characters are not clearly distinguished.

   There are various ways in which "spoofing" can occur with IRIs.
   "Spoofing" means that somebody may add a resource name that looks the
   same or similar to the user, but that points to a different resource.
   The added resource may pretend to be the real resource by looking
   very similar but may contain all kinds of changes that may be
   difficult to spot and that can cause all kinds of problems.  Most
   spoofing possibilities for IRIs are extensions of those for URIs.

   Spoofing can occur for various reasons.  First, a user's
   normalization expectations or actual normalization when entering an
   IRI or transcoding an IRI from a legacy character encoding do not
   match the normalization used on the server side.  Conceptually, this
   is no different from the problems surrounding the use of
   case-insensitive web servers.  For example, a popular web page with a
   mixed-case name ("http://big.example.com/PopularPage.html") might be
   "spoofed" by someone who is able to create
   "http://big.example.com/popularpage.html".  However, the use of
   unnormalized character sequences, and of additional mappings for user
   convenience, may increase the chance for spoofing.  Protocols and
   servers that allow the creation of resources with names that are not
   normalized are particularly vulnerable to such attacks.  This is an
   inherent security problem of the relevant protocol, server, or
   resource and is not specific to IRIs, but it is mentioned here for
   completeness.

   Spoofing can occur in various IRI components, such as the domain name
   part or a path part.  For considerations specific to the domain name
   part, see [RFC3491].  For the path part, administrators of sites that
   allow independent users to create resources in the same sub area may
   have to be careful to check for spoofing.

   Spoofing can occur because in the UCS many characters look very
   similar.  Details are discussed in Section 7.5.  Again, this is very
   similar to spoofing possibilities on US-ASCII, e.g., using "br0ken"
   or "1ame" URIs.

   Spoofing can occur when URIs with percent-encodings based on various
   character encodings are accepted to deal with older user agents.  In
   some cases, particularly for Latin-based resource names, this is
   usually easy to detect because UTF-8-encoded names, when interpreted
   and viewed as legacy character encodings, produce mostly garbage.

RFC3987 - Page 39

   When concurrently used character encodings have a similar structure
   but there are no characters that have exactly the same encoding,
   detection is more difficult.

   Spoofing can occur with bidirectional IRIs, if the restrictions in
   section 4.2 are not followed.  The same visual representation may be
   interpreted as different logical representations, and vice versa.  It
   is also very important that a correct Unicode bidirectional
   implementation be used.

9.  Acknowledgements

   We would like to thank Larry Masinter for his work as coauthor of
   many earlier versions of this document (draft-masinter-url-i18n-xx).

   The discussion on the issue addressed here started a long time ago.
   There was a thread in the HTML working group in August 1995 (under
   the topic of "Globalizing URIs") and in the www-international mailing
   list in July 1996 (under the topic of "Internationalization and
   URLs"), and there were ad-hoc meetings at the Unicode conferences in
   September 1995 and September 1997.

   Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding,
   Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim
   Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie
   Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley,
   Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne,
   Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan
   Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan
   Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris
   Haynes, Walter Underwood, and many others for help with understanding
   the issues and possible solutions, and with getting the details
   right.

   This document is a product of the Internationalization Working Group
   (I18N WG) of the World Wide Web Consortium (W3C).  Thanks to the
   members of the W3C I18N Working Group and Interest Group for their
   contributions and their work on [CharMod].  Thanks also go to the
   members of many other W3C Working Groups for adopting IRIs, and to
   the members of the Montreal IAB Workshop on Internationalization and
   Localization for their review.

RFC3987 - Page 40

10.  References

10.1.  Normative References

   [ASCII]        American National Standards Institute, "Coded
                  Character Set -- 7-bit American Standard Code for
                  Information Interchange", ANSI X3.4, 1986.

   [ISO10646]     International Organization for Standardization,
                  "ISO/IEC 10646:2003: Information Technology -
                  Universal Multiple-Octet Coded Character Set (UCS)",
                  ISO Standard 10646, December 2003.

   [RFC2119]      Bradner, S., "Key words for use in RFCs to Indicate
                  Requirement Levels", BCP 14, RFC 2119, March 1997.

   [RFC2234]      Crocker, D. and P. Overell, "Augmented BNF for Syntax
                  Specifications: ABNF", RFC 2234, November 1997.

   [RFC3490]      Faltstrom, P., Hoffman, P., and A. Costello,
                  "Internationalizing Domain Names in Applications
                  (IDNA)", RFC 3490, March 2003.

   [RFC3491]      Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep
                  Profile for Internationalized Domain Names (IDN)", RFC
                  3491, March 2003.

   [RFC3629]      Yergeau, F., "UTF-8, a transformation format of ISO
                  10646", STD 63, RFC 3629, November 2003.

   [RFC3986]      Berners-Lee, T., Fielding, R., and L. Masinter,
                  "Uniform Resource Identifier (URI): Generic Syntax",
                  STD 66, RFC 3986, January 2005.

   [UNI9]         Davis, M., "The Bidirectional Algorithm", Unicode
                  Standard Annex #9, March 2004,
                  <http://www.unicode.org/reports/tr9/tr9-13.html>.

   [UNIV4]        The Unicode Consortium, "The Unicode Standard, Version
                  4.0.1, defined by: The Unicode Standard, Version 4.0
                  (Reading, MA, Addison-Wesley, 2003. ISBN
                  0-321-18578-1), as amended by Unicode 4.0.1
                  (http://www.unicode.org/versions/Unicode4.0.1/)",
                  March 2004.

RFC3987 - Page 41

   [UTR15]        Davis, M. and M. Duerst, "Unicode Normalization
                  Forms", Unicode Standard Annex #15, April 2003,
                  <http://www.unicode.org/unicode/reports/
                  tr15/tr15-23.html>.

10.2.  Informative References

   [BidiEx]       "Examples of bidirectional IRIs",
                  <http://www.w3.org/International/iri-edit/
                  BidiExamples>.

   [CharMod]      Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T.
                  Texin, "Character Model for the World Wide Web:
                  Resource Identifiers", World Wide Web Consortium
                  Candidate Recommendation, November 2004,
                  <http://www.w3.org/TR/charmod-resid>.

   [Duerst97]     Duerst, M., "The Properties and Promises of UTF-8",
                  Proc.  11th International Unicode Conference, San Jose
                  , September 1997,
                  <http://www.ifi.unizh.ch/mml/mduerst/papers/
                  PDF/IUC11-UTF-8.pdf>.

   [Gettys]       Gettys, J., "URI Model Consequences",
                  <http://www.w3.org/DesignIssues/ModelConsequences>.

   [HTML4]        Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01
                  Specification", World Wide Web Consortium
                  Recommendation, December 1999,
                  <http://www.w3.org/TR/html401/appendix/
                  notes.html#h-B.2>.

   [RFC2045]      Freed, N. and N. Borenstein, "Multipurpose Internet
                  Mail Extensions (MIME) Part One: Format of Internet
                  Message Bodies", RFC 2045, November 1996.

   [RFC2130]      Weider, C., Preston, C., Simonsen, K., Alvestrand, H.,
                  Atkinson, R., Crispin, M., and P. Svanberg, "The
                  Report of the IAB Character Set Workshop held 29
                  February - 1 March, 1996", RFC 2130, April 1997.

   [RFC2141]      Moats, R., "URN Syntax", RFC 2141, May 1997.

   [RFC2192]      Newman, C., "IMAP URL Scheme", RFC 2192, September
                  1997.

   [RFC2277]      Alvestrand, H., "IETF Policy on Character Sets and
                  Languages", BCP 18, RFC 2277, January 1998.

RFC3987 - Page 42

   [RFC2368]      Hoffman, P., Masinter, L., and J. Zawinski, "The
                  mailto URL scheme", RFC 2368, July 1998.

   [RFC2384]      Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

   [RFC2396]      Berners-Lee, T., Fielding, R., and L. Masinter,
                  "Uniform Resource Identifiers (URI): Generic Syntax",
                  RFC 2396, August 1998.

   [RFC2397]      Masinter, L., "The "data" URL scheme", RFC 2397,
                  August 1998.

   [RFC2616]      Fielding,  R., Gettys, J., Mogul, J., Frystyk, H.,
                  Masinter, L., Leach, P., and T. Berners-Lee,
                  "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616,
                  June 1999.

   [RFC2640]      Curtin, B., "Internationalization of the File Transfer
                  Protocol", RFC 2640, July 1999.

   [RFC2718]      Masinter, L., Alvestrand, H., Zigmond, D., and R.
                  Petke, "Guidelines for new URL Schemes", RFC 2718,
                  November 1999.

   [UNIXML]       Duerst, M. and A. Freytag, "Unicode in XML and other
                  Markup Languages", Unicode Technical Report #20, World
                  Wide Web Consortium Note, June 2003,
                  <http://www.w3.org/TR/unicode-xml/>.

   [XLink]        DeRose, S., Maler, E., and D. Orchard, "XML Linking
                  Language (XLink) Version 1.0", World Wide Web
                  Consortium Recommendation, June 2001,
                  <http://www.w3.org/TR/xlink/#link-locators>.

   [XML1]         Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E.,
                  and F. Yergeau, "Extensible Markup Language (XML) 1.0
                  (Third Edition)", World Wide Web Consortium
                  Recommendation, February 2004,
                  <http://www.w3.org/TR/REC-xml#sec-external-ent>.

   [XMLNamespace] Bray, T., Hollander, D., and A. Layman, "Namespaces in
                  XML", World Wide Web Consortium Recommendation,
                  January 1999, <http://www.w3.org/TR/REC-xml-names>.

   [XMLSchema]    Biron, P. and A. Malhotra, "XML Schema Part 2:
                  Datatypes", World Wide Web Consortium Recommendation,
                  May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>.

RFC3987 - Page 43

   [XPointer]     Grosso, P., Maler, E., Marsh, J. and N. Walsh,
                  "XPointer Framework", World Wide Web Consortium
                  Recommendation, March 2003,
                  <http://www.w3.org/TR/xptr-framework/#escaping>.

RFC3987 - Page 44

Appendix A.  Design Alternatives

   This section shortly summarizes major design alternatives and the
   reasons for why they were not chosen.

Appendix A.1.  New Scheme(s)

   Introducing new schemes (for example, httpi:, ftpi:,...) or a new
   metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:,
   i:ftp:,...) was proposed to make IRI-to-URI conversion scheme
   dependent or to distinguish between percent-encodings resulting from
   IRI-to-URI conversion and percent-encodings from legacy character
   encodings.

   New schemes are not needed to distinguish URIs from true IRIs (i.e.,
   IRIs that contain non-ASCII characters).  The benefit of being able
   to detect the origin of percent-encodings is marginal, as UTF-8 can
   be detected with very high reliability.  Deploying new schemes is
   extremely hard, so not requiring new schemes for IRIs makes
   deployment of IRIs vastly easier.  Making conversion scheme dependent
   is highly inadvisable and would be encouraged by separate schemes for
   IRIs.  Using a uniform convention for conversion from IRIs to URIs
   makes IRI implementation orthogonal to the introduction of actual new
   schemes.

Appendix A.2.  Character Encodings Other Than UTF-8

   At an early stage, UTF-7 was considered as an alternative to UTF-8
   when IRIs are converted to URIs.  UTF-7 would not have needed
   percent-encoding and in most cases would have been shorter than
   percent-encoded UTF-8.

   Using UTF-8 avoids a double layering and overloading of the use of
   the "+" character.  UTF-8 is fully compatible with US-ASCII and has
   therefore been recommended by the IETF, and is being used widely.

   UTF-7 has never been used much and is now clearly being discouraged.
   Requiring implementations to convert from UTF-8 to UTF-7 and back
   would be an additional implementation burden.

Appendix A.3.  New Encoding Convention

   Instead of using the existing percent-encoding convention of URIs,
   which is based on octets, the idea was to create a new encoding
   convention; for example, to use "%u" to introduce UCS code points.

RFC3987 - Page 45

   Using the existing octet-based percent-encoding mechanism does not
   need an upgrade of the URI syntax and does not need corresponding
   server upgrades.

Appendix A.4.  Indicating Character Encodings in the URI/IRI

   Some proposals suggested indicating the character encodings used in
   an URI or IRI with some new syntactic convention in the URI itself,
   similar to the "charset" parameter for e-mails and Web pages.  As an
   example, the label in square brackets in
   "http://www.example.org/ros[iso-8859-1]&#xE9"; indicated that the
   following "&#xE9"; had to be interpreted as iso-8859-1.

   If UTF-8 is used exclusively, an upgrade to the URI syntax is not
   needed.  It avoids potentially multiple labels that have to be copied
   correctly in all cases, even on the side of a bus or on a napkin,
   leading to usability problems (and being prohibitively annoying).
   Exclusively using UTF-8 also reduces transcoding errors and
   confusion.

Authors' Addresses

   Martin Duerst  (Note: Please write "Duerst" with u-umlaut wherever
                  possible, for example as "D&#252;rst" in XML and
                  HTML.)
   World Wide Web Consortium
   5322 Endo
   Fujisawa, Kanagawa  252-8520
   Japan

   Phone: +81 466 49 1170
   Fax:   +81 466 49 1171
   EMail: duerst@w3.org
   URI:   http://www.w3.org/People/D%C3%BCrst/
   (Note: This is the percent-encoded form of an IRI.)


   Michel Suignard
   Microsoft Corporation
   One Microsoft Way
   Redmond, WA  98052
   U.S.A.

   Phone: +1 425 882-8080
   EMail: michelsu@microsoft.com
   URI:   http://www.suignard.com

RFC3987 - Page 46

Full Copyright Statement

   Copyright (C) The Internet Society (2005).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the IETF's procedures with respect to rights in IETF Documents can
   be found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at ietf-
   ipr@ietf.org.


Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.