Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 3987

Internationalized Resource Identifiers (IRIs)

Pages: 46
Proposed Standard
Errata
Part 3 of 3 – Pages 29 to 46
First   Prev   None

Top   ToC   RFC3987 - Page 29   prevText

6. Use of IRIs

6.1. Limitations on UCS Characters Allowed in IRIs

This section discusses limitations on characters and character sequences usable for IRIs beyond those given in section 2.2 and section 4.1. The considerations in this section are relevant when IRIs are created and when URIs are converted to IRIs. a. The repertoire of characters allowed in each IRI component is limited by the definition of that component. For example, the definition of the scheme component does not allow characters beyond US-ASCII. (Note: In accordance with URI practice, generic IRI software cannot and should not check for such limitations.) b. The UCS contains many areas of characters for which there are strong visual look-alikes. Because of the likelihood of transcription errors, these also should be avoided. This includes the full-width equivalents of Latin characters, half-width Katakana characters for Japanese, and many others. It also includes many look-alikes of "space", "delims", and "unwise", characters excluded in [RFC3491]. Additional information is available from [UNIXML]. [UNIXML] is written in the context of running text rather than in that of identifiers. Nevertheless, it discusses many of the categories of characters not appropriate for IRIs.

6.2. Software Interfaces and Protocols

Although an IRI is defined as a sequence of characters, software interfaces for URIs typically function on sequences of octets or other kinds of code units. Thus, software interfaces and protocols MUST define which character encoding is used. Intermediate software interfaces between IRI-capable components and URI-only components MUST map the IRIs per section 3.1, when transferring from IRI-capable to URI-only components. This mapping SHOULD be applied as late as possible. It SHOULD NOT be applied between components that are known to be able to handle IRIs.
Top   ToC   RFC3987 - Page 30

6.3. Format of URIs and IRIs in Documents and Protocols

Document formats that transport URIs may have to be upgraded to allow the transport of IRIs. In cases where the document as a whole has a native character encoding, IRIs MUST also be encoded in this character encoding and converted accordingly by a parser or interpreter. IRI characters not expressible in the native character encoding SHOULD be escaped by using the escaping conventions of the document format if such conventions are available. Alternatively, they MAY be percent-encoded according to section 3.1. For example, in HTML or XML, numeric character references SHOULD be used. If a document as a whole has a native character encoding and that character encoding is not UTF-8, then IRIs MUST NOT be placed into the document in the UTF-8 character encoding. Note: Some formats already accommodate IRIs, although they use different terminology. HTML 4.0 [HTML4] defines the conversion from IRIs to URIs as error-avoiding behavior. XML 1.0 [XML1], XLink [XLink], XML Schema [XMLSchema], and specifications based upon them allow IRIs. Also, it is expected that all relevant new W3C formats and protocols will be required to handle IRIs [CharMod].

6.4. Use of UTF-8 for Encoding Original Characters

This section discusses details and gives examples for point c) in section 1.2. To be able to use IRIs, the URI corresponding to the IRI in question has to encode original characters into octets by using UTF-8. This can be specified for all URIs of a URI scheme or can apply to individual URIs for schemes that do not specify how to encode original characters. It can apply to the whole URI, or only to some part. For background information on encoding characters into URIs, see also section 2.5 of [RFC3986]. For new URI schemes, using UTF-8 is recommended in [RFC2718]. Examples where UTF-8 is already used are the URN syntax [RFC2141], IMAP URLs [RFC2192], and POP URLs [RFC2384]. On the other hand, because the HTTP URL scheme does not specify how to encode original characters, only some HTTP URLs can have corresponding but different IRIs. For example, for a document with a URI of "http://www.example.org/r%C3%A9sum%C3%A9.html", it is possible to construct a corresponding IRI (in XML notation, see, section 1.4): "http://www.example.org/résumé.html" ("&#xE9"; stands for the e-acute character, and "%C3%A9" is the UTF-8 encoded and percent-encoded representation of that character). On the other hand, for a document with a URI of
Top   ToC   RFC3987 - Page 31
   "http://www.example.org/r%E9sum%E9.html", the percent-encoding octets
   cannot be converted to actual characters in an IRI, as the
   percent-encoding is not based on UTF-8.

   This means that for most URI schemes, there is no need to upgrade
   their scheme definition in order for them to work with IRIs.  The
   main case where upgrading makes sense is when a scheme definition, or
   a particular component of a scheme, is strictly limited to the use of
   US-ASCII characters with no provision to include non-ASCII
   characters/octets via percent-encoding, or if a scheme definition
   currently uses highly scheme-specific provisions for the encoding of
   non-ASCII characters.  An example of this is the mailto: scheme
   [RFC2368].

   This specification does not upgrade any scheme specifications in any
   way; this has to be done separately.  Also, note that there is no
   such thing as an "IRI scheme"; all IRIs use URI schemes, and all URI
   schemes can be used with IRIs, even though in some cases only by
   using URIs directly as IRIs, without any conversion.

   URI schemes can impose restrictions on the syntax of scheme-specific
   URIs; i.e., URIs that are admissible under the generic URI syntax
   [RFC3986] may not be admissible due to narrower syntactic constraints
   imposed by a URI scheme specification.  URI scheme definitions cannot
   broaden the syntactic restrictions of the generic URI syntax;
   otherwise, it would be possible to generate URIs that satisfied the
   scheme-specific syntactic constraints without satisfying the
   syntactic constraints of the generic URI syntax.  However, additional
   syntactic constraints imposed by URI scheme specifications are
   applicable to IRI, as the corresponding URI resulting from the
   mapping defined in section 3.1 MUST be a valid URI under the
   syntactic restrictions of generic URI syntax and any narrower
   restrictions imposed by the corresponding URI scheme specification.

   The requirement for the use of UTF-8 applies to all parts of a URI
   (with the potential exception of the ireg-name part; see section
   3.1).  However, it is possible that the capability of IRIs to
   represent a wide range of characters directly is used just in some
   parts of the IRI (or IRI reference).  The other parts of the IRI may
   only contain US-ASCII characters, or they may not be based on UTF-8.
   They may be based on another character encoding, or they may directly
   encode raw binary data (see also [RFC2397]).

   For example, it is possible to have a URI reference of
   "http://www.example.org/r%E9sum%E9.xml#r%C3%A9sum%C3%A9", where the
   document name is encoded in iso-8859-1 based on server settings, but
   where the fragment identifier is encoded in UTF-8 according to
Top   ToC   RFC3987 - Page 32
   [XPointer]. The IRI corresponding to the above URI would be (in XML
   notation)
   "http://www.example.org/r%E9sum%E9.xml#résum&#xE9";.

   Similar considerations apply to query parts.  The functionality of
   IRIs (namely, to be able to include non-ASCII characters) can only be
   used if the query part is encoded in UTF-8.

6.5. Relative IRI References

Processing of relative IRI references against a base is handled straightforwardly; the algorithms of [RFC3986] can be applied directly, treating the characters additionally allowed in IRI references in the same way that unreserved characters are in URI references.

7. URI/IRI Processing Guidelines (Informative)

This informative section provides guidelines for supporting IRIs in the same software components and operations that currently process URIs: Software interfaces that handle URIs, software that allows users to enter URIs, software that creates or generates URIs, software that displays URIs, formats and protocols that transport URIs, and software that interprets URIs. These may all require modification before functioning properly with IRIs. The considerations in this section also apply to URI references and IRI references.

7.1. URI/IRI Software Interfaces

Software interfaces that handle URIs, such as URI-handling APIs and protocols transferring URIs, need interfaces and protocol elements that are designed to carry IRIs. In case the current handling in an API or protocol is based on US-ASCII, UTF-8 is recommended as the character encoding for IRIs, as it is compatible with US-ASCII, is in accordance with the recommendations of [RFC2277], and makes converting to URIs easy. In any case, the API or protocol definition must clearly define the character encoding to be used. The transfer from URI-only to IRI-capable components requires no mapping, although the conversion described in section 3.2 above may be performed. It is preferable not to perform this inverse conversion when there is a chance that this cannot be done correctly.
Top   ToC   RFC3987 - Page 33

7.2. URI/IRI Entry

Some components allow users to enter URIs into the system by typing or dictation, for example. This software must be updated to allow for IRI entry. A person viewing a visual representation of an IRI (as a sequence of glyphs, in some order, in some visual display) or hearing an IRI will use an entry method for characters in the user's language to input the IRI. Depending on the script and the input method used, this may be a more or less complicated process. The process of IRI entry must ensure, as much as possible, that the restrictions defined in section 2.2 are met. This may be done by choosing appropriate input methods or variants/settings thereof, by appropriately converting the characters being input, by eliminating characters that cannot be converted, and/or by issuing a warning or error message to the user. As an example of variant settings, input method editors for East Asian Languages usually allow the input of Latin letters and related characters in full-width or half-width versions. For IRI input, the input method editor should be set so that it produces half-width Latin letters and punctuation and full-width Katakana. An input field primarily or solely used for the input of URIs/IRIs may allow the user to view an IRI as it is mapped to a URI. Places where the input of IRIs is frequent may provide the possibility for viewing an IRI as mapped to a URI. This will help users when some of the software they use does not yet accept IRIs. An IRI input component interfacing to components that handle URIs, but not IRIs, must map the IRI to a URI before passing it to these components. For the input of IRIs with right-to-left characters, please see section 4.3.

7.3. URI/IRI Transfer between Applications

Many applications, particularly mail user agents, try to detect URIs appearing in plain text. For this, they use some heuristics based on URI syntax. They then allow the user to click on such URIs and retrieve the corresponding resource in an appropriate (usually scheme-dependent) application.
Top   ToC   RFC3987 - Page 34
   Such applications have to be upgraded to use the IRI syntax as a base
   for heuristics.  In particular, a non-ASCII character should not be
   taken as the indication of the end of an IRI.  Such applications also
   have to make sure that they correctly convert the detected IRI from
   the character encoding of the document or application where the IRI
   appears to the character encoding used by the system-wide IRI
   invocation mechanism, or to a URI (according to section 3.1) if the
   system-wide invocation mechanism only accepts URIs.

   The clipboard is another frequently used way to transfer URIs and
   IRIs from one application to another.  On most platforms, the
   clipboard is able to store and transfer text in many languages and
   scripts.  Correctly used, the clipboard transfers characters, not
   bytes, which will do the right thing with IRIs.

7.4. URI/IRI Generation

Systems that offer resources through the Internet, where those resources have logical names, sometimes automatically generate URIs for the resources they offer. For example, some HTTP servers can generate a directory listing for a file directory and then respond to the generated URIs with the files. Many legacy character encodings are in use in various file systems. Many currently deployed systems do not transform the local character representation of the underlying system before generating URIs. For maximum interoperability, systems that generate resource identifiers should make the appropriate transformations. For example, if a file system contains a file named "résumé.html", a server should expose this as "r%C3%A9sum%C3%A9.html" in a URI, which allows use of "résumé.html" in an IRI, even if locally the file name is kept in a character encoding other than UTF-8. This recommendation particularly applies to HTTP servers. For FTP servers, similar considerations apply; see [RFC2640].

7.5. URI/IRI Selection

In some cases, resource owners and publishers have control over the IRIs used to identify their resources. This control is mostly executed by controlling the resource names, such as file names, directly.
Top   ToC   RFC3987 - Page 35
   In these cases, it is recommended to avoid choosing IRIs that are
   easily confused.  For example, for US-ASCII, the lower-case ell ("l")
   is easily confused with the digit one ("1"), and the upper-case oh
   ("O") is easily confused with the digit zero ("0").  Publishers
   should avoid confusing users with "br0ken" or "1ame" identifiers.

   Outside the US-ASCII repertoire, there are many more opportunities
   for confusion; a complete set of guidelines is too lengthy to include
   here.  As long as names are limited to characters from a single
   script, native writers of a given script or language will know best
   when ambiguities can appear, and how they can be avoided.  What may
   look ambiguous to a stranger may be completely obvious to the average
   native user.  On the other hand, in some cases, the UCS contains
   variants for compatibility reasons; for example, for typographic
   purposes.  These should be avoided wherever possible.  Although there
   may be exceptions, newly created resource names should generally be
   in NFKC [UTR15] (which means that they are also in NFC).

   As an example, the UCS contains the "fi" ligature at U+FB01 for
   compatibility reasons.  Wherever possible, IRIs should use the two
   letters "f" and "i" rather than the "fi" ligature.  An example where
   the latter may be used is in the query part of an IRI for an explicit
   search for a word written containing the "fi" ligature.

   In certain cases, there is a chance that characters from different
   scripts look the same.  The best known example is the similarity of
   the Latin "A", the Greek "Alpha", and the Cyrillic "A".  To avoid
   such cases, only IRIs should be created where all the characters in a
   single component are used together in a given language.  This usually
   means that all of these characters will be from the same script, but
   there are languages that mix characters from different scripts (such
   as Japanese).  This is similar to the heuristics used to distinguish
   between letters and numbers in the examples above.  Also, for Latin,
   Greek, and Cyrillic, using lowercase letters results in fewer
   ambiguities than using uppercase letters would.

7.6. Display of URIs/IRIs

In situations where the rendering software is not expected to display non-ASCII parts of the IRI correctly using the available layout and font resources, these parts should be percent-encoded before being displayed. For display of Bidi IRIs, please see section 4.1.
Top   ToC   RFC3987 - Page 36

7.7. Interpretation of URIs and IRIs

Software that interprets IRIs as the names of local resources should accept IRIs in multiple forms and convert and match them with the appropriate local resource names. First, multiple representations include both IRIs in the native character encoding of the protocol and also their URI counterparts. Second, it may include URIs constructed based on character encodings other than UTF-8. These URIs may be produced by user agents that do not conform to this specification and that use legacy character encodings to convert non-ASCII characters to URIs. Whether this is necessary, and what character encodings to cover, depends on a number of factors, such as the legacy character encodings used locally and the distribution of various versions of user agents. For example, software for Japanese may accept URIs in Shift_JIS and/or EUC-JP in addition to UTF-8. Third, it may include additional mappings to be more user-friendly and robust against transmission errors. These would be similar to how some servers currently treat URIs as case insensitive or perform additional matching to account for spelling errors. For characters beyond the US-ASCII repertoire, this may, for example, include ignoring the accents on received IRIs or resource names. Please note that such mappings, including case mappings, are language dependent. It can be difficult to identify a resource unambiguously if too many mappings are taken into consideration. However, percent-encoded and not percent-encoded parts of IRIs can always be clearly distinguished. Also, the regularity of UTF-8 (see [Duerst97]) makes the potential for collisions lower than it may seem at first.

7.8. Upgrading Strategy

Where this recommendation places further constraints on software for which many instances are already deployed, it is important to introduce upgrades carefully and to be aware of the various interdependencies. If IRIs cannot be interpreted correctly, they should not be created, generated, or transported. This suggests that upgrading URI interpreting software to accept IRIs should have highest priority. On the other hand, a single IRI is interpreted only by a single or very few interpreters that are known in advance, although it may be entered and transported very widely.
Top   ToC   RFC3987 - Page 37
   Therefore, IRIs benefit most from a broad upgrade of software to be
   able to enter and transport IRIs.  However, before an individual IRI
   is published, care should be taken to upgrade the corresponding
   interpreting software in order to cover the forms expected to be
   received by various versions of entry and transport software.

   The upgrade of generating software to generate IRIs instead of using
   a local character encoding should happen only after the service is
   upgraded to accept IRIs.  Similarly, IRIs should only be generated
   when the service accepts IRIs and the intervening infrastructure and
   protocol is known to transport them safely.

   Software converting from URIs to IRIs for display should be upgraded
   only after upgraded entry software has been widely deployed to the
   population that will see the displayed result.

   Where there is a free choice of character encodings, it is often
   possible to reduce the effort and dependencies for upgrading to IRIs
   by using UTF-8 rather than another encoding.  For example, when a new
   file-based Web server is set up, using UTF-8 as the character
   encoding for file names will make the transition to IRIs easier.
   Likewise, when a new Web form is set up using UTF-8 as the character
   encoding of the form page, the returned query URIs will use UTF-8 as
   the character encoding (unless the user, for whatever reason, changes
   the character encoding) and will therefore be compatible with IRIs.

   These recommendations, when taken together, will allow for the
   extension from URIs to IRIs in order to handle characters other than
   US-ASCII while minimizing interoperability problems.  For
   considerations regarding the upgrade of URI scheme definitions, see
   section 6.4.

8. Security Considerations

The security considerations discussed in [RFC3986] also apply to IRIs. In addition, the following issues require particular care for IRIs. Incorrect encoding or decoding can lead to security problems. In particular, some UTF-8 decoders do not check against overlong byte sequences. As an example, a "/" is encoded with the byte 0x2F both in UTF-8 and in US-ASCII, but some UTF-8 decoders also wrongly interpret the sequence 0xC0 0xAF as a "/". A sequence such as
Top   ToC   RFC3987 - Page 38
   "%C0%AF.." may pass some security tests and then be interpreted as
   "/.." in a path if UTF-8 decoders are fault-tolerant, if conversion
   and checking are not done in the right order, and/or if reserved
   characters and unreserved characters are not clearly distinguished.

   There are various ways in which "spoofing" can occur with IRIs.
   "Spoofing" means that somebody may add a resource name that looks the
   same or similar to the user, but that points to a different resource.
   The added resource may pretend to be the real resource by looking
   very similar but may contain all kinds of changes that may be
   difficult to spot and that can cause all kinds of problems.  Most
   spoofing possibilities for IRIs are extensions of those for URIs.

   Spoofing can occur for various reasons.  First, a user's
   normalization expectations or actual normalization when entering an
   IRI or transcoding an IRI from a legacy character encoding do not
   match the normalization used on the server side.  Conceptually, this
   is no different from the problems surrounding the use of
   case-insensitive web servers.  For example, a popular web page with a
   mixed-case name ("http://big.example.com/PopularPage.html") might be
   "spoofed" by someone who is able to create
   "http://big.example.com/popularpage.html".  However, the use of
   unnormalized character sequences, and of additional mappings for user
   convenience, may increase the chance for spoofing.  Protocols and
   servers that allow the creation of resources with names that are not
   normalized are particularly vulnerable to such attacks.  This is an
   inherent security problem of the relevant protocol, server, or
   resource and is not specific to IRIs, but it is mentioned here for
   completeness.

   Spoofing can occur in various IRI components, such as the domain name
   part or a path part.  For considerations specific to the domain name
   part, see [RFC3491].  For the path part, administrators of sites that
   allow independent users to create resources in the same sub area may
   have to be careful to check for spoofing.

   Spoofing can occur because in the UCS many characters look very
   similar.  Details are discussed in Section 7.5.  Again, this is very
   similar to spoofing possibilities on US-ASCII, e.g., using "br0ken"
   or "1ame" URIs.

   Spoofing can occur when URIs with percent-encodings based on various
   character encodings are accepted to deal with older user agents.  In
   some cases, particularly for Latin-based resource names, this is
   usually easy to detect because UTF-8-encoded names, when interpreted
   and viewed as legacy character encodings, produce mostly garbage.
Top   ToC   RFC3987 - Page 39
   When concurrently used character encodings have a similar structure
   but there are no characters that have exactly the same encoding,
   detection is more difficult.

   Spoofing can occur with bidirectional IRIs, if the restrictions in
   section 4.2 are not followed.  The same visual representation may be
   interpreted as different logical representations, and vice versa.  It
   is also very important that a correct Unicode bidirectional
   implementation be used.

9. Acknowledgements

We would like to thank Larry Masinter for his work as coauthor of many earlier versions of this document (draft-masinter-url-i18n-xx). The discussion on the issue addressed here started a long time ago. There was a thread in the HTML working group in August 1995 (under the topic of "Globalizing URIs") and in the www-international mailing list in July 1996 (under the topic of "Internationalization and URLs"), and there were ad-hoc meetings at the Unicode conferences in September 1995 and September 1997. Many thanks go to Francois Yergeau, Matitiahu Allouche, Roy Fielding, Tim Berners-Lee, Mark Davis, M.T. Carrasco Benitez, James Clark, Tim Bray, Chris Wendt, Yaron Goland, Andrea Vine, Misha Wolf, Leslie Daigle, Ted Hardie, Bill Fenner, Margaret Wasserman, Russ Housley, Makoto MURATA, Steven Atkin, Ryan Stansifer, Tex Texin, Graham Klyne, Bjoern Hoehrmann, Chris Lilley, Ian Jacobs, Adam Costello, Dan Oscarson, Elliotte Rusty Harold, Mike J. Brown, Roy Badami, Jonathan Rosenne, Asmus Freytag, Simon Josefsson, Carlos Viegas Damasio, Chris Haynes, Walter Underwood, and many others for help with understanding the issues and possible solutions, and with getting the details right. This document is a product of the Internationalization Working Group (I18N WG) of the World Wide Web Consortium (W3C). Thanks to the members of the W3C I18N Working Group and Interest Group for their contributions and their work on [CharMod]. Thanks also go to the members of many other W3C Working Groups for adopting IRIs, and to the members of the Montreal IAB Workshop on Internationalization and Localization for their review.
Top   ToC   RFC3987 - Page 40

10. References

10.1. Normative References

[ASCII] American National Standards Institute, "Coded Character Set -- 7-bit American Standard Code for Information Interchange", ANSI X3.4, 1986. [ISO10646] International Organization for Standardization, "ISO/IEC 10646:2003: Information Technology - Universal Multiple-Octet Coded Character Set (UCS)", ISO Standard 10646, December 2003. [RFC2119] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [RFC2234] Crocker, D. and P. Overell, "Augmented BNF for Syntax Specifications: ABNF", RFC 2234, November 1997. [RFC3490] Faltstrom, P., Hoffman, P., and A. Costello, "Internationalizing Domain Names in Applications (IDNA)", RFC 3490, March 2003. [RFC3491] Hoffman, P. and M. Blanchet, "Nameprep: A Stringprep Profile for Internationalized Domain Names (IDN)", RFC 3491, March 2003. [RFC3629] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003. [RFC3986] Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform Resource Identifier (URI): Generic Syntax", STD 66, RFC 3986, January 2005. [UNI9] Davis, M., "The Bidirectional Algorithm", Unicode Standard Annex #9, March 2004, <http://www.unicode.org/reports/tr9/tr9-13.html>. [UNIV4] The Unicode Consortium, "The Unicode Standard, Version 4.0.1, defined by: The Unicode Standard, Version 4.0 (Reading, MA, Addison-Wesley, 2003. ISBN 0-321-18578-1), as amended by Unicode 4.0.1 (http://www.unicode.org/versions/Unicode4.0.1/)", March 2004.
Top   ToC   RFC3987 - Page 41
   [UTR15]        Davis, M. and M. Duerst, "Unicode Normalization
                  Forms", Unicode Standard Annex #15, April 2003,
                  <http://www.unicode.org/unicode/reports/
                  tr15/tr15-23.html>.

10.2. Informative References

[BidiEx] "Examples of bidirectional IRIs", <http://www.w3.org/International/iri-edit/ BidiExamples>. [CharMod] Duerst, M., Yergeau, F., Ishida, R., Wolf, M., and T. Texin, "Character Model for the World Wide Web: Resource Identifiers", World Wide Web Consortium Candidate Recommendation, November 2004, <http://www.w3.org/TR/charmod-resid>. [Duerst97] Duerst, M., "The Properties and Promises of UTF-8", Proc. 11th International Unicode Conference, San Jose , September 1997, <http://www.ifi.unizh.ch/mml/mduerst/papers/ PDF/IUC11-UTF-8.pdf>. [Gettys] Gettys, J., "URI Model Consequences", <http://www.w3.org/DesignIssues/ModelConsequences>. [HTML4] Raggett, D., Le Hors, A., and I. Jacobs, "HTML 4.01 Specification", World Wide Web Consortium Recommendation, December 1999, <http://www.w3.org/TR/html401/appendix/ notes.html#h-B.2>. [RFC2045] Freed, N. and N. Borenstein, "Multipurpose Internet Mail Extensions (MIME) Part One: Format of Internet Message Bodies", RFC 2045, November 1996. [RFC2130] Weider, C., Preston, C., Simonsen, K., Alvestrand, H., Atkinson, R., Crispin, M., and P. Svanberg, "The Report of the IAB Character Set Workshop held 29 February - 1 March, 1996", RFC 2130, April 1997. [RFC2141] Moats, R., "URN Syntax", RFC 2141, May 1997. [RFC2192] Newman, C., "IMAP URL Scheme", RFC 2192, September 1997. [RFC2277] Alvestrand, H., "IETF Policy on Character Sets and Languages", BCP 18, RFC 2277, January 1998.
Top   ToC   RFC3987 - Page 42
   [RFC2368]      Hoffman, P., Masinter, L., and J. Zawinski, "The
                  mailto URL scheme", RFC 2368, July 1998.

   [RFC2384]      Gellens, R., "POP URL Scheme", RFC 2384, August 1998.

   [RFC2396]      Berners-Lee, T., Fielding, R., and L. Masinter,
                  "Uniform Resource Identifiers (URI): Generic Syntax",
                  RFC 2396, August 1998.

   [RFC2397]      Masinter, L., "The "data" URL scheme", RFC 2397,
                  August 1998.

   [RFC2616]      Fielding,  R., Gettys, J., Mogul, J., Frystyk, H.,
                  Masinter, L., Leach, P., and T. Berners-Lee,
                  "Hypertext Transfer Protocol -- HTTP/1.1", RFC 2616,
                  June 1999.

   [RFC2640]      Curtin, B., "Internationalization of the File Transfer
                  Protocol", RFC 2640, July 1999.

   [RFC2718]      Masinter, L., Alvestrand, H., Zigmond, D., and R.
                  Petke, "Guidelines for new URL Schemes", RFC 2718,
                  November 1999.

   [UNIXML]       Duerst, M. and A. Freytag, "Unicode in XML and other
                  Markup Languages", Unicode Technical Report #20, World
                  Wide Web Consortium Note, June 2003,
                  <http://www.w3.org/TR/unicode-xml/>.

   [XLink]        DeRose, S., Maler, E., and D. Orchard, "XML Linking
                  Language (XLink) Version 1.0", World Wide Web
                  Consortium Recommendation, June 2001,
                  <http://www.w3.org/TR/xlink/#link-locators>.

   [XML1]         Bray, T., Paoli, J., Sperberg-McQueen, C., Maler, E.,
                  and F. Yergeau, "Extensible Markup Language (XML) 1.0
                  (Third Edition)", World Wide Web Consortium
                  Recommendation, February 2004,
                  <http://www.w3.org/TR/REC-xml#sec-external-ent>.

   [XMLNamespace] Bray, T., Hollander, D., and A. Layman, "Namespaces in
                  XML", World Wide Web Consortium Recommendation,
                  January 1999, <http://www.w3.org/TR/REC-xml-names>.

   [XMLSchema]    Biron, P. and A. Malhotra, "XML Schema Part 2:
                  Datatypes", World Wide Web Consortium Recommendation,
                  May 2001, <http://www.w3.org/TR/xmlschema-2/#anyURI>.
Top   ToC   RFC3987 - Page 43
   [XPointer]     Grosso, P., Maler, E., Marsh, J. and N. Walsh,
                  "XPointer Framework", World Wide Web Consortium
                  Recommendation, March 2003,
                  <http://www.w3.org/TR/xptr-framework/#escaping>.
Top   ToC   RFC3987 - Page 44

Appendix A. Design Alternatives

This section shortly summarizes major design alternatives and the reasons for why they were not chosen.

Appendix A.1. New Scheme(s)

Introducing new schemes (for example, httpi:, ftpi:,...) or a new metascheme (e.g., i:, leading to URI/IRI prefixes such as i:http:, i:ftp:,...) was proposed to make IRI-to-URI conversion scheme dependent or to distinguish between percent-encodings resulting from IRI-to-URI conversion and percent-encodings from legacy character encodings. New schemes are not needed to distinguish URIs from true IRIs (i.e., IRIs that contain non-ASCII characters). The benefit of being able to detect the origin of percent-encodings is marginal, as UTF-8 can be detected with very high reliability. Deploying new schemes is extremely hard, so not requiring new schemes for IRIs makes deployment of IRIs vastly easier. Making conversion scheme dependent is highly inadvisable and would be encouraged by separate schemes for IRIs. Using a uniform convention for conversion from IRIs to URIs makes IRI implementation orthogonal to the introduction of actual new schemes.

Appendix A.2. Character Encodings Other Than UTF-8

At an early stage, UTF-7 was considered as an alternative to UTF-8 when IRIs are converted to URIs. UTF-7 would not have needed percent-encoding and in most cases would have been shorter than percent-encoded UTF-8. Using UTF-8 avoids a double layering and overloading of the use of the "+" character. UTF-8 is fully compatible with US-ASCII and has therefore been recommended by the IETF, and is being used widely. UTF-7 has never been used much and is now clearly being discouraged. Requiring implementations to convert from UTF-8 to UTF-7 and back would be an additional implementation burden.

Appendix A.3. New Encoding Convention

Instead of using the existing percent-encoding convention of URIs, which is based on octets, the idea was to create a new encoding convention; for example, to use "%u" to introduce UCS code points.
Top   ToC   RFC3987 - Page 45
   Using the existing octet-based percent-encoding mechanism does not
   need an upgrade of the URI syntax and does not need corresponding
   server upgrades.

Appendix A.4. Indicating Character Encodings in the URI/IRI

Some proposals suggested indicating the character encodings used in an URI or IRI with some new syntactic convention in the URI itself, similar to the "charset" parameter for e-mails and Web pages. As an example, the label in square brackets in "http://www.example.org/ros[iso-8859-1]&#xE9"; indicated that the following "&#xE9"; had to be interpreted as iso-8859-1. If UTF-8 is used exclusively, an upgrade to the URI syntax is not needed. It avoids potentially multiple labels that have to be copied correctly in all cases, even on the side of a bus or on a napkin, leading to usability problems (and being prohibitively annoying). Exclusively using UTF-8 also reduces transcoding errors and confusion.

Authors' Addresses

Martin Duerst (Note: Please write "Duerst" with u-umlaut wherever possible, for example as "D&#252;rst" in XML and HTML.) World Wide Web Consortium 5322 Endo Fujisawa, Kanagawa 252-8520 Japan Phone: +81 466 49 1170 Fax: +81 466 49 1171 EMail: duerst@w3.org URI: http://www.w3.org/People/D%C3%BCrst/ (Note: This is the percent-encoded form of an IRI.) Michel Suignard Microsoft Corporation One Microsoft Way Redmond, WA 98052 U.S.A. Phone: +1 425 882-8080 EMail: michelsu@microsoft.com URI: http://www.suignard.com
Top   ToC   RFC3987 - Page 46
Full Copyright Statement

   Copyright (C) The Internet Society (2005).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the IETF's procedures with respect to rights in IETF Documents can
   be found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at ietf-
   ipr@ietf.org.


Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.