RFC 3986

Uniform Resource Identifier (URI): Generic Syntax

Pages: 61
Internet Standard: 66
→ Errata
Obsoletes: 2732 2396 1808
Updates: 1738
Updated by: 6874 7320 8820

Part 3 of 3 – Pages 38 to 61

RFC3986 - Page 38 prevText

6.  Normalization and Comparison

   One of the most common operations on URIs is simple comparison:
   determining whether two URIs are equivalent without using the URIs to
   access their respective resource(s).  A comparison is performed every
   time a response cache is accessed, a browser checks its history to
   color a link, or an XML parser processes tags within a namespace.
   Extensive normalization prior to comparison of URIs is often used by
   spiders and indexing engines to prune a search space or to reduce
   duplication of request actions and response storage.

   URI comparison is performed for some particular purpose.  Protocols
   or implementations that compare URIs for different purposes will
   often be subject to differing design trade-offs in regards to how
   much effort should be spent in reducing aliased identifiers.  This
   section describes various methods that may be used to compare URIs,
   the trade-offs between them, and the types of applications that might
   use them.

6.1.  Equivalence

   Because URIs exist to identify resources, presumably they should be
   considered equivalent when they identify the same resource.  However,
   this definition of equivalence is not of much practical use, as there
   is no way for an implementation to compare two resources unless it
   has full knowledge or control of them.  For this reason,
   determination of equivalence or difference of URIs is based on string
   comparison, perhaps augmented by reference to additional rules
   provided by URI scheme definitions.  We use the terms "different" and
   "equivalent" to describe the possible outcomes of such comparisons,
   but there are many application-dependent versions of equivalence.

   Even though it is possible to determine that two URIs are equivalent,
   URI comparison is not sufficient to determine whether two URIs
   identify different resources.  For example, an owner of two different
   domain names could decide to serve the same resource from both,
   resulting in two different URIs.  Therefore, comparison methods are
   designed to minimize false negatives while strictly avoiding false
   positives.

   In testing for equivalence, applications should not directly compare
   relative references; the references should be converted to their
   respective target URIs before comparison.  When URIs are compared to
   select (or avoid) a network action, such as retrieval of a
   representation, fragment components (if any) should be excluded from
   the comparison.

RFC3986 - Page 39

6.2.  Comparison Ladder

   A variety of methods are used in practice to test URI equivalence.
   These methods fall into a range, distinguished by the amount of
   processing required and the degree to which the probability of false
   negatives is reduced.  As noted above, false negatives cannot be
   eliminated.  In practice, their probability can be reduced, but this
   reduction requires more processing and is not cost-effective for all
   applications.

   If this range of comparison practices is considered as a ladder, the
   following discussion will climb the ladder, starting with practices
   that are cheap but have a relatively higher chance of producing false
   negatives, and proceeding to those that have higher computational
   cost and lower risk of false negatives.

6.2.1.  Simple String Comparison

   If two URIs, when considered as character strings, are identical,
   then it is safe to conclude that they are equivalent.  This type of
   equivalence test has very low computational cost and is in wide use
   in a variety of applications, particularly in the domain of parsing.

   Testing strings for equivalence requires some basic precautions.
   This procedure is often referred to as "bit-for-bit" or
   "byte-for-byte" comparison, which is potentially misleading.  Testing
   strings for equality is normally based on pair comparison of the
   characters that make up the strings, starting from the first and
   proceeding until both strings are exhausted and all characters are
   found to be equal, until a pair of characters compares unequal, or
   until one of the strings is exhausted before the other.

   This character comparison requires that each pair of characters be
   put in comparable form.  For example, should one URI be stored in a
   byte array in EBCDIC encoding and the second in a Java String object
   (UTF-16), bit-for-bit comparisons applied naively will produce
   errors.  It is better to speak of equality on a character-for-
   character basis rather than on a byte-for-byte or bit-for-bit basis.
   In practical terms, character-by-character comparisons should be done
   codepoint-by-codepoint after conversion to a common character
   encoding.

   False negatives are caused by the production and use of URI aliases.
   Unnecessary aliases can be reduced, regardless of the comparison
   method, by consistently providing URI references in an already-
   normalized form (i.e., a form identical to what would be produced
   after normalization is applied, as described below).

RFC3986 - Page 40

   Protocols and data formats often limit some URI comparisons to simple
   string comparison, based on the theory that people and
   implementations will, in their own best interest, be consistent in
   providing URI references, or at least consistent enough to negate any
   efficiency that might be obtained from further normalization.

6.2.2.  Syntax-Based Normalization

   Implementations may use logic based on the definitions provided by
   this specification to reduce the probability of false negatives.
   This processing is moderately higher in cost than character-for-
   character string comparison.  For example, an application using this
   approach could reasonably consider the following two URIs equivalent:

      example://a/b/c/%7Bfoo%7D
      eXAMPLE://a/./b/../b/%63/%7bfoo%7d

   Web user agents, such as browsers, typically apply this type of URI
   normalization when determining whether a cached response is
   available.  Syntax-based normalization includes such techniques as
   case normalization, percent-encoding normalization, and removal of
   dot-segments.

6.2.2.1.  Case Normalization

   For all URIs, the hexadecimal digits within a percent-encoding
   triplet (e.g., "%3a" versus "%3A") are case-insensitive and therefore
   should be normalized to use uppercase letters for the digits A-F.

   When a URI uses components of the generic syntax, the component
   syntax equivalence rules always apply; namely, that the scheme and
   host are case-insensitive and therefore should be normalized to
   lowercase.  For example, the URI <HTTP://www.EXAMPLE.com/> is
   equivalent to <http://www.example.com/>.  The other generic syntax
   components are assumed to be case-sensitive unless specifically
   defined otherwise by the scheme (see Section 6.2.3).

6.2.2.2.  Percent-Encoding Normalization

   The percent-encoding mechanism (Section 2.1) is a frequent source of
   variance among otherwise identical URIs.  In addition to the case
   normalization issue noted above, some URI producers percent-encode
   octets that do not require percent-encoding, resulting in URIs that
   are equivalent to their non-encoded counterparts.  These URIs should
   be normalized by decoding any percent-encoded octet that corresponds
   to an unreserved character, as described in Section 2.3.

RFC3986 - Page 41

6.2.2.3.  Path Segment Normalization

   The complete path segments "." and ".." are intended only for use
   within relative references (Section 4.1) and are removed as part of
   the reference resolution process (Section 5.2).  However, some
   deployed implementations incorrectly assume that reference resolution
   is not necessary when the reference is already a URI and thus fail to
   remove dot-segments when they occur in non-relative paths.  URI
   normalizers should remove dot-segments by applying the
   remove_dot_segments algorithm to the path, as described in
   Section 5.2.4.

6.2.3.  Scheme-Based Normalization

   The syntax and semantics of URIs vary from scheme to scheme, as
   described by the defining specification for each scheme.
   Implementations may use scheme-specific rules, at further processing
   cost, to reduce the probability of false negatives.  For example,
   because the "http" scheme makes use of an authority component, has a
   default port of "80", and defines an empty path to be equivalent to
   "/", the following four URIs are equivalent:

      http://example.com
      http://example.com/
      http://example.com:/
      http://example.com:80/

   In general, a URI that uses the generic syntax for authority with an
   empty path should be normalized to a path of "/".  Likewise, an
   explicit ":port", for which the port is empty or the default for the
   scheme, is equivalent to one where the port and its ":" delimiter are
   elided and thus should be removed by scheme-based normalization.  For
   example, the second URI above is the normal form for the "http"
   scheme.

   Another case where normalization varies by scheme is in the handling
   of an empty authority component or empty host subcomponent.  For many
   scheme specifications, an empty authority or host is considered an
   error; for others, it is considered equivalent to "localhost" or the
   end-user's host.  When a scheme defines a default for authority and a
   URI reference to that default is desired, the reference should be
   normalized to an empty authority for the sake of uniformity, brevity,
   and internationalization.  If, however, either the userinfo or port
   subcomponents are non-empty, then the host should be given explicitly
   even if it matches the default.

   Normalization should not remove delimiters when their associated
   component is empty unless licensed to do so by the scheme

RFC3986 - Page 42

   specification.  For example, the URI "http://example.com/?" cannot be
   assumed to be equivalent to any of the examples above.  Likewise, the
   presence or absence of delimiters within a userinfo subcomponent is
   usually significant to its interpretation.  The fragment component is
   not subject to any scheme-based normalization; thus, two URIs that
   differ only by the suffix "#" are considered different regardless of
   the scheme.

   Some schemes define additional subcomponents that consist of case-
   insensitive data, giving an implicit license to normalizers to
   convert this data to a common case (e.g., all lowercase).  For
   example, URI schemes that define a subcomponent of path to contain an
   Internet hostname, such as the "mailto" URI scheme, cause that
   subcomponent to be case-insensitive and thus subject to case
   normalization (e.g., "mailto:Joe@Example.COM" is equivalent to
   "mailto:Joe@example.com", even though the generic syntax considers
   the path component to be case-sensitive).

   Other scheme-specific normalizations are possible.

6.2.4.  Protocol-Based Normalization

   Substantial effort to reduce the incidence of false negatives is
   often cost-effective for web spiders.  Therefore, they implement even
   more aggressive techniques in URI comparison.  For example, if they
   observe that a URI such as

      http://example.com/data

   redirects to a URI differing only in the trailing slash

      http://example.com/data/

   they will likely regard the two as equivalent in the future.  This
   kind of technique is only appropriate when equivalence is clearly
   indicated by both the result of accessing the resources and the
   common conventions of their scheme's dereference algorithm (in this
   case, use of redirection by HTTP origin servers to avoid problems
   with relative references).

RFC3986 - Page 43

7.  Security Considerations

   A URI does not in itself pose a security threat.  However, as URIs
   are often used to provide a compact set of instructions for access to
   network resources, care must be taken to properly interpret the data
   within a URI, to prevent that data from causing unintended access,
   and to avoid including data that should not be revealed in plain
   text.

7.1.  Reliability and Consistency

   There is no guarantee that once a URI has been used to retrieve
   information, the same information will be retrievable by that URI in
   the future.  Nor is there any guarantee that the information
   retrievable via that URI in the future will be observably similar to
   that retrieved in the past.  The URI syntax does not constrain how a
   given scheme or authority apportions its namespace or maintains it
   over time.  Such guarantees can only be obtained from the person(s)
   controlling that namespace and the resource in question.  A specific
   URI scheme may define additional semantics, such as name persistence,
   if those semantics are required of all naming authorities for that
   scheme.

7.2.  Malicious Construction

   It is sometimes possible to construct a URI so that an attempt to
   perform a seemingly harmless, idempotent operation, such as the
   retrieval of a representation, will in fact cause a possibly damaging
   remote operation.  The unsafe URI is typically constructed by
   specifying a port number other than that reserved for the network
   protocol in question.  The client unwittingly contacts a site running
   a different protocol service, and data within the URI contains
   instructions that, when interpreted according to this other protocol,
   cause an unexpected operation.  A frequent example of such abuse has
   been the use of a protocol-based scheme with a port component of
   "25", thereby fooling user agent software into sending an unintended
   or impersonating message via an SMTP server.

   Applications should prevent dereference of a URI that specifies a TCP
   port number within the "well-known port" range (0 - 1023) unless the
   protocol being used to dereference that URI is compatible with the
   protocol expected on that well-known port.  Although IANA maintains a
   registry of well-known ports, applications should make such
   restrictions user-configurable to avoid preventing the deployment of
   new services.

RFC3986 - Page 44

   When a URI contains percent-encoded octets that match the delimiters
   for a given resolution or dereference protocol (for example, CR and
   LF characters for the TELNET protocol), these percent-encodings must
   not be decoded before transmission across that protocol.  Transfer of
   the percent-encoding, which might violate the protocol, is less
   harmful than allowing decoded octets to be interpreted as additional
   operations or parameters, perhaps triggering an unexpected and
   possibly harmful remote operation.

7.3.  Back-End Transcoding

   When a URI is dereferenced, the data within it is often parsed by
   both the user agent and one or more servers.  In HTTP, for example, a
   typical user agent will parse a URI into its five major components,
   access the authority's server, and send it the data within the
   authority, path, and query components.  A typical server will take
   that information, parse the path into segments and the query into
   key/value pairs, and then invoke implementation-specific handlers to
   respond to the request.  As a result, a common security concern for
   server implementations that handle a URI, either as a whole or split
   into separate components, is proper interpretation of the octet data
   represented by the characters and percent-encodings within that URI.

   Percent-encoded octets must be decoded at some point during the
   dereference process.  Applications must split the URI into its
   components and subcomponents prior to decoding the octets, as
   otherwise the decoded octets might be mistaken for delimiters.
   Security checks of the data within a URI should be applied after
   decoding the octets.  Note, however, that the "%00" percent-encoding
   (NUL) may require special handling and should be rejected if the
   application is not expecting to receive raw data within a component.

   Special care should be taken when the URI path interpretation process
   involves the use of a back-end file system or related system
   functions.  File systems typically assign an operational meaning to
   special characters, such as the "/", "\", ":", "[", and "]"
   characters, and to special device names like ".", "..", "...", "aux",
   "lpt", etc.  In some cases, merely testing for the existence of such
   a name will cause the operating system to pause or invoke unrelated
   system calls, leading to significant security concerns regarding
   denial of service and unintended data transfer.  It would be
   impossible for this specification to list all such significant
   characters and device names.  Implementers should research the
   reserved names and characters for the types of storage device that
   may be attached to their applications and restrict the use of data
   obtained from URI components accordingly.

RFC3986 - Page 45

7.4.  Rare IP Address Formats

   Although the URI syntax for IPv4address only allows the common
   dotted-decimal form of IPv4 address literal, many implementations
   that process URIs make use of platform-dependent system routines,
   such as gethostbyname() and inet_aton(), to translate the string
   literal to an actual IP address.  Unfortunately, such system routines
   often allow and process a much larger set of formats than those
   described in Section 3.2.2.

   For example, many implementations allow dotted forms of three
   numbers, wherein the last part is interpreted as a 16-bit quantity
   and placed in the right-most two bytes of the network address (e.g.,
   a Class B network).  Likewise, a dotted form of two numbers means
   that the last part is interpreted as a 24-bit quantity and placed in
   the right-most three bytes of the network address (Class A), and a
   single number (without dots) is interpreted as a 32-bit quantity and
   stored directly in the network address.  Adding further to the
   confusion, some implementations allow each dotted part to be
   interpreted as decimal, octal, or hexadecimal, as specified in the C
   language (i.e., a leading 0x or 0X implies hexadecimal; a leading 0
   implies octal; otherwise, the number is interpreted as decimal).

   These additional IP address formats are not allowed in the URI syntax
   due to differences between platform implementations.  However, they
   can become a security concern if an application attempts to filter
   access to resources based on the IP address in string literal format.
   If this filtering is performed, literals should be converted to
   numeric form and filtered based on the numeric value, and not on a
   prefix or suffix of the string form.

7.5.  Sensitive Information

   URI producers should not provide a URI that contains a username or
   password that is intended to be secret.  URIs are frequently
   displayed by browsers, stored in clear text bookmarks, and logged by
   user agent history and intermediary applications (proxies).  A
   password appearing within the userinfo component is deprecated and
   should be considered an error (or simply ignored) except in those
   rare cases where the 'password' parameter is intended to be public.

7.6.  Semantic Attacks

   Because the userinfo subcomponent is rarely used and appears before
   the host in the authority component, it can be used to construct a
   URI intended to mislead a human user by appearing to identify one
   (trusted) naming authority while actually identifying a different
   authority hidden behind the noise.  For example

RFC3986 - Page 46

      ftp://cnn.example.com&story=breaking_news@10.0.0.1/top_story.htm

   might lead a human user to assume that the host is 'cnn.example.com',
   whereas it is actually '10.0.0.1'.  Note that a misleading userinfo
   subcomponent could be much longer than the example above.

   A misleading URI, such as that above, is an attack on the user's
   preconceived notions about the meaning of a URI rather than an attack
   on the software itself.  User agents may be able to reduce the impact
   of such attacks by distinguishing the various components of the URI
   when they are rendered, such as by using a different color or tone to
   render userinfo if any is present, though there is no panacea.  More
   information on URI-based semantic attacks can be found in [Siedzik].

8.  IANA Considerations

   URI scheme names, as defined by <scheme> in Section 3.1, form a
   registered namespace that is managed by IANA according to the
   procedures defined in [BCP35].  No IANA actions are required by this
   document.

9.  Acknowledgements

   This specification is derived from RFC 2396 [RFC2396], RFC 1808
   [RFC1808], and RFC 1738 [RFC1738]; the acknowledgements in those
   documents still apply.  It also incorporates the update (with
   corrections) for IPv6 literals in the host syntax, as defined by
   Robert M. Hinden, Brian E. Carpenter, and Larry Masinter in
   [RFC2732].  In addition, contributions by Gisle Aas, Reese Anschultz,
   Daniel Barclay, Tim Bray, Mike Brown, Rob Cameron, Jeremy Carroll,
   Dan Connolly, Adam M. Costello, John Cowan, Jason Diamond, Martin
   Duerst, Stefan Eissing, Clive D.W. Feather, Al Gilman, Tony Hammond,
   Elliotte Harold, Pat Hayes, Henry Holtzman, Ian B. Jacobs, Michael
   Kay, John C. Klensin, Graham Klyne, Dan Kohn, Bruce Lilly, Andrew
   Main, Dave McAlpin, Ira McDonald, Michael Mealling, Ray Merkert,
   Stephen Pollei, Julian Reschke, Tomas Rokicki, Miles Sabin, Kai
   Schaetzl, Mark Thomson, Ronald Tschalaer, Norm Walsh, Marc Warne,
   Stuart Williams, and Henry Zongaro are gratefully acknowledged.

10.  References

10.1.  Normative References

   [ASCII]    American National Standards Institute, "Coded Character
              Set -- 7-bit American Standard Code for Information
              Interchange", ANSI X3.4, 1986.

RFC3986 - Page 47

   [RFC2234]  Crocker, D. and P. Overell, "Augmented BNF for Syntax
              Specifications: ABNF", RFC 2234, November 1997.

   [STD63]    Yergeau, F., "UTF-8, a transformation format of
              ISO 10646", STD 63, RFC 3629, November 2003.

   [UCS]      International Organization for Standardization,
              "Information Technology - Universal Multiple-Octet Coded
              Character Set (UCS)", ISO/IEC 10646:2003, December 2003.

10.2.  Informative References

   [BCP19]    Freed, N. and J. Postel, "IANA Charset Registration
              Procedures", BCP 19, RFC 2978, October 2000.

   [BCP35]    Petke, R. and I. King, "Registration Procedures for URL
              Scheme Names", BCP 35, RFC 2717, November 1999.

   [RFC0952]  Harrenstien, K., Stahl, M., and E. Feinler, "DoD Internet
              host table specification", RFC 952, October 1985.

   [RFC1034]  Mockapetris, P., "Domain names - concepts and facilities",
              STD 13, RFC 1034, November 1987.

   [RFC1123]  Braden, R., "Requirements for Internet Hosts - Application
              and Support", STD 3, RFC 1123, October 1989.

   [RFC1535]  Gavron, E., "A Security Problem and Proposed Correction
              With Widely Deployed DNS Software", RFC 1535,
              October 1993.

   [RFC1630]  Berners-Lee, T., "Universal Resource Identifiers in WWW: A
              Unifying Syntax for the Expression of Names and Addresses
              of Objects on the Network as used in the World-Wide Web",
              RFC 1630, June 1994.

   [RFC1736]  Kunze, J., "Functional Recommendations for Internet
              Resource Locators", RFC 1736, February 1995.

   [RFC1737]  Sollins, K. and L. Masinter, "Functional Requirements for
              Uniform Resource Names", RFC 1737, December 1994.

   [RFC1738]  Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
              Resource Locators (URL)", RFC 1738, December 1994.

   [RFC1808]  Fielding, R., "Relative Uniform Resource Locators",
              RFC 1808, June 1995.

RFC3986 - Page 48

   [RFC2046]  Freed, N. and N. Borenstein, "Multipurpose Internet Mail
              Extensions (MIME) Part Two: Media Types", RFC 2046,
              November 1996.

   [RFC2141]  Moats, R., "URN Syntax", RFC 2141, May 1997.

   [RFC2396]  Berners-Lee, T., Fielding, R., and L. Masinter, "Uniform
              Resource Identifiers (URI): Generic Syntax", RFC 2396,
              August 1998.

   [RFC2518]  Goland, Y., Whitehead, E., Faizi, A., Carter, S., and D.
              Jensen, "HTTP Extensions for Distributed Authoring --
              WEBDAV", RFC 2518, February 1999.

   [RFC2557]  Palme, J., Hopmann, A., and N. Shelness, "MIME
              Encapsulation of Aggregate Documents, such as HTML
              (MHTML)", RFC 2557, March 1999.

   [RFC2718]  Masinter, L., Alvestrand, H., Zigmond, D., and R. Petke,
              "Guidelines for new URL Schemes", RFC 2718, November 1999.

   [RFC2732]  Hinden, R., Carpenter, B., and L. Masinter, "Format for
              Literal IPv6 Addresses in URL's", RFC 2732, December 1999.

   [RFC3305]  Mealling, M. and R. Denenberg, "Report from the Joint
              W3C/IETF URI Planning Interest Group: Uniform Resource
              Identifiers (URIs), URLs, and Uniform Resource Names
              (URNs): Clarifications and Recommendations", RFC 3305,
              August 2002.

   [RFC3490]  Faltstrom, P., Hoffman, P., and A. Costello,
              "Internationalizing Domain Names in Applications (IDNA)",
              RFC 3490, March 2003.

   [RFC3513]  Hinden, R. and S. Deering, "Internet Protocol Version 6
              (IPv6) Addressing Architecture", RFC 3513, April 2003.

   [Siedzik]  Siedzik, R., "Semantic Attacks: What's in a URL?",
              April 2001, <http://www.giac.org/practical/gsec/
              Richard_Siedzik_GSEC.pdf>.

RFC3986 - Page 49

Appendix A.  Collected ABNF for URI

   URI           = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

   hier-part     = "//" authority path-abempty
                 / path-absolute
                 / path-rootless
                 / path-empty

   URI-reference = URI / relative-ref

   absolute-URI  = scheme ":" hier-part [ "?" query ]

   relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

   relative-part = "//" authority path-abempty
                 / path-absolute
                 / path-noscheme
                 / path-empty

   scheme        = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

   authority     = [ userinfo "@" ] host [ ":" port ]
   userinfo      = *( unreserved / pct-encoded / sub-delims / ":" )
   host          = IP-literal / IPv4address / reg-name
   port          = *DIGIT

   IP-literal    = "[" ( IPv6address / IPvFuture  ) "]"

   IPvFuture     = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

   IPv6address   =                            6( h16 ":" ) ls32
                 /                       "::" 5( h16 ":" ) ls32
                 / [               h16 ] "::" 4( h16 ":" ) ls32
                 / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                 / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                 / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                 / [ *4( h16 ":" ) h16 ] "::"              ls32
                 / [ *5( h16 ":" ) h16 ] "::"              h16
                 / [ *6( h16 ":" ) h16 ] "::"

   h16           = 1*4HEXDIG
   ls32          = ( h16 ":" h16 ) / IPv4address
   IPv4address   = dec-octet "." dec-octet "." dec-octet "." dec-octet

RFC3986 - Page 50

   dec-octet     = DIGIT                 ; 0-9
                 / %x31-39 DIGIT         ; 10-99
                 / "1" 2DIGIT            ; 100-199
                 / "2" %x30-34 DIGIT     ; 200-249
                 / "25" %x30-35          ; 250-255

   reg-name      = *( unreserved / pct-encoded / sub-delims )

   path          = path-abempty    ; begins with "/" or is empty
                 / path-absolute   ; begins with "/" but not "//"
                 / path-noscheme   ; begins with a non-colon segment
                 / path-rootless   ; begins with a segment
                 / path-empty      ; zero characters

   path-abempty  = *( "/" segment )
   path-absolute = "/" [ segment-nz *( "/" segment ) ]
   path-noscheme = segment-nz-nc *( "/" segment )
   path-rootless = segment-nz *( "/" segment )
   path-empty    = 0<pchar>

   segment       = *pchar
   segment-nz    = 1*pchar
   segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                 ; non-zero-length segment without any colon ":"

   pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

   query         = *( pchar / "/" / "?" )

   fragment      = *( pchar / "/" / "?" )

   pct-encoded   = "%" HEXDIG HEXDIG

   unreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~"
   reserved      = gen-delims / sub-delims
   gen-delims    = ":" / "/" / "?" / "#" / "[" / "]" / "@"
   sub-delims    = "!" / "$" / "&" / "'" / "(" / ")"
                 / "*" / "+" / "," / ";" / "="

Appendix B.  Parsing a URI Reference with a Regular Expression

   As the "first-match-wins" algorithm is identical to the "greedy"
   disambiguation method used by POSIX regular expressions, it is
   natural and commonplace to use a regular expression for parsing the
   potential five components of a URI reference.

   The following line is the regular expression for breaking-down a
   well-formed URI reference into its components.

RFC3986 - Page 51

      ^(([^:/?#]+):)?(//([^/?#]*))?([^?#]*)(\?([^#]*))?(#(.*))?
       12            3  4          5       6  7        8 9

   The numbers in the second line above are only to assist readability;
   they indicate the reference points for each subexpression (i.e., each
   paired parenthesis).  We refer to the value matched for subexpression
   <n> as $<n>.  For example, matching the above expression to

      http://www.ics.uci.edu/pub/ietf/uri/#Related

   results in the following subexpression matches:

      $1 = http:
      $2 = http
      $3 = //www.ics.uci.edu
      $4 = www.ics.uci.edu
      $5 = /pub/ietf/uri/
      $6 = <undefined>
      $7 = <undefined>
      $8 = #Related
      $9 = Related

   where <undefined> indicates that the component is not present, as is
   the case for the query component in the above example.  Therefore, we
   can determine the value of the five components as

      scheme    = $2
      authority = $4
      path      = $5
      query     = $7
      fragment  = $9

   Going in the opposite direction, we can recreate a URI reference from
   its components by using the algorithm of Section 5.3.

Appendix C.  Delimiting a URI in Context

   URIs are often transmitted through formats that do not provide a
   clear context for their interpretation.  For example, there are many
   occasions when a URI is included in plain text; examples include text
   sent in email, USENET news, and on printed paper.  In such cases, it
   is important to be able to delimit the URI from the rest of the text,
   and in particular from punctuation marks that might be mistaken for
   part of the URI.

   In practice, URIs are delimited in a variety of ways, but usually
   within double-quotes "http://example.com/", angle brackets
   <http://example.com/>, or just by using whitespace:

RFC3986 - Page 52

      http://example.com/

   These wrappers do not form part of the URI.

   In some cases, extra whitespace (spaces, line-breaks, tabs, etc.) may
   have to be added to break a long URI across lines.  The whitespace
   should be ignored when the URI is extracted.

   No whitespace should be introduced after a hyphen ("-") character.
   Because some typesetters and printers may (erroneously) introduce a
   hyphen at the end of line when breaking it, the interpreter of a URI
   containing a line break immediately after a hyphen should ignore all
   whitespace around the line break and should be aware that the hyphen
   may or may not actually be part of the URI.

   Using <> angle brackets around each URI is especially recommended as
   a delimiting style for a reference that contains embedded whitespace.

   The prefix "URL:" (with or without a trailing space) was formerly
   recommended as a way to help distinguish a URI from other bracketed
   designators, though it is not commonly used in practice and is no
   longer recommended.

   For robustness, software that accepts user-typed URI should attempt
   to recognize and strip both delimiters and embedded whitespace.

   For example, the text

      Yes, Jim, I found it under "http://www.w3.org/Addressing/",
      but you can probably pick it up from <ftp://foo.example.
      com/rfc/>.  Note the warning in <http://www.ics.uci.edu/pub/
      ietf/uri/historical.html#WARNING>.

   contains the URI references

      http://www.w3.org/Addressing/
      ftp://foo.example.com/rfc/
      http://www.ics.uci.edu/pub/ietf/uri/historical.html#WARNING

RFC3986 - Page 53

Appendix D.  Changes from RFC 2396

D.1.  Additions

   An ABNF rule for URI has been introduced to correspond to one common
   usage of the term: an absolute URI with optional fragment.

   IPv6 (and later) literals have been added to the list of possible
   identifiers for the host portion of an authority component, as
   described by [RFC2732], with the addition of "[" and "]" to the
   reserved set and a version flag to anticipate future versions of IP
   literals.  Square brackets are now specified as reserved within the
   authority component and are not allowed outside their use as
   delimiters for an IP literal within host.  In order to make this
   change without changing the technical definition of the path, query,
   and fragment components, those rules were redefined to directly
   specify the characters allowed.

   As [RFC2732] defers to [RFC3513] for definition of an IPv6 literal
   address, which, unfortunately, lacks an ABNF description of
   IPv6address, we created a new ABNF rule for IPv6address that matches
   the text representations defined by Section 2.2 of [RFC3513].
   Likewise, the definition of IPv4address has been improved in order to
   limit each decimal octet to the range 0-255.

   Section 6, on URI normalization and comparison, has been completely
   rewritten and extended by using input from Tim Bray and discussion
   within the W3C Technical Architecture Group.

D.2.  Modifications

   The ad-hoc BNF syntax of RFC 2396 has been replaced with the ABNF of
   [RFC2234].  This change required all rule names that formerly
   included underscore characters to be renamed with a dash instead.  In
   addition, a number of syntax rules have been eliminated or simplified
   to make the overall grammar more comprehensible.  Specifications that
   refer to the obsolete grammar rules may be understood by replacing
   those rules according to the following table:

RFC3986 - Page 54

   +----------------+--------------------------------------------------+
   | obsolete rule  | translation                                      |
   +----------------+--------------------------------------------------+
   | absoluteURI    | absolute-URI                                     |
   | relativeURI    | relative-part [ "?" query ]                      |
   | hier_part      | ( "//" authority path-abempty /                  |
   |                | path-absolute ) [ "?" query ]                    |
   |                |                                                  |
   | opaque_part    | path-rootless [ "?" query ]                      |
   | net_path       | "//" authority path-abempty                      |
   | abs_path       | path-absolute                                    |
   | rel_path       | path-rootless                                    |
   | rel_segment    | segment-nz-nc                                    |
   | reg_name       | reg-name                                         |
   | server         | authority                                        |
   | hostport       | host [ ":" port ]                                |
   | hostname       | reg-name                                         |
   | path_segments  | path-abempty                                     |
   | param          | *<pchar excluding ";">                           |
   |                |                                                  |
   | uric           | unreserved / pct-encoded / ";" / "?" / ":"       |
   |                |  / "@" / "&" / "=" / "+" / "$" / "," / "/"       |
   |                |                                                  |
   | uric_no_slash  | unreserved / pct-encoded / ";" / "?" / ":"       |
   |                |  / "@" / "&" / "=" / "+" / "$" / ","             |
   |                |                                                  |
   | mark           | "-" / "_" / "." / "!" / "~" / "*" / "'"          |
   |                |  / "(" / ")"                                     |
   |                |                                                  |
   | escaped        | pct-encoded                                      |
   | hex            | HEXDIG                                           |
   | alphanum       | ALPHA / DIGIT                                    |
   +----------------+--------------------------------------------------+

   Use of the above obsolete rules for the definition of scheme-specific
   syntax is deprecated.

   Section 2, on characters, has been rewritten to explain what
   characters are reserved, when they are reserved, and why they are
   reserved, even when they are not used as delimiters by the generic
   syntax.  The mark characters that are typically unsafe to decode,
   including the exclamation mark ("!"), asterisk ("*"), single-quote
   ("'"), and open and close parentheses ("(" and ")"), have been moved
   to the reserved set in order to clarify the distinction between
   reserved and unreserved and, hopefully, to answer the most common
   question of scheme designers.  Likewise, the section on
   percent-encoded characters has been rewritten, and URI normalizers
   are now given license to decode any percent-encoded octets

RFC3986 - Page 55

   corresponding to unreserved characters.  In general, the terms
   "escaped" and "unescaped" have been replaced with "percent-encoded"
   and "decoded", respectively, to reduce confusion with other forms of
   escape mechanisms.

   The ABNF for URI and URI-reference has been redesigned to make them
   more friendly to LALR parsers and to reduce complexity.  As a result,
   the layout form of syntax description has been removed, along with
   the uric, uric_no_slash, opaque_part, net_path, abs_path, rel_path,
   path_segments, rel_segment, and mark rules.  All references to
   "opaque" URIs have been replaced with a better description of how the
   path component may be opaque to hierarchy.  The relativeURI rule has
   been replaced with relative-ref to avoid unnecessary confusion over
   whether they are a subset of URI.  The ambiguity regarding the
   parsing of URI-reference as a URI or a relative-ref with a colon in
   the first segment has been eliminated through the use of five
   separate path matching rules.

   The fragment identifier has been moved back into the section on
   generic syntax components and within the URI and relative-ref rules,
   though it remains excluded from absolute-URI.  The number sign ("#")
   character has been moved back to the reserved set as a result of
   reintegrating the fragment syntax.

   The ABNF has been corrected to allow the path component to be empty.
   This also allows an absolute-URI to consist of nothing after the
   "scheme:", as is present in practice with the "dav:" namespace
   [RFC2518] and with the "about:" scheme used internally by many WWW
   browser implementations.  The ambiguity regarding the boundary
   between authority and path has been eliminated through the use of
   five separate path matching rules.

   Registry-based naming authorities that use the generic syntax are now
   defined within the host rule.  This change allows current
   implementations, where whatever name provided is simply fed to the
   local name resolution mechanism, to be consistent with the
   specification.  It also removes the need to re-specify DNS name
   formats here.  Furthermore, it allows the host component to contain
   percent-encoded octets, which is necessary to enable
   internationalized domain names to be provided in URIs, processed in
   their native character encodings at the application layers above URI
   processing, and passed to an IDNA library as a registered name in the
   UTF-8 character encoding.  The server, hostport, hostname,
   domainlabel, toplabel, and alphanum rules have been removed.

   The resolving relative references algorithm of [RFC2396] has been
   rewritten with pseudocode for this revision to improve clarity and
   fix the following issues:

RFC3986 - Page 56

   o  [RFC2396] section 5.2, step 6a, failed to account for a base URI
      with no path.

   o  Restored the behavior of [RFC1808] where, if the reference
      contains an empty path and a defined query component, the target
      URI inherits the base URI's path component.

   o  The determination of whether a URI reference is a same-document
      reference has been decoupled from the URI parser, simplifying the
      URI processing interface within applications in a way consistent
      with the internal architecture of deployed URI processing
      implementations.  The determination is now based on comparison to
      the base URI after transforming a reference to absolute form,
      rather than on the format of the reference itself.  This change
      may result in more references being considered "same-document"
      under this specification than there would be under the rules given
      in RFC 2396, especially when normalization is used to reduce
      aliases.  However, it does not change the status of existing
      same-document references.

   o  Separated the path merge routine into two routines: merge, for
      describing combination of the base URI path with a relative-path
      reference, and remove_dot_segments, for describing how to remove
      the special "." and ".." segments from a composed path.  The
      remove_dot_segments algorithm is now applied to all URI reference
      paths in order to match common implementations and to improve the
      normalization of URIs in practice.  This change only impacts the
      parsing of abnormal references and same-scheme references wherein
      the base URI has a non-hierarchical path.

Index

   A
      ABNF  11
      absolute  27
      absolute-path  26
      absolute-URI  27
      access  9
      authority  17, 18

   B
      base URI  28

   C
      character encoding  4
      character  4
      characters  8, 11
      coded character set  4

RFC3986 - Page 57

   D
      dec-octet  20
      dereference  9
      dot-segments  23

   F
      fragment  16, 24

   G
      gen-delims  13
      generic syntax  6

   H
      h16  20
      hier-part  16
      hierarchical  10
      host  18

   I
      identifier  5
      IP-literal  19
      IPv4  20
      IPv4address  19, 20
      IPv6  19
      IPv6address  19, 20
      IPvFuture  19

   L
      locator  7
      ls32  20

   M
      merge  32

   N
      name  7
      network-path  26

   P
      path  16, 22, 26
         path-abempty  22
         path-absolute  22
         path-empty  22
         path-noscheme  22
         path-rootless  22
      path-abempty  16, 22, 26
      path-absolute  16, 22, 26
      path-empty  16, 22, 26

RFC3986 - Page 58

      path-rootless  16, 22
      pchar  23
      pct-encoded  12
      percent-encoding  12
      port  22

   Q
      query  16, 23

   R
      reg-name  21
      registered name  20
      relative  10, 28
      relative-path  26
      relative-ref  26
      remove_dot_segments  33
      representation  9
      reserved  12
      resolution  9, 28
      resource  5
      retrieval  9

   S
      same-document  27
      sameness  9
      scheme  16, 17
      segment  22, 23
         segment-nz  23
         segment-nz-nc  23
      sub-delims  13
      suffix  27

   T
      transcription  8

   U
      uniform  4
      unreserved  13
      URI grammar
         absolute-URI  27
         ALPHA  11
         authority  18
         CR  11
         dec-octet  20
         DIGIT  11
         DQUOTE  11
         fragment  24
         gen-delims  13

RFC3986 - Page 59

         h16  20
         HEXDIG  11
         hier-part  16
         host  19
         IP-literal  19
         IPv4address  20
         IPv6address  20
         IPvFuture  19
         LF  11
         ls32  20
         OCTET  11
         path  22
         path-abempty  22
         path-absolute  22
         path-empty  22
         path-noscheme  22
         path-rootless  22
         pchar  23
         pct-encoded  12
         port  22
         query  24
         reg-name  21
         relative-ref  26
         reserved  13
         scheme  17
         segment  23
         segment-nz  23
         segment-nz-nc  23
         SP  11
         sub-delims  13
         unreserved  13
         URI  16
         URI-reference  25
         userinfo  18
      URI  16
      URI-reference  25
      URL  7
      URN  7
      userinfo  18

RFC3986 - Page 60

Authors' Addresses

   Tim Berners-Lee
   World Wide Web Consortium
   Massachusetts Institute of Technology
   77 Massachusetts Avenue
   Cambridge, MA  02139
   USA

   Phone: +1-617-253-5702
   Fax:   +1-617-258-5999
   EMail: timbl@w3.org
   URI:   http://www.w3.org/People/Berners-Lee/


   Roy T. Fielding
   Day Software
   5251 California Ave., Suite 110
   Irvine, CA  92617
   USA

   Phone: +1-949-679-2960
   Fax:   +1-949-679-2972
   EMail: fielding@gbiv.com
   URI:   http://roy.gbiv.com/


   Larry Masinter
   Adobe Systems Incorporated
   345 Park Ave
   San Jose, CA  95110
   USA

   Phone: +1-408-536-3024
   EMail: LMM@acm.org
   URI:   http://larry.masinter.net/

RFC3986 - Page 61

Full Copyright Statement

   Copyright (C) The Internet Society (2005).

   This document is subject to the rights, licenses and restrictions
   contained in BCP 78, and except as set forth therein, the authors
   retain all their rights.

   This document and the information contained herein are provided on an
   "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS
   OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET
   ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED,
   INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE
   INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED
   WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE.

Intellectual Property

   The IETF takes no position regarding the validity or scope of any
   Intellectual Property Rights or other rights that might be claimed to
   pertain to the implementation or use of the technology described in
   this document or the extent to which any license under such rights
   might or might not be available; nor does it represent that it has
   made any independent effort to identify any such rights.  Information
   on the IETF's procedures with respect to rights in IETF Documents can
   be found in BCP 78 and BCP 79.

   Copies of IPR disclosures made to the IETF Secretariat and any
   assurances of licenses to be made available, or the result of an
   attempt made to obtain a general license or permission for the use of
   such proprietary rights by implementers or users of this
   specification can be obtained from the IETF on-line IPR repository at
   http://www.ietf.org/ipr.

   The IETF invites any interested party to bring to its attention any
   copyrights, patents or patent applications, or other proprietary
   rights that may cover technology that may be required to implement
   this standard.  Please address the information to the IETF at ietf-
   ipr@ietf.org.


Acknowledgement

   Funding for the RFC Editor function is currently provided by the
   Internet Society.