RFC 3986

Uniform Resource Identifier (URI): Generic Syntax

Pages: 61
Internet Standard: 66
→ Errata
Obsoletes: 2732 2396 1808
Updates: 1738
Updated by: 6874 7320 8820

Part 2 of 3 – Pages 16 to 37

RFC3986 - Page 16 prevText

3.  Syntax Components

   The generic URI syntax consists of a hierarchical sequence of
   components referred to as the scheme, authority, path, query, and
   fragment.

      URI         = scheme ":" hier-part [ "?" query ] [ "#" fragment ]

      hier-part   = "//" authority path-abempty
                  / path-absolute
                  / path-rootless
                  / path-empty

   The scheme and path components are required, though the path may be
   empty (no characters).  When authority is present, the path must
   either be empty or begin with a slash ("/") character.  When
   authority is not present, the path cannot begin with two slash
   characters ("//").  These restrictions result in five different ABNF
   rules for a path (Section 3.3), only one of which will match any
   given URI reference.

   The following are two example URIs and their component parts:

         foo://example.com:8042/over/there?name=ferret#nose
         \_/   \______________/\_________/ \_________/ \__/
          |           |            |            |        |
       scheme     authority       path        query   fragment
          |   _____________________|__
         / \ /                        \
         urn:example:animal:ferret:nose

RFC3986 - Page 17

3.1.  Scheme

   Each URI begins with a scheme name that refers to a specification for
   assigning identifiers within that scheme.  As such, the URI syntax is
   a federated and extensible naming system wherein each scheme's
   specification may further restrict the syntax and semantics of
   identifiers using that scheme.

   Scheme names consist of a sequence of characters beginning with a
   letter and followed by any combination of letters, digits, plus
   ("+"), period ("."), or hyphen ("-").  Although schemes are case-
   insensitive, the canonical form is lowercase and documents that
   specify schemes must do so with lowercase letters.  An implementation
   should accept uppercase letters as equivalent to lowercase in scheme
   names (e.g., allow "HTTP" as well as "http") for the sake of
   robustness but should only produce lowercase scheme names for
   consistency.

      scheme      = ALPHA *( ALPHA / DIGIT / "+" / "-" / "." )

   Individual schemes are not specified by this document.  The process
   for registration of new URI schemes is defined separately by [BCP35].
   The scheme registry maintains the mapping between scheme names and
   their specifications.  Advice for designers of new URI schemes can be
   found in [RFC2718].  URI scheme specifications must define their own
   syntax so that all strings matching their scheme-specific syntax will
   also match the <absolute-URI> grammar, as described in Section 4.3.

   When presented with a URI that violates one or more scheme-specific
   restrictions, the scheme-specific resolution process should flag the
   reference as an error rather than ignore the unused parts; doing so
   reduces the number of equivalent URIs and helps detect abuses of the
   generic syntax, which might indicate that the URI has been
   constructed to mislead the user (Section 7.6).

3.2.  Authority

   Many URI schemes include a hierarchical element for a naming
   authority so that governance of the name space defined by the
   remainder of the URI is delegated to that authority (which may, in
   turn, delegate it further).  The generic syntax provides a common
   means for distinguishing an authority based on a registered name or
   server address, along with optional port and user information.

   The authority component is preceded by a double slash ("//") and is
   terminated by the next slash ("/"), question mark ("?"), or number
   sign ("#") character, or by the end of the URI.

RFC3986 - Page 18

      authority   = [ userinfo "@" ] host [ ":" port ]

   URI producers and normalizers should omit the ":" delimiter that
   separates host from port if the port component is empty.  Some
   schemes do not allow the userinfo and/or port subcomponents.

   If a URI contains an authority component, then the path component
   must either be empty or begin with a slash ("/") character.  Non-
   validating parsers (those that merely separate a URI reference into
   its major components) will often ignore the subcomponent structure of
   authority, treating it as an opaque string from the double-slash to
   the first terminating delimiter, until such time as the URI is
   dereferenced.

3.2.1.  User Information

   The userinfo subcomponent may consist of a user name and, optionally,
   scheme-specific information about how to gain authorization to access
   the resource.  The user information, if present, is followed by a
   commercial at-sign ("@") that delimits it from the host.

      userinfo    = *( unreserved / pct-encoded / sub-delims / ":" )

   Use of the format "user:password" in the userinfo field is
   deprecated.  Applications should not render as clear text any data
   after the first colon (":") character found within a userinfo
   subcomponent unless the data after the colon is the empty string
   (indicating no password).  Applications may choose to ignore or
   reject such data when it is received as part of a reference and
   should reject the storage of such data in unencrypted form.  The
   passing of authentication information in clear text has proven to be
   a security risk in almost every case where it has been used.

   Applications that render a URI for the sake of user feedback, such as
   in graphical hypertext browsing, should render userinfo in a way that
   is distinguished from the rest of a URI, when feasible.  Such
   rendering will assist the user in cases where the userinfo has been
   misleadingly crafted to look like a trusted domain name
   (Section 7.6).

3.2.2.  Host

   The host subcomponent of authority is identified by an IP literal
   encapsulated within square brackets, an IPv4 address in dotted-
   decimal form, or a registered name.  The host subcomponent is case-
   insensitive.  The presence of a host subcomponent within a URI does
   not imply that the scheme requires access to the given host on the
   Internet.  In many cases, the host syntax is used only for the sake

RFC3986 - Page 19

   of reusing the existing registration process created and deployed for
   DNS, thus obtaining a globally unique name without the cost of
   deploying another registry.  However, such use comes with its own
   costs: domain name ownership may change over time for reasons not
   anticipated by the URI producer.  In other cases, the data within the
   host component identifies a registered name that has nothing to do
   with an Internet host.  We use the name "host" for the ABNF rule
   because that is its most common purpose, not its only purpose.

      host        = IP-literal / IPv4address / reg-name

   The syntax rule for host is ambiguous because it does not completely
   distinguish between an IPv4address and a reg-name.  In order to
   disambiguate the syntax, we apply the "first-match-wins" algorithm:
   If host matches the rule for IPv4address, then it should be
   considered an IPv4 address literal and not a reg-name.  Although host
   is case-insensitive, producers and normalizers should use lowercase
   for registered names and hexadecimal addresses for the sake of
   uniformity, while only using uppercase letters for percent-encodings.

   A host identified by an Internet Protocol literal address, version 6
   [RFC3513] or later, is distinguished by enclosing the IP literal
   within square brackets ("[" and "]").  This is the only place where
   square bracket characters are allowed in the URI syntax.  In
   anticipation of future, as-yet-undefined IP literal address formats,
   an implementation may use an optional version flag to indicate such a
   format explicitly rather than rely on heuristic determination.

      IP-literal = "[" ( IPv6address / IPvFuture  ) "]"

      IPvFuture  = "v" 1*HEXDIG "." 1*( unreserved / sub-delims / ":" )

   The version flag does not indicate the IP version; rather, it
   indicates future versions of the literal format.  As such,
   implementations must not provide the version flag for the existing
   IPv4 and IPv6 literal address forms described below.  If a URI
   containing an IP-literal that starts with "v" (case-insensitive),
   indicating that the version flag is present, is dereferenced by an
   application that does not know the meaning of that version flag, then
   the application should return an appropriate error for "address
   mechanism not supported".

   A host identified by an IPv6 literal address is represented inside
   the square brackets without a preceding version flag.  The ABNF
   provided here is a translation of the text definition of an IPv6
   literal address provided in [RFC3513].  This syntax does not support
   IPv6 scoped addressing zone identifiers.

RFC3986 - Page 20

   A 128-bit IPv6 address is divided into eight 16-bit pieces.  Each
   piece is represented numerically in case-insensitive hexadecimal,
   using one to four hexadecimal digits (leading zeroes are permitted).
   The eight encoded pieces are given most-significant first, separated
   by colon characters.  Optionally, the least-significant two pieces
   may instead be represented in IPv4 address textual format.  A
   sequence of one or more consecutive zero-valued 16-bit pieces within
   the address may be elided, omitting all their digits and leaving
   exactly two consecutive colons in their place to mark the elision.

      IPv6address =                            6( h16 ":" ) ls32
                  /                       "::" 5( h16 ":" ) ls32
                  / [               h16 ] "::" 4( h16 ":" ) ls32
                  / [ *1( h16 ":" ) h16 ] "::" 3( h16 ":" ) ls32
                  / [ *2( h16 ":" ) h16 ] "::" 2( h16 ":" ) ls32
                  / [ *3( h16 ":" ) h16 ] "::"    h16 ":"   ls32
                  / [ *4( h16 ":" ) h16 ] "::"              ls32
                  / [ *5( h16 ":" ) h16 ] "::"              h16
                  / [ *6( h16 ":" ) h16 ] "::"

      ls32        = ( h16 ":" h16 ) / IPv4address
                  ; least-significant 32 bits of address

      h16         = 1*4HEXDIG
                  ; 16 bits of address represented in hexadecimal

   A host identified by an IPv4 literal address is represented in
   dotted-decimal notation (a sequence of four decimal numbers in the
   range 0 to 255, separated by "."), as described in [RFC1123] by
   reference to [RFC0952].  Note that other forms of dotted notation may
   be interpreted on some platforms, as described in Section 7.4, but
   only the dotted-decimal form of four octets is allowed by this
   grammar.

      IPv4address = dec-octet "." dec-octet "." dec-octet "." dec-octet

      dec-octet   = DIGIT                 ; 0-9
                  / %x31-39 DIGIT         ; 10-99
                  / "1" 2DIGIT            ; 100-199
                  / "2" %x30-34 DIGIT     ; 200-249
                  / "25" %x30-35          ; 250-255

   A host identified by a registered name is a sequence of characters
   usually intended for lookup within a locally defined host or service
   name registry, though the URI's scheme-specific semantics may require
   that a specific registry (or fixed name table) be used instead.  The
   most common name registry mechanism is the Domain Name System (DNS).
   A registered name intended for lookup in the DNS uses the syntax

RFC3986 - Page 21

   defined in Section 3.5 of [RFC1034] and Section 2.1 of [RFC1123].
   Such a name consists of a sequence of domain labels separated by ".",
   each domain label starting and ending with an alphanumeric character
   and possibly also containing "-" characters.  The rightmost domain
   label of a fully qualified domain name in DNS may be followed by a
   single "." and should be if it is necessary to distinguish between
   the complete domain name and some local domain.

      reg-name    = *( unreserved / pct-encoded / sub-delims )

   If the URI scheme defines a default for host, then that default
   applies when the host subcomponent is undefined or when the
   registered name is empty (zero length).  For example, the "file" URI
   scheme is defined so that no authority, an empty host, and
   "localhost" all mean the end-user's machine, whereas the "http"
   scheme considers a missing authority or empty host invalid.

   This specification does not mandate a particular registered name
   lookup technology and therefore does not restrict the syntax of reg-
   name beyond what is necessary for interoperability.  Instead, it
   delegates the issue of registered name syntax conformance to the
   operating system of each application performing URI resolution, and
   that operating system decides what it will allow for the purpose of
   host identification.  A URI resolution implementation might use DNS,
   host tables, yellow pages, NetInfo, WINS, or any other system for
   lookup of registered names.  However, a globally scoped naming
   system, such as DNS fully qualified domain names, is necessary for
   URIs intended to have global scope.  URI producers should use names
   that conform to the DNS syntax, even when use of DNS is not
   immediately apparent, and should limit these names to no more than
   255 characters in length.

   The reg-name syntax allows percent-encoded octets in order to
   represent non-ASCII registered names in a uniform way that is
   independent of the underlying name resolution technology.  Non-ASCII
   characters must first be encoded according to UTF-8 [STD63], and then
   each octet of the corresponding UTF-8 sequence must be percent-
   encoded to be represented as URI characters.  URI producing
   applications must not use percent-encoding in host unless it is used
   to represent a UTF-8 character sequence.  When a non-ASCII registered
   name represents an internationalized domain name intended for
   resolution via the DNS, the name must be transformed to the IDNA
   encoding [RFC3490] prior to name lookup.  URI producers should
   provide these registered names in the IDNA encoding, rather than a
   percent-encoding, if they wish to maximize interoperability with
   legacy URI resolvers.

RFC3986 - Page 22

3.2.3.  Port

   The port subcomponent of authority is designated by an optional port
   number in decimal following the host and delimited from it by a
   single colon (":") character.

      port        = *DIGIT

   A scheme may define a default port.  For example, the "http" scheme
   defines a default port of "80", corresponding to its reserved TCP
   port number.  The type of port designated by the port number (e.g.,
   TCP, UDP, SCTP) is defined by the URI scheme.  URI producers and
   normalizers should omit the port component and its ":" delimiter if
   port is empty or if its value would be the same as that of the
   scheme's default.

3.3.  Path

   The path component contains data, usually organized in hierarchical
   form, that, along with data in the non-hierarchical query component
   (Section 3.4), serves to identify a resource within the scope of the
   URI's scheme and naming authority (if any).  The path is terminated
   by the first question mark ("?") or number sign ("#") character, or
   by the end of the URI.

   If a URI contains an authority component, then the path component
   must either be empty or begin with a slash ("/") character.  If a URI
   does not contain an authority component, then the path cannot begin
   with two slash characters ("//").  In addition, a URI reference
   (Section 4.1) may be a relative-path reference, in which case the
   first path segment cannot contain a colon (":") character.  The ABNF
   requires five separate rules to disambiguate these cases, only one of
   which will match the path substring within a given URI reference.  We
   use the generic term "path component" to describe the URI substring
   matched by the parser to one of these rules.

      path          = path-abempty    ; begins with "/" or is empty
                    / path-absolute   ; begins with "/" but not "//"
                    / path-noscheme   ; begins with a non-colon segment
                    / path-rootless   ; begins with a segment
                    / path-empty      ; zero characters

      path-abempty  = *( "/" segment )
      path-absolute = "/" [ segment-nz *( "/" segment ) ]
      path-noscheme = segment-nz-nc *( "/" segment )
      path-rootless = segment-nz *( "/" segment )
      path-empty    = 0<pchar>

RFC3986 - Page 23

      segment       = *pchar
      segment-nz    = 1*pchar
      segment-nz-nc = 1*( unreserved / pct-encoded / sub-delims / "@" )
                    ; non-zero-length segment without any colon ":"

      pchar         = unreserved / pct-encoded / sub-delims / ":" / "@"

   A path consists of a sequence of path segments separated by a slash
   ("/") character.  A path is always defined for a URI, though the
   defined path may be empty (zero length).  Use of the slash character
   to indicate hierarchy is only required when a URI will be used as the
   context for relative references.  For example, the URI
   <mailto:fred@example.com> has a path of "fred@example.com", whereas
   the URI <foo://info.example.com?fred> has an empty path.

   The path segments "." and "..", also known as dot-segments, are
   defined for relative reference within the path name hierarchy.  They
   are intended for use at the beginning of a relative-path reference
   (Section 4.2) to indicate relative position within the hierarchical
   tree of names.  This is similar to their role within some operating
   systems' file directory structures to indicate the current directory
   and parent directory, respectively.  However, unlike in a file
   system, these dot-segments are only interpreted within the URI path
   hierarchy and are removed as part of the resolution process (Section
   5.2).

   Aside from dot-segments in hierarchical paths, a path segment is
   considered opaque by the generic syntax.  URI producing applications
   often use the reserved characters allowed in a segment to delimit
   scheme-specific or dereference-handler-specific subcomponents.  For
   example, the semicolon (";") and equals ("=") reserved characters are
   often used to delimit parameters and parameter values applicable to
   that segment.  The comma (",") reserved character is often used for
   similar purposes.  For example, one URI producer might use a segment
   such as "name;v=1.1" to indicate a reference to version 1.1 of
   "name", whereas another might use a segment such as "name,1.1" to
   indicate the same.  Parameter types may be defined by scheme-specific
   semantics, but in most cases the syntax of a parameter is specific to
   the implementation of the URI's dereferencing algorithm.

3.4.  Query

   The query component contains non-hierarchical data that, along with
   data in the path component (Section 3.3), serves to identify a
   resource within the scope of the URI's scheme and naming authority
   (if any).  The query component is indicated by the first question
   mark ("?") character and terminated by a number sign ("#") character
   or by the end of the URI.

RFC3986 - Page 24

      query       = *( pchar / "/" / "?" )

   The characters slash ("/") and question mark ("?") may represent data
   within the query component.  Beware that some older, erroneous
   implementations may not handle such data correctly when it is used as
   the base URI for relative references (Section 5.1), apparently
   because they fail to distinguish query data from path data when
   looking for hierarchical separators.  However, as query components
   are often used to carry identifying information in the form of
   "key=value" pairs and one frequently used value is a reference to
   another URI, it is sometimes better for usability to avoid percent-
   encoding those characters.

3.5.  Fragment

   The fragment identifier component of a URI allows indirect
   identification of a secondary resource by reference to a primary
   resource and additional identifying information.  The identified
   secondary resource may be some portion or subset of the primary
   resource, some view on representations of the primary resource, or
   some other resource defined or described by those representations.  A
   fragment identifier component is indicated by the presence of a
   number sign ("#") character and terminated by the end of the URI.

      fragment    = *( pchar / "/" / "?" )

   The semantics of a fragment identifier are defined by the set of
   representations that might result from a retrieval action on the
   primary resource.  The fragment's format and resolution is therefore
   dependent on the media type [RFC2046] of a potentially retrieved
   representation, even though such a retrieval is only performed if the
   URI is dereferenced.  If no such representation exists, then the
   semantics of the fragment are considered unknown and are effectively
   unconstrained.  Fragment identifier semantics are independent of the
   URI scheme and thus cannot be redefined by scheme specifications.

   Individual media types may define their own restrictions on or
   structures within the fragment identifier syntax for specifying
   different types of subsets, views, or external references that are
   identifiable as secondary resources by that media type.  If the
   primary resource has multiple representations, as is often the case
   for resources whose representation is selected based on attributes of
   the retrieval request (a.k.a., content negotiation), then whatever is
   identified by the fragment should be consistent across all of those
   representations.  Each representation should either define the
   fragment so that it corresponds to the same secondary resource,
   regardless of how it is represented, or should leave the fragment
   undefined (i.e., not found).

RFC3986 - Page 25

   As with any URI, use of a fragment identifier component does not
   imply that a retrieval action will take place.  A URI with a fragment
   identifier may be used to refer to the secondary resource without any
   implication that the primary resource is accessible or will ever be
   accessed.

   Fragment identifiers have a special role in information retrieval
   systems as the primary form of client-side indirect referencing,
   allowing an author to specifically identify aspects of an existing
   resource that are only indirectly provided by the resource owner.  As
   such, the fragment identifier is not used in the scheme-specific
   processing of a URI; instead, the fragment identifier is separated
   from the rest of the URI prior to a dereference, and thus the
   identifying information within the fragment itself is dereferenced
   solely by the user agent, regardless of the URI scheme.  Although
   this separate handling is often perceived to be a loss of
   information, particularly for accurate redirection of references as
   resources move over time, it also serves to prevent information
   providers from denying reference authors the right to refer to
   information within a resource selectively.  Indirect referencing also
   provides additional flexibility and extensibility to systems that use
   URIs, as new media types are easier to define and deploy than new
   schemes of identification.

   The characters slash ("/") and question mark ("?") are allowed to
   represent data within the fragment identifier.  Beware that some
   older, erroneous implementations may not handle this data correctly
   when it is used as the base URI for relative references (Section
   5.1).

4.  Usage

   When applications make reference to a URI, they do not always use the
   full form of reference defined by the "URI" syntax rule.  To save
   space and take advantage of hierarchical locality, many Internet
   protocol elements and media type formats allow an abbreviation of a
   URI, whereas others restrict the syntax to a particular form of URI.
   We define the most common forms of reference syntax in this
   specification because they impact and depend upon the design of the
   generic syntax, requiring a uniform parsing algorithm in order to be
   interpreted consistently.

4.1.  URI Reference

   URI-reference is used to denote the most common usage of a resource
   identifier.

      URI-reference = URI / relative-ref

RFC3986 - Page 26

   A URI-reference is either a URI or a relative reference.  If the
   URI-reference's prefix does not match the syntax of a scheme followed
   by its colon separator, then the URI-reference is a relative
   reference.

   A URI-reference is typically parsed first into the five URI
   components, in order to determine what components are present and
   whether the reference is relative.  Then, each component is parsed
   for its subparts and their validation.  The ABNF of URI-reference,
   along with the "first-match-wins" disambiguation rule, is sufficient
   to define a validating parser for the generic syntax.  Readers
   familiar with regular expressions should see Appendix B for an
   example of a non-validating URI-reference parser that will take any
   given string and extract the URI components.

4.2.  Relative Reference

   A relative reference takes advantage of the hierarchical syntax
   (Section 1.2.3) to express a URI reference relative to the name space
   of another hierarchical URI.

      relative-ref  = relative-part [ "?" query ] [ "#" fragment ]

      relative-part = "//" authority path-abempty
                    / path-absolute
                    / path-noscheme
                    / path-empty

   The URI referred to by a relative reference, also known as the target
   URI, is obtained by applying the reference resolution algorithm of
   Section 5.

   A relative reference that begins with two slash characters is termed
   a network-path reference; such references are rarely used.  A
   relative reference that begins with a single slash character is
   termed an absolute-path reference.  A relative reference that does
   not begin with a slash character is termed a relative-path reference.

   A path segment that contains a colon character (e.g., "this:that")
   cannot be used as the first segment of a relative-path reference, as
   it would be mistaken for a scheme name.  Such a segment must be
   preceded by a dot-segment (e.g., "./this:that") to make a relative-
   path reference.

RFC3986 - Page 27

4.3.  Absolute URI

   Some protocol elements allow only the absolute form of a URI without
   a fragment identifier.  For example, defining a base URI for later
   use by relative references calls for an absolute-URI syntax rule that
   does not allow a fragment.

      absolute-URI  = scheme ":" hier-part [ "?" query ]

   URI scheme specifications must define their own syntax so that all
   strings matching their scheme-specific syntax will also match the
   <absolute-URI> grammar.  Scheme specifications will not define
   fragment identifier syntax or usage, regardless of its applicability
   to resources identifiable via that scheme, as fragment identification
   is orthogonal to scheme definition.  However, scheme specifications
   are encouraged to include a wide range of examples, including
   examples that show use of the scheme's URIs with fragment identifiers
   when such usage is appropriate.

4.4.  Same-Document Reference

   When a URI reference refers to a URI that is, aside from its fragment
   component (if any), identical to the base URI (Section 5.1), that
   reference is called a "same-document" reference.  The most frequent
   examples of same-document references are relative references that are
   empty or include only the number sign ("#") separator followed by a
   fragment identifier.

   When a same-document reference is dereferenced for a retrieval
   action, the target of that reference is defined to be within the same
   entity (representation, document, or message) as the reference;
   therefore, a dereference should not result in a new retrieval action.

   Normalization of the base and target URIs prior to their comparison,
   as described in Sections 6.2.2 and 6.2.3, is allowed but rarely
   performed in practice.  Normalization may increase the set of same-
   document references, which may be of benefit to some caching
   applications.  As such, reference authors should not assume that a
   slightly different, though equivalent, reference URI will (or will
   not) be interpreted as a same-document reference by any given
   application.

4.5.  Suffix Reference

   The URI syntax is designed for unambiguous reference to resources and
   extensibility via the URI scheme.  However, as URI identification and
   usage have become commonplace, traditional media (television, radio,
   newspapers, billboards, etc.) have increasingly used a suffix of the

RFC3986 - Page 28

   URI as a reference, consisting of only the authority and path
   portions of the URI, such as

      www.w3.org/Addressing/

   or simply a DNS registered name on its own.  Such references are
   primarily intended for human interpretation rather than for machines,
   with the assumption that context-based heuristics are sufficient to
   complete the URI (e.g., most registered names beginning with "www"
   are likely to have a URI prefix of "http://").  Although there is no
   standard set of heuristics for disambiguating a URI suffix, many
   client implementations allow them to be entered by the user and
   heuristically resolved.

   Although this practice of using suffix references is common, it
   should be avoided whenever possible and should never be used in
   situations where long-term references are expected.  The heuristics
   noted above will change over time, particularly when a new URI scheme
   becomes popular, and are often incorrect when used out of context.
   Furthermore, they can lead to security issues along the lines of
   those described in [RFC1535].

   As a URI suffix has the same syntax as a relative-path reference, a
   suffix reference cannot be used in contexts where a relative
   reference is expected.  As a result, suffix references are limited to
   places where there is no defined base URI, such as dialog boxes and
   off-line advertisements.

5.  Reference Resolution

   This section defines the process of resolving a URI reference within
   a context that allows relative references so that the result is a
   string matching the <URI> syntax rule of Section 3.

5.1.  Establishing a Base URI

   The term "relative" implies that a "base URI" exists against which
   the relative reference is applied.  Aside from fragment-only
   references (Section 4.4), relative references are only usable when a
   base URI is known.  A base URI must be established by the parser
   prior to parsing URI references that might be relative.  A base URI
   must conform to the <absolute-URI> syntax rule (Section 4.3).  If the
   base URI is obtained from a URI reference, then that reference must
   be converted to absolute form and stripped of any fragment component
   prior to its use as a base URI.

RFC3986 - Page 29

   The base URI of a reference can be established in one of four ways,
   discussed below in order of precedence.  The order of precedence can
   be thought of in terms of layers, where the innermost defined base
   URI has the highest precedence.  This can be visualized graphically
   as follows:

         .----------------------------------------------------------.
         |  .----------------------------------------------------.  |
         |  |  .----------------------------------------------.  |  |
         |  |  |  .----------------------------------------.  |  |  |
         |  |  |  |  .----------------------------------.  |  |  |  |
         |  |  |  |  |       <relative-reference>       |  |  |  |  |
         |  |  |  |  `----------------------------------'  |  |  |  |
         |  |  |  | (5.1.1) Base URI embedded in content   |  |  |  |
         |  |  |  `----------------------------------------'  |  |  |
         |  |  | (5.1.2) Base URI of the encapsulating entity |  |  |
         |  |  |         (message, representation, or none)   |  |  |
         |  |  `----------------------------------------------'  |  |
         |  | (5.1.3) URI used to retrieve the entity            |  |
         |  `----------------------------------------------------'  |
         | (5.1.4) Default Base URI (application-dependent)         |
         `----------------------------------------------------------'

5.1.1.  Base URI Embedded in Content

   Within certain media types, a base URI for relative references can be
   embedded within the content itself so that it can be readily obtained
   by a parser.  This can be useful for descriptive documents, such as
   tables of contents, which may be transmitted to others through
   protocols other than their usual retrieval context (e.g., email or
   USENET news).

   It is beyond the scope of this specification to specify how, for each
   media type, a base URI can be embedded.  The appropriate syntax, when
   available, is described by the data format specification associated
   with each media type.

5.1.2.  Base URI from the Encapsulating Entity

   If no base URI is embedded, the base URI is defined by the
   representation's retrieval context.  For a document that is enclosed
   within another entity, such as a message or archive, the retrieval
   context is that entity.  Thus, the default base URI of a
   representation is the base URI of the entity in which the
   representation is encapsulated.

RFC3986 - Page 30

   A mechanism for embedding a base URI within MIME container types
   (e.g., the message and multipart types) is defined by MHTML
   [RFC2557].  Protocols that do not use the MIME message header syntax,
   but that do allow some form of tagged metadata to be included within
   messages, may define their own syntax for defining a base URI as part
   of a message.

5.1.3.  Base URI from the Retrieval URI

   If no base URI is embedded and the representation is not encapsulated
   within some other entity, then, if a URI was used to retrieve the
   representation, that URI shall be considered the base URI.  Note that
   if the retrieval was the result of a redirected request, the last URI
   used (i.e., the URI that resulted in the actual retrieval of the
   representation) is the base URI.

5.1.4.  Default Base URI

   If none of the conditions described above apply, then the base URI is
   defined by the context of the application.  As this definition is
   necessarily application-dependent, failing to define a base URI by
   using one of the other methods may result in the same content being
   interpreted differently by different types of applications.

   A sender of a representation containing relative references is
   responsible for ensuring that a base URI for those references can be
   established.  Aside from fragment-only references, relative
   references can only be used reliably in situations where the base URI
   is well defined.

5.2.  Relative Resolution

   This section describes an algorithm for converting a URI reference
   that might be relative to a given base URI into the parsed components
   of the reference's target.  The components can then be recomposed, as
   described in Section 5.3, to form the target URI.  This algorithm
   provides definitive results that can be used to test the output of
   other implementations.  Applications may implement relative reference
   resolution by using some other algorithm, provided that the results
   match what would be given by this one.

RFC3986 - Page 31

5.2.1.  Pre-parse the Base URI

   The base URI (Base) is established according to the procedure of
   Section 5.1 and parsed into the five main components described in
   Section 3.  Note that only the scheme component is required to be
   present in a base URI; the other components may be empty or
   undefined.  A component is undefined if its associated delimiter does
   not appear in the URI reference; the path component is never
   undefined, though it may be empty.

   Normalization of the base URI, as described in Sections 6.2.2 and
   6.2.3, is optional.  A URI reference must be transformed to its
   target URI before it can be normalized.

5.2.2.  Transform References

   For each URI reference (R), the following pseudocode describes an
   algorithm for transforming R into its target URI (T):

      -- The URI reference is parsed into the five URI components
      --
      (R.scheme, R.authority, R.path, R.query, R.fragment) = parse(R);

      -- A non-strict parser may ignore a scheme in the reference
      -- if it is identical to the base URI's scheme.
      --
      if ((not strict) and (R.scheme == Base.scheme)) then
         undefine(R.scheme);
      endif;

RFC3986 - Page 32

      if defined(R.scheme) then
         T.scheme    = R.scheme;
         T.authority = R.authority;
         T.path      = remove_dot_segments(R.path);
         T.query     = R.query;
      else
         if defined(R.authority) then
            T.authority = R.authority;
            T.path      = remove_dot_segments(R.path);
            T.query     = R.query;
         else
            if (R.path == "") then
               T.path = Base.path;
               if defined(R.query) then
                  T.query = R.query;
               else
                  T.query = Base.query;
               endif;
            else
               if (R.path starts-with "/") then
                  T.path = remove_dot_segments(R.path);
               else
                  T.path = merge(Base.path, R.path);
                  T.path = remove_dot_segments(T.path);
               endif;
               T.query = R.query;
            endif;
            T.authority = Base.authority;
         endif;
         T.scheme = Base.scheme;
      endif;

      T.fragment = R.fragment;

5.2.3.  Merge Paths

   The pseudocode above refers to a "merge" routine for merging a
   relative-path reference with the path of the base URI.  This is
   accomplished as follows:

   o  If the base URI has a defined authority component and an empty
      path, then return a string consisting of "/" concatenated with the
      reference's path; otherwise,

RFC3986 - Page 33

   o  return a string consisting of the reference's path component
      appended to all but the last segment of the base URI's path (i.e.,
      excluding any characters after the right-most "/" in the base URI
      path, or excluding the entire base URI path if it does not contain
      any "/" characters).

5.2.4.  Remove Dot Segments

   The pseudocode also refers to a "remove_dot_segments" routine for
   interpreting and removing the special "." and ".." complete path
   segments from a referenced path.  This is done after the path is
   extracted from a reference, whether or not the path was relative, in
   order to remove any invalid or extraneous dot-segments prior to
   forming the target URI.  Although there are many ways to accomplish
   this removal process, we describe a simple method using two string
   buffers.

   1.  The input buffer is initialized with the now-appended path
       components and the output buffer is initialized to the empty
       string.

   2.  While the input buffer is not empty, loop as follows:

       A.  If the input buffer begins with a prefix of "../" or "./",
           then remove that prefix from the input buffer; otherwise,

       B.  if the input buffer begins with a prefix of "/./" or "/.",
           where "." is a complete path segment, then replace that
           prefix with "/" in the input buffer; otherwise,

       C.  if the input buffer begins with a prefix of "/../" or "/..",
           where ".." is a complete path segment, then replace that
           prefix with "/" in the input buffer and remove the last
           segment and its preceding "/" (if any) from the output
           buffer; otherwise,

       D.  if the input buffer consists only of "." or "..", then remove
           that from the input buffer; otherwise,

       E.  move the first path segment in the input buffer to the end of
           the output buffer, including the initial "/" character (if
           any) and any subsequent characters up to, but not including,
           the next "/" character or the end of the input buffer.

   3.  Finally, the output buffer is returned as the result of
       remove_dot_segments.

RFC3986 - Page 34

   Note that dot-segments are intended for use in URI references to
   express an identifier relative to the hierarchy of names in the base
   URI.  The remove_dot_segments algorithm respects that hierarchy by
   removing extra dot-segments rather than treat them as an error or
   leaving them to be misinterpreted by dereference implementations.

   The following illustrates how the above steps are applied for two
   examples of merged paths, showing the state of the two buffers after
   each step.

      STEP   OUTPUT BUFFER         INPUT BUFFER

       1 :                         /a/b/c/./../../g
       2E:   /a                    /b/c/./../../g
       2E:   /a/b                  /c/./../../g
       2E:   /a/b/c                /./../../g
       2B:   /a/b/c                /../../g
       2C:   /a/b                  /../g
       2C:   /a                    /g
       2E:   /a/g

      STEP   OUTPUT BUFFER         INPUT BUFFER

       1 :                         mid/content=5/../6
       2E:   mid                   /content=5/../6
       2E:   mid/content=5         /../6
       2C:   mid                   /6
       2E:   mid/6

   Some applications may find it more efficient to implement the
   remove_dot_segments algorithm by using two segment stacks rather than
   strings.

      Note: Beware that some older, erroneous implementations will fail
      to separate a reference's query component from its path component
      prior to merging the base and reference paths, resulting in an
      interoperability failure if the query component contains the
      strings "/../" or "/./".

RFC3986 - Page 35

5.3.  Component Recomposition

   Parsed URI components can be recomposed to obtain the corresponding
   URI reference string.  Using pseudocode, this would be:

      result = ""

      if defined(scheme) then
         append scheme to result;
         append ":" to result;
      endif;

      if defined(authority) then
         append "//" to result;
         append authority to result;
      endif;

      append path to result;

      if defined(query) then
         append "?" to result;
         append query to result;
      endif;

      if defined(fragment) then
         append "#" to result;
         append fragment to result;
      endif;

      return result;

   Note that we are careful to preserve the distinction between a
   component that is undefined, meaning that its separator was not
   present in the reference, and a component that is empty, meaning that
   the separator was present and was immediately followed by the next
   component separator or the end of the reference.

5.4.  Reference Resolution Examples

   Within a representation with a well defined base URI of

      http://a/b/c/d;p?q

   a relative reference is transformed to its target URI as follows.

RFC3986 - Page 36

5.4.1.  Normal Examples

      "g:h"           =  "g:h"
      "g"             =  "http://a/b/c/g"
      "./g"           =  "http://a/b/c/g"
      "g/"            =  "http://a/b/c/g/"
      "/g"            =  "http://a/g"
      "//g"           =  "http://g"
      "?y"            =  "http://a/b/c/d;p?y"
      "g?y"           =  "http://a/b/c/g?y"
      "#s"            =  "http://a/b/c/d;p?q#s"
      "g#s"           =  "http://a/b/c/g#s"
      "g?y#s"         =  "http://a/b/c/g?y#s"
      ";x"            =  "http://a/b/c/;x"
      "g;x"           =  "http://a/b/c/g;x"
      "g;x?y#s"       =  "http://a/b/c/g;x?y#s"
      ""              =  "http://a/b/c/d;p?q"
      "."             =  "http://a/b/c/"
      "./"            =  "http://a/b/c/"
      ".."            =  "http://a/b/"
      "../"           =  "http://a/b/"
      "../g"          =  "http://a/b/g"
      "../.."         =  "http://a/"
      "../../"        =  "http://a/"
      "../../g"       =  "http://a/g"

5.4.2.  Abnormal Examples

   Although the following abnormal examples are unlikely to occur in
   normal practice, all URI parsers should be capable of resolving them
   consistently.  Each example uses the same base as that above.

   Parsers must be careful in handling cases where there are more ".."
   segments in a relative-path reference than there are hierarchical
   levels in the base URI's path.  Note that the ".." syntax cannot be
   used to change the authority component of a URI.

      "../../../g"    =  "http://a/g"
      "../../../../g" =  "http://a/g"

RFC3986 - Page 37

   Similarly, parsers must remove the dot-segments "." and ".." when
   they are complete components of a path, but not when they are only
   part of a segment.

      "/./g"          =  "http://a/g"
      "/../g"         =  "http://a/g"
      "g."            =  "http://a/b/c/g."
      ".g"            =  "http://a/b/c/.g"
      "g.."           =  "http://a/b/c/g.."
      "..g"           =  "http://a/b/c/..g"

   Less likely are cases where the relative reference uses unnecessary
   or nonsensical forms of the "." and ".." complete path segments.

      "./../g"        =  "http://a/b/g"
      "./g/."         =  "http://a/b/c/g/"
      "g/./h"         =  "http://a/b/c/g/h"
      "g/../h"        =  "http://a/b/c/h"
      "g;x=1/./y"     =  "http://a/b/c/g;x=1/y"
      "g;x=1/../y"    =  "http://a/b/c/y"

   Some applications fail to separate the reference's query and/or
   fragment components from the path component before merging it with
   the base path and removing dot-segments.  This error is rarely
   noticed, as typical usage of a fragment never includes the hierarchy
   ("/") character and the query component is not normally used within
   relative references.

      "g?y/./x"       =  "http://a/b/c/g?y/./x"
      "g?y/../x"      =  "http://a/b/c/g?y/../x"
      "g#s/./x"       =  "http://a/b/c/g#s/./x"
      "g#s/../x"      =  "http://a/b/c/g#s/../x"

   Some parsers allow the scheme name to be present in a relative
   reference if it is the same as the base URI scheme.  This is
   considered to be a loophole in prior specifications of partial URI
   [RFC1630].  Its use should be avoided but is allowed for backward
   compatibility.

      "http:g"        =  "http:g"         ; for strict parsers
                      /  "http://a/b/c/g" ; for backward compatibility