RFC 6940

REsource LOcation And Discovery (RELOAD) Base Protocol

Pages: 176
Proposed Standard
→ Errata

Part 4 of 7 – Pages 67 to 92

RFC6940 - Page 67 prevText

6.5.  Forwarding and Link Management Layer

   Each node maintains connections to a set of other nodes defined by
   the Topology Plug-in.  This section defines the methods RELOAD uses
   to form and maintain connections between nodes in the overlay.  Three
   methods are defined:

   Attach
      Used to form RELOAD connections between nodes using ICE for NAT
      traversal.  When node A wants to connect to node B, it sends an
      Attach message to node B through the overlay.  The Attach contains
      A's ICE parameters.  B responds with its ICE parameters, and the
      two nodes perform ICE to form connection.  Attach also allows two
      nodes to connect via No-ICE instead of full ICE.

   AppAttach
      Used to form application-layer connections between nodes.

   Ping
      A simple request/response which is used to verify connectivity of
      the target peer.

6.5.1.  Attach

   A node sends an Attach request when it wishes to establish a direct
   Overlay Link connection to another node for the purpose of sending
   RELOAD messages.  A client that can establish a connection directly
   need not send an Attach, as described in the second bullet of
   Section 4.2.1.

   As described in Section 6.1, an Attach may be routed to either a
   Node-ID or a Resource-ID.  An Attach routed to a specific Node-ID
   will fail if that node is not reached.  An Attach routed to a
   Resource-ID will establish a connection with the peer currently
   responsible for that Resource-ID, which may be useful in establishing
   a direct connection to the responsible peer for use with frequent or
   large resource updates.

   An Attach, in and of itself, does not result in updating the Routing
   Table of either node.  That function is performed by Updates.  If
   node A has Attached to node B, but has not received any Updates from
   B, it MAY route messages which are directly addressed to B through
   that channel, but it MUST NOT route messages through B to other peers
   via that channel.  The process of Attaching is separate from the
   process of becoming a peer (using Join and Update), to prevent half-
   open states where a node has started to form connections but is not
   really ready to act as a peer.  Thus, clients (unlike peers) can
   simply Attach without sending Join or Update.

RFC6940 - Page 68

6.5.1.1.  Request Definition

   An Attach request message contains the requesting node ICE connection
   parameters formatted into a binary structure.

        enum { invalidOverlayLinkType(0), DTLS-UDP-SR(1),
               DTLS-UDP-SR-NO-ICE(3), TLS-TCP-FH-NO-ICE(4),
               (255) } OverlayLinkType;

        enum { invalidCandType(0),
               host(1), srflx(2), /* RESERVED(3), */ relay(4),
               (255) } CandType;

        struct {
          opaque                name<0..2^16-1>;
          opaque                value<0..2^16-1>;
        } IceExtension;

        struct {
          IpAddressPort         addr_port;
          OverlayLinkType       overlay_link;
          opaque                foundation<0..255>;
          uint32                priority;
          CandType              type;
          select (type) {
            case host:
              ;          /* Empty */
            case srflx:
            case relay:
              IpAddressPort     rel_addr_port;
          };
          IceExtension          extensions<0..2^16-1>;
        } IceCandidate;

        struct {
          opaque                ufrag<0..2^8-1>;
          opaque                password<0..2^8-1>;
          opaque                role<0..2^8-1>;
          IceCandidate          candidates<0..2^16-1>;
          Boolean               send_update;
        } AttachReqAns;

   The values contained in AttachReqAns are:

   ufrag
      The username fragment (from ICE).

RFC6940 - Page 69

   password
      The ICE password.

   role
      An active/passive/actpass attribute from RFC 4145 [RFC4145].  This
      value MUST be "passive" for the offerer (the peer sending the
      Attach request) and "active" for the answerer (the peer sending
      the Attach response).

   candidates
      One or more ICE candidate values, as described below.

   send_update
      Has the same meaning as the send_update field in RouteQueryReq.

   Each ICE candidate is represented as an IceCandidate structure, which
   is a direct translation of the information from the ICE string
   structures, with the exception of the component ID.  Since there is
   only one component, it is always 1, and thus left out of the
   structure.  The remaining values are specified as follows:

   addr_port
      Corresponds to the ICE connection-address and port productions.

   overlay_link
      Corresponds to the ICE transport production.  Overlay Link
      protocols used with No-ICE MUST specify "No-ICE" in their
      description.  Future overlay link values can be added by defining
      new OverlayLinkType values in the IANA registry as described in
      Section 14.10.  Future extensions to the encapsulation or framing
      that provide for backward compatibility with the previously
      specified encapsulation or framing values MUST use the same
      OverlayLinkType value that was previously defined.
      OverlayLinkType protocols are defined in Section 6.6

      A single AttachReqAns MUST NOT include both candidates whose
      OverlayLinkType protocols use ICE (the default) and candidates
      that specify "No-ICE".

   foundation
      Corresponds to the ICE foundation production.

   priority
      Corresponds to the ICE priority production.

   type
      Corresponds to the ICE cand-type production.

RFC6940 - Page 70

   rel_addr_port
      Corresponds to the ICE rel-addr and rel-port productions.  It is
      present only for types "relay", "prfix", and "srflx".

   extensions
      ICE extensions.  The name and value fields correspond to binary
      translations of the equivalent fields in the ICE extensions.

   These values should be generated using the procedures described in
   Section 6.5.1.3.

6.5.1.2.  Response Definition

   If a peer receives an Attach request, it MUST determine how to
   process the request as follows:

   o  If the peer has not initiated an Attach request to the originating
      peer of this Attach request, it MUST process this request and
      SHOULD generate its own response with an AttachReqAns.  It should
      then begin ICE checks.

   o  If the peer has already sent an Attach request to and received the
      response from the originating peer of this Attach request and, as
      a result, an ICE check and TLS connection are in progress, then it
      SHOULD generate an Error_In_Progress error instead of an
      AttachReqAns.

   o  If the peer has already sent an Attach request to but not yet
      received the response from the originating peer of this Attach
      request, it SHOULD apply the following tie-breaker heuristic to
      determine how to handle this Attach request and the incomplete
      Attach request it has sent out:

      *  If the peer's own Node-ID is smaller when compared as big-
         endian unsigned integers, it MUST cancel retransmission of its
         own incomplete Attach request.  It MUST then process this
         Attach request, generate an AttachReqAns response, and proceed
         with the corresponding ICE check.

      *  If the peer's own Node-ID is larger when compared as big-endian
         unsigned integers, it MUST generate an Error_In_Progress error
         to this Attach request, and then proceed to wait for and
         complete the Attach and the corresponding ICE check it has
         originated.

   o  If the peer is overloaded or detects some other kind of error, it
      MAY generate an error instead of an AttachReqAns.

RFC6940 - Page 71

   When a peer receives an Attach response, it SHOULD parse the response
   and begin its own ICE checks.

6.5.1.3.  Using ICE with RELOAD

   This section describes the profile of ICE that is used with RELOAD.
   RELOAD implementations MUST implement full ICE.

   In ICE, as defined by [RFC5245], the Session Description Protocol
   (SDP) is used to carry the ICE parameters.  In RELOAD, this function
   is performed by a binary encoding in the Attach method.  This
   encoding is more restricted than the SDP encoding because the RELOAD
   environment is simpler:

   o  Only a single media stream is supported.

   o  In this case, the "stream" refers not to RTP or other types of
      media, but rather to a connection for RELOAD itself or other
      application-layer protocols, such as SIP.

   o  RELOAD allows only for a single offer/answer exchange.  Unlike the
      usage of ICE within SIP, there is never a need to send a
      subsequent offer to update the default candidates to match the
      ones selected by ICE.

   An agent follows the ICE specification as described in [RFC5245] with
   the changes and additional procedures described in the subsections
   below.

6.5.1.4.  Collecting STUN Servers

   ICE relies on the node having one or more Session Traversal Utilities
   for NAT (STUN) servers to use.  In conventional ICE, it is assumed
   that nodes are configured with one or more STUN servers through some
   out-of-band mechanism.  This is still possible in RELOAD, but RELOAD
   also learns STUN servers as it connects to other peers.

   A peer on a well-provisioned wide-area overlay will be configured
   with one or more bootstrap nodes.  These nodes make an initial list
   of STUN servers.  However, as the peer forms connections with
   additional peers, it builds more peers that it can use like STUN
   servers.

   Because complicated NAT topologies are possible, a peer may need more
   than one STUN server.  Specifically, a peer that is behind a single
   NAT will typically observe only two IP addresses in its STUN checks:
   its local address and its server reflexive address from a STUN server
   outside its NAT.  However, if more NATs are involved, a peer may

RFC6940 - Page 72

   learn additional server reflexive addresses (which vary based on
   where in the topology the STUN server is).  To maximize the chance of
   achieving a direct connection, a peer SHOULD group other peers by the
   peer-reflexive addresses it discovers through them.  It SHOULD then
   select one peer from each group to use as a STUN server for future
   connections.

   Only peers to which the peer currently has connections may be used.
   If the connection to that host is lost, it MUST be removed from the
   list of STUN servers, and a new server from the same group MUST be
   selected unless there are no others servers in the group, in which
   case some other peer MAY be used.

6.5.1.5.  Gathering Candidates

   When a node wishes to establish a connection for the purposes of
   RELOAD signaling or application signaling, it follows the process of
   gathering candidates as described in Section 4 of ICE [RFC5245].
   RELOAD utilizes a single component.  Consequently, gathering for
   these "streams" requires a single component.  In the case where a
   node has not yet found a TURN server, the agent would not include a
   relayed candidate.

   The ICE specification assumes that an ICE agent is configured with,
   or somehow knows of, TURN and STUN servers.  RELOAD provides a way
   for an agent to learn these by querying the overlay, as described in
   Sections 6.5.1.4 and 9.

   The default candidate selection described in Section 4.1.4 of ICE is
   ignored; defaults are not signaled or utilized by RELOAD.

   An alternative to using the full ICE supported by the Attach request
   is to use the No-ICE mechanism by providing candidates with "No-ICE"
   Overlay Link protocols.  Configuration for the overlay indicates
   whether or not these Overlay Link protocols can be used.  An overlay
   MUST be either all ICE or all No-ICE.

   No-ICE will not work in all the scenarios where ICE would work, but
   in some cases, particularly those with no NATs or firewalls, it will
   work.

6.5.1.6.  Prioritizing Candidates

   Standardization of additional protocols for use with ICE is expected,
   including TCP [RFC6544] and protocols such as the Stream Control
   Transmission Protocol (SCTP) [RFC4960] and Datagram Congestion
   Control Protocol (DCCP) [RFC4340].  UDP encapsulations for SCTP and
   DCCP would expand the Overlay Link protocols available for RELOAD.

RFC6940 - Page 73

   When additional protocols are available, the following prioritization
   is RECOMMENDED:

   o  Highest priority is assigned to protocols that offer well-
      understood congestion and flow control without head-of-line
      blocking, for example, SCTP without message ordering, DCCP, and
      those protocols encapsulated using UDP.

   o  Second highest priority is assigned to protocols that offer well-
      understood congestion and flow control, but that have head-of-line
      blocking, such as TCP.

   o  Lowest priority is assigned to protocols encapsulated over UDP
      that do not implement well-established congestion control
      algorithms.  The DTLS/UDP with Simple Reliability (SR) overlay
      link protocol is an example of such a protocol.

   Head-of-line blocking is undesirable in an Overlay Link protocol,
   because the messages carried on a RELOAD link are independent, rather
   than stream-oriented.  Therefore, if message N on a link is lost,
   delaying message N+1 on that same link until N is successfully
   retransmitted does nothing other than increase the latency for the
   transaction of message N+1, as they are unrelated to each other.
   Therefore, while the high quality, performance, and availability of
   modern TCP implementations makes them very attractive, their
   performance as Overlay Link protocols is not optimal.

   Note that none of the protocols defined in this document meets these
   conditions, but it is expected that new Overlay Link protocols
   defined in the future will fill this gap.

6.5.1.7.  Encoding the Attach Message

   Section 4.3 of ICE describes procedures for encoding the SDP for
   conveying RELOAD candidates.  Instead of actually encoding an SDP
   message, the candidate information (IP address and port and transport
   protocol, priority, foundation, type, and related address) is carried
   within the attributes of the Attach request or its response.
   Similarly, the username fragment and password are carried in the
   Attach message or its response.  Section 6.5.1 describes the detailed
   attribute encoding for Attach.  The Attach request and its response
   do not contain any default candidates or the ice-lite attribute, as
   these features of ICE are not used by RELOAD.

   Since the Attach request contains the candidate information and short
   term credentials, it is considered as an offer for a single media
   stream that happens to be encoded in a format different than SDP, but
   is otherwise considered a valid offer for the purposes of following

RFC6940 - Page 74

   the ICE specification.  Similarly, the Attach response is considered
   a valid answer for the purposes of following the ICE specification.

6.5.1.8.  Verifying ICE Support

   An agent MUST skip the verification procedures in Sections 5.1 and
   6.1 of ICE.  Since RELOAD requires full ICE from all agents, this
   check is not required.

6.5.1.9.  Role Determination

   The roles of controlling and controlled, as described in Section 5.2
   of ICE, are still utilized with RELOAD.  However, the offerer (the
   entity sending the Attach request) will always be controlling, and
   the answerer (the entity sending the Attach response) will always be
   controlled.  The connectivity checks MUST still contain the ICE-
   CONTROLLED and ICE-CONTROLLING attributes, however, even though the
   role reversal capability for which they are defined will never be
   needed with RELOAD.  This is to allow for a common codebase between
   ICE for RELOAD and ICE for SDP.

6.5.1.10.  Full ICE

   When the overlay uses ICE, connectivity checks and nominations are
   used as in regular ICE.

6.5.1.10.1.  Connectivity Checks

   The processes of forming check lists in Section 5.7 of ICE,
   scheduling checks in Section 5.8, and checking connectivity checks in
   Section 7 are used with RELOAD without change.

6.5.1.10.2.  Concluding ICE

   The procedures in Section 8 of ICE are followed to conclude ICE, with
   the following exceptions:

   o  The controlling agent MUST NOT attempt to send an updated offer
      once the state of its single media stream reaches Completed.

   o  Once the state of ICE reaches Completed, the agent can immediately
      free all unused candidates.  This is because RELOAD does not have
      the concept of forking, and thus the three-second delay in
      Section 8.3 of ICE does not apply.

RFC6940 - Page 75

6.5.1.10.3.  Media Keepalives

   STUN MUST be utilized for the keepalives described in Section 10 of
   ICE.

6.5.1.11.  No-ICE

   No-ICE is selected when either side has provided "no ICE" Overlay
   Link candidates.  STUN is not used for connectivity checks when doing
   No-ICE; instead, the DTLS or TLS handshake (or similar security layer
   of future overlay link protocols) forms the connectivity check.  The
   certificate exchanged during the TLS or DTLS handshake MUST match the
   node which sent the AttachReqAns, and if it does not, the connection
   MUST be closed.

6.5.1.12.  Subsequent Offers and Answers

   An agent MUST NOT send a subsequent offer or answer.  Thus, the
   procedures in Section 9 of ICE MUST be ignored.

6.5.1.13.  Sending Media

   The procedures of Section 11 of ICE apply to RELOAD as well.
   However, in this case, the "media" takes the form of application-
   layer protocols (e.g., RELOAD) over TLS or DTLS.  Consequently, once
   ICE processing completes, the agent will begin TLS or DTLS procedures
   to establish a secure connection.  The node that sent the Attach
   request MUST be the TLS server.  The other node MUST be the TLS
   client.  The server MUST request TLS client authentication.  The
   nodes MUST verify that the certificate presented in the handshake
   matches the identity of the other peer as found in the Attach
   message.  Once the TLS or DTLS signaling is complete, the application
   protocol is free to use the connection.

   The concept of a previous selected pair for a component does not
   apply to RELOAD, since ICE restarts are not possible with RELOAD.

6.5.1.14.  Receiving Media

   An agent MUST be prepared to receive packets for the application
   protocol (TLS or DTLS carrying RELOAD) at any time.  The jitter and
   RTP considerations in Section 11 of ICE do not apply to RELOAD.

6.5.2.  AppAttach

   A node sends an AppAttach request when it wishes to establish a
   direct connection to another node for the purposes of sending
   application-layer messages.  AppAttach is nearly identical to Attach,

RFC6940 - Page 76

   except for the purpose of the connection: it is used to transport
   non-RELOAD "media".  A separate request is used to avoid implementer
   confusion between the two methods (this was found to be a real
   problem with initial implementations).  The AppAttach request and its
   response contain an application attribute, which indicates what
   protocol is to be run over the connection.

6.5.2.1.  Request Definition

   An AppAttachReq message contains the requesting node's ICE connection
   parameters formatted into a binary structure.

        struct {
          opaque                  ufrag<0..2^8-1>;
          opaque                  password<0..2^8-1>;
          uint16                  application;
          opaque                  role<0..2^8-1>;
          IceCandidate            candidates<0..2^16-1>;
        } AppAttachReq;


   The values contained in AppAttachReq and AppAttachAns are:

   ufrag
      The username fragment (from ICE).

   password
      The ICE password.

   application
      A 16-bit Application-ID, as defined in the Section 14.5.  This
      number represents the IANA-registered application that is going to
      send data on this connection.

   role
      An active/passive/actpass attribute from RFC 4145 [RFC4145].

   candidates
      One or more ICE candidate values.

   The application using the connection that is set up with this request
   is responsible for providing traffic of sufficient frequency to keep
   the NAT and Firewall binding alive.  Applications will often send
   traffic every 25 seconds to ensure this.

RFC6940 - Page 77

6.5.2.2.  Response Definition

   If a peer receives an AppAttach request, it SHOULD process the
   request and generate its own response with a AppAttachAns.  It should
   then begin ICE checks.  When a peer receives an AppAttach response,
   it SHOULD parse the response and begin its own ICE checks.  If the
   Application ID is not supported, the peer MUST reply with an
   Error_Not_Found error.

        struct {
          opaque                  ufrag<0..2^8-1>;
          opaque                  password<0..2^8-1>;
          uint16                  application;
          opaque                  role<0..2^8-1>;
          IceCandidate            candidates<0..2^16-1>;
        } AppAttachAns;


   The meaning of the fields is the same as in the AppAttachReq.

6.5.3.  Ping

   Ping is used to test connectivity along a path.  A ping can be
   addressed to a specific Node-ID, to the peer controlling a given
   location (by using a Resource-ID), or to the wildcard Node-ID.

6.5.3.1.  Request Definition

   The PingReq structure is used to make a Ping request.

        struct {
          opaque<0..2^16-1> padding;
        } PingReq;


   The Ping request is empty of meaningful contents.  However, it may
   contain up to 65535 bytes of padding to facilitate the discovery of
   overlay maximum packet sizes.

6.5.3.2.  Response Definition

   A successful PingAns response contains the information elements
   requested by the peer.

         struct {
           uint64                 response_id;
           uint64                 time;
         } PingAns;

RFC6940 - Page 78

   A PingAns message contains the following elements:

   response_id
      A randomly generated 64-bit response ID.  This is used to
      distinguish Ping responses.

   time
      The time when the Ping response was created, represented in the
      same way as storage_time, defined in Section 7.

6.5.4.  ConfigUpdate

   The ConfigUpdate method is used to push updated configuration data
   across the overlay.  Whenever a node detects that another node has
   old configuration data, it MUST generate a ConfigUpdate request.  The
   ConfigUpdate request allows updating of two kinds of data: the
   configuration data (Section 6.3.2.1) and the Kind information
   (Section 7.4.1.1).

6.5.4.1.  Request Definition

   The ConfigUpdateReq structure is used to provide updated
   configuration information.

        enum { invalidConfigUpdateType(0), config(1), kind(2), (255) }
             ConfigUpdateType;

        typedef uint32           KindId;
        typedef opaque           KindDescription<0..2^16-1>;

        struct {
          ConfigUpdateType       type;
          uint32                 length;

          select (type) {
            case config:
                        opaque             config_data<0..2^24-1>;

            case kind:
                        KindDescription    kinds<0..2^24-1>;

            /* This structure may be extended with new types */
          };
        } ConfigUpdateReq;

RFC6940 - Page 79

   The ConfigUpdateReq message contains the following elements:

   type
      The type of the contents of the message.  This structure allows
      for unknown content types.

   length
      The length of the remainder of the message.  This is included to
      preserve backward compatibility and is 32 bits instead of 24 to
      facilitate easy conversion between network and host byte order.

   config_data (type==config)
      The contents of the Configuration Document.

   kinds (type==kind)
      One or more XML kind-block productions (see Section 11.1).  These
      MUST be encoded with UTF-8 and assume a default namespace of
      "urn:ietf:params:xml:ns:p2p:config-base".

6.5.4.2.  Response Definition

   The ConfigUpdateAns structure is used to respond to a ConfigUpdateReq
   request.

        struct {
        } ConfigUpdateAns;

   If the ConfigUpdateReq is of type "config", it MUST be processed only
   if all the following are true:

   o  The sequence number in the document is greater than the current
      configuration sequence number.

   o  The Configuration Document is correctly digitally signed (see
      Section 11 for details on signatures).

   Otherwise, appropriate errors MUST be generated.

   If the ConfigUpdateReq is of type "kind", it MUST be processed only
   if it is correctly digitally signed by an acceptable Kind signer
   (i.e., one listed in the current configuration file).  Details on the
   kind-signer field in the configuration file are described in
   Section 11.1.  In addition, if the Kind update conflicts with an
   existing known Kind (i.e., it is signed by a different signer), then
   it should be rejected with an Error_Forbidden error.  This should not
   happen in correctly functioning overlays.

RFC6940 - Page 80

   If the update is acceptable, then the node MUST reconfigure itself to
   match the new information.  This may include adding permissions for
   new Kinds, deleting old Kinds, or even, in extreme circumstances,
   exiting and re-entering the overlay, if, for instance, the DHT
   algorithm has changed.

   If an implementation misses enough ConfigUpdates that include key
   changes, it is possible that it will no longer be able to verify new
   valid ConfigUpdates.  In this case, the only available recovery
   mechanism is to attempt to retrieve a new Configuration Document,
   typically by the mechanisms used for initial bootstrapping.  It is up
   to implementers whether or how to decide to employ this sort of
   recovery mechanism.

   The response for ConfigUpdate is empty.

6.6.  Overlay Link Layer

   RELOAD can use multiple Overlay Link protocols to send its messages.
   Because ICE is used to establish connections (see Section 6.5.1.3),
   RELOAD nodes are able to detect which Overlay Link protocols are
   offered by other nodes and establish connections between them.  Any
   link protocol needs to be able to establish a secure, authenticated
   connection and to provide data origin authentication and message
   integrity for individual data elements.  RELOAD currently supports
   three Overlay Link protocols:

   o  DTLS [RFC6347] over UDP with Simple Reliability (SR)
      (OverlayLinkType=DTLS-UDP-SR)

   o  TLS [RFC5246] over TCP with Framing Header, No-ICE
      (OverlayLinkType=TLS-TCP-FH-NO-ICE)

   o  DTLS [RFC6347] over UDP with SR, No-ICE
      (OverlayLinkType=DTLS-UDP-SR-NO-ICE)

   Note that although UDP does not properly have "connections", both TLS
   and DTLS have a handshake that establishes a similar, stateful
   association.  We refer to these as "connections" for the purposes of
   this document.

   If a peer receives a message that is larger than the value of max-
   message-size defined in the overlay configuration, the peer SHOULD
   send an Error_Message_Too_Large error and then close the TLS or DTLS
   session from which the message was received.  Note that this error
   can be sent and the session closed before the peer receives the
   complete message.  If the forwarding header is larger than the max-

RFC6940 - Page 81

   message-size, the receiver SHOULD close the TLS or DTLS session
   without sending an error.

   The RELOAD mechanism requires that failed links be quickly removed
   from the Routing Table so end-to-end retransmission can handle lost
   messages.  Overlay Link protocols MUST be designed with a mechanism
   that quickly signals a likely failure, and implementations SHOULD
   quickly act to remove a failed link from the Routing Table when
   receiving this signal.  The entry can be restored if it proves to
   resume functioning, or it can be replaced at some point in the future
   if necessary.  Section 10.7.2 contains more details specific to the
   CHORD-RELOAD Topology Plug-in.

   The Framing Header (FH) is used to frame messages and provide timing
   when used on a reliable stream-based transport protocol.  Simple
   Reliability (SR) uses the FH to provide congestion control and
   partial reliability when using unreliable message-oriented transport
   protocols.  We will first define each of these algorithms in Sections
   6.6.2 and 6.6.3, and then define Overlay Link protocols that use them
   in Sections 6.6.4, 6.6.5, and 6.6.6.

   Note: We expect future Overlay Link protocols to define replacements
   for all components of these protocols, including the Framing Header.
   The three protocols that we will discuss have been chosen for
   simplicity of implementation and reasonable performance.

6.6.1.  Future Overlay Link Protocols

   It is possible to define new link-layer protocols and apply them to a
   new overlay using the "overlay-link-protocol" configuration directive
   (see Section 11.1.).  However, any new protocols MUST meet the
   following requirements:

   Endpoint authentication:  When a node forms an association with
      another endpoint, it MUST be possible to cryptographically verify
      that the endpoint has a given Node-ID.

   Traffic origin authentication and integrity:  When a node receives
      traffic from another endpoint, it MUST be possible to
      cryptographically verify that the traffic came from a given
      association and that it has not been modified in transit from the
      other endpoint in the association.  The overlay link protocol MUST
      also provide replay prevention/detection.

   Traffic confidentiality:  When a node sends traffic to another
      endpoint, it MUST NOT be possible for a third party that is not
      involved in the association to determine the contents of that
      traffic.

RFC6940 - Page 82

   Any new overlay protocol MUST be defined via Standards Action
   [RFC5226].  See Section 14.11.

6.6.1.1.  HIP

   In a Host Identity Protocol Based Overlay Networking Environment (HIP
   BONE) [RFC6079], HIP [RFC5201] provides connection management (e.g.,
   NAT traversal and mobility) and security for the overlay network.
   The P2PSIP Working Group has expressed interest in supporting a HIP-
   based link protocol.  Such support would require specifying such
   details as:

   o  How to issue certificates which provide identities meaningful to
      the HIP base exchange.  We anticipate that this would require a
      mapping between Overlay Routable Cryptographic Hash Identifiers
      (ORCHIDs) and NodeIds.

   o  How to carry the HIP I1 and I2 messages.

   o  How to carry RELOAD messages over HIP.

   [HIP-RELOAD] documents work in progress on using RELOAD with the HIP
   BONE.

6.6.1.2.  ICE-TCP

   The ICE-TCP RFC [RFC6544] allows TCP to be supported as an Overlay
   Link protocol that can be added using ICE.

6.6.1.3.  Message-Oriented Transports

   Modern message-oriented transports offer high performance and good
   congestion control, and they avoid head-of-line blocking in case of
   lost data.  These characteristics make them preferable as underlying
   transport protocols for RELOAD links.  SCTP without message ordering
   and DCCP are two examples of such protocols.  However, currently they
   are not well-supported by commonly available NATs, and specifications
   for ICE session establishment are not available.

6.6.1.4.  Tunneled Transports

   As of the time of this writing, there is significant interest in the
   IETF community in tunneling other transports over UDP, which is
   motivated by the situation that UDP is well-supported by modern NAT
   hardware and by the fact that performance similar to a native
   implementation can be achieved.  Currently, SCTP, DCCP, and a generic
   tunneling extension are being proposed for message-oriented
   protocols.  Once ICE traversal has been specified for these tunneled

RFC6940 - Page 83

   protocols, they should be straightforward to support as overlay link
   protocols.

6.6.2.  Framing Header

   In order to support unreliable links and to allow for quick detection
   of link failures when using reliable end-to-end transports, each
   message is wrapped in a very simple framing layer (FramedMessage),
   which is used only for each hop.  This layer contains a sequence
   number which can then be used for ACKs.  The same header is used for
   both reliable and unreliable transports for simplicity of
   implementation.

   The definition of FramedMessage is:

        enum { data(128), ack(129), (255) } FramedMessageType;

        struct {
          FramedMessageType       type;

          select (type) {
            case data:
              uint32              sequence;
              opaque              message<0..2^24-1>;

            case ack:
              uint32              ack_sequence;
              uint32              received;
          };
        } FramedMessage;

   The type field of the PDU is set to indicate whether the message is
   data or an acknowledgement.

   If the message is of type "data", then the remainder of the PDU is as
   follows:

   sequence
      The sequence number.  This increments by one for each framed
      message sent over this transport session.

   message
      The message that is being transmitted.

   Each connection has it own sequence number space.  Initially, the
   value is zero, and it increments by exactly one for each message sent
   over that connection.

RFC6940 - Page 84

   When the receiver receives a message, it SHOULD immediately send an
   ACK message.  The receiver MUST keep track of the 32 most recent
   sequence numbers received on this association in order to generate
   the appropriate ACK.

   If the PDU is of type "ack", the contents are as follows:

   ack_sequence
      The sequence number of the message being acknowledged.

   received
      A bitmask indicating if each of the previous 32 sequence numbers
      before this packet has been among the 32 packets most recently
      received on this connection.  When a packet is received with a
      sequence number N, the receiver looks at the sequence number of
      the 32 previously received packets on this connection.  We call
      the previously received packet number M.  For each of the previous
      32 packets, if the sequence number M is less than N but greater
      than N-32, the N-M bit of the received bitmask is set to one;
      otherwise, it is set to zero.  Note that a bit being set to one
      indicates positively that a particular packet was received, but a
      bit being set to zero means only that it is unknown whether or not
      the packet has been received, because it might have been received
      before the 32 most recently received packets.

   The received field bits in the ACK provide a high degree of
   redundancy so that the sender can figure out which packets the
   receiver has received and can then estimate packet loss rates.  If
   the sender also keeps track of the time at which recent sequence
   numbers have been sent, the RTT (round-trip time) can be estimated.

   Note that because retransmissions receive new sequence numbers,
   multiple ACKs may be received for the same message.  This approach
   provides more information than traditional TCP sequence numbers, but
   care must be taken when applying algorithms designed based on TCP's
   stream-oriented sequence number.

6.6.3.  Simple Reliability

   When RELOAD is carried over DTLS or another unreliable link protocol,
   it needs to be used with a reliability and congestion control
   mechanism, which is provided on a hop-by-hop basis.  The basic
   principle is that each message, regardless of whether or not it
   carries a request or response, will get an ACK and be reliably
   retransmitted.  The receiver's job is very simple, and is limited to
   just sending ACKs.  All the complexity is at the sender side.  This
   allows the sending implementation to trade off performance versus
   implementation complexity without affecting the wire protocol.

RFC6940 - Page 85

   Because the receiver's role is limited to providing packet
   acknowledgements, a wide variety of congestion control algorithms can
   be implemented on the sender side while using the same basic wire
   protocol.  The sender algorithm used MUST meet the requirements of
   [RFC5405].

6.6.3.1.  Stop and Wait Sender Algorithm

   This section describes one possible implementation of a sender
   algorithm for Simple Reliability.  It is adequate for overlays
   running on underlying networks with low latency and loss (LANs) or
   low-traffic overlays on the Internet.

   A node MUST NOT have more than one unacknowledged message on the DTLS
   connection at a time.  Note that because retransmissions of the same
   message are given new sequence numbers, there may be multiple
   unacknowledged sequence numbers in use.

   The RTO (Retransmission TimeOut) is based on an estimate of the RTT.
   The value for RTO is calculated separately for each DTLS session.
   Implementations can use a static value for RTO or a dynamic estimate,
   which will result in better performance.  For implementations that
   use a static value, the default value for RTO is 500 ms.  Nodes MAY
   use smaller values of RTO if it is known that all nodes are within
   the local network.  The default RTO MAY be set to a larger value,
   which is RECOMMENDED if it is known in advance (such as on high-
   latency access links) that the RTT is larger.

   Implementations that use a dynamic estimate to compute the RTO MUST
   use the algorithm described in RFC 6298 [RFC6298], with the exception
   that the value of RTO SHOULD NOT be rounded up to the nearest second,
   but instead rounded up to the nearest millisecond.  The RTT of a
   successful STUN transaction from the ICE stage is used as the initial
   measurement for formula 2.2 of RFC 6298.  The sender keeps track of
   the time each message was sent for all recently sent messages.  Any
   time an ACK is received, the sender can compute the RTT for that
   message by looking at the time the ACK was received and the time when
   the message was sent.  This is used as a subsequent RTT measurement
   for formula 2.3 of RFC 6298 to update the RTO estimate.  (Note that
   because retransmissions receive new sequence numbers, all received
   ACKs are used.)

   An initiating node SHOULD retransmit a message if it has not received
   an ACK after an interval of RTO (transit nodes do not retransmit at
   this layer).  The node MUST double the time to wait after each
   retransmission.  For each retransmission, the sequence number MUST be
   incremented.

RFC6940 - Page 86

   Retransmissions continue until a response is received, until a total
   of 5 requests have been sent, until there has been a hard ICMP error
   [RFC1122], or until a TLS alert indicating the end of the connection
   has been sent or received.  The sender knows a response was received
   when it receives an ACK with a sequence number that indicates it is a
   response to one of the transmissions of this message.  For example,
   assuming an RTO of 500 ms, requests would be sent at times 0 ms, 500
   ms, 1500 ms, 3500 ms, and 7500 ms.  If all retransmissions for a
   message fail, then the sending node SHOULD close the connection
   routing the message.

   To determine when a link might be failing without waiting for the
   final timeout, observe when no ACKs have been received for an entire
   RTO interval, and then wait for three retransmissions to occur beyond
   that point.  If no ACKs have been received by the time the third
   retransmission occurs, it is RECOMMENDED that the link be removed
   from the Routing Table.  The link MAY be restored to the Routing
   Table if ACKs resume before the connection is closed, as described
   above.

   A sender MUST wait 10 ms between receipt of an ACK and transmission
   of the next message.

6.6.4.  DTLS/UDP with SR

   This overlay link protocol consists of DTLS over UDP while
   implementing the SR protocol.  STUN connectivity checks and
   keepalives are used.  Any compliant sender algorithm may be used.

6.6.5.  TLS/TCP with FH, No-ICE

   This overlay link protocol consists of TLS over TCP with the framing
   header.  Because ICE is not used, STUN connectivity checks are not
   used upon establishing the TCP connection, nor are they used for
   keepalives.

   Because the TCP layer's application-level timeout is too slow to be
   useful for overlay routing, the Overlay Link implementation MUST use
   the framing header to measure the RTT of the connection and calculate
   an RTO as specified in Section 2 of [RFC6298].  The resulting RTO is
   not used for retransmissions, but rather as a timeout to indicate
   when the link SHOULD be removed from the Routing Table.  It is
   RECOMMENDED that such a connection be retained for 30 seconds to
   determine if the failure was transient before concluding the link has
   failed permanently.

   When sending candidates for TLS/TCP with FH, No-ICE, a passive
   candidate MUST be provided.

RFC6940 - Page 87

6.6.6.  DTLS/UDP with SR, No-ICE

   This overlay link protocol consists of DTLS over UDP while
   implementing the Simple Reliability protocol.  Because ICE is not
   used, no STUN connectivity checks or keepalives are used.

6.7.  Fragmentation and Reassembly

   In order to allow transmission over datagram protocols such as DTLS,
   RELOAD messages may be fragmented.

   Any node along the path can fragment the message, but only the final
   destination reassembles the fragments.  When a node takes a packet
   and fragments it, each fragment has a full copy of the forwarding
   header, but the data after the forwarding header is broken up into
   appropriately sized chunks.  The size of the payload chunks needs to
   take into account space to allow the Via and Destination Lists to
   grow.  Each fragment MUST contain a full copy of the Via List,
   Destination List, and ForwardingOptions and MUST contain at least 256
   bytes of the message body.  If these elements cannot fit within the
   MTU of the underlying datagram protocol, RELOAD fragmentation is not
   performed, and IP-layer fragmentation is allowed to occur.  The
   length field MUST contain the size of the message after
   fragmentation.  When a message MUST be fragmented, it SHOULD be split
   into equal-sized fragments that are no larger than the Path MTU
   (PMTU) of the next overlay link minus 32 bytes.  This is to allow the
   Via List to grow before further fragmentation is required.

   Note that this fragmentation is not optimal for the end-to-end
   path -- a message may be refragmented multiple times as it traverses
   the overlay, but it is assembled only at the final destination.  This
   option has been chosen as it is far easier to implement than end-to-
   end (e2e) PMTU discovery across an ever-changing overlay and it
   effectively addresses the reliability issues of relying on IP-layer
   fragmentation.  However, Ping can be used to allow e2e PMTU discovery
   to be implemented if desired.

   Upon receipt of a fragmented message by the intended peer, the peer
   holds the fragments in a holding buffer until the entire message has
   been received.  The message is then reassembled into a single message
   and processed.  In order to mitigate denial-of-service (DoS) attacks,
   receivers SHOULD time out incomplete fragments after the maximum
   request lifetime (15 seconds).  This time was derived from looking at
   the end-to-end retransmission time and saving fragments long enough
   for the full end-to-end retransmissions to take place.  Ideally, the
   receiver would have enough buffer space to deal with as many
   fragments as can arrive in the maximum request lifetime.  However, if

RFC6940 - Page 88

   the receiver runs out of buffer space to reassemble a message, it
   MUST drop the message.

   The fragment field of the forwarding header is used to encode
   fragmentation information.  The offset is the number of bytes between
   the end of the forwarding header and the start of the data.  The
   first fragment therefore has an offset of 0.  The last fragment
   indicator MUST be appropriately set.  If the message is not
   fragmented, it is simply treated as if it is the only fragment: the
   last fragment bit is set and the offset is 0, resulting in a fragment
   value of 0xC0000000.

   Note: The reason for this definition of the fragment field is that
   originally, the high bit was defined in part of the specification as
   "is fragmented", so there was some specification ambiguity about how
   to encode messages with only one fragment.  This ambiguity was
   resolved in favor of always encoding as the "last" fragment with
   offset 0, thus simplifying the receiver code path, but resulting in
   the high bit being redundant.  Because messages MUST be set with the
   high bit set to 1, implementations SHOULD discard any message with it
   set to 0.  Implementations (presumably legacy ones) which choose to
   accept such messages MUST either ignore the remaining bits or ensure
   that they are 0.  They MUST NOT try to interpret as fragmented
   messages with the high bit set low.

7.  Data Storage Protocol

   RELOAD provides a set of generic mechanisms for storing and
   retrieving data in the Overlay Instance.  These mechanisms can be
   used for new applications simply by defining new code points and a
   small set of rules.  No new protocol mechanisms are required.

   The basic unit of stored data is a single StoredData structure:

        struct {
          uint32                  length;
          uint64                  storage_time;
          uint32                  lifetime;
          StoredDataValue         value;
          Signature               signature;
        } StoredData;

   The contents of this structure are as follows:

   length
      The size of the StoredData structure, in bytes, excluding the size
      of length itself.

RFC6940 - Page 89

   storage_time
      The time when the data was stored, represented as the number of
      milliseconds elapsed since midnight Jan 1, 1970 UTC, not counting
      leap seconds.  This will have the same values for seconds as
      standard UNIX or POSIX time.  More information can be found at
      [UnixTime].  Any attempt to store a data value with a storage time
      before that of a value already stored at this location MUST
      generate an Error_Data_Too_Old error.  This prevents rollback
      attacks.  The node SHOULD make a best-effort attempt to use a
      correct clock to determine this number.  However, the protocol
      does not require synchronized clocks: the receiving peer uses the
      storage time in the previous store, not its own clock.  Clock
      values are used so that when clocks are generally synchronized,
      data may be stored in a single transaction, rather than querying
      for the value of a counter before the actual store.

      If a node attempting to store new data in response to a user
      request (rather than as an overlay maintenance operation such as
      occurs when healing the overlay from a partition) is rejected with
      an Error_Data_Too_Old error, the node MAY elect to perform its
      store using a storage_time that increments the value used with the
      previous store (this may be obtained by doing a Fetch).  This
      situation may occur when the clocks of nodes storing to this
      location are not properly synchronized.

   lifetime
      The validity period for the data, in seconds, starting from the
      time the peer receives the StoreReq.

   value
      The data value itself, as described in Section 7.2.

   signature
      A signature, as defined in Section 7.1.

   Each Resource-ID specifies a single location in the Overlay Instance.
   However, each location may contain multiple StoredData values,
   distinguished by Kind-ID.  The definition of a Kind describes both
   the data values which may be stored and the data model of the data.
   Some data models allow multiple values to be stored under the same
   Kind-ID.  Section 7.2 describes the available data models.  Thus, for
   instance, a given Resource-ID might contain a single-value element
   stored under Kind-ID X and an array containing multiple values stored
   under Kind-ID Y.

RFC6940 - Page 90

7.1.  Data Signature Computation

   Each StoredData element is individually signed.  However, the
   signature also must be self-contained and must cover the Kind-ID and
   Resource-ID, even though they are not present in the StoredData
   structure.  The input to the signature algorithm is:

      resource_id || kind || storage_time || StoredDataValue ||
      SignerIdentity

   where || indicates concatenation and where these values are:

   resource_id
      The Resource-ID where this data is stored.

   kind
      The Kind-ID for this data.

   storage_time
      The contents of the storage_time data value.

   StoredDataValue
      The contents of the stored data value, as described in the
      previous sections.

   SignerIdentity
      The signer identity, as defined in Section 6.3.4.

   Once the signature has been computed, the signature is represented
   using a signature element, as described in Section 6.3.4.

   Note that there is no necessary relationship between the validity
   window of a certificate and the expiry of the data it is
   authenticating.  When signatures are verified, the current time MUST
   be compared to the certificate validity period.  Stored data MAY be
   set to expire after the signing certificate's validity period.  Such
   signatures are not considered valid after the signing certificate
   expires.  Implementations may "garbage collect" such data at their
   convenience, either by purging it automatically (perhaps by setting
   the upper bound on data storage to the lifetime of the signing
   certificate) or by simply leaving it in place until it expires
   naturally and relying on users of that data to notice the expired
   signing certificate.

RFC6940 - Page 91

7.2.  Data Models

   The protocol currently defines the following data models:

   o  single value

   o  array

   o  dictionary

   These are represented with the StoredDataValue structure.  The actual
   data model is known from the Kind being stored.

        struct {
          Boolean                exists;
          opaque                 value<0..2^32-1>;
        } DataValue;

        struct {
          select (DataModel) {
            case single_value:
              DataValue             single_value_entry;

            case array:
              ArrayEntry            array_entry;

            case dictionary:
              DictionaryEntry       dictionary_entry;

            /* This structure may be extended */
          };
        } StoredDataValue;

   The following sections discuss the properties of each data model.

7.2.1.  Single Value

   A single-value element is a simple sequence of bytes.  There may be
   only one single-value element for each Resource-ID, Kind-ID pair.

   A single value element is represented as a DataValue, which contains
   the following two elements:

   exists
      This value indicates whether the value exists at all.  If it is
      set to False, it means that no value is present.  If it is True,
      this means that a value is present.  This gives the protocol a
      mechanism for indicating nonexistence as opposed to emptiness.

RFC6940 - Page 92

   value
      The stored data.

7.2.2.  Array

   An array is a set of opaque values addressed by an integer index.
   Arrays are zero based.  Note that arrays can be sparse.  For
   instance, a Store of "X" at index 2 in an empty array produces an
   array with the values [ NA, NA, "X"].  Future attempts to fetch
   elements at index 0 or 1 will return values with "exists" set to
   False.

   An array element is represented as an ArrayEntry:

         struct {
           uint32                  index;
           DataValue               value;
         } ArrayEntry;

   The contents of this structure are:

   index
      The index of the data element in the array.

   value
      The stored data.

7.2.3.  Dictionary

   A dictionary is a set of opaque values indexed by an opaque key, with
   one value for each key.  A single dictionary entry is represented as
   a DictionaryEntry:

         typedef opaque           DictionaryKey<0..2^16-1>;

         struct {
           DictionaryKey          key;
           DataValue              value;
         } DictionaryEntry;

   The contents of this structure are:

   key
      The dictionary key for this value.

   value
      The stored data.

(next page on part 5)