6.5. Forwarding and Link Management Layer
Each node maintains connections to a set of other nodes defined by the Topology Plug-in. This section defines the methods RELOAD uses to form and maintain connections between nodes in the overlay. Three methods are defined: Attach Used to form RELOAD connections between nodes using ICE for NAT traversal. When node A wants to connect to node B, it sends an Attach message to node B through the overlay. The Attach contains A's ICE parameters. B responds with its ICE parameters, and the two nodes perform ICE to form connection. Attach also allows two nodes to connect via No-ICE instead of full ICE. AppAttach Used to form application-layer connections between nodes. Ping A simple request/response which is used to verify connectivity of the target peer.6.5.1. Attach
A node sends an Attach request when it wishes to establish a direct Overlay Link connection to another node for the purpose of sending RELOAD messages. A client that can establish a connection directly need not send an Attach, as described in the second bullet of Section 4.2.1. As described in Section 6.1, an Attach may be routed to either a Node-ID or a Resource-ID. An Attach routed to a specific Node-ID will fail if that node is not reached. An Attach routed to a Resource-ID will establish a connection with the peer currently responsible for that Resource-ID, which may be useful in establishing a direct connection to the responsible peer for use with frequent or large resource updates. An Attach, in and of itself, does not result in updating the Routing Table of either node. That function is performed by Updates. If node A has Attached to node B, but has not received any Updates from B, it MAY route messages which are directly addressed to B through that channel, but it MUST NOT route messages through B to other peers via that channel. The process of Attaching is separate from the process of becoming a peer (using Join and Update), to prevent half- open states where a node has started to form connections but is not really ready to act as a peer. Thus, clients (unlike peers) can simply Attach without sending Join or Update.
6.5.1.1. Request Definition
An Attach request message contains the requesting node ICE connection parameters formatted into a binary structure. enum { invalidOverlayLinkType(0), DTLS-UDP-SR(1), DTLS-UDP-SR-NO-ICE(3), TLS-TCP-FH-NO-ICE(4), (255) } OverlayLinkType; enum { invalidCandType(0), host(1), srflx(2), /* RESERVED(3), */ relay(4), (255) } CandType; struct { opaque name<0..2^16-1>; opaque value<0..2^16-1>; } IceExtension; struct { IpAddressPort addr_port; OverlayLinkType overlay_link; opaque foundation<0..255>; uint32 priority; CandType type; select (type) { case host: ; /* Empty */ case srflx: case relay: IpAddressPort rel_addr_port; }; IceExtension extensions<0..2^16-1>; } IceCandidate; struct { opaque ufrag<0..2^8-1>; opaque password<0..2^8-1>; opaque role<0..2^8-1>; IceCandidate candidates<0..2^16-1>; Boolean send_update; } AttachReqAns; The values contained in AttachReqAns are: ufrag The username fragment (from ICE).
password The ICE password. role An active/passive/actpass attribute from RFC 4145 [RFC4145]. This value MUST be "passive" for the offerer (the peer sending the Attach request) and "active" for the answerer (the peer sending the Attach response). candidates One or more ICE candidate values, as described below. send_update Has the same meaning as the send_update field in RouteQueryReq. Each ICE candidate is represented as an IceCandidate structure, which is a direct translation of the information from the ICE string structures, with the exception of the component ID. Since there is only one component, it is always 1, and thus left out of the structure. The remaining values are specified as follows: addr_port Corresponds to the ICE connection-address and port productions. overlay_link Corresponds to the ICE transport production. Overlay Link protocols used with No-ICE MUST specify "No-ICE" in their description. Future overlay link values can be added by defining new OverlayLinkType values in the IANA registry as described in Section 14.10. Future extensions to the encapsulation or framing that provide for backward compatibility with the previously specified encapsulation or framing values MUST use the same OverlayLinkType value that was previously defined. OverlayLinkType protocols are defined in Section 6.6 A single AttachReqAns MUST NOT include both candidates whose OverlayLinkType protocols use ICE (the default) and candidates that specify "No-ICE". foundation Corresponds to the ICE foundation production. priority Corresponds to the ICE priority production. type Corresponds to the ICE cand-type production.
rel_addr_port Corresponds to the ICE rel-addr and rel-port productions. It is present only for types "relay", "prfix", and "srflx". extensions ICE extensions. The name and value fields correspond to binary translations of the equivalent fields in the ICE extensions. These values should be generated using the procedures described in Section 6.5.1.3.6.5.1.2. Response Definition
If a peer receives an Attach request, it MUST determine how to process the request as follows: o If the peer has not initiated an Attach request to the originating peer of this Attach request, it MUST process this request and SHOULD generate its own response with an AttachReqAns. It should then begin ICE checks. o If the peer has already sent an Attach request to and received the response from the originating peer of this Attach request and, as a result, an ICE check and TLS connection are in progress, then it SHOULD generate an Error_In_Progress error instead of an AttachReqAns. o If the peer has already sent an Attach request to but not yet received the response from the originating peer of this Attach request, it SHOULD apply the following tie-breaker heuristic to determine how to handle this Attach request and the incomplete Attach request it has sent out: * If the peer's own Node-ID is smaller when compared as big- endian unsigned integers, it MUST cancel retransmission of its own incomplete Attach request. It MUST then process this Attach request, generate an AttachReqAns response, and proceed with the corresponding ICE check. * If the peer's own Node-ID is larger when compared as big-endian unsigned integers, it MUST generate an Error_In_Progress error to this Attach request, and then proceed to wait for and complete the Attach and the corresponding ICE check it has originated. o If the peer is overloaded or detects some other kind of error, it MAY generate an error instead of an AttachReqAns.
When a peer receives an Attach response, it SHOULD parse the response and begin its own ICE checks.6.5.1.3. Using ICE with RELOAD
This section describes the profile of ICE that is used with RELOAD. RELOAD implementations MUST implement full ICE. In ICE, as defined by [RFC5245], the Session Description Protocol (SDP) is used to carry the ICE parameters. In RELOAD, this function is performed by a binary encoding in the Attach method. This encoding is more restricted than the SDP encoding because the RELOAD environment is simpler: o Only a single media stream is supported. o In this case, the "stream" refers not to RTP or other types of media, but rather to a connection for RELOAD itself or other application-layer protocols, such as SIP. o RELOAD allows only for a single offer/answer exchange. Unlike the usage of ICE within SIP, there is never a need to send a subsequent offer to update the default candidates to match the ones selected by ICE. An agent follows the ICE specification as described in [RFC5245] with the changes and additional procedures described in the subsections below.6.5.1.4. Collecting STUN Servers
ICE relies on the node having one or more Session Traversal Utilities for NAT (STUN) servers to use. In conventional ICE, it is assumed that nodes are configured with one or more STUN servers through some out-of-band mechanism. This is still possible in RELOAD, but RELOAD also learns STUN servers as it connects to other peers. A peer on a well-provisioned wide-area overlay will be configured with one or more bootstrap nodes. These nodes make an initial list of STUN servers. However, as the peer forms connections with additional peers, it builds more peers that it can use like STUN servers. Because complicated NAT topologies are possible, a peer may need more than one STUN server. Specifically, a peer that is behind a single NAT will typically observe only two IP addresses in its STUN checks: its local address and its server reflexive address from a STUN server outside its NAT. However, if more NATs are involved, a peer may
learn additional server reflexive addresses (which vary based on where in the topology the STUN server is). To maximize the chance of achieving a direct connection, a peer SHOULD group other peers by the peer-reflexive addresses it discovers through them. It SHOULD then select one peer from each group to use as a STUN server for future connections. Only peers to which the peer currently has connections may be used. If the connection to that host is lost, it MUST be removed from the list of STUN servers, and a new server from the same group MUST be selected unless there are no others servers in the group, in which case some other peer MAY be used.6.5.1.5. Gathering Candidates
When a node wishes to establish a connection for the purposes of RELOAD signaling or application signaling, it follows the process of gathering candidates as described in Section 4 of ICE [RFC5245]. RELOAD utilizes a single component. Consequently, gathering for these "streams" requires a single component. In the case where a node has not yet found a TURN server, the agent would not include a relayed candidate. The ICE specification assumes that an ICE agent is configured with, or somehow knows of, TURN and STUN servers. RELOAD provides a way for an agent to learn these by querying the overlay, as described in Sections 6.5.1.4 and 9. The default candidate selection described in Section 4.1.4 of ICE is ignored; defaults are not signaled or utilized by RELOAD. An alternative to using the full ICE supported by the Attach request is to use the No-ICE mechanism by providing candidates with "No-ICE" Overlay Link protocols. Configuration for the overlay indicates whether or not these Overlay Link protocols can be used. An overlay MUST be either all ICE or all No-ICE. No-ICE will not work in all the scenarios where ICE would work, but in some cases, particularly those with no NATs or firewalls, it will work.6.5.1.6. Prioritizing Candidates
Standardization of additional protocols for use with ICE is expected, including TCP [RFC6544] and protocols such as the Stream Control Transmission Protocol (SCTP) [RFC4960] and Datagram Congestion Control Protocol (DCCP) [RFC4340]. UDP encapsulations for SCTP and DCCP would expand the Overlay Link protocols available for RELOAD.
When additional protocols are available, the following prioritization is RECOMMENDED: o Highest priority is assigned to protocols that offer well- understood congestion and flow control without head-of-line blocking, for example, SCTP without message ordering, DCCP, and those protocols encapsulated using UDP. o Second highest priority is assigned to protocols that offer well- understood congestion and flow control, but that have head-of-line blocking, such as TCP. o Lowest priority is assigned to protocols encapsulated over UDP that do not implement well-established congestion control algorithms. The DTLS/UDP with Simple Reliability (SR) overlay link protocol is an example of such a protocol. Head-of-line blocking is undesirable in an Overlay Link protocol, because the messages carried on a RELOAD link are independent, rather than stream-oriented. Therefore, if message N on a link is lost, delaying message N+1 on that same link until N is successfully retransmitted does nothing other than increase the latency for the transaction of message N+1, as they are unrelated to each other. Therefore, while the high quality, performance, and availability of modern TCP implementations makes them very attractive, their performance as Overlay Link protocols is not optimal. Note that none of the protocols defined in this document meets these conditions, but it is expected that new Overlay Link protocols defined in the future will fill this gap.6.5.1.7. Encoding the Attach Message
Section 4.3 of ICE describes procedures for encoding the SDP for conveying RELOAD candidates. Instead of actually encoding an SDP message, the candidate information (IP address and port and transport protocol, priority, foundation, type, and related address) is carried within the attributes of the Attach request or its response. Similarly, the username fragment and password are carried in the Attach message or its response. Section 6.5.1 describes the detailed attribute encoding for Attach. The Attach request and its response do not contain any default candidates or the ice-lite attribute, as these features of ICE are not used by RELOAD. Since the Attach request contains the candidate information and short term credentials, it is considered as an offer for a single media stream that happens to be encoded in a format different than SDP, but is otherwise considered a valid offer for the purposes of following
the ICE specification. Similarly, the Attach response is considered a valid answer for the purposes of following the ICE specification.6.5.1.8. Verifying ICE Support
An agent MUST skip the verification procedures in Sections 5.1 and 6.1 of ICE. Since RELOAD requires full ICE from all agents, this check is not required.6.5.1.9. Role Determination
The roles of controlling and controlled, as described in Section 5.2 of ICE, are still utilized with RELOAD. However, the offerer (the entity sending the Attach request) will always be controlling, and the answerer (the entity sending the Attach response) will always be controlled. The connectivity checks MUST still contain the ICE- CONTROLLED and ICE-CONTROLLING attributes, however, even though the role reversal capability for which they are defined will never be needed with RELOAD. This is to allow for a common codebase between ICE for RELOAD and ICE for SDP.6.5.1.10. Full ICE
When the overlay uses ICE, connectivity checks and nominations are used as in regular ICE.6.5.1.10.1. Connectivity Checks
The processes of forming check lists in Section 5.7 of ICE, scheduling checks in Section 5.8, and checking connectivity checks in Section 7 are used with RELOAD without change.6.5.1.10.2. Concluding ICE
The procedures in Section 8 of ICE are followed to conclude ICE, with the following exceptions: o The controlling agent MUST NOT attempt to send an updated offer once the state of its single media stream reaches Completed. o Once the state of ICE reaches Completed, the agent can immediately free all unused candidates. This is because RELOAD does not have the concept of forking, and thus the three-second delay in Section 8.3 of ICE does not apply.
6.5.1.10.3. Media Keepalives
STUN MUST be utilized for the keepalives described in Section 10 of ICE.6.5.1.11. No-ICE
No-ICE is selected when either side has provided "no ICE" Overlay Link candidates. STUN is not used for connectivity checks when doing No-ICE; instead, the DTLS or TLS handshake (or similar security layer of future overlay link protocols) forms the connectivity check. The certificate exchanged during the TLS or DTLS handshake MUST match the node which sent the AttachReqAns, and if it does not, the connection MUST be closed.6.5.1.12. Subsequent Offers and Answers
An agent MUST NOT send a subsequent offer or answer. Thus, the procedures in Section 9 of ICE MUST be ignored.6.5.1.13. Sending Media
The procedures of Section 11 of ICE apply to RELOAD as well. However, in this case, the "media" takes the form of application- layer protocols (e.g., RELOAD) over TLS or DTLS. Consequently, once ICE processing completes, the agent will begin TLS or DTLS procedures to establish a secure connection. The node that sent the Attach request MUST be the TLS server. The other node MUST be the TLS client. The server MUST request TLS client authentication. The nodes MUST verify that the certificate presented in the handshake matches the identity of the other peer as found in the Attach message. Once the TLS or DTLS signaling is complete, the application protocol is free to use the connection. The concept of a previous selected pair for a component does not apply to RELOAD, since ICE restarts are not possible with RELOAD.6.5.1.14. Receiving Media
An agent MUST be prepared to receive packets for the application protocol (TLS or DTLS carrying RELOAD) at any time. The jitter and RTP considerations in Section 11 of ICE do not apply to RELOAD.6.5.2. AppAttach
A node sends an AppAttach request when it wishes to establish a direct connection to another node for the purposes of sending application-layer messages. AppAttach is nearly identical to Attach,
except for the purpose of the connection: it is used to transport non-RELOAD "media". A separate request is used to avoid implementer confusion between the two methods (this was found to be a real problem with initial implementations). The AppAttach request and its response contain an application attribute, which indicates what protocol is to be run over the connection.6.5.2.1. Request Definition
An AppAttachReq message contains the requesting node's ICE connection parameters formatted into a binary structure. struct { opaque ufrag<0..2^8-1>; opaque password<0..2^8-1>; uint16 application; opaque role<0..2^8-1>; IceCandidate candidates<0..2^16-1>; } AppAttachReq; The values contained in AppAttachReq and AppAttachAns are: ufrag The username fragment (from ICE). password The ICE password. application A 16-bit Application-ID, as defined in the Section 14.5. This number represents the IANA-registered application that is going to send data on this connection. role An active/passive/actpass attribute from RFC 4145 [RFC4145]. candidates One or more ICE candidate values. The application using the connection that is set up with this request is responsible for providing traffic of sufficient frequency to keep the NAT and Firewall binding alive. Applications will often send traffic every 25 seconds to ensure this.
6.5.2.2. Response Definition
If a peer receives an AppAttach request, it SHOULD process the request and generate its own response with a AppAttachAns. It should then begin ICE checks. When a peer receives an AppAttach response, it SHOULD parse the response and begin its own ICE checks. If the Application ID is not supported, the peer MUST reply with an Error_Not_Found error. struct { opaque ufrag<0..2^8-1>; opaque password<0..2^8-1>; uint16 application; opaque role<0..2^8-1>; IceCandidate candidates<0..2^16-1>; } AppAttachAns; The meaning of the fields is the same as in the AppAttachReq.6.5.3. Ping
Ping is used to test connectivity along a path. A ping can be addressed to a specific Node-ID, to the peer controlling a given location (by using a Resource-ID), or to the wildcard Node-ID.6.5.3.1. Request Definition
The PingReq structure is used to make a Ping request. struct { opaque<0..2^16-1> padding; } PingReq; The Ping request is empty of meaningful contents. However, it may contain up to 65535 bytes of padding to facilitate the discovery of overlay maximum packet sizes.6.5.3.2. Response Definition
A successful PingAns response contains the information elements requested by the peer. struct { uint64 response_id; uint64 time; } PingAns;
A PingAns message contains the following elements: response_id A randomly generated 64-bit response ID. This is used to distinguish Ping responses. time The time when the Ping response was created, represented in the same way as storage_time, defined in Section 7.6.5.4. ConfigUpdate
The ConfigUpdate method is used to push updated configuration data across the overlay. Whenever a node detects that another node has old configuration data, it MUST generate a ConfigUpdate request. The ConfigUpdate request allows updating of two kinds of data: the configuration data (Section 6.3.2.1) and the Kind information (Section 7.4.1.1).6.5.4.1. Request Definition
The ConfigUpdateReq structure is used to provide updated configuration information. enum { invalidConfigUpdateType(0), config(1), kind(2), (255) } ConfigUpdateType; typedef uint32 KindId; typedef opaque KindDescription<0..2^16-1>; struct { ConfigUpdateType type; uint32 length; select (type) { case config: opaque config_data<0..2^24-1>; case kind: KindDescription kinds<0..2^24-1>; /* This structure may be extended with new types */ }; } ConfigUpdateReq;
The ConfigUpdateReq message contains the following elements: type The type of the contents of the message. This structure allows for unknown content types. length The length of the remainder of the message. This is included to preserve backward compatibility and is 32 bits instead of 24 to facilitate easy conversion between network and host byte order. config_data (type==config) The contents of the Configuration Document. kinds (type==kind) One or more XML kind-block productions (see Section 11.1). These MUST be encoded with UTF-8 and assume a default namespace of "urn:ietf:params:xml:ns:p2p:config-base".6.5.4.2. Response Definition
The ConfigUpdateAns structure is used to respond to a ConfigUpdateReq request. struct { } ConfigUpdateAns; If the ConfigUpdateReq is of type "config", it MUST be processed only if all the following are true: o The sequence number in the document is greater than the current configuration sequence number. o The Configuration Document is correctly digitally signed (see Section 11 for details on signatures). Otherwise, appropriate errors MUST be generated. If the ConfigUpdateReq is of type "kind", it MUST be processed only if it is correctly digitally signed by an acceptable Kind signer (i.e., one listed in the current configuration file). Details on the kind-signer field in the configuration file are described in Section 11.1. In addition, if the Kind update conflicts with an existing known Kind (i.e., it is signed by a different signer), then it should be rejected with an Error_Forbidden error. This should not happen in correctly functioning overlays.
If the update is acceptable, then the node MUST reconfigure itself to match the new information. This may include adding permissions for new Kinds, deleting old Kinds, or even, in extreme circumstances, exiting and re-entering the overlay, if, for instance, the DHT algorithm has changed. If an implementation misses enough ConfigUpdates that include key changes, it is possible that it will no longer be able to verify new valid ConfigUpdates. In this case, the only available recovery mechanism is to attempt to retrieve a new Configuration Document, typically by the mechanisms used for initial bootstrapping. It is up to implementers whether or how to decide to employ this sort of recovery mechanism. The response for ConfigUpdate is empty.6.6. Overlay Link Layer
RELOAD can use multiple Overlay Link protocols to send its messages. Because ICE is used to establish connections (see Section 6.5.1.3), RELOAD nodes are able to detect which Overlay Link protocols are offered by other nodes and establish connections between them. Any link protocol needs to be able to establish a secure, authenticated connection and to provide data origin authentication and message integrity for individual data elements. RELOAD currently supports three Overlay Link protocols: o DTLS [RFC6347] over UDP with Simple Reliability (SR) (OverlayLinkType=DTLS-UDP-SR) o TLS [RFC5246] over TCP with Framing Header, No-ICE (OverlayLinkType=TLS-TCP-FH-NO-ICE) o DTLS [RFC6347] over UDP with SR, No-ICE (OverlayLinkType=DTLS-UDP-SR-NO-ICE) Note that although UDP does not properly have "connections", both TLS and DTLS have a handshake that establishes a similar, stateful association. We refer to these as "connections" for the purposes of this document. If a peer receives a message that is larger than the value of max- message-size defined in the overlay configuration, the peer SHOULD send an Error_Message_Too_Large error and then close the TLS or DTLS session from which the message was received. Note that this error can be sent and the session closed before the peer receives the complete message. If the forwarding header is larger than the max-
message-size, the receiver SHOULD close the TLS or DTLS session without sending an error. The RELOAD mechanism requires that failed links be quickly removed from the Routing Table so end-to-end retransmission can handle lost messages. Overlay Link protocols MUST be designed with a mechanism that quickly signals a likely failure, and implementations SHOULD quickly act to remove a failed link from the Routing Table when receiving this signal. The entry can be restored if it proves to resume functioning, or it can be replaced at some point in the future if necessary. Section 10.7.2 contains more details specific to the CHORD-RELOAD Topology Plug-in. The Framing Header (FH) is used to frame messages and provide timing when used on a reliable stream-based transport protocol. Simple Reliability (SR) uses the FH to provide congestion control and partial reliability when using unreliable message-oriented transport protocols. We will first define each of these algorithms in Sections 6.6.2 and 6.6.3, and then define Overlay Link protocols that use them in Sections 6.6.4, 6.6.5, and 6.6.6. Note: We expect future Overlay Link protocols to define replacements for all components of these protocols, including the Framing Header. The three protocols that we will discuss have been chosen for simplicity of implementation and reasonable performance.6.6.1. Future Overlay Link Protocols
It is possible to define new link-layer protocols and apply them to a new overlay using the "overlay-link-protocol" configuration directive (see Section 11.1.). However, any new protocols MUST meet the following requirements: Endpoint authentication: When a node forms an association with another endpoint, it MUST be possible to cryptographically verify that the endpoint has a given Node-ID. Traffic origin authentication and integrity: When a node receives traffic from another endpoint, it MUST be possible to cryptographically verify that the traffic came from a given association and that it has not been modified in transit from the other endpoint in the association. The overlay link protocol MUST also provide replay prevention/detection. Traffic confidentiality: When a node sends traffic to another endpoint, it MUST NOT be possible for a third party that is not involved in the association to determine the contents of that traffic.
Any new overlay protocol MUST be defined via Standards Action [RFC5226]. See Section 14.11.6.6.1.1. HIP
In a Host Identity Protocol Based Overlay Networking Environment (HIP BONE) [RFC6079], HIP [RFC5201] provides connection management (e.g., NAT traversal and mobility) and security for the overlay network. The P2PSIP Working Group has expressed interest in supporting a HIP- based link protocol. Such support would require specifying such details as: o How to issue certificates which provide identities meaningful to the HIP base exchange. We anticipate that this would require a mapping between Overlay Routable Cryptographic Hash Identifiers (ORCHIDs) and NodeIds. o How to carry the HIP I1 and I2 messages. o How to carry RELOAD messages over HIP. [HIP-RELOAD] documents work in progress on using RELOAD with the HIP BONE.6.6.1.2. ICE-TCP
The ICE-TCP RFC [RFC6544] allows TCP to be supported as an Overlay Link protocol that can be added using ICE.6.6.1.3. Message-Oriented Transports
Modern message-oriented transports offer high performance and good congestion control, and they avoid head-of-line blocking in case of lost data. These characteristics make them preferable as underlying transport protocols for RELOAD links. SCTP without message ordering and DCCP are two examples of such protocols. However, currently they are not well-supported by commonly available NATs, and specifications for ICE session establishment are not available.6.6.1.4. Tunneled Transports
As of the time of this writing, there is significant interest in the IETF community in tunneling other transports over UDP, which is motivated by the situation that UDP is well-supported by modern NAT hardware and by the fact that performance similar to a native implementation can be achieved. Currently, SCTP, DCCP, and a generic tunneling extension are being proposed for message-oriented protocols. Once ICE traversal has been specified for these tunneled
protocols, they should be straightforward to support as overlay link protocols.6.6.2. Framing Header
In order to support unreliable links and to allow for quick detection of link failures when using reliable end-to-end transports, each message is wrapped in a very simple framing layer (FramedMessage), which is used only for each hop. This layer contains a sequence number which can then be used for ACKs. The same header is used for both reliable and unreliable transports for simplicity of implementation. The definition of FramedMessage is: enum { data(128), ack(129), (255) } FramedMessageType; struct { FramedMessageType type; select (type) { case data: uint32 sequence; opaque message<0..2^24-1>; case ack: uint32 ack_sequence; uint32 received; }; } FramedMessage; The type field of the PDU is set to indicate whether the message is data or an acknowledgement. If the message is of type "data", then the remainder of the PDU is as follows: sequence The sequence number. This increments by one for each framed message sent over this transport session. message The message that is being transmitted. Each connection has it own sequence number space. Initially, the value is zero, and it increments by exactly one for each message sent over that connection.
When the receiver receives a message, it SHOULD immediately send an ACK message. The receiver MUST keep track of the 32 most recent sequence numbers received on this association in order to generate the appropriate ACK. If the PDU is of type "ack", the contents are as follows: ack_sequence The sequence number of the message being acknowledged. received A bitmask indicating if each of the previous 32 sequence numbers before this packet has been among the 32 packets most recently received on this connection. When a packet is received with a sequence number N, the receiver looks at the sequence number of the 32 previously received packets on this connection. We call the previously received packet number M. For each of the previous 32 packets, if the sequence number M is less than N but greater than N-32, the N-M bit of the received bitmask is set to one; otherwise, it is set to zero. Note that a bit being set to one indicates positively that a particular packet was received, but a bit being set to zero means only that it is unknown whether or not the packet has been received, because it might have been received before the 32 most recently received packets. The received field bits in the ACK provide a high degree of redundancy so that the sender can figure out which packets the receiver has received and can then estimate packet loss rates. If the sender also keeps track of the time at which recent sequence numbers have been sent, the RTT (round-trip time) can be estimated. Note that because retransmissions receive new sequence numbers, multiple ACKs may be received for the same message. This approach provides more information than traditional TCP sequence numbers, but care must be taken when applying algorithms designed based on TCP's stream-oriented sequence number.6.6.3. Simple Reliability
When RELOAD is carried over DTLS or another unreliable link protocol, it needs to be used with a reliability and congestion control mechanism, which is provided on a hop-by-hop basis. The basic principle is that each message, regardless of whether or not it carries a request or response, will get an ACK and be reliably retransmitted. The receiver's job is very simple, and is limited to just sending ACKs. All the complexity is at the sender side. This allows the sending implementation to trade off performance versus implementation complexity without affecting the wire protocol.
Because the receiver's role is limited to providing packet acknowledgements, a wide variety of congestion control algorithms can be implemented on the sender side while using the same basic wire protocol. The sender algorithm used MUST meet the requirements of [RFC5405].6.6.3.1. Stop and Wait Sender Algorithm
This section describes one possible implementation of a sender algorithm for Simple Reliability. It is adequate for overlays running on underlying networks with low latency and loss (LANs) or low-traffic overlays on the Internet. A node MUST NOT have more than one unacknowledged message on the DTLS connection at a time. Note that because retransmissions of the same message are given new sequence numbers, there may be multiple unacknowledged sequence numbers in use. The RTO (Retransmission TimeOut) is based on an estimate of the RTT. The value for RTO is calculated separately for each DTLS session. Implementations can use a static value for RTO or a dynamic estimate, which will result in better performance. For implementations that use a static value, the default value for RTO is 500 ms. Nodes MAY use smaller values of RTO if it is known that all nodes are within the local network. The default RTO MAY be set to a larger value, which is RECOMMENDED if it is known in advance (such as on high- latency access links) that the RTT is larger. Implementations that use a dynamic estimate to compute the RTO MUST use the algorithm described in RFC 6298 [RFC6298], with the exception that the value of RTO SHOULD NOT be rounded up to the nearest second, but instead rounded up to the nearest millisecond. The RTT of a successful STUN transaction from the ICE stage is used as the initial measurement for formula 2.2 of RFC 6298. The sender keeps track of the time each message was sent for all recently sent messages. Any time an ACK is received, the sender can compute the RTT for that message by looking at the time the ACK was received and the time when the message was sent. This is used as a subsequent RTT measurement for formula 2.3 of RFC 6298 to update the RTO estimate. (Note that because retransmissions receive new sequence numbers, all received ACKs are used.) An initiating node SHOULD retransmit a message if it has not received an ACK after an interval of RTO (transit nodes do not retransmit at this layer). The node MUST double the time to wait after each retransmission. For each retransmission, the sequence number MUST be incremented.
Retransmissions continue until a response is received, until a total of 5 requests have been sent, until there has been a hard ICMP error [RFC1122], or until a TLS alert indicating the end of the connection has been sent or received. The sender knows a response was received when it receives an ACK with a sequence number that indicates it is a response to one of the transmissions of this message. For example, assuming an RTO of 500 ms, requests would be sent at times 0 ms, 500 ms, 1500 ms, 3500 ms, and 7500 ms. If all retransmissions for a message fail, then the sending node SHOULD close the connection routing the message. To determine when a link might be failing without waiting for the final timeout, observe when no ACKs have been received for an entire RTO interval, and then wait for three retransmissions to occur beyond that point. If no ACKs have been received by the time the third retransmission occurs, it is RECOMMENDED that the link be removed from the Routing Table. The link MAY be restored to the Routing Table if ACKs resume before the connection is closed, as described above. A sender MUST wait 10 ms between receipt of an ACK and transmission of the next message.6.6.4. DTLS/UDP with SR
This overlay link protocol consists of DTLS over UDP while implementing the SR protocol. STUN connectivity checks and keepalives are used. Any compliant sender algorithm may be used.6.6.5. TLS/TCP with FH, No-ICE
This overlay link protocol consists of TLS over TCP with the framing header. Because ICE is not used, STUN connectivity checks are not used upon establishing the TCP connection, nor are they used for keepalives. Because the TCP layer's application-level timeout is too slow to be useful for overlay routing, the Overlay Link implementation MUST use the framing header to measure the RTT of the connection and calculate an RTO as specified in Section 2 of [RFC6298]. The resulting RTO is not used for retransmissions, but rather as a timeout to indicate when the link SHOULD be removed from the Routing Table. It is RECOMMENDED that such a connection be retained for 30 seconds to determine if the failure was transient before concluding the link has failed permanently. When sending candidates for TLS/TCP with FH, No-ICE, a passive candidate MUST be provided.
6.6.6. DTLS/UDP with SR, No-ICE
This overlay link protocol consists of DTLS over UDP while implementing the Simple Reliability protocol. Because ICE is not used, no STUN connectivity checks or keepalives are used.6.7. Fragmentation and Reassembly
In order to allow transmission over datagram protocols such as DTLS, RELOAD messages may be fragmented. Any node along the path can fragment the message, but only the final destination reassembles the fragments. When a node takes a packet and fragments it, each fragment has a full copy of the forwarding header, but the data after the forwarding header is broken up into appropriately sized chunks. The size of the payload chunks needs to take into account space to allow the Via and Destination Lists to grow. Each fragment MUST contain a full copy of the Via List, Destination List, and ForwardingOptions and MUST contain at least 256 bytes of the message body. If these elements cannot fit within the MTU of the underlying datagram protocol, RELOAD fragmentation is not performed, and IP-layer fragmentation is allowed to occur. The length field MUST contain the size of the message after fragmentation. When a message MUST be fragmented, it SHOULD be split into equal-sized fragments that are no larger than the Path MTU (PMTU) of the next overlay link minus 32 bytes. This is to allow the Via List to grow before further fragmentation is required. Note that this fragmentation is not optimal for the end-to-end path -- a message may be refragmented multiple times as it traverses the overlay, but it is assembled only at the final destination. This option has been chosen as it is far easier to implement than end-to- end (e2e) PMTU discovery across an ever-changing overlay and it effectively addresses the reliability issues of relying on IP-layer fragmentation. However, Ping can be used to allow e2e PMTU discovery to be implemented if desired. Upon receipt of a fragmented message by the intended peer, the peer holds the fragments in a holding buffer until the entire message has been received. The message is then reassembled into a single message and processed. In order to mitigate denial-of-service (DoS) attacks, receivers SHOULD time out incomplete fragments after the maximum request lifetime (15 seconds). This time was derived from looking at the end-to-end retransmission time and saving fragments long enough for the full end-to-end retransmissions to take place. Ideally, the receiver would have enough buffer space to deal with as many fragments as can arrive in the maximum request lifetime. However, if
the receiver runs out of buffer space to reassemble a message, it MUST drop the message. The fragment field of the forwarding header is used to encode fragmentation information. The offset is the number of bytes between the end of the forwarding header and the start of the data. The first fragment therefore has an offset of 0. The last fragment indicator MUST be appropriately set. If the message is not fragmented, it is simply treated as if it is the only fragment: the last fragment bit is set and the offset is 0, resulting in a fragment value of 0xC0000000. Note: The reason for this definition of the fragment field is that originally, the high bit was defined in part of the specification as "is fragmented", so there was some specification ambiguity about how to encode messages with only one fragment. This ambiguity was resolved in favor of always encoding as the "last" fragment with offset 0, thus simplifying the receiver code path, but resulting in the high bit being redundant. Because messages MUST be set with the high bit set to 1, implementations SHOULD discard any message with it set to 0. Implementations (presumably legacy ones) which choose to accept such messages MUST either ignore the remaining bits or ensure that they are 0. They MUST NOT try to interpret as fragmented messages with the high bit set low.7. Data Storage Protocol
RELOAD provides a set of generic mechanisms for storing and retrieving data in the Overlay Instance. These mechanisms can be used for new applications simply by defining new code points and a small set of rules. No new protocol mechanisms are required. The basic unit of stored data is a single StoredData structure: struct { uint32 length; uint64 storage_time; uint32 lifetime; StoredDataValue value; Signature signature; } StoredData; The contents of this structure are as follows: length The size of the StoredData structure, in bytes, excluding the size of length itself.
storage_time The time when the data was stored, represented as the number of milliseconds elapsed since midnight Jan 1, 1970 UTC, not counting leap seconds. This will have the same values for seconds as standard UNIX or POSIX time. More information can be found at [UnixTime]. Any attempt to store a data value with a storage time before that of a value already stored at this location MUST generate an Error_Data_Too_Old error. This prevents rollback attacks. The node SHOULD make a best-effort attempt to use a correct clock to determine this number. However, the protocol does not require synchronized clocks: the receiving peer uses the storage time in the previous store, not its own clock. Clock values are used so that when clocks are generally synchronized, data may be stored in a single transaction, rather than querying for the value of a counter before the actual store. If a node attempting to store new data in response to a user request (rather than as an overlay maintenance operation such as occurs when healing the overlay from a partition) is rejected with an Error_Data_Too_Old error, the node MAY elect to perform its store using a storage_time that increments the value used with the previous store (this may be obtained by doing a Fetch). This situation may occur when the clocks of nodes storing to this location are not properly synchronized. lifetime The validity period for the data, in seconds, starting from the time the peer receives the StoreReq. value The data value itself, as described in Section 7.2. signature A signature, as defined in Section 7.1. Each Resource-ID specifies a single location in the Overlay Instance. However, each location may contain multiple StoredData values, distinguished by Kind-ID. The definition of a Kind describes both the data values which may be stored and the data model of the data. Some data models allow multiple values to be stored under the same Kind-ID. Section 7.2 describes the available data models. Thus, for instance, a given Resource-ID might contain a single-value element stored under Kind-ID X and an array containing multiple values stored under Kind-ID Y.
7.1. Data Signature Computation
Each StoredData element is individually signed. However, the signature also must be self-contained and must cover the Kind-ID and Resource-ID, even though they are not present in the StoredData structure. The input to the signature algorithm is: resource_id || kind || storage_time || StoredDataValue || SignerIdentity where || indicates concatenation and where these values are: resource_id The Resource-ID where this data is stored. kind The Kind-ID for this data. storage_time The contents of the storage_time data value. StoredDataValue The contents of the stored data value, as described in the previous sections. SignerIdentity The signer identity, as defined in Section 6.3.4. Once the signature has been computed, the signature is represented using a signature element, as described in Section 6.3.4. Note that there is no necessary relationship between the validity window of a certificate and the expiry of the data it is authenticating. When signatures are verified, the current time MUST be compared to the certificate validity period. Stored data MAY be set to expire after the signing certificate's validity period. Such signatures are not considered valid after the signing certificate expires. Implementations may "garbage collect" such data at their convenience, either by purging it automatically (perhaps by setting the upper bound on data storage to the lifetime of the signing certificate) or by simply leaving it in place until it expires naturally and relying on users of that data to notice the expired signing certificate.
7.2. Data Models
The protocol currently defines the following data models: o single value o array o dictionary These are represented with the StoredDataValue structure. The actual data model is known from the Kind being stored. struct { Boolean exists; opaque value<0..2^32-1>; } DataValue; struct { select (DataModel) { case single_value: DataValue single_value_entry; case array: ArrayEntry array_entry; case dictionary: DictionaryEntry dictionary_entry; /* This structure may be extended */ }; } StoredDataValue; The following sections discuss the properties of each data model.7.2.1. Single Value
A single-value element is a simple sequence of bytes. There may be only one single-value element for each Resource-ID, Kind-ID pair. A single value element is represented as a DataValue, which contains the following two elements: exists This value indicates whether the value exists at all. If it is set to False, it means that no value is present. If it is True, this means that a value is present. This gives the protocol a mechanism for indicating nonexistence as opposed to emptiness.
value The stored data.7.2.2. Array
An array is a set of opaque values addressed by an integer index. Arrays are zero based. Note that arrays can be sparse. For instance, a Store of "X" at index 2 in an empty array produces an array with the values [ NA, NA, "X"]. Future attempts to fetch elements at index 0 or 1 will return values with "exists" set to False. An array element is represented as an ArrayEntry: struct { uint32 index; DataValue value; } ArrayEntry; The contents of this structure are: index The index of the data element in the array. value The stored data.7.2.3. Dictionary
A dictionary is a set of opaque values indexed by an opaque key, with one value for each key. A single dictionary entry is represented as a DictionaryEntry: typedef opaque DictionaryKey<0..2^16-1>; struct { DictionaryKey key; DataValue value; } DictionaryEntry; The contents of this structure are: key The dictionary key for this value. value The stored data.