3. Operation
3.1. Overview
+--------+ +--------+ | Peer A | S E S S I O N | Peer B | | /=============================\ | | || Flows || | | ||---------------------------->|| | | ||---------------------------->|| | | ||<----------------------------|| | | ||<----------------------------|| | | ||<----------------------------|| | | \=============================/ | | | | | | | +--------+ | | | | +--------+ | | S E S S I O N | Peer C | | /=============================\ | | || Flows || | | ||---------------------------->|| | | ||<----------------------------|| | | ||<----------------------------|| | | \=============================/ | | | | | +--------+ +--------+ Figure 7: Sessions between Pairs of Communicating Endpoints Between any pair of communicating endpoints is a single, bidirectional, secured, congestion controlled session. Unidirectional flows convey messages from one end to the other within the session. An endpoint initiates a session to a far end when communication is desired. An initiator begins with one or more candidate destination socket addresses, and it may learn and try more candidate addresses during startup handshaking. Eventually, a first suitable response is received, and that endpoint is selected. Startup proceeds to the selected endpoint. In the case of session startup glare, one endpoint is the prevailing initiator and the other assumes the role of responder. Encryption keys and session identifiers are negotiated between the endpoints, and the session is established. Each endpoint may begin sending message flows to the other end. For each flow, the far end may accept it and deliver its messages to the user, or it may reject the flow and transmit an exception to the
sender. The flow receiver may close and reject a flow at a later time, after first accepting it. The flow receiver acknowledges all data sent to it, regardless of whether the flow was accepted. Acknowledgements drive a congestion control mechanism. An endpoint may have concurrent sessions with other far endpoints. The multiple sessions are distinguished by a session identifier rather than by socket address. This allows an endpoint's address to change mid-session without having to tear down and re-establish a session. The existing cryptographic state for a session can be used to verify a change of address while protecting against session hijacking or denial of service. A sender may indicate to a receiver that some user messages are of a time critical or real-time nature. A receiver may indicate to senders on concurrent sessions that it is receiving time critical messages from another endpoint. The other senders SHOULD modify their congestion control parameters to yield capacity to the session carrying time critical messages. A sender may close a flow. The flow is completed when the receiver has no outstanding gaps before the final fragment of the flow. The sender and receiver reserve a completed flow's identifier for a time to allow in-flight messages to drain from the network. Eventually, neither end will have any flows open to the other. The session will be idle and quiescent. Either end may reliably close the session to recover its resources. In certain circumstances, an endpoint may be ceasing operation and not have time to wait for acknowledgement of a reliable session close. In this case, the halting endpoint may send an abrupt session close to advise the far end that it is halting immediately.3.2. Endpoint Identity
Each RTMFP endpoint has an identity. The identity is encoded in a certificate. This specification doesn't mandate any particular certificate format, cryptographic algorithms, or cryptographic properties for certificates. An endpoint is named by an Endpoint Discriminator. This specification doesn't mandate any particular format for Endpoint Discriminators. An Endpoint Discriminator MAY select more than one identity and MAY match more than one distinct certificate.
Multiple distinct Endpoint Discriminators MAY match one certificate. It is RECOMMENDED that multiple endpoints not have the same identity. Entities with the same identity are indistinguishable during session startup; this situation could be undesirable in some applications. An endpoint MAY have more than one address. The Cryptography Profile implements the following functions for identities, certificates, and Endpoint Discriminators, whose operation MUST be deterministic: o Test whether a given certificate is authentic. Authenticity can comprise verifying an issuer signature chain in a public key infrastructure. o Test whether a given Endpoint Discriminator selects a given certificate. o Test whether a given Endpoint Discriminator selects the local endpoint. o Generate a Canonical Endpoint Discriminator for a given certificate. Canonical Endpoint Discriminators for distinct identities SHOULD be distinct. If two distinct identities have the same Canonical Endpoint Discriminator, an initiator might abort a new opening session to the second identity (Section 3.5.1.1.1); this behavior might not be desirable. o Given a certificate, a message, and a digital signature over the message, test whether the signature is valid and generated by the owner of the certificate. o Generate a digital signature for a given message corresponding to the near identity. o Given the near identity and a far certificate, determine which one shall prevail as Initiator and which shall assume the Responder role in the case of startup glare. The far end MUST arrive at the same conclusion. A comparison function can comprise performing a lexicographic ordering of the binary certificates, declaring the far identity the prevailing endpoint if the far certificate is ordered before the near certificate, and otherwise declaring the near identity to be the prevailing endpoint.
o Given a first certificate and a second certificate, test whether a new incoming session from the second shall override an existing session with the first. It is RECOMMENDED that the test comprise testing whether the certificates are bitwise identical. All other semantics for certificates and Endpoint Discriminators are determined by the Cryptography Profile and the application.3.3. Packet Multiplex
An RTMFP typically has one or more interfaces through which it communicates with other RTMFP endpoints. RTMFP can communicate with multiple distinct other RTMFP endpoints through each local interface. Session multiplexing over a shared interface can facilitate peer-to- peer communications through a NAT, by enabling third-party endpoints such as Forwarders (Section 3.5.1.5) and Redirectors (Section 3.5.1.4) to observe the translated public address and inform peers of the translation. An interface is typically a UDP socket (Section 2.2.1) but MAY be any suitable datagram transport service where endpoints can be addressed by IPv4 or IPv6 socket addresses. RTMFP uses a session ID to multiplex and demultiplex communications with distinct endpoints (Section 2.2.2), in addition to the endpoint socket address. This allows an RTMFP to detect a far-end address change (as might happen, for example, in mobile and wireless scenarios) and allows communication sessions to survive address changes. This also allows an RTMFP to act as a Forwarder or Redirector for an endpoint with which it has an active session, by distinguishing startup packets from those of the active session. On receiving a packet, an RTMFP decodes the session ID to look up the corresponding session information context and decryption key. Session ID 0 is reserved for session startup and MUST NOT be used for an active session. A packet for Session ID 0 uses the Default Session Key as defined by the Cryptography Profile.3.4. Packet Fragmentation
When an RTMFP packet (Section 2.2.4) is unavoidably larger than the path MTU (such as a startup packet containing an RHello (Section 2.3.4) or IIKeying (Section 2.3.7) chunk with a large certificate), it can be fragmented into segments that do not exceed the path MTU by using the Packet Fragment chunk (Section 2.3.1).
The packet fragmentation mechanism SHOULD be used only to segment unavoidably large packets. Accordingly, this mechanism SHOULD be employed only during session startup with Session ID 0. This mechanism MUST NOT be used instead of the natural fragmentation mechanism of the User Data (Section 2.3.11) and Next User Data (Section 2.3.12) chunks for dividing the messages of the user's data flows into segments that do not exceed the path MTU. A fragmented plain RTMFP packet is reassembled by concatenating the packetFragment fields of the fragments for the packet in contiguous ascending order, starting from index 0 through and including the final fragment. When reassembling packets for Session ID 0, a receiver SHOULD identify the packets by the socket address from which the packet containing the fragment was received, as well as the indicated packetID. A receiver SHOULD allow up to 60 seconds to completely receive a fragmented packet for which progress is being made. A packet is progressing if at least one new fragment for it was received in the last second. A receiver MUST discard a Packet Fragment chunk having an empty packetFragment field. The mode of each packet containing Packet Fragments for the same fragmented packet MUST match the mode of the fragmented packet. A receiver MUST discard any new Packet Fragment chunk received in a packet with a mode different from the mode of the packet containing the first received fragment. A receiver MUST discard any reassembled packet with a mode different than the packets containing its fragments. In order to avoid jamming the network, the sender MUST rate limit packet transmission. In the absence of specific path capacity information (for instance, during session startup), a sender SHOULD NOT send more than 4380 bytes nor more than four packets per distinct endpoint every 200 ms. To avoid resource exhaustion, a receiver SHOULD limit the number of concurrent packet reassembly buffers and the size of each buffer. Limits can depend, for example, on the expected size of reassembled packets, on the rate at which fragmented packets are expected to be received, on the expected degree of interleaving, and on the expected function of the receiver. Limits can depend on the available resources of the receiver. There can be different limits for packets with Session ID 0 and packets for established sessions. For example,
a busy server might need to allow for several hundred concurrent packet reassembly buffers to accommodate hundreds of connection requests per second with potentially interleaved fragments, but a client device with constrained resources could allow just a few reassembly buffers. In the absence of specific information regarding the expected size of reassembled packets, a receiver should set the limit for each packet reassembly buffer to 65536 bytes.3.5. Sessions
A session is the protocol relationship between a pair of communicating endpoints, comprising the shared and endpoint-specific information context necessary to carry out the communication. The session context at each end includes at least: o TS_RX: the last timestamp received from the far end; o TS_RX_TIME: the time at which TS_RX was first observed to be different than its previous value; o TS_ECHO_TX: the last timestamp echo sent to the far end; o MRTO: the measured retransmission timeout; o ERTO: the effective retransmission timeout; o Cryptographic keys for encrypting and decrypting packets, and for verifying the validity of packets, according to the Cryptography Profile; o Cryptographic near and far nonces according to the Cryptography Profile, where the near nonce is the far end's far nonce, and vice versa; o The certificate of the far end; o The receive session identifier, used by the far end when sending packets to this end; o The send session identifier to use when sending packets to the far end; o DESTADDR: the destination socket address to use when sending packets to the far end; o The set of all sending flow contexts (Section 3.6.2); o The set of all receiving flow contexts (Section 3.6.3);
o The transmission budget, which controls the rate at which data is sent into the network (for example, a congestion window); o S_OUTSTANDING_BYTES: the total amount of user message data outstanding, or in flight, in the network -- that is, the sum of the F_OUTSTANDING_BYTES of each sending flow in the session; o RX_DATA_PACKETS: a count of the number of received packets containing at least one User Data chunk since the last acknowledgement was sent, initially 0; o ACK_NOW: a boolean flag indicating whether an acknowledgement should be sent immediately, initially false; o DELACK_ALARM: an alarm to trigger an acknowledgement after a delay, initially unset; o The state, at any time being one of the following values: the opening states S_IHELLO_SENT and S_KEYING_SENT, the open state S_OPEN, the closing states S_NEARCLOSE and S_FARCLOSE_LINGER, and the closed states S_CLOSED and S_OPEN_FAILED; and o The role -- either Initiator or Responder -- of this end of the session.
Note: The following diagram is only a summary of state transitions and their causing events, and is not a complete operational specification. rcv IIKeying Glare far prevails +-------------+ ultimate open timeout +--------------|S_IHELLO_SENT|-------------+ | +-------------+ | | |rcv RHello | | | v | v +-------------+ |<-----------(duplicate session?) |S_OPEN_FAILED| | yes |no +-------------+ | | ^ | rcv IIKeying Glare v | | far prevails +-------------+ | |<-------------|S_KEYING_SENT|-------------+ | +-------------+ ultimate open timeout | |rcv RIKeying | | | rcv v | +-+ IIKeying +--------+ rcv Close Request | |X|---------->| S_OPEN |--------------------+ | +-+ +--------+ | | | |ABRUPT CLOSE | | ORDERLY CLOSE| |or rcv Close Ack | | | |or rcv IIKeying | | | | session override | | | +-------+ | | v | v | +-----------+ | +-----------------+ | |S_NEARCLOSE| | |S_FARCLOSE_LINGER| | +-----------+ | +-----------------+ | rcv Close Ack| | |rcv Close Ack | or 90 seconds| v |or 19 seconds | | +--------+ | | +------>|S_CLOSED|<---------+ +-------------------------->| | +--------+ Figure 8: Session State Diagram
3.5.1. Startup
3.5.1.1. Normal Handshake
RTMFP sessions are established with a 4-way handshake in two round trips. The initiator begins by sending an IHello to one or more candidate addresses for the desired destination endpoint. A responder statelessly sends an RHello in response. The first correct RHello received at the initiator is selected; all others are ignored. The initiator computes its half of the session keying and sends an IIKeying. The responder receives the IIKeying and, if it is acceptable, computes its half of the session keying, at which point it can also compute the shared session keying and session nonces. The responder creates a new S_OPEN session with the initiator and sends an RIKeying. The initiator receives the RIKeying and, if it is acceptable, computes the shared session keying and session nonces. The initiator's session is now S_OPEN. . Initiator Responder . | IHello | |(EPD,Tag) | S_IHELLO_SENT |(SID=0) | |------------------------------->| | | | RHello | | (Tag,Cookie,RCert)| | (SID=0)| |<-------------------------------| S_KEYING_SENT | | | IIKeying | |(ISID,Cookie,ICert,SKIC,ISig) | |(SID=0) | |------------------------------->| | | | RIKeying | | (RSID,SKRC,RSig)| | (SID=ISID,Key=Default)| S_OPEN |<-------------------------------| S_OPEN | | | S E S S I O N | |<-------------------(SID=ISID)--| |--(SID=RSID)------------------->| Figure 9: Normal Handshake In the following sections, the handshake is detailed from the perspectives of the initiator and responder.
3.5.1.1.1. Initiator
The initiator determines that a session is needed for an Endpoint Discriminator. The initiator creates state for a new opening session and begins with a candidate endpoint address set containing at least one address. The new session is placed in the S_IHELLO_SENT state. If the session does not move to the S_OPEN state before an ultimate open timeout, the session has failed and moves to the S_OPEN_FAILED state. The RECOMMENDED ultimate open timeout is 95 seconds. The initiator chooses a new, unique tag not used by any currently opening session. It is RECOMMENDED that the tag be cryptographically pseudorandom and be at least 8 bytes in length, so that it is hard to guess. The initiator constructs an IHello chunk (Section 2.3.2) with the Endpoint Discriminator and the tag. While the initiator is in the S_IHELLO_SENT state, it sends the IHello to each candidate endpoint address in the set, on a backoff schedule. The backoff SHOULD NOT be less than multiplicative, with not less than 1.5 seconds added to the interval between each attempt. The backoff SHOULD be scheduled separately for each candidate address, since new candidates can be added over time. If the initiator receives a Redirect chunk (Section 2.3.5) with a tag echo matching this session, AND this session is in the S_IHELLO_SENT state, then for each redirect destination indicated in the Redirect: if the candidate endpoint address set contains fewer than REDIRECT_THRESHOLD addresses, add the indicated redirect destination to the candidate endpoint address set. REDIRECT_THRESHOLD SHOULD NOT be more than 24. If the initiator receives an RHello chunk (Section 2.3.4) with a tag echo matching this session, AND this session is in the S_IHELLO_SENT state, AND the responder certificate matches the desired Endpoint Discriminator, AND the certificate is authentic according to the Cryptography Profile, then: 1. If the Canonical Endpoint Discriminator for the responder certificate matches the Canonical Endpoint Discriminator of another existing session in the S_KEYING_SENT or S_OPEN states, AND the certificate of the other opening session matches the desired Endpoint Discriminator, then this session is a duplicate and SHOULD be aborted in favor of the other existing session; otherwise,
2. Move to the S_KEYING_SENT state. Set DESTADDR, the far-end address for the session, to the address from which this RHello was received. The initiator chooses a new, unique receive session ID, not used by any other session, for the responder to use when sending packets to the initiator. It computes a Session Key Initiator Component appropriate to the responder's certificate according to the Cryptography Profile. Using this data and the cookie from the RHello, the initiator constructs and signs an IIKeying chunk (Section 2.3.7). While the initiator is in the S_KEYING_SENT state, it sends the IIKeying to DESTADDR on a backoff schedule. The backoff SHOULD NOT be less than multiplicative, with not less than 1.5 seconds added to the interval between each attempt. If the initiator receives an RIKeying chunk (Section 2.3.8) in a packet with this session's receive session identifier, AND this session is in the S_KEYING_SENT state, AND the signature in the chunk is authentic according to the far end's certificate (from the RHello), AND the Session Key Responder Component successfully combines with the Session Key Initiator Component and the near and far certificates to form the shared session keys and nonces according to the Cryptography Profile, then the session has opened successfully. The session moves to the S_OPEN state. The send session identifier is set from the RIKeying. Packet encryption, decryption, and verification now use the newly computed shared session keys, and the session nonces are available for application- layer cryptographic challenges.3.5.1.1.2. Responder
On receipt of an IHello chunk (Section 2.3.2) with an Endpoint Discriminator that selects its identity, an endpoint SHOULD construct an RHello chunk (Section 2.3.4) and send it to the address from which the IHello was received. To avoid a potential resource exhaustion denial of service, the endpoint SHOULD NOT create any persistent state associated with the IHello. The endpoint MUST generate the cookie for the RHello in such a way that it can be recognized as authentic and valid when echoed in an IIKeying. The endpoint SHOULD use the address from which the IHello was received as part of the cookie generation formula. Cookies SHOULD be valid only for a limited time; that lifetime SHOULD NOT be less than 95 seconds (the recommended ultimate session open timeout).
On receipt of an FIHello chunk (Section 2.3.3) from a Forwarder (Section 3.5.1.5) where the Endpoint Discriminator selects its identity, an endpoint SHOULD do one of the following: 1. Compute, construct, and send an RHello as though the FIHello was an IHello received from the indicated reply address; or 2. Construct and send an Implied Redirect (Section 2.3.5) to the FIHello's reply address; or 3. Ignore this FIHello. On receipt of an IIKeying chunk (Section 2.3.7), if the cookie is not authentic or if it has expired, ignore this IIKeying; otherwise, On receipt of an IIKeying chunk, if the cookie appears authentic but does not match the address from which the IIKeying's packet was received, perform the special processing at Cookie Change (Section 3.5.1.2); otherwise, On receipt of an IIKeying with an authentic and valid cookie, if the certificate is authentic according to the Cryptography Profile, AND the signature in the chunk is authentic according to the far end's certificate and the Cryptography Profile, AND the Session Key Initiator Component is acceptable, then: 1. If the address from which this IIKeying was received corresponds to an opening session in the S_IHELLO_SENT or S_KEYING_SENT state, perform the special processing at Glare (Section 3.5.1.3); otherwise, 2. If the address from which this IIKeying was received corresponds to a session in the S_OPEN state, then: 1. If the receiver was the Responder for the S_OPEN session and the session identifier, certificate, and Session Key Initiator Component are identical to those of the S_OPEN session, this IIKeying is a retransmission, so resend the S_OPEN session's RIKeying using the Default Session Key as specified below; otherwise, 2. If the certificate from this IIKeying does not override the certificate of the S_OPEN session, ignore this IIKeying; otherwise,
3. The certificate from this IIKeying overrides the certificate of the S_OPEN session; this is a new opening session from the same identity, and the existing S_OPEN session is stale. Move the existing S_OPEN session to S_CLOSED and abort all of its flows (signaling exceptions to the user), then continue processing this IIKeying. Otherwise, 3. Compute a Session Key Responder Component and choose a new, unique receive session ID not used by any other session for the initiator to use when sending packets to the responder. Using this data, construct and, with the Session Key Initiator Component, sign an RIKeying chunk (Section 2.3.8). Using the Session Key Initiator and Responder Components and the near and far certificates, the responder combines and computes the shared session keys and nonces according to the Cryptography Profile. The responder creates a new session in the S_OPEN state, with the far-endpoint address DESTADDR taken from the source address of the packet containing the IIKeying and the send session identifier taken from the IIKeying. The responder sends the RIKeying to the initiator using the Default Session Key and the requested send session identifier. Packet encryption, decryption, and verification of all future packets for this session use the newly computed keys, and the session nonces are available for application-layer cryptographic challenges.3.5.1.2. Cookie Change
In some circumstances, the responder may generate an RHello cookie for an initiator's address that isn't the address the initiator would use when sending packets directly to the responder. This can happen, for example, when the initiator has multiple local addresses and uses one address to reach a Forwarder (Section 3.5.1.5) but another to reach the responder.
Consider the following example: Initiator Forwarder Responder | IHello | | |(Src=Ix) | | |------------------------------->| | | | FIHello | | |(RA=Ix) | | |-------------------------------->| | | | RHello | | (Cookie:Ix)| |<-----------------------------------------------------------------| | | | IIKeying | |(Cookie:Ix,Src=Iy) | |----------------------------------------------------------------->| | | | RHello Cookie Change | | (Cookie:Ix,Cookie:Iy)| |<-----------------------------------------------------------------| | | | IIKeying | |(Cookie:Iy) | |----------------------------------------------------------------->| | | | RIKeying | |<-----------------------------------------------------------------| | | |<======================== S E S S I O N =========================>| Figure 10: Handshake with Cookie Change The initiator has two network interfaces: a first preferred interface with address Ix = 192.0.2.100:50000, and a second with address Iy = 198.51.100.101:50001. The responder has one interface with address Ry = 198.51.100.200:51000, on the same network as the initiator's second interface. The initiator uses its first interface to reach a Forwarder. The Forwarder observes the initiator's address of Ix and sends a Forwarded IHello (Section 2.3.3) to the responder. The responder treats this as if it were an IHello from Ix, calculates a corresponding cookie, and sends an RHello to Ix. The initiator receives this RHello from Ry and selects that address as the destination for the session. It then sends an IIKeying, copying the cookie from the RHello. However, since the source of the RHello is Ry, on a network to which the initiator is directly connected, the initiator uses its second interface Iy to send the IIKeying. The responder, on receiving the IIKeying, will compare the cookie to the
expected value based on the source address of the packet, and since the IIKeying source doesn't match the IHello source used to generate the cookie, the responder will reject the IIKeying. If the responder determines that it generated the cookie in the IIKeying but the cookie doesn't match the sender's address (for example, if the cookie is in two parts, with a first part generated independently of the initiator's address and a second part dependent on the address), the responder SHOULD generate a new cookie based on the address from which the IIKeying was received and send an RHello Cookie Change chunk (Section 2.3.6) to the source of the IIKeying, using the session ID from the IIKeying and the Default Session Key. If the initiator receives an RHello Cookie Change chunk for a session in the S_KEYING_SENT state, AND the old cookie matches the one originally sent to the responder, then the initiator adopts the new cookie, constructs and signs a new IIKeying chunk, and sends the new IIKeying to the responder. The initiator SHOULD NOT change the cookie for a session more than once.3.5.1.3. Glare
Glare occurs when two endpoints attempt to initiate sessions to each other concurrently. Glare is detected by receipt of a valid and authentic IIKeying from an endpoint address that is a destination for an opening session. Only one session is allowed between a pair of endpoints. Glare is resolved by comparing the certificate in the received IIKeying with the near end's certificate. The Cryptography Profile defines a certificate comparison function to determine the prevailing endpoint when there is glare. If the near end prevails, discard and ignore the received IIKeying. The far end will abort its opening session on receipt of IIKeying from the near end. Otherwise, the far end prevails: 1. If the certificate in the IIKeying overrides the certificate associated with the near opening session according to the Cryptography Profile, then abort and destroy the near opening session. Then, 2. Continue with normal Responder IIKeying processing (Section 3.5.1.1.2).
3.5.1.4. Redirector
+-----------+ +------------+ +-----------+ | Initiator |---------->| Redirector | | Responder | | |<----------| | | | | | +------------+ | | | |<=================================>| | +-----------+ +-----------+ Figure 11: Redirector A Redirector acts like a name server for Endpoint Discriminators. An initiator MAY use a Redirector to discover additional candidate endpoint addresses for a desired endpoint. On receipt of an IHello chunk with an Endpoint Discriminator that does not select the Redirector's identity, the Redirector constructs and sends back to the initiator a Responder Redirect chunk (Section 2.3.5) containing one or more additional candidate addresses for the indicated endpoint. Initiator Redirector Responder | IHello | | |------------------------------->| | | | | | Redirect | | |<-------------------------------| | | | | IHello | |----------------------------------------------------------------->| | | | RHello | |<-----------------------------------------------------------------| | | | IIKeying | |----------------------------------------------------------------->| | | | RIKeying | |<-----------------------------------------------------------------| | | |<======================== S E S S I O N =========================>| Figure 12: Handshake Using a Redirector
Deployment Design Note: Redirectors SHOULD NOT initiate new sessions to endpoints that might use the Redirector's address as a candidate for another endpoint, since the far end might interpret the Redirector's IIKeying as glare for the far end's initiation to the other endpoint.3.5.1.5. Forwarder
+-----------+ +-----------+ +---+ +-----------+ | Initiator |---->| Forwarder |<===>| N |<===>| Responder | | | +-----------+ | A | | | | |<=====================>| T |<===>| | +-----------+ +---+ +-----------+ Figure 13: Forwarder A responder might be behind a NAT or firewall that doesn't allow inbound packets to reach the endpoint until it first sends an outbound packet for a particular far-endpoint address. A Forwarder's endpoint address MAY be a candidate address for another endpoint. A responder MAY use a Forwarder to receive FIHello chunks sent on behalf of an initiator. On receipt of an IHello chunk with an Endpoint Discriminator that does not select the Forwarder's identity, if the Forwarder has an S_OPEN session with an endpoint whose certificate matches the desired Endpoint Discriminator, the Forwarder constructs and sends an FIHello chunk (Section 2.3.3) to the selected endpoint over the S_OPEN session, using the tag and Endpoint Discriminator from the IHello chunk and the source address of the packet containing the IHello for the corresponding fields of the FIHello.
On receipt of an FIHello chunk, a responder might send an RHello or Implied Redirect to the original source of the IHello (Section 3.5.1.1.2), potentially allowing future packets to flow directly between the initiator and responder through the NAT or firewall. Initiator Forwarder NAT Responder | IHello | | | |------------------------------->| | | | | FIHello | | | |--------------->|--------------->| | | | | | RHello | | :<---------------| |<------------------------------------------------: | | : | | IIKeying : | |-------------------------------------------------:--------------->| | : | | : RIKeying | | :<---------------| |<------------------------------------------------: | | : | |<======================== S E S S I O N ========>:<==============>| Figure 14: Forwarder Handshake where Responder Sends an RHello
Initiator Forwarder NAT Responder | IHello | | | |------------------------------->| | | | | FIHello | | | |--------------->|--------------->| | | | | | Redirect | | | (Implied,RD={})| | :<---------------| |<------------------------------------------------: | | : | | IHello : | |------------------------------------------------>:--------------->| | : | | : RHello | | :<---------------| |<------------------------------------------------: | | : | | IIKeying : | |------------------------------------------------>:--------------->| | : | | : RIKeying | | :<---------------| |<------------------------------------------------: | | : | |<======================== S E S S I O N ========>:<==============>| Figure 15: Forwarder Handshake where Responder Sends an Implied Redirect3.5.1.6. Redirector and Forwarder with NAT
+---+ +---+ +---+ +---+ +---+ | I | | N | | I | | N | | R | | n |------>| A |------>| n | | A | | e | | i | | T | | t |<====>| T |<====>| s | | t |<------| |<------| r | | | | p | | i | | | | o | | | | o | | a | | | +---+ | | | n | | t | | | | | | d | | o |<=====>| |<================>| |<====>| e | | r | | | | | | r | +---+ +---+ +---+ +---+ Figure 16: Introduction Service for Initiator and Responder behind NATs
An initiator and responder might each be behind distinct NATs or firewalls that don't allow inbound packets to reach the respective endpoints until each first sends an outbound packet for a particular far-endpoint address. An introduction service comprising Redirector and Forwarder functions may facilitate direct communication between endpoints each behind a NAT. The responder is registered with the introduction service via an S_OPEN session to it. The service observes and records the responder's public NAT address as the DESTADDR of the S_OPEN session. The service MAY record other addresses for the responder, for example addresses that the responder self-reports as being directly attached. The initiator begins with an address of the introduction service as an initial candidate. The Redirector portion of the service sends to the initiator a Responder Redirect containing at least the responder's public NAT address as previously recorded. The Forwarder portion of the service sends to the responder a Forwarded IHello containing the initiator's public NAT address as observed to be the source of the IHello. The responder sends an RHello to the initiator's public NAT address in response to the FIHello. This will allow inbound packets to the responder through its NAT from the initiator's public NAT address. The initiator sends an IHello to the responder's public NAT address in response to the Responder Redirect. This will allow inbound packets to the initiator through its NAT from the responder's public NAT address. With transit paths created in both NATs, normal session startup can proceed.
Initiator NAT-I Redirector+Forwarder NAT-R Responder | | | | | | IHello | | | | |(Dst=Intro) | | | | |-------------->| | | | | |--------------->| | | | | | FIHello | | | | |(RA=NAT-I-Pub) | | | | |--------------->|--------------->| | | Redirect | | | | | (RD={NAT-R-Pub,| | | | | ...})| | | |<--------------|<---------------| | | | | | RHello | | | | (Dst=NAT-I-Pub)| | | :<---------------| | | (*) <--------------------------: | | IHello | : | |(Dst=NAT-R-Pub)| : | |-------------->: : | | :-------------------------------->:--------------->| | : : | | : : RHello | | : :<---------------| |<--------------:<--------------------------------: | | : : | | IIKeying : : | |-------------->: : | | :-------------------------------->:--------------->| | : : | | : : RIKeying | | : :<---------------| |<--------------:<--------------------------------: | | : : | |<=============>:<======== S E S S I O N ========>:<==============>| Figure 17: Handshake with Redirector and Forwarder At the point in Figure 17 marked (*), the responder's RHello from the FIHello might arrive at the initiator's NAT before or after the initiator's IHello is sent outbound to the responder's public NAT address. If it arrives before, it may be dropped by the NAT. If it arrives after, it will transit the NAT and trigger keying without waiting for another round-trip time. The timing of this race depends, among other factors, on the relative distances of the initiator and responder from each other and from the introduction service.
3.5.1.7. Load Distribution and Fault Tolerance
+---+ IHello/RHello +-------------+ | I |<------------------->| Responder 1 | | n | +-------------+ | i | SESSION +-------------+ | t |<=========>| Responder 2 | | i | +-------------+ | a | IHello... +----------------+ | t |-------------------------> X | Dead Responder | | o | +----------------+ | r | IHello/RHello +-------------+ | |<---------------->| Responder N | +---+ +-------------+ Figure 18: Parallel Open to Multiple Endpoints As specified in Section 3.2, more than one endpoint is allowed to be selected by one Endpoint Discriminator. This will typically be the case for a set of servers, any of which could accommodate a connecting client. As specified in Section 3.5.1.1.1, an initiator is allowed to use multiple candidate endpoint addresses when starting a session, and the sender of the first acceptable RHello chunk to be received is selected to complete the session, with later responses ignored. An initiator can start with the multiple candidate endpoint addresses, or it may learn them during startup from one or more Redirectors (Section 3.5.1.4). Parallel open to multiple endpoints for the same Endpoint Discriminator, combined with selection by earliest RHello, can be used for load distribution and fault tolerance. The cost at each endpoint that is not selected is limited to receiving and processing an IHello, and generating and sending an RHello. In one circumstance, multiple servers of similar processing and networking capacity may be located in near proximity to each other, such as in a data center. In this circumstance, a less heavily loaded server can respond to an IHello more quickly than more heavily loaded servers and will tend to be selected by a client. In another circumstance, multiple servers may be located in different physical locations, such as different data centers. In this circumstance, a server that is located nearer (in terms of network distance) to the client can respond earlier than more distant servers and will tend to be selected by the client.
Multiple servers, in proximity or distant from one another, can form a redundant pool of servers. A client can perform a parallel open to the multiple servers. In normal operation, the multiple servers will all respond, and the client will select one of them as described above. If one of the multiple servers fails, other servers in the pool can still respond to the client, allowing the client to succeed to an S_OPEN session with one of them.3.5.2. Congestion Control
An RTMFP MUST implement congestion control and avoidance algorithms that are "TCP compatible", in accordance with Internet best current practice [RFC2914]. The algorithms SHOULD NOT be more aggressive in sending data than those described in "TCP Congestion Control" [RFC5681] and MUST NOT be more aggressive in sending data than the "slow start algorithm" described in Section 3.1 of RFC 5681. An endpoint maintains a transmission budget in the session information context of each S_OPEN session (Section 3.5), controlling the rate at which the endpoint sends data into the network. For window-based congestion control and avoidance algorithms, the transmission budget is the congestion window, which is the amount of user data that is allowed to be outstanding, or in flight, in the network. Transmission is allowed when S_OUTSTANDING_BYTES (Section 3.5) is less than the congestion window (Section 3.6.2.3). See Appendix A for an experimental window-based congestion control algorithm for real-time and bulk data. An endpoint avoids sending large bursts of data or packets into the network (Section 3.5.2.3). A sending endpoint increases and decreases its transmission budget in response to acknowledgements (Section 3.6.2.4) and loss according to the congestion control and avoidance algorithms. Loss is detected by negative acknowledgement (Section 3.6.2.5) and timeout (Section 3.6.2.6). Timeout is determined by the Effective Retransmission Timeout (ERTO) (Section 3.5.2.2). The ERTO is measured using the Timestamp and Timestamp Echo packet header fields (Section 2.2.4). A receiving endpoint acknowledges all received data (Section 3.6.3.4) to enable the sender to measure receipt of data, or lack thereof. A receiving endpoint may be receiving time critical (or real-time) data from a first sender while receiving data from other senders. The receiving endpoint can signal its other senders (Section 2.2.4)
to cause them to decrease the aggressiveness of their congestion control and avoidance algorithms, in order to yield network capacity to the time critical data (Section 3.5.2.1).3.5.2.1. Time Critical Reverse Notification
A sender can increase its transmission budget at a rate compatible with (but not exceeding) the "slow start algorithm" specified in RFC 5681 (with which the transmission rate is doubled every round trip when beginning or restarting transmission, until loss is detected). However, a sender MUST behave as though the slow start threshold SSTHRESH is clamped to 0 (disabling the slow start algorithm's exponential increase behavior) on a session where a Time Critical Reverse Notification (Section 2.2.4) indication has been received from the far end within the last 800 milliseconds, unless the sender is itself currently sending time critical data to the far end. During each round trip, a sender SHOULD NOT increase the transmission budget by more than 0.5% or by 384 bytes per round trip (whichever is greater) on a session where a Time Critical Reverse Notification indication has been received from the far end within the last 800 milliseconds, unless the sender is itself currently sending time critical data to the far end.3.5.2.2. Retransmission Timeout
RTMFP uses the ERTO to detect when a user data fragment has been lost in the network. The ERTO is typically calculated in a manner similar to that specified in "Requirements for Internet Hosts - Communication Layers" [RFC1122] and is a function of round-trip time measurements and persistent timeout behavior. The ERTO SHOULD be at least 250 milliseconds and SHOULD allow for the receiver to delay sending an acknowledgement for up to 200 milliseconds (Section 3.6.3.4.4). The ERTO MUST NOT be less than the round-trip time. To facilitate round-trip time measurement, an endpoint MUST implement the Timestamp Echo facility: o On a session entering the S_OPEN state, initialize TS_RX_TIME to negative infinity, and initialize TS_RX and TS_ECHO_TX to have no value.
o On receipt of a packet in an S_OPEN session with the timestampPresent (Section 2.2.4) flag set, if the timestamp field in the packet is different than TS_RX, set TS_RX to the value of the timestamp field in the packet, and set TS_RX_TIME to the current time. o When sending a packet to the far end in an S_OPEN session: 1. Calculate TS_RX_ELAPSED = current time - TS_RX_TIME. If TS_RX_ELAPSED is more than 128 seconds, then set TS_RX and TS_ECHO_TX to have no value, and do not include a timestamp echo; otherwise, 2. Calculate TS_RX_ELAPSED_TICKS to be the number of whole 4-millisecond periods in TS_RX_ELAPSED; then 3. Calculate TS_ECHO = (TS_RX + TS_RX_ELAPSED_TICKS) MODULO 65536; then 4. If TS_ECHO is not equal to TS_ECHO_TX, then set TS_ECHO_TX to TS_ECHO, set the timestampEchoPresent flag, and set the timestampEcho field to TS_ECHO_TX. The remainder of this section describes an OPTIONAL method for calculating the ERTO. Real-time applications and P2P mesh applications often require knowing the round-trip time and RTT variance. This section additionally describes a method for measuring the round-trip time and RTT variance, and calculating a smoothed round-trip time. Let the session information context contain additional variables: o TS_TX: the last timestamp sent to the far end, initialized to have no value; o TS_ECHO_RX: the last timestamp echo received from the far end, initialized to have no value; o SRTT: the smoothed round-trip time, initialized to have no value; o RTTVAR: the round-trip time variance, initialized to 0. Initialize MRTO to 250 milliseconds. Initialize ERTO to 3 seconds.
On sending a packet to the far end of an S_OPEN session, if the current send timestamp is not equal to TS_TX, then set TS_TX to the current send timestamp, set the timestampPresent flag in the packet header, and set the timestamp field to TS_TX. On receipt of a packet from the far end of an S_OPEN session, if the timestampEchoPresent flag is set in the packet header, AND the timestampEcho field is not equal to TS_ECHO_RX, then: 1. Set TS_ECHO_RX to timestampEcho; 2. Calculate RTT_TICKS = (current send timestamp - timestampEcho) MODULO 65536; 3. If RTT_TICKS is greater than 32767, the measurement is invalid, so discard this measurement; otherwise, 4. Calculate RTT = RTT_TICKS * 4 milliseconds; 5. If SRTT has a value, then calculate new values of RTTVAR and SRTT: 1. RTT_DELTA = | SRTT - RTT |; 2. RTTVAR = ((3 * RTTVAR) + RTT_DELTA) / 4; 3. SRTT = ((7 * SRTT) + RTT) / 8. 6. If SRTT has no value, then set SRTT = RTT and RTTVAR = RTT / 2; 7. Set MRTO = SRTT + 4 * RTTVAR + 200 milliseconds; 8. Set ERTO to MRTO or 250 milliseconds, whichever is greater. A retransmission timeout occurs when the most recently transmitted user data fragment has remained outstanding in the network for ERTO. When this timeout occurs, increase ERTO on an exponential backoff with an ultimate backoff cap of 10 seconds: 1. Calculate ERTO_BACKOFF = ERTO * 1.4142; 2. Calculate ERTO_CAPPED to be ERTO_BACKOFF or 10 seconds, whichever is less; 3. Set ERTO to ERTO_CAPPED or MRTO, whichever is greater.
3.5.2.3. Burst Avoidance
An application's sending patterns may cause the transmission budget to grow to a large value, but at times its sending patterns will result in a comparatively small amount of data outstanding in the network. In this circumstance, especially with a window-based congestion avoidance algorithm, if the application then has a large amount of new data to send (for example, a new bulk data transfer), it could send data into the network all at once to fill the window. This kind of transmission burst is undesirable, however, because it can jam interfaces, links, and buffers. Accordingly, in any session, an endpoint SHOULD NOT send more than six packets containing user data between receiving any acknowledgements or retransmission timeouts. The following describes an OPTIONAL method to avoid bursting large numbers of packets into the network: Let the session information context contain an additional variable DATA_PACKET_COUNT, initialized to 0. Transmission of a user data fragment on this session is not allowed if DATA_PACKET_COUNT is greater than or equal to 6, regardless of any other allowance of the congestion control algorithm. On transmission of a packet containing at least one User Data chunk (Section 2.3.11), set DATA_PACKET_COUNT = DATA_PACKET_COUNT + 1. On receipt of an acknowledgement chunk (Sections 2.3.13 and 2.3.14), set DATA_PACKET_COUNT to 0. On a retransmission timeout, set DATA_PACKET_COUNT to 0.3.5.3. Address Mobility
Sessions are demultiplexed with a 32-bit session ID, rather than by endpoint address. This allows an endpoint's address to change during an S_OPEN session. This can happen, for example, when switching from a wireless to a wired network, or when moving from one wireless base station to another, or when a NAT restarts. If the near end receives a valid packet for an S_OPEN session from a source address that doesn't match DESTADDR, the far end might have changed addresses. The near end SHOULD verify that the far end is definitively at the new address before changing DESTADDR. A suggested verification method is described in Section 3.5.4.2.
3.5.4. Ping
If an endpoint receives a Ping chunk (Section 2.3.9) in a session in the S_OPEN state, it SHOULD construct and send a Ping Reply chunk (Section 2.3.10) in response if possible, copying the message unaltered. The Ping Reply SHOULD be sent as quickly as possible following receipt of a Ping. The semantics of a Ping's message is reserved for the sender; a receiver SHOULD NOT interpret the Ping's message. Endpoints can use the mechanism of the Ping chunk and the expected Ping Reply for any purpose. This specification doesn't mandate any specific constraints on the format or semantics of a Ping message. A Ping Reply MUST be sent only as a response to a Ping. Receipt of a Ping Reply implies live bidirectional connectivity. This specification doesn't mandate any other semantics for a Ping Reply.3.5.4.1. Keepalive
An endpoint can use a Ping to test for live bidirectional connectivity, to test that the far end of a session is still in the S_OPEN state, to keep NAT translations alive, and to keep firewall holes open. An endpoint can use a Ping to hasten detection of a near-end address change by the far end. An endpoint may declare a session to be defunct and dead after a persistent failure by the far end to return Ping Replies in response to Pings. If used for these purposes, a Keepalive Ping SHOULD have an empty message. A Keepalive Ping SHOULD NOT be sent more often than once per ERTO. If a corresponding Ping Reply is not received within ERTO of sending the Ping, ERTO SHOULD be increased according to Section 3.5.2 ("Congestion Control").
3.5.4.2. Address Mobility
This section describes an OPTIONAL but suggested method for processing and verifying a far-end address change. Let the session context contain additional variables MOB_TX_TS, MOB_RX_TS, and MOB_SECRET. MOB_TX_TS and MOB_RX_TS have initial values of negative infinity. MOB_SECRET should be a cryptographically pseudorandom value not less than 128 bits in length and known only to this end. On receipt of a packet for an S_OPEN session, after processing all chunks in the packet: if the session is still in the S_OPEN state, AND the source address of the packet does not match DESTADDR, AND MOB_TX_TS is at least one second in the past, then: 1. Set MOB_TX_TS to the current time; 2. Construct a Ping message comprising the following: a marking to indicate (to this end when returned in a Ping Reply) that it is a mobility check (for example, the first byte being ASCII 'M' for "Mobility"), a timestamp set to MOB_TX_TS, and a cryptographic hash over the following: the preceding items, the address from which the packet was received, and MOB_SECRET; and 3. Send this Ping to the address from which the packet was received, instead of DESTADDR. On receipt of a Ping Reply in an S_OPEN session, if the Ping Reply's message satisfies all of these conditions: o it has this end's expected marking to indicate that it is a mobility check, and o the timestamp in the message is not more than 120 seconds in the past, and o the timestamp in the message is greater than MOB_RX_TS, and o the cryptographic hash matches the expected value according to the contents of the message plus the source address of the packet containing this Ping Reply and MOB_SECRET,
then: 1. Set MOB_RX_TS to the timestamp in the message; and 2. Set DESTADDR to the source address of the packet containing this Ping Reply.3.5.4.3. Path MTU Discovery
"Packetization Layer Path MTU Discovery" [RFC4821] describes a method for measuring the path MTU between communicating endpoints. An RTMFP SHOULD perform path MTU discovery. The method described in RFC 4821 can be adapted for use in RTMFP by sending a probe packet comprising one of the Padding chunk types (type 0x00 or 0xff) and a Ping. The Ping chunk SHOULD come after the Padding chunk, to guard against a false positive response in case the probe packet is truncated.3.5.5. Close
An endpoint may close a session at any time. Typically, an endpoint will close a session when there have been no open flows in either direction for a time. In another circumstance, an endpoint may be ceasing operation and will close all of its sessions even if they have open flows. To close an S_OPEN session in a reliable and orderly fashion, an endpoint moves the session to the S_NEARCLOSE state. On a session transitioning from S_OPEN to S_NEARCLOSE and every 5 seconds thereafter while still in the S_NEARCLOSE state, send a Session Close Request chunk (Section 2.3.17). A session that has been in the S_NEARCLOSE state for at least 90 seconds (allowing time to retransmit the Session Close Request multiple times) SHOULD move to the S_CLOSED state. On a session transitioning from S_OPEN to the S_NEARCLOSE, S_FARCLOSE_LINGER or S_CLOSED state, immediately abort and terminate all open or closing flows. Flows only exist in S_OPEN sessions. To close an S_OPEN session abruptly, send a Session Close Acknowledgement chunk (Section 2.3.18), then move to the S_CLOSED state.
On receipt of a Session Close Request chunk for a session in the S_OPEN, S_NEARCLOSE, or S_FARCLOSE_LINGER states, send a Session Close Acknowledgement chunk; then, if the session is in the S_OPEN state, move to the S_FARCLOSE_LINGER state. A session that has been in the S_FARCLOSE_LINGER state for at least 19 seconds (allowing time to answer 3 retransmissions of a Session Close Request) SHOULD move to the S_CLOSED state. On receipt of a Session Close Acknowledgement chunk for a session in the S_OPEN, S_NEARCLOSE, or S_FARCLOSE_LINGER states, move to the S_CLOSED state.