2.7. Minor Versioning
To address the requirement of an NFS protocol that can evolve as the need arises, the NFSv4.1 protocol contains the rules and framework to allow for future minor changes or versioning. The base assumption with respect to minor versioning is that any future accepted minor version will be documented in one or more Standards Track RFCs. Minor version 0 of the NFSv4 protocol is represented by [30], and minor version 1 is represented by this RFC. The COMPOUND and CB_COMPOUND procedures support the encoding of the minor version being requested by the client. The following items represent the basic rules for the development of minor versions. Note that a future minor version may modify or add to the following rules as part of the minor version definition.
1. Procedures are not added or deleted. To maintain the general RPC model, NFSv4 minor versions will not add to or delete procedures from the NFS program. 2. Minor versions may add operations to the COMPOUND and CB_COMPOUND procedures. The addition of operations to the COMPOUND and CB_COMPOUND procedures does not affect the RPC model. * Minor versions may append attributes to the bitmap4 that represents sets of attributes and to the fattr4 that represents sets of attribute values. This allows for the expansion of the attribute model to allow for future growth or adaptation. * Minor version X must append any new attributes after the last documented attribute. Since attribute results are specified as an opaque array of per-attribute, XDR-encoded results, the complexity of adding new attributes in the midst of the current definitions would be too burdensome. 3. Minor versions must not modify the structure of an existing operation's arguments or results. Again, the complexity of handling multiple structure definitions for a single operation is too burdensome. New operations should be added instead of modifying existing structures for a minor version. This rule does not preclude the following adaptations in a minor version: * adding bits to flag fields, such as new attributes to GETATTR's bitmap4 data type, and providing corresponding variants of opaque arrays, such as a notify4 used together with such bitmaps * adding bits to existing attributes like ACLs that have flag words * extending enumerated types (including NFS4ERR_*) with new values
* adding cases to a switched union 4. Minor versions must not modify the structure of existing attributes. 5. Minor versions must not delete operations. This prevents the potential reuse of a particular operation "slot" in a future minor version. 6. Minor versions must not delete attributes. 7. Minor versions must not delete flag bits or enumeration values. 8. Minor versions may declare an operation MUST NOT be implemented. Specifying that an operation MUST NOT be implemented is equivalent to obsoleting an operation. For the client, it means that the operation MUST NOT be sent to the server. For the server, an NFS error can be returned as opposed to "dropping" the request as an XDR decode error. This approach allows for the obsolescence of an operation while maintaining its structure so that a future minor version can reintroduce the operation. 1. Minor versions may declare that an attribute MUST NOT be implemented. 2. Minor versions may declare that a flag bit or enumeration value MUST NOT be implemented. 9. Minor versions may downgrade features from REQUIRED to RECOMMENDED, or RECOMMENDED to OPTIONAL. 10. Minor versions may upgrade features from OPTIONAL to RECOMMENDED, or RECOMMENDED to REQUIRED. 11. A client and server that support minor version X SHOULD support minor versions zero through X-1 as well. 12. Except for infrastructural changes, a minor version must not introduce REQUIRED new features. This rule allows for the introduction of new functionality and forces the use of implementation experience before designating a feature as REQUIRED. On the other hand, some classes of features are infrastructural and have broad effects. Allowing infrastructural features to be RECOMMENDED or OPTIONAL complicates implementation of the minor version.
13. A client MUST NOT attempt to use a stateid, filehandle, or similar returned object from the COMPOUND procedure with minor version X for another COMPOUND procedure with minor version Y, where X != Y.2.8. Non-RPC-Based Security Services
As described in Section 2.2.1.1.1.1, NFSv4.1 relies on RPC for identification, authentication, integrity, and privacy. NFSv4.1 itself provides or enables additional security services as described in the next several subsections.2.8.1. Authorization
Authorization to access a file object via an NFSv4.1 operation is ultimately determined by the NFSv4.1 server. A client can predetermine its access to a file object via the OPEN (Section 18.16) and the ACCESS (Section 18.1) operations. Principals with appropriate access rights can modify the authorization on a file object via the SETATTR (Section 18.30) operation. Attributes that affect access rights include mode, owner, owner_group, acl, dacl, and sacl. See Section 5.2.8.2. Auditing
NFSv4.1 provides auditing on a per-file object basis, via the acl and sacl attributes as described in Section 6. It is outside the scope of this specification to specify audit log formats or management policies.2.8.3. Intrusion Detection
NFSv4.1 provides alarm control on a per-file object basis, via the acl and sacl attributes as described in Section 6. Alarms may serve as the basis for intrusion detection. It is outside the scope of this specification to specify heuristics for detecting intrusion via alarms.2.9. Transport Layers
2.9.1. REQUIRED and RECOMMENDED Properties of Transports
NFSv4.1 works over Remote Direct Memory Access (RDMA) and non-RDMA- based transports with the following attributes:
o The transport supports reliable delivery of data, which NFSv4.1 requires but neither NFSv4.1 nor RPC has facilities for ensuring [34]. o The transport delivers data in the order it was sent. Ordered delivery simplifies detection of transmit errors, and simplifies the sending of arbitrary sized requests and responses via the record marking protocol [3]. Where an NFSv4.1 implementation supports operation over the IP network protocol, any transport used between NFS and IP MUST be among the IETF-approved congestion control transport protocols. At the time this document was written, the only two transports that had the above attributes were TCP and the Stream Control Transmission Protocol (SCTP). To enhance the possibilities for interoperability, an NFSv4.1 implementation MUST support operation over the TCP transport protocol. Even if NFSv4.1 is used over a non-IP network protocol, it is RECOMMENDED that the transport support congestion control. It is permissible for a connectionless transport to be used under NFSv4.1; however, reliable and in-order delivery of data combined with congestion control by the connectionless transport is REQUIRED. As a consequence, UDP by itself MUST NOT be used as an NFSv4.1 transport. NFSv4.1 assumes that a client transport address and server transport address used to send data over a transport together constitute a connection, even if the underlying transport eschews the concept of a connection.2.9.2. Client and Server Transport Behavior
If a connection-oriented transport (e.g., TCP) is used, the client and server SHOULD use long-lived connections for at least three reasons: 1. This will prevent the weakening of the transport's congestion control mechanisms via short-lived connections. 2. This will improve performance for the WAN environment by eliminating the need for connection setup handshakes. 3. The NFSv4.1 callback model differs from NFSv4.0, and requires the client and server to maintain a client-created backchannel (see Section 2.10.3.1) for the server to use. In order to reduce congestion, if a connection-oriented transport is used, and the request is not the NULL procedure:
o A requester MUST NOT retry a request unless the connection the request was sent over was lost before the reply was received. o A replier MUST NOT silently drop a request, even if the request is a retry. (The silent drop behavior of RPCSEC_GSS [4] does not apply because this behavior happens at the RPCSEC_GSS layer, a lower layer in the request processing.) Instead, the replier SHOULD return an appropriate error (see Section 2.10.6.1), or it MAY disconnect the connection. When sending a reply, the replier MUST send the reply to the same full network address (e.g., if using an IP-based transport, the source port of the requester is part of the full network address) from which the requester sent the request. If using a connection- oriented transport, replies MUST be sent on the same connection from which the request was received. If a connection is dropped after the replier receives the request but before the replier sends the reply, the replier might have a pending reply. If a connection is established with the same source and destination full network address as the dropped connection, then the replier MUST NOT send the reply until the requester retries the request. The reason for this prohibition is that the requester MAY retry a request over a different connection (provided that connection is associated with the original request's session). When using RDMA transports, there are other reasons for not tolerating retries over the same connection: o RDMA transports use "credits" to enforce flow control, where a credit is a right to a peer to transmit a message. If one peer were to retransmit a request (or reply), it would consume an additional credit. If the replier retransmitted a reply, it would certainly result in an RDMA connection loss, since the requester would typically only post a single receive buffer for each request. If the requester retransmitted a request, the additional credit consumed on the server might lead to RDMA connection failure unless the client accounted for it and decreased its available credit, leading to wasted resources. o RDMA credits present a new issue to the reply cache in NFSv4.1. The reply cache may be used when a connection within a session is lost, such as after the client reconnects. Credit information is a dynamic property of the RDMA connection, and stale values must not be replayed from the cache. This implies that the reply cache contents must not be blindly used when replies are sent from it, and credit information appropriate to the channel must be refreshed by the RPC layer.
In addition, as described in Section 2.10.6.2, while a session is active, the NFSv4.1 requester MUST NOT stop waiting for a reply.2.9.3. Ports
Historically, NFSv3 servers have listened over TCP port 2049. The registered port 2049 [35] for the NFS protocol should be the default configuration. NFSv4.1 clients SHOULD NOT use the RPC binding protocols as described in [36].2.10. Session
NFSv4.1 clients and servers MUST support and MUST use the session feature as described in this section.2.10.1. Motivation and Overview
Previous versions and minor versions of NFS have suffered from the following: o Lack of support for Exactly Once Semantics (EOS). This includes lack of support for EOS through server failure and recovery. o Limited callback support, including no support for sending callbacks through firewalls, and races between replies to normal requests and callbacks. o Limited trunking over multiple network paths. o Requiring machine credentials for fully secure operation. Through the introduction of a session, NFSv4.1 addresses the above shortfalls with practical solutions: o EOS is enabled by a reply cache with a bounded size, making it feasible to keep the cache in persistent storage and enable EOS through server failure and recovery. One reason that previous revisions of NFS did not support EOS was because some EOS approaches often limited parallelism. As will be explained in Section 2.10.6, NFSv4.1 supports both EOS and unlimited parallelism. o The NFSv4.1 client (defined in Section 1.6, Paragraph 2) creates transport connections and provides them to the server to use for sending callback requests, thus solving the firewall issue (Section 18.34). Races between responses from client requests and
callbacks caused by the requests are detected via the session's sequencing properties that are a consequence of EOS (Section 2.10.6.3). o The NFSv4.1 client can associate an arbitrary number of connections with the session, and thus provide trunking (Section 2.10.5). o The NFSv4.1 client and server produces a session key independent of client and server machine credentials which can be used to compute a digest for protecting critical session management operations (Section 2.10.8.3). o The NFSv4.1 client can also create secure RPCSEC_GSS contexts for use by the session's backchannel that do not require the server to authenticate to a client machine principal (Section 2.10.8.2). A session is a dynamically created, long-lived server object created by a client and used over time from one or more transport connections. Its function is to maintain the server's state relative to the connection(s) belonging to a client instance. This state is entirely independent of the connection itself, and indeed the state exists whether or not the connection exists. A client may have one or more sessions associated with it so that client-associated state may be accessed using any of the sessions associated with that client's client ID, when connections are associated with those sessions. When no connections are associated with any of a client ID's sessions for an extended time, such objects as locks, opens, delegations, layouts, etc. are subject to expiration. The session serves as an object representing a means of access by a client to the associated client state on the server, independent of the physical means of access to that state. A single client may create multiple sessions. A single session MUST NOT serve multiple clients.2.10.2. NFSv4 Integration
Sessions are part of NFSv4.1 and not NFSv4.0. Normally, a major infrastructure change such as sessions would require a new major version number to an Open Network Computing (ONC) RPC program like NFS. However, because NFSv4 encapsulates its functionality in a single procedure, COMPOUND, and because COMPOUND can support an arbitrary number of operations, sessions have been added to NFSv4.1 with little difficulty. COMPOUND includes a minor version number field, and for NFSv4.1 this minor version is set to 1. When the NFSv4 server processes a COMPOUND with the minor version set to 1, it expects a different set of operations than it does for NFSv4.0.
NFSv4.1 defines the SEQUENCE operation, which is required for every COMPOUND that operates over an established session, with the exception of some session administration operations, such as DESTROY_SESSION (Section 18.37).2.10.2.1. SEQUENCE and CB_SEQUENCE
In NFSv4.1, when the SEQUENCE operation is present, it MUST be the first operation in the COMPOUND procedure. The primary purpose of SEQUENCE is to carry the session identifier. The session identifier associates all other operations in the COMPOUND procedure with a particular session. SEQUENCE also contains required information for maintaining EOS (see Section 2.10.6). Session-enabled NFSv4.1 COMPOUND requests thus have the form: +-----+--------------+-----------+------------+-----------+---- | tag | minorversion | numops |SEQUENCE op | op + args | ... | | (== 1) | (limited) | + args | | +-----+--------------+-----------+------------+-----------+---- and the replies have the form: +------------+-----+--------+-------------------------------+--// |last status | tag | numres |status + SEQUENCE op + results | // +------------+-----+--------+-------------------------------+--// //-----------------------+---- // status + op + results | ... //-----------------------+---- A CB_COMPOUND procedure request and reply has a similar form to COMPOUND, but instead of a SEQUENCE operation, there is a CB_SEQUENCE operation. CB_COMPOUND also has an additional field called "callback_ident", which is superfluous in NFSv4.1 and MUST be ignored by the client. CB_SEQUENCE has the same information as SEQUENCE, and also includes other information needed to resolve callback races (Section 2.10.6.3).2.10.2.2. Client ID and Session Association
Each client ID (Section 2.4) can have zero or more active sessions. A client ID and associated session are required to perform file access in NFSv4.1. Each time a session is used (whether by a client sending a request to the server or the client replying to a callback request from the server), the state leased to its associated client ID is automatically renewed.
State (which can consist of share reservations, locks, delegations, and layouts (Section 1.7.4)) is tied to the client ID. Client state is not tied to any individual session. Successive state changing operations from a given state owner MAY go over different sessions, provided the session is associated with the same client ID. A callback MAY arrive over a different session than that of the request that originally acquired the state pertaining to the callback. For example, if session A is used to acquire a delegation, a request to recall the delegation MAY arrive over session B if both sessions are associated with the same client ID. Sections 2.10.8.1 and 2.10.8.2 discuss the security considerations around callbacks.2.10.3. Channels
A channel is not a connection. A channel represents the direction ONC RPC requests are sent. Each session has one or two channels: the fore channel and the backchannel. Because there are at most two channels per session, and because each channel has a distinct purpose, channels are not assigned identifiers. The fore channel is used for ordinary requests from the client to the server, and carries COMPOUND requests and responses. A session always has a fore channel. The backchannel is used for callback requests from server to client, and carries CB_COMPOUND requests and responses. Whether or not there is a backchannel is a decision made by the client; however, many features of NFSv4.1 require a backchannel. NFSv4.1 servers MUST support backchannels. Each session has resources for each channel, including separate reply caches (see Section 2.10.6.1). Note that even the backchannel requires a reply cache (or, at least, a slot table in order to detect retries) because some callback operations are nonidempotent.2.10.3.1. Association of Connections, Channels, and Sessions
Each channel is associated with zero or more transport connections (whether of the same transport protocol or different transport protocols). A connection can be associated with one channel or both channels of a session; the client and server negotiate whether a connection will carry traffic for one channel or both channels via the CREATE_SESSION (Section 18.36) and the BIND_CONN_TO_SESSION (Section 18.34) operations. When a session is created via CREATE_SESSION, the connection that transported the CREATE_SESSION request is automatically associated with the fore channel, and
optionally the backchannel. If the client specifies no state protection (Section 18.35) when the session is created, then when SEQUENCE is transmitted on a different connection, the connection is automatically associated with the fore channel of the session specified in the SEQUENCE operation. A connection's association with a session is not exclusive. A connection associated with the channel(s) of one session may be simultaneously associated with the channel(s) of other sessions including sessions associated with other client IDs. It is permissible for connections of multiple transport types to be associated with the same channel. For example, both TCP and RDMA connections can be associated with the fore channel. In the event an RDMA and non-RDMA connection are associated with the same channel, the maximum number of slots SHOULD be at least one more than the total number of RDMA credits (Section 2.10.6.1). This way, if all RDMA credits are used, the non-RDMA connection can have at least one outstanding request. If a server supports multiple transport types, it MUST allow a client to associate connections from each transport to a channel. It is permissible for a connection of one type of transport to be associated with the fore channel, and a connection of a different type to be associated with the backchannel.2.10.4. Server Scope
Servers each specify a server scope value in the form of an opaque string eir_server_scope returned as part of the results of an EXCHANGE_ID operation. The purpose of the server scope is to allow a group of servers to indicate to clients that a set of servers sharing the same server scope value has arranged to use compatible values of otherwise opaque identifiers. Thus, the identifiers generated by one server of that set may be presented to another of that same scope. The use of such compatible values does not imply that a value generated by one server will always be accepted by another. In most cases, it will not. However, a server will not accept a value generated by another inadvertently. When it does accept it, it will be because it is recognized as valid and carrying the same meaning as on another server of the same scope. When servers are of the same server scope, this compatibility of values applies to the follow identifiers:
o Filehandle values. A filehandle value accepted by two servers of the same server scope denotes the same object. A WRITE operation sent to one server is reflected immediately in a READ sent to the other, and locks obtained on one server conflict with those requested on the other. o Session ID values. A session ID value accepted by two servers of the same server scope denotes the same session. o Client ID values. A client ID value accepted as valid by two servers of the same server scope is associated with two clients with the same client owner and verifier. o State ID values. A state ID value is recognized as valid when the corresponding client ID is recognized as valid. If the same stateid value is accepted as valid on two servers of the same scope and the client IDs on the two servers represent the same client owner and verifier, then the two stateid values designate the same set of locks and are for the same file. o Server owner values. When the server scope values are the same, server owner value may be validly compared. In cases where the server scope values are different, server owner values are treated as different even if they contain all identical bytes. The coordination among servers required to provide such compatibility can be quite minimal, and limited to a simple partition of the ID space. The recognition of common values requires additional implementation, but this can be tailored to the specific situations in which that recognition is desired. Clients will have occasion to compare the server scope values of multiple servers under a number of circumstances, each of which will be discussed under the appropriate functional section: o When server owner values received in response to EXCHANGE_ID operations sent to multiple network addresses are compared for the purpose of determining the validity of various forms of trunking, as described in Section 2.10.5. o When network or server reconfiguration causes the same network address to possibly be directed to different servers, with the necessity for the client to determine when lock reclaim should be attempted, as described in Section 8.4.2.1. o When file system migration causes the transfer of responsibility for a file system between servers and the client needs to determine whether state has been transferred with the file system
(as described in Section 11.7.7) or whether the client needs to reclaim state on a similar basis as in the case of server restart, as described in Section 8.4.2. When two replies from EXCHANGE_ID, each from two different server network addresses, have the same server scope, there are a number of ways a client can validate that the common server scope is due to two servers cooperating in a group. o If both EXCHANGE_ID requests were sent with RPCSEC_GSS authentication and the server principal is the same for both targets, the equality of server scope is validated. It is RECOMMENDED that two servers intending to share the same server scope also share the same principal name. o The client may accept the appearance of the second server in the fs_locations or fs_locations_info attribute for a relevant file system. For example, if there is a migration event for a particular file system or there are locks to be reclaimed on a particular file system, the attributes for that particular file system may be used. The client sends the GETATTR request to the first server for the fs_locations or fs_locations_info attribute with RPCSEC_GSS authentication. It may need to do this in advance of the need to verify the common server scope. If the client successfully authenticates the reply to GETATTR, and the GETATTR request and reply containing the fs_locations or fs_locations_info attribute refers to the second server, then the equality of server scope is supported. A client may choose to limit the use of this form of support to information relevant to the specific file system involved (e.g. a file system being migrated).2.10.5. Trunking
Trunking is the use of multiple connections between a client and server in order to increase the speed of data transfer. NFSv4.1 supports two types of trunking: session trunking and client ID trunking. NFSv4.1 servers MUST support both forms of trunking within the context of a single server network address and MUST support both forms within the context of the set of network addresses used to access a single server. NFSv4.1 servers in a clustered configuration MAY allow network addresses for different servers to use client ID trunking. Clients may use either form of trunking as long as they do not, when trunking between different server network addresses, violate the servers' mandates as to the kinds of trunking to be allowed (see
below). With regard to callback channels, the client MUST allow the server to choose among all callback channels valid for a given client ID and MUST support trunking when the connections supporting the backchannel allow session or client ID trunking to be used for callbacks. Session trunking is essentially the association of multiple connections, each with potentially different target and/or source network addresses, to the same session. When the target network addresses (server addresses) of the two connections are the same, the server MUST support such session trunking. When the target network addresses are different, the server MAY indicate such support using the data returned by the EXCHANGE_ID operation (see below). Client ID trunking is the association of multiple sessions to the same client ID. Servers MUST support client ID trunking for two target network addresses whenever they allow session trunking for those same two network addresses. In addition, a server MAY, by presenting the same major server owner ID (Section 2.5) and server scope (Section 2.10.4), allow an additional case of client ID trunking. When two servers return the same major server owner and server scope, it means that the two servers are cooperating on locking state management, which is a prerequisite for client ID trunking. Distinguishing when the client is allowed to use session and client ID trunking requires understanding how the results of the EXCHANGE_ID (Section 18.35) operation identify a server. Suppose a client sends EXCHANGE_IDs over two different connections, each with a possibly different target network address, but each EXCHANGE_ID operation has the same value in the eia_clientowner field. If the same NFSv4.1 server is listening over each connection, then each EXCHANGE_ID result MUST return the same values of eir_clientid, eir_server_owner.so_major_id, and eir_server_scope. The client can then treat each connection as referring to the same server (subject to verification; see Section 2.10.5.1 later in this section), and it can use each connection to trunk requests and replies. The client's choice is whether session trunking or client ID trunking applies. Session Trunking. If the eia_clientowner argument is the same in two different EXCHANGE_ID requests, and the eir_clientid, eir_server_owner.so_major_id, eir_server_owner.so_minor_id, and eir_server_scope results match in both EXCHANGE_ID results, then the client is permitted to perform session trunking. If the client has no session mapping to the tuple of eir_clientid, eir_server_owner.so_major_id, eir_server_scope, and eir_server_owner.so_minor_id, then it creates the session via a CREATE_SESSION operation over one of the connections, which
associates the connection to the session. If there is a session for the tuple, the client can send BIND_CONN_TO_SESSION to associate the connection to the session. Of course, if the client does not desire to use session trunking, it is not required to do so. It can invoke CREATE_SESSION on the connection. This will result in client ID trunking as described below. It can also decide to drop the connection if it does not choose to use trunking. Client ID Trunking. If the eia_clientowner argument is the same in two different EXCHANGE_ID requests, and the eir_clientid, eir_server_owner.so_major_id, and eir_server_scope results match in both EXCHANGE_ID results, then the client is permitted to perform client ID trunking (regardless of whether the eir_server_owner.so_minor_id results match). The client can associate each connection with different sessions, where each session is associated with the same server. The client completes the act of client ID trunking by invoking CREATE_SESSION on each connection, using the same client ID that was returned in eir_clientid. These invocations create two sessions and also associate each connection with its respective session. The client is free to decline to use client ID trunking by simply dropping the connection at this point. When doing client ID trunking, locking state is shared across sessions associated with that same client ID. This requires the server to coordinate state across sessions. The client should be prepared for the possibility that eir_server_owner values may be different on subsequent EXCHANGE_ID requests made to the same network address, as a result of various sorts of reconfiguration events. When this happens and the changes result in the invalidation of previously valid forms of trunking, the client should cease to use those forms, either by dropping connections or by adding sessions. For a discussion of lock reclaim as it relates to such reconfiguration events, see Section 8.4.2.1.2.10.5.1. Verifying Claims of Matching Server Identity
When two servers over two connections claim matching or partially matching eir_server_owner, eir_server_scope, and eir_clientid values, the client does not have to trust the servers' claims. The client may verify these claims before trunking traffic in the following ways:
o For session trunking, clients SHOULD reliably verify if connections between different network paths are in fact associated with the same NFSv4.1 server and usable on the same session, and servers MUST allow clients to perform reliable verification. When a client ID is created, the client SHOULD specify that BIND_CONN_TO_SESSION is to be verified according to the SP4_SSV or SP4_MACH_CRED (Section 18.35) state protection options. For SP4_SSV, reliable verification depends on a shared secret (the SSV) that is established via the SET_SSV (Section 18.47) operation. When a new connection is associated with the session (via the BIND_CONN_TO_SESSION operation, see Section 18.34), if the client specified SP4_SSV state protection for the BIND_CONN_TO_SESSION operation, the client MUST send the BIND_CONN_TO_SESSION with RPCSEC_GSS protection, using integrity or privacy, and an RPCSEC_GSS handle created with the GSS SSV mechanism (Section 2.10.9). If the client mistakenly tries to associate a connection to a session of a wrong server, the server will either reject the attempt because it is not aware of the session identifier of the BIND_CONN_TO_SESSION arguments, or it will reject the attempt because the RPCSEC_GSS authentication fails. Even if the server mistakenly or maliciously accepts the connection association attempt, the RPCSEC_GSS verifier it computes in the response will not be verified by the client, so the client will know it cannot use the connection for trunking the specified session. If the client specified SP4_MACH_CRED state protection, the BIND_CONN_TO_SESSION operation will use RPCSEC_GSS integrity or privacy, using the same credential that was used when the client ID was created. Mutual authentication via RPCSEC_GSS assures the client that the connection is associated with the correct session of the correct server. o For client ID trunking, the client has at least two options for verifying that the same client ID obtained from two different EXCHANGE_ID operations came from the same server. The first option is to use RPCSEC_GSS authentication when sending each EXCHANGE_ID operation. Each time an EXCHANGE_ID is sent with RPCSEC_GSS authentication, the client notes the principal name of the GSS target. If the EXCHANGE_ID results indicate that client ID trunking is possible, and the GSS targets' principal names are the same, the servers are the same and client ID trunking is allowed.
The second option for verification is to use SP4_SSV protection. When the client sends EXCHANGE_ID, it specifies SP4_SSV protection. The first EXCHANGE_ID the client sends always has to be confirmed by a CREATE_SESSION call. The client then sends SET_SSV. Later, the client sends EXCHANGE_ID to a second destination network address different from the one the first EXCHANGE_ID was sent to. The client checks that each EXCHANGE_ID reply has the same eir_clientid, eir_server_owner.so_major_id, and eir_server_scope. If so, the client verifies the claim by sending a CREATE_SESSION operation to the second destination address, protected with RPCSEC_GSS integrity using an RPCSEC_GSS handle returned by the second EXCHANGE_ID. If the server accepts the CREATE_SESSION request, and if the client verifies the RPCSEC_GSS verifier and integrity codes, then the client has proof the second server knows the SSV, and thus the two servers are cooperating for the purposes of specifying server scope and client ID trunking.2.10.6. Exactly Once Semantics
Via the session, NFSv4.1 offers exactly once semantics (EOS) for requests sent over a channel. EOS is supported on both the fore channel and backchannel. Each COMPOUND or CB_COMPOUND request that is sent with a leading SEQUENCE or CB_SEQUENCE operation MUST be executed by the receiver exactly once. This requirement holds regardless of whether the request is sent with reply caching specified (see Section 2.10.6.1.3). The requirement holds even if the requester is sending the request over a session created between a pNFS data client and pNFS data server. To understand the rationale for this requirement, divide the requests into three classifications: o Non-idempotent requests. o Idempotent modifying requests. o Idempotent non-modifying requests. An example of a non-idempotent request is RENAME. Obviously, if a replier executes the same RENAME request twice, and the first execution succeeds, the re-execution will fail. If the replier returns the result from the re-execution, this result is incorrect. Therefore, EOS is required for non-idempotent requests.
An example of an idempotent modifying request is a COMPOUND request containing a WRITE operation. Repeated execution of the same WRITE has the same effect as execution of that WRITE a single time. Nevertheless, enforcing EOS for WRITEs and other idempotent modifying requests is necessary to avoid data corruption. Suppose a client sends WRITE A to a noncompliant server that does not enforce EOS, and receives no response, perhaps due to a network partition. The client reconnects to the server and re-sends WRITE A. Now, the server has outstanding two instances of A. The server can be in a situation in which it executes and replies to the retry of A, while the first A is still waiting in the server's internal I/O system for some resource. Upon receiving the reply to the second attempt of WRITE A, the client believes its WRITE is done so it is free to send WRITE B, which overlaps the byte-range of A. When the original A is dispatched from the server's I/O system and executed (thus the second time A will have been written), then what has been written by B can be overwritten and thus corrupted. An example of an idempotent non-modifying request is a COMPOUND containing SEQUENCE, PUTFH, READLINK, and nothing else. The re- execution of such a request will not cause data corruption or produce an incorrect result. Nonetheless, to keep the implementation simple, the replier MUST enforce EOS for all requests, whether or not idempotent and non-modifying. Note that true and complete EOS is not possible unless the server persists the reply cache in stable storage, and unless the server is somehow implemented to never require a restart (indeed, if such a server exists, the distinction between a reply cache kept in stable storage versus one that is not is one without meaning). See Section 2.10.6.5 for a discussion of persistence in the reply cache. Regardless, even if the server does not persist the reply cache, EOS improves robustness and correctness over previous versions of NFS because the legacy duplicate request/reply caches were based on the ONC RPC transaction identifier (XID). Section 2.10.6.1 explains the shortcomings of the XID as a basis for a reply cache and describes how NFSv4.1 sessions improve upon the XID.2.10.6.1. Slot Identifiers and Reply Cache
The RPC layer provides a transaction ID (XID), which, while required to be unique, is not convenient for tracking requests for two reasons. First, the XID is only meaningful to the requester; it cannot be interpreted by the replier except to test for equality with previously sent requests. When consulting an RPC-based duplicate request cache, the opaqueness of the XID requires a computationally expensive look up (often via a hash that includes XID and source
address). NFSv4.1 requests use a non-opaque slot ID, which is an index into a slot table, which is far more efficient. Second, because RPC requests can be executed by the replier in any order, there is no bound on the number of requests that may be outstanding at any time. To achieve perfect EOS, using ONC RPC would require storing all replies in the reply cache. XIDs are 32 bits; storing over four billion (2^32) replies in the reply cache is not practical. In practice, previous versions of NFS have chosen to store a fixed number of replies in the cache, and to use a least recently used (LRU) approach to replacing cache entries with new entries when the cache is full. In NFSv4.1, the number of outstanding requests is bounded by the size of the slot table, and a sequence ID per slot is used to tell the replier when it is safe to delete a cached reply. In the NFSv4.1 reply cache, when the requester sends a new request, it selects a slot ID in the range 0..N, where N is the replier's current maximum slot ID granted to the requester on the session over which the request is to be sent. The value of N starts out as equal to ca_maxrequests - 1 (Section 18.36), but can be adjusted by the response to SEQUENCE or CB_SEQUENCE as described later in this section. The slot ID must be unused by any of the requests that the requester has already active on the session. "Unused" here means the requester has no outstanding request for that slot ID. A slot contains a sequence ID and the cached reply corresponding to the request sent with that sequence ID. The sequence ID is a 32-bit unsigned value, and is therefore in the range 0..0xFFFFFFFF (2^32 - 1). The first time a slot is used, the requester MUST specify a sequence ID of one (Section 18.36). Each time a slot is reused, the request MUST specify a sequence ID that is one greater than that of the previous request on the slot. If the previous sequence ID was 0xFFFFFFFF, then the next request for the slot MUST have the sequence ID set to zero (i.e., (2^32 - 1) + 1 mod 2^32). The sequence ID accompanies the slot ID in each request. It is for the critical check at the replier: it used to efficiently determine whether a request using a certain slot ID is a retransmit or a new, never-before-seen request. It is not feasible for the requester to assert that it is retransmitting to implement this, because for any given request the requester cannot know whether the replier has seen it unless the replier actually replies. Of course, if the requester has seen the reply, the requester would not retransmit. The replier compares each received request's sequence ID with the last one previously received for that slot ID, to see if the new request is:
o A new request, in which the sequence ID is one greater than that previously seen in the slot (accounting for sequence wraparound). The replier proceeds to execute the new request, and the replier MUST increase the slot's sequence ID by one. o A retransmitted request, in which the sequence ID is equal to that currently recorded in the slot. If the original request has executed to completion, the replier returns the cached reply. See Section 2.10.6.2 for direction on how the replier deals with retries of requests that are still in progress. o A misordered retry, in which the sequence ID is less than (accounting for sequence wraparound) that previously seen in the slot. The replier MUST return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). o A misordered new request, in which the sequence ID is two or more than (accounting for sequence wraparound) that previously seen in the slot. Note that because the sequence ID MUST wrap around to zero once it reaches 0xFFFFFFFF, a misordered new request and a misordered retry cannot be distinguished. Thus, the replier MUST return NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). Unlike the XID, the slot ID is always within a specific range; this has two implications. The first implication is that for a given session, the replier need only cache the results of a limited number of COMPOUND requests. The second implication derives from the first, which is that unlike XID-indexed reply caches (also known as duplicate request caches - DRCs), the slot ID-based reply cache cannot be overflowed. Through use of the sequence ID to identify retransmitted requests, the replier does not need to actually cache the request itself, reducing the storage requirements of the reply cache further. These facilities make it practical to maintain all the required entries for an effective reply cache. The slot ID, sequence ID, and session ID therefore take over the traditional role of the XID and source network address in the replier's reply cache implementation. This approach is considerably more portable and completely robust -- it is not subject to the reassignment of ports as clients reconnect over IP networks. In addition, the RPC XID is not used in the reply cache, enhancing robustness of the cache in the face of any rapid reuse of XIDs by the requester. While the replier does not care about the XID for the purposes of reply cache management (but the replier MUST return the same XID that was in the request), nonetheless there are considerations for the XID in NFSv4.1 that are the same as all other
previous versions of NFS. The RPC XID remains in each message and needs to be formulated in NFSv4.1 requests as in any other ONC RPC request. The reasons include: o The RPC layer retains its existing semantics and implementation. o The requester and replier must be able to interoperate at the RPC layer, prior to the NFSv4.1 decoding of the SEQUENCE or CB_SEQUENCE operation. o If an operation is being used that does not start with SEQUENCE or CB_SEQUENCE (e.g., BIND_CONN_TO_SESSION), then the RPC XID is needed for correct operation to match the reply to the request. o The SEQUENCE or CB_SEQUENCE operation may generate an error. If so, the embedded slot ID, sequence ID, and session ID (if present) in the request will not be in the reply, and the requester has only the XID to match the reply to the request. Given that well-formulated XIDs continue to be required, this begs the question: why do SEQUENCE and CB_SEQUENCE replies have a session ID, slot ID, and sequence ID? Having the session ID in the reply means that the requester does not have to use the XID to look up the session ID, which would be necessary if the connection were associated with multiple sessions. Having the slot ID and sequence ID in the reply means that the requester does not have to use the XID to look up the slot ID and sequence ID. Furthermore, since the XID is only 32 bits, it is too small to guarantee the re-association of a reply with its request [37]; having session ID, slot ID, and sequence ID in the reply allows the client to validate that the reply in fact belongs to the matched request. The SEQUENCE (and CB_SEQUENCE) operation also carries a "highest_slotid" value, which carries additional requester slot usage information. The requester MUST always indicate the slot ID representing the outstanding request with the highest-numbered slot value. The requester should in all cases provide the most conservative value possible, although it can be increased somewhat above the actual instantaneous usage to maintain some minimum or optimal level. This provides a way for the requester to yield unused request slots back to the replier, which in turn can use the information to reallocate resources. The replier responds with both a new target highest_slotid and an enforced highest_slotid, described as follows:
o The target highest_slotid is an indication to the requester of the highest_slotid the replier wishes the requester to be using. This permits the replier to withdraw (or add) resources from a requester that has been found to not be using them, in order to more fairly share resources among a varying level of demand from other requesters. The requester must always comply with the replier's value updates, since they indicate newly established hard limits on the requester's access to session resources. However, because of request pipelining, the requester may have active requests in flight reflecting prior values; therefore, the replier must not immediately require the requester to comply. o The enforced highest_slotid indicates the highest slot ID the requester is permitted to use on a subsequent SEQUENCE or CB_SEQUENCE operation. The replier's enforced highest_slotid SHOULD be no less than the highest_slotid the requester indicated in the SEQUENCE or CB_SEQUENCE arguments. A requester can be intransigent with respect to lowering its highest_slotid argument to a Sequence operation, i.e. the requester continues to ignore the target highest_slotid in the response to a Sequence operation, and continues to set its highest_slotid argument to be higher than the target highest_slotid. This can be considered particularly egregious behavior when the replier knows there are no outstanding requests with slot IDs higher than its target highest_slotid. When faced with such intransigence, the replier is free to take more forceful action, and MAY reply with a new enforced highest_slotid that is less than its previous enforced highest_slotid. Thereafter, if the requester continues to send requests with a highest_slotid that is greater than the replier's new enforced highest_slotid, the server MAY return NFS4ERR_BAD_HIGH_SLOT, unless the slot ID in the request is greater than the new enforced highest_slotid and the request is a retry. The replier SHOULD retain the slots it wants to retire until the requester sends a request with a highest_slotid less than or equal to the replier's new enforced highest_slotid. The requester can also be intransigent with respect to sending non-retry requests that have a slot ID that exceeds the replier's highest_slotid. Once the replier has forcibly lowered the enforced highest_slotid, the requester is only allowed to send retries on slots that exceed the replier's highest_slotid. If a request is received with a slot ID that is higher than the new enforced highest_slotid, and the sequence ID is one higher than what is in the slot's reply cache, then the server can both retire the slot and return NFS4ERR_BADSLOT (however, the server MUST NOT
do one and not the other). The reason it is safe to retire the slot is because by using the next sequence ID, the requester is indicating it has received the previous reply for the slot. o The requester SHOULD use the lowest available slot when sending a new request. This way, the replier may be able to retire slot entries faster. However, where the replier is actively adjusting its granted highest_slotid, it will not be able to use only the receipt of the slot ID and highest_slotid in the request. Neither the slot ID nor the highest_slotid used in a request may reflect the replier's current idea of the requester's session limit, because the request may have been sent from the requester before the update was received. Therefore, in the downward adjustment case, the replier may have to retain a number of reply cache entries at least as large as the old value of maximum requests outstanding, until it can infer that the requester has seen a reply containing the new granted highest_slotid. The replier can infer that the requester has seen such a reply when it receives a new request with the same slot ID as the request replied to and the next higher sequence ID.2.10.6.1.1. Caching of SEQUENCE and CB_SEQUENCE Replies
When a SEQUENCE or CB_SEQUENCE operation is successfully executed, its reply MUST always be cached. Specifically, session ID, sequence ID, and slot ID MUST be cached in the reply cache. The reply from SEQUENCE also includes the highest slot ID, target highest slot ID, and status flags. Instead of caching these values, the server MAY re-compute the values from the current state of the fore channel, session, and/or client ID as appropriate. Similarly, the reply from CB_SEQUENCE includes a highest slot ID and target highest slot ID. The client MAY re-compute the values from the current state of the session as appropriate. Regardless of whether or not a replier is re-computing highest slot ID, target slot ID, and status on replies to retries, the requester MUST NOT assume that the values are being re-computed whenever it receives a reply after a retry is sent, since it has no way of knowing whether the reply it has received was sent by the replier in response to the retry or is a delayed response to the original request. Therefore, it may be the case that highest slot ID, target slot ID, or status bits may reflect the state of affairs when the request was first executed. Although acting based on such delayed information is valid, it may cause the receiver of the reply to do unneeded work. Requesters MAY choose to send additional requests to get the current state of affairs or use the state of affairs reported by subsequent requests, in preference to acting immediately on data that might be out of date.
2.10.6.1.2. Errors from SEQUENCE and CB_SEQUENCE
Any time SEQUENCE or CB_SEQUENCE returns an error, the sequence ID of the slot MUST NOT change. The replier MUST NOT modify the reply cache entry for the slot whenever an error is returned from SEQUENCE or CB_SEQUENCE.2.10.6.1.3. Optional Reply Caching
On a per-request basis, the requester can choose to direct the replier to cache the reply to all operations after the first operation (SEQUENCE or CB_SEQUENCE) via the sa_cachethis or csa_cachethis fields of the arguments to SEQUENCE or CB_SEQUENCE. The reason it would not direct the replier to cache the entire reply is that the request is composed of all idempotent operations [34]. Caching the reply may offer little benefit. If the reply is too large (see Section 2.10.6.4), it may not be cacheable anyway. Even if the reply to idempotent request is small enough to cache, unnecessarily caching the reply slows down the server and increases RPC latency. Whether or not the requester requests the reply to be cached has no effect on the slot processing. If the results of SEQUENCE or CB_SEQUENCE are NFS4_OK, then the slot's sequence ID MUST be incremented by one. If a requester does not direct the replier to cache the reply, the replier MUST do one of following: o The replier can cache the entire original reply. Even though sa_cachethis or csa_cachethis is FALSE, the replier is always free to cache. It may choose this approach in order to simplify implementation. o The replier enters into its reply cache a reply consisting of the original results to the SEQUENCE or CB_SEQUENCE operation, and with the next operation in COMPOUND or CB_COMPOUND having the error NFS4ERR_RETRY_UNCACHED_REP. Thus, if the requester later retries the request, it will get NFS4ERR_RETRY_UNCACHED_REP. If a replier receives a retried Sequence operation where the reply to the COMPOUND or CB_COMPOUND was not cached, then the replier, * MAY return NFS4ERR_RETRY_UNCACHED_REP in reply to a Sequence operation if the Sequence operation is not the first operation (granted, a requester that does so is in violation of the NFSv4.1 protocol). * MUST NOT return NFS4ERR_RETRY_UNCACHED_REP in reply to a Sequence operation if the Sequence operation is the first operation.
o If the second operation is an illegal operation, or an operation that was legal in a previous minor version of NFSv4 and MUST NOT be supported in the current minor version (e.g., SETCLIENTID), the replier MUST NOT ever return NFS4ERR_RETRY_UNCACHED_REP. Instead the replier MUST return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or NFS4ERR_NOTSUPP as appropriate. o If the second operation can result in another error status, the replier MAY return a status other than NFS4ERR_RETRY_UNCACHED_REP, provided the operation is not executed in such a way that the state of the replier is changed. Examples of such an error status include: NFS4ERR_NOTSUPP returned for an operation that is legal but not REQUIRED in the current minor versions, and thus not supported by the replier; NFS4ERR_SEQUENCE_POS; and NFS4ERR_REQ_TOO_BIG. The discussion above assumes that the retried request matches the original one. Section 2.10.6.1.3.1 discusses what the replier might do, and MUST do when original and retried requests do not match. Since the replier may only cache a small amount of the information that would be required to determine whether this is a case of a false retry, the replier may send to the client any of the following responses: o The cached reply to the original request (if the replier has cached it in its entirety and the users of the original request and retry match). o A reply that consists only of the Sequence operation with the error NFS4ERR_FALSE_RETRY. o A reply consisting of the response to Sequence with the status NFS4_OK, together with the second operation as it appeared in the retried request with an error of NFS4ERR_RETRY_UNCACHED_REP or other error as described above. o A reply that consists of the response to Sequence with the status NFS4_OK, together with the second operation as it appeared in the original request with an error of NFS4ERR_RETRY_UNCACHED_REP or other error as described above.2.10.6.1.3.1. False Retry If a requester sent a Sequence operation with a slot ID and sequence ID that are in the reply cache but the replier detected that the retried request is not the same as the original request, including a retry that has different operations or different arguments in the operations from the original and a retry that uses a different
principal in the RPC request's credential field that translates to a different user, then this is a false retry. When the replier detects a false retry, it is permitted (but not always obligated) to return NFS4ERR_FALSE_RETRY in response to the Sequence operation when it detects a false retry. Translations of particularly privileged user values to other users due to the lack of appropriately secure credentials, as configured on the replier, should be applied before determining whether the users are the same or different. If the replier determines the users are different between the original request and a retry, then the replier MUST return NFS4ERR_FALSE_RETRY. If an operation of the retry is an illegal operation, or an operation that was legal in a previous minor version of NFSv4 and MUST NOT be supported in the current minor version (e.g., SETCLIENTID), the replier MAY return NFS4ERR_FALSE_RETRY (and MUST do so if the users of the original request and retry differ). Otherwise, the replier MAY return NFS4ERR_OP_ILLEGAL or NFS4ERR_BADXDR or NFS4ERR_NOTSUPP as appropriate. Note that the handling is in contrast for how the replier deals with retries requests with no cached reply. The difference is due to NFS4ERR_FALSE_RETRY being a valid error for only Sequence operations, whereas NFS4ERR_RETRY_UNCACHED_REP is a valid error for all operations except illegal operations and operations that MUST NOT be supported in the current minor version of NFSv4.2.10.6.2. Retry and Replay of Reply
A requester MUST NOT retry a request, unless the connection it used to send the request disconnects. The requester can then reconnect and re-send the request, or it can re-send the request over a different connection that is associated with the same session. If the requester is a server wanting to re-send a callback operation over the backchannel of a session, the requester of course cannot reconnect because only the client can associate connections with the backchannel. The server can re-send the request over another connection that is bound to the same session's backchannel. If there is no such connection, the server MUST indicate that the session has no backchannel by setting the SEQ4_STATUS_CB_PATH_DOWN_SESSION flag bit in the response to the next SEQUENCE operation from the client. The client MUST then associate a connection with the session (or destroy the session). Note that it is not fatal for a requester to retry without a disconnect between the request and retry. However, the retry does consume resources, especially with RDMA, where each request, retry or not, consumes a credit. Retries for no reason, especially retries
sent shortly after the previous attempt, are a poor use of network bandwidth and defeat the purpose of a transport's inherent congestion control system. A requester MUST wait for a reply to a request before using the slot for another request. If it does not wait for a reply, then the requester does not know what sequence ID to use for the slot on its next request. For example, suppose a requester sends a request with sequence ID 1, and does not wait for the response. The next time it uses the slot, it sends the new request with sequence ID 2. If the replier has not seen the request with sequence ID 1, then the replier is not expecting sequence ID 2, and rejects the requester's new request with NFS4ERR_SEQ_MISORDERED (as the result from SEQUENCE or CB_SEQUENCE). RDMA fabrics do not guarantee that the memory handles (Steering Tags) within each RPC/RDMA "chunk" [8] are valid on a scope outside that of a single connection. Therefore, handles used by the direct operations become invalid after connection loss. The server must ensure that any RDMA operations that must be replayed from the reply cache use the newly provided handle(s) from the most recent request. A retry might be sent while the original request is still in progress on the replier. The replier SHOULD deal with the issue by returning NFS4ERR_DELAY as the reply to SEQUENCE or CB_SEQUENCE operation, but implementations MAY return NFS4ERR_MISORDERED. Since errors from SEQUENCE and CB_SEQUENCE are never recorded in the reply cache, this approach allows the results of the execution of the original request to be properly recorded in the reply cache (assuming that the requester specified the reply to be cached).2.10.6.3. Resolving Server Callback Races
It is possible for server callbacks to arrive at the client before the reply from related fore channel operations. For example, a client may have been granted a delegation to a file it has opened, but the reply to the OPEN (informing the client of the granting of the delegation) may be delayed in the network. If a conflicting operation arrives at the server, it will recall the delegation using the backchannel, which may be on a different transport connection, perhaps even a different network, or even a different session associated with the same client ID. The presence of a session between the client and server alleviates this issue. When a session is in place, each client request is uniquely identified by its { session ID, slot ID, sequence ID } triple. By the rules under which slot entries (reply cache entries) are retired, the server has knowledge whether the client has "seen"
each of the server's replies. The server can therefore provide sufficient information to the client to allow it to disambiguate between an erroneous or conflicting callback race condition. For each client operation that might result in some sort of server callback, the server SHOULD "remember" the { session ID, slot ID, sequence ID } triple of the client request until the slot ID retirement rules allow the server to determine that the client has, in fact, seen the server's reply. Until the time the { session ID, slot ID, sequence ID } request triple can be retired, any recalls of the associated object MUST carry an array of these referring identifiers (in the CB_SEQUENCE operation's arguments), for the benefit of the client. After this time, it is not necessary for the server to provide this information in related callbacks, since it is certain that a race condition can no longer occur. The CB_SEQUENCE operation that begins each server callback carries a list of "referring" { session ID, slot ID, sequence ID } triples. If the client finds the request corresponding to the referring session ID, slot ID, and sequence ID to be currently outstanding (i.e., the server's reply has not been seen by the client), it can determine that the callback has raced the reply, and act accordingly. If the client does not find the request corresponding to the referring triple to be outstanding (including the case of a session ID referring to a destroyed session), then there is no race with respect to this triple. The server SHOULD limit the referring triples to requests that refer to just those that apply to the objects referred to in the CB_COMPOUND procedure. The client must not simply wait forever for the expected server reply to arrive before responding to the CB_COMPOUND that won the race, because it is possible that it will be delayed indefinitely. The client should assume the likely case that the reply will arrive within the average round-trip time for COMPOUND requests to the server, and wait that period of time. If that period of time expires, it can respond to the CB_COMPOUND with NFS4ERR_DELAY. There are other scenarios under which callbacks may race replies. Among them are pNFS layout recalls as described in Section 12.5.5.2.2.10.6.4. COMPOUND and CB_COMPOUND Construction Issues
Very large requests and replies may pose both buffer management issues (especially with RDMA) and reply cache issues. When the session is created (Section 18.36), for each channel (fore and back), the client and server negotiate the maximum-sized request they will send or process (ca_maxrequestsize), the maximum-sized reply they will return or process (ca_maxresponsesize), and the maximum-sized reply they will store in the reply cache (ca_maxresponsesize_cached).
If a request exceeds ca_maxrequestsize, the reply will have the status NFS4ERR_REQ_TOO_BIG. A replier MAY return NFS4ERR_REQ_TOO_BIG as the status for the first operation (SEQUENCE or CB_SEQUENCE) in the request (which means that no operations in the request executed and that the state of the slot in the reply cache is unchanged), or it MAY opt to return it on a subsequent operation in the same COMPOUND or CB_COMPOUND request (which means that at least one operation did execute and that the state of the slot in the reply cache does change). The replier SHOULD set NFS4ERR_REQ_TOO_BIG on the operation that exceeds ca_maxrequestsize. If a reply exceeds ca_maxresponsesize, the reply will have the status NFS4ERR_REP_TOO_BIG. A replier MAY return NFS4ERR_REP_TOO_BIG as the status for the first operation (SEQUENCE or CB_SEQUENCE) in the request, or it MAY opt to return it on a subsequent operation (in the same COMPOUND or CB_COMPOUND reply). A replier MAY return NFS4ERR_REP_TOO_BIG in the reply to SEQUENCE or CB_SEQUENCE, even if the response would still exceed ca_maxresponsesize. If sa_cachethis or csa_cachethis is TRUE, then the replier MUST cache a reply except if an error is returned by the SEQUENCE or CB_SEQUENCE operation (see Section 2.10.6.1.2). If the reply exceeds ca_maxresponsesize_cached (and sa_cachethis or csa_cachethis is TRUE), then the server MUST return NFS4ERR_REP_TOO_BIG_TO_CACHE. Even if NFS4ERR_REP_TOO_BIG_TO_CACHE (or any other error for that matter) is returned on an operation other than the first operation (SEQUENCE or CB_SEQUENCE), then the reply MUST be cached if sa_cachethis or csa_cachethis is TRUE. For example, if a COMPOUND has eleven operations, including SEQUENCE, the fifth operation is a RENAME, and the tenth operation is a READ for one million bytes, the server may return NFS4ERR_REP_TOO_BIG_TO_CACHE on the tenth operation. Since the server executed several operations, especially the non-idempotent RENAME, the client's request to cache the reply needs to be honored in order for the correct operation of exactly once semantics. If the client retries the request, the server will have cached a reply that contains results for ten of the eleven requested operations, with the tenth operation having a status of NFS4ERR_REP_TOO_BIG_TO_CACHE. A client needs to take care that when sending operations that change the current filehandle (except for PUTFH, PUTPUBFH, PUTROOTFH, and RESTOREFH), it not exceed the maximum reply buffer before the GETFH operation. Otherwise, the client will have to retry the operation that changed the current filehandle, in order to obtain the desired filehandle. For the OPEN operation (see Section 18.16), retry is not always available as an option. The following guidelines for the handling of filehandle-changing operations are advised:
o Within the same COMPOUND procedure, a client SHOULD send GETFH immediately after a current filehandle-changing operation. A client MUST send GETFH after a current filehandle-changing operation that is also non-idempotent (e.g., the OPEN operation), unless the operation is RESTOREFH. RESTOREFH is an exception, because even though it is non-idempotent, the filehandle RESTOREFH produced originated from an operation that is either idempotent (e.g., PUTFH, LOOKUP), or non-idempotent (e.g., OPEN, CREATE). If the origin is non-idempotent, then because the client MUST send GETFH after the origin operation, the client can recover if RESTOREFH returns an error. o A server MAY return NFS4ERR_REP_TOO_BIG or NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a filehandle-changing operation if the reply would be too large on the next operation. o A server SHOULD return NFS4ERR_REP_TOO_BIG or NFS4ERR_REP_TOO_BIG_TO_CACHE (if sa_cachethis is TRUE) on a filehandle-changing, non-idempotent operation if the reply would be too large on the next operation, especially if the operation is OPEN. o A server MAY return NFS4ERR_UNSAFE_COMPOUND to a non-idempotent current filehandle-changing operation, if it looks at the next operation (in the same COMPOUND procedure) and finds it is not GETFH. The server SHOULD do this if it is unable to determine in advance whether the total response size would exceed ca_maxresponsesize_cached or ca_maxresponsesize.2.10.6.5. Persistence
Since the reply cache is bounded, it is practical for the reply cache to persist across server restarts. The replier MUST persist the following information if it agreed to persist the session (when the session was created; see Section 18.36): o The session ID. o The slot table including the sequence ID and cached reply for each slot. The above are sufficient for a replier to provide EOS semantics for any requests that were sent and executed before the server restarted. If the replier is a client, then there is no need for it to persist any more information, unless the client will be persisting all other state across client restart, in which case, the server will never see any NFSv4.1-level protocol manifestation of a client restart. If the
replier is a server, with just the slot table and session ID persisting, any requests the client retries after the server restart will return the results that are cached in the reply cache, and any new requests (i.e., the sequence ID is one greater than the slot's sequence ID) MUST be rejected with NFS4ERR_DEADSESSION (returned by SEQUENCE). Such a session is considered dead. A server MAY re- animate a session after a server restart so that the session will accept new requests as well as retries. To re-animate a session, the server needs to persist additional information through server restart: o The client ID. This is a prerequisite to let the client create more sessions associated with the same client ID as the re- animated session. o The client ID's sequence ID that is used for creating sessions (see Sections 18.35 and 18.36). This is a prerequisite to let the client create more sessions. o The principal that created the client ID. This allows the server to authenticate the client when it sends EXCHANGE_ID. o The SSV, if SP4_SSV state protection was specified when the client ID was created (see Section 18.35). This lets the client create new sessions, and associate connections with the new and existing sessions. o The properties of the client ID as defined in Section 18.35. A persistent reply cache places certain demands on the server. The execution of the sequence of operations (starting with SEQUENCE) and placement of its results in the persistent cache MUST be atomic. If a client retries a sequence of operations that was previously executed on the server, the only acceptable outcomes are either the original cached reply or an indication that the client ID or session has been lost (indicating a catastrophic loss of the reply cache or a session that has been deleted because the client failed to use the session for an extended period of time). A server could fail and restart in the middle of a COMPOUND procedure that contains one or more non-idempotent or idempotent-but-modifying operations. This creates an even higher challenge for atomic execution and placement of results in the reply cache. One way to view the problem is as a single transaction consisting of each operation in the COMPOUND followed by storing the result in persistent storage, then finally a transaction commit. If there is a failure before the transaction is committed, then the server rolls
back the transaction. If the server itself fails, then when it restarts, its recovery logic could roll back the transaction before starting the NFSv4.1 server. While the description of the implementation for atomic execution of the request and caching of the reply is beyond the scope of this document, an example implementation for NFSv2 [38] is described in [39].