2.10.7. RDMA Considerations
A complete discussion of the operation of RPC-based protocols over RDMA transports is in [8]. A discussion of the operation of NFSv4, including NFSv4.1, over RDMA is in [9]. Where RDMA is considered, this specification assumes the use of such a layering; it addresses only the upper-layer issues relevant to making best use of RPC/RDMA.2.10.7.1. RDMA Connection Resources
RDMA requires its consumers to register memory and post buffers of a specific size and number for receive operations. Registration of memory can be a relatively high-overhead operation, since it requires pinning of buffers, assignment of attributes (e.g., readable/writable), and initialization of hardware translation. Preregistration is desirable to reduce overhead. These registrations are specific to hardware interfaces and even to RDMA connection endpoints; therefore, negotiation of their limits is desirable to manage resources effectively. Following basic registration, these buffers must be posted by the RPC layer to handle receives. These buffers remain in use by the RPC/ NFSv4.1 implementation; the size and number of them must be known to the remote peer in order to avoid RDMA errors that would cause a fatal error on the RDMA connection. NFSv4.1 manages slots as resources on a per-session basis (see Section 2.10), while RDMA connections manage credits on a per- connection basis. This means that in order for a peer to send data over RDMA to a remote buffer, it has to have both an NFSv4.1 slot and an RDMA credit. If multiple RDMA connections are associated with a session, then if the total number of credits across all RDMA connections associated with the session is X, and the number of slots in the session is Y, then the maximum number of outstanding requests is the lesser of X and Y.
2.10.7.2. Flow Control
Previous versions of NFS do not provide flow control; instead, they rely on the windowing provided by transports like TCP to throttle requests. This does not work with RDMA, which provides no operation flow control and will terminate a connection in error when limits are exceeded. Limits such as maximum number of requests outstanding are therefore negotiated when a session is created (see the ca_maxrequests field in Section 18.36). These limits then provide the maxima within which each connection associated with the session's channel(s) must remain. RDMA connections are managed within these limits as described in Section 3.3 of [8]; if there are multiple RDMA connections, then the maximum number of requests for a channel will be divided among the RDMA connections. Put a different way, the onus is on the replier to ensure that the total number of RDMA credits across all connections associated with the replier's channel does exceed the channel's maximum number of outstanding requests. The limits may also be modified dynamically at the replier's choosing by manipulating certain parameters present in each NFSv4.1 reply. In addition, the CB_RECALL_SLOT callback operation (see Section 20.8) can be sent by a server to a client to return RDMA credits to the server, thereby lowering the maximum number of requests a client can have outstanding to the server.2.10.7.3. Padding
Header padding is requested by each peer at session initiation (see the ca_headerpadsize argument to CREATE_SESSION in Section 18.36), and subsequently used by the RPC RDMA layer, as described in [8]. Zero padding is permitted. Padding leverages the useful property that RDMA preserve alignment of data, even when they are placed into anonymous (untagged) buffers. If requested, client inline writes will insert appropriate pad bytes within the request header to align the data payload on the specified boundary. The client is encouraged to add sufficient padding (up to the negotiated size) so that the "data" field of the WRITE operation is aligned. Most servers can make good use of such padding, which allows them to chain receive buffers in such a way that any data carried by client requests will be placed into appropriate buffers at the server, ready for file system processing. The receiver's RPC layer encounters no overhead from skipping over pad bytes, and the RDMA layer's high performance makes the insertion and transmission of padding on the sender a significant optimization. In this way, the need for servers to perform RDMA Read to satisfy all but the largest
client writes is obviated. An added benefit is the reduction of message round trips on the network -- a potentially good trade, where latency is present. The value to choose for padding is subject to a number of criteria. A primary source of variable-length data in the RPC header is the authentication information, the form of which is client-determined, possibly in response to server specification. The contents of COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all go into the determination of a maximal NFSv4.1 request size and therefore minimal buffer size. The client must select its offered value carefully, so as to avoid overburdening the server, and vice versa. The benefit of an appropriate padding value is higher performance. Sender gather: |RPC Request|Pad bytes|Length| -> |User data...| \------+----------------------/ \ \ \ \ Receiver scatter: \-----------+- ... /-----+----------------\ \ \ |RPC Request|Pad|Length| -> |FS buffer|->|FS buffer|->... In the above case, the server may recycle unused buffers to the next posted receive if unused by the actual received request, or may pass the now-complete buffers by reference for normal write processing. For a server that can make use of it, this removes any need for data copies of incoming data, without resorting to complicated end-to-end buffer advertisement and management. This includes most kernel-based and integrated server designs, among many others. The client may perform similar optimizations, if desired.2.10.7.4. Dual RDMA and Non-RDMA Transports
Some RDMA transports (e.g., RFC 5040 [10]) permit a "streaming" (non- RDMA) phase, where ordinary traffic might flow before "stepping up" to RDMA mode, commencing RDMA traffic. Some RDMA transports start connections always in RDMA mode. NFSv4.1 allows, but does not assume, a streaming phase before RDMA mode. When a connection is associated with a session, the client and server negotiate whether the connection is used in RDMA or non-RDMA mode (see Sections 18.36 and 18.34).
2.10.8. Session Security
2.10.8.1. Session Callback Security
Via session/connection association, NFSv4.1 improves security over that provided by NFSv4.0 for the backchannel. The connection is client-initiated (see Section 18.34) and subject to the same firewall and routing checks as the fore channel. At the client's option (see Section 18.35), connection association is fully authenticated before being activated (see Section 18.34). Traffic from the server over the backchannel is authenticated exactly as the client specifies (see Section 2.10.8.2).2.10.8.2. Backchannel RPC Security
When the NFSv4.1 client establishes the backchannel, it informs the server of the security flavors and principals to use when sending requests. If the security flavor is RPCSEC_GSS, the client expresses the principal in the form of an established RPCSEC_GSS context. The server is free to use any of the flavor/principal combinations the client offers, but it MUST NOT use unoffered combinations. This way, the client need not provide a target GSS principal for the backchannel as it did with NFSv4.0, nor does the server have to implement an RPCSEC_GSS initiator as it did with NFSv4.0 [30]. The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL (Section 18.33) operations allow the client to specify flavor/ principal combinations. Also note that the SP4_SSV state protection mode (see Sections 18.35 and 2.10.8.3) has the side benefit of providing SSV-derived RPCSEC_GSS contexts (Section 2.10.9).2.10.8.3. Protection from Unauthorized State Changes
As described to this point in the specification, the state model of NFSv4.1 is vulnerable to an attacker that sends a SEQUENCE operation with a forged session ID and with a slot ID that it expects the legitimate client to use next. When the legitimate client uses the slot ID with the same sequence number, the server returns the attacker's result from the reply cache, which disrupts the legitimate client and thus denies service to it. Similarly, an attacker could send a CREATE_SESSION with a forged client ID to create a new session associated with the client ID. The attacker could send requests using the new session that change locking state, such as LOCKU operations to release locks the legitimate client has acquired. Setting a security policy on the file that requires RPCSEC_GSS credentials when manipulating the file's state is one potential work
around, but has the disadvantage of preventing a legitimate client from releasing state when RPCSEC_GSS is required to do so, but a GSS context cannot be obtained (possibly because the user has logged off the client). NFSv4.1 provides three options to a client for state protection, which are specified when a client creates a client ID via EXCHANGE_ID (Section 18.35). The first (SP4_NONE) is to simply waive state protection. The other two options (SP4_MACH_CRED and SP4_SSV) share several traits: o An RPCSEC_GSS-based credential is used to authenticate client ID and session maintenance operations, including creating and destroying a session, associating a connection with the session, and destroying the client ID. o Because RPCSEC_GSS is used to authenticate client ID and session maintenance, the attacker cannot associate a rogue connection with a legitimate session, or associate a rogue session with a legitimate client ID in order to maliciously alter the client ID's lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc. o In cases where the server's security policies on a portion of its namespace require RPCSEC_GSS authentication, a client may have to use an RPCSEC_GSS credential to remove per-file state (e.g., LOCKU, CLOSE, etc.). The server may require that the principal that removes the state match certain criteria (e.g., the principal might have to be the same as the one that acquired the state). However, the client might not have an RPCSEC_GSS context for such a principal, and might not be able to create such a context (perhaps because the user has logged off). When the client establishes SP4_MACH_CRED or SP4_SSV protection, it can specify a list of operations that the server MUST allow using the machine credential (if SP4_MACH_CRED is used) or the SSV credential (if SP4_SSV is used). The SP4_MACH_CRED state protection option uses a machine credential where the principal that creates the client ID MUST also be the principal that performs client ID and session maintenance operations. The security of the machine credential state protection approach depends entirely on safe guarding the per-machine credential. Assuming a proper safeguard using the per-machine credential for operations like CREATE_SESSION, BIND_CONN_TO_SESSION,
DESTROY_SESSION, and DESTROY_CLIENTID will prevent an attacker from associating a rogue connection with a session, or associating a rogue session with a client ID. There are at least three scenarios for the SP4_MACH_CRED option: 1. The system administrator configures a unique, permanent per- machine credential for one of the mandated GSS mechanisms (e.g., if Kerberos V5 is used, a "keytab" containing a principal derived from a client host name could be used). 2. The client is used by a single user, and so the client ID and its sessions are used by just that user. If the user's credential expires, then session and client ID maintenance cannot occur, but since the client has a single user, only that user is inconvenienced. 3. The physical client has multiple users, but the client implementation has a unique client ID for each user. This is effectively the same as the second scenario, but a disadvantage is that each user needs to be allocated at least one session each, so the approach suffers from lack of economy. The SP4_SSV protection option uses the SSV (Section 1.6), via RPCSEC_GSS and the SSV GSS mechanism (Section 2.10.9), to protect state from attack. The SP4_SSV protection option is intended for the situation comprised of a client that has multiple active users and a system administrator who wants to avoid the burden of installing a permanent machine credential on each client. The SSV is established and updated on the server via SET_SSV (see Section 18.47). To prevent eavesdropping, a client SHOULD send SET_SSV via RPCSEC_GSS with the privacy service. Several aspects of the SSV make it intractable for an attacker to guess the SSV, and thus associate rogue connections with a session, and rogue sessions with a client ID: o The arguments to and results of SET_SSV include digests of the old and new SSV, respectively. o Because the initial value of the SSV is zero, therefore known, the client that opts for SP4_SSV protection and opts to apply SP4_SSV protection to BIND_CONN_TO_SESSION and CREATE_SESSION MUST send at least one SET_SSV operation before the first BIND_CONN_TO_SESSION operation or before the second CREATE_SESSION operation on a client ID. If it does not, the SSV mechanism will not generate tokens (Section 2.10.9). A client SHOULD send SET_SSV as soon as a session is created.
o A SET_SSV request does not replace the SSV with the argument to SET_SSV. Instead, the current SSV on the server is logically exclusive ORed (XORed) with the argument to SET_SSV. Each time a new principal uses a client ID for the first time, the client SHOULD send a SET_SSV with that principal's RPCSEC_GSS credentials, with RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY. Here are the types of attacks that can be attempted by an attacker named Eve on a victim named Bob, and how SP4_SSV protection foils each attack: o Suppose Eve is the first user to log into a legitimate client. Eve's use of an NFSv4.1 file system will cause the legitimate client to create a client ID with SP4_SSV protection, specifying that the BIND_CONN_TO_SESSION operation MUST use the SSV credential. Eve's use of the file system also causes an SSV to be created. The SET_SSV operation that creates the SSV will be protected by the RPCSEC_GSS context created by the legitimate client, which uses Eve's GSS principal and credentials. Eve can eavesdrop on the network while her RPCSEC_GSS context is created and the SET_SSV using her context is sent. Even if the legitimate client sends the SET_SSV with RPC_GSS_SVC_PRIVACY, because Eve knows her own credentials, she can decrypt the SSV. Eve can compute an RPCSEC_GSS credential that BIND_CONN_TO_SESSION will accept, and so associate a new connection with the legitimate session. Eve can change the slot ID and sequence state of a legitimate session, and/or the SSV state, in such a way that when Bob accesses the server via the same legitimate client, the legitimate client will be unable to use the session. The client's only recourse is to create a new client ID for Bob to use, and establish a new SSV for the client ID. The client will be unable to delete the old client ID, and will let the lease on the old client ID expire. Once the legitimate client establishes an SSV over the new session using Bob's RPCSEC_GSS context, Eve can use the new session via the legitimate client, but she cannot disrupt Bob. Moreover, because the client SHOULD have modified the SSV due to Eve using the new session, Bob cannot get revenge on Eve by associating a rogue connection with the session. The question is how did the legitimate client detect that Eve has hijacked the old session? When the client detects that a new principal, Bob, wants to use the session, it SHOULD have sent a SET_SSV, which leads to the following sub-scenarios:
* Let us suppose that from the rogue connection, Eve sent a SET_SSV with the same slot ID and sequence ID that the legitimate client later uses. The server will assume the SET_SSV sent with Bob's credentials is a retry, and return to the legitimate client the reply it sent Eve. However, unless Eve can correctly guess the SSV the legitimate client will use, the digest verification checks in the SET_SSV response will fail. That is an indication to the client that the session has apparently been hijacked. * Alternatively, Eve sent a SET_SSV with a different slot ID than the legitimate client uses for its SET_SSV. Then the digest verification of the SET_SSV sent with Bob's credentials fails on the server, and the error returned to the client makes it apparent that the session has been hijacked. * Alternatively, Eve sent an operation other than SET_SSV, but with the same slot ID and sequence that the legitimate client uses for its SET_SSV. The server returns to the legitimate client the response it sent Eve. The client sees that the response is not at all what it expects. The client assumes either session hijacking or a server bug, and either way destroys the old session. o Eve associates a rogue connection with the session as above, and then destroys the session. Again, Bob goes to use the server from the legitimate client, which sends a SET_SSV using Bob's credentials. The client receives an error that indicates that the session does not exist. When the client tries to create a new session, this will fail because the SSV it has does not match that which the server has, and now the client knows the session was hijacked. The legitimate client establishes a new client ID. o If Eve creates a connection before the legitimate client establishes an SSV, because the initial value of the SSV is zero and therefore known, Eve can send a SET_SSV that will pass the digest verification check. However, because the new connection has not been associated with the session, the SET_SSV is rejected for that reason. In summary, an attacker's disruption of state when SP4_SSV protection is in use is limited to the formative period of a client ID, its first session, and the establishment of the SSV. Once a non- malicious user uses the client ID, the client quickly detects any hijack and rectifies the situation. Once a non-malicious user successfully modifies the SSV, the attacker cannot use NFSv4.1 operations to disrupt the non-malicious user.
Note that neither the SP4_MACH_CRED nor SP4_SSV protection approaches prevent hijacking of a transport connection that has previously been associated with a session. If the goal of a counter-threat strategy is to prevent connection hijacking, the use of IPsec is RECOMMENDED. If a connection hijack occurs, the hijacker could in theory change locking state and negatively impact the service to legitimate clients. However, if the server is configured to require the use of RPCSEC_GSS with integrity or privacy on the affected file objects, and if EXCHGID4_FLAG_BIND_PRINC_STATEID capability (Section 18.35) is in force, this will thwart unauthorized attempts to change locking state.2.10.9. The Secret State Verifier (SSV) GSS Mechanism
The SSV provides the secret key for a GSS mechanism internal to NFSv4.1 that NFSv4.1 uses for state protection. Contexts for this mechanism are not established via the RPCSEC_GSS protocol. Instead, the contexts are automatically created when EXCHANGE_ID specifies SP4_SSV protection. The only tokens defined are the PerMsgToken (emitted by GSS_GetMIC) and the SealedMessage token (emitted by GSS_Wrap). The mechanism OID for the SSV mechanism is iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech (1.3.6.1.4.1.28882.1.1). While the SSV mechanism does not define any initial context tokens, the OID can be used to let servers indicate that the SSV mechanism is acceptable whenever the client sends a SECINFO or SECINFO_NO_NAME operation (see Section 2.6). The SSV mechanism defines four subkeys derived from the SSV value. Each time SET_SSV is invoked, the subkeys are recalculated by the client and server. The calculation of each of the four subkeys depends on each of the four respective ssv_subkey4 enumerated values. The calculation uses the HMAC [11] algorithm, using the current SSV as the key, the one-way hash algorithm as negotiated by EXCHANGE_ID, and the input text as represented by the XDR encoded enumeration value for that subkey of data type ssv_subkey4. If the length of the output of the HMAC algorithm exceeds the length of key of the encryption algorithm (which is also negotiated by EXCHANGE_ID), then the subkey MUST be truncated from the HMAC output, i.e., if the subkey is of N bytes long, then the first N bytes of the HMAC output MUST be used for the subkey. The specification of EXCHANGE_ID states that the length of the output of the HMAC algorithm MUST NOT be less than the length of subkey needed for the encryption algorithm (see Section 18.35).
/* Input for computing subkeys */ enum ssv_subkey4 { SSV4_SUBKEY_MIC_I2T = 1, SSV4_SUBKEY_MIC_T2I = 2, SSV4_SUBKEY_SEAL_I2T = 3, SSV4_SUBKEY_SEAL_T2I = 4 }; The subkey derived from SSV4_SUBKEY_MIC_I2T is used for calculating message integrity codes (MICs) that originate from the NFSv4.1 client, whether as part of a request over the fore channel or a response over the backchannel. The subkey derived from SSV4_SUBKEY_MIC_T2I is used for MICs originating from the NFSv4.1 server. The subkey derived from SSV4_SUBKEY_SEAL_I2T is used for encryption text originating from the NFSv4.1 client, and the subkey derived from SSV4_SUBKEY_SEAL_T2I is used for encryption text originating from the NFSv4.1 server. The PerMsgToken description is based on an XDR definition: /* Input for computing smt_hmac */ struct ssv_mic_plain_tkn4 { uint32_t smpt_ssv_seq; opaque smpt_orig_plain<>; }; /* SSV GSS PerMsgToken token */ struct ssv_mic_tkn4 { uint32_t smt_ssv_seq; opaque smt_hmac<>; }; The field smt_hmac is an HMAC calculated by using the subkey derived from SSV4_SUBKEY_MIC_I2T or SSV4_SUBKEY_MIC_T2I as the key, the one- way hash algorithm as negotiated by EXCHANGE_ID, and the input text as represented by data of type ssv_mic_plain_tkn4. The field smpt_ssv_seq is the same as smt_ssv_seq. The field smpt_orig_plain is the "message" input passed to GSS_GetMIC() (see Section 2.3.1 of [7]). The caller of GSS_GetMIC() provides a pointer to a buffer containing the plain text. The SSV mechanism's entry point for GSS_GetMIC() encodes this into an opaque array, and the encoding will include an initial four-byte length, plus any necessary padding. Prepended to this will be the XDR encoded value of smpt_ssv_seq, thus making up an XDR encoding of a value of data type ssv_mic_plain_tkn4, which in turn is the input into the HMAC.
The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type ssv_mic_tkn4. The field smt_ssv_seq comes from the SSV sequence number, which is equal to one after SET_SSV (Section 18.47) is called the first time on a client ID. Thereafter, the SSV sequence number is incremented on each SET_SSV. Thus, smt_ssv_seq represents the version of the SSV at the time GSS_GetMIC() was called. As noted in Section 18.35, the client and server can maintain multiple concurrent versions of the SSV. This allows the SSV to be changed without serializing all RPC calls that use the SSV mechanism with SET_SSV operations. Once the HMAC is calculated, it is XDR encoded into smt_hmac, which will include an initial four-byte length, and any necessary padding. Prepended to this will be the XDR encoded value of smt_ssv_seq. The SealedMessage description is based on an XDR definition: /* Input for computing ssct_encr_data and ssct_hmac */ struct ssv_seal_plain_tkn4 { opaque sspt_confounder<>; uint32_t sspt_ssv_seq; opaque sspt_orig_plain<>; opaque sspt_pad<>; }; /* SSV GSS SealedMessage token */ struct ssv_seal_cipher_tkn4 { uint32_t ssct_ssv_seq; opaque ssct_iv<>; opaque ssct_encr_data<>; opaque ssct_hmac<>; }; The token emitted by GSS_Wrap() is XDR encoded and of XDR data type ssv_seal_cipher_tkn4. The ssct_ssv_seq field has the same meaning as smt_ssv_seq. The ssct_encr_data field is the result of encrypting a value of the XDR encoded data type ssv_seal_plain_tkn4. The encryption key is the subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and the encryption algorithm is that negotiated by EXCHANGE_ID. The ssct_iv field is the initialization vector (IV) for the encryption algorithm (if applicable) and is sent in clear text. The content and size of the IV MUST comply with the specification of the encryption algorithm. For example, the id-aes256-CBC algorithm MUST
use a 16-byte initialization vector (IV), which MUST be unpredictable for each instance of a value of data type ssv_seal_plain_tkn4 that is encrypted with a particular SSV key. The ssct_hmac field is the result of computing an HMAC using the value of the XDR encoded data type ssv_seal_plain_tkn4 as the input text. The key is the subkey derived from SSV4_SUBKEY_MIC_I2T or SSV4_SUBKEY_MIC_T2I, and the one-way hash algorithm is that negotiated by EXCHANGE_ID. The sspt_confounder field is a random value. The sspt_ssv_seq field is the same as ssvt_ssv_seq. The field sspt_orig_plain field is the original plaintext and is the "input_message" input passed to GSS_Wrap() (see Section 2.3.3 of [7]). As with the handling of the plaintext by the SSV mechanism's GSS_GetMIC() entry point, the entry point for GSS_Wrap() expects a pointer to the plaintext, and will XDR encode an opaque array into sspt_orig_plain representing the plain text, along with the other fields of an instance of data type ssv_seal_plain_tkn4. The sspt_pad field is present to support encryption algorithms that require inputs to be in fixed-sized blocks. The content of sspt_pad is zero filled except for the length. Beware that the XDR encoding of ssv_seal_plain_tkn4 contains three variable-length arrays, and so each array consumes four bytes for an array length, and each array that follows the length is always padded to a multiple of four bytes per the XDR standard. For example, suppose the encryption algorithm uses 16-byte blocks, and the sspt_confounder is three bytes long, and the sspt_orig_plain field is 15 bytes long. The XDR encoding of sspt_confounder uses eight bytes (4 + 3 + 1 byte pad), the XDR encoding of sspt_ssv_seq uses four bytes, the XDR encoding of sspt_orig_plain uses 20 bytes (4 + 15 + 1 byte pad), and the smallest XDR encoding of the sspt_pad field is four bytes. This totals 36 bytes. The next multiple of 16 is 48; thus, the length field of sspt_pad needs to be set to 12 bytes, or a total encoding of 16 bytes. The total number of XDR encoded bytes is thus 8 + 4 + 20 + 16 = 48. GSS_Wrap() emits a token that is an XDR encoding of a value of data type ssv_seal_cipher_tkn4. Note that regardless of whether or not the caller of GSS_Wrap() requests confidentiality, the token always has confidentiality. This is because the SSV mechanism is for RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without confidentiality.
There is one SSV per client ID. There is a single GSS context for a client ID / SSV pair. All SSV mechanism RPCSEC_GSS handles of a client ID / SSV pair share the same GSS context. SSV GSS contexts do not expire except when the SSV is destroyed (causes would include the client ID being destroyed or a server restart). Since one purpose of context expiration is to replace keys that have been in use for "too long", hence vulnerable to compromise by brute force or accident, the client can replace the SSV key by sending periodic SET_SSV operations, which is done by cycling through different users' RPCSEC_GSS credentials. This way, the SSV is replaced without destroying the SSV's GSS contexts. SSV RPCSEC_GSS handles can be expired or deleted by the server at any time, and the EXCHANGE_ID operation can be used to create more SSV RPCSEC_GSS handles. Expiration of SSV RPCSEC_GSS handles does not imply that the SSV or its GSS context has expired. The client MUST establish an SSV via SET_SSV before the SSV GSS context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC(). If SET_SSV has not been successfully called, attempts to emit tokens MUST fail. The SSV mechanism does not support replay detection and sequencing in its tokens because RPCSEC_GSS does not use those features (See Section 5.2.2, "Context Creation Requests", in [4]). However, Section 2.10.10 discusses special considerations for the SSV mechanism when used with RPCSEC_GSS.2.10.10. Security Considerations for RPCSEC_GSS When Using the SSV Mechanism
When a client ID is created with SP4_SSV state protection (see Section 18.35), the client is permitted to associate multiple RPCSEC_GSS handles with the single SSV GSS context (see Section 2.10.9). Because of the way RPCSEC_GSS (both version 1 and version 2, see [4] and [12]) calculate the verifier of the reply, special care must be taken by the implementation of the NFSv4.1 client to prevent attacks by a man-in-the-middle. The verifier of an RPCSEC_GSS reply is the output of GSS_GetMIC() applied to the input value of the seq_num field of the RPCSEC_GSS credential (data type rpc_gss_cred_ver_1_t) (see Section 5.3.3.2 of [4]). If multiple RPCSEC_GSS handles share the same GSS context, then if one handle is used to send a request with the same seq_num value as another handle, an attacker could block the reply, and replace it with the verifier used for the other handle. There are multiple ways to prevent the attack on the SSV RPCSEC_GSS verifier in the reply. The simplest is believed to be as follows.
o Each time one or more new SSV RPCSEC_GSS handles are created via EXCHANGE_ID, the client SHOULD send a SET_SSV operation to modify the SSV. By changing the SSV, the new handles will not result in the re-use of an SSV RPCSEC_GSS verifier in a reply. o When a requester decides to use N SSV RPCSEC_GSS handles, it SHOULD assign a unique and non-overlapping range of seq_nums to each SSV RPCSEC_GSS handle. The size of each range SHOULD be equal to MAXSEQ / N (see Section 5 of [4] for the definition of MAXSEQ). When an SSV RPCSEC_GSS handle reaches its maximum, it SHOULD force the replier to destroy the handle by sending a NULL RPC request with seq_num set to MAXSEQ + 1 (see Section 5.3.3.3 of [4]). o When the requester wants to increase or decrease N, it SHOULD force the replier to destroy all N handles by sending a NULL RPC request on each handle with seq_num set to MAXSEQ + 1. If the requester is the client, it SHOULD send a SET_SSV operation before using new handles. If the requester is the server, then the client SHOULD send a SET_SSV operation when it detects that the server has forced it to destroy a backchannel's SSV RPCSEC_GSS handle. By sending a SET_SSV operation, the SSV will change, and so the attacker will be unavailable to successfully replay a previous verifier in a reply to the requester. Note that if the replier carefully creates the SSV RPCSEC_GSS handles, the related risk of a man-in-the-middle splicing a forged SSV RPCSEC_GSS credential with a verifier for another handle does not exist. This is because the verifier in an RPCSEC_GSS request is computed from input that includes both the RPCSEC_GSS handle and seq_num (see Section 5.3.1 of [4]). Provided the replier takes care to avoid re-using the value of an RPCSEC_GSS handle that it creates, such as by including a generation number in the handle, the man-in- the-middle will not be able to successfully replay a previous verifier in the request to a replier.2.10.11. Session Mechanics - Steady State
2.10.11.1. Obligations of the Server
The server has the primary obligation to monitor the state of backchannel resources that the client has created for the server (RPCSEC_GSS contexts and backchannel connections). If these resources vanish, the server takes action as specified in Section 2.10.13.2.
2.10.11.2. Obligations of the Client
The client SHOULD honor the following obligations in order to utilize the session: o Keep a necessary session from going idle on the server. A client that requires a session but nonetheless is not sending operations risks having the session be destroyed by the server. This is because sessions consume resources, and resource limitations may force the server to cull an inactive session. A server MAY consider a session to be inactive if the client has not used the session before the session inactivity timer (Section 2.10.12) has expired. o Destroy the session when not needed. If a client has multiple sessions, one of which has no requests waiting for replies, and has been idle for some period of time, it SHOULD destroy the session. o Maintain GSS contexts and RPCSEC_GSS handles for the backchannel. If the client requires the server to use the RPCSEC_GSS security flavor for callbacks, then it needs to be sure the RPCSEC_GSS handles and/or their GSS contexts that are handed to the server via BACKCHANNEL_CTL or CREATE_SESSION are unexpired. o Preserve a connection for a backchannel. The server requires a backchannel in order to gracefully recall recallable state or notify the client of certain events. Note that if the connection is not being used for the fore channel, there is no way for the client to tell if the connection is still alive (e.g., the server restarted without sending a disconnect). The onus is on the server, not the client, to determine if the backchannel's connection is alive, and to indicate in the response to a SEQUENCE operation when the last connection associated with a session's backchannel has disconnected.2.10.11.3. Steps the Client Takes to Establish a Session
If the client does not have a client ID, the client sends EXCHANGE_ID to establish a client ID. If it opts for SP4_MACH_CRED or SP4_SSV protection, in the spo_must_enforce list of operations, it SHOULD at minimum specify CREATE_SESSION, DESTROY_SESSION, BIND_CONN_TO_SESSION, BACKCHANNEL_CTL, and DESTROY_CLIENTID. If it opts for SP4_SSV protection, the client needs to ask for SSV-based RPCSEC_GSS handles.
The client uses the client ID to send a CREATE_SESSION on a connection to the server. The results of CREATE_SESSION indicate whether or not the server will persist the session reply cache through a server that has restarted, and the client notes this for future reference. If the client specified SP4_SSV state protection when the client ID was created, then it SHOULD send SET_SSV in the first COMPOUND after the session is created. Each time a new principal goes to use the client ID, it SHOULD send a SET_SSV again. If the client wants to use delegations, layouts, directory notifications, or any other state that requires a backchannel, then it needs to add a connection to the backchannel if CREATE_SESSION did not already do so. The client creates a connection, and calls BIND_CONN_TO_SESSION to associate the connection with the session and the session's backchannel. If CREATE_SESSION did not already do so, the client MUST tell the server what security is required in order for the client to accept callbacks. The client does this via BACKCHANNEL_CTL. If the client selected SP4_MACH_CRED or SP4_SSV protection when it called EXCHANGE_ID, then the client SHOULD specify that the backchannel use RPCSEC_GSS contexts for security. If the client wants to use additional connections for the backchannel, then it needs to call BIND_CONN_TO_SESSION on each connection it wants to use with the session. If the client wants to use additional connections for the fore channel, then it needs to call BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED state protection when the client ID was created. At this point, the session has reached steady state.2.10.12. Session Inactivity Timer
The server MAY maintain a session inactivity timer for each session. If the session inactivity timer expires, then the server MAY destroy the session. To avoid losing a session due to inactivity, the client MUST renew the session inactivity timer. The length of session inactivity timer MUST NOT be less than the lease_time attribute (Section 5.8.1.11). As with lease renewal (Section 8.3), when the server receives a SEQUENCE operation, it resets the session inactivity timer, and MUST NOT allow the timer to expire while the rest of the operations in the COMPOUND procedure's request are still executing. Once the last operation has finished, the server MUST set the session inactivity timer to expire no sooner than the sum of the current time and the value of the lease_time attribute.
2.10.13. Session Mechanics - Recovery
2.10.13.1. Events Requiring Client Action
The following events require client action to recover.2.10.13.1.1. RPCSEC_GSS Context Loss by Callback Path
If all RPCSEC_GSS handles granted by the client to the server for callback use have expired, the client MUST establish a new handle via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE results indicates when callback handles are nearly expired, or fully expired (see Section 18.46.3).2.10.13.1.2. Connection Loss
If the client loses the last connection of the session and wants to retain the session, then it needs to create a new connection, and if, when the client ID was created, BIND_CONN_TO_SESSION was specified in the spo_must_enforce list, the client MUST use BIND_CONN_TO_SESSION to associate the connection with the session. If there was a request outstanding at the time of connection loss, then if the client wants to continue to use the session, it MUST retry the request, as described in Section 2.10.6.2. Note that it is not necessary to retry requests over a connection with the same source network address or the same destination network address as the lost connection. As long as the session ID, slot ID, and sequence ID in the retry match that of the original request, the server will recognize the request as a retry if it executed the request prior to disconnect. If the connection that was lost was the last one associated with the backchannel, and the client wants to retain the backchannel and/or prevent revocation of recallable state, the client needs to reconnect, and if it does, it MUST associate the connection to the session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD indicate when it has no callback connection via the sr_status_flags result from SEQUENCE.2.10.13.1.3. Backchannel GSS Context Loss
Via the sr_status_flags result of the SEQUENCE operation or other means, the client will learn if some or all of the RPCSEC_GSS contexts it assigned to the backchannel have been lost. If the client wants to retain the backchannel and/or not put recallable state subject to revocation, the client needs to use BACKCHANNEL_CTL to assign new contexts.
2.10.13.1.4. Loss of Session
The replier might lose a record of the session. Causes include: o Replier failure and restart. o A catastrophe that causes the reply cache to be corrupted or lost on the media on which it was stored. This applies even if the replier indicated in the CREATE_SESSION results that it would persist the cache. o The server purges the session of a client that has been inactive for a very extended period of time. o As a result of configuration changes among a set of clustered servers, a network address previously connected to one server becomes connected to a different server that has no knowledge of the session in question. Such a configuration change will generally only happen when the original server ceases to function for a time. Loss of reply cache is equivalent to loss of session. The replier indicates loss of session to the requester by returning NFS4ERR_BADSESSION on the next operation that uses the session ID that refers to the lost session. After an event like a server restart, the client may have lost its connections. The client assumes for the moment that the session has not been lost. It reconnects, and if it specified connection association enforcement when the session was created, it invokes BIND_CONN_TO_SESSION using the session ID. Otherwise, it invokes SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns NFS4ERR_BADSESSION, the client knows the session is not available to it when communicating with that network address. If the connection survives session loss, then the next SEQUENCE operation the client sends over the connection will get back NFS4ERR_BADSESSION. The client again knows the session was lost. Here is one suggested algorithm for the client when it gets NFS4ERR_BADSESSION. It is not obligatory in that, if a client does not want to take advantage of such features as trunking, it may omit parts of it. However, it is a useful example that draws attention to various possible recovery issues: 1. If the client has other connections to other server network addresses associated with the same session, attempt a COMPOUND with a single operation, SEQUENCE, on each of the other connections.
2. If the attempts succeed, the session is still alive, and this is a strong indicator that the server's network address has moved. The client might send an EXCHANGE_ID on the connection that returned NFS4ERR_BADSESSION to see if there are opportunities for client ID trunking (i.e., the same client ID and so_major are returned). The client might use DNS to see if the moved network address was replaced with another, so that the performance and availability benefits of session trunking can continue. 3. If the SEQUENCE requests fail with NFS4ERR_BADSESSION, then the session no longer exists on any of the server network addresses for which the client has connections associated with that session ID. It is possible the session is still alive and available on other network addresses. The client sends an EXCHANGE_ID on all the connections to see if the server owner is still listening on those network addresses. If the same server owner is returned but a new client ID is returned, this is a strong indicator of a server restart. If both the same server owner and same client ID are returned, then this is a strong indication that the server did delete the session, and the client will need to send a CREATE_SESSION if it has no other sessions for that client ID. If a different server owner is returned, the client can use DNS to find other network addresses. If it does not, or if DNS does not find any other addresses for the server, then the client will be unable to provide NFSv4.1 service, and fatal errors should be returned to processes that were using the server. If the client is using a "mount" paradigm, unmounting the server is advised. 4. If the client knows of no other connections associated with the session ID and server network addresses that are, or have been, associated with the session ID, then the client can use DNS to find other network addresses. If it does not, or if DNS does not find any other addresses for the server, then the client will be unable to provide NFSv4.1 service, and fatal errors should be returned to processes that were using the server. If the client is using a "mount" paradigm, unmounting the server is advised. If there is a reconfiguration event that results in the same network address being assigned to servers where the eir_server_scope value is different, it cannot be guaranteed that a session ID generated by the first will be recognized as invalid by the first. Therefore, in managing server reconfigurations among servers with different server scope values, it is necessary to make sure that all clients have disconnected from the first server before effecting the reconfiguration. Nonetheless, clients should not assume that servers will always adhere to this requirement; clients MUST be prepared to deal with unexpected effects of server reconfigurations. Even where a session ID is inappropriately recognized as valid, it is likely
either that the connection will not be recognized as valid or that a sequence value for a slot will not be correct. Therefore, when a client receives results indicating such unexpected errors, the use of EXCHANGE_ID to determine the current server configuration is RECOMMENDED. A variation on the above is that after a server's network address moves, there is no NFSv4.1 server listening, e.g., no listener on port 2049. In this example, one of the following occur: the NFSv4 server returns NFS4ERR_MINOR_VERS_MISMATCH, the NFS server returns a PROG_MISMATCH error, the RPC listener on 2049 returns PROG_UNVAIL, or attempts to reconnect to the network address timeout. These SHOULD be treated as equivalent to SEQUENCE returning NFS4ERR_BADSESSION for these purposes. When the client detects session loss, it needs to call CREATE_SESSION to recover. Any non-idempotent operations that were in progress might have been performed on the server at the time of session loss. The client has no general way to recover from this. Note that loss of session does not imply loss of byte-range lock, open, delegation, or layout state because locks, opens, delegations, and layouts are tied to the client ID and depend on the client ID, not the session. Nor does loss of byte-range lock, open, delegation, or layout state imply loss of session state, because the session depends on the client ID; loss of client ID however does imply loss of session, byte-range lock, open, delegation, and layout state. See Section 8.4.2. A session can survive a server restart, but lock recovery may still be needed. It is possible that CREATE_SESSION will fail with NFS4ERR_STALE_CLIENTID (e.g., the server restarts and does not preserve client ID state). If so, the client needs to call EXCHANGE_ID, followed by CREATE_SESSION.2.10.13.2. Events Requiring Server Action
The following events require server action to recover.2.10.13.2.1. Client Crash and Restart
As described in Section 18.35, a restarted client sends EXCHANGE_ID in such a way that it causes the server to delete any sessions it had.
2.10.13.2.2. Client Crash with No Restart
If a client crashes and never comes back, it will never send EXCHANGE_ID with its old client owner. Thus, the server has session state that will never be used again. After an extended period of time, and if the server has resource constraints, it MAY destroy the old session as well as locking state.2.10.13.2.3. Extended Network Partition
To the server, the extended network partition may be no different from a client crash with no restart (see Section 2.10.13.2.2). Unless the server can discern that there is a network partition, it is free to treat the situation as if the client has crashed permanently.2.10.13.2.4. Backchannel Connection Loss
If there were callback requests outstanding at the time of a connection loss, then the server MUST retry the requests, as described in Section 2.10.6.2. Note that it is not necessary to retry requests over a connection with the same source network address or the same destination network address as the lost connection. As long as the session ID, slot ID, and sequence ID in the retry match that of the original request, the callback target will recognize the request as a retry even if it did see the request prior to disconnect. If the connection lost is the last one associated with the backchannel, then the server MUST indicate that in the sr_status_flags field of every SEQUENCE reply until the backchannel is re-established. There are two situations, each of which uses different status flags: no connectivity for the session's backchannel and no connectivity for any session backchannel of the client. See Section 18.46 for a description of the appropriate flags in sr_status_flags.2.10.13.2.5. GSS Context Loss
The server SHOULD monitor when the number of RPCSEC_GSS handles assigned to the backchannel reaches one, and when that one handle is near expiry (i.e., between one and two periods of lease time), and indicate so in the sr_status_flags field of all SEQUENCE replies. The server MUST indicate when all of the backchannel's assigned RPCSEC_GSS handles have expired via the sr_status_flags field of all SEQUENCE replies.
2.10.14. Parallel NFS and Sessions
A client and server can potentially be a non-pNFS implementation, a metadata server implementation, a data server implementation, or two or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not mutually exclusive) are passed in the EXCHANGE_ID arguments and results to allow the client to indicate how it wants to use sessions created under the client ID, and to allow the server to indicate how it will allow the sessions to be used. See Section 13.1 for pNFS sessions considerations.3. Protocol Constants and Data Types
The syntax and semantics to describe the data types of the NFSv4.1 protocol are defined in the XDR RFC 4506 [2] and RPC RFC 5531 [3] documents. The next sections build upon the XDR data types to define constants, types, and structures specific to this protocol. The full list of XDR data types is in [13].3.1. Basic Constants
const NFS4_FHSIZE = 128; const NFS4_VERIFIER_SIZE = 8; const NFS4_OPAQUE_LIMIT = 1024; const NFS4_SESSIONID_SIZE = 16; const NFS4_INT64_MAX = 0x7fffffffffffffff; const NFS4_UINT64_MAX = 0xffffffffffffffff; const NFS4_INT32_MAX = 0x7fffffff; const NFS4_UINT32_MAX = 0xffffffff; const NFS4_MAXFILELEN = 0xffffffffffffffff; const NFS4_MAXFILEOFF = 0xfffffffffffffffe; Except where noted, all these constants are defined in bytes. o NFS4_FHSIZE is the maximum size of a filehandle. o NFS4_VERIFIER_SIZE is the fixed size of a verifier. o NFS4_OPAQUE_LIMIT is the maximum size of certain opaque information. o NFS4_SESSIONID_SIZE is the fixed size of a session identifier. o NFS4_INT64_MAX is the maximum value of a signed 64-bit integer.
o NFS4_UINT64_MAX is the maximum value of an unsigned 64-bit integer. o NFS4_INT32_MAX is the maximum value of a signed 32-bit integer. o NFS4_UINT32_MAX is the maximum value of an unsigned 32-bit integer. o NFS4_MAXFILELEN is the maximum length of a regular file. o NFS4_MAXFILEOFF is the maximum offset into a regular file.3.2. Basic Data Types
These are the base NFSv4.1 data types. +---------------+---------------------------------------------------+ | Data Type | Definition | +---------------+---------------------------------------------------+ | int32_t | typedef int int32_t; | | uint32_t | typedef unsigned int uint32_t; | | int64_t | typedef hyper int64_t; | | uint64_t | typedef unsigned hyper uint64_t; | | attrlist4 | typedef opaque attrlist4<>; | | | Used for file/directory attributes. | | bitmap4 | typedef uint32_t bitmap4<>; | | | Used in attribute array encoding. | | changeid4 | typedef uint64_t changeid4; | | | Used in the definition of change_info4. | | clientid4 | typedef uint64_t clientid4; | | | Shorthand reference to client identification. | | count4 | typedef uint32_t count4; | | | Various count parameters (READ, WRITE, COMMIT). | | length4 | typedef uint64_t length4; | | | The length of a byte-range within a file. | | mode4 | typedef uint32_t mode4; | | | Mode attribute data type. | | nfs_cookie4 | typedef uint64_t nfs_cookie4; | | | Opaque cookie value for READDIR. | | nfs_fh4 | typedef opaque nfs_fh4<NFS4_FHSIZE>; | | | Filehandle definition. | | nfs_ftype4 | enum nfs_ftype4; | | | Various defined file types. | | nfsstat4 | enum nfsstat4; | | | Return value for operations. | | offset4 | typedef uint64_t offset4; | | | Various offset designations (READ, WRITE, LOCK, | | | COMMIT). |
| qop4 | typedef uint32_t qop4; | | | Quality of protection designation in SECINFO. | | sec_oid4 | typedef opaque sec_oid4<>; | | | Security Object Identifier. The sec_oid4 data | | | type is not really opaque. Instead, it contains | | | an ASN.1 OBJECT IDENTIFIER as used by GSS-API in | | | the mech_type argument to GSS_Init_sec_context. | | | See [7] for details. | | sequenceid4 | typedef uint32_t sequenceid4; | | | Sequence number used for various session | | | operations (EXCHANGE_ID, CREATE_SESSION, | | | SEQUENCE, CB_SEQUENCE). | | seqid4 | typedef uint32_t seqid4; | | | Sequence identifier used for locking. | | sessionid4 | typedef opaque sessionid4[NFS4_SESSIONID_SIZE]; | | | Session identifier. | | slotid4 | typedef uint32_t slotid4; | | | Sequencing artifact for various session | | | operations (SEQUENCE, CB_SEQUENCE). | | utf8string | typedef opaque utf8string<>; | | | UTF-8 encoding for strings. | | utf8str_cis | typedef utf8string utf8str_cis; | | | Case-insensitive UTF-8 string. | | utf8str_cs | typedef utf8string utf8str_cs; | | | Case-sensitive UTF-8 string. | | utf8str_mixed | typedef utf8string utf8str_mixed; | | | UTF-8 strings with a case-sensitive prefix and a | | | case-insensitive suffix. | | component4 | typedef utf8str_cs component4; | | | Represents pathname components. | | linktext4 | typedef utf8str_cs linktext4; | | | Symbolic link contents ("symbolic link" is | | | defined in an Open Group [14] standard). | | pathname4 | typedef component4 pathname4<>; | | | Represents pathname for fs_locations. | | verifier4 | typedef opaque verifier4[NFS4_VERIFIER_SIZE]; | | | Verifier used for various operations (COMMIT, | | | CREATE, EXCHANGE_ID, OPEN, READDIR, WRITE) | | | NFS4_VERIFIER_SIZE is defined as 8. | +---------------+---------------------------------------------------+ End of Base Data Types Table 1
3.3. Structured Data Types
3.3.1. nfstime4
struct nfstime4 { int64_t seconds; uint32_t nseconds; }; The nfstime4 data type gives the number of seconds and nanoseconds since midnight or zero hour January 1, 1970 Coordinated Universal Time (UTC). Values greater than zero for the seconds field denote dates after the zero hour January 1, 1970. Values less than zero for the seconds field denote dates before the zero hour January 1, 1970. In both cases, the nseconds field is to be added to the seconds field for the final time representation. For example, if the time to be represented is one-half second before zero hour January 1, 1970, the seconds field would have a value of negative one (-1) and the nseconds field would have a value of one-half second (500000000). Values greater than 999,999,999 for nseconds are invalid. This data type is used to pass time and date information. A server converts to and from its local representation of time when processing time values, preserving as much accuracy as possible. If the precision of timestamps stored for a file system object is less than defined, loss of precision can occur. An adjunct time maintenance protocol is RECOMMENDED to reduce client and server time skew.3.3.2. time_how4
enum time_how4 { SET_TO_SERVER_TIME4 = 0, SET_TO_CLIENT_TIME4 = 1 };3.3.3. settime4
union settime4 switch (time_how4 set_it) { case SET_TO_CLIENT_TIME4: nfstime4 time; default: void; }; The time_how4 and settime4 data types are used for setting timestamps in file object attributes. If set_it is SET_TO_SERVER_TIME4, then the server uses its local representation of time for the time value.
3.3.4. specdata4
struct specdata4 { uint32_t specdata1; /* major device number */ uint32_t specdata2; /* minor device number */ }; This data type represents the device numbers for the device file types NF4CHR and NF4BLK.3.3.5. fsid4
struct fsid4 { uint64_t major; uint64_t minor; };3.3.6. change_policy4
struct change_policy4 { uint64_t cp_major; uint64_t cp_minor; }; The change_policy4 data type is used for the change_policy RECOMMENDED attribute. It provides change sequencing indication analogous to the change attribute. To enable the server to present a value valid across server re-initialization without requiring persistent storage, two 64-bit quantities are used, allowing one to be a server instance ID and the second to be incremented non- persistently, within a given server instance.3.3.7. fattr4
struct fattr4 { bitmap4 attrmask; attrlist4 attr_vals; }; The fattr4 data type is used to represent file and directory attributes. The bitmap is a counted array of 32-bit integers used to contain bit values. The position of the integer in the array that contains bit n can be computed from the expression (n / 32), and its bit within that integer is (n mod 32).
0 1 +-----------+-----------+-----------+-- | count | 31 .. 0 | 63 .. 32 | +-----------+-----------+-----------+--3.3.8. change_info4
struct change_info4 { bool atomic; changeid4 before; changeid4 after; }; This data type is used with the CREATE, LINK, OPEN, REMOVE, and RENAME operations to let the client know the value of the change attribute for the directory in which the target file system object resides.3.3.9. netaddr4
struct netaddr4 { /* see struct rpcb in RFC 1833 */ string na_r_netid<>; /* network id */ string na_r_addr<>; /* universal address */ }; The netaddr4 data type is used to identify network transport endpoints. The r_netid and r_addr fields respectively contain a netid and uaddr. The netid and uaddr concepts are defined in [15]. The netid and uaddr formats for TCP over IPv4 and TCP over IPv6 are defined in [15], specifically Tables 2 and 3 and Sections 5.2.3.3 and 5.2.3.4.3.3.10. state_owner4
struct state_owner4 { clientid4 clientid; opaque owner<NFS4_OPAQUE_LIMIT>; }; typedef state_owner4 open_owner4; typedef state_owner4 lock_owner4; The state_owner4 data type is the base type for the open_owner4 (Section 3.3.10.1) and lock_owner4 (Section 3.3.10.2).
3.3.10.1. open_owner4
This data type is used to identify the owner of OPEN state.3.3.10.2. lock_owner4
This structure is used to identify the owner of byte-range locking state.3.3.11. open_to_lock_owner4
struct open_to_lock_owner4 { seqid4 open_seqid; stateid4 open_stateid; seqid4 lock_seqid; lock_owner4 lock_owner; }; This data type is used for the first LOCK operation done for an open_owner4. It provides both the open_stateid and lock_owner, such that the transition is made from a valid open_stateid sequence to that of the new lock_stateid sequence. Using this mechanism avoids the confirmation of the lock_owner/lock_seqid pair since it is tied to established state in the form of the open_stateid/open_seqid.3.3.12. stateid4
struct stateid4 { uint32_t seqid; opaque other[12]; }; This data type is used for the various state sharing mechanisms between the client and server. The client never modifies a value of data type stateid. The starting value of the "seqid" field is undefined. The server is required to increment the "seqid" field by one at each transition of the stateid. This is important since the client will inspect the seqid in OPEN stateids to determine the order of OPEN processing done by the server.3.3.13. layouttype4
enum layouttype4 { LAYOUT4_NFSV4_1_FILES = 0x1, LAYOUT4_OSD2_OBJECTS = 0x2, LAYOUT4_BLOCK_VOLUME = 0x3 };
This data type indicates what type of layout is being used. The file server advertises the layout types it supports through the fs_layout_type file system attribute (Section 5.12.1). A client asks for layouts of a particular type in LAYOUTGET, and processes those layouts in its layout-type-specific logic. The layouttype4 data type is 32 bits in length. The range represented by the layout type is split into three parts. Type 0x0 is reserved. Types within the range 0x00000001-0x7FFFFFFF are globally unique and are assigned according to the description in Section 22.4; they are maintained by IANA. Types within the range 0x80000000-0xFFFFFFFF are site specific and for private use only. The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file layout type, as defined in Section 13, is to be used. The LAYOUT4_OSD2_OBJECTS enumeration specifies that the object layout, as defined in [40], is to be used. Similarly, the LAYOUT4_BLOCK_VOLUME enumeration specifies that the block/volume layout, as defined in [41], is to be used.3.3.14. deviceid4
const NFS4_DEVICEID4_SIZE = 16; typedef opaque deviceid4[NFS4_DEVICEID4_SIZE]; Layout information includes device IDs that specify a storage device through a compact handle. Addressing and type information is obtained with the GETDEVICEINFO operation. Device IDs are not guaranteed to be valid across metadata server restarts. A device ID is unique per client ID and layout type. See Section 12.2.10 for more details.3.3.15. device_addr4
struct device_addr4 { layouttype4 da_layout_type; opaque da_addr_body<>; }; The device address is used to set up a communication channel with the storage device. Different layout types will require different data types to define how they communicate with storage devices. The opaque da_addr_body field is interpreted based on the specified da_layout_type field.
This document defines the device address for the NFSv4.1 file layout (see Section 13.3), which identifies a storage device by network IP address and port number. This is sufficient for the clients to communicate with the NFSv4.1 storage devices, and may be sufficient for other layout types as well. Device types for object-based storage devices and block storage devices (e.g., Small Computer System Interface (SCSI) volume labels) are defined by their respective layout specifications.3.3.16. layout_content4
struct layout_content4 { layouttype4 loc_type; opaque loc_body<>; }; The loc_body field is interpreted based on the layout type (loc_type). This document defines the loc_body for the NFSv4.1 file layout type; see Section 13.3 for its definition.3.3.17. layout4
struct layout4 { offset4 lo_offset; length4 lo_length; layoutiomode4 lo_iomode; layout_content4 lo_content; }; The layout4 data type defines a layout for a file. The layout type specific data is opaque within lo_content. Since layouts are sub- dividable, the offset and length together with the file's filehandle, the client ID, iomode, and layout type identify the layout.3.3.18. layoutupdate4
struct layoutupdate4 { layouttype4 lou_type; opaque lou_body<>; }; The layoutupdate4 data type is used by the client to return updated layout information to the metadata server via the LAYOUTCOMMIT (Section 18.42) operation. This data type provides a channel to pass layout type specific information (in field lou_body) back to the metadata server. For example, for the block/volume layout type, this could include the list of reserved blocks that were written. The contents of the opaque lou_body argument are determined by the layout
type. The NFSv4.1 file-based layout does not use this data type; if lou_type is LAYOUT4_NFSV4_1_FILES, the lou_body field MUST have a zero length.3.3.19. layouthint4
struct layouthint4 { layouttype4 loh_type; opaque loh_body<>; }; The layouthint4 data type is used by the client to pass in a hint about the type of layout it would like created for a particular file. It is the data type specified by the layout_hint attribute described in Section 5.12.4. The metadata server may ignore the hint or may selectively ignore fields within the hint. This hint should be provided at create time as part of the initial attributes within OPEN. The loh_body field is specific to the type of layout (loh_type). The NFSv4.1 file-based layout uses the nfsv4_1_file_layouthint4 data type as defined in Section 13.3.3.3.20. layoutiomode4
enum layoutiomode4 { LAYOUTIOMODE4_READ = 1, LAYOUTIOMODE4_RW = 2, LAYOUTIOMODE4_ANY = 3 }; The iomode specifies whether the client intends to just read or both read and write the data represented by the layout. While the LAYOUTIOMODE4_ANY iomode MUST NOT be used in the arguments to the LAYOUTGET operation, it MAY be used in the arguments to the LAYOUTRETURN and CB_LAYOUTRECALL operations. The LAYOUTIOMODE4_ANY iomode specifies that layouts pertaining to both LAYOUTIOMODE4_READ and LAYOUTIOMODE4_RW iomodes are being returned or recalled, respectively. The metadata server's use of the iomode may depend on the layout type being used. The storage devices MAY validate I/O accesses against the iomode and reject invalid accesses.3.3.21. nfs_impl_id4
struct nfs_impl_id4 { utf8str_cis nii_domain; utf8str_cs nii_name; nfstime4 nii_date; };
This data type is used to identify client and server implementation details. The nii_domain field is the DNS domain name with which the implementor is associated. The nii_name field is the product name of the implementation and is completely free form. It is RECOMMENDED that the nii_name be used to distinguish machine architecture, machine platforms, revisions, versions, and patch levels. The nii_date field is the timestamp of when the software instance was published or built.3.3.22. threshold_item4
struct threshold_item4 { layouttype4 thi_layout_type; bitmap4 thi_hintset; opaque thi_hintlist<>; }; This data type contains a list of hints specific to a layout type for helping the client determine when it should send I/O directly through the metadata server versus the storage devices. The data type consists of the layout type (thi_layout_type), a bitmap (thi_hintset) describing the set of hints supported by the server (they may differ based on the layout type), and a list of hints (thi_hintlist) whose content is determined by the hintset bitmap. See the mdsthreshold attribute for more details. The thi_hintset field is a bitmap of the following values:
+-------------------------+---+---------+---------------------------+ | name | # | Data | Description | | | | Type | | +-------------------------+---+---------+---------------------------+ | threshold4_read_size | 0 | length4 | If a file's length is | | | | | less than the value of | | | | | threshold4_read_size, | | | | | then it is RECOMMENDED | | | | | that the client read from | | | | | the file via the MDS and | | | | | not a storage device. | | threshold4_write_size | 1 | length4 | If a file's length is | | | | | less than the value of | | | | | threshold4_write_size, | | | | | then it is RECOMMENDED | | | | | that the client write to | | | | | the file via the MDS and | | | | | not a storage device. | | threshold4_read_iosize | 2 | length4 | For read I/O sizes below | | | | | this threshold, it is | | | | | RECOMMENDED to read data | | | | | through the MDS. | | threshold4_write_iosize | 3 | length4 | For write I/O sizes below | | | | | this threshold, it is | | | | | RECOMMENDED to write data | | | | | through the MDS. | +-------------------------+---+---------+---------------------------+3.3.23. mdsthreshold4
struct mdsthreshold4 { threshold_item4 mth_hints<>; }; This data type holds an array of elements of data type threshold_item4, each of which is valid for a particular layout type. An array is necessary because a server can support multiple layout types for a single file.