Tech-invite3GPPspaceIETFspace
96959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 5661

Network File System (NFS) Version 4 Minor Version 1 Protocol

Pages: 617
Obsoleted by:  8881
Updated by:  81788434
Part 4 of 20 – Pages 65 to 97
First   Prev   Next

Top   ToC   RFC5661 - Page 65   prevText

2.10.7. RDMA Considerations

A complete discussion of the operation of RPC-based protocols over RDMA transports is in [8]. A discussion of the operation of NFSv4, including NFSv4.1, over RDMA is in [9]. Where RDMA is considered, this specification assumes the use of such a layering; it addresses only the upper-layer issues relevant to making best use of RPC/RDMA.
2.10.7.1. RDMA Connection Resources
RDMA requires its consumers to register memory and post buffers of a specific size and number for receive operations. Registration of memory can be a relatively high-overhead operation, since it requires pinning of buffers, assignment of attributes (e.g., readable/writable), and initialization of hardware translation. Preregistration is desirable to reduce overhead. These registrations are specific to hardware interfaces and even to RDMA connection endpoints; therefore, negotiation of their limits is desirable to manage resources effectively. Following basic registration, these buffers must be posted by the RPC layer to handle receives. These buffers remain in use by the RPC/ NFSv4.1 implementation; the size and number of them must be known to the remote peer in order to avoid RDMA errors that would cause a fatal error on the RDMA connection. NFSv4.1 manages slots as resources on a per-session basis (see Section 2.10), while RDMA connections manage credits on a per- connection basis. This means that in order for a peer to send data over RDMA to a remote buffer, it has to have both an NFSv4.1 slot and an RDMA credit. If multiple RDMA connections are associated with a session, then if the total number of credits across all RDMA connections associated with the session is X, and the number of slots in the session is Y, then the maximum number of outstanding requests is the lesser of X and Y.
Top   ToC   RFC5661 - Page 66
2.10.7.2. Flow Control
Previous versions of NFS do not provide flow control; instead, they rely on the windowing provided by transports like TCP to throttle requests. This does not work with RDMA, which provides no operation flow control and will terminate a connection in error when limits are exceeded. Limits such as maximum number of requests outstanding are therefore negotiated when a session is created (see the ca_maxrequests field in Section 18.36). These limits then provide the maxima within which each connection associated with the session's channel(s) must remain. RDMA connections are managed within these limits as described in Section 3.3 of [8]; if there are multiple RDMA connections, then the maximum number of requests for a channel will be divided among the RDMA connections. Put a different way, the onus is on the replier to ensure that the total number of RDMA credits across all connections associated with the replier's channel does exceed the channel's maximum number of outstanding requests. The limits may also be modified dynamically at the replier's choosing by manipulating certain parameters present in each NFSv4.1 reply. In addition, the CB_RECALL_SLOT callback operation (see Section 20.8) can be sent by a server to a client to return RDMA credits to the server, thereby lowering the maximum number of requests a client can have outstanding to the server.
2.10.7.3. Padding
Header padding is requested by each peer at session initiation (see the ca_headerpadsize argument to CREATE_SESSION in Section 18.36), and subsequently used by the RPC RDMA layer, as described in [8]. Zero padding is permitted. Padding leverages the useful property that RDMA preserve alignment of data, even when they are placed into anonymous (untagged) buffers. If requested, client inline writes will insert appropriate pad bytes within the request header to align the data payload on the specified boundary. The client is encouraged to add sufficient padding (up to the negotiated size) so that the "data" field of the WRITE operation is aligned. Most servers can make good use of such padding, which allows them to chain receive buffers in such a way that any data carried by client requests will be placed into appropriate buffers at the server, ready for file system processing. The receiver's RPC layer encounters no overhead from skipping over pad bytes, and the RDMA layer's high performance makes the insertion and transmission of padding on the sender a significant optimization. In this way, the need for servers to perform RDMA Read to satisfy all but the largest
Top   ToC   RFC5661 - Page 67
   client writes is obviated.  An added benefit is the reduction of
   message round trips on the network -- a potentially good trade, where
   latency is present.

   The value to choose for padding is subject to a number of criteria.
   A primary source of variable-length data in the RPC header is the
   authentication information, the form of which is client-determined,
   possibly in response to server specification.  The contents of
   COMPOUNDs, sizes of strings such as those passed to RENAME, etc. all
   go into the determination of a maximal NFSv4.1 request size and
   therefore minimal buffer size.  The client must select its offered
   value carefully, so as to avoid overburdening the server, and vice
   versa.  The benefit of an appropriate padding value is higher
   performance.

                    Sender gather:
        |RPC Request|Pad  bytes|Length| -> |User data...|
        \------+----------------------/      \
                \                             \
                 \    Receiver scatter:        \-----------+- ...
            /-----+----------------\            \           \
            |RPC Request|Pad|Length|   ->  |FS buffer|->|FS buffer|->...

   In the above case, the server may recycle unused buffers to the next
   posted receive if unused by the actual received request, or may pass
   the now-complete buffers by reference for normal write processing.
   For a server that can make use of it, this removes any need for data
   copies of incoming data, without resorting to complicated end-to-end
   buffer advertisement and management.  This includes most kernel-based
   and integrated server designs, among many others.  The client may
   perform similar optimizations, if desired.

2.10.7.4. Dual RDMA and Non-RDMA Transports
Some RDMA transports (e.g., RFC 5040 [10]) permit a "streaming" (non- RDMA) phase, where ordinary traffic might flow before "stepping up" to RDMA mode, commencing RDMA traffic. Some RDMA transports start connections always in RDMA mode. NFSv4.1 allows, but does not assume, a streaming phase before RDMA mode. When a connection is associated with a session, the client and server negotiate whether the connection is used in RDMA or non-RDMA mode (see Sections 18.36 and 18.34).
Top   ToC   RFC5661 - Page 68

2.10.8. Session Security

2.10.8.1. Session Callback Security
Via session/connection association, NFSv4.1 improves security over that provided by NFSv4.0 for the backchannel. The connection is client-initiated (see Section 18.34) and subject to the same firewall and routing checks as the fore channel. At the client's option (see Section 18.35), connection association is fully authenticated before being activated (see Section 18.34). Traffic from the server over the backchannel is authenticated exactly as the client specifies (see Section 2.10.8.2).
2.10.8.2. Backchannel RPC Security
When the NFSv4.1 client establishes the backchannel, it informs the server of the security flavors and principals to use when sending requests. If the security flavor is RPCSEC_GSS, the client expresses the principal in the form of an established RPCSEC_GSS context. The server is free to use any of the flavor/principal combinations the client offers, but it MUST NOT use unoffered combinations. This way, the client need not provide a target GSS principal for the backchannel as it did with NFSv4.0, nor does the server have to implement an RPCSEC_GSS initiator as it did with NFSv4.0 [30]. The CREATE_SESSION (Section 18.36) and BACKCHANNEL_CTL (Section 18.33) operations allow the client to specify flavor/ principal combinations. Also note that the SP4_SSV state protection mode (see Sections 18.35 and 2.10.8.3) has the side benefit of providing SSV-derived RPCSEC_GSS contexts (Section 2.10.9).
2.10.8.3. Protection from Unauthorized State Changes
As described to this point in the specification, the state model of NFSv4.1 is vulnerable to an attacker that sends a SEQUENCE operation with a forged session ID and with a slot ID that it expects the legitimate client to use next. When the legitimate client uses the slot ID with the same sequence number, the server returns the attacker's result from the reply cache, which disrupts the legitimate client and thus denies service to it. Similarly, an attacker could send a CREATE_SESSION with a forged client ID to create a new session associated with the client ID. The attacker could send requests using the new session that change locking state, such as LOCKU operations to release locks the legitimate client has acquired. Setting a security policy on the file that requires RPCSEC_GSS credentials when manipulating the file's state is one potential work
Top   ToC   RFC5661 - Page 69
   around, but has the disadvantage of preventing a legitimate client
   from releasing state when RPCSEC_GSS is required to do so, but a GSS
   context cannot be obtained (possibly because the user has logged off
   the client).

   NFSv4.1 provides three options to a client for state protection,
   which are specified when a client creates a client ID via EXCHANGE_ID
   (Section 18.35).

   The first (SP4_NONE) is to simply waive state protection.

   The other two options (SP4_MACH_CRED and SP4_SSV) share several
   traits:

   o  An RPCSEC_GSS-based credential is used to authenticate client ID
      and session maintenance operations, including creating and
      destroying a session, associating a connection with the session,
      and destroying the client ID.

   o  Because RPCSEC_GSS is used to authenticate client ID and session
      maintenance, the attacker cannot associate a rogue connection with
      a legitimate session, or associate a rogue session with a
      legitimate client ID in order to maliciously alter the client ID's
      lock state via CLOSE, LOCKU, DELEGRETURN, LAYOUTRETURN, etc.

   o  In cases where the server's security policies on a portion of its
      namespace require RPCSEC_GSS authentication, a client may have to
      use an RPCSEC_GSS credential to remove per-file state (e.g.,
      LOCKU, CLOSE, etc.).  The server may require that the principal
      that removes the state match certain criteria (e.g., the principal
      might have to be the same as the one that acquired the state).
      However, the client might not have an RPCSEC_GSS context for such
      a principal, and might not be able to create such a context
      (perhaps because the user has logged off).  When the client
      establishes SP4_MACH_CRED or SP4_SSV protection, it can specify a
      list of operations that the server MUST allow using the machine
      credential (if SP4_MACH_CRED is used) or the SSV credential (if
      SP4_SSV is used).

   The SP4_MACH_CRED state protection option uses a machine credential
   where the principal that creates the client ID MUST also be the
   principal that performs client ID and session maintenance operations.
   The security of the machine credential state protection approach
   depends entirely on safe guarding the per-machine credential.
   Assuming a proper safeguard using the per-machine credential for
   operations like CREATE_SESSION, BIND_CONN_TO_SESSION,
Top   ToC   RFC5661 - Page 70
   DESTROY_SESSION, and DESTROY_CLIENTID will prevent an attacker from
   associating a rogue connection with a session, or associating a rogue
   session with a client ID.

   There are at least three scenarios for the SP4_MACH_CRED option:

   1.  The system administrator configures a unique, permanent per-
       machine credential for one of the mandated GSS mechanisms (e.g.,
       if Kerberos V5 is used, a "keytab" containing a principal derived
       from a client host name could be used).

   2.  The client is used by a single user, and so the client ID and its
       sessions are used by just that user.  If the user's credential
       expires, then session and client ID maintenance cannot occur, but
       since the client has a single user, only that user is
       inconvenienced.

   3.  The physical client has multiple users, but the client
       implementation has a unique client ID for each user.  This is
       effectively the same as the second scenario, but a disadvantage
       is that each user needs to be allocated at least one session
       each, so the approach suffers from lack of economy.

   The SP4_SSV protection option uses the SSV (Section 1.6), via
   RPCSEC_GSS and the SSV GSS mechanism (Section 2.10.9), to protect
   state from attack.  The SP4_SSV protection option is intended for the
   situation comprised of a client that has multiple active users and a
   system administrator who wants to avoid the burden of installing a
   permanent machine credential on each client.  The SSV is established
   and updated on the server via SET_SSV (see Section 18.47).  To
   prevent eavesdropping, a client SHOULD send SET_SSV via RPCSEC_GSS
   with the privacy service.  Several aspects of the SSV make it
   intractable for an attacker to guess the SSV, and thus associate
   rogue connections with a session, and rogue sessions with a client
   ID:

   o  The arguments to and results of SET_SSV include digests of the old
      and new SSV, respectively.

   o  Because the initial value of the SSV is zero, therefore known, the
      client that opts for SP4_SSV protection and opts to apply SP4_SSV
      protection to BIND_CONN_TO_SESSION and CREATE_SESSION MUST send at
      least one SET_SSV operation before the first BIND_CONN_TO_SESSION
      operation or before the second CREATE_SESSION operation on a
      client ID.  If it does not, the SSV mechanism will not generate
      tokens (Section 2.10.9).  A client SHOULD send SET_SSV as soon as
      a session is created.
Top   ToC   RFC5661 - Page 71
   o  A SET_SSV request does not replace the SSV with the argument to
      SET_SSV.  Instead, the current SSV on the server is logically
      exclusive ORed (XORed) with the argument to SET_SSV.  Each time a
      new principal uses a client ID for the first time, the client
      SHOULD send a SET_SSV with that principal's RPCSEC_GSS
      credentials, with RPCSEC_GSS service set to RPC_GSS_SVC_PRIVACY.

   Here are the types of attacks that can be attempted by an attacker
   named Eve on a victim named Bob, and how SP4_SSV protection foils
   each attack:

   o  Suppose Eve is the first user to log into a legitimate client.
      Eve's use of an NFSv4.1 file system will cause the legitimate
      client to create a client ID with SP4_SSV protection, specifying
      that the BIND_CONN_TO_SESSION operation MUST use the SSV
      credential.  Eve's use of the file system also causes an SSV to be
      created.  The SET_SSV operation that creates the SSV will be
      protected by the RPCSEC_GSS context created by the legitimate
      client, which uses Eve's GSS principal and credentials.  Eve can
      eavesdrop on the network while her RPCSEC_GSS context is created
      and the SET_SSV using her context is sent.  Even if the legitimate
      client sends the SET_SSV with RPC_GSS_SVC_PRIVACY, because Eve
      knows her own credentials, she can decrypt the SSV.  Eve can
      compute an RPCSEC_GSS credential that BIND_CONN_TO_SESSION will
      accept, and so associate a new connection with the legitimate
      session.  Eve can change the slot ID and sequence state of a
      legitimate session, and/or the SSV state, in such a way that when
      Bob accesses the server via the same legitimate client, the
      legitimate client will be unable to use the session.

      The client's only recourse is to create a new client ID for Bob to
      use, and establish a new SSV for the client ID.  The client will
      be unable to delete the old client ID, and will let the lease on
      the old client ID expire.

      Once the legitimate client establishes an SSV over the new session
      using Bob's RPCSEC_GSS context, Eve can use the new session via
      the legitimate client, but she cannot disrupt Bob.  Moreover,
      because the client SHOULD have modified the SSV due to Eve using
      the new session, Bob cannot get revenge on Eve by associating a
      rogue connection with the session.

      The question is how did the legitimate client detect that Eve has
      hijacked the old session?  When the client detects that a new
      principal, Bob, wants to use the session, it SHOULD have sent a
      SET_SSV, which leads to the following sub-scenarios:
Top   ToC   RFC5661 - Page 72
      *  Let us suppose that from the rogue connection, Eve sent a
         SET_SSV with the same slot ID and sequence ID that the
         legitimate client later uses.  The server will assume the
         SET_SSV sent with Bob's credentials is a retry, and return to
         the legitimate client the reply it sent Eve.  However, unless
         Eve can correctly guess the SSV the legitimate client will use,
         the digest verification checks in the SET_SSV response will
         fail.  That is an indication to the client that the session has
         apparently been hijacked.

      *  Alternatively, Eve sent a SET_SSV with a different slot ID than
         the legitimate client uses for its SET_SSV.  Then the digest
         verification of the SET_SSV sent with Bob's credentials fails
         on the server, and the error returned to the client makes it
         apparent that the session has been hijacked.

      *  Alternatively, Eve sent an operation other than SET_SSV, but
         with the same slot ID and sequence that the legitimate client
         uses for its SET_SSV.  The server returns to the legitimate
         client the response it sent Eve.  The client sees that the
         response is not at all what it expects.  The client assumes
         either session hijacking or a server bug, and either way
         destroys the old session.

   o  Eve associates a rogue connection with the session as above, and
      then destroys the session.  Again, Bob goes to use the server from
      the legitimate client, which sends a SET_SSV using Bob's
      credentials.  The client receives an error that indicates that the
      session does not exist.  When the client tries to create a new
      session, this will fail because the SSV it has does not match that
      which the server has, and now the client knows the session was
      hijacked.  The legitimate client establishes a new client ID.

   o  If Eve creates a connection before the legitimate client
      establishes an SSV, because the initial value of the SSV is zero
      and therefore known, Eve can send a SET_SSV that will pass the
      digest verification check.  However, because the new connection
      has not been associated with the session, the SET_SSV is rejected
      for that reason.

   In summary, an attacker's disruption of state when SP4_SSV protection
   is in use is limited to the formative period of a client ID, its
   first session, and the establishment of the SSV.  Once a non-
   malicious user uses the client ID, the client quickly detects any
   hijack and rectifies the situation.  Once a non-malicious user
   successfully modifies the SSV, the attacker cannot use NFSv4.1
   operations to disrupt the non-malicious user.
Top   ToC   RFC5661 - Page 73
   Note that neither the SP4_MACH_CRED nor SP4_SSV protection approaches
   prevent hijacking of a transport connection that has previously been
   associated with a session.  If the goal of a counter-threat strategy
   is to prevent connection hijacking, the use of IPsec is RECOMMENDED.

   If a connection hijack occurs, the hijacker could in theory change
   locking state and negatively impact the service to legitimate
   clients.  However, if the server is configured to require the use of
   RPCSEC_GSS with integrity or privacy on the affected file objects,
   and if EXCHGID4_FLAG_BIND_PRINC_STATEID capability (Section 18.35) is
   in force, this will thwart unauthorized attempts to change locking
   state.

2.10.9. The Secret State Verifier (SSV) GSS Mechanism

The SSV provides the secret key for a GSS mechanism internal to NFSv4.1 that NFSv4.1 uses for state protection. Contexts for this mechanism are not established via the RPCSEC_GSS protocol. Instead, the contexts are automatically created when EXCHANGE_ID specifies SP4_SSV protection. The only tokens defined are the PerMsgToken (emitted by GSS_GetMIC) and the SealedMessage token (emitted by GSS_Wrap). The mechanism OID for the SSV mechanism is iso.org.dod.internet.private.enterprise.Michael Eisler.nfs.ssv_mech (1.3.6.1.4.1.28882.1.1). While the SSV mechanism does not define any initial context tokens, the OID can be used to let servers indicate that the SSV mechanism is acceptable whenever the client sends a SECINFO or SECINFO_NO_NAME operation (see Section 2.6). The SSV mechanism defines four subkeys derived from the SSV value. Each time SET_SSV is invoked, the subkeys are recalculated by the client and server. The calculation of each of the four subkeys depends on each of the four respective ssv_subkey4 enumerated values. The calculation uses the HMAC [11] algorithm, using the current SSV as the key, the one-way hash algorithm as negotiated by EXCHANGE_ID, and the input text as represented by the XDR encoded enumeration value for that subkey of data type ssv_subkey4. If the length of the output of the HMAC algorithm exceeds the length of key of the encryption algorithm (which is also negotiated by EXCHANGE_ID), then the subkey MUST be truncated from the HMAC output, i.e., if the subkey is of N bytes long, then the first N bytes of the HMAC output MUST be used for the subkey. The specification of EXCHANGE_ID states that the length of the output of the HMAC algorithm MUST NOT be less than the length of subkey needed for the encryption algorithm (see Section 18.35).
Top   ToC   RFC5661 - Page 74
   /* Input for computing subkeys */
   enum ssv_subkey4 {
           SSV4_SUBKEY_MIC_I2T     = 1,
           SSV4_SUBKEY_MIC_T2I     = 2,
           SSV4_SUBKEY_SEAL_I2T    = 3,
           SSV4_SUBKEY_SEAL_T2I    = 4
   };

   The subkey derived from SSV4_SUBKEY_MIC_I2T is used for calculating
   message integrity codes (MICs) that originate from the NFSv4.1
   client, whether as part of a request over the fore channel or a
   response over the backchannel.  The subkey derived from
   SSV4_SUBKEY_MIC_T2I is used for MICs originating from the NFSv4.1
   server.  The subkey derived from SSV4_SUBKEY_SEAL_I2T is used for
   encryption text originating from the NFSv4.1 client, and the subkey
   derived from SSV4_SUBKEY_SEAL_T2I is used for encryption text
   originating from the NFSv4.1 server.

   The PerMsgToken description is based on an XDR definition:

   /* Input for computing smt_hmac */
   struct ssv_mic_plain_tkn4 {
     uint32_t        smpt_ssv_seq;
     opaque          smpt_orig_plain<>;
   };


   /* SSV GSS PerMsgToken token */
   struct ssv_mic_tkn4 {
     uint32_t        smt_ssv_seq;
     opaque          smt_hmac<>;
   };

   The field smt_hmac is an HMAC calculated by using the subkey derived
   from SSV4_SUBKEY_MIC_I2T or SSV4_SUBKEY_MIC_T2I as the key, the one-
   way hash algorithm as negotiated by EXCHANGE_ID, and the input text
   as represented by data of type ssv_mic_plain_tkn4.  The field
   smpt_ssv_seq is the same as smt_ssv_seq.  The field smpt_orig_plain
   is the "message" input passed to GSS_GetMIC() (see Section 2.3.1 of
   [7]).  The caller of GSS_GetMIC() provides a pointer to a buffer
   containing the plain text.  The SSV mechanism's entry point for
   GSS_GetMIC() encodes this into an opaque array, and the encoding will
   include an initial four-byte length, plus any necessary padding.
   Prepended to this will be the XDR encoded value of smpt_ssv_seq, thus
   making up an XDR encoding of a value of data type ssv_mic_plain_tkn4,
   which in turn is the input into the HMAC.
Top   ToC   RFC5661 - Page 75
   The token emitted by GSS_GetMIC() is XDR encoded and of XDR data type
   ssv_mic_tkn4.  The field smt_ssv_seq comes from the SSV sequence
   number, which is equal to one after SET_SSV (Section 18.47) is called
   the first time on a client ID.  Thereafter, the SSV sequence number
   is incremented on each SET_SSV.  Thus, smt_ssv_seq represents the
   version of the SSV at the time GSS_GetMIC() was called.  As noted in
   Section 18.35, the client and server can maintain multiple concurrent
   versions of the SSV.  This allows the SSV to be changed without
   serializing all RPC calls that use the SSV mechanism with SET_SSV
   operations.  Once the HMAC is calculated, it is XDR encoded into
   smt_hmac, which will include an initial four-byte length, and any
   necessary padding.  Prepended to this will be the XDR encoded value
   of smt_ssv_seq.

   The SealedMessage description is based on an XDR definition:

   /* Input for computing ssct_encr_data and ssct_hmac */
   struct ssv_seal_plain_tkn4 {
     opaque          sspt_confounder<>;
     uint32_t        sspt_ssv_seq;
     opaque          sspt_orig_plain<>;
     opaque          sspt_pad<>;
   };


   /* SSV GSS SealedMessage token */
   struct ssv_seal_cipher_tkn4 {
     uint32_t      ssct_ssv_seq;
     opaque        ssct_iv<>;
     opaque        ssct_encr_data<>;
     opaque        ssct_hmac<>;
   };

   The token emitted by GSS_Wrap() is XDR encoded and of XDR data type
   ssv_seal_cipher_tkn4.

   The ssct_ssv_seq field has the same meaning as smt_ssv_seq.

   The ssct_encr_data field is the result of encrypting a value of the
   XDR encoded data type ssv_seal_plain_tkn4.  The encryption key is the
   subkey derived from SSV4_SUBKEY_SEAL_I2T or SSV4_SUBKEY_SEAL_T2I, and
   the encryption algorithm is that negotiated by EXCHANGE_ID.

   The ssct_iv field is the initialization vector (IV) for the
   encryption algorithm (if applicable) and is sent in clear text.  The
   content and size of the IV MUST comply with the specification of the
   encryption algorithm.  For example, the id-aes256-CBC algorithm MUST
Top   ToC   RFC5661 - Page 76
   use a 16-byte initialization vector (IV), which MUST be unpredictable
   for each instance of a value of data type ssv_seal_plain_tkn4 that is
   encrypted with a particular SSV key.

   The ssct_hmac field is the result of computing an HMAC using the
   value of the XDR encoded data type ssv_seal_plain_tkn4 as the input
   text.  The key is the subkey derived from SSV4_SUBKEY_MIC_I2T or
   SSV4_SUBKEY_MIC_T2I, and the one-way hash algorithm is that
   negotiated by EXCHANGE_ID.

   The sspt_confounder field is a random value.

   The sspt_ssv_seq field is the same as ssvt_ssv_seq.

   The field sspt_orig_plain field is the original plaintext and is the
   "input_message" input passed to GSS_Wrap() (see Section 2.3.3 of
   [7]).  As with the handling of the plaintext by the SSV mechanism's
   GSS_GetMIC() entry point, the entry point for GSS_Wrap() expects a
   pointer to the plaintext, and will XDR encode an opaque array into
   sspt_orig_plain representing the plain text, along with the other
   fields of an instance of data type ssv_seal_plain_tkn4.

   The sspt_pad field is present to support encryption algorithms that
   require inputs to be in fixed-sized blocks.  The content of sspt_pad
   is zero filled except for the length.  Beware that the XDR encoding
   of ssv_seal_plain_tkn4 contains three variable-length arrays, and so
   each array consumes four bytes for an array length, and each array
   that follows the length is always padded to a multiple of four bytes
   per the XDR standard.

   For example, suppose the encryption algorithm uses 16-byte blocks,
   and the sspt_confounder is three bytes long, and the sspt_orig_plain
   field is 15 bytes long.  The XDR encoding of sspt_confounder uses
   eight bytes (4 + 3 + 1 byte pad), the XDR encoding of sspt_ssv_seq
   uses four bytes, the XDR encoding of sspt_orig_plain uses 20 bytes (4
   + 15 + 1 byte pad), and the smallest XDR encoding of the sspt_pad
   field is four bytes.  This totals 36 bytes.  The next multiple of 16
   is 48; thus, the length field of sspt_pad needs to be set to 12
   bytes, or a total encoding of 16 bytes.  The total number of XDR
   encoded bytes is thus 8 + 4 + 20 + 16 = 48.

   GSS_Wrap() emits a token that is an XDR encoding of a value of data
   type ssv_seal_cipher_tkn4.  Note that regardless of whether or not
   the caller of GSS_Wrap() requests confidentiality, the token always
   has confidentiality.  This is because the SSV mechanism is for
   RPCSEC_GSS, and RPCSEC_GSS never produces GSS_wrap() tokens without
   confidentiality.
Top   ToC   RFC5661 - Page 77
   There is one SSV per client ID.  There is a single GSS context for a
   client ID / SSV pair.  All SSV mechanism RPCSEC_GSS handles of a
   client ID / SSV pair share the same GSS context.  SSV GSS contexts do
   not expire except when the SSV is destroyed (causes would include the
   client ID being destroyed or a server restart).  Since one purpose of
   context expiration is to replace keys that have been in use for "too
   long", hence vulnerable to compromise by brute force or accident, the
   client can replace the SSV key by sending periodic SET_SSV
   operations, which is done by cycling through different users'
   RPCSEC_GSS credentials.  This way, the SSV is replaced without
   destroying the SSV's GSS contexts.

   SSV RPCSEC_GSS handles can be expired or deleted by the server at any
   time, and the EXCHANGE_ID operation can be used to create more SSV
   RPCSEC_GSS handles.  Expiration of SSV RPCSEC_GSS handles does not
   imply that the SSV or its GSS context has expired.

   The client MUST establish an SSV via SET_SSV before the SSV GSS
   context can be used to emit tokens from GSS_Wrap() and GSS_GetMIC().
   If SET_SSV has not been successfully called, attempts to emit tokens
   MUST fail.

   The SSV mechanism does not support replay detection and sequencing in
   its tokens because RPCSEC_GSS does not use those features (See
   Section 5.2.2, "Context Creation Requests", in [4]).  However,
   Section 2.10.10 discusses special considerations for the SSV
   mechanism when used with RPCSEC_GSS.

2.10.10. Security Considerations for RPCSEC_GSS When Using the SSV Mechanism

When a client ID is created with SP4_SSV state protection (see Section 18.35), the client is permitted to associate multiple RPCSEC_GSS handles with the single SSV GSS context (see Section 2.10.9). Because of the way RPCSEC_GSS (both version 1 and version 2, see [4] and [12]) calculate the verifier of the reply, special care must be taken by the implementation of the NFSv4.1 client to prevent attacks by a man-in-the-middle. The verifier of an RPCSEC_GSS reply is the output of GSS_GetMIC() applied to the input value of the seq_num field of the RPCSEC_GSS credential (data type rpc_gss_cred_ver_1_t) (see Section 5.3.3.2 of [4]). If multiple RPCSEC_GSS handles share the same GSS context, then if one handle is used to send a request with the same seq_num value as another handle, an attacker could block the reply, and replace it with the verifier used for the other handle. There are multiple ways to prevent the attack on the SSV RPCSEC_GSS verifier in the reply. The simplest is believed to be as follows.
Top   ToC   RFC5661 - Page 78
   o  Each time one or more new SSV RPCSEC_GSS handles are created via
      EXCHANGE_ID, the client SHOULD send a SET_SSV operation to modify
      the SSV.  By changing the SSV, the new handles will not result in
      the re-use of an SSV RPCSEC_GSS verifier in a reply.

   o  When a requester decides to use N SSV RPCSEC_GSS handles, it
      SHOULD assign a unique and non-overlapping range of seq_nums to
      each SSV RPCSEC_GSS handle.  The size of each range SHOULD be
      equal to MAXSEQ / N (see Section 5 of [4] for the definition of
      MAXSEQ).  When an SSV RPCSEC_GSS handle reaches its maximum, it
      SHOULD force the replier to destroy the handle by sending a NULL
      RPC request with seq_num set to MAXSEQ + 1 (see Section 5.3.3.3 of
      [4]).

   o  When the requester wants to increase or decrease N, it SHOULD
      force the replier to destroy all N handles by sending a NULL RPC
      request on each handle with seq_num set to MAXSEQ + 1.  If the
      requester is the client, it SHOULD send a SET_SSV operation before
      using new handles.  If the requester is the server, then the
      client SHOULD send a SET_SSV operation when it detects that the
      server has forced it to destroy a backchannel's SSV RPCSEC_GSS
      handle.  By sending a SET_SSV operation, the SSV will change, and
      so the attacker will be unavailable to successfully replay a
      previous verifier in a reply to the requester.

   Note that if the replier carefully creates the SSV RPCSEC_GSS
   handles, the related risk of a man-in-the-middle splicing a forged
   SSV RPCSEC_GSS credential with a verifier for another handle does not
   exist.  This is because the verifier in an RPCSEC_GSS request is
   computed from input that includes both the RPCSEC_GSS handle and
   seq_num (see Section 5.3.1 of [4]).  Provided the replier takes care
   to avoid re-using the value of an RPCSEC_GSS handle that it creates,
   such as by including a generation number in the handle, the man-in-
   the-middle will not be able to successfully replay a previous
   verifier in the request to a replier.

2.10.11. Session Mechanics - Steady State

2.10.11.1. Obligations of the Server
The server has the primary obligation to monitor the state of backchannel resources that the client has created for the server (RPCSEC_GSS contexts and backchannel connections). If these resources vanish, the server takes action as specified in Section 2.10.13.2.
Top   ToC   RFC5661 - Page 79
2.10.11.2. Obligations of the Client
The client SHOULD honor the following obligations in order to utilize the session: o Keep a necessary session from going idle on the server. A client that requires a session but nonetheless is not sending operations risks having the session be destroyed by the server. This is because sessions consume resources, and resource limitations may force the server to cull an inactive session. A server MAY consider a session to be inactive if the client has not used the session before the session inactivity timer (Section 2.10.12) has expired. o Destroy the session when not needed. If a client has multiple sessions, one of which has no requests waiting for replies, and has been idle for some period of time, it SHOULD destroy the session. o Maintain GSS contexts and RPCSEC_GSS handles for the backchannel. If the client requires the server to use the RPCSEC_GSS security flavor for callbacks, then it needs to be sure the RPCSEC_GSS handles and/or their GSS contexts that are handed to the server via BACKCHANNEL_CTL or CREATE_SESSION are unexpired. o Preserve a connection for a backchannel. The server requires a backchannel in order to gracefully recall recallable state or notify the client of certain events. Note that if the connection is not being used for the fore channel, there is no way for the client to tell if the connection is still alive (e.g., the server restarted without sending a disconnect). The onus is on the server, not the client, to determine if the backchannel's connection is alive, and to indicate in the response to a SEQUENCE operation when the last connection associated with a session's backchannel has disconnected.
2.10.11.3. Steps the Client Takes to Establish a Session
If the client does not have a client ID, the client sends EXCHANGE_ID to establish a client ID. If it opts for SP4_MACH_CRED or SP4_SSV protection, in the spo_must_enforce list of operations, it SHOULD at minimum specify CREATE_SESSION, DESTROY_SESSION, BIND_CONN_TO_SESSION, BACKCHANNEL_CTL, and DESTROY_CLIENTID. If it opts for SP4_SSV protection, the client needs to ask for SSV-based RPCSEC_GSS handles.
Top   ToC   RFC5661 - Page 80
   The client uses the client ID to send a CREATE_SESSION on a
   connection to the server.  The results of CREATE_SESSION indicate
   whether or not the server will persist the session reply cache
   through a server that has restarted, and the client notes this for
   future reference.

   If the client specified SP4_SSV state protection when the client ID
   was created, then it SHOULD send SET_SSV in the first COMPOUND after
   the session is created.  Each time a new principal goes to use the
   client ID, it SHOULD send a SET_SSV again.

   If the client wants to use delegations, layouts, directory
   notifications, or any other state that requires a backchannel, then
   it needs to add a connection to the backchannel if CREATE_SESSION did
   not already do so.  The client creates a connection, and calls
   BIND_CONN_TO_SESSION to associate the connection with the session and
   the session's backchannel.  If CREATE_SESSION did not already do so,
   the client MUST tell the server what security is required in order
   for the client to accept callbacks.  The client does this via
   BACKCHANNEL_CTL.  If the client selected SP4_MACH_CRED or SP4_SSV
   protection when it called EXCHANGE_ID, then the client SHOULD specify
   that the backchannel use RPCSEC_GSS contexts for security.

   If the client wants to use additional connections for the
   backchannel, then it needs to call BIND_CONN_TO_SESSION on each
   connection it wants to use with the session.  If the client wants to
   use additional connections for the fore channel, then it needs to
   call BIND_CONN_TO_SESSION if it specified SP4_SSV or SP4_MACH_CRED
   state protection when the client ID was created.

   At this point, the session has reached steady state.

2.10.12. Session Inactivity Timer

The server MAY maintain a session inactivity timer for each session. If the session inactivity timer expires, then the server MAY destroy the session. To avoid losing a session due to inactivity, the client MUST renew the session inactivity timer. The length of session inactivity timer MUST NOT be less than the lease_time attribute (Section 5.8.1.11). As with lease renewal (Section 8.3), when the server receives a SEQUENCE operation, it resets the session inactivity timer, and MUST NOT allow the timer to expire while the rest of the operations in the COMPOUND procedure's request are still executing. Once the last operation has finished, the server MUST set the session inactivity timer to expire no sooner than the sum of the current time and the value of the lease_time attribute.
Top   ToC   RFC5661 - Page 81

2.10.13. Session Mechanics - Recovery

2.10.13.1. Events Requiring Client Action
The following events require client action to recover.
2.10.13.1.1. RPCSEC_GSS Context Loss by Callback Path
If all RPCSEC_GSS handles granted by the client to the server for callback use have expired, the client MUST establish a new handle via BACKCHANNEL_CTL. The sr_status_flags field of the SEQUENCE results indicates when callback handles are nearly expired, or fully expired (see Section 18.46.3).
2.10.13.1.2. Connection Loss
If the client loses the last connection of the session and wants to retain the session, then it needs to create a new connection, and if, when the client ID was created, BIND_CONN_TO_SESSION was specified in the spo_must_enforce list, the client MUST use BIND_CONN_TO_SESSION to associate the connection with the session. If there was a request outstanding at the time of connection loss, then if the client wants to continue to use the session, it MUST retry the request, as described in Section 2.10.6.2. Note that it is not necessary to retry requests over a connection with the same source network address or the same destination network address as the lost connection. As long as the session ID, slot ID, and sequence ID in the retry match that of the original request, the server will recognize the request as a retry if it executed the request prior to disconnect. If the connection that was lost was the last one associated with the backchannel, and the client wants to retain the backchannel and/or prevent revocation of recallable state, the client needs to reconnect, and if it does, it MUST associate the connection to the session and backchannel via BIND_CONN_TO_SESSION. The server SHOULD indicate when it has no callback connection via the sr_status_flags result from SEQUENCE.
2.10.13.1.3. Backchannel GSS Context Loss
Via the sr_status_flags result of the SEQUENCE operation or other means, the client will learn if some or all of the RPCSEC_GSS contexts it assigned to the backchannel have been lost. If the client wants to retain the backchannel and/or not put recallable state subject to revocation, the client needs to use BACKCHANNEL_CTL to assign new contexts.
Top   ToC   RFC5661 - Page 82
2.10.13.1.4. Loss of Session
The replier might lose a record of the session. Causes include: o Replier failure and restart. o A catastrophe that causes the reply cache to be corrupted or lost on the media on which it was stored. This applies even if the replier indicated in the CREATE_SESSION results that it would persist the cache. o The server purges the session of a client that has been inactive for a very extended period of time. o As a result of configuration changes among a set of clustered servers, a network address previously connected to one server becomes connected to a different server that has no knowledge of the session in question. Such a configuration change will generally only happen when the original server ceases to function for a time. Loss of reply cache is equivalent to loss of session. The replier indicates loss of session to the requester by returning NFS4ERR_BADSESSION on the next operation that uses the session ID that refers to the lost session. After an event like a server restart, the client may have lost its connections. The client assumes for the moment that the session has not been lost. It reconnects, and if it specified connection association enforcement when the session was created, it invokes BIND_CONN_TO_SESSION using the session ID. Otherwise, it invokes SEQUENCE. If BIND_CONN_TO_SESSION or SEQUENCE returns NFS4ERR_BADSESSION, the client knows the session is not available to it when communicating with that network address. If the connection survives session loss, then the next SEQUENCE operation the client sends over the connection will get back NFS4ERR_BADSESSION. The client again knows the session was lost. Here is one suggested algorithm for the client when it gets NFS4ERR_BADSESSION. It is not obligatory in that, if a client does not want to take advantage of such features as trunking, it may omit parts of it. However, it is a useful example that draws attention to various possible recovery issues: 1. If the client has other connections to other server network addresses associated with the same session, attempt a COMPOUND with a single operation, SEQUENCE, on each of the other connections.
Top   ToC   RFC5661 - Page 83
   2.  If the attempts succeed, the session is still alive, and this is
       a strong indicator that the server's network address has moved.
       The client might send an EXCHANGE_ID on the connection that
       returned NFS4ERR_BADSESSION to see if there are opportunities for
       client ID trunking (i.e., the same client ID and so_major are
       returned).  The client might use DNS to see if the moved network
       address was replaced with another, so that the performance and
       availability benefits of session trunking can continue.

   3.  If the SEQUENCE requests fail with NFS4ERR_BADSESSION, then the
       session no longer exists on any of the server network addresses
       for which the client has connections associated with that session
       ID.  It is possible the session is still alive and available on
       other network addresses.  The client sends an EXCHANGE_ID on all
       the connections to see if the server owner is still listening on
       those network addresses.  If the same server owner is returned
       but a new client ID is returned, this is a strong indicator of a
       server restart.  If both the same server owner and same client ID
       are returned, then this is a strong indication that the server
       did delete the session, and the client will need to send a
       CREATE_SESSION if it has no other sessions for that client ID.
       If a different server owner is returned, the client can use DNS
       to find other network addresses.  If it does not, or if DNS does
       not find any other addresses for the server, then the client will
       be unable to provide NFSv4.1 service, and fatal errors should be
       returned to processes that were using the server.  If the client
       is using a "mount" paradigm, unmounting the server is advised.

   4.  If the client knows of no other connections associated with the
       session ID and server network addresses that are, or have been,
       associated with the session ID, then the client can use DNS to
       find other network addresses.  If it does not, or if DNS does not
       find any other addresses for the server, then the client will be
       unable to provide NFSv4.1 service, and fatal errors should be
       returned to processes that were using the server.  If the client
       is using a "mount" paradigm, unmounting the server is advised.

   If there is a reconfiguration event that results in the same network
   address being assigned to servers where the eir_server_scope value is
   different, it cannot be guaranteed that a session ID generated by the
   first will be recognized as invalid by the first.  Therefore, in
   managing server reconfigurations among servers with different server
   scope values, it is necessary to make sure that all clients have
   disconnected from the first server before effecting the
   reconfiguration.  Nonetheless, clients should not assume that servers
   will always adhere to this requirement; clients MUST be prepared to
   deal with unexpected effects of server reconfigurations.  Even where
   a session ID is inappropriately recognized as valid, it is likely
Top   ToC   RFC5661 - Page 84
   either that the connection will not be recognized as valid or that a
   sequence value for a slot will not be correct.  Therefore, when a
   client receives results indicating such unexpected errors, the use of
   EXCHANGE_ID to determine the current server configuration is
   RECOMMENDED.

   A variation on the above is that after a server's network address
   moves, there is no NFSv4.1 server listening, e.g., no listener on
   port 2049.  In this example, one of the following occur: the NFSv4
   server returns NFS4ERR_MINOR_VERS_MISMATCH, the NFS server returns a
   PROG_MISMATCH error, the RPC listener on 2049 returns PROG_UNVAIL, or
   attempts to reconnect to the network address timeout.  These SHOULD
   be treated as equivalent to SEQUENCE returning NFS4ERR_BADSESSION for
   these purposes.

   When the client detects session loss, it needs to call CREATE_SESSION
   to recover.  Any non-idempotent operations that were in progress
   might have been performed on the server at the time of session loss.
   The client has no general way to recover from this.

   Note that loss of session does not imply loss of byte-range lock,
   open, delegation, or layout state because locks, opens, delegations,
   and layouts are tied to the client ID and depend on the client ID,
   not the session.  Nor does loss of byte-range lock, open, delegation,
   or layout state imply loss of session state, because the session
   depends on the client ID; loss of client ID however does imply loss
   of session, byte-range lock, open, delegation, and layout state.  See
   Section 8.4.2.  A session can survive a server restart, but lock
   recovery may still be needed.

   It is possible that CREATE_SESSION will fail with
   NFS4ERR_STALE_CLIENTID (e.g., the server restarts and does not
   preserve client ID state).  If so, the client needs to call
   EXCHANGE_ID, followed by CREATE_SESSION.

2.10.13.2. Events Requiring Server Action
The following events require server action to recover.
2.10.13.2.1. Client Crash and Restart
As described in Section 18.35, a restarted client sends EXCHANGE_ID in such a way that it causes the server to delete any sessions it had.
Top   ToC   RFC5661 - Page 85
2.10.13.2.2. Client Crash with No Restart
If a client crashes and never comes back, it will never send EXCHANGE_ID with its old client owner. Thus, the server has session state that will never be used again. After an extended period of time, and if the server has resource constraints, it MAY destroy the old session as well as locking state.
2.10.13.2.3. Extended Network Partition
To the server, the extended network partition may be no different from a client crash with no restart (see Section 2.10.13.2.2). Unless the server can discern that there is a network partition, it is free to treat the situation as if the client has crashed permanently.
2.10.13.2.4. Backchannel Connection Loss
If there were callback requests outstanding at the time of a connection loss, then the server MUST retry the requests, as described in Section 2.10.6.2. Note that it is not necessary to retry requests over a connection with the same source network address or the same destination network address as the lost connection. As long as the session ID, slot ID, and sequence ID in the retry match that of the original request, the callback target will recognize the request as a retry even if it did see the request prior to disconnect. If the connection lost is the last one associated with the backchannel, then the server MUST indicate that in the sr_status_flags field of every SEQUENCE reply until the backchannel is re-established. There are two situations, each of which uses different status flags: no connectivity for the session's backchannel and no connectivity for any session backchannel of the client. See Section 18.46 for a description of the appropriate flags in sr_status_flags.
2.10.13.2.5. GSS Context Loss
The server SHOULD monitor when the number of RPCSEC_GSS handles assigned to the backchannel reaches one, and when that one handle is near expiry (i.e., between one and two periods of lease time), and indicate so in the sr_status_flags field of all SEQUENCE replies. The server MUST indicate when all of the backchannel's assigned RPCSEC_GSS handles have expired via the sr_status_flags field of all SEQUENCE replies.
Top   ToC   RFC5661 - Page 86

2.10.14. Parallel NFS and Sessions

A client and server can potentially be a non-pNFS implementation, a metadata server implementation, a data server implementation, or two or three types of implementations. The EXCHGID4_FLAG_USE_NON_PNFS, EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS flags (not mutually exclusive) are passed in the EXCHANGE_ID arguments and results to allow the client to indicate how it wants to use sessions created under the client ID, and to allow the server to indicate how it will allow the sessions to be used. See Section 13.1 for pNFS sessions considerations.

3. Protocol Constants and Data Types

The syntax and semantics to describe the data types of the NFSv4.1 protocol are defined in the XDR RFC 4506 [2] and RPC RFC 5531 [3] documents. The next sections build upon the XDR data types to define constants, types, and structures specific to this protocol. The full list of XDR data types is in [13].

3.1. Basic Constants

const NFS4_FHSIZE = 128; const NFS4_VERIFIER_SIZE = 8; const NFS4_OPAQUE_LIMIT = 1024; const NFS4_SESSIONID_SIZE = 16; const NFS4_INT64_MAX = 0x7fffffffffffffff; const NFS4_UINT64_MAX = 0xffffffffffffffff; const NFS4_INT32_MAX = 0x7fffffff; const NFS4_UINT32_MAX = 0xffffffff; const NFS4_MAXFILELEN = 0xffffffffffffffff; const NFS4_MAXFILEOFF = 0xfffffffffffffffe; Except where noted, all these constants are defined in bytes. o NFS4_FHSIZE is the maximum size of a filehandle. o NFS4_VERIFIER_SIZE is the fixed size of a verifier. o NFS4_OPAQUE_LIMIT is the maximum size of certain opaque information. o NFS4_SESSIONID_SIZE is the fixed size of a session identifier. o NFS4_INT64_MAX is the maximum value of a signed 64-bit integer.
Top   ToC   RFC5661 - Page 87
   o  NFS4_UINT64_MAX is the maximum value of an unsigned 64-bit
      integer.

   o  NFS4_INT32_MAX is the maximum value of a signed 32-bit integer.

   o  NFS4_UINT32_MAX is the maximum value of an unsigned 32-bit
      integer.

   o  NFS4_MAXFILELEN is the maximum length of a regular file.

   o  NFS4_MAXFILEOFF is the maximum offset into a regular file.

3.2. Basic Data Types

These are the base NFSv4.1 data types. +---------------+---------------------------------------------------+ | Data Type | Definition | +---------------+---------------------------------------------------+ | int32_t | typedef int int32_t; | | uint32_t | typedef unsigned int uint32_t; | | int64_t | typedef hyper int64_t; | | uint64_t | typedef unsigned hyper uint64_t; | | attrlist4 | typedef opaque attrlist4<>; | | | Used for file/directory attributes. | | bitmap4 | typedef uint32_t bitmap4<>; | | | Used in attribute array encoding. | | changeid4 | typedef uint64_t changeid4; | | | Used in the definition of change_info4. | | clientid4 | typedef uint64_t clientid4; | | | Shorthand reference to client identification. | | count4 | typedef uint32_t count4; | | | Various count parameters (READ, WRITE, COMMIT). | | length4 | typedef uint64_t length4; | | | The length of a byte-range within a file. | | mode4 | typedef uint32_t mode4; | | | Mode attribute data type. | | nfs_cookie4 | typedef uint64_t nfs_cookie4; | | | Opaque cookie value for READDIR. | | nfs_fh4 | typedef opaque nfs_fh4<NFS4_FHSIZE>; | | | Filehandle definition. | | nfs_ftype4 | enum nfs_ftype4; | | | Various defined file types. | | nfsstat4 | enum nfsstat4; | | | Return value for operations. | | offset4 | typedef uint64_t offset4; | | | Various offset designations (READ, WRITE, LOCK, | | | COMMIT). |
Top   ToC   RFC5661 - Page 88
   | qop4          | typedef uint32_t qop4;                            |
   |               | Quality of protection designation in SECINFO.     |
   | sec_oid4      | typedef opaque sec_oid4<>;                        |
   |               | Security Object Identifier.  The sec_oid4 data    |
   |               | type is not really opaque.  Instead, it contains  |
   |               | an ASN.1 OBJECT IDENTIFIER as used by GSS-API in  |
   |               | the mech_type argument to GSS_Init_sec_context.   |
   |               | See [7] for details.                              |
   | sequenceid4   | typedef uint32_t sequenceid4;                     |
   |               | Sequence number used for various session          |
   |               | operations (EXCHANGE_ID, CREATE_SESSION,          |
   |               | SEQUENCE, CB_SEQUENCE).                           |
   | seqid4        | typedef uint32_t seqid4;                          |
   |               | Sequence identifier used for locking.             |
   | sessionid4    | typedef opaque sessionid4[NFS4_SESSIONID_SIZE];   |
   |               | Session identifier.                               |
   | slotid4       | typedef uint32_t slotid4;                         |
   |               | Sequencing artifact for various session           |
   |               | operations (SEQUENCE, CB_SEQUENCE).               |
   | utf8string    | typedef opaque utf8string<>;                      |
   |               | UTF-8 encoding for strings.                       |
   | utf8str_cis   | typedef utf8string utf8str_cis;                   |
   |               | Case-insensitive UTF-8 string.                    |
   | utf8str_cs    | typedef utf8string utf8str_cs;                    |
   |               | Case-sensitive UTF-8 string.                      |
   | utf8str_mixed | typedef utf8string utf8str_mixed;                 |
   |               | UTF-8 strings with a case-sensitive prefix and a  |
   |               | case-insensitive suffix.                          |
   | component4    | typedef utf8str_cs component4;                    |
   |               | Represents pathname components.                   |
   | linktext4     | typedef utf8str_cs linktext4;                     |
   |               | Symbolic link contents ("symbolic link" is        |
   |               | defined in an Open Group [14] standard).          |
   | pathname4     | typedef component4 pathname4<>;                   |
   |               | Represents pathname for fs_locations.             |
   | verifier4     | typedef opaque verifier4[NFS4_VERIFIER_SIZE];     |
   |               | Verifier used for various operations (COMMIT,     |
   |               | CREATE, EXCHANGE_ID, OPEN, READDIR, WRITE)        |
   |               | NFS4_VERIFIER_SIZE is defined as 8.               |
   +---------------+---------------------------------------------------+

                          End of Base Data Types

                                  Table 1
Top   ToC   RFC5661 - Page 89

3.3. Structured Data Types

3.3.1. nfstime4

struct nfstime4 { int64_t seconds; uint32_t nseconds; }; The nfstime4 data type gives the number of seconds and nanoseconds since midnight or zero hour January 1, 1970 Coordinated Universal Time (UTC). Values greater than zero for the seconds field denote dates after the zero hour January 1, 1970. Values less than zero for the seconds field denote dates before the zero hour January 1, 1970. In both cases, the nseconds field is to be added to the seconds field for the final time representation. For example, if the time to be represented is one-half second before zero hour January 1, 1970, the seconds field would have a value of negative one (-1) and the nseconds field would have a value of one-half second (500000000). Values greater than 999,999,999 for nseconds are invalid. This data type is used to pass time and date information. A server converts to and from its local representation of time when processing time values, preserving as much accuracy as possible. If the precision of timestamps stored for a file system object is less than defined, loss of precision can occur. An adjunct time maintenance protocol is RECOMMENDED to reduce client and server time skew.

3.3.2. time_how4

enum time_how4 { SET_TO_SERVER_TIME4 = 0, SET_TO_CLIENT_TIME4 = 1 };

3.3.3. settime4

union settime4 switch (time_how4 set_it) { case SET_TO_CLIENT_TIME4: nfstime4 time; default: void; }; The time_how4 and settime4 data types are used for setting timestamps in file object attributes. If set_it is SET_TO_SERVER_TIME4, then the server uses its local representation of time for the time value.
Top   ToC   RFC5661 - Page 90

3.3.4. specdata4

struct specdata4 { uint32_t specdata1; /* major device number */ uint32_t specdata2; /* minor device number */ }; This data type represents the device numbers for the device file types NF4CHR and NF4BLK.

3.3.5. fsid4

struct fsid4 { uint64_t major; uint64_t minor; };

3.3.6. change_policy4

struct change_policy4 { uint64_t cp_major; uint64_t cp_minor; }; The change_policy4 data type is used for the change_policy RECOMMENDED attribute. It provides change sequencing indication analogous to the change attribute. To enable the server to present a value valid across server re-initialization without requiring persistent storage, two 64-bit quantities are used, allowing one to be a server instance ID and the second to be incremented non- persistently, within a given server instance.

3.3.7. fattr4

struct fattr4 { bitmap4 attrmask; attrlist4 attr_vals; }; The fattr4 data type is used to represent file and directory attributes. The bitmap is a counted array of 32-bit integers used to contain bit values. The position of the integer in the array that contains bit n can be computed from the expression (n / 32), and its bit within that integer is (n mod 32).
Top   ToC   RFC5661 - Page 91
   0            1
   +-----------+-----------+-----------+--
   |  count    | 31  ..  0 | 63  .. 32 |
   +-----------+-----------+-----------+--

3.3.8. change_info4

struct change_info4 { bool atomic; changeid4 before; changeid4 after; }; This data type is used with the CREATE, LINK, OPEN, REMOVE, and RENAME operations to let the client know the value of the change attribute for the directory in which the target file system object resides.

3.3.9. netaddr4

struct netaddr4 { /* see struct rpcb in RFC 1833 */ string na_r_netid<>; /* network id */ string na_r_addr<>; /* universal address */ }; The netaddr4 data type is used to identify network transport endpoints. The r_netid and r_addr fields respectively contain a netid and uaddr. The netid and uaddr concepts are defined in [15]. The netid and uaddr formats for TCP over IPv4 and TCP over IPv6 are defined in [15], specifically Tables 2 and 3 and Sections 5.2.3.3 and 5.2.3.4.

3.3.10. state_owner4

struct state_owner4 { clientid4 clientid; opaque owner<NFS4_OPAQUE_LIMIT>; }; typedef state_owner4 open_owner4; typedef state_owner4 lock_owner4; The state_owner4 data type is the base type for the open_owner4 (Section 3.3.10.1) and lock_owner4 (Section 3.3.10.2).
Top   ToC   RFC5661 - Page 92
3.3.10.1. open_owner4
This data type is used to identify the owner of OPEN state.
3.3.10.2. lock_owner4
This structure is used to identify the owner of byte-range locking state.

3.3.11. open_to_lock_owner4

struct open_to_lock_owner4 { seqid4 open_seqid; stateid4 open_stateid; seqid4 lock_seqid; lock_owner4 lock_owner; }; This data type is used for the first LOCK operation done for an open_owner4. It provides both the open_stateid and lock_owner, such that the transition is made from a valid open_stateid sequence to that of the new lock_stateid sequence. Using this mechanism avoids the confirmation of the lock_owner/lock_seqid pair since it is tied to established state in the form of the open_stateid/open_seqid.

3.3.12. stateid4

struct stateid4 { uint32_t seqid; opaque other[12]; }; This data type is used for the various state sharing mechanisms between the client and server. The client never modifies a value of data type stateid. The starting value of the "seqid" field is undefined. The server is required to increment the "seqid" field by one at each transition of the stateid. This is important since the client will inspect the seqid in OPEN stateids to determine the order of OPEN processing done by the server.

3.3.13. layouttype4

enum layouttype4 { LAYOUT4_NFSV4_1_FILES = 0x1, LAYOUT4_OSD2_OBJECTS = 0x2, LAYOUT4_BLOCK_VOLUME = 0x3 };
Top   ToC   RFC5661 - Page 93
   This data type indicates what type of layout is being used.  The file
   server advertises the layout types it supports through the
   fs_layout_type file system attribute (Section 5.12.1).  A client asks
   for layouts of a particular type in LAYOUTGET, and processes those
   layouts in its layout-type-specific logic.

   The layouttype4 data type is 32 bits in length.  The range
   represented by the layout type is split into three parts.  Type 0x0
   is reserved.  Types within the range 0x00000001-0x7FFFFFFF are
   globally unique and are assigned according to the description in
   Section 22.4; they are maintained by IANA.  Types within the range
   0x80000000-0xFFFFFFFF are site specific and for private use only.

   The LAYOUT4_NFSV4_1_FILES enumeration specifies that the NFSv4.1 file
   layout type, as defined in Section 13, is to be used.  The
   LAYOUT4_OSD2_OBJECTS enumeration specifies that the object layout, as
   defined in [40], is to be used.  Similarly, the LAYOUT4_BLOCK_VOLUME
   enumeration specifies that the block/volume layout, as defined in
   [41], is to be used.

3.3.14. deviceid4

const NFS4_DEVICEID4_SIZE = 16; typedef opaque deviceid4[NFS4_DEVICEID4_SIZE]; Layout information includes device IDs that specify a storage device through a compact handle. Addressing and type information is obtained with the GETDEVICEINFO operation. Device IDs are not guaranteed to be valid across metadata server restarts. A device ID is unique per client ID and layout type. See Section 12.2.10 for more details.

3.3.15. device_addr4

struct device_addr4 { layouttype4 da_layout_type; opaque da_addr_body<>; }; The device address is used to set up a communication channel with the storage device. Different layout types will require different data types to define how they communicate with storage devices. The opaque da_addr_body field is interpreted based on the specified da_layout_type field.
Top   ToC   RFC5661 - Page 94
   This document defines the device address for the NFSv4.1 file layout
   (see Section 13.3), which identifies a storage device by network IP
   address and port number.  This is sufficient for the clients to
   communicate with the NFSv4.1 storage devices, and may be sufficient
   for other layout types as well.  Device types for object-based
   storage devices and block storage devices (e.g., Small Computer
   System Interface (SCSI) volume labels) are defined by their
   respective layout specifications.

3.3.16. layout_content4

struct layout_content4 { layouttype4 loc_type; opaque loc_body<>; }; The loc_body field is interpreted based on the layout type (loc_type). This document defines the loc_body for the NFSv4.1 file layout type; see Section 13.3 for its definition.

3.3.17. layout4

struct layout4 { offset4 lo_offset; length4 lo_length; layoutiomode4 lo_iomode; layout_content4 lo_content; }; The layout4 data type defines a layout for a file. The layout type specific data is opaque within lo_content. Since layouts are sub- dividable, the offset and length together with the file's filehandle, the client ID, iomode, and layout type identify the layout.

3.3.18. layoutupdate4

struct layoutupdate4 { layouttype4 lou_type; opaque lou_body<>; }; The layoutupdate4 data type is used by the client to return updated layout information to the metadata server via the LAYOUTCOMMIT (Section 18.42) operation. This data type provides a channel to pass layout type specific information (in field lou_body) back to the metadata server. For example, for the block/volume layout type, this could include the list of reserved blocks that were written. The contents of the opaque lou_body argument are determined by the layout
Top   ToC   RFC5661 - Page 95
   type.  The NFSv4.1 file-based layout does not use this data type; if
   lou_type is LAYOUT4_NFSV4_1_FILES, the lou_body field MUST have a
   zero length.

3.3.19. layouthint4

struct layouthint4 { layouttype4 loh_type; opaque loh_body<>; }; The layouthint4 data type is used by the client to pass in a hint about the type of layout it would like created for a particular file. It is the data type specified by the layout_hint attribute described in Section 5.12.4. The metadata server may ignore the hint or may selectively ignore fields within the hint. This hint should be provided at create time as part of the initial attributes within OPEN. The loh_body field is specific to the type of layout (loh_type). The NFSv4.1 file-based layout uses the nfsv4_1_file_layouthint4 data type as defined in Section 13.3.

3.3.20. layoutiomode4

enum layoutiomode4 { LAYOUTIOMODE4_READ = 1, LAYOUTIOMODE4_RW = 2, LAYOUTIOMODE4_ANY = 3 }; The iomode specifies whether the client intends to just read or both read and write the data represented by the layout. While the LAYOUTIOMODE4_ANY iomode MUST NOT be used in the arguments to the LAYOUTGET operation, it MAY be used in the arguments to the LAYOUTRETURN and CB_LAYOUTRECALL operations. The LAYOUTIOMODE4_ANY iomode specifies that layouts pertaining to both LAYOUTIOMODE4_READ and LAYOUTIOMODE4_RW iomodes are being returned or recalled, respectively. The metadata server's use of the iomode may depend on the layout type being used. The storage devices MAY validate I/O accesses against the iomode and reject invalid accesses.

3.3.21. nfs_impl_id4

struct nfs_impl_id4 { utf8str_cis nii_domain; utf8str_cs nii_name; nfstime4 nii_date; };
Top   ToC   RFC5661 - Page 96
   This data type is used to identify client and server implementation
   details.  The nii_domain field is the DNS domain name with which the
   implementor is associated.  The nii_name field is the product name of
   the implementation and is completely free form.  It is RECOMMENDED
   that the nii_name be used to distinguish machine architecture,
   machine platforms, revisions, versions, and patch levels.  The
   nii_date field is the timestamp of when the software instance was
   published or built.

3.3.22. threshold_item4

struct threshold_item4 { layouttype4 thi_layout_type; bitmap4 thi_hintset; opaque thi_hintlist<>; }; This data type contains a list of hints specific to a layout type for helping the client determine when it should send I/O directly through the metadata server versus the storage devices. The data type consists of the layout type (thi_layout_type), a bitmap (thi_hintset) describing the set of hints supported by the server (they may differ based on the layout type), and a list of hints (thi_hintlist) whose content is determined by the hintset bitmap. See the mdsthreshold attribute for more details. The thi_hintset field is a bitmap of the following values:
Top   ToC   RFC5661 - Page 97
   +-------------------------+---+---------+---------------------------+
   | name                    | # | Data    | Description               |
   |                         |   | Type    |                           |
   +-------------------------+---+---------+---------------------------+
   | threshold4_read_size    | 0 | length4 | If a file's length is     |
   |                         |   |         | less than the value of    |
   |                         |   |         | threshold4_read_size,     |
   |                         |   |         | then it is RECOMMENDED    |
   |                         |   |         | that the client read from |
   |                         |   |         | the file via the MDS and  |
   |                         |   |         | not a storage device.     |
   | threshold4_write_size   | 1 | length4 | If a file's length is     |
   |                         |   |         | less than the value of    |
   |                         |   |         | threshold4_write_size,    |
   |                         |   |         | then it is RECOMMENDED    |
   |                         |   |         | that the client write to  |
   |                         |   |         | the file via the MDS and  |
   |                         |   |         | not a storage device.     |
   | threshold4_read_iosize  | 2 | length4 | For read I/O sizes below  |
   |                         |   |         | this threshold, it is     |
   |                         |   |         | RECOMMENDED to read data  |
   |                         |   |         | through the MDS.          |
   | threshold4_write_iosize | 3 | length4 | For write I/O sizes below |
   |                         |   |         | this threshold, it is     |
   |                         |   |         | RECOMMENDED to write data |
   |                         |   |         | through the MDS.          |
   +-------------------------+---+---------+---------------------------+

3.3.23. mdsthreshold4

struct mdsthreshold4 { threshold_item4 mth_hints<>; }; This data type holds an array of elements of data type threshold_item4, each of which is valid for a particular layout type. An array is necessary because a server can support multiple layout types for a single file.


(page 97 continued on part 5)

Next Section