const ACCESS4_READ = 0x00000001;
const ACCESS4_LOOKUP = 0x00000002;
const ACCESS4_MODIFY = 0x00000004;
const ACCESS4_EXTEND = 0x00000008;
const ACCESS4_DELETE = 0x00000010;
const ACCESS4_EXECUTE = 0x00000020;
struct ACCESS4args {
/* CURRENT_FH: object */
uint32_t access;
};
struct ACCESS4resok {
uint32_t supported;
uint32_t access;
};
union ACCESS4res switch (nfsstat4 status) {
case NFS4_OK:
ACCESS4resok resok4;
default:
void;
};
ACCESS determines the access rights that a user, as identified by the credentials in the RPC request, has with respect to the file system object specified by the current filehandle. The client encodes the set of access rights that are to be checked in the bit mask "access". The server checks the permissions encoded in the bit mask. If a status of NFS4_OK is returned, two bit masks are included in the response. The first, "supported", represents the access rights for which the server can verify reliably. The second, "access", represents the access rights available to the user for the filehandle provided. On success, the current filehandle retains its value.
Note that the reply's supported and access fields
MUST NOT contain more values than originally set in the request's access field. For example, if the client sends an ACCESS operation with just the ACCESS4_READ value set and the server supports this value, the server
MUST NOT set more than ACCESS4_READ in the supported field even if it could have reliably checked other values.
The reply's access field
MUST NOT contain more values than the supported field.
The results of this operation are necessarily advisory in nature. A return status of NFS4_OK and the appropriate bit set in the bit mask do not imply that such access will be allowed to the file system object in the future. This is because access rights can be revoked by the server at any time.
The following access permissions may be requested:
-
ACCESS4_READ
-
Read data from file or read a directory.
-
ACCESS4_LOOKUP
-
Look up a name in a directory (no meaning for non-directory objects).
-
ACCESS4_MODIFY
-
Rewrite existing file data or modify existing directory entries.
-
ACCESS4_EXTEND
-
Write new data or add directory entries.
-
ACCESS4_DELETE
-
Delete an existing directory entry.
-
ACCESS4_EXECUTE
-
Execute a regular file (no meaning for a directory).
On success, the current filehandle retains its value.
ACCESS4_EXECUTE is a challenging semantic to implement because NFS provides remote file access, not remote execution. This leads to the following:
-
Whether or not a regular file is executable ought to be the responsibility of the NFS client and not the server. And yet the ACCESS operation is specified to seemingly require a server to own that responsibility.
-
When a client executes a regular file, it has to read the file from the server. Strictly speaking, the server should not allow the client to read a file being executed unless the user has read permissions on the file. Requiring explicit read permissions on executable files in order to access them over NFS is not going to be acceptable to some users and storage administrators. Historically, NFS servers have allowed a user to READ a file if the user has execute access to the file.
As a practical example, the UNIX specification [
60] states that an implementation claiming conformance to UNIX may indicate in the access() programming interface's result that a privileged user has execute rights, even if no execute permission bits are set on the regular file's attributes. It is possible to claim conformance to the UNIX specification and instead not indicate execute rights in that situation, which is true for some operating environments. Suppose the operating environments of the client and server are implementing the access() semantics for privileged users differently, and the ACCESS operation implementations of the client and server follow their respective access() semantics. This can cause undesired behavior:
-
Suppose the client's access() interface returns X_OK if the user is privileged and no execute permission bits are set on the regular file's attribute, and the server's access() interface does not return X_OK in that situation. Then the client will be unable to execute files stored on the NFS server that could be executed if stored on a non-NFS file system.
-
Suppose the client's access() interface does not return X_OK if the user is privileged, and no execute permission bits are set on the regular file's attribute, and the server's access() interface does return X_OK in that situation. Then:
-
The client will be able to execute files stored on the NFS server that could be executed if stored on a non-NFS file system, unless the client's execution subsystem also checks for execute permission bits.
-
Even if the execution subsystem is checking for execute permission bits, there are more potential issues. For example, suppose the client is invoking access() to build a "path search table" of all executable files in the user's "search path", where the path is a list of directories each containing executable files. Suppose there are two files each in separate directories of the search path, such that files have the same component name. In the first directory the file has no execute permission bits set, and in the second directory the file has execute bits set. The path search table will indicate that the first directory has the executable file, but the execute subsystem will fail to execute it. The command shell might fail to try the second file in the second directory. And even if it did, this is a potential performance issue. Clearly, the desired outcome for the client is for the path search table to not contain the first file.
To deal with the problems described above, the "smart client, stupid server" principle is used. The client owns overall responsibility for determining execute access and relies on the server to parse the execution permissions within the file's mode, acl, and dacl attributes. The rules for the client and server follow:
-
If the client is sending ACCESS in order to determine if the user can read the file, the client SHOULD set ACCESS4_READ in the request's access field.
-
If the client's operating environment only grants execution to the user if the user has execute access according to the execute permissions in the mode, acl, and dacl attributes, then if the client wants to determine execute access, the client SHOULD send an ACCESS request with ACCESS4_EXECUTE bit set in the request's access field.
-
If the client's operating environment grants execution to the user even if the user does not have execute access according to the execute permissions in the mode, acl, and dacl attributes, then if the client wants to determine execute access, it SHOULD send an ACCESS request with both the ACCESS4_EXECUTE and ACCESS4_READ bits set in the request's access field. This way, if any read or execute permission grants the user read or execute access (or if the server interprets the user as privileged), as indicated by the presence of ACCESS4_EXECUTE and/or ACCESS4_READ in the reply's access field, the client will be able to grant the user execute access to the file.
-
If the server supports execute permission bits, or some other method for denoting executability (e.g., the suffix of the name of the file might indicate execute), it MUST check only execute permissions, not read permissions, when determining whether or not the reply will have ACCESS4_EXECUTE set in the access field. The server MUST NOT also examine read permission bits when determining whether or not the reply will have ACCESS4_EXECUTE set in the access field. Even if the server's operating environment would grant execute access to the user (e.g., the user is privileged), the server MUST NOT reply with ACCESS4_EXECUTE set in reply's access field unless there is at least one execute permission bit set in the mode, acl, or dacl attributes. In the case of acl and dacl, the "one execute permission bit" MUST be an ACE4_EXECUTE bit set in an ALLOW ACE.
-
If the server does not support execute permission bits or some other method for denoting executability, it MUST NOT set ACCESS4_EXECUTE in the reply's supported and access fields. If the client set ACCESS4_EXECUTE in the ACCESS request's access field, and ACCESS4_EXECUTE is not set in the reply's supported field, then the client will have to send an ACCESS request with the ACCESS4_READ bit set in the request's access field.
-
If the server supports read permission bits, it MUST only check for read permissions in the mode, acl, and dacl attributes when it receives an ACCESS request with ACCESS4_READ set in the access field. The server MUST NOT also examine execute permission bits when determining whether the reply will have ACCESS4_READ set in the access field or not.
Note that if the ACCESS reply has ACCESS4_READ or ACCESS_EXECUTE set, then the user also has permissions to OPEN (
Section 18.16) or READ (
Section 18.22) the file. In other words, if the client sends an ACCESS request with the ACCESS4_READ and ACCESS_EXECUTE set in the access field (or two separate requests, one with ACCESS4_READ set and the other with ACCESS4_EXECUTE set), and the reply has just ACCESS4_EXECUTE set in the access field (or just one reply has ACCESS4_EXECUTE set), then the user has authorization to OPEN or READ the file.
In general, it is not sufficient for the client to attempt to deduce access permissions by inspecting the uid, gid, and mode fields in the file attributes or by attempting to interpret the contents of the ACL attribute. This is because the server may perform uid or gid mapping or enforce additional access-control restrictions. It is also possible that the server may not be in the same ID space as the client. In these cases (and perhaps others), the client cannot reliably perform an access check with only current file attributes.
In the NFSv2 protocol, the only reliable way to determine whether an operation was allowed was to try it and see if it succeeded or failed. Using the ACCESS operation in the NFSv4.1 protocol, the client can ask the server to indicate whether or not one or more classes of operations are permitted. The ACCESS operation is provided to allow clients to check before doing a series of operations that will result in an access failure. The OPEN operation provides a point where the server can verify access to the file object and a method to return that information to the client. The ACCESS operation is still useful for directory operations or for use in the case that the UNIX interface access() is used on the client.
The information returned by the server in response to an ACCESS call is not permanent. It was correct at the exact time that the server performed the checks, but not necessarily afterwards. The server can revoke access permission at any time.
The client should use the effective credentials of the user to build the authentication information in the ACCESS request used to determine access rights. It is the effective user and group credentials that are used in subsequent READ and WRITE operations.
Many implementations do not directly support the ACCESS4_DELETE permission. Operating systems like UNIX will ignore the ACCESS4_DELETE bit if set on an access request on a non-directory object. In these systems, delete permission on a file is determined by the access permissions on the directory in which the file resides, instead of being determined by the permissions of the file itself. Therefore, the mask returned enumerating which access rights can be determined will have the ACCESS4_DELETE value set to 0. This indicates to the client that the server was unable to check that particular access right. The ACCESS4_DELETE bit in the access mask returned will then be ignored by the client.
struct CLOSE4args {
/* CURRENT_FH: object */
seqid4 seqid;
stateid4 open_stateid;
};
union CLOSE4res switch (nfsstat4 status) {
case NFS4_OK:
stateid4 open_stateid;
default:
void;
};
The CLOSE operation releases share reservations for the regular or named attribute file as specified by the current filehandle. The share reservations and other state information released at the server as a result of this CLOSE are only those associated with the supplied stateid. State associated with other OPENs is not affected.
If byte-range locks are held, the client
SHOULD release all locks before sending a CLOSE. The server
MAY free all outstanding locks on CLOSE, but some servers may not support the CLOSE of a file that still has byte-range locks held. The server
MUST return failure if any locks would exist after the CLOSE.
The argument seqid
MAY have any value, and the server
MUST ignore seqid.
On success, the current filehandle retains its value.
The server
MAY require that the combination of principal, security flavor, and, if applicable, GSS mechanism that sent the OPEN request also be the one to CLOSE the file. This might not be possible if credentials for the principal are no longer available. The server
MAY allow the machine credential or SSV credential (see
Section 18.35) to send CLOSE.
Even though CLOSE returns a stateid, this stateid is not useful to the client and should be treated as deprecated. CLOSE "shuts down" the state associated with all OPENs for the file by a single open-owner. As noted above, CLOSE will either release all file-locking state or return an error. Therefore, the stateid returned by CLOSE is not useful for operations that follow. To help find any uses of this stateid by clients, the server
SHOULD return the invalid special stateid (the "other" value is zero and the "seqid" field is NFS4_UINT32_MAX, see
Section 8.2.3).
A CLOSE operation may make delegations grantable where they were not previously. Servers may choose to respond immediately if there are pending delegation want requests or may respond to the situation at a later time.
struct COMMIT4args {
/* CURRENT_FH: file */
offset4 offset;
count4 count;
};
struct COMMIT4resok {
verifier4 writeverf;
};
union COMMIT4res switch (nfsstat4 status) {
case NFS4_OK:
COMMIT4resok resok4;
default:
void;
};
The COMMIT operation forces or flushes uncommitted, modified data to stable storage for the file specified by the current filehandle. The flushed data is that which was previously written with one or more WRITE operations that had the "committed" field of their results field set to UNSTABLE4.
The offset specifies the position within the file where the flush is to begin. An offset value of zero means to flush data starting at the beginning of the file. The count specifies the number of bytes of data to flush. If the count is zero, a flush from the offset to the end of the file is done.
The server returns a write verifier upon successful completion of the COMMIT. The write verifier is used by the client to determine if the server has restarted between the initial WRITE operations and the COMMIT. The client does this by comparing the write verifier returned from the initial WRITE operations and the verifier returned by the COMMIT operation. The server must vary the value of the write verifier at each server event or instantiation that may lead to a loss of uncommitted data. Most commonly this occurs when the server is restarted; however, other events at the server may result in uncommitted data loss as well.
On success, the current filehandle retains its value.
The COMMIT operation is similar in operation and semantics to the [
22] system interface that synchronizes a file's state with the disk (file data and metadata is flushed to disk or stable storage). COMMIT performs the same operation for a client, flushing any unsynchronized data and metadata on the server to the server's disk or stable storage for the specified file. Like fsync(), it may be that there is some modified data or no modified data to synchronize. The data may have been synchronized by the server's normal periodic buffer synchronization activity. COMMIT should return NFS4_OK, unless there has been an unexpected error.
COMMIT differs from fsync() in that it is possible for the client to flush a range of the file (most likely triggered by a buffer-reclamation scheme on the client before the file has been completely written).
The server implementation of COMMIT is reasonably simple. If the server receives a full file COMMIT request, that is, starting at offset zero and count zero, it should do the equivalent of applying fsync() to the entire file. Otherwise, it should arrange to have the modified data in the range specified by offset and count to be flushed to stable storage. In both cases, any metadata associated with the file must be flushed to stable storage before returning. It is not an error for there to be nothing to flush on the server. This means that the data and metadata that needed to be flushed have already been flushed or lost during the last server failure.
The client implementation of COMMIT is a little more complex. There are two reasons for wanting to commit a client buffer to stable storage. The first is that the client wants to reuse a buffer. In this case, the offset and count of the buffer are sent to the server in the COMMIT request. The server then flushes any modified data based on the offset and count, and flushes any modified metadata associated with the file. It then returns the status of the flush and the write verifier. The second reason for the client to generate a COMMIT is for a full file flush, such as may be done at close. In this case, the client would gather all of the buffers for this file that contain uncommitted data, do the COMMIT operation with an offset of zero and count of zero, and then free all of those buffers. Any other dirty buffers would be sent to the server in the normal fashion.
After a buffer is written (via the WRITE operation) by the client with the "committed" field in the result of WRITE set to UNSTABLE4, the buffer must be considered as modified by the client until the buffer has either been flushed via a COMMIT operation or written via a WRITE operation with the "committed" field in the result set to FILE_SYNC4 or DATA_SYNC4. This is done to prevent the buffer from being freed and reused before the data can be flushed to stable storage on the server.
When a response is returned from either a WRITE or a COMMIT operation and it contains a write verifier that differs from that previously returned by the server, the client will need to retransmit all of the buffers containing uncommitted data to the server. How this is to be done is up to the implementor. If there is only one buffer of interest, then it should be sent in a WRITE request with the FILE_SYNC4 stable parameter. If there is more than one buffer, it might be worthwhile retransmitting all of the buffers in WRITE operations with the stable parameter set to UNSTABLE4 and then retransmitting the COMMIT operation to flush all of the data on the server to stable storage. However, if the server repeatably returns from COMMIT a verifier that differs from that returned by WRITE, the only way to ensure progress is to retransmit all of the buffers with WRITE requests with the FILE_SYNC4 stable parameter.
The above description applies to page-cache-based systems as well as buffer-cache-based systems. In the former systems, the virtual memory system will need to be modified instead of the buffer cache.
union createtype4 switch (nfs_ftype4 type) {
case NF4LNK:
linktext4 linkdata;
case NF4BLK:
case NF4CHR:
specdata4 devdata;
case NF4SOCK:
case NF4FIFO:
case NF4DIR:
void;
default:
void; /* server should return NFS4ERR_BADTYPE */
};
struct CREATE4args {
/* CURRENT_FH: directory for creation */
createtype4 objtype;
component4 objname;
fattr4 createattrs;
};
struct CREATE4resok {
change_info4 cinfo;
bitmap4 attrset; /* attributes set */
};
union CREATE4res switch (nfsstat4 status) {
case NFS4_OK:
/* new CURRENTFH: created object */
CREATE4resok resok4;
default:
void;
};
The CREATE operation creates a file object other than an ordinary file in a directory with a given name. The OPEN operation
MUST be used to create a regular file or a named attribute.
The current filehandle must be a directory: an object of type NF4DIR. If the current filehandle is an attribute directory (type NF4ATTRDIR), the error NFS4ERR_WRONG_TYPE is returned. If the current filehandle designates any other type of object, the error NFS4ERR_NOTDIR results.
The objname specifies the name for the new object. The objtype determines the type of object to be created: directory, symlink, etc. If the object type specified is that of an ordinary file, a named attribute, or a named attribute directory, the error NFS4ERR_BADTYPE results.
If an object of the same name already exists in the directory, the server will return the error NFS4ERR_EXIST.
For the directory where the new file object was created, the server returns change_info4 information in cinfo. With the atomic field of the change_info4 data type, the server will indicate if the before and after change attributes were obtained atomically with respect to the file object creation.
If the objname has a length of zero, or if objname does not obey the UTF-8 definition, the error NFS4ERR_INVAL will be returned.
The current filehandle is replaced by that of the new object.
The createattrs specifies the initial set of attributes for the object. The set of attributes may include any writable attribute valid for the object type. When the operation is successful, the server will return to the client an attribute mask signifying which attributes were successfully set for the object.
If createattrs includes neither the owner attribute nor an ACL with an ACE for the owner, and if the server's file system both supports and requires an owner attribute (or an owner ACE), then the server
MUST derive the owner (or the owner ACE). This would typically be from the principal indicated in the RPC credentials of the call, but the server's operating environment or file system semantics may dictate other methods of derivation. Similarly, if createattrs includes neither the group attribute nor a group ACE, and if the server's file system both supports and requires the notion of a group attribute (or group ACE), the server
MUST derive the group attribute (or the corresponding owner ACE) for the file. This could be from the RPC call's credentials, such as the group principal if the credentials include it (such as with AUTH_SYS), from the group identifier associated with the principal in the credentials (e.g., POSIX systems have a [
23] that has a group identifier for every user identifier), inherited from the directory in which the object is created, or whatever else the server's operating environment or file system semantics dictate. This applies to the OPEN operation too.
Conversely, it is possible that the client will specify in createattrs an owner attribute, group attribute, or ACL that the principal indicated the RPC call's credentials does not have permissions to create files for. The error to be returned in this instance is NFS4ERR_PERM. This applies to the OPEN operation too.
If the current filehandle designates a directory for which another client holds a directory delegation, then, unless the delegation is such that the situation can be resolved by sending a notification, the delegation
MUST be recalled, and the CREATE operation
MUST NOT proceed until the delegation is returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while delegation remains outstanding.
When the current filehandle designates a directory for which one or more directory delegations exist, then, when those delegations request such notifications, NOTIFY4_ADD_ENTRY will be generated as a result of this operation.
If the capability FSCHARSET_CAP4_ALLOWS_ONLY_UTF8 is set (
Section 14.4), and a symbolic link is being created, then the content of the symbolic link
MUST be in UTF-8 encoding.
If the client desires to set attribute values after the create, a SETATTR operation can be added to the COMPOUND request so that the appropriate attributes will be set.
struct DELEGPURGE4args {
clientid4 clientid;
};
struct DELEGPURGE4res {
nfsstat4 status;
};
This operation purges all of the delegations awaiting recovery for a given client. This is useful for clients that do not commit delegation information to stable storage to indicate that conflicting requests need not be delayed by the server awaiting recovery of delegation information.
The client is NOT specified by the clientid field of the request. The client
SHOULD set the client field to zero, and the server
MUST ignore the clientid field. Instead, the server
MUST derive the client ID from the value of the session ID in the arguments of the SEQUENCE operation that precedes DELEGPURGE in the COMPOUND request.
The DELEGPURGE operation should be used by clients that record delegation information on stable storage on the client. In this case, after the client recovers all delegations it knows of, it should immediately send a DELEGPURGE operation. Doing so will notify the server that no additional delegations for the client will be recovered allowing it to free resources, and avoid delaying other clients which make requests that conflict with the unrecovered delegations. The set of delegations known to the server and the client might be different. The reason for this is that after sending a request that resulted in a delegation, the client might experience a failure before it both received the delegation and committed the delegation to the client's stable storage.
The server
MAY support DELEGPURGE, but if it does not, it
MUST NOT support CLAIM_DELEGATE_PREV and
MUST NOT support CLAIM_DELEG_PREV_FH.
struct DELEGRETURN4args {
/* CURRENT_FH: delegated object */
stateid4 deleg_stateid;
};
struct DELEGRETURN4res {
nfsstat4 status;
};
The DELEGRETURN operation returns the delegation represented by the current filehandle and stateid.
Delegations may be returned voluntarily (i.e., before the server has recalled them) or when recalled. In either case, the client must properly propagate state changed under the context of the delegation to the server before returning the delegation.
The server
MAY require that the principal, security flavor, and if applicable, the GSS mechanism, combination that acquired the delegation also be the one to send DELEGRETURN on the file. This might not be possible if credentials for the principal are no longer available. The server
MAY allow the machine credential or SSV credential (see
Section 18.35) to send DELEGRETURN.
struct GETATTR4args {
/* CURRENT_FH: object */
bitmap4 attr_request;
};
struct GETATTR4resok {
fattr4 obj_attributes;
};
union GETATTR4res switch (nfsstat4 status) {
case NFS4_OK:
GETATTR4resok resok4;
default:
void;
};
The GETATTR operation will obtain attributes for the file system object specified by the current filehandle. The client sets a bit in the bitmap argument for each attribute value that it would like the server to return. The server returns an attribute bitmap that indicates the attribute values that it was able to return, which will include all attributes requested by the client that are attributes supported by the server for the target file system. This bitmap is followed by the attribute values ordered lowest attribute number first.
The server
MUST return a value for each attribute that the client requests if the attribute is supported by the server for the target file system. If the server does not support a particular attribute on the target file system, then it
MUST NOT return the attribute value and
MUST NOT set the attribute bit in the result bitmap. The server
MUST return an error if it supports an attribute on the target but cannot obtain its value. In that case, no attribute values will be returned.
File systems that are absent should be treated as having support for a very small set of attributes as described in
Section 11.4.1, even if previously, when the file system was present, more attributes were supported.
All servers
MUST support the
REQUIRED attributes as specified in
Section 5.6, for all file systems, with the exception of absent file systems.
On success, the current filehandle retains its value.
Suppose there is an OPEN_DELEGATE_WRITE delegation held by another client for the file in question and size and/or change are among the set of attributes being interrogated. The server has two choices. First, the server can obtain the actual current value of these attributes from the client holding the delegation by using the CB_GETATTR callback. Second, the server, particularly when the delegated client is unresponsive, can recall the delegation in question. The GETATTR
MUST NOT proceed until one of the following occurs:
-
The requested attribute values are returned in the response to CB_GETATTR.
-
The OPEN_DELEGATE_WRITE delegation is returned.
-
The OPEN_DELEGATE_WRITE delegation is revoked.
Unless one of the above happens very quickly, one or more NFS4ERR_DELAY errors will be returned while a delegation is outstanding.
struct GETFH4resok {
nfs_fh4 object;
};
union GETFH4res switch (nfsstat4 status) {
case NFS4_OK:
GETFH4resok resok4;
default:
void;
};
This operation returns the current filehandle value.
On success, the current filehandle retains its value.
As described in
Section 2.10.6.4, GETFH is
REQUIRED or
RECOMMENDED to immediately follow certain operations, and servers are free to reject such operations if the client fails to insert GETFH in the request as
REQUIRED or
RECOMMENDED.
Section 18.16.4.1 provides additional justification for why GETFH
MUST follow OPEN.
Operations that change the current filehandle like LOOKUP or CREATE do not automatically return the new filehandle as a result. For instance, if a client needs to look up a directory entry and obtain its filehandle, then the following request is needed.
-
PUTFH (directory filehandle)
-
LOOKUP (entry name)
-
GETFH
struct LINK4args {
/* SAVED_FH: source object */
/* CURRENT_FH: target directory */
component4 newname;
};
struct LINK4resok {
change_info4 cinfo;
};
union LINK4res switch (nfsstat4 status) {
case NFS4_OK:
LINK4resok resok4;
default:
void;
};
The LINK operation creates an additional newname for the file represented by the saved filehandle, as set by the SAVEFH operation, in the directory represented by the current filehandle. The existing file and the target directory must reside within the same file system on the server. On success, the current filehandle will continue to be the target directory. If an object exists in the target directory with the same name as newname, the server must return NFS4ERR_EXIST.
For the target directory, the server returns change_info4 information in cinfo. With the atomic field of the change_info4 data type, the server will indicate if the before and after change attributes were obtained atomically with respect to the link creation.
If the newname has a length of zero, or if newname does not obey the UTF-8 definition, the error NFS4ERR_INVAL will be returned.
The server
MAY impose restrictions on the LINK operation such that LINK may not be done when the file is open or when that open is done by particular protocols, or with particular options or access modes. When LINK is rejected because of such restrictions, the error NFS4ERR_FILE_OPEN is returned.
If a server does implement such restrictions and those restrictions include cases of NFSv4 opens preventing successful execution of a link, the server needs to recall any delegations that could hide the existence of opens relevant to that decision. The reason is that when a client holds a delegation, the server might not have an accurate account of the opens for that client, since the client may execute OPENs and CLOSEs locally. The LINK operation must be delayed only until a definitive result can be obtained. For example, suppose there are multiple delegations and one of them establishes an open whose presence would prevent the link. Given the server's semantics, NFS4ERR_FILE_OPEN may be returned to the caller as soon as that delegation is returned without waiting for other delegations to be returned. Similarly, if such opens are not associated with delegations, NFS4ERR_FILE_OPEN can be returned immediately with no delegation recall being done.
If the current filehandle designates a directory for which another client holds a directory delegation, then, unless the delegation is such that the situation can be resolved by sending a notification, the delegation
MUST be recalled, and the operation cannot be performed successfully until the delegation is returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while delegation remains outstanding.
When the current filehandle designates a directory for which one or more directory delegations exist, then, when those delegations request such notifications, instead of a recall, NOTIFY4_ADD_ENTRY will be generated as a result of the LINK operation.
If the current file system supports the numlinks attribute, and other clients have delegations to the file being linked, then those delegations
MUST be recalled and the LINK operation
MUST NOT proceed until all delegations are returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while delegation remains outstanding.
Changes to any property of the "hard" linked files are reflected in all of the linked files. When a link is made to a file, the attributes for the file should have a value for numlinks that is one greater than the value before the LINK operation.
The statement "file and the target directory must reside within the same file system on the server" means that the fsid fields in the attributes for the objects are the same. If they reside on different file systems, the error NFS4ERR_XDEV is returned. This error may be returned by some servers when there is an internal partitioning of a file system that the LINK operation would violate.
On some servers, "." and ".." are illegal values for newname and the error NFS4ERR_BADNAME will be returned if they are specified.
When the current filehandle designates a named attribute directory and the object to be linked (the saved filehandle) is not a named attribute for the same object, the error NFS4ERR_XDEV
MUST be returned. When the saved filehandle designates a named attribute and the current filehandle is not the appropriate named attribute directory, the error NFS4ERR_XDEV
MUST also be returned.
When the current filehandle designates a named attribute directory and the object to be linked (the saved filehandle) is a named attribute within that directory, the server may return the error NFS4ERR_NOTSUPP.
In the case that newname is already linked to the file represented by the saved filehandle, the server will return NFS4ERR_EXIST.
Note that symbolic links are created with the CREATE operation.
/*
* For LOCK, transition from open_stateid and lock_owner
* to a lock stateid.
*/
struct open_to_lock_owner4 {
seqid4 open_seqid;
stateid4 open_stateid;
seqid4 lock_seqid;
lock_owner4 lock_owner;
};
/*
* For LOCK, existing lock stateid continues to request new
* file lock for the same lock_owner and open_stateid.
*/
struct exist_lock_owner4 {
stateid4 lock_stateid;
seqid4 lock_seqid;
};
union locker4 switch (bool new_lock_owner) {
case TRUE:
open_to_lock_owner4 open_owner;
case FALSE:
exist_lock_owner4 lock_owner;
};
/*
* LOCK/LOCKT/LOCKU: Record lock management
*/
struct LOCK4args {
/* CURRENT_FH: file */
nfs_lock_type4 locktype;
bool reclaim;
offset4 offset;
length4 length;
locker4 locker;
};
struct LOCK4denied {
offset4 offset;
length4 length;
nfs_lock_type4 locktype;
lock_owner4 owner;
};
struct LOCK4resok {
stateid4 lock_stateid;
};
union LOCK4res switch (nfsstat4 status) {
case NFS4_OK:
LOCK4resok resok4;
case NFS4ERR_DENIED:
LOCK4denied denied;
default:
void;
};
The LOCK operation requests a byte-range lock for the byte-range specified by the offset and length parameters, and lock type specified in the locktype parameter. If this is a reclaim request, the reclaim parameter will be TRUE.
Bytes in a file may be locked even if those bytes are not currently allocated to the file. To lock the file from a specific offset through the end-of-file (no matter how long the file actually is) use a length field equal to NFS4_UINT64_MAX. The server
MUST return NFS4ERR_INVAL under the following combinations of length and offset:
-
Length is equal to zero.
-
Length is not equal to NFS4_UINT64_MAX, and the sum of length and offset exceeds NFS4_UINT64_MAX.
32-bit servers are servers that support locking for byte offsets that fit within 32 bits (i.e., less than or equal to NFS4_UINT32_MAX). If the client specifies a range that overlaps one or more bytes beyond offset NFS4_UINT32_MAX but does not end at offset NFS4_UINT64_MAX, then such a 32-bit server
MUST return the error NFS4ERR_BAD_RANGE.
If the server returns NFS4ERR_DENIED, the owner, offset, and length of a conflicting lock are returned.
The locker argument specifies the lock-owner that is associated with the LOCK operation. The locker4 structure is a switched union that indicates whether the client has already created byte-range locking state associated with the current open file and lock-owner. In the case in which it has, the argument is just a stateid representing the set of locks associated with that open file and lock-owner, together with a lock_seqid value that
MAY be any value and
MUST be ignored by the server. In the case where no byte-range locking state has been established, or the client does not have the stateid available, the argument contains the stateid of the open file with which this lock is to be associated, together with the lock-owner with which the lock is to be associated. The open_to_lock_owner case covers the very first lock done by a lock-owner for a given open file and offers a method to use the established state of the open_stateid to transition to the use of a lock stateid.
The following fields of the locker parameter
MAY be set to any value by the client and
MUST be ignored by the server:
-
The clientid field of the lock_owner field of the open_owner field (locker.open_owner.lock_owner.clientid). The reason the server MUST ignore the clientid field is that the server MUST derive the client ID from the session ID from the SEQUENCE operation of the COMPOUND request.
-
The open_seqid and lock_seqid fields of the open_owner field (locker.open_owner.open_seqid and locker.open_owner.lock_seqid).
-
The lock_seqid field of the lock_owner field (locker.lock_owner.lock_seqid).
Note that the client ID appearing in a LOCK4denied structure is the actual client associated with the conflicting lock, whether this is the client ID associated with the current session or a different one. Thus, if the server returns NFS4ERR_DENIED, it
MUST set the clientid field of the owner field of the denied field.
If the current filehandle is not an ordinary file, an error will be returned to the client. In the case that the current filehandle represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is returned. In all other cases, NFS4ERR_WRONG_TYPE is returned.
On success, the current filehandle retains its value.
If the server is unable to determine the exact offset and length of the conflicting byte-range lock, the same offset and length that were provided in the arguments should be returned in the denied results.
LOCK operations are subject to permission checks and to checks against the access type of the associated file. However, the specific right and modes required for various types of locks reflect the semantics of the server-exported file system, and are not specified by the protocol. For example, Windows 2000 allows a write lock of a file open for read access, while a POSIX-compliant system does not.
When the client sends a LOCK operation that corresponds to a range that the lock-owner has locked already (with the same or different lock type), or to a sub-range of such a range, or to a byte-range that includes multiple locks already granted to that lock-owner, in whole or in part, and the server does not support such locking operations (i.e., does not support POSIX locking semantics), the server will return the error NFS4ERR_LOCK_RANGE. In that case, the client may return an error, or it may emulate the required operations, using only LOCK for ranges that do not include any bytes already locked by that lock-owner and LOCKU of locks held by that lock-owner (specifying an exactly matching range and type). Similarly, when the client sends a LOCK operation that amounts to upgrading (changing from a READ_LT lock to a WRITE_LT lock) or downgrading (changing from WRITE_LT lock to a READ_LT lock) an existing byte-range lock, and the server does not support such a lock, the server will return NFS4ERR_LOCK_NOTSUPP. Such operations may not perfectly reflect the required semantics in the face of conflicting LOCK operations from other clients.
When a client holds an OPEN_DELEGATE_WRITE delegation, the client holding that delegation is assured that there are no opens by other clients. Thus, there can be no conflicting LOCK operations from such clients. Therefore, the client may be handling locking requests locally, without doing LOCK operations on the server. If it does that, it must be prepared to update the lock status on the server, by sending appropriate LOCK and LOCKU operations before returning the delegation.
When one or more clients hold OPEN_DELEGATE_READ delegations, any LOCK operation where the server is implementing mandatory locking semantics
MUST result in the recall of all such delegations. The LOCK operation may not be granted until all such delegations are returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while the delegation remains outstanding.
struct LOCKT4args {
/* CURRENT_FH: file */
nfs_lock_type4 locktype;
offset4 offset;
length4 length;
lock_owner4 owner;
};
union LOCKT4res switch (nfsstat4 status) {
case NFS4ERR_DENIED:
LOCK4denied denied;
case NFS4_OK:
void;
default:
void;
};
The LOCKT operation tests the lock as specified in the arguments. If a conflicting lock exists, the owner, offset, length, and type of the conflicting lock are returned. The owner field in the results includes the client ID of the owner of the conflicting lock, whether this is the client ID associated with the current session or a different client ID. If no lock is held, nothing other than NFS4_OK is returned. Lock types READ_LT and READW_LT are processed in the same way in that a conflicting lock test is done without regard to blocking or non-blocking. The same is true for WRITE_LT and WRITEW_LT.
The ranges are specified as for LOCK. The NFS4ERR_INVAL and NFS4ERR_BAD_RANGE errors are returned under the same circumstances as for LOCK.
The clientid field of the owner
MAY be set to any value by the client and
MUST be ignored by the server. The reason the server
MUST ignore the clientid field is that the server
MUST derive the client ID from the session ID from the SEQUENCE operation of the COMPOUND request.
If the current filehandle is not an ordinary file, an error will be returned to the client. In the case that the current filehandle represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is returned. In all other cases, NFS4ERR_WRONG_TYPE is returned.
On success, the current filehandle retains its value.
If the server is unable to determine the exact offset and length of the conflicting lock, the same offset and length that were provided in the arguments should be returned in the denied results.
LOCKT uses a lock_owner4 rather a stateid4, as is used in LOCK to identify the owner. This is because the client does not have to open the file to test for the existence of a lock, so a stateid might not be available.
As noted in
Section 18.10.4, some servers may return NFS4ERR_LOCK_RANGE to certain (otherwise non-conflicting) LOCK operations that overlap ranges already granted to the current lock-owner.
The LOCKT operation's test for conflicting locks
SHOULD exclude locks for the current lock-owner, and thus should return NFS4_OK in such cases. Note that this means that a server might return NFS4_OK to a LOCKT request even though a LOCK operation for the same range and lock-owner would fail with NFS4ERR_LOCK_RANGE.
When a client holds an OPEN_DELEGATE_WRITE delegation, it may choose (see
Section 18.10.4) to handle LOCK requests locally. In such a case, LOCKT requests will similarly be handled locally.
struct LOCKU4args {
/* CURRENT_FH: file */
nfs_lock_type4 locktype;
seqid4 seqid;
stateid4 lock_stateid;
offset4 offset;
length4 length;
};
union LOCKU4res switch (nfsstat4 status) {
case NFS4_OK:
stateid4 lock_stateid;
default:
void;
};
The LOCKU operation unlocks the byte-range lock specified by the parameters. The client may set the locktype field to any value that is legal for the nfs_lock_type4 enumerated type, and the server
MUST accept any legal value for locktype. Any legal value for locktype has no effect on the success or failure of the LOCKU operation.
The ranges are specified as for LOCK. The NFS4ERR_INVAL and NFS4ERR_BAD_RANGE errors are returned under the same circumstances as for LOCK.
The seqid parameter
MAY be any value and the server
MUST ignore it.
If the current filehandle is not an ordinary file, an error will be returned to the client. In the case that the current filehandle represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is returned. In all other cases, NFS4ERR_WRONG_TYPE is returned.
On success, the current filehandle retains its value.
The server
MAY require that the principal, security flavor, and if applicable, the GSS mechanism, combination that sent a LOCK operation also be the one to send LOCKU on the file. This might not be possible if credentials for the principal are no longer available. The server
MAY allow the machine credential or SSV credential (see
Section 18.35) to send LOCKU.
If the area to be unlocked does not correspond exactly to a lock actually held by the lock-owner, the server may return the error NFS4ERR_LOCK_RANGE. This includes the case in which the area is not locked, where the area is a sub-range of the area locked, where it overlaps the area locked without matching exactly, or the area specified includes multiple locks held by the lock-owner. In all of these cases, allowed by [
21] semantics, a client receiving this error should, if it desires support for such operations, simulate the operation using LOCKU on ranges corresponding to locks it actually holds, possibly followed by LOCK operations for the sub-ranges not being unlocked.
When a client holds an OPEN_DELEGATE_WRITE delegation, it may choose (see
Section 18.10.4) to handle LOCK requests locally. In such a case, LOCKU operations will similarly be handled locally.
struct LOOKUP4args {
/* CURRENT_FH: directory */
component4 objname;
};
struct LOOKUP4res {
/* New CURRENT_FH: object */
nfsstat4 status;
};
The LOOKUP operation looks up or finds a file system object using the directory specified by the current filehandle. LOOKUP evaluates the component and if the object exists, the current filehandle is replaced with the component's filehandle.
If the component cannot be evaluated either because it does not exist or because the client does not have permission to evaluate the component, then an error will be returned and the current filehandle will be unchanged.
If the component is a zero-length string or if any component does not obey the UTF-8 definition, the error NFS4ERR_INVAL will be returned.
If the client wants to achieve the effect of a multi-component look up, it may construct a COMPOUND request such as (and obtain each filehandle):
PUTFH (directory filehandle)
LOOKUP "pub"
GETFH
LOOKUP "foo"
GETFH
LOOKUP "bar"
GETFH
Unlike NFSv3, NFSv4.1 allows LOOKUP requests to cross mountpoints on the server. The client can detect a mountpoint crossing by comparing the fsid attribute of the directory with the fsid attribute of the directory looked up. If the fsids are different, then the new directory is a server mountpoint. UNIX clients that detect a mountpoint crossing will need to mount the server's file system. This needs to be done to maintain the file object identity checking mechanisms common to UNIX clients.
Servers that limit NFS access to "shared" or "exported" file systems should provide a pseudo file system into which the exported file systems can be integrated, so that clients can browse the server's namespace. The clients view of a pseudo file system will be limited to paths that lead to exported file systems.
Note: previous versions of the protocol assigned special semantics to the names "." and "..". NFSv4.1 assigns no special semantics to these names. The LOOKUPP operator must be used to look up a parent directory.
Note that this operation does not follow symbolic links. The client is responsible for all parsing of filenames including filenames that are modified by symbolic links encountered during the look up process.
If the current filehandle supplied is not a directory but a symbolic link, the error NFS4ERR_SYMLINK is returned as the error. For all other non-directory file types, the error NFS4ERR_NOTDIR is returned.
/* CURRENT_FH: object */
void;
struct LOOKUPP4res {
/* new CURRENT_FH: parent directory */
nfsstat4 status;
};
The current filehandle is assumed to refer to a regular directory or a named attribute directory. LOOKUPP assigns the filehandle for its parent directory to be the current filehandle. If there is no parent directory, an NFS4ERR_NOENT error must be returned. Therefore, NFS4ERR_NOENT will be returned by the server when the current filehandle is at the root or top of the server's file tree.
As is the case with LOOKUP, LOOKUPP will also cross mountpoints.
If the current filehandle is not a directory or named attribute directory, the error NFS4ERR_NOTDIR is returned.
If the requester's security flavor does not match that configured for the parent directory, then the server
SHOULD return NFS4ERR_WRONGSEC (a future minor revision of NFSv4 may upgrade this to
MUST) in the LOOKUPP response. However, if the server does so, it
MUST support the SECINFO_NO_NAME operation (
Section 18.45), so that the client can gracefully determine the correct security flavor.
If the current filehandle is a named attribute directory that is associated with a file system object via OPENATTR (i.e., not a sub-directory of a named attribute directory), LOOKUPP
SHOULD return the filehandle of the associated file system object.
An issue to note is upward navigation from named attribute directories. The named attribute directories are essentially detached from the namespace, and this property should be safely represented in the client operating environment. LOOKUPP on a named attribute directory may return the filehandle of the associated file, and conveying this to applications might be unsafe as many applications expect the parent of an object to always be a directory. Therefore, the client may want to hide the parent of named attribute directories (represented as ".." in UNIX) or represent the named attribute directory as its own parent (as is typically done for the file system root directory in UNIX).
struct NVERIFY4args {
/* CURRENT_FH: object */
fattr4 obj_attributes;
};
struct NVERIFY4res {
nfsstat4 status;
};
This operation is used to prefix a sequence of operations to be performed if one or more attributes have changed on some file system object. If all the attributes match, then the error NFS4ERR_SAME
MUST be returned.
On success, the current filehandle retains its value.
This operation is useful as a cache validation operator. If the object to which the attributes belong has changed, then the following operations may obtain new data associated with that object, for instance, to check if a file has been changed and obtain new data if it has:
SEQUENCE
PUTFH fh
NVERIFY attrbits attrs
READ 0 32767
Contrast this with NFSv3, which would first send a GETATTR in one request/reply round trip, and then if attributes indicated that the client's cache was stale, then send a READ in another request/reply round trip.
In the case that a
RECOMMENDED attribute is specified in the NVERIFY operation and the server does not support that attribute for the file system object, the error NFS4ERR_ATTRNOTSUPP is returned to the client.
When the attribute rdattr_error or any set-only attribute (e.g., time_modify_set) is specified, the error NFS4ERR_INVAL is returned to the client.
/*
* Various definitions for OPEN
*/
enum createmode4 {
UNCHECKED4 = 0,
GUARDED4 = 1,
/* Deprecated in NFSv4.1. */
EXCLUSIVE4 = 2,
/*
* New to NFSv4.1. If session is persistent,
* GUARDED4 MUST be used. Otherwise, use
* EXCLUSIVE4_1 instead of EXCLUSIVE4.
*/
EXCLUSIVE4_1 = 3
};
struct creatverfattr {
verifier4 cva_verf;
fattr4 cva_attrs;
};
union createhow4 switch (createmode4 mode) {
case UNCHECKED4:
case GUARDED4:
fattr4 createattrs;
case EXCLUSIVE4:
verifier4 createverf;
case EXCLUSIVE4_1:
creatverfattr ch_createboth;
};
enum opentype4 {
OPEN4_NOCREATE = 0,
OPEN4_CREATE = 1
};
union openflag4 switch (opentype4 opentype) {
case OPEN4_CREATE:
createhow4 how;
default:
void;
};
/* Next definitions used for OPEN delegation */
enum limit_by4 {
NFS_LIMIT_SIZE = 1,
NFS_LIMIT_BLOCKS = 2
/* others as needed */
};
struct nfs_modified_limit4 {
uint32_t num_blocks;
uint32_t bytes_per_block;
};
union nfs_space_limit4 switch (limit_by4 limitby) {
/* limit specified as file size */
case NFS_LIMIT_SIZE:
uint64_t filesize;
/* limit specified by number of blocks */
case NFS_LIMIT_BLOCKS:
nfs_modified_limit4 mod_blocks;
} ;
/*
* Share Access and Deny constants for open argument
*/
const OPEN4_SHARE_ACCESS_READ = 0x00000001;
const OPEN4_SHARE_ACCESS_WRITE = 0x00000002;
const OPEN4_SHARE_ACCESS_BOTH = 0x00000003;
const OPEN4_SHARE_DENY_NONE = 0x00000000;
const OPEN4_SHARE_DENY_READ = 0x00000001;
const OPEN4_SHARE_DENY_WRITE = 0x00000002;
const OPEN4_SHARE_DENY_BOTH = 0x00000003;
/* new flags for share_access field of OPEN4args */
const OPEN4_SHARE_ACCESS_WANT_DELEG_MASK = 0xFF00;
const OPEN4_SHARE_ACCESS_WANT_NO_PREFERENCE = 0x0000;
const OPEN4_SHARE_ACCESS_WANT_READ_DELEG = 0x0100;
const OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG = 0x0200;
const OPEN4_SHARE_ACCESS_WANT_ANY_DELEG = 0x0300;
const OPEN4_SHARE_ACCESS_WANT_NO_DELEG = 0x0400;
const OPEN4_SHARE_ACCESS_WANT_CANCEL = 0x0500;
const
OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL
= 0x10000;
const
OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED
= 0x20000;
enum open_delegation_type4 {
OPEN_DELEGATE_NONE = 0,
OPEN_DELEGATE_READ = 1,
OPEN_DELEGATE_WRITE = 2,
OPEN_DELEGATE_NONE_EXT = 3 /* new to v4.1 */
};
enum open_claim_type4 {
/*
* Not a reclaim.
*/
CLAIM_NULL = 0,
CLAIM_PREVIOUS = 1,
CLAIM_DELEGATE_CUR = 2,
CLAIM_DELEGATE_PREV = 3,
/*
* Not a reclaim.
*
* Like CLAIM_NULL, but object identified
* by the current filehandle.
*/
CLAIM_FH = 4, /* new to v4.1 */
/*
* Like CLAIM_DELEGATE_CUR, but object identified
* by current filehandle.
*/
CLAIM_DELEG_CUR_FH = 5, /* new to v4.1 */
/*
* Like CLAIM_DELEGATE_PREV, but object identified
* by current filehandle.
*/
CLAIM_DELEG_PREV_FH = 6 /* new to v4.1 */
};
struct open_claim_delegate_cur4 {
stateid4 delegate_stateid;
component4 file;
};
union open_claim4 switch (open_claim_type4 claim) {
/*
* No special rights to file.
* Ordinary OPEN of the specified file.
*/
case CLAIM_NULL:
/* CURRENT_FH: directory */
component4 file;
/*
* Right to the file established by an
* open previous to server reboot. File
* identified by filehandle obtained at
* that time rather than by name.
*/
case CLAIM_PREVIOUS:
/* CURRENT_FH: file being reclaimed */
open_delegation_type4 delegate_type;
/*
* Right to file based on a delegation
* granted by the server. File is
* specified by name.
*/
case CLAIM_DELEGATE_CUR:
/* CURRENT_FH: directory */
open_claim_delegate_cur4 delegate_cur_info;
/*
* Right to file based on a delegation
* granted to a previous boot instance
* of the client. File is specified by name.
*/
case CLAIM_DELEGATE_PREV:
/* CURRENT_FH: directory */
component4 file_delegate_prev;
/*
* Like CLAIM_NULL. No special rights
* to file. Ordinary OPEN of the
* specified file by current filehandle.
*/
case CLAIM_FH: /* new to v4.1 */
/* CURRENT_FH: regular file to open */
void;
/*
* Like CLAIM_DELEGATE_PREV. Right to file based on a
* delegation granted to a previous boot
* instance of the client. File is identified
* by filehandle.
*/
case CLAIM_DELEG_PREV_FH: /* new to v4.1 */
/* CURRENT_FH: file being opened */
void;
/*
* Like CLAIM_DELEGATE_CUR. Right to file based on
* a delegation granted by the server.
* File is identified by filehandle.
*/
case CLAIM_DELEG_CUR_FH: /* new to v4.1 */
/* CURRENT_FH: file being opened */
stateid4 oc_delegate_stateid;
};
/*
* OPEN: Open a file, potentially receiving an OPEN delegation
*/
struct OPEN4args {
seqid4 seqid;
uint32_t share_access;
uint32_t share_deny;
open_owner4 owner;
openflag4 openhow;
open_claim4 claim;
};
struct open_read_delegation4 {
stateid4 stateid; /* Stateid for delegation*/
bool recall; /* Pre-recalled flag for
delegations obtained
by reclaim (CLAIM_PREVIOUS) */
nfsace4 permissions; /* Defines users who don't
need an ACCESS call to
open for read */
};
struct open_write_delegation4 {
stateid4 stateid; /* Stateid for delegation */
bool recall; /* Pre-recalled flag for
delegations obtained
by reclaim
(CLAIM_PREVIOUS) */
nfs_space_limit4
space_limit; /* Defines condition that
the client must check to
determine whether the
file needs to be flushed
to the server on close. */
nfsace4 permissions; /* Defines users who don't
need an ACCESS call as
part of a delegated
open. */
};
enum why_no_delegation4 { /* new to v4.1 */
WND4_NOT_WANTED = 0,
WND4_CONTENTION = 1,
WND4_RESOURCE = 2,
WND4_NOT_SUPP_FTYPE = 3,
WND4_WRITE_DELEG_NOT_SUPP_FTYPE = 4,
WND4_NOT_SUPP_UPGRADE = 5,
WND4_NOT_SUPP_DOWNGRADE = 6,
WND4_CANCELLED = 7,
WND4_IS_DIR = 8
};
union open_none_delegation4 /* new to v4.1 */
switch (why_no_delegation4 ond_why) {
case WND4_CONTENTION:
bool ond_server_will_push_deleg;
case WND4_RESOURCE:
bool ond_server_will_signal_avail;
default:
void;
};
union open_delegation4
switch (open_delegation_type4 delegation_type) {
case OPEN_DELEGATE_NONE:
void;
case OPEN_DELEGATE_READ:
open_read_delegation4 read;
case OPEN_DELEGATE_WRITE:
open_write_delegation4 write;
case OPEN_DELEGATE_NONE_EXT: /* new to v4.1 */
open_none_delegation4 od_whynone;
};
/*
* Result flags
*/
/* Client must confirm open */
const OPEN4_RESULT_CONFIRM = 0x00000002;
/* Type of file locking behavior at the server */
const OPEN4_RESULT_LOCKTYPE_POSIX = 0x00000004;
/* Server will preserve file if removed while open */
const OPEN4_RESULT_PRESERVE_UNLINKED = 0x00000008;
/*
* Server may use CB_NOTIFY_LOCK on locks
* derived from this open
*/
const OPEN4_RESULT_MAY_NOTIFY_LOCK = 0x00000020;
struct OPEN4resok {
stateid4 stateid; /* Stateid for open */
change_info4 cinfo; /* Directory Change Info */
uint32_t rflags; /* Result flags */
bitmap4 attrset; /* attribute set for create*/
open_delegation4 delegation; /* Info on any open
delegation */
};
union OPEN4res switch (nfsstat4 status) {
case NFS4_OK:
/* New CURRENT_FH: opened file */
OPEN4resok resok4;
default:
void;
};
The OPEN operation opens a regular file in a directory with the provided name or filehandle. OPEN can also create a file if a name is provided, and the client specifies it wants to create a file. Specification of whether or not a file is to be created, and the method of creation is via the openhow parameter. The openhow parameter consists of a switched union (data type opengflag4), which switches on the value of opentype (OPEN4_NOCREATE or OPEN4_CREATE). If OPEN4_CREATE is specified, this leads to another switched union (data type createhow4) that supports four cases of creation methods: UNCHECKED4, GUARDED4, EXCLUSIVE4, or EXCLUSIVE4_1. If opentype is OPEN4_CREATE, then the claim field of the claim field
MUST be one of CLAIM_NULL, CLAIM_DELEGATE_CUR, or CLAIM_DELEGATE_PREV, because these claim methods include a component of a file name.
Upon success (which might entail creation of a new file), the current filehandle is replaced by that of the created or existing object.
If the current filehandle is a named attribute directory, OPEN will then create or open a named attribute file. Note that exclusive create of a named attribute is not supported. If the createmode is EXCLUSIVE4 or EXCLUSIVE4_1 and the current filehandle is a named attribute directory, the server will return EINVAL.
UNCHECKED4 means that the file should be created if a file of that name does not exist and encountering an existing regular file of that name is not an error. For this type of create, createattrs specifies the initial set of attributes for the file. The set of attributes may include any writable attribute valid for regular files. When an UNCHECKED4 create encounters an existing file, the attributes specified by createattrs are not used, except that when createattrs specifies the size attribute with a size of zero, the existing file is truncated.
If GUARDED4 is specified, the server checks for the presence of a duplicate object by name before performing the create. If a duplicate exists, NFS4ERR_EXIST is returned. If the object does not exist, the request is performed as described for UNCHECKED4.
For the UNCHECKED4 and GUARDED4 cases, where the operation is successful, the server will return to the client an attribute mask signifying which attributes were successfully set for the object.
EXCLUSIVE4_1 and EXCLUSIVE4 specify that the server is to follow exclusive creation semantics, using the verifier to ensure exclusive creation of the target. The server should check for the presence of a duplicate object by name. If the object does not exist, the server creates the object and stores the verifier with the object. If the object does exist and the stored verifier matches the client provided verifier, the server uses the existing object as the newly created object. If the stored verifier does not match, then an error of NFS4ERR_EXIST is returned.
If using EXCLUSIVE4, and if the server uses attributes to store the exclusive create verifier, the server will signify which attributes it used by setting the appropriate bits in the attribute mask that is returned in the results. Unlike UNCHECKED4, GUARDED4, and EXCLUSIVE4_1, EXCLUSIVE4 does not support the setting of attributes at file creation, and after a successful OPEN via EXCLUSIVE4, the client
MUST send a SETATTR to set attributes to a known state.
In NFSv4.1, EXCLUSIVE4 has been deprecated in favor of EXCLUSIVE4_1. Unlike EXCLUSIVE4, attributes may be provided in the EXCLUSIVE4_1 case, but because the server may use attributes of the target object to store the verifier, the set of allowable attributes may be fewer than the set of attributes SETATTR allows. The allowable attributes for EXCLUSIVE4_1 are indicated in the suppattr_exclcreat (
Section 5.8.1.14) attribute. If the client attempts to set in cva_attrs an attribute that is not in suppattr_exclcreat, the server
MUST return NFS4ERR_INVAL. The response field, attrset, indicates both which attributes the server set from cva_attrs and which attributes the server used to store the verifier. As described in
Section 18.16.4, the client can compare cva_attrs.attrmask with attrset to determine which attributes were used to store the verifier.
With the addition of persistent sessions and pNFS, under some conditions EXCLUSIVE4
MUST NOT be used by the client or supported by the server. The following table summarizes the appropriate and mandated exclusive create methods for implementations of NFSv4.1:
Persistent Reply Cache Enabled |
Server Supports pNFS |
Server REQUIRED |
Client Allowed |
no |
no |
EXCLUSIVE4_1 and EXCLUSIVE4 |
EXCLUSIVE4_1 (SHOULD) or EXCLUSIVE4 (SHOULD NOT) |
no |
yes |
EXCLUSIVE4_1 |
EXCLUSIVE4_1 |
yes |
no |
GUARDED4 |
GUARDED4 |
yes |
yes |
GUARDED4 |
GUARDED4 |
Table 18: Required Methods for Exclusive Create
If CREATE_SESSION4_FLAG_PERSIST is set in the results of CREATE_SESSION, the reply cache is persistent (see
Section 18.36). If the EXCHGID4_FLAG_USE_PNFS_MDS flag is set in the results from EXCHANGE_ID, the server is a pNFS server (see
Section 18.35). If the client attempts to use EXCLUSIVE4 on a persistent session, or a session derived from an EXCHGID4_FLAG_USE_PNFS_MDS client ID, the server
MUST return NFS4ERR_INVAL.
With persistent sessions, exclusive create semantics are fully achievable via GUARDED4, and so EXCLUSIVE4 or EXCLUSIVE4_1
MUST NOT be used. When pNFS is being used, the layout_hint attribute might not be supported after the file is created. Only the EXCLUSIVE4_1 and GUARDED methods of exclusive file creation allow the atomic setting of attributes.
For the target directory, the server returns change_info4 information in cinfo. With the atomic field of the change_info4 data type, the server will indicate if the before and after change attributes were obtained atomically with respect to the link creation.
The OPEN operation provides for Windows share reservation capability with the use of the share_access and share_deny fields of the OPEN arguments. The client specifies at OPEN the required share_access and share_deny modes. For clients that do not directly support SHAREs (i.e., UNIX), the expected deny value is OPEN4_SHARE_DENY_NONE. In the case that there is an existing SHARE reservation that conflicts with the OPEN request, the server returns the error NFS4ERR_SHARE_DENIED. For additional discussion of SHARE semantics, see
Section 9.7.
For each OPEN, the client provides a value for the owner field of the OPEN argument. The owner field is of data type open_owner4, and contains a field called clientid and a field called owner. The client can set the clientid field to any value and the server
MUST ignore it. Instead, the server
MUST derive the client ID from the session ID of the SEQUENCE operation of the COMPOUND request.
The "seqid" field of the request is not used in NFSv4.1, but it
MAY be any value and the server
MUST ignore it.
In the case that the client is recovering state from a server failure, the claim field of the OPEN argument is used to signify that the request is meant to reclaim state previously held.
The "claim" field of the OPEN argument is used to specify the file to be opened and the state information that the client claims to possess. There are seven claim types as follows:
open type |
description |
CLAIM_NULL, CLAIM_FH
|
For the client, this is a new OPEN request and there is no previous state associated with the file for the client. With CLAIM_NULL, the file is identified by the current filehandle and the specified component name. With CLAIM_FH (new to NFSv4.1), the file is identified by just the current filehandle.
|
CLAIM_PREVIOUS
|
The client is claiming basic OPEN state for a file that was held previous to a server restart. Generally used when a server is returning persistent filehandles; the client may not have the file name to reclaim the OPEN.
|
CLAIM_DELEGATE_CUR, CLAIM_DELEG_CUR_FH
|
The client is claiming a delegation for OPEN as granted by the server. Generally, this is done as part of recalling a delegation. With CLAIM_DELEGATE_CUR, the file is identified by the current filehandle and the specified component name. With CLAIM_DELEG_CUR_FH (new to NFSv4.1), the file is identified by just the current filehandle.
|
CLAIM_DELEGATE_PREV, CLAIM_DELEG_PREV_FH
|
The client is claiming a delegation granted to a previous client instance; used after the client restarts. The server MAY support CLAIM_DELEGATE_PREV and/or CLAIM_DELEG_PREV_FH (new to NFSv4.1). If it does support either claim type, CREATE_SESSION MUST NOT remove the client's delegation state, and the server MUST support the DELEGPURGE operation.
|
Table 19
For OPEN requests that reach the server during the grace period, the server returns an error of NFS4ERR_GRACE. The following claim types are exceptions:
-
OPEN requests specifying the claim type CLAIM_PREVIOUS are devoted to reclaiming opens after a server restart and are typically only valid during the grace period.
-
OPEN requests specifying the claim types CLAIM_DELEGATE_CUR and CLAIM_DELEG_CUR_FH are valid both during and after the grace period. Since the granting of the delegation that they are subordinate to assures that there is no conflict with locks to be reclaimed by other clients, the server need not return NFS4ERR_GRACE when these are received during the grace period.
For any OPEN request, the server may return an OPEN delegation, which allows further opens and closes to be handled locally on the client as described in
Section 10.4. Note that delegation is up to the server to decide. The client should never assume that delegation will or will not be granted in a particular instance. It should always be prepared for either case. A partial exception is the reclaim (CLAIM_PREVIOUS) case, in which a delegation type is claimed. In this case, delegation will always be granted, although the server may specify an immediate recall in the delegation structure.
The rflags returned by a successful OPEN allow the server to return information governing how the open file is to be handled.
-
OPEN4_RESULT_CONFIRM is deprecated and MUST NOT be returned by an NFSv4.1 server.
-
OPEN4_RESULT_LOCKTYPE_POSIX indicates that the server's byte-range locking behavior supports the complete set of POSIX locking techniques [21]. From this, the client can choose to manage byte-range locking state in a way to handle a mismatch of byte-range locking management.
-
OPEN4_RESULT_PRESERVE_UNLINKED indicates that the server will preserve the open file if the client (or any other client) removes the file as long as it is open. Furthermore, the server promises to preserve the file through the grace period after server restart, thereby giving the client the opportunity to reclaim its open.
-
OPEN4_RESULT_MAY_NOTIFY_LOCK indicates that the server may attempt CB_NOTIFY_LOCK callbacks for locks on this file. This flag is a hint only, and may be safely ignored by the client.
If the component is of zero length, NFS4ERR_INVAL will be returned. The component is also subject to the normal UTF-8, character support, and name checks. See
Section 14.5 for further discussion.
When an OPEN is done and the specified open-owner already has the resulting filehandle open, the result is to "OR" together the new share and deny status together with the existing status. In this case, only a single CLOSE need be done, even though multiple OPENs were completed. When such an OPEN is done, checking of share reservations for the new OPEN proceeds normally, with no exception for the existing OPEN held by the same open-owner. In this case, the stateid returned as an "other" field that matches that of the previous open while the "seqid" field is incremented to reflect the change status due to the new open.
If the underlying file system at the server is only accessible in a read-only mode and the OPEN request has specified ACCESS_WRITE or ACCESS_BOTH, the server will return NFS4ERR_ROFS to indicate a read-only file system.
As with the CREATE operation, the server
MUST derive the owner, owner ACE, group, or group ACE if any of the four attributes are required and supported by the server's file system. For an OPEN with the EXCLUSIVE4 createmode, the server has no choice, since such OPEN calls do not include the createattrs field. Conversely, if createattrs (UNCHECKED4 or GUARDED4) or cva_attrs (EXCLUSIVE4_1) is specified, and includes an owner, owner_group, or ACE that the principal in the RPC call's credentials does not have authorization to create files for, then the server may return NFS4ERR_PERM.
In the case of an OPEN that specifies a size of zero (e.g., truncation) and the file has named attributes, the named attributes are left as is and are not removed.
NFSv4.1 gives more precise control to clients over acquisition of delegations via the following new flags for the share_access field of OPEN4args:
OPEN4_SHARE_ACCESS_WANT_READ_DELEG
OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG
OPEN4_SHARE_ACCESS_WANT_ANY_DELEG
OPEN4_SHARE_ACCESS_WANT_NO_DELEG
OPEN4_SHARE_ACCESS_WANT_CANCEL
OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL
OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED
If (share_access & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) is not zero, then the client will have specified one and only one of:
OPEN4_SHARE_ACCESS_WANT_READ_DELEG
OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG
OPEN4_SHARE_ACCESS_WANT_ANY_DELEG
OPEN4_SHARE_ACCESS_WANT_NO_DELEG
OPEN4_SHARE_ACCESS_WANT_CANCEL
Otherwise, the client is neither indicating a desire nor a non-desire for a delegation, and the server
MAY or
MAY not return a delegation in the OPEN response.
If the server supports the new _WANT_ flags and the client sends one or more of the new flags, then in the event the server does not return a delegation, it
MUST return a delegation type of OPEN_DELEGATE_NONE_EXT. The field ond_why in the reply indicates why no delegation was returned and will be one of:
-
WND4_NOT_WANTED
-
The client specified OPEN4_SHARE_ACCESS_WANT_NO_DELEG.
-
WND4_CONTENTION
-
There is a conflicting delegation or open on the file.
-
WND4_RESOURCE
-
Resource limitations prevent the server from granting a delegation.
-
WND4_NOT_SUPP_FTYPE
-
The server does not support delegations on this file type.
-
WND4_WRITE_DELEG_NOT_SUPP_FTYPE
-
The server does not support OPEN_DELEGATE_WRITE delegations on this file type.
-
WND4_NOT_SUPP_UPGRADE
-
The server does not support atomic upgrade of an OPEN_DELEGATE_READ delegation to an OPEN_DELEGATE_WRITE delegation.
-
WND4_NOT_SUPP_DOWNGRADE
-
The server does not support atomic downgrade of an OPEN_DELEGATE_WRITE delegation to an OPEN_DELEGATE_READ delegation.
-
WND4_CANCELED
-
The client specified OPEN4_SHARE_ACCESS_WANT_CANCEL and now any "want" for this file object is cancelled.
-
WND4_IS_DIR
-
The specified file object is a directory, and the operation is OPEN or WANT_DELEGATION, which do not support delegations on directories.
OPEN4_SHARE_ACCESS_WANT_READ_DELEG, OPEN_SHARE_ACCESS_WANT_WRITE_DELEG, or OPEN_SHARE_ACCESS_WANT_ANY_DELEG mean, respectively, the client wants an OPEN_DELEGATE_READ, OPEN_DELEGATE_WRITE, or any delegation regardless which of OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH is set. If the client has an OPEN_DELEGATE_READ delegation on a file and requests an OPEN_DELEGATE_WRITE delegation, then the client is requesting atomic upgrade of its OPEN_DELEGATE_READ delegation to an OPEN_DELEGATE_WRITE delegation. If the client has an OPEN_DELEGATE_WRITE delegation on a file and requests an OPEN_DELEGATE_READ delegation, then the client is requesting atomic downgrade to an OPEN_DELEGATE_READ delegation. A server
MAY support atomic upgrade or downgrade. If it does, then the returned delegation_type of OPEN_DELEGATE_READ or OPEN_DELEGATE_WRITE that is different from the delegation type the client currently has, indicates successful upgrade or downgrade. If the server does not support atomic delegation upgrade or downgrade, then ond_why will be set to WND4_NOT_SUPP_UPGRADE or WND4_NOT_SUPP_DOWNGRADE.
OPEN4_SHARE_ACCESS_WANT_NO_DELEG means that the client wants no delegation.
OPEN4_SHARE_ACCESS_WANT_CANCEL means that the client wants no delegation and wants to cancel any previously registered "want" for a delegation.
The client may set one or both of OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL and OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED. However, they will have no effect unless one of following is set:
-
OPEN4_SHARE_ACCESS_WANT_READ_DELEG
-
OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG
-
OPEN4_SHARE_ACCESS_WANT_ANY_DELEG
If the client specifies OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL, then it wishes to register a "want" for a delegation, in the event the OPEN results do not include a delegation. If so and the server denies the delegation due to insufficient resources, the server
MAY later inform the client, via the CB_RECALLABLE_OBJ_AVAIL operation, that the resource limitation condition has eased. The server will tell the client that it intends to send a future CB_RECALLABLE_OBJ_AVAIL operation by setting delegation_type in the results to OPEN_DELEGATE_NONE_EXT, ond_why to WND4_RESOURCE, and ond_server_will_signal_avail set to TRUE. If ond_server_will_signal_avail is set to TRUE, the server
MUST later send a CB_RECALLABLE_OBJ_AVAIL operation.
If the client specifies OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_UNCONTENDED, then it wishes to register a "want" for a delegation, in the event the OPEN results do not include a delegation. If so and the server denies the delegation due to contention, the server
MAY later inform the client, via the CB_PUSH_DELEG operation, that the contention condition has eased. The server will tell the client that it intends to send a future CB_PUSH_DELEG operation by setting delegation_type in the results to OPEN_DELEGATE_NONE_EXT, ond_why to WND4_CONTENTION, and ond_server_will_push_deleg to TRUE. If ond_server_will_push_deleg is TRUE, the server
MUST later send a CB_PUSH_DELEG operation.
If the client has previously registered a want for a delegation on a file, and then sends a request to register a want for a delegation on the same file, the server
MUST return a new error: NFS4ERR_DELEG_ALREADY_WANTED. If the client wishes to register a different type of delegation want for the same file, it
MUST cancel the existing delegation WANT.
In absence of a persistent session, the client invokes exclusive create by setting the how parameter to EXCLUSIVE4 or EXCLUSIVE4_1. In these cases, the client provides a verifier that can reasonably be expected to be unique. A combination of a client identifier, perhaps the client network address, and a unique number generated by the client, perhaps the RPC transaction identifier, may be appropriate.
If the object does not exist, the server creates the object and stores the verifier in stable storage. For file systems that do not provide a mechanism for the storage of arbitrary file attributes, the server may use one or more elements of the object's metadata to store the verifier. The verifier
MUST be stored in stable storage to prevent erroneous failure on retransmission of the request. It is assumed that an exclusive create is being performed because exclusive semantics are critical to the application. Because of the expected usage, exclusive CREATE does not rely solely on the server's reply cache for storage of the verifier. A nonpersistent reply cache does not survive a crash and the session and reply cache may be deleted after a network partition that exceeds the lease time, thus opening failure windows.
An NFSv4.1 server
SHOULD NOT store the verifier in any of the file's
RECOMMENDED or
REQUIRED attributes. If it does, the server
SHOULD use time_modify_set or time_access_set to store the verifier. The server
SHOULD NOT store the verifier in the following attributes:
-
acl (it is desirable for access control to be established at creation),
-
dacl (ditto),
-
mode (ditto),
-
owner (ditto),
-
owner_group (ditto),
-
retentevt_set (it may be desired to establish retention at creation)
-
retention_hold (ditto),
-
retention_set (ditto),
-
sacl (it is desirable for auditing control to be established at creation),
-
size (on some servers, size may have a limited range of values),
-
mode_set_masked (as with mode),
-
time_creation (a meaningful file creation should be set when the file is created).
Another alternative for the server is to use a named attribute to store the verifier.
Because the EXCLUSIVE4 create method does not specify initial attributes when processing an EXCLUSIVE4 create, the server
-
SHOULD set the owner of the file to that corresponding to the credential of request's RPC header.
-
SHOULD NOT leave the file's access control to anyone but the owner of the file.
If the server cannot support exclusive create semantics, possibly because of the requirement to commit the verifier to stable storage, it should fail the OPEN request with the error NFS4ERR_NOTSUPP.
During an exclusive CREATE request, if the object already exists, the server reconstructs the object's verifier and compares it with the verifier in the request. If they match, the server treats the request as a success. The request is presumed to be a duplicate of an earlier, successful request for which the reply was lost and that the server duplicate request cache mechanism did not detect. If the verifiers do not match, the request is rejected with the status NFS4ERR_EXIST.
After the client has performed a successful exclusive create, the attrset response indicates which attributes were used to store the verifier. If EXCLUSIVE4 was used, the attributes set in attrset were used for the verifier. If EXCLUSIVE4_1 was used, the client determines the attributes used for the verifier by comparing attrset with cva_attrs.attrmask; any bits set in the former but not the latter identify the attributes used to store the verifier. The client
MUST immediately send a SETATTR to set attributes used to store the verifier. Until it does so, the attributes used to store the verifier cannot be relied upon. The subsequent SETATTR
MUST NOT occur in the same COMPOUND request as the OPEN.
Unless a persistent session is used, use of the GUARDED4 attribute does not provide exactly once semantics. In particular, if a reply is lost and the server does not detect the retransmission of the request, the operation can fail with NFS4ERR_EXIST, even though the create was performed successfully. The client would use this behavior in the case that the application has not requested an exclusive create but has asked to have the file truncated when the file is opened. In the case of the client timing out and retransmitting the create request, the client can use GUARDED4 to prevent against a sequence like create, write, create (retransmitted) from occurring.
For SHARE reservations, the value of the expression (share_access & ~OPEN4_SHARE_ACCESS_WANT_DELEG_MASK)
MUST be one of OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH. If not, the server
MUST return NFS4ERR_INVAL. The value of share_deny
MUST be one of OPEN4_SHARE_DENY_NONE, OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or OPEN4_SHARE_DENY_BOTH. If not, the server
MUST return NFS4ERR_INVAL.
Based on the share_access value (OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH), the client should check that the requester has the proper access rights to perform the specified operation. This would generally be the results of applying the ACL access rules to the file for the current requester. However, just as with the ACCESS operation, the client should not attempt to second-guess the server's decisions, as access rights may change and may be subject to server administrative controls outside the ACL framework. If the requester's READ or WRITE operation is not authorized (depending on the share_access value), the server
MUST return NFS4ERR_ACCESS.
Note that if the client ID was not created with the EXCHGID4_FLAG_BIND_PRINC_STATEID capability set in the reply to EXCHANGE_ID, then the server
MUST NOT impose any requirement that READs and WRITEs sent for an open file have the same credentials as the OPEN itself, and the server is
REQUIRED to perform access checking on the READs and WRITEs themselves. Otherwise, if the reply to EXCHANGE_ID did have EXCHGID4_FLAG_BIND_PRINC_STATEID set, then with one exception, the credentials used in the OPEN request
MUST match those used in the READs and WRITEs, and the stateids in the READs and WRITEs
MUST match, or be derived from the stateid from the reply to OPEN. The exception is if SP4_SSV or SP4_MACH_CRED state protection is used, and the spo_must_allow result of EXCHANGE_ID includes the READ and/or WRITE operations. In that case, the machine or SSV credential will be allowed to send READ and/or WRITE. See
Section 18.35.
If the component provided to OPEN is a symbolic link, the error NFS4ERR_SYMLINK will be returned to the client, while if it is a directory the error NFS4ERR_ISDIR will be returned. If the component is neither of those but not an ordinary file, the error NFS4ERR_WRONG_TYPE is returned. If the current filehandle is not a directory, the error NFS4ERR_NOTDIR will be returned.
The use of the OPEN4_RESULT_PRESERVE_UNLINKED result flag allows a client to avoid the common implementation practice of renaming an open file to ".nfs<unique value>" after it removes the file. After the server returns OPEN4_RESULT_PRESERVE_UNLINKED, if a client sends a REMOVE operation that would reduce the file's link count to zero, the server
SHOULD report a value of zero for the numlinks attribute on the file.
If another client has a delegation of the file being opened that conflicts with open being done (sometimes depending on the share_access or share_deny value specified), the delegation(s)
MUST be recalled, and the operation cannot proceed until each such delegation is returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while delegation remains outstanding. In the case of an OPEN_DELEGATE_WRITE delegation, any open by a different client will conflict, while for an OPEN_DELEGATE_READ delegation, only opens with one of the following characteristics will be considered conflicting:
-
The value of share_access includes the bit OPEN4_SHARE_ACCESS_WRITE.
-
The value of share_deny specifies OPEN4_SHARE_DENY_READ or OPEN4_SHARE_DENY_BOTH.
-
OPEN4_CREATE is specified together with UNCHECKED4, the size attribute is specified as zero (for truncation), and an existing file is truncated.
If OPEN4_CREATE is specified and the file does not exist and the current filehandle designates a directory for which another client holds a directory delegation, then, unless the delegation is such that the situation can be resolved by sending a notification, the delegation
MUST be recalled, and the operation cannot proceed until the delegation is returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while delegation remains outstanding.
If OPEN4_CREATE is specified and the file does not exist and the current filehandle designates a directory for which one or more directory delegations exist, then, when those delegations request such notifications, NOTIFY4_ADD_ENTRY will be generated as a result of this operation.
OPEN resembles LOOKUP in that it generates a filehandle for the client to use. Unlike LOOKUP though, OPEN creates server state on the filehandle. In normal circumstances, the client can only release this state with a CLOSE operation. CLOSE uses the current filehandle to determine which file to close. Therefore, the client
MUST follow every OPEN operation with a GETFH operation in the same COMPOUND procedure. This will supply the client with the filehandle such that CLOSE can be used appropriately.
Simply waiting for the lease on the file to expire is insufficient because the server may maintain the state indefinitely as long as another client does not attempt to make a conflicting access to the same file.
See also
Section 2.10.6.4.
struct OPENATTR4args {
/* CURRENT_FH: object */
bool createdir;
};
struct OPENATTR4res {
/*
* If status is NFS4_OK,
* new CURRENT_FH: named attribute
* directory
*/
nfsstat4 status;
};
The OPENATTR operation is used to obtain the filehandle of the named attribute directory associated with the current filehandle. The result of the OPENATTR will be a filehandle to an object of type NF4ATTRDIR. From this filehandle, READDIR and LOOKUP operations can be used to obtain filehandles for the various named attributes associated with the original file system object. Filehandles returned within the named attribute directory will designate objects of type of NF4NAMEDATTR.
The createdir argument allows the client to signify if a named attribute directory should be created as a result of the OPENATTR operation. Some clients may use the OPENATTR operation with a value of FALSE for createdir to determine if any named attributes exist for the object. If none exist, then NFS4ERR_NOENT will be returned. If createdir has a value of TRUE and no named attribute directory exists, one is created and its filehandle becomes the current filehandle. On the other hand, if createdir has a value of TRUE and the named attribute directory already exists, no error results and the filehandle of the existing directory becomes the current filehandle. The creation of a named attribute directory assumes that the server has implemented named attribute support in this fashion and is not required to do so by this definition.
If the current filehandle designates an object of type NF4NAMEDATTR (a named attribute) or NF4ATTRDIR (a named attribute directory), an error of NFS4ERR_WRONG_TYPE is returned to the client. Named attributes or a named attribute directory
MUST NOT have their own named attributes.
If the server does not support named attributes for the current filehandle, an error of NFS4ERR_NOTSUPP will be returned to the client.
struct OPEN_DOWNGRADE4args {
/* CURRENT_FH: opened file */
stateid4 open_stateid;
seqid4 seqid;
uint32_t share_access;
uint32_t share_deny;
};
struct OPEN_DOWNGRADE4resok {
stateid4 open_stateid;
};
union OPEN_DOWNGRADE4res switch(nfsstat4 status) {
case NFS4_OK:
OPEN_DOWNGRADE4resok resok4;
default:
void;
};
This operation is used to adjust the access and deny states for a given open. This is necessary when a given open-owner opens the same file multiple times with different access and deny values. In this situation, a close of one of the opens may change the appropriate share_access and share_deny flags to remove bits associated with opens no longer in effect.
Valid values for the expression (share_access & ~OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) are OPEN4_SHARE_ACCESS_READ, OPEN4_SHARE_ACCESS_WRITE, or OPEN4_SHARE_ACCESS_BOTH. If the client specifies other values, the server
MUST reply with NFS4ERR_INVAL.
Valid values for the share_deny field are OPEN4_SHARE_DENY_NONE, OPEN4_SHARE_DENY_READ, OPEN4_SHARE_DENY_WRITE, or OPEN4_SHARE_DENY_BOTH. If the client specifies other values, the server
MUST reply with NFS4ERR_INVAL.
After checking for valid values of share_access and share_deny, the server replaces the current access and deny modes on the file with share_access and share_deny subject to the following constraints:
-
The bits in share_access SHOULD equal the union of the share_access bits (not including OPEN4_SHARE_WANT_* bits) specified for some subset of the OPENs in effect for the current open-owner on the current file.
-
The bits in share_deny SHOULD equal the union of the share_deny bits specified for some subset of the OPENs in effect for the current open-owner on the current file.
If the above constraints are not respected, the server
SHOULD return the error NFS4ERR_INVAL. Since share_access and share_deny bits should be subsets of those already granted, short of a defect in the client or server implementation, it is not possible for the OPEN_DOWNGRADE request to be denied because of conflicting share reservations.
The seqid argument is not used in NFSv4.1,
MAY be any value, and
MUST be ignored by the server.
On success, the current filehandle retains its value.
An OPEN_DOWNGRADE operation may make OPEN_DELEGATE_READ delegations grantable where they were not previously. Servers may choose to respond immediately if there are pending delegation want requests or may respond to the situation at a later time.
struct PUTFH4args {
nfs_fh4 object;
};
struct PUTFH4res {
/*
* If status is NFS4_OK,
* new CURRENT_FH: argument to PUTFH
*/
nfsstat4 status;
};
This operation replaces the current filehandle with the filehandle provided as an argument. It clears the current stateid.
If the security mechanism used by the requester does not meet the requirements of the filehandle provided to this operation, the server
MUST return NFS4ERR_WRONGSEC.
See
Section 16.2.3.1.1 for more details on the current filehandle.
See
Section 16.2.3.1.2 for more details on the current stateid.
This operation is used in an NFS request to set the context for file accessing operations that follow in the same COMPOUND request.
struct PUTPUBFH4res {
/*
* If status is NFS4_OK,
* new CURRENT_FH: public fh
*/
nfsstat4 status;
};
This operation replaces the current filehandle with the filehandle that represents the public filehandle of the server's namespace. This filehandle may be different from the "root" filehandle that may be associated with some other directory on the server.
PUTPUBFH also clears the current stateid.
The public filehandle represents the concepts embodied in [
49], [
50], and [
61]. The intent for NFSv4.1 is that the public filehandle (represented by the PUTPUBFH operation) be used as a method of providing WebNFS server compatibility with NFSv3.
The public filehandle and the root filehandle (represented by the PUTROOTFH operation)
SHOULD be equivalent. If the public and root filehandles are not equivalent, then the directory corresponding to the public filehandle
MUST be a descendant of the directory corresponding to the root filehandle.
See
Section 16.2.3.1.1 for more details on the current filehandle.
See
Section 16.2.3.1.2 for more details on the current stateid.
This operation is used in an NFS request to set the context for file accessing operations that follow in the same COMPOUND request.
With the NFSv3 public filehandle, the client is able to specify whether the pathname provided in the LOOKUP should be evaluated as either an absolute path relative to the server's root or relative to the public filehandle. [
61] contains further discussion of the functionality. With NFSv4.1, that type of specification is not directly available in the LOOKUP operation. The reason for this is because the component separators needed to specify absolute vs. relative are not allowed in NFSv4. Therefore, the client is responsible for constructing its request such that the use of either PUTROOTFH or PUTPUBFH signifies absolute or relative evaluation of an NFS URL, respectively.
Note that there are warnings mentioned in [
61] with respect to the use of absolute evaluation and the restrictions the server may place on that evaluation with respect to how much of its namespace has been made available. These same warnings apply to NFSv4.1. It is likely, therefore, that because of server implementation details, an NFSv3 absolute public filehandle look up may behave differently than an NFSv4.1 absolute resolution.
There is a form of security negotiation as described in [
62] that uses the public filehandle and an overloading of the pathname. This method is not available with NFSv4.1 as filehandles are not overloaded with special meaning and therefore do not provide the same framework as NFSv3. Clients should therefore use the security negotiation mechanisms described in
Section 2.6.
struct PUTROOTFH4res {
/*
* If status is NFS4_OK,
* new CURRENT_FH: root fh
*/
nfsstat4 status;
};
This operation replaces the current filehandle with the filehandle that represents the root of the server's namespace. From this filehandle, a LOOKUP operation can locate any other filehandle on the server. This filehandle may be different from the "public" filehandle that may be associated with some other directory on the server.
PUTROOTFH also clears the current stateid.
See
Section 16.2.3.1.1 for more details on the current filehandle.
See
Section 16.2.3.1.2 for more details on the current stateid.
This operation is used in an NFS request to set the context for file accessing operations that follow in the same COMPOUND request.
struct READ4args {
/* CURRENT_FH: file */
stateid4 stateid;
offset4 offset;
count4 count;
};
struct READ4resok {
bool eof;
opaque data<>;
};
union READ4res switch (nfsstat4 status) {
case NFS4_OK:
READ4resok resok4;
default:
void;
};
The READ operation reads data from the regular file identified by the current filehandle.
The client provides an offset of where the READ is to start and a count of how many bytes are to be read. An offset of zero means to read data starting at the beginning of the file. If offset is greater than or equal to the size of the file, the status NFS4_OK is returned with a data length set to zero and eof is set to TRUE. The READ is subject to access permissions checking.
If the client specifies a count value of zero, the READ succeeds and returns zero bytes of data again subject to access permissions checking. The server may choose to return fewer bytes than specified by the client. The client needs to check for this condition and handle the condition appropriately.
Except when special stateids are used, the stateid value for a READ request represents a value returned from a previous byte-range lock or share reservation request or the stateid associated with a delegation. The stateid identifies the associated owners if any and is used by the server to verify that the associated locks are still valid (e.g., have not been revoked).
If the read ended at the end-of-file (formally, in a correctly formed READ operation, if offset + count is equal to the size of the file), or the READ operation extends beyond the size of the file (if offset + count is greater than the size of the file), eof is returned as TRUE; otherwise, it is FALSE. A successful READ of an empty file will always return eof as TRUE.
If the current filehandle is not an ordinary file, an error will be returned to the client. In the case that the current filehandle represents an object of type NF4DIR, NFS4ERR_ISDIR is returned. If the current filehandle designates a symbolic link, NFS4ERR_SYMLINK is returned. In all other cases, NFS4ERR_WRONG_TYPE is returned.
For a READ with a stateid value of all bits equal to zero, the server
MAY allow the READ to be serviced subject to mandatory byte-range locks or the current share deny modes for the file. For a READ with a stateid value of all bits equal to one, the server
MAY allow READ operations to bypass locking checks at the server.
On success, the current filehandle retains its value.
If the server returns a "short read" (i.e., fewer data than requested and eof is set to FALSE), the client should send another READ to get the remaining data. A server may return less data than requested under several circumstances. The file may have been truncated by another client or perhaps on the server itself, changing the file size from what the requesting client believes to be the case. This would reduce the actual amount of data available to the client. It is possible that the server reduce the transfer size and so return a short read result. Server resource exhaustion may also occur in a short read.
If mandatory byte-range locking is in effect for the file, and if the byte-range corresponding to the data to be read from the file is WRITE_LT locked by an owner not associated with the stateid, the server will return the NFS4ERR_LOCKED error. The client should try to get the appropriate READ_LT via the LOCK operation before re-attempting the READ. When the READ completes, the client should release the byte-range lock via LOCKU.
If another client has an OPEN_DELEGATE_WRITE delegation for the file being read, the delegation must be recalled, and the operation cannot proceed until that delegation is returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while the delegation remains outstanding. Normally, delegations will not be recalled as a result of a READ operation since the recall will occur as a result of an earlier OPEN. However, since it is possible for a READ to be done with a special stateid, the server needs to check for this case even though the client should have done an OPEN previously.
struct READDIR4args {
/* CURRENT_FH: directory */
nfs_cookie4 cookie;
verifier4 cookieverf;
count4 dircount;
count4 maxcount;
bitmap4 attr_request;
};
struct entry4 {
nfs_cookie4 cookie;
component4 name;
fattr4 attrs;
entry4 *nextentry;
};
struct dirlist4 {
entry4 *entries;
bool eof;
};
struct READDIR4resok {
verifier4 cookieverf;
dirlist4 reply;
};
union READDIR4res switch (nfsstat4 status) {
case NFS4_OK:
READDIR4resok resok4;
default:
void;
};
The READDIR operation retrieves a variable number of entries from a file system directory and returns client-requested attributes for each entry along with information to allow the client to request additional directory entries in a subsequent READDIR.
The arguments contain a cookie value that represents where the READDIR should start within the directory. A value of zero for the cookie is used to start reading at the beginning of the directory. For subsequent READDIR requests, the client specifies a cookie value that is provided by the server on a previous READDIR request.
The request's cookieverf field should be set to 0 zero) when the request's cookie field is zero (first read of the directory). On subsequent requests, the cookieverf field must match the cookieverf returned by the READDIR in which the cookie was acquired. If the server determines that the cookieverf is no longer valid for the directory, the error NFS4ERR_NOT_SAME must be returned.
The dircount field of the request is a hint of the maximum number of bytes of directory information that should be returned. This value represents the total length of the names of the directory entries and the cookie value for these entries. This length represents the XDR encoding of the data (names and cookies) and not the length in the native format of the server.
The maxcount field of the request represents the maximum total size of all of the data being returned within the READDIR4resok structure and includes the XDR overhead. The server
MAY return less data. If the server is unable to return a single directory entry within the maxcount limit, the error NFS4ERR_TOOSMALL
MUST be returned to the client.
Finally, the request's attr_request field represents the list of attributes to be returned for each directory entry supplied by the server.
A successful reply consists of a list of directory entries. Each of these entries contains the name of the directory entry, a cookie value for that entry, and the associated attributes as requested. The "eof" flag has a value of TRUE if there are no more entries in the directory.
The cookie value is only meaningful to the server and is used as a cursor for the directory entry. As mentioned, this cookie is used by the client for subsequent READDIR operations so that it may continue reading a directory. The cookie is similar in concept to a READ offset but
MUST NOT be interpreted as such by the client. Ideally, the cookie value
SHOULD NOT change if the directory is modified since the client may be caching these values.
In some cases, the server may encounter an error while obtaining the attributes for a directory entry. Instead of returning an error for the entire READDIR operation, the server can instead return the attribute rdattr_error (
Section 5.8.1.12). With this, the server is able to communicate the failure to the client and not fail the entire operation in the instance of what might be a transient failure. Obviously, the client must request the fattr4_rdattr_error attribute for this method to work properly. If the client does not request the attribute, the server has no choice but to return failure for the entire READDIR operation.
For some file system environments, the directory entries "." and ".." have special meaning, and in other environments, they do not. If the server supports these special entries within a directory, they
SHOULD NOT be returned to the client as part of the READDIR response. To enable some client environments, the cookie values of zero, 1, and 2 are to be considered reserved. Note that the UNIX client will use these values when combining the server's response and local representations to enable a fully formed UNIX directory presentation to the application.
For READDIR arguments, cookie values of one and two
SHOULD NOT be used, and for READDIR results, cookie values of zero, one, and two
SHOULD NOT be returned.
On success, the current filehandle retains its value.
The server's file system directory representations can differ greatly. A client's programming interfaces may also be bound to the local operating environment in a way that does not translate well into the NFS protocol. Therefore, the use of the dircount and maxcount fields are provided to enable the client to provide hints to the server. If the client is aggressive about attribute collection during a READDIR, the server has an idea of how to limit the encoded response.
If dircount is zero, the server bounds the reply's size based on the request's maxcount field.
The cookieverf may be used by the server to help manage cookie values that may become stale. It should be a rare occurrence that a server is unable to continue properly reading a directory with the provided cookie/cookieverf pair. The server
SHOULD make every effort to avoid this condition since the application at the client might be unable to properly handle this type of failure.
The use of the cookieverf will also protect the client from using READDIR cookie values that might be stale. For example, if the file system has been migrated, the server might or might not be able to use the same cookie values to service READDIR as the previous server used. With the client providing the cookieverf, the server is able to provide the appropriate response to the client. This prevents the case where the server accepts a cookie value but the underlying directory has changed and the response is invalid from the client's context of its previous READDIR.
Since some servers will not be returning "." and ".." entries as has been done with previous versions of the NFS protocol, the client that requires these entries be present in READDIR responses must fabricate them.
/* CURRENT_FH: symlink */
void;
struct READLINK4resok {
linktext4 link;
};
union READLINK4res switch (nfsstat4 status) {
case NFS4_OK:
READLINK4resok resok4;
default:
void;
};
READLINK reads the data associated with a symbolic link. Depending on the value of the UTF-8 capability attribute (
Section 14.4), the data is encoded in UTF-8. Whether created by an NFS client or created locally on the server, the data in a symbolic link is not interpreted (except possibly to check for proper UTF-8 encoding) when created, but is simply stored.
On success, the current filehandle retains its value.
A symbolic link is nominally a pointer to another file. The data is not necessarily interpreted by the server, just stored in the file. It is possible for a client implementation to store a pathname that is not meaningful to the server operating system in a symbolic link. A READLINK operation returns the data to the client for interpretation. If different implementations want to share access to symbolic links, then they must agree on the interpretation of the data in the symbolic link.
The READLINK operation is only allowed on objects of type NF4LNK. The server should return the error NFS4ERR_WRONG_TYPE if the object is not of type NF4LNK.
struct REMOVE4args {
/* CURRENT_FH: directory */
component4 target;
};
struct REMOVE4resok {
change_info4 cinfo;
};
union REMOVE4res switch (nfsstat4 status) {
case NFS4_OK:
REMOVE4resok resok4;
default:
void;
};
The REMOVE operation removes (deletes) a directory entry named by filename from the directory corresponding to the current filehandle. If the entry in the directory was the last reference to the corresponding file system object, the object may be destroyed. The directory may be either of type NF4DIR or NF4ATTRDIR.
For the directory where the filename was removed, the server returns change_info4 information in cinfo. With the atomic field of the change_info4 data type, the server will indicate if the before and after change attributes were obtained atomically with respect to the removal.
If the target has a length of zero, or if the target does not obey the UTF-8 definition (and the server is enforcing UTF-8 encoding; see
Section 14.4), the error NFS4ERR_INVAL will be returned.
On success, the current filehandle retains its value.
NFSv3 required a different operator RMDIR for directory removal and REMOVE for non-directory removal. This allowed clients to skip checking the file type when being passed a non-directory delete system call (e.g., [
24] in POSIX) to remove a directory, as well as the converse (e.g., a rmdir() on a non-directory) because they knew the server would check the file type. NFSv4.1 REMOVE can be used to delete any directory entry independent of its file type. The implementor of an NFSv4.1 client's entry points from the unlink() and rmdir() system calls should first check the file type against the types the system call is allowed to remove before sending a REMOVE operation. Alternatively, the implementor can produce a COMPOUND call that includes a LOOKUP/VERIFY sequence of operations to verify the file type before a REMOVE operation in the same COMPOUND call.
The concept of last reference is server specific. However, if the numlinks field in the previous attributes of the object had the value 1, the client should not rely on referring to the object via a filehandle. Likewise, the client should not rely on the resources (disk space, directory entry, and so on) formerly associated with the object becoming immediately available. Thus, if a client needs to be able to continue to access a file after using REMOVE to remove it, the client should take steps to make sure that the file will still be accessible. While the traditional mechanism used is to RENAME the file from its old name to a new hidden name, the NFSv4.1 OPEN operation
MAY return a result flag, OPEN4_RESULT_PRESERVE_UNLINKED, which indicates to the client that the file will be preserved if the file has an outstanding open (see
Section 18.16).
If the server finds that the file is still open when the REMOVE arrives:
-
The server SHOULD NOT delete the file's directory entry if the file was opened with OPEN4_SHARE_DENY_WRITE or OPEN4_SHARE_DENY_BOTH.
-
If the file was not opened with OPEN4_SHARE_DENY_WRITE or OPEN4_SHARE_DENY_BOTH, the server SHOULD delete the file's directory entry. However, until last CLOSE of the file, the server MAY continue to allow access to the file via its filehandle.
-
The server MUST NOT delete the directory entry if the reply from OPEN had the flag OPEN4_RESULT_PRESERVE_UNLINKED set.
The server
MAY implement its own restrictions on removal of a file while it is open. The server might disallow such a REMOVE (or a removal that occurs as part of RENAME). The conditions that influence the restrictions on removal of a file while it is still open include:
-
Whether certain access protocols (i.e., not just NFS) are holding the file open.
-
Whether particular options, access modes, or policies on the server are enabled.
If a file has an outstanding OPEN and this prevents the removal of the file's directory entry, the error NFS4ERR_FILE_OPEN is returned.
Where the determination above cannot be made definitively because delegations are being held, they
MUST be recalled to allow processing of the REMOVE to continue. When a delegation is held, the server has no reliable knowledge of the status of OPENs for that client, so unless there are files opened with the particular deny modes by clients without delegations, the determination cannot be made until delegations are recalled, and the operation cannot proceed until each sufficient delegation has been returned or revoked to allow the server to make a correct determination.
In all cases in which delegations are recalled, the server is likely to return one or more NFS4ERR_DELAY errors while delegations remain outstanding.
If the current filehandle designates a directory for which another client holds a directory delegation, then, unless the situation can be resolved by sending a notification, the directory delegation
MUST be recalled, and the operation
MUST NOT proceed until the delegation is returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while delegation remains outstanding.
When the current filehandle designates a directory for which one or more directory delegations exist, then, when those delegations request such notifications, NOTIFY4_REMOVE_ENTRY will be generated as a result of this operation.
Note that when a remove occurs as a result of a RENAME, NOTIFY4_REMOVE_ENTRY will only be generated if the removal happens as a separate operation. In the case in which the removal is integrated and atomic with RENAME, the notification of the removal is integrated with notification for the RENAME. See the discussion of the NOTIFY4_RENAME_ENTRY notification in
Section 20.4.
struct RENAME4args {
/* SAVED_FH: source directory */
component4 oldname;
/* CURRENT_FH: target directory */
component4 newname;
};
struct RENAME4resok {
change_info4 source_cinfo;
change_info4 target_cinfo;
};
union RENAME4res switch (nfsstat4 status) {
case NFS4_OK:
RENAME4resok resok4;
default:
void;
};
The RENAME operation renames the object identified by oldname in the source directory corresponding to the saved filehandle, as set by the SAVEFH operation, to newname in the target directory corresponding to the current filehandle. The operation is required to be atomic to the client. Source and target directories
MUST reside on the same file system on the server. On success, the current filehandle will continue to be the target directory.
If the target directory already contains an entry with the name newname, the source object
MUST be compatible with the target: either both are non-directories or both are directories and the target
MUST be empty. If compatible, the existing target is removed before the rename occurs or, preferably, the target is removed atomically as part of the rename. See
Section 18.25.4 for client and server actions whenever a target is removed. Note however that when the removal is performed atomically with the rename, certain parts of the removal described there are integrated with the rename. For example, notification of the removal will not be via a NOTIFY4_REMOVE_ENTRY but will be indicated as part of the NOTIFY4_ADD_ENTRY or NOTIFY4_RENAME_ENTRY generated by the rename.
If the source object and the target are not compatible or if the target is a directory but not empty, the server will return the error NFS4ERR_EXIST.
If oldname and newname both refer to the same file (e.g., they might be hard links of each other), then unless the file is open (see
Section 18.26.4), RENAME
MUST perform no action and return NFS4_OK.
For both directories involved in the RENAME, the server returns change_info4 information. With the atomic field of the change_info4 data type, the server will indicate if the before and after change attributes were obtained atomically with respect to the rename.
If oldname refers to a named attribute and the saved and current filehandles refer to different file system objects, the server will return NFS4ERR_XDEV just as if the saved and current filehandles represented directories on different file systems.
If oldname or newname has a length of zero, or if oldname or newname does not obey the UTF-8 definition, the error NFS4ERR_INVAL will be returned.
The server
MAY impose restrictions on the RENAME operation such that RENAME may not be done when the file being renamed is open or when that open is done by particular protocols, or with particular options or access modes. Similar restrictions may be applied when a file exists with the target name and is open. When RENAME is rejected because of such restrictions, the error NFS4ERR_FILE_OPEN is returned.
When oldname and rename refer to the same file and that file is open in a fashion such that RENAME would normally be rejected with NFS4ERR_FILE_OPEN if oldname and newname were different files, then RENAME
SHOULD be rejected with NFS4ERR_FILE_OPEN.
If a server does implement such restrictions and those restrictions include cases of NFSv4 opens preventing successful execution of a rename, the server needs to recall any delegations that could hide the existence of opens relevant to that decision. This is because when a client holds a delegation, the server might not have an accurate account of the opens for that client, since the client may execute OPENs and CLOSEs locally. The RENAME operation need only be delayed until a definitive result can be obtained. For example, if there are multiple delegations and one of them establishes an open whose presence would prevent the rename, given the server's semantics, NFS4ERR_FILE_OPEN may be returned to the caller as soon as that delegation is returned without waiting for other delegations to be returned. Similarly, if such opens are not associated with delegations, NFS4ERR_FILE_OPEN can be returned immediately with no delegation recall being done.
If the current filehandle or the saved filehandle designates a directory for which another client holds a directory delegation, then, unless the situation can be resolved by sending a notification, the delegation
MUST be recalled, and the operation cannot proceed until the delegation is returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while delegation remains outstanding.
When the current and saved filehandles are the same and they designate a directory for which one or more directory delegations exist, then, when those delegations request such notifications, a notification of type NOTIFY4_RENAME_ENTRY will be generated as a result of this operation. When oldname and rename refer to the same file, no notification is generated (because, as
Section 18.26.3 states, the server
MUST take no action). When a file is removed because it has the same name as the target, if that removal is done atomically with the rename, a NOTIFY4_REMOVE_ENTRY notification will not be generated. Instead, the deletion of the file will be reported as part of the NOTIFY4_RENAME_ENTRY notification.
When the current and saved filehandles are not the same:
-
If the current filehandle designates a directory for which one or more directory delegations exist, then, when those delegations request such notifications, NOTIFY4_ADD_ENTRY will be generated as a result of this operation. When a file is removed because it has the same name as the target, if that removal is done atomically with the rename, a NOTIFY4_REMOVE_ENTRY notification will not be generated. Instead, the deletion of the file will be reported as part of the NOTIFY4_ADD_ENTRY notification.
-
If the saved filehandle designates a directory for which one or more directory delegations exist, then, when those delegations request such notifications, NOTIFY4_REMOVE_ENTRY will be generated as a result of this operation.
If the object being renamed has file delegations held by clients other than the one doing the RENAME, the delegations
MUST be recalled, and the operation cannot proceed until each such delegation is returned or revoked. Note that in the case of multiply linked files, the delegation recall requirement applies even if the delegation was obtained through a different name than the one being renamed. In all cases in which delegations are recalled, the server is likely to return one or more NFS4ERR_DELAY errors while the delegation(s) remains outstanding, although it might not do that if the delegations are returned quickly.
The RENAME operation must be atomic to the client. The statement "source and target directories
MUST reside on the same file system on the server" means that the fsid fields in the attributes for the directories are the same. If they reside on different file systems, the error NFS4ERR_XDEV is returned.
Based on the value of the fh_expire_type attribute for the object, the filehandle may or may not expire on a RENAME. However, server implementors are strongly encouraged to attempt to keep filehandles from expiring in this fashion.
On some servers, the file names "." and ".." are illegal as either oldname or newname, and will result in the error NFS4ERR_BADNAME. In addition, on many servers the case of oldname or newname being an alias for the source directory will be checked for. Such servers will return the error NFS4ERR_INVAL in these cases.
If either of the source or target filehandles are not directories, the server will return NFS4ERR_NOTDIR.
struct RESTOREFH4res {
/*
* If status is NFS4_OK,
* new CURRENT_FH: value of saved fh
*/
nfsstat4 status;
};
The RESTOREFH operation sets the current filehandle and stateid to the values in the saved filehandle and stateid. If there is no saved filehandle, then the server will return the error NFS4ERR_NOFILEHANDLE.
See
Section 16.2.3.1.1 for more details on the current filehandle.
See
Section 16.2.3.1.2 for more details on the current stateid.
Operations like OPEN and LOOKUP use the current filehandle to represent a directory and replace it with a new filehandle. Assuming that the previous filehandle was saved with a SAVEFH operator, the previous filehandle can be restored as the current filehandle. This is commonly used to obtain post-operation attributes for the directory, e.g.,
PUTFH (directory filehandle)
SAVEFH
GETATTR attrbits (pre-op dir attrs)
CREATE optbits "foo" attrs
GETATTR attrbits (file attributes)
RESTOREFH
GETATTR attrbits (post-op dir attrs)
struct SAVEFH4res {
/*
* If status is NFS4_OK,
* new SAVED_FH: value of current fh
*/
nfsstat4 status;
};
The SAVEFH operation saves the current filehandle and stateid. If a previous filehandle was saved, then it is no longer accessible. The saved filehandle can be restored as the current filehandle with the RESTOREFH operator.
On success, the current filehandle retains its value.
See
Section 16.2.3.1.1 for more details on the current filehandle.
See
Section 16.2.3.1.2 for more details on the current stateid.
struct SECINFO4args {
/* CURRENT_FH: directory */
component4 name;
};
/*
* From RFC 2203
*/
enum rpc_gss_svc_t {
RPC_GSS_SVC_NONE = 1,
RPC_GSS_SVC_INTEGRITY = 2,
RPC_GSS_SVC_PRIVACY = 3
};
struct rpcsec_gss_info {
sec_oid4 oid;
qop4 qop;
rpc_gss_svc_t service;
};
/* RPCSEC_GSS has a value of '6' - See RFC 2203 */
union secinfo4 switch (uint32_t flavor) {
case RPCSEC_GSS:
rpcsec_gss_info flavor_info;
default:
void;
};
typedef secinfo4 SECINFO4resok<>;
union SECINFO4res switch (nfsstat4 status) {
case NFS4_OK:
/* CURRENTFH: consumed */
SECINFO4resok resok4;
default:
void;
};
The SECINFO operation is used by the client to obtain a list of valid RPC authentication flavors for a specific directory filehandle, file name pair. SECINFO should apply the same access methodology used for LOOKUP when evaluating the name. Therefore, if the requester does not have the appropriate access to LOOKUP the name, then SECINFO
MUST behave the same way and return NFS4ERR_ACCESS.
The result will contain an array that represents the security mechanisms available, with an order corresponding to the server's preferences, the most preferred being first in the array. The client is free to pick whatever security mechanism it both desires and supports, or to pick in the server's preference order the first one it supports. The array entries are represented by the secinfo4 structure. The field 'flavor' will contain a value of AUTH_NONE, AUTH_SYS (as defined in [
3]), or RPCSEC_GSS (as defined in [
4]). The field flavor can also be any other security flavor registered with IANA.
For the flavors AUTH_NONE and AUTH_SYS, no additional security information is returned. The same is true of many (if not most) other security flavors, including AUTH_DH. For a return value of RPCSEC_GSS, a security triple is returned that contains the mechanism object identifier (OID, as defined in [
7]), the quality of protection (as defined in [
7]), and the service type (as defined in [
4]). It is possible for SECINFO to return multiple entries with flavor equal to RPCSEC_GSS with different security triple values.
On success, the current filehandle is consumed (see
Section 2.6.3.1.1.8), and if the next operation after SECINFO tries to use the current filehandle, that operation will fail with the status NFS4ERR_NOFILEHANDLE.
If the name has a length of zero, or if the name does not obey the UTF-8 definition (assuming UTF-8 capabilities are enabled; see
Section 14.4), the error NFS4ERR_INVAL will be returned.
See
Section 2.6 for additional information on the use of SECINFO.
The SECINFO operation is expected to be used by the NFS client when the error value of NFS4ERR_WRONGSEC is returned from another NFS operation. This signifies to the client that the server's security policy is different from what the client is currently using. At this point, the client is expected to obtain a list of possible security flavors and choose what best suits its policies.
As mentioned, the server's security policies will determine when a client request receives NFS4ERR_WRONGSEC. See
Table 14 for a list of operations that can return NFS4ERR_WRONGSEC. In addition, when READDIR returns attributes, the rdattr_error (
Section 5.8.1.12) can contain NFS4ERR_WRONGSEC. Note that CREATE and REMOVE
MUST NOT return NFS4ERR_WRONGSEC. The rationale for CREATE is that unless the target name exists, it cannot have a separate security policy from the parent directory, and the security policy of the parent was checked when its filehandle was injected into the COMPOUND request's operations stream (for similar reasons, an OPEN operation that creates the target
MUST NOT return NFS4ERR_WRONGSEC). If the target name exists, while it might have a separate security policy, that is irrelevant because CREATE
MUST return NFS4ERR_EXIST. The rationale for REMOVE is that while that target might have a separate security policy, the target is going to be removed, and so the security policy of the parent trumps that of the object being removed. RENAME and LINK
MAY return NFS4ERR_WRONGSEC, but the NFS4ERR_WRONGSEC error applies only to the saved filehandle (see
Section 2.6.3.1.2). Any NFS4ERR_WRONGSEC error on the current filehandle used by LINK and RENAME
MUST be returned by the PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH operation that injected the current filehandle.
With the exception of LINK and RENAME, the set of operations that can return NFS4ERR_WRONGSEC represents the point at which the client can inject a filehandle into the "current filehandle" at the server. The filehandle is either provided by the client (PUTFH, PUTPUBFH, PUTROOTFH), generated as a result of a name-to-filehandle translation (LOOKUP and OPEN), or generated from the saved filehandle via RESTOREFH. As
Section 2.6.3.1.1.1 states, a put filehandle operation followed by SAVEFH
MUST NOT return NFS4ERR_WRONGSEC. Thus, the RESTOREFH operation, under certain conditions (see
Section 2.6.3.1.1), is permitted to return NFS4ERR_WRONGSEC so that security policies can be honored.
The READDIR operation will not directly return the NFS4ERR_WRONGSEC error. However, if the READDIR request included a request for attributes, it is possible that the READDIR request's security triple did not match that of a directory entry. If this is the case and the client has requested the rdattr_error attribute, the server will return the NFS4ERR_WRONGSEC error in rdattr_error for the entry.
To resolve an error return of NFS4ERR_WRONGSEC, the client does the following:
-
For LOOKUP and OPEN, the client will use SECINFO with the same current filehandle and name as provided in the original LOOKUP or OPEN to enumerate the available security triples.
-
For the rdattr_error, the client will use SECINFO with the same current filehandle as provided in the original READDIR. The name passed to SECINFO will be that of the directory entry (as returned from READDIR) that had the NFS4ERR_WRONGSEC error in the rdattr_error attribute.
-
For PUTFH, PUTROOTFH, PUTPUBFH, RESTOREFH, LINK, and RENAME, the client will use SECINFO_NO_NAME { style = SECINFO_STYLE4_CURRENT_FH }. The client will prefix the SECINFO_NO_NAME operation with the appropriate PUTFH, PUTPUBFH, or PUTROOTFH operation that provides the filehandle originally provided by the PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH operation.
NOTE: In NFSv4.0, the client was required to use SECINFO, and had to reconstruct the parent of the original filehandle and the component name of the original filehandle. The introduction in NFSv4.1 of SECINFO_NO_NAME obviates the need for reconstruction.
-
For LOOKUPP, the client will use SECINFO_NO_NAME { style = SECINFO_STYLE4_PARENT } and provide the filehandle that equals the filehandle originally provided to LOOKUPP.
See
Section 21 for a discussion on the recommendations for the security flavor used by SECINFO and SECINFO_NO_NAME.
struct SETATTR4args {
/* CURRENT_FH: target object */
stateid4 stateid;
fattr4 obj_attributes;
};
struct SETATTR4res {
nfsstat4 status;
bitmap4 attrsset;
};
The SETATTR operation changes one or more of the attributes of a file system object. The new attributes are specified with a bitmap and the attributes that follow the bitmap in bit order.
The stateid argument for SETATTR is used to provide byte-range locking context that is necessary for SETATTR requests that set the size attribute. Since setting the size attribute modifies the file's data, it has the same locking requirements as a corresponding WRITE. Any SETATTR that sets the size attribute is incompatible with a share reservation that specifies OPEN4_SHARE_DENY_WRITE. The area between the old end-of-file and the new end-of-file is considered to be modified just as would have been the case had the area in question been specified as the target of WRITE, for the purpose of checking conflicts with byte-range locks, for those cases in which a server is implementing mandatory byte-range locking behavior. A valid stateid
SHOULD always be specified. When the file size attribute is not set, the special stateid consisting of all bits equal to zero
MAY be passed.
On either success or failure of the operation, the server will return the attrsset bitmask to represent what (if any) attributes were successfully set. The attrsset in the response is a subset of the attrmask field of the obj_attributes field in the argument.
On success, the current filehandle retains its value.
If the request specifies the owner attribute to be set, the server
SHOULD allow the operation to succeed if the current owner of the object matches the value specified in the request. Some servers may be implemented in a way as to prohibit the setting of the owner attribute unless the requester has privilege to do so. If the server is lenient in this one case of matching owner values, the client implementation may be simplified in cases of creation of an object (e.g., an exclusive create via OPEN) followed by a SETATTR.
The file size attribute is used to request changes to the size of a file. A value of zero causes the file to be truncated, a value less than the current size of the file causes data from new size to the end of the file to be discarded, and a size greater than the current size of the file causes logically zeroed data bytes to be added to the end of the file. Servers are free to implement this using unallocated bytes (holes) or allocated data bytes set to zero. Clients should not make any assumptions regarding a server's implementation of this feature, beyond that the bytes in the affected byte-range returned by READ will be zeroed. Servers
MUST support extending the file size via SETATTR.
SETATTR is not guaranteed to be atomic. A failed SETATTR may partially change a file's attributes, hence the reason why the reply always includes the status and the list of attributes that were set.
If the object whose attributes are being changed has a file delegation that is held by a client other than the one doing the SETATTR, the delegation(s) must be recalled, and the operation cannot proceed to actually change an attribute until each such delegation is returned or revoked. In all cases in which delegations are recalled, the server is likely to return one or more NFS4ERR_DELAY errors while the delegation(s) remains outstanding, although it might not do that if the delegations are returned quickly.
If the object whose attributes are being set is a directory and another client holds a directory delegation for that directory, then if enabled, asynchronous notifications will be generated when the set of attributes changed has a non-null intersection with the set of attributes for which notification is requested. Notifications of type NOTIFY4_CHANGE_DIR_ATTRS will be sent to the appropriate client(s), but the SETATTR is not delayed by waiting for these notifications to be sent.
If the object whose attributes are being set is a member of the directory for which another client holds a directory delegation, then asynchronous notifications will be generated when the set of attributes changed has a non-null intersection with the set of attributes for which notification is requested. Notifications of type NOTIFY4_CHANGE_CHILD_ATTRS will be sent to the appropriate clients, but the SETATTR is not delayed by waiting for these notifications to be sent.
Changing the size of a file with SETATTR indirectly changes the time_modify and change attributes. A client must account for this as size changes can result in data deletion.
The attributes time_access_set and time_modify_set are write-only attributes constructed as a switched union so the client can direct the server in setting the time values. If the switched union specifies SET_TO_CLIENT_TIME4, the client has provided an nfstime4 to be used for the operation. If the switch union does not specify SET_TO_CLIENT_TIME4, the server is to use its current time for the SETATTR operation.
If server and client times differ, programs that compare client time to file times can break. A time synchronization protocol should be used to limit client/server time skew.
Use of a COMPOUND containing a VERIFY operation specifying only the change attribute, immediately followed by a SETATTR, provides a means whereby a client may specify a request that emulates the functionality of the SETATTR guard mechanism of NFSv3. Since the function of the guard mechanism is to avoid changes to the file attributes based on stale information, delays between checking of the guard condition and the setting of the attributes have the potential to compromise this function, as would the corresponding delay in the NFSv4 emulation. Therefore, NFSv4.1 servers
SHOULD take care to avoid such delays, to the degree possible, when executing such a request.
If the server does not support an attribute as requested by the client, the server
SHOULD return NFS4ERR_ATTRNOTSUPP.
A mask of the attributes actually set is returned by SETATTR in all cases. That mask
MUST NOT include attribute bits not requested to be set by the client. If the attribute masks in the request and reply are equal, the status field in the reply
MUST be NFS4_OK.
struct VERIFY4args {
/* CURRENT_FH: object */
fattr4 obj_attributes;
};
struct VERIFY4res {
nfsstat4 status;
};
The VERIFY operation is used to verify that attributes have the value assumed by the client before proceeding with the following operations in the COMPOUND request. If any of the attributes do not match, then the error NFS4ERR_NOT_SAME must be returned. The current filehandle retains its value after successful completion of the operation.
One possible use of the VERIFY operation is the following series of operations. With this, the client is attempting to verify that the file being removed will match what the client expects to be removed. This series can help prevent the unintended deletion of a file.
PUTFH (directory filehandle)
LOOKUP (file name)
VERIFY (filehandle == fh)
PUTFH (directory filehandle)
REMOVE (file name)
This series does not prevent a second client from removing and creating a new file in the middle of this sequence, but it does help avoid the unintended result.
In the case that a
RECOMMENDED attribute is specified in the VERIFY operation and the server does not support that attribute for the file system object, the error NFS4ERR_ATTRNOTSUPP is returned to the client.
When the attribute rdattr_error or any set-only attribute (e.g., time_modify_set) is specified, the error NFS4ERR_INVAL is returned to the client.
enum stable_how4 {
UNSTABLE4 = 0,
DATA_SYNC4 = 1,
FILE_SYNC4 = 2
};
struct WRITE4args {
/* CURRENT_FH: file */
stateid4 stateid;
offset4 offset;
stable_how4 stable;
opaque data<>;
};
struct WRITE4resok {
count4 count;
stable_how4 committed;
verifier4 writeverf;
};
union WRITE4res switch (nfsstat4 status) {
case NFS4_OK:
WRITE4resok resok4;
default:
void;
};
The WRITE operation is used to write data to a regular file. The target file is specified by the current filehandle. The offset specifies the offset where the data should be written. An offset of zero specifies that the write should start at the beginning of the file. The count, as encoded as part of the opaque data parameter, represents the number of bytes of data that are to be written. If the count is zero, the WRITE will succeed and return a count of zero subject to permissions checking. The server
MAY write fewer bytes than requested by the client.
The client specifies with the stable parameter the method of how the data is to be processed by the server. If stable is FILE_SYNC4, the server
MUST commit the data written plus all file system metadata to stable storage before returning results. This corresponds to the NFSv2 protocol semantics. Any other behavior constitutes a protocol violation. If stable is DATA_SYNC4, then the server
MUST commit all of the data to stable storage and enough of the metadata to retrieve the data before returning. The server implementor is free to implement DATA_SYNC4 in the same fashion as FILE_SYNC4, but with a possible performance drop. If stable is UNSTABLE4, the server is free to commit any part of the data and the metadata to stable storage, including all or none, before returning a reply to the client. There is no guarantee whether or when any uncommitted data will subsequently be committed to stable storage. The only guarantees made by the server are that it will not destroy any data without changing the value of writeverf and that it will not commit the data and metadata at a level less than that requested by the client.
Except when special stateids are used, the stateid value for a WRITE request represents a value returned from a previous byte-range LOCK or OPEN request or the stateid associated with a delegation. The stateid identifies the associated owners if any and is used by the server to verify that the associated locks are still valid (e.g., have not been revoked).
Upon successful completion, the following results are returned. The count result is the number of bytes of data written to the file. The server may write fewer bytes than requested. If so, the actual number of bytes written starting at location, offset, is returned.
The server also returns an indication of the level of commitment of the data and metadata via committed. Per
Table 20,
-
The server MAY commit the data at a stronger level than requested.
-
The server MUST commit the data at a level at least as high as that committed.
stable |
committed |
UNSTABLE4 |
FILE_SYNC4, DATA_SYNC4, UNSTABLE4 |
DATA_SYNC4 |
FILE_SYNC4, DATA_SYNC4 |
FILE_SYNC4 |
FILE_SYNC4 |
Table 20: Valid Combinations of the Fields Stable in the Request and Committed in the Reply
The final portion of the result is the field writeverf. This field is the write verifier and is a cookie that the client can use to determine whether a server has changed instance state (e.g., server restart) between a call to WRITE and a subsequent call to either WRITE or COMMIT. This cookie
MUST be unchanged during a single instance of the NFSv4.1 server and
MUST be unique between instances of the NFSv4.1 server. If the cookie changes, then the client
MUST assume that any data written with an UNSTABLE4 value for committed and an old writeverf in the reply has been lost and will need to be recovered.
If a client writes data to the server with the stable argument set to UNSTABLE4 and the reply yields a committed response of DATA_SYNC4 or UNSTABLE4, the client will follow up some time in the future with a COMMIT operation to synchronize outstanding asynchronous data and metadata with the server's stable storage, barring client error. It is possible that due to client crash or other error that a subsequent COMMIT will not be received by the server.
For a WRITE with a stateid value of all bits equal to zero, the server
MAY allow the WRITE to be serviced subject to mandatory byte-range locks or the current share deny modes for the file. For a WRITE with a stateid value of all bits equal to 1, the server
MUST NOT allow the WRITE operation to bypass locking checks at the server and otherwise is treated as if a stateid of all bits equal to zero were used.
On success, the current filehandle retains its value.
It is possible for the server to write fewer bytes of data than requested by the client. In this case, the server
SHOULD NOT return an error unless no data was written at all. If the server writes less than the number of bytes specified, the client will need to send another WRITE to write the remaining data.
It is assumed that the act of writing data to a file will cause the time_modified and change attributes of the file to be updated. However, these attributes
SHOULD NOT be changed unless the contents of the file are changed. Thus, a WRITE request with count set to zero
SHOULD NOT cause the time_modified and change attributes of the file to be updated.
Stable storage is persistent storage that survives:
-
Repeated power failures.
-
Hardware failures (of any board, power supply, etc.).
-
Repeated software crashes and restarts.
This definition does not address failure of the stable storage module itself.
The verifier is defined to allow a client to detect different instances of an NFSv4.1 protocol server over which cached, uncommitted data may be lost. In the most likely case, the verifier allows the client to detect server restarts. This information is required so that the client can safely determine whether the server could have lost cached data. If the server fails unexpectedly and the client has uncommitted data from previous WRITE requests (done with the stable argument set to UNSTABLE4 and in which the result committed was returned as UNSTABLE4 as well), the server might not have flushed cached data to stable storage. The burden of recovery is on the client, and the client will need to retransmit the data to the server.
A suggested verifier would be to use the time that the server was last started (if restarting the server results in lost buffers).
The reply's committed field allows the client to do more effective caching. If the server is committing all WRITE requests to stable storage, then it
SHOULD return with committed set to FILE_SYNC4, regardless of the value of the stable field in the arguments. A server that uses an NVRAM accelerator may choose to implement this policy. The client can use this to increase the effectiveness of the cache by discarding cached data that has already been committed on the server.
Some implementations may return NFS4ERR_NOSPC instead of NFS4ERR_DQUOT when a user's quota is exceeded.
In the case that the current filehandle is of type NF4DIR, the server will return NFS4ERR_ISDIR. If the current file is a symbolic link, the error NFS4ERR_SYMLINK will be returned. Otherwise, if the current filehandle does not designate an ordinary file, the server will return NFS4ERR_WRONG_TYPE.
If mandatory byte-range locking is in effect for the file, and the corresponding byte-range of the data to be written to the file is READ_LT or WRITE_LT locked by an owner that is not associated with the stateid, the server
MUST return NFS4ERR_LOCKED. If so, the client
MUST check if the owner corresponding to the stateid used with the WRITE operation has a conflicting READ_LT lock that overlaps with the byte-range that was to be written. If the stateid's owner has no conflicting READ_LT lock, then the client
SHOULD try to get the appropriate write byte-range lock via the LOCK operation before re-attempting the WRITE. When the WRITE completes, the client
SHOULD release the byte-range lock via LOCKU.
If the stateid's owner had a conflicting READ_LT lock, then the client has no choice but to return an error to the application that attempted the WRITE. The reason is that since the stateid's owner had a READ_LT lock, either the server attempted to temporarily effectively upgrade this READ_LT lock to a WRITE_LT lock or the server has no upgrade capability. If the server attempted to upgrade the READ_LT lock and failed, it is pointless for the client to re-attempt the upgrade via the LOCK operation, because there might be another client also trying to upgrade. If two clients are blocked trying to upgrade the same lock, the clients deadlock. If the server has no upgrade capability, then it is pointless to try a LOCK operation to upgrade.
If one or more other clients have delegations for the file being written, those delegations
MUST be recalled, and the operation cannot proceed until those delegations are returned or revoked. Except where this happens very quickly, one or more NFS4ERR_DELAY errors will be returned to requests made while the delegation remains outstanding. Normally, delegations will not be recalled as a result of a WRITE operation since the recall will occur as a result of an earlier OPEN. However, since it is possible for a WRITE to be done with a special stateid, the server needs to check for this case even though the client should have done an OPEN previously.
typedef opaque gsshandle4_t<>;
struct gss_cb_handles4 {
rpc_gss_svc_t gcbp_service; /* RFC 2203 */
gsshandle4_t gcbp_handle_from_server;
gsshandle4_t gcbp_handle_from_client;
};
union callback_sec_parms4 switch (uint32_t cb_secflavor) {
case AUTH_NONE:
void;
case AUTH_SYS:
authsys_parms cbsp_sys_cred; /* RFC 5531 */
case RPCSEC_GSS:
gss_cb_handles4 cbsp_gss_handles;
};
struct BACKCHANNEL_CTL4args {
uint32_t bca_cb_program;
callback_sec_parms4 bca_sec_parms<>;
};
struct BACKCHANNEL_CTL4res {
nfsstat4 bcr_status;
};
The BACKCHANNEL_CTL operation replaces the backchannel's callback program number and adds (not replaces) RPCSEC_GSS handles for use by the backchannel.
The arguments of the BACKCHANNEL_CTL call are a subset of the CREATE_SESSION parameters. In the arguments of BACKCHANNEL_CTL, the bca_cb_program field and bca_sec_parms fields correspond respectively to the csa_cb_program and csa_sec_parms fields of the arguments of CREATE_SESSION (
Section 18.36).
BACKCHANNEL_CTL
MUST appear in a COMPOUND that starts with SEQUENCE.
If the RPCSEC_GSS handle identified by gcbp_handle_from_server does not exist on the server, the server
MUST return NFS4ERR_NOENT.
If an RPCSEC_GSS handle is using the SSV context (see
Section 2.10.9), then because each SSV RPCSEC_GSS handle shares a common SSV GSS context, there are security considerations specific to this situation discussed in
Section 2.10.10.
enum channel_dir_from_client4 {
CDFC4_FORE = 0x1,
CDFC4_BACK = 0x2,
CDFC4_FORE_OR_BOTH = 0x3,
CDFC4_BACK_OR_BOTH = 0x7
};
struct BIND_CONN_TO_SESSION4args {
sessionid4 bctsa_sessid;
channel_dir_from_client4
bctsa_dir;
bool bctsa_use_conn_in_rdma_mode;
};
enum channel_dir_from_server4 {
CDFS4_FORE = 0x1,
CDFS4_BACK = 0x2,
CDFS4_BOTH = 0x3
};
struct BIND_CONN_TO_SESSION4resok {
sessionid4 bctsr_sessid;
channel_dir_from_server4
bctsr_dir;
bool bctsr_use_conn_in_rdma_mode;
};
union BIND_CONN_TO_SESSION4res
switch (nfsstat4 bctsr_status) {
case NFS4_OK:
BIND_CONN_TO_SESSION4resok
bctsr_resok4;
default: void;
};
BIND_CONN_TO_SESSION is used to associate additional connections with a session. It
MUST be used on the connection being associated with the session. It
MUST be the only operation in the COMPOUND procedure. If SP4_NONE (
Section 18.35) state protection is used, any principal, security flavor, or RPCSEC_GSS context
MAY be used to invoke the operation. If SP4_MACH_CRED is used, RPCSEC_GSS
MUST be used with the integrity or privacy services, using the principal that created the client ID. If SP4_SSV is used, RPCSEC_GSS with the SSV GSS mechanism (
Section 2.10.9) and integrity or privacy
MUST be used.
If, when the client ID was created, the client opted for SP4_NONE state protection, the client is not required to use BIND_CONN_TO_SESSION to associate the connection with the session, unless the client wishes to associate the connection with the backchannel. When SP4_NONE protection is used, simply sending a COMPOUND request with a SEQUENCE operation is sufficient to associate the connection with the session specified in SEQUENCE.
The field bctsa_dir indicates whether the client wants to associate the connection with the fore channel or the backchannel or both channels. The value CDFC4_FORE_OR_BOTH indicates that the client wants to associate the connection with both the fore channel and backchannel, but will accept the connection being associated to just the fore channel. The value CDFC4_BACK_OR_BOTH indicates that the client wants to associate with both the fore channel and backchannel, but will accept the connection being associated with just the backchannel. The server replies in bctsr_dir which channel(s) the connection is associated with. If the client specified CDFC4_FORE, the server
MUST return CDFS4_FORE. If the client specified CDFC4_BACK, the server
MUST return CDFS4_BACK. If the client specified CDFC4_FORE_OR_BOTH, the server
MUST return CDFS4_FORE or CDFS4_BOTH. If the client specified CDFC4_BACK_OR_BOTH, the server
MUST return CDFS4_BACK or CDFS4_BOTH.
See the CREATE_SESSION operation (
Section 18.36), and the description of the argument csa_use_conn_in_rdma_mode to understand bctsa_use_conn_in_rdma_mode, and the description of csr_use_conn_in_rdma_mode to understand bctsr_use_conn_in_rdma_mode.
Invoking BIND_CONN_TO_SESSION on a connection already associated with the specified session has no effect, and the server
MUST respond with NFS4_OK, unless the client is demanding changes to the set of channels the connection is associated with. If so, the server
MUST return NFS4ERR_INVAL.
If a session's channel loses all connections, depending on the client ID's state protection and type of channel, the client might need to use BIND_CONN_TO_SESSION to associate a new connection. If the server restarted and does not keep the reply cache in stable storage, the server will not recognize the session ID. The client will ultimately have to invoke EXCHANGE_ID to create a new client ID and session.
Suppose SP4_SSV state protection is being used, and BIND_CONN_TO_SESSION is among the operations included in the spo_must_enforce set when the client ID was created (
Section 18.35). If so, there is an issue if SET_SSV is sent, no response is returned, and the last connection associated with the client ID drops. The client, per the sessions model,
MUST retry the SET_SSV. But it needs a new connection to do so, and
MUST associate that connection with the session via a BIND_CONN_TO_SESSION authenticated with the SSV GSS mechanism. The problem is that the RPCSEC_GSS message integrity codes use a subkey derived from the SSV as the key and the SSV may have changed. While there are multiple recovery strategies, a single, general strategy is described here.
-
The client reconnects.
-
The client assumes that the SET_SSV was executed, and so sends BIND_CONN_TO_SESSION with the subkey (derived from the new SSV, i.e., what SET_SSV would have set the SSV to) used as the key for the RPCSEC_GSS credential message integrity codes.
-
If the request succeeds, this means that the original attempted SET_SSV did execute successfully. The client re-sends the original SET_SSV, which the server will reply to via the reply cache.
-
If the server returns an RPC authentication error, this means that the server's current SSV was not changed (and the SET_SSV was likely not executed). The client then tries BIND_CONN_TO_SESSION with the subkey derived from the old SSV as the key for the RPCSEC_GSS message integrity codes.
-
The attempted BIND_CONN_TO_SESSION with the old SSV should succeed. If so, the client re-sends the original SET_SSV. If the original SET_SSV was not executed, then the server executes it. If the original SET_SSV was executed but failed, the server will return the SET_SSV from the reply cache.
The EXCHANGE_ID operation exchanges long-hand client and server identifiers (owners) and provides access to a client ID, creating one if necessary. This client ID becomes associated with the connection on which the operation is done, so that it is available when a CREATE_SESSION is done or when the connection is used to issue a request on an existing session associated with the current client.
const EXCHGID4_FLAG_SUPP_MOVED_REFER = 0x00000001;
const EXCHGID4_FLAG_SUPP_MOVED_MIGR = 0x00000002;
const EXCHGID4_FLAG_BIND_PRINC_STATEID = 0x00000100;
const EXCHGID4_FLAG_USE_NON_PNFS = 0x00010000;
const EXCHGID4_FLAG_USE_PNFS_MDS = 0x00020000;
const EXCHGID4_FLAG_USE_PNFS_DS = 0x00040000;
const EXCHGID4_FLAG_MASK_PNFS = 0x00070000;
const EXCHGID4_FLAG_UPD_CONFIRMED_REC_A = 0x40000000;
const EXCHGID4_FLAG_CONFIRMED_R = 0x80000000;
struct state_protect_ops4 {
bitmap4 spo_must_enforce;
bitmap4 spo_must_allow;
};
struct ssv_sp_parms4 {
state_protect_ops4 ssp_ops;
sec_oid4 ssp_hash_algs<>;
sec_oid4 ssp_encr_algs<>;
uint32_t ssp_window;
uint32_t ssp_num_gss_handles;
};
enum state_protect_how4 {
SP4_NONE = 0,
SP4_MACH_CRED = 1,
SP4_SSV = 2
};
union state_protect4_a switch(state_protect_how4 spa_how) {
case SP4_NONE:
void;
case SP4_MACH_CRED:
state_protect_ops4 spa_mach_ops;
case SP4_SSV:
ssv_sp_parms4 spa_ssv_parms;
};
struct EXCHANGE_ID4args {
client_owner4 eia_clientowner;
uint32_t eia_flags;
state_protect4_a eia_state_protect;
nfs_impl_id4 eia_client_impl_id<1>;
};
struct ssv_prot_info4 {
state_protect_ops4 spi_ops;
uint32_t spi_hash_alg;
uint32_t spi_encr_alg;
uint32_t spi_ssv_len;
uint32_t spi_window;
gsshandle4_t spi_handles<>;
};
union state_protect4_r switch(state_protect_how4 spr_how) {
case SP4_NONE:
void;
case SP4_MACH_CRED:
state_protect_ops4 spr_mach_ops;
case SP4_SSV:
ssv_prot_info4 spr_ssv_info;
};
struct EXCHANGE_ID4resok {
clientid4 eir_clientid;
sequenceid4 eir_sequenceid;
uint32_t eir_flags;
state_protect4_r eir_state_protect;
server_owner4 eir_server_owner;
opaque eir_server_scope<NFS4_OPAQUE_LIMIT>;
nfs_impl_id4 eir_server_impl_id<1>;
};
union EXCHANGE_ID4res switch (nfsstat4 eir_status) {
case NFS4_OK:
EXCHANGE_ID4resok eir_resok4;
default:
void;
};
The client uses the EXCHANGE_ID operation to register a particular instance of that client with the server, as represented by a client_owner4. However, when the client_owner4 has already been registered by other means (e.g., Transparent State Migration), the client may still use EXCHANGE_ID to obtain the client ID assigned previously.
The client ID returned from this operation will be associated with the connection on which the EXCHANGE_ID is received and will serve as a parent object for sessions created by the client on this connection or to which the connection is bound. As a result of using those sessions to make requests involving the creation of state, that state will become associated with the client ID returned.
In situations in which the registration of the client_owner has not occurred previously, the client ID must first be used, along with the returned eir_sequenceid, in creating an associated session using CREATE_SESSION.
If the flag EXCHGID4_FLAG_CONFIRMED_R is set in the result, eir_flags, then it is an indication that the registration of the client_owner has already occurred and that a further CREATE_SESSION is not needed to confirm it. Of course, subsequent CREATE_SESSION operations may be needed for other reasons.
The value eir_sequenceid is used to establish an initial sequence value associated with the client ID returned. In cases in which a CREATE_SESSION has already been done, there is no need for this value, since sequencing of such request has already been established, and the client has no need for this value and will ignore it.
EXCHANGE_ID
MAY be sent in a COMPOUND procedure that starts with SEQUENCE. However, when a client communicates with a server for the first time, it will not have a session, so using SEQUENCE will not be possible. If EXCHANGE_ID is sent without a preceding SEQUENCE, then it
MUST be the only operation in the COMPOUND procedure's request. If it is not, the server
MUST return NFS4ERR_NOT_ONLY_OP.
The eia_clientowner field is composed of a co_verifier field and a co_ownerid string. As noted in
Section 2.4, the co_ownerid identifies the client, and the co_verifier specifies a particular incarnation of that client. An EXCHANGE_ID sent with a new incarnation of the client will lead to the server removing lock state of the old incarnation. On the other hand, when an EXCHANGE_ID sent with the current incarnation and co_ownerid does not result in an unrelated error, it will potentially update an existing client ID's properties or simply return information about the existing client_id. The latter would happen when this operation is done to the same server using different network addresses as part of creating trunked connections.
A server
MUST NOT provide the same client ID to two different incarnations of an eia_clientowner.
In addition to the client ID and sequence ID, the server returns a server owner (eir_server_owner) and server scope (eir_server_scope). The former field is used in connection with network trunking as described in
Section 2.10.5. The latter field is used to allow clients to determine when client IDs sent by one server may be recognized by another in the event of file system migration (see
Section 11.11.9 of the current document).
The client ID returned by EXCHANGE_ID is only unique relative to the combination of eir_server_owner.so_major_id and eir_server_scope. Thus, if two servers return the same client ID, the onus is on the client to distinguish the client IDs on the basis of eir_server_owner.so_major_id and eir_server_scope. In the event two different servers claim matching server_owner.so_major_id and eir_server_scope, the client can use the verification techniques discussed in
Section 2.10.5.1 to determine if the servers are distinct. If they are distinct, then the client will need to note the destination network addresses of the connections used with each server and use the network address as the final discriminator.
The server, as defined by the unique identity expressed in the so_major_id of the server owner and the server scope, needs to track several properties of each client ID it hands out. The properties apply to the client ID and all sessions associated with the client ID. The properties are derived from the arguments and results of EXCHANGE_ID. The client ID properties include:
-
The capabilities expressed by the following bits, which come from the results of EXCHANGE_ID:
-
EXCHGID4_FLAG_SUPP_MOVED_REFER
-
EXCHGID4_FLAG_SUPP_MOVED_MIGR
-
EXCHGID4_FLAG_BIND_PRINC_STATEID
-
EXCHGID4_FLAG_USE_NON_PNFS
-
EXCHGID4_FLAG_USE_PNFS_MDS
-
EXCHGID4_FLAG_USE_PNFS_DS
These properties may be updated by subsequent EXCHANGE_ID operations on confirmed client IDs though the server MAY refuse to change them.
-
The state protection method used, one of SP4_NONE, SP4_MACH_CRED, or SP4_SSV, as set by the spa_how field of the arguments to EXCHANGE_ID. Once the client ID is confirmed, this property cannot be updated by subsequent EXCHANGE_ID operations.
-
For SP4_MACH_CRED or SP4_SSV state protection:
-
The list of operations (spo_must_enforce) that MUST use the specified state protection. This list comes from the results of EXCHANGE_ID.
-
The list of operations (spo_must_allow) that MAY use the specified state protection. This list comes from the results of EXCHANGE_ID.
Once the client ID is confirmed, these properties cannot be updated by subsequent EXCHANGE_ID requests.
-
For SP4_SSV protection:
-
The OID of the hash algorithm. This property is represented by one of the algorithms in the ssp_hash_algs field of the EXCHANGE_ID arguments. Once the client ID is confirmed, this property cannot be updated by subsequent EXCHANGE_ID requests.
-
The OID of the encryption algorithm. This property is represented by one of the algorithms in the ssp_encr_algs field of the EXCHANGE_ID arguments. Once the client ID is confirmed, this property cannot be updated by subsequent EXCHANGE_ID requests.
-
The length of the SSV. This property is represented by the spi_ssv_len field in the EXCHANGE_ID results. Once the client ID is confirmed, this property cannot be updated by subsequent EXCHANGE_ID operations.
There are REQUIRED and RECOMMENDED relationships among the length of the key of the encryption algorithm ("key length"), the length of the output of hash algorithm ("hash length"), and the length of the SSV ("SSV length").
-
key length MUST be <= hash length. This is because the keys used for the encryption algorithm are actually subkeys derived from the SSV, and the derivation is via the hash algorithm. The selection of an encryption algorithm with a key length that exceeded the length of the output of the hash algorithm would require padding, and thus weaken the use of the encryption algorithm.
-
hash length SHOULD be <= SSV length. This is because the SSV is a key used to derive subkeys via an HMAC, and it is recommended that the key used as input to an HMAC be at least as long as the length of the HMAC's hash algorithm's output (see Section 3 of [52]).
-
key length SHOULD be <= SSV length. This is a transitive result of the above two invariants.
-
key length SHOULD be >= hash length / 2. This is because the subkey derivation is via an HMAC and it is recommended that if the HMAC has to be truncated, it should not be truncated to less than half the hash length (see Section [52] of RFC 2104 [52]).
-
Number of concurrent versions of the SSV the client and server will support (see Section 2.10.9). This property is represented by spi_window in the EXCHANGE_ID results. The property may be updated by subsequent EXCHANGE_ID operations.
-
The client's implementation ID as represented by the eia_client_impl_id field of the arguments. The property may be updated by subsequent EXCHANGE_ID requests.
-
The server's implementation ID as represented by the eir_server_impl_id field of the reply. The property may be updated by replies to subsequent EXCHANGE_ID requests.
The eia_flags passed as part of the arguments and the eir_flags results allow the client and server to inform each other of their capabilities as well as indicate how the client ID will be used. Whether a bit is set or cleared on the arguments' flags does not force the server to set or clear the same bit on the results' side. Bits not defined above cannot be set in the eia_flags field. If they are, the server
MUST reject the operation with NFS4ERR_INVAL.
The EXCHGID4_FLAG_UPD_CONFIRMED_REC_A bit can only be set in eia_flags; it is always off in eir_flags. The EXCHGID4_FLAG_CONFIRMED_R bit can only be set in eir_flags; it is always off in eia_flags. If the server recognizes the co_ownerid and co_verifier as mapping to a confirmed client ID, it sets EXCHGID4_FLAG_CONFIRMED_R in eir_flags. The EXCHGID4_FLAG_CONFIRMED_R flag allows a client to tell if the client ID it is trying to create already exists and is confirmed.
If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set in eia_flags, this means that the client is attempting to update properties of an existing confirmed client ID (if the client wants to update properties of an unconfirmed client ID, it
MUST NOT set EXCHGID4_FLAG_UPD_CONFIRMED_REC_A). If so, it is
RECOMMENDED that the client send the update EXCHANGE_ID operation in the same COMPOUND as a SEQUENCE so that the EXCHANGE_ID is executed exactly once. Whether the client can update the properties of client ID depends on the state protection it selected when the client ID was created, and the principal and security flavor it used when sending the EXCHANGE_ID operation. The situations described in items [
6], [
7], [
8], or [
9] of the second numbered list of
Section 18.35.4 below will apply. Note that if the operation succeeds and returns a client ID that is already confirmed, the server
MUST set the EXCHGID4_FLAG_CONFIRMED_R bit in eir_flags.
If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in eia_flags, this means that the client is trying to establish a new client ID; it is attempting to trunk data communication to the server (See
Section 2.10.5); or it is attempting to update properties of an unconfirmed client ID. The situations described in items [
1], [
2], [
3], [
4], or [
5] of the second numbered list of
Section 18.35.4 below will apply. Note that if the operation succeeds and returns a client ID that was previously confirmed, the server
MUST set the EXCHGID4_FLAG_CONFIRMED_R bit in eir_flags.
When the EXCHGID4_FLAG_SUPP_MOVED_REFER flag bit is set, the client indicates that it is capable of dealing with an NFS4ERR_MOVED error as part of a referral sequence. When this bit is not set, it is still legal for the server to perform a referral sequence. However, a server may use the fact that the client is incapable of correctly responding to a referral, by avoiding it for that particular client. It may, for instance, act as a proxy for that particular file system, at some cost in performance, although it is not obligated to do so. If the server will potentially perform a referral, it
MUST set EXCHGID4_FLAG_SUPP_MOVED_REFER in eir_flags.
When the EXCHGID4_FLAG_SUPP_MOVED_MIGR is set, the client indicates that it is capable of dealing with an NFS4ERR_MOVED error as part of a file system migration sequence. When this bit is not set, it is still legal for the server to indicate that a file system has moved, when this in fact happens. However, a server may use the fact that the client is incapable of correctly responding to a migration in its scheduling of file systems to migrate so as to avoid migration of file systems being actively used. It may also hide actual migrations from clients unable to deal with them by acting as a proxy for a migrated file system for particular clients, at some cost in performance, although it is not obligated to do so. If the server will potentially perform a migration, it
MUST set EXCHGID4_FLAG_SUPP_MOVED_MIGR in eir_flags.
When EXCHGID4_FLAG_BIND_PRINC_STATEID is set, the client indicates that it wants the server to bind the stateid to the principal. This means that when a principal creates a stateid, it has to be the one to use the stateid. If the server will perform binding, it will return EXCHGID4_FLAG_BIND_PRINC_STATEID. The server
MAY return EXCHGID4_FLAG_BIND_PRINC_STATEID even if the client does not request it. If an update to the client ID changes the value of EXCHGID4_FLAG_BIND_PRINC_STATEID's client ID property, the effect applies only to new stateids. Existing stateids (and all stateids with the same "other" field) that were created with stateid to principal binding in force will continue to have binding in force. Existing stateids (and all stateids with the same "other" field) that were created with stateid to principal not in force will continue to have binding not in force.
The EXCHGID4_FLAG_USE_NON_PNFS, EXCHGID4_FLAG_USE_PNFS_MDS, and EXCHGID4_FLAG_USE_PNFS_DS bits are described in
Section 13.1 and convey roles the client ID is to be used for in a pNFS environment. The server
MUST set one of the acceptable combinations of these bits (roles) in eir_flags, as specified in that section. Note that the same client owner/server owner pair can have multiple roles. Multiple roles can be associated with the same client ID or with different client IDs. Thus, if a client sends EXCHANGE_ID from the same client owner to the same server owner multiple times, but specifies different pNFS roles each time, the server might return different client IDs. Given that different pNFS roles might have different client IDs, the client may ask for different properties for each role/client ID.
The spa_how field of the eia_state_protect field specifies how the client wants to protect its client, locking, and session states from unauthorized changes (
Section 2.10.8.3):
-
SP4_NONE. The client does not request the NFSv4.1 server to enforce state protection. The NFSv4.1 server MUST NOT enforce state protection for the returned client ID.
-
SP4_MACH_CRED. If spa_how is SP4_MACH_CRED, then the client MUST send the EXCHANGE_ID operation with RPCSEC_GSS as the security flavor, and with a service of RPC_GSS_SVC_INTEGRITY or RPC_GSS_SVC_PRIVACY. If SP4_MACH_CRED is specified, then the client wants to use an RPCSEC_GSS-based machine credential to protect its state. The server MUST note the principal the EXCHANGE_ID operation was sent with, and the GSS mechanism used. These notes collectively comprise the machine credential.
After the client ID is confirmed, as long as the lease associated with the client ID is unexpired, a subsequent EXCHANGE_ID operation that uses the same eia_clientowner.co_owner as the first EXCHANGE_ID MUST also use the same machine credential as the first EXCHANGE_ID. The server returns the same client ID for the subsequent EXCHANGE_ID as that returned from the first EXCHANGE_ID.
-
SP4_SSV. If spa_how is SP4_SSV, then the client MUST send the EXCHANGE_ID operation with RPCSEC_GSS as the security flavor, and with a service of RPC_GSS_SVC_INTEGRITY or RPC_GSS_SVC_PRIVACY. If SP4_SSV is specified, then the client wants to use the SSV to protect its state. The server records the credential used in the request as the machine credential (as defined above) for the eia_clientowner.co_owner. The CREATE_SESSION operation that confirms the client ID MUST use the same machine credential.
When a client specifies SP4_MACH_CRED or SP4_SSV, it also provides two lists of operations (each expressed as a bitmap). The first list is spo_must_enforce and consists of those operations the client
MUST send (subject to the server confirming the list of operations in the result of EXCHANGE_ID) with the machine credential (if SP4_MACH_CRED protection is specified) or the SSV-based credential (if SP4_SSV protection is used). The client
MUST send the operations with RPCSEC_GSS credentials that specify the RPC_GSS_SVC_INTEGRITY or RPC_GSS_SVC_PRIVACY security service. Typically, the first list of operations includes EXCHANGE_ID, CREATE_SESSION, DELEGPURGE, DESTROY_SESSION, BIND_CONN_TO_SESSION, and DESTROY_CLIENTID. The client
SHOULD NOT specify in this list any operations that require a filehandle because the server's access policies
MAY conflict with the client's choice, and thus the client would then be unable to access a subset of the server's namespace.
Note that if SP4_SSV protection is specified, and the client indicates that CREATE_SESSION must be protected with SP4_SSV, because the SSV cannot exist without a confirmed client ID, the first CREATE_SESSION
MUST instead be sent using the machine credential, and the server
MUST accept the machine credential.
There is a corresponding result, also called spo_must_enforce, of the operations for which the server will require SP4_MACH_CRED or SP4_SSV protection. Normally, the server's result equals the client's argument, but the result
MAY be different. If the client requests one or more operations in the set { EXCHANGE_ID, CREATE_SESSION, DELEGPURGE, DESTROY_SESSION, BIND_CONN_TO_SESSION, DESTROY_CLIENTID }, then the result spo_must_enforce
MUST include the operations the client requested from that set.
If spo_must_enforce in the results has BIND_CONN_TO_SESSION set, then connection binding enforcement is enabled, and the client
MUST use the machine (if SP4_MACH_CRED protection is used) or SSV (if SP4_SSV protection is used) credential on calls to BIND_CONN_TO_SESSION.
The second list is spo_must_allow and consists of those operations the client wants to have the option of sending with the machine credential or the SSV-based credential, even if the object the operations are performed on is not owned by the machine or SSV credential.
The corresponding result, also called spo_must_allow, consists of the operations the server will allow the client to use SP4_SSV or SP4_MACH_CRED credentials with. Normally, the server's result equals the client's argument, but the result
MAY be different.
The purpose of spo_must_allow is to allow clients to solve the following conundrum. Suppose the client ID is confirmed with EXCHGID4_FLAG_BIND_PRINC_STATEID, and it calls OPEN with the RPCSEC_GSS credentials of a normal user. Now suppose the user's credentials expire, and cannot be renewed (e.g., a Kerberos ticket granting ticket expires, and the user has logged off and will not be acquiring a new ticket granting ticket). The client will be unable to send CLOSE without the user's credentials, which is to say the client has to either leave the state on the server or re-send EXCHANGE_ID with a new verifier to clear all state, that is, unless the client includes CLOSE on the list of operations in spo_must_allow and the server agrees.
The SP4_SSV protection parameters also have:
-
ssp_hash_algs:
-
This is the set of algorithms the client supports for the purpose of computing the digests needed for the internal SSV GSS mechanism and for the SET_SSV operation. Each algorithm is specified as an object identifier (OID). The REQUIRED algorithms for a server are id-sha1, id-sha224, id-sha256, id-sha384, and id-sha512 [25].
Due to known weaknesses in id-sha1, it is RECOMMENDED that the client specify at least one algorithm within ssp_hash_algs other than id-sha1.
The algorithm the server selects among the set is indicated in spi_hash_alg, a field of spr_ssv_prot_info. The field spi_hash_alg is an index into the array ssp_hash_algs. Because of known the weaknesses in id-sha1, it is RECOMMENDED that it not be selected by the server as long as ssp_hash_algs contains any other supported algorithm.
If the server does not support any of the offered algorithms, it returns NFS4ERR_HASH_ALG_UNSUPP. If ssp_hash_algs is empty, the server MUST return NFS4ERR_INVAL.
-
ssp_encr_algs:
-
This is the set of algorithms the client supports for the purpose of providing privacy protection for the internal SSV GSS mechanism. Each algorithm is specified as an OID. The REQUIRED algorithm for a server is id-aes256-CBC. The RECOMMENDED algorithms are id-aes192-CBC and id-aes128-CBC [26]. The selected algorithm is returned in spi_encr_alg, an index into ssp_encr_algs. If the server does not support any of the offered algorithms, it returns NFS4ERR_ENCR_ALG_UNSUPP. If ssp_encr_algs is empty, the server MUST return NFS4ERR_INVAL. Note that due to previously stated requirements and recommendations on the relationships between key length and hash length, some combinations of RECOMMENDED and REQUIRED encryption algorithm and hash algorithm either SHOULD NOT or MUST NOT be used. Table 21 summarizes the illegal and discouraged combinations.
-
ssp_window:
-
This is the number of SSV versions the client wants the server to maintain (i.e., each successful call to SET_SSV produces a new version of the SSV). If ssp_window is zero, the server MUST return NFS4ERR_INVAL. The server responds with spi_window, which MUST NOT exceed ssp_window and MUST be at least one. Any requests on the backchannel or fore channel that are using a version of the SSV that is outside the window will fail with an ONC RPC authentication error, and the requester will have to retry them with the same slot ID and sequence ID.
-
ssp_num_gss_handles:
-
This is the number of RPCSEC_GSS handles the server should create that are based on the GSS SSV mechanism (see Section 2.10.9). It is not the total number of RPCSEC_GSS handles for the client ID. Indeed, subsequent calls to EXCHANGE_ID will add RPCSEC_GSS handles. The server responds with a list of handles in spi_handles. If the client asks for at least one handle and the server cannot create it, the server MUST return an error. The handles in spi_handles are not available for use until the client ID is confirmed, which could be immediately if EXCHANGE_ID returns EXCHGID4_FLAG_CONFIRMED_R, or upon successful confirmation from CREATE_SESSION.
While a client ID can span all the connections that are connected to a server sharing the same eir_server_owner.so_major_id, the RPCSEC_GSS handles returned in spi_handles can only be used on connections connected to a server that returns the same the eir_server_owner.so_major_id and eir_server_owner.so_minor_id on each connection. It is permissible for the client to set ssp_num_gss_handles to zero; the client can create more handles with another EXCHANGE_ID call.
Because each SSV RPCSEC_GSS handle shares a common SSV GSS context, there are security considerations specific to this situation discussed in Section 2.10.10.
The seq_window (see Section [4] of RFC 2203 [4]) of each RPCSEC_GSS handle in spi_handle MUST be the same as the seq_window of the RPCSEC_GSS handle used for the credential of the RPC request of which the EXCHANGE_ID operation was sent as a part.
Encryption Algorithm |
MUST NOT be combined with
|
SHOULD NOT be combined with
|
id-aes128-CBC |
|
id-sha384, id-sha512 |
id-aes192-CBC |
id-sha1 |
id-sha512 |
id-aes256-CBC |
id-sha1, id-sha224 |
|
Table 21
The arguments include an array of up to one element in length called eia_client_impl_id. If eia_client_impl_id is present, it contains the information identifying the implementation of the client. Similarly, the results include an array of up to one element in length called eir_server_impl_id that identifies the implementation of the server. Servers
MUST accept a zero-length eia_client_impl_id array, and clients
MUST accept a zero-length eir_server_impl_id array.
A possible use for implementation identifiers would be in diagnostic software that extracts this information in an attempt to identify interoperability problems, performance workload behaviors, or general usage statistics. Since the intent of having access to this information is for planning or general diagnosis only, the client and server
MUST NOT interpret this implementation identity information in a way that affects how the implementation interacts with its peer. The client and server are not allowed to depend on the peer's manifesting a particular allowed behavior based on an implementation identifier but are required to interoperate as specified elsewhere in the protocol specification.
Because it is possible that some implementations might violate the protocol specification and interpret the identity information, implementations
MUST provide facilities to allow the NFSv4 client and server to be configured to set the contents of the nfs_impl_id structures sent to any specified value.
A server's client record is a 5-tuple:
-
co_ownerid:
The client identifier string, from the eia_clientowner structure of the EXCHANGE_ID4args structure.
-
co_verifier:
A client-specific value used to indicate incarnations (where a client restart represents a new incarnation), from the eia_clientowner structure of the EXCHANGE_ID4args structure.
-
principal:
The principal that was defined in the RPC header's credential and/or verifier at the time the client record was established.
-
client ID:
The shorthand client identifier, generated by the server and returned via the eir_clientid field in the EXCHANGE_ID4resok structure.
-
confirmed:
A private field on the server indicating whether or not a client record has been confirmed. A client record is confirmed if there has been a successful CREATE_SESSION operation to confirm it. Otherwise, it is unconfirmed. An unconfirmed record is established by an EXCHANGE_ID call. Any unconfirmed record that is not confirmed within a lease period SHOULD be removed.
The following identifiers represent special values for the fields in the records.
-
ownerid_arg:
-
The value of the eia_clientowner.co_ownerid subfield of the EXCHANGE_ID4args structure of the current request.
-
verifier_arg:
-
The value of the eia_clientowner.co_verifier subfield of the EXCHANGE_ID4args structure of the current request.
-
old_verifier_arg:
-
A value of the eia_clientowner.co_verifier field of a client record received in a previous request; this is distinct from verifier_arg.
-
principal_arg:
-
The value of the RPCSEC_GSS principal for the current request.
-
old_principal_arg:
-
A value of the principal of a client record as defined by the RPC header's credential or verifier of a previous request. This is distinct from principal_arg.
-
clientid_ret:
-
The value of the eir_clientid field the server will return in the EXCHANGE_ID4resok structure for the current request.
-
old_clientid_ret:
-
The value of the eir_clientid field the server returned in the EXCHANGE_ID4resok structure for a previous request. This is distinct from clientid_ret.
-
confirmed:
-
The client ID has been confirmed.
-
unconfirmed:
-
The client ID has not been confirmed.
Since EXCHANGE_ID is a non-idempotent operation, we must consider the possibility that retries occur as a result of a client restart, network partition, malfunctioning router, etc. Retries are identified by the value of the eia_clientowner field of EXCHANGE_ID4args, and the method for dealing with them is outlined in the scenarios below.
The scenarios are described in terms of the client record(s) a server has for a given co_ownerid. Note that if the client ID was created specifying SP4_SSV state protection and EXCHANGE_ID as the one of the operations in spo_must_allow, then the server
MUST authorize EXCHANGE_IDs with the SSV principal in addition to the principal that created the client ID.
-
New Owner ID
If the server has no client records with eia_clientowner.co_ownerid matching ownerid_arg, and EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set in the EXCHANGE_ID, then a new shorthand client ID (let us call it clientid_ret) is generated, and the following unconfirmed record is added to the server's state.
{ ownerid_arg, verifier_arg, principal_arg, clientid_ret, unconfirmed }
Subsequently, the server returns clientid_ret.
-
Non-Update on Existing Client ID
If the server has the following confirmed record, and the request does not have EXCHGID4_FLAG_UPD_CONFIRMED_REC_A set, then the request is the result of a retried request due to a faulty router or lost connection, or the client is trying to determine if it can perform trunking.
{ ownerid_arg, verifier_arg, principal_arg, clientid_ret, confirmed }
Since the record has been confirmed, the client must have received the server's reply from the initial EXCHANGE_ID request. Since the server has a confirmed record, and since EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, with the possible exception of eir_server_owner.so_minor_id, the server returns the same result it did when the client ID's properties were last updated (or if never updated, the result when the client ID was created). The confirmed record is unchanged.
-
Client Collision
If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the server has the following confirmed record, then this request is likely the result of a chance collision between the values of the eia_clientowner.co_ownerid subfield of EXCHANGE_ID4args for two different clients.
{ ownerid_arg, *, old_principal_arg, old_clientid_ret, confirmed }
If there is currently no state associated with old_clientid_ret, or if there is state but the lease has expired, then this case is effectively equivalent to the New Owner ID case of Section 18.35.4, Paragraph 7, Item 1. The confirmed record is deleted, the old_clientid_ret and its lock state are deleted, a new shorthand client ID is generated, and the following unconfirmed record is added to the server's state.
{ ownerid_arg, verifier_arg, principal_arg, clientid_ret, unconfirmed }
Subsequently, the server returns clientid_ret.
If old_clientid_ret has an unexpired lease with state, then no state of old_clientid_ret is changed or deleted. The server returns NFS4ERR_CLID_INUSE to indicate that the client should retry with a different value for the eia_clientowner.co_ownerid subfield of EXCHANGE_ID4args. The client record is not changed.
-
Replacement of Unconfirmed Record
If the EXCHGID4_FLAG_UPD_CONFIRMED_REC_A flag is not set, and the server has the following unconfirmed record, then the client is attempting EXCHANGE_ID again on an unconfirmed client ID, perhaps due to a retry, a client restart before client ID confirmation (i.e., before CREATE_SESSION was called), or some other reason.
{ ownerid_arg, *, *, old_clientid_ret, unconfirmed }
It is possible that the properties of old_clientid_ret are different than those specified in the current EXCHANGE_ID. Whether or not the properties are being updated, to eliminate ambiguity, the server deletes the unconfirmed record, generates a new client ID (clientid_ret), and establishes the following unconfirmed record:
{ ownerid_arg, verifier_arg, principal_arg, clientid_ret, unconfirmed }
-
Client Restart
If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is not set, and if the server has the following confirmed client record, then this request is likely from a previously confirmed client that has restarted.
{ ownerid_arg, old_verifier_arg, principal_arg, old_clientid_ret, confirmed }
Since the previous incarnation of the same client will no longer be making requests, once the new client ID is confirmed by CREATE_SESSION, byte-range locks and share reservations should be released immediately rather than forcing the new incarnation to wait for the lease time on the previous incarnation to expire. Furthermore, session state should be removed since if the client had maintained that information across restart, this request would not have been sent. If the server supports neither the CLAIM_DELEGATE_PREV nor CLAIM_DELEG_PREV_FH claim types, associated delegations should be purged as well; otherwise, delegations are retained and recovery proceeds according to Section 10.2.1.
After processing, clientid_ret is returned to the client and this client record is added:
{ ownerid_arg, verifier_arg, principal_arg, clientid_ret, unconfirmed }
The previously described confirmed record continues to exist, and thus the same ownerid_arg exists in both a confirmed and unconfirmed state at the same time. The number of states can collapse to one once the server receives an applicable CREATE_SESSION or EXCHANGE_ID.
-
If the server subsequently receives a successful CREATE_SESSION that confirms clientid_ret, then the server atomically destroys the confirmed record and makes the unconfirmed record confirmed as described in Section 18.36.3.
-
If the server instead subsequently receives an EXCHANGE_ID with the client owner equal to ownerid_arg, one strategy is to simply delete the unconfirmed record, and process the EXCHANGE_ID as described in the entirety of Section 18.35.4.
-
Update
If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has the following confirmed record, then this request is an attempt at an update.
{ ownerid_arg, verifier_arg, principal_arg, clientid_ret, confirmed }
Since the record has been confirmed, the client must have received the server's reply from the initial EXCHANGE_ID request. The server allows the update, and the client record is left intact.
-
Update but No Confirmed Record
If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has no confirmed record corresponding ownerid_arg, then the server returns NFS4ERR_NOENT and leaves any unconfirmed record intact.
-
Update but Wrong Verifier
If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has the following confirmed record, then this request is an illegal attempt at an update, perhaps because of a retry from a previous client incarnation.
{ ownerid_arg, old_verifier_arg, *, clientid_ret, confirmed }
The server returns NFS4ERR_NOT_SAME and leaves the client record intact.
-
Update but Wrong Principal
If EXCHGID4_FLAG_UPD_CONFIRMED_REC_A is set, and the server has the following confirmed record, then this request is an illegal attempt at an update by an unauthorized principal.
{ ownerid_arg, verifier_arg, old_principal_arg, clientid_ret, confirmed }
The server returns NFS4ERR_PERM and leaves the client record intact.
struct channel_attrs4 {
count4 ca_headerpadsize;
count4 ca_maxrequestsize;
count4 ca_maxresponsesize;
count4 ca_maxresponsesize_cached;
count4 ca_maxoperations;
count4 ca_maxrequests;
uint32_t ca_rdma_ird<1>;
};
const CREATE_SESSION4_FLAG_PERSIST = 0x00000001;
const CREATE_SESSION4_FLAG_CONN_BACK_CHAN = 0x00000002;
const CREATE_SESSION4_FLAG_CONN_RDMA = 0x00000004;
struct CREATE_SESSION4args {
clientid4 csa_clientid;
sequenceid4 csa_sequence;
uint32_t csa_flags;
channel_attrs4 csa_fore_chan_attrs;
channel_attrs4 csa_back_chan_attrs;
uint32_t csa_cb_program;
callback_sec_parms4 csa_sec_parms<>;
};
struct CREATE_SESSION4resok {
sessionid4 csr_sessionid;
sequenceid4 csr_sequence;
uint32_t csr_flags;
channel_attrs4 csr_fore_chan_attrs;
channel_attrs4 csr_back_chan_attrs;
};
union CREATE_SESSION4res switch (nfsstat4 csr_status) {
case NFS4_OK:
CREATE_SESSION4resok csr_resok4;
default:
void;
};
This operation is used by the client to create new session objects on the server.
CREATE_SESSION can be sent with or without a preceding SEQUENCE operation in the same COMPOUND procedure. If CREATE_SESSION is sent with a preceding SEQUENCE operation, any session created by CREATE_SESSION has no direct relation to the session specified in the SEQUENCE operation, although the two sessions might be associated with the same client ID. If CREATE_SESSION is sent without a preceding SEQUENCE, then it
MUST be the only operation in the COMPOUND procedure's request. If it is not, the server
MUST return NFS4ERR_NOT_ONLY_OP.
In addition to creating a session, CREATE_SESSION has the following effects:
-
The first session created with a new client ID serves to confirm the creation of that client's state on the server. The server returns the parameter values for the new session.
-
The connection CREATE_SESSION that is sent over is associated with the session's fore channel.
The arguments and results of CREATE_SESSION are described as follows:
-
csa_clientid:
-
This is the client ID with which the new session will be associated. The corresponding result is csr_sessionid, the session ID of the new session.
-
csa_sequence:
-
Each client ID serializes CREATE_SESSION via a per-client ID sequence number (see Section 18.36.4). The corresponding result is csr_sequence, which MUST be equal to csa_sequence.
In the next three arguments, the client offers a value that is to be a property of the session. Except where stated otherwise, it is
RECOMMENDED that the server accept the value. If it is not acceptable, the server
MAY use a different value. Regardless, the server
MUST return the value the session will use (which will be either what the client offered, or what the server is insisting on) to the client.
-
csa_flags:
-
The csa_flags field contains a list of the following flag bits:
-
CREATE_SESSION4_FLAG_PERSIST:
-
If CREATE_SESSION4_FLAG_PERSIST is set, the client wants the server to provide a persistent reply cache. For sessions in which only idempotent operations will be used (e.g., a read-only session), clients SHOULD NOT set CREATE_SESSION4_FLAG_PERSIST. If the server does not or cannot provide a persistent reply cache, the server MUST NOT set CREATE_SESSION4_FLAG_PERSIST in the field csr_flags.
If the server is a pNFS metadata server, for reasons described in Section 12.5.2 it SHOULD support CREATE_SESSION4_FLAG_PERSIST if it supports the layout_hint (Section 5.12.4) attribute.
-
CREATE_SESSION4_FLAG_CONN_BACK_CHAN:
-
If CREATE_SESSION4_FLAG_CONN_BACK_CHAN is set in csa_flags, the client is requesting that the connection over which the CREATE_SESSION operation arrived be associated with the session's backchannel in addition to its fore channel. If the server agrees, it sets CREATE_SESSION4_FLAG_CONN_BACK_CHAN in the result field csr_flags. If CREATE_SESSION4_FLAG_CONN_BACK_CHAN is not set in csa_flags, then CREATE_SESSION4_FLAG_CONN_BACK_CHAN MUST NOT be set in csr_flags.
-
CREATE_SESSION4_FLAG_CONN_RDMA:
-
If CREATE_SESSION4_FLAG_CONN_RDMA is set in csa_flags, and if the connection over which the CREATE_SESSION operation arrived is currently in non-RDMA mode but has the capability to operate in RDMA mode, then the client is requesting that the server "step up" to RDMA mode on the connection. If the server agrees, it sets CREATE_SESSION4_FLAG_CONN_RDMA in the result field csr_flags. If CREATE_SESSION4_FLAG_CONN_RDMA is not set in csa_flags, then CREATE_SESSION4_FLAG_CONN_RDMA MUST NOT be set in csr_flags. Note that once the server agrees to step up, it and the client MUST exchange all future traffic on the connection with RPC RDMA framing and not Record Marking ([32]).
-
csa_fore_chan_attrs, csa_back_chan_attrs:
-
The csa_fore_chan_attrs and csa_back_chan_attrs fields apply to attributes of the fore channel (which conveys requests originating from the client to the server), and the backchannel (the channel that conveys callback requests originating from the server to the client), respectively. The results are in corresponding structures called csr_fore_chan_attrs and csr_back_chan_attrs. The results establish attributes for each channel, and on all subsequent use of each channel of the session. Each structure has the following fields:
-
ca_headerpadsize:
-
The maximum amount of padding the requester is willing to apply to ensure that write payloads are aligned on some boundary at the replier. For each channel, the server
-
will reply in ca_headerpadsize with its preferred value, or zero if padding is not in use, and
-
MAY decrease this value but MUST NOT increase it.
-
ca_maxrequestsize:
-
The maximum size of a COMPOUND or CB_COMPOUND request that will be sent. This size represents the XDR encoded size of the request, including the RPC headers (including security flavor credentials and verifiers) but excludes any RPC transport framing headers. Imagine a request coming over a non-RDMA TCP/IP connection, and that it has a single Record Marking header preceding it. The maximum allowable count encoded in the header will be ca_maxrequestsize. If a requester sends a request that exceeds ca_maxrequestsize, the error NFS4ERR_REQ_TOO_BIG will be returned per the description in Section 2.10.6.4. For each channel, the server MAY decrease this value but MUST NOT increase it.
-
ca_maxresponsesize:
-
The maximum size of a COMPOUND or CB_COMPOUND reply that the requester will accept from the replier including RPC headers (see the ca_maxrequestsize definition). For each channel, the server MAY decrease this value, but MUST NOT increase it. However, if the client selects a value for ca_maxresponsesize such that a replier on a channel could never send a response, the server SHOULD return NFS4ERR_TOOSMALL in the CREATE_SESSION reply. After the session is created, if a requester sends a request for which the size of the reply would exceed this value, the replier will return NFS4ERR_REP_TOO_BIG, per the description in Section 2.10.6.4.
-
ca_maxresponsesize_cached:
-
Like ca_maxresponsesize, but the maximum size of a reply that will be stored in the reply cache (Section 2.10.6.1). For each channel, the server MAY decrease this value, but MUST NOT increase it. If, in the reply to CREATE_SESSION, the value of ca_maxresponsesize_cached of a channel is less than the value of ca_maxresponsesize of the same channel, then this is an indication to the requester that it needs to be selective about which replies it directs the replier to cache; for example, large replies from non-idempotent operations (e.g., COMPOUND requests with a READ operation) should not be cached. The requester decides which replies to cache via an argument to the SEQUENCE (the sa_cachethis field, see Section 18.46) or CB_SEQUENCE (the csa_cachethis field, see Section 20.9) operations. After the session is created, if a requester sends a request for which the size of the reply would exceed ca_maxresponsesize_cached, the replier will return NFS4ERR_REP_TOO_BIG_TO_CACHE, per the description in Section 2.10.6.4.
-
ca_maxoperations:
-
The maximum number of operations the replier will accept in a COMPOUND or CB_COMPOUND. For the backchannel, the server MUST NOT change the value the client offers. For the fore channel, the server MAY change the requested value. After the session is created, if a requester sends a COMPOUND or CB_COMPOUND with more operations than ca_maxoperations, the replier MUST return NFS4ERR_TOO_MANY_OPS.
-
ca_maxrequests:
-
The maximum number of concurrent COMPOUND or CB_COMPOUND requests the requester will send on the session. Subsequent requests will each be assigned a slot identifier by the requester within the range zero to ca_maxrequests - 1 inclusive. For the backchannel, the server MUST NOT change the value the client offers. For the fore channel, the server MAY change the requested value.
-
ca_rdma_ird:
-
This array has a maximum of one element. If this array has one element, then the element contains the inbound RDMA read queue depth (IRD). For each channel, the server MAY decrease this value, but MUST NOT increase it.
-
csa_cb_program
-
This is the ONC RPC program number the server MUST use in any callbacks sent through the backchannel to the client. The server MUST specify an ONC RPC program number equal to csa_cb_program and an ONC RPC version number equal to 4 in callbacks sent to the client. If a CB_COMPOUND is sent to the client, the server MUST use a minor version number of 1. There is no corresponding result.
-
csa_sec_parms
-
The field csa_sec_parms is an array of acceptable security credentials the server can use on the session's backchannel. Three security flavors are supported: AUTH_NONE, AUTH_SYS, and RPCSEC_GSS. If AUTH_NONE is specified for a credential, then this says the client is authorizing the server to use AUTH_NONE on all callbacks for the session. If AUTH_SYS is specified, then the client is authorizing the server to use AUTH_SYS on all callbacks, using the credential specified cbsp_sys_cred. If RPCSEC_GSS is specified, then the server is allowed to use the RPCSEC_GSS context specified in cbsp_gss_parms as the RPCSEC_GSS context in the credential of the RPC header of callbacks to the client. There is no corresponding result.
The RPCSEC_GSS context for the backchannel is specified via a pair of values of data type gsshandle4_t. The data type gsshandle4_t represents an RPCSEC_GSS handle, and is precisely the same as the data type of the "handle" field of the rpc_gss_init_res data type defined in "Context Creation Response - Successful Acceptance", Section 5.2.3.1 of [4].
The first RPCSEC_GSS handle, gcbp_handle_from_server, is the fore handle the server returned to the client (either in the handle field of data type rpc_gss_init_res or as one of the elements of the spi_handles field returned in the reply to EXCHANGE_ID) when the RPCSEC_GSS context was created on the server. The second handle, gcbp_handle_from_client, is the back handle to which the client will map the RPCSEC_GSS context. The server can immediately use the value of gcbp_handle_from_client in the RPCSEC_GSS credential in callback RPCs. That is, the value in gcbp_handle_from_client can be used as the value of the field "handle" in data type rpc_gss_cred_t (see "Elements of the RPCSEC_GSS Security Protocol", Section 5 of [4]) in callback RPCs. The server MUST use the RPCSEC_GSS security service specified in gcbp_service, i.e., it MUST set the "service" field of the rpc_gss_cred_t data type in RPCSEC_GSS credential to the value of gcbp_service (see "RPC Request Header", Section 5.3.1 of [4]).
If the RPCSEC_GSS handle identified by gcbp_handle_from_server does not exist on the server, the server will return NFS4ERR_NOENT.
Within each element of csa_sec_parms, the fore and back RPCSEC_GSS contexts MUST share the same GSS context and MUST have the same seq_window (see Section [4] of RFC 2203 [4]). The fore and back RPCSEC_GSS context state are independent of each other as far as the RPCSEC_GSS sequence number (see the seq_num field in the rpc_gss_cred_t data type of Sections [4] and [4] of [4]).
If an RPCSEC_GSS handle is using the SSV context (see Section 2.10.9), then because each SSV RPCSEC_GSS handle shares a common SSV GSS context, there are security considerations specific to this situation discussed in Section 2.10.10.
Once the session is created, the first SEQUENCE or CB_SEQUENCE received on a slot
MUST have a sequence ID equal to 1; if not, the replier
MUST return NFS4ERR_SEQ_MISORDERED.
To describe a possible implementation, the same notation for client records introduced in the description of EXCHANGE_ID is used with the following addition:
-
clientid_arg: The value of the csa_clientid field of the CREATE_SESSION4args structure of the current request.
Since CREATE_SESSION is a non-idempotent operation, we need to consider the possibility that retries may occur as a result of a client restart, network partition, malfunctioning router, etc. For each client ID created by EXCHANGE_ID, the server maintains a separate reply cache (called the CREATE_SESSION reply cache) similar to the session reply cache used for SEQUENCE operations, with two distinctions.
-
First, this is a reply cache just for detecting and processing CREATE_SESSION requests for a given client ID.
-
Second, the size of the client ID reply cache is of one slot (and as a result, the CREATE_SESSION request does not carry a slot number). This means that at most one CREATE_SESSION request for a given client ID can be outstanding.
As previously stated, CREATE_SESSION can be sent with or without a preceding SEQUENCE operation. Even if a SEQUENCE precedes CREATE_SESSION, the server
MUST maintain the CREATE_SESSION reply cache, which is separate from the reply cache for the session associated with a SEQUENCE. If CREATE_SESSION was originally sent by itself, the client
MAY send a retry of the CREATE_SESSION operation within a COMPOUND preceded by a SEQUENCE. If CREATE_SESSION was originally sent in a COMPOUND that started with a SEQUENCE, then the client
SHOULD send a retry in a COMPOUND that starts with a SEQUENCE that has the same session ID as the SEQUENCE of the original request. However, the client
MAY send a retry in a COMPOUND that either has no preceding SEQUENCE, or has a preceding SEQUENCE that refers to a different session than the original CREATE_SESSION. This might be necessary if the client sends a CREATE_SESSION in a COMPOUND preceded by a SEQUENCE with session ID X, and session X no longer exists. Regardless, any retry of CREATE_SESSION, with or without a preceding SEQUENCE,
MUST use the same value of csa_sequence as the original.
After the client received a reply to an EXCHANGE_ID operation that contains a new, unconfirmed client ID, the server expects the client to follow with a CREATE_SESSION operation to confirm the client ID. The server expects value of csa_sequenceid in the arguments to that CREATE_SESSION to be to equal the value of the field eir_sequenceid that was returned in results of the EXCHANGE_ID that returned the unconfirmed client ID. Before the server replies to that EXCHANGE_ID operation, it initializes the client ID slot to be equal to eir_sequenceid - 1 (accounting for underflow), and records a contrived CREATE_SESSION result with a "cached" result of NFS4ERR_SEQ_MISORDERED. With the client ID slot thus initialized, the processing of the CREATE_SESSION operation is divided into four phases:
-
Client record look up. The server looks up the client ID in its client record table. If the server contains no records with client ID equal to clientid_arg, then most likely the client's state has been purged during a period of inactivity, possibly due to a loss of connectivity. NFS4ERR_STALE_CLIENTID is returned, and no changes are made to any client records on the server. Otherwise, the server goes to phase 2.
-
Sequence ID processing. If csa_sequenceid is equal to the sequence ID in the client ID's slot, then this is a replay of the previous CREATE_SESSION request, and the server returns the cached result. If csa_sequenceid is not equal to the sequence ID in the slot, and is more than one greater (accounting for wraparound), then the server returns the error NFS4ERR_SEQ_MISORDERED, and does not change the slot. If csa_sequenceid is equal to the slot's sequence ID + 1 (accounting for wraparound), then the slot's sequence ID is set to csa_sequenceid, and the CREATE_SESSION processing goes to the next phase. A subsequent new CREATE_SESSION call over the same client ID MUST use a csa_sequenceid that is one greater than the sequence ID in the slot.
-
Client ID confirmation. If this would be the first session for the client ID, the CREATE_SESSION operation serves to confirm the client ID. Otherwise, the client ID confirmation phase is skipped and only the session creation phase occurs. Any case in which there is more than one record with identical values for client ID represents a server implementation error. Operation in the potential valid cases is summarized as follows.
-
Successful Confirmation
-
If the server has the following unconfirmed record, then this is the expected confirmation of an unconfirmed record.
-
{ ownerid, verifier, principal_arg, clientid_arg, unconfirmed }
-
As noted in Section 18.35.4, the server might also have the following confirmed record.
-
{ ownerid, old_verifier, principal_arg, old_clientid, confirmed }
-
The server schedules the replacement of both records with:
-
{ ownerid, verifier, principal_arg, clientid_arg, confirmed }
-
The processing of CREATE_SESSION continues on to session creation. Once the session is successfully created, the scheduled client record replacement is committed. If the session is not successfully created, then no changes are made to any client records on the server.
-
Unsuccessful Confirmation
-
If the server has the following record, then the client has changed principals after the previous EXCHANGE_ID request, or there has been a chance collision between shorthand client identifiers.
-
{ *, *, old_principal_arg, clientid_arg, * }
-
Neither of these cases is permissible. Processing stops and NFS4ERR_CLID_INUSE is returned to the client. No changes are made to any client records on the server.
-
Session creation. The server confirmed the client ID, either in this CREATE_SESSION operation, or a previous CREATE_SESSION operation. The server examines the remaining fields of the arguments.
The server creates the session by recording the parameter values used (including whether the CREATE_SESSION4_FLAG_PERSIST flag is set and has been accepted by the server) and allocating space for the session reply cache (if there is not enough space, the server returns NFS4ERR_NOSPC). For each slot in the reply cache, the server sets the sequence ID to zero, and records an entry containing a COMPOUND reply with zero operations and the error NFS4ERR_SEQ_MISORDERED. This way, if the first SEQUENCE request sent has a sequence ID equal to zero, the server can simply return what is in the reply cache: NFS4ERR_SEQ_MISORDERED. The client initializes its reply cache for receiving callbacks in the same way, and similarly, the first CB_SEQUENCE operation on a slot after session creation MUST have a sequence ID of one.
If the session state is created successfully, the server associates the session with the client ID provided by the client.
When a request that had CREATE_SESSION4_FLAG_CONN_RDMA set needs to be retried, the retry MUST be done on a new connection that is in non-RDMA mode. If properties of the new connection are different enough that the arguments to CREATE_SESSION need to change, then a non-retry MUST be sent. The server will eventually dispose of any session that was created on the original connection.
On the backchannel, the client and server might wish to have many slots, in some cases perhaps more that the fore channel, in order to deal with the situations where the network link has high latency and is the primary bottleneck for response to recalls. If so, and if the client provides too few slots to the backchannel, the server might limit the number of recallable objects it gives to the client.
Implementing RPCSEC_GSS callback support requires changes to both the client and server implementations of RPCSEC_GSS. One possible set of changes includes:
-
Adding a data structure that wraps the GSS-API context with a reference count.
-
New functions to increment and decrement the reference count. If the reference count is decremented to zero, the wrapper data structure and the GSS-API context it refers to would be freed.
-
Change RPCSEC_GSS to create the wrapper data structure upon receiving GSS-API context from gss_accept_sec_context() and gss_init_sec_context(). The reference count would be initialized to 1.
-
Adding a function to map an existing RPCSEC_GSS handle to a pointer to the wrapper data structure. The reference count would be incremented.
-
Adding a function to create a new RPCSEC_GSS handle from a pointer to the wrapper data structure. The reference count would be incremented.
-
Replacing calls from RPCSEC_GSS that free GSS-API contexts, with calls to decrement the reference count on the wrapper data structure.
struct DESTROY_SESSION4args {
sessionid4 dsa_sessionid;
};
struct DESTROY_SESSION4res {
nfsstat4 dsr_status;
};
The DESTROY_SESSION operation closes the session and discards the session's reply cache, if any. Any remaining connections associated with the session are immediately disassociated. If the connection has no remaining associated sessions, the connection
MAY be closed by the server. Locks, delegations, layouts, wants, and the lease, which are all tied to the client ID, are not affected by DESTROY_SESSION.
DESTROY_SESSION
MUST be invoked on a connection that is associated with the session being destroyed. In addition, if SP4_MACH_CRED state protection was specified when the client ID was created, the RPCSEC_GSS principal that created the session
MUST be the one that destroys the session, using RPCSEC_GSS privacy or integrity. If SP4_SSV state protection was specified when the client ID was created, RPCSEC_GSS using the SSV mechanism (
Section 2.10.9)
MUST be used, with integrity or privacy.
If the COMPOUND request starts with SEQUENCE, and if the sessionids specified in SEQUENCE and DESTROY_SESSION are the same, then
-
DESTROY_SESSION MUST be the final operation in the COMPOUND request.
-
It is advisable to avoid placing DESTROY_SESSION in a COMPOUND request with other state-modifying operations, because the DESTROY_SESSION will destroy the reply cache.
-
Because the session and its reply cache are destroyed, a client that retries the request may receive an error in reply to the retry, even though the original request was successful.
If the COMPOUND request starts with SEQUENCE, and if the sessionids specified in SEQUENCE and DESTROY_SESSION are different, then DESTROY_SESSION can appear in any position of the COMPOUND request (except for the first position). The two sessionids can belong to different client IDs.
If the COMPOUND request does not start with SEQUENCE, and if DESTROY_SESSION is not the sole operation, then server
MUST return NFS4ERR_NOT_ONLY_OP.
If there is a backchannel on the session and the server has outstanding CB_COMPOUND operations for the session which have not been replied to, then the server
MAY refuse to destroy the session and return an error. If so, then in the event the backchannel is down, the server
SHOULD return NFS4ERR_CB_PATH_DOWN to inform the client that the backchannel needs to be repaired before the server will allow the session to be destroyed. Otherwise, the error CB_BACK_CHAN_BUSY
SHOULD be returned to indicate that there are CB_COMPOUNDs that need to be replied to. The client
SHOULD reply to all outstanding CB_COMPOUNDs before re-sending DESTROY_SESSION.
struct FREE_STATEID4args {
stateid4 fsa_stateid;
};
struct FREE_STATEID4res {
nfsstat4 fsr_status;
};
The FREE_STATEID operation is used to free a stateid that no longer has any associated locks (including opens, byte-range locks, delegations, and layouts). This may be because of client LOCKU operations or because of server revocation. If there are valid locks (of any kind) associated with the stateid in question, the error NFS4ERR_LOCKS_HELD will be returned, and the associated stateid will not be freed.
When a stateid is freed that had been associated with revoked locks, by sending the FREE_STATEID operation, the client acknowledges the loss of those locks. This allows the server, once all such revoked state is acknowledged, to allow that client again to reclaim locks, without encountering the edge conditions discussed in
Section 8.4.2.
Once a successful FREE_STATEID is done for a given stateid, any subsequent use of that stateid will result in an NFS4ERR_BAD_STATEID error.
typedef nfstime4 attr_notice4;
struct GET_DIR_DELEGATION4args {
/* CURRENT_FH: delegated directory */
bool gdda_signal_deleg_avail;
bitmap4 gdda_notification_types;
attr_notice4 gdda_child_attr_delay;
attr_notice4 gdda_dir_attr_delay;
bitmap4 gdda_child_attributes;
bitmap4 gdda_dir_attributes;
};
struct GET_DIR_DELEGATION4resok {
verifier4 gddr_cookieverf;
/* Stateid for get_dir_delegation */
stateid4 gddr_stateid;
/* Which notifications can the server support */
bitmap4 gddr_notification;
bitmap4 gddr_child_attributes;
bitmap4 gddr_dir_attributes;
};
enum gddrnf4_status {
GDD4_OK = 0,
GDD4_UNAVAIL = 1
};
union GET_DIR_DELEGATION4res_non_fatal
switch (gddrnf4_status gddrnf_status) {
case GDD4_OK:
GET_DIR_DELEGATION4resok gddrnf_resok4;
case GDD4_UNAVAIL:
bool gddrnf_will_signal_deleg_avail;
};
union GET_DIR_DELEGATION4res
switch (nfsstat4 gddr_status) {
case NFS4_OK:
GET_DIR_DELEGATION4res_non_fatal gddr_res_non_fatal4;
default:
void;
};
The GET_DIR_DELEGATION operation is used by a client to request a directory delegation. The directory is represented by the current filehandle. The client also specifies whether it wants the server to notify it when the directory changes in certain ways by setting one or more bits in a bitmap. The server may refuse to grant the delegation. In that case, the server will return NFS4ERR_DIRDELEG_UNAVAIL. If the server decides to hand out the delegation, it will return a cookie verifier for that directory. If the cookie verifier changes when the client is holding the delegation, the delegation will be recalled unless the client has asked for notification for this event.
The server will also return a directory delegation stateid, gddr_stateid, as a result of the GET_DIR_DELEGATION operation. This stateid will appear in callback messages related to the delegation, such as notifications and delegation recalls. The client will use this stateid to return the delegation voluntarily or upon recall. A delegation is returned by calling the DELEGRETURN operation.
The server might not be able to support notifications of certain events. If the client asks for such notifications, the server
MUST inform the client of its inability to do so as part of the GET_DIR_DELEGATION reply by not setting the appropriate bits in the supported notifications bitmask, gddr_notification, contained in the reply. The server
MUST NOT add bits to gddr_notification that the client did not request.
The GET_DIR_DELEGATION operation can be used for both normal and named attribute directories.
If client sets gdda_signal_deleg_avail to TRUE, then it is registering with the client a "want" for a directory delegation. If the delegation is not available, and the server supports and will honor the "want", the results will have gddrnf_will_signal_deleg_avail set to TRUE and no error will be indicated on return. If so, the client should expect a future CB_RECALLABLE_OBJ_AVAIL operation to indicate that a directory delegation is available. If the server does not wish to honor the "want" or is not able to do so, it returns the error NFS4ERR_DIRDELEG_UNAVAIL. If the delegation is immediately available, the server
SHOULD return it with the response to the operation, rather than via a callback.
When a client makes a request for a directory delegation while it already holds a directory delegation for that directory (including the case where it has been recalled but not yet returned by the client or revoked by the server), the server
MUST reply with the value of gddr_status set to NFS4_OK, the value of gddrnf_status set to GDD4_UNAVAIL, and the value of gddrnf_will_signal_deleg_avail set to FALSE. The delegation the client held before the request remains intact, and its state is unchanged. The current stateid is not changed (see
Section 16.2.3.1.2 for a description of the current stateid).
Directory delegations provide the benefit of improving cache consistency of namespace information. This is done through synchronous callbacks. A server must support synchronous callbacks in order to support directory delegations. In addition to that, asynchronous notifications provide a way to reduce network traffic as well as improve client performance in certain conditions.
Notifications are specified in terms of potential changes to the directory. A client can ask to be notified of events by setting one or more bits in gdda_notification_types. The client can ask for notifications on addition of entries to a directory (by setting the NOTIFY4_ADD_ENTRY in gdda_notification_types), notifications on entry removal (NOTIFY4_REMOVE_ENTRY), renames (NOTIFY4_RENAME_ENTRY), directory attribute changes (NOTIFY4_CHANGE_DIR_ATTRIBUTES), and cookie verifier changes (NOTIFY4_CHANGE_COOKIE_VERIFIER) by setting one or more corresponding bits in the gdda_notification_types field.
The client can also ask for notifications of changes to attributes of directory entries (NOTIFY4_CHANGE_CHILD_ATTRIBUTES) in order to keep its attribute cache up to date. However, any changes made to child attributes do not cause the delegation to be recalled. If a client is interested in directory entry caching or negative name caching, it can set the gdda_notification_types appropriately to its particular need and the server will notify it of all changes that would otherwise invalidate its name cache. The kind of notification a client asks for may depend on the directory size, its rate of change, and the applications being used to access that directory. The enumeration of the conditions under which a client might ask for a notification is out of the scope of this specification.
For attribute notifications, the client will set bits in the gdda_dir_attributes bitmap to indicate which attributes it wants to be notified of. If the server does not support notifications for changes to a certain attribute, it
SHOULD NOT set that attribute in the supported attribute bitmap specified in the reply (gddr_dir_attributes). The client will also set in the gdda_child_attributes bitmap the attributes of directory entries it wants to be notified of, and the server will indicate in gddr_child_attributes which attributes of directory entries it will notify the client of.
The client will also let the server know if it wants to get the notification as soon as the attribute change occurs or after a certain delay by setting a delay factor; gdda_child_attr_delay is for attribute changes to directory entries and gdda_dir_attr_delay is for attribute changes to the directory. If this delay factor is set to zero, that indicates to the server that the client wants to be notified of any attribute changes as soon as they occur. If the delay factor is set to N seconds, the server will make a best-effort guarantee that attribute updates are synchronized within N seconds. If the client asks for a delay factor that the server does not support or that may cause significant resource consumption on the server by causing the server to send a lot of notifications, the server should not commit to sending out notifications for attributes and therefore must not set the appropriate bit in the gddr_child_attributes and gddr_dir_attributes bitmaps in the response.
The client
MUST use a security tuple (
Section 2.6.1) that the directory or its applicable ancestor (
Section 2.6) is exported with. If not, the server
MUST return NFS4ERR_WRONGSEC to the operation that both precedes GET_DIR_DELEGATION and sets the current filehandle (see
Section 2.6.3.1).
The directory delegation covers all the entries in the directory except the parent entry. That means if a directory and its parent both hold directory delegations, any changes to the parent will not cause a notification to be sent for the child even though the child's parent entry points to the parent directory.
struct GETDEVICEINFO4args {
deviceid4 gdia_device_id;
layouttype4 gdia_layout_type;
count4 gdia_maxcount;
bitmap4 gdia_notify_types;
};
struct GETDEVICEINFO4resok {
device_addr4 gdir_device_addr;
bitmap4 gdir_notification;
};
union GETDEVICEINFO4res switch (nfsstat4 gdir_status) {
case NFS4_OK:
GETDEVICEINFO4resok gdir_resok4;
case NFS4ERR_TOOSMALL:
count4 gdir_mincount;
default:
void;
};
The GETDEVICEINFO operation returns pNFS storage device address information for the specified device ID. The client identifies the device information to be returned by providing the gdia_device_id and gdia_layout_type that uniquely identify the device. The client provides gdia_maxcount to limit the number of bytes for the result. This maximum size represents all of the data being returned within the GETDEVICEINFO4resok structure and includes the XDR overhead. The server may return less data. If the server is unable to return any information within the gdia_maxcount limit, the error NFS4ERR_TOOSMALL will be returned. However, if gdia_maxcount is zero, NFS4ERR_TOOSMALL
MUST NOT be returned.
The da_layout_type field of the gdir_device_addr returned by the server
MUST be equal to the gdia_layout_type specified by the client. If it is not equal, the client
SHOULD ignore the response as invalid and behave as if the server returned an error, even if the client does have support for the layout type returned.
The client also provides a notification bitmap, gdia_notify_types, for the device ID mapping notification for which it is interested in receiving; the server must support device ID notifications for the notification request to have affect. The notification mask is composed in the same manner as the bitmap for file attributes (
Section 3.3.7). The numbers of bit positions are listed in the notify_device_type4 enumeration type (
Section 20.12). Only two enumerated values of notify_device_type4 currently apply to GETDEVICEINFO: NOTIFY_DEVICEID4_CHANGE and NOTIFY_DEVICEID4_DELETE (see
Section 20.12).
The notification bitmap applies only to the specified device ID. If a client sends a GETDEVICEINFO operation on a deviceID multiple times, the last notification bitmap is used by the server for subsequent notifications. If the bitmap is zero or empty, then the device ID's notifications are turned off.
If the client wants to just update or turn off notifications, it
MAY send a GETDEVICEINFO operation with gdia_maxcount set to zero. In that event, if the device ID is valid, the reply's da_addr_body field of the gdir_device_addr field will be of zero length.
If an unknown device ID is given in gdia_device_id, the server returns NFS4ERR_NOENT. Otherwise, the device address information is returned in gdir_device_addr. Finally, if the server supports notifications for device ID mappings, the gdir_notification result will contain a bitmap of which notifications it will actually send to the client (via CB_NOTIFY_DEVICEID, see
Section 20.12).
If NFS4ERR_TOOSMALL is returned, the results also contain gdir_mincount. The value of gdir_mincount represents the minimum size necessary to obtain the device information.
Aside from updating or turning off notifications, another use case for gdia_maxcount being set to zero is to validate a device ID.
The client
SHOULD request a notification for changes or deletion of a device ID to device address mapping so that the server can allow the client gracefully use a new mapping, without having pending I/O fail abruptly, or force layouts using the device ID to be recalled or revoked.
It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the client gets and processes the response to GETDEVICEINFO or GETDEVICELIST. The analysis of the race leverages the fact that the server
MUST NOT delete a device ID that is referred to by a layout the client has.
-
CB_NOTIFY_DEVICEID deletes a device ID. If the client believes it has layouts that refer to the device ID, then it is possible that layouts referring to the deleted device ID have been revoked. The client should send a TEST_STATEID request using the stateid for each layout that might have been revoked. If TEST_STATEID indicates that any layouts have been revoked, the client must recover from layout revocation as described in Section 12.5.6. If TEST_STATEID indicates that at least one layout has not been revoked, the client should send a GETDEVICEINFO operation on the supposedly deleted device ID to verify that the device ID has been deleted.
If GETDEVICEINFO indicates that the device ID does not exist, then the client assumes the server is faulty and recovers by sending an EXCHANGE_ID operation. If GETDEVICEINFO indicates that the device ID does exist, then while the server is faulty for sending an erroneous device ID deletion notification, the degree to which it is faulty does not require the client to create a new client ID.
If the client does not have layouts that refer to the device ID, no harm is done. The client should mark the device ID as deleted, and when GETDEVICEINFO or GETDEVICELIST results are received that indicate that the device ID has been in fact deleted, the device ID should be removed from the client's cache.
-
CB_NOTIFY_DEVICEID indicates that a device ID's device addressing mappings have changed. The client should assume that the results from the in-progress GETDEVICEINFO will be stale for the device ID once received, and so it should send another GETDEVICEINFO on the device ID.
struct GETDEVICELIST4args {
/* CURRENT_FH: object belonging to the file system */
layouttype4 gdla_layout_type;
/* number of deviceIDs to return */
count4 gdla_maxdevices;
nfs_cookie4 gdla_cookie;
verifier4 gdla_cookieverf;
};
struct GETDEVICELIST4resok {
nfs_cookie4 gdlr_cookie;
verifier4 gdlr_cookieverf;
deviceid4 gdlr_deviceid_list<>;
bool gdlr_eof;
};
union GETDEVICELIST4res switch (nfsstat4 gdlr_status) {
case NFS4_OK:
GETDEVICELIST4resok gdlr_resok4;
default:
void;
};
This operation is used by the client to enumerate all of the device IDs that a server's file system uses.
The client provides a current filehandle of a file object that belongs to the file system (i.e., all file objects sharing the same fsid as that of the current filehandle) and the layout type in gdia_layout_type. Since this operation might require multiple calls to enumerate all the device IDs (and is thus similar to the
Section 18.23 operation), the client also provides gdia_cookie and gdia_cookieverf to specify the current cursor position in the list. When the client wants to read from the beginning of the file system's device mappings, it sets gdla_cookie to zero. The field gdla_cookieverf
MUST be ignored by the server when gdla_cookie is zero. The client provides gdla_maxdevices to limit the number of device IDs in the result. If gdla_maxdevices is zero, the server
MUST return NFS4ERR_INVAL. The server
MAY return fewer device IDs.
The successful response to the operation will contain the cookie, gdlr_cookie, and the cookie verifier, gdlr_cookieverf, to be used on the subsequent GETDEVICELIST. A gdlr_eof value of TRUE signifies that there are no remaining entries in the server's device list. Each element of gdlr_deviceid_list contains a device ID.
An example of the use of this operation is for pNFS clients and servers that use LAYOUT4_BLOCK_VOLUME layouts. In these environments it may be helpful for a client to determine device accessibility upon first file system access.
union newtime4 switch (bool nt_timechanged) {
case TRUE:
nfstime4 nt_time;
case FALSE:
void;
};
union newoffset4 switch (bool no_newoffset) {
case TRUE:
offset4 no_offset;
case FALSE:
void;
};
struct LAYOUTCOMMIT4args {
/* CURRENT_FH: file */
offset4 loca_offset;
length4 loca_length;
bool loca_reclaim;
stateid4 loca_stateid;
newoffset4 loca_last_write_offset;
newtime4 loca_time_modify;
layoutupdate4 loca_layoutupdate;
};
union newsize4 switch (bool ns_sizechanged) {
case TRUE:
length4 ns_size;
case FALSE:
void;
};
struct LAYOUTCOMMIT4resok {
newsize4 locr_newsize;
};
union LAYOUTCOMMIT4res switch (nfsstat4 locr_status) {
case NFS4_OK:
LAYOUTCOMMIT4resok locr_resok4;
default:
void;
};
The LAYOUTCOMMIT operation commits changes in the layout represented by the current filehandle, client ID (derived from the session ID in the preceding SEQUENCE operation), byte-range, and stateid. Since layouts are sub-dividable, a smaller portion of a layout, retrieved via LAYOUTGET, can be committed. The byte-range being committed is specified through the byte-range (loca_offset and loca_length). This byte-range
MUST overlap with one or more existing layouts previously granted via LAYOUTGET (
Section 18.43), each with an iomode of LAYOUTIOMODE4_RW. In the case where the iomode of any held layout segment is not LAYOUTIOMODE4_RW, the server should return the error NFS4ERR_BAD_IOMODE. For the case where the client does not hold matching layout segment(s) for the defined byte-range, the server should return the error NFS4ERR_BAD_LAYOUT.
The LAYOUTCOMMIT operation indicates that the client has completed writes using a layout obtained by a previous LAYOUTGET. The client may have only written a subset of the data range it previously requested. LAYOUTCOMMIT allows it to commit or discard provisionally allocated space and to update the server with a new end-of-file. The layout referenced by LAYOUTCOMMIT is still valid after the operation completes and can be continued to be referenced by the client ID, filehandle, byte-range, layout type, and stateid.
If the loca_reclaim field is set to TRUE, this indicates that the client is attempting to commit changes to a layout after the restart of the metadata server during the metadata server's recovery grace period (see
Section 12.7.4). This type of request may be necessary when the client has uncommitted writes to provisionally allocated byte-ranges of a file that were sent to the storage devices before the restart of the metadata server. In this case, the layout provided by the client
MUST be a subset of a writable layout that the client held immediately before the restart of the metadata server. The value of the field loca_stateid
MUST be a value that the metadata server returned before it restarted. The metadata server is free to accept or reject this request based on its own internal metadata consistency checks. If the metadata server finds that the layout provided by the client does not pass its consistency checks, it
MUST reject the request with the status NFS4ERR_RECLAIM_BAD. The successful completion of the LAYOUTCOMMIT request with loca_reclaim set to TRUE does NOT provide the client with a layout for the file. It simply commits the changes to the layout specified in the loca_layoutupdate field. To obtain a layout for the file, the client must send a LAYOUTGET request to the server after the server's grace period has expired. If the metadata server receives a LAYOUTCOMMIT request with loca_reclaim set to TRUE when the metadata server is not in its recovery grace period, it
MUST reject the request with the status NFS4ERR_NO_GRACE.
Setting the loca_reclaim field to TRUE is required if and only if the committed layout was acquired before the metadata server restart. If the client is committing a layout that was acquired during the metadata server's grace period, it
MUST set the "reclaim" field to FALSE.
The loca_stateid is a layout stateid value as returned by previously successful layout operations (see
Section 12.5.3).
The loca_last_write_offset field specifies the offset of the last byte written by the client previous to the LAYOUTCOMMIT. Note that this value is never equal to the file's size (at most it is one byte less than the file's size) and
MUST be less than or equal to NFS4_MAXFILEOFF. Also, loca_last_write_offset
MUST overlap the range described by loca_offset and loca_length. The metadata server may use this information to determine whether the file's size needs to be updated. If the metadata server updates the file's size as the result of the LAYOUTCOMMIT operation, it must return the new size (locr_newsize.ns_size) as part of the results.
The loca_time_modify field allows the client to suggest a modification time it would like the metadata server to set. The metadata server may use the suggestion or it may use the time of the LAYOUTCOMMIT operation to set the modification time. If the metadata server uses the client-provided modification time, it should ensure that time does not flow backwards. If the client wants to force the metadata server to set an exact time, the client should use a SETATTR operation in a COMPOUND right after LAYOUTCOMMIT. See
Section 12.5.4 for more details. If the client desires the resultant modification time, it should construct the COMPOUND so that a GETATTR follows the LAYOUTCOMMIT.
The loca_layoutupdate argument to LAYOUTCOMMIT provides a mechanism for a client to provide layout-specific updates to the metadata server. For example, the layout update can describe what byte-ranges of the original layout have been used and what byte-ranges can be deallocated. There is no NFSv4.1 file layout-specific layoutupdate4 structure.
The layout information is more verbose for block devices than for objects and files because the latter two hide the details of block allocation behind their storage protocols. At the minimum, the client needs to communicate changes to the end-of-file location back to the server, and, if desired, its view of the file's modification time. For block/volume layouts, it needs to specify precisely which blocks have been used.
If the layout identified in the arguments does not exist, the error NFS4ERR_BADLAYOUT is returned. The layout being committed may also be rejected if it does not correspond to an existing layout with an iomode of LAYOUTIOMODE4_RW.
On success, the current filehandle retains its value and the current stateid retains its value.
The client
MAY also use LAYOUTCOMMIT with the loca_reclaim field set to TRUE to convey hints to modified file attributes or to report layout-type specific information such as I/O errors for object-based storage layouts, as normally done during normal operation. Doing so may help the metadata server to recover files more efficiently after restart. For example, some file system implementations may require expansive recovery of file system objects if the metadata server does not get a positive indication from all clients holding a LAYOUTIOMODE4_RW layout that they have successfully completed all their writes. Sending a LAYOUTCOMMIT (if required) and then following with LAYOUTRETURN can provide such an indication and allow for graceful and efficient recovery.
If loca_reclaim is TRUE, the metadata server is free to either examine or ignore the value in the field loca_stateid. The metadata server implementation might or might not encode in its layout stateid information that allows the metadata server to perform a consistency check on the LAYOUTCOMMIT request.
struct LAYOUTGET4args {
/* CURRENT_FH: file */
bool loga_signal_layout_avail;
layouttype4 loga_layout_type;
layoutiomode4 loga_iomode;
offset4 loga_offset;
length4 loga_length;
length4 loga_minlength;
stateid4 loga_stateid;
count4 loga_maxcount;
};
struct LAYOUTGET4resok {
bool logr_return_on_close;
stateid4 logr_stateid;
layout4 logr_layout<>;
};
union LAYOUTGET4res switch (nfsstat4 logr_status) {
case NFS4_OK:
LAYOUTGET4resok logr_resok4;
case NFS4ERR_LAYOUTTRYLATER:
bool logr_will_signal_layout_avail;
default:
void;
};
The LAYOUTGET operation requests a layout from the metadata server for reading or writing the file given by the filehandle at the byte-range specified by offset and length. Layouts are identified by the client ID (derived from the session ID in the preceding SEQUENCE operation), current filehandle, layout type (loga_layout_type), and the layout stateid (loga_stateid). The use of the loga_iomode field depends upon the layout type, but should reflect the client's data access intent.
If the metadata server is in a grace period, and does not persist layouts and device ID to device address mappings, then it
MUST return NFS4ERR_GRACE (see
Section 8.4.2.1).
The LAYOUTGET operation returns layout information for the specified byte-range: a layout. The client actually specifies two ranges, both starting at the offset in the loga_offset field. The first range is between loga_offset and loga_offset + loga_length - 1 inclusive. This range indicates the desired range the client wants the layout to cover. The second range is between loga_offset and loga_offset + loga_minlength - 1 inclusive. This range indicates the required range the client needs the layout to cover. Thus, loga_minlength
MUST be less than or equal to loga_length.
When a length field is set to NFS4_UINT64_MAX, this indicates a desire (when loga_length is NFS4_UINT64_MAX) or requirement (when loga_minlength is NFS4_UINT64_MAX) to get a layout from loga_offset through the end-of-file, regardless of the file's length.
The following rules govern the relationships among, and the minima of, loga_length, loga_minlength, and loga_offset.
-
If loga_length is less than loga_minlength, the metadata server MUST return NFS4ERR_INVAL.
-
If loga_minlength is zero, this is an indication to the metadata server that the client desires any layout at offset loga_offset or less that the metadata server has "readily available". Readily is subjective, and depends on the layout type and the pNFS server implementation. For example, some metadata servers might have to pre-allocate stable storage when they receive a request for a range of a file that goes beyond the file's current length. If loga_minlength is zero and loga_length is greater than zero, this tells the metadata server what range of the layout the client would prefer to have. If loga_length and loga_minlength are both zero, then the client is indicating that it desires a layout of any length with the ending offset of the range no less than the value specified loga_offset, and the starting offset at or below loga_offset. If the metadata server does not have a layout that is readily available, then it MUST return NFS4ERR_LAYOUTTRYLATER.
-
If the sum of loga_offset and loga_minlength exceeds NFS4_UINT64_MAX, and loga_minlength is not NFS4_UINT64_MAX, the error NFS4ERR_INVAL MUST result.
-
If the sum of loga_offset and loga_length exceeds NFS4_UINT64_MAX, and loga_length is not NFS4_UINT64_MAX, the error NFS4ERR_INVAL MUST result.
After the metadata server has performed the above checks on loga_offset, loga_minlength, and loga_offset, the metadata server
MUST return a layout according to the rules in
Table 22.
Acceptable layouts based on loga_minlength. Note: u64m = NFS4_UINT64_MAX; a_off = loga_offset; a_minlen = loga_minlength.
Layout iomode of request |
Layout a_minlen of request |
Layout iomode of reply |
Layout offset of reply |
Layout length of reply |
_READ |
u64m |
MAY be _READ
|
MUST be <= a_off
|
MUST be >= file length - layout offset
|
_READ |
u64m |
MAY be _RW
|
MUST be <= a_off
|
MUST be u64m
|
_READ |
> 0 and < u64m |
MAY be _READ
|
MUST be <= a_off
|
MUST be >= MIN(file length, a_minlen + a_off) - layout offset
|
_READ |
> 0 and < u64m |
MAY be _RW
|
MUST be <= a_off
|
MUST be >= a_off - layout offset + a_minlen
|
_READ |
0 |
MAY be _READ
|
MUST be <= a_off
|
MUST be > 0
|
_READ |
0 |
MAY be _RW
|
MUST be <= a_off
|
MUST be > 0
|
_RW |
u64m |
MUST be _RW
|
MUST be <= a_off
|
MUST be u64m
|
_RW |
> 0 and < u64m |
MUST be _RW
|
MUST be <= a_off
|
MUST be >= a_off - layout offset + a_minlen
|
_RW |
0 |
MUST be _RW
|
MUST be <= a_off
|
MUST be > 0
|
Table 22
If loga_minlength is not zero and the metadata server cannot return a layout according to the rules in
Table 22, then the metadata server
MUST return the error NFS4ERR_BADLAYOUT. If loga_minlength is zero and the metadata server cannot or will not return a layout according to the rules in
Table 22, then the metadata server
MUST return the error NFS4ERR_LAYOUTTRYLATER. Assuming that loga_length is greater than loga_minlength or equal to zero, the metadata server
SHOULD return a layout according to the rules in
Table 23.
Desired layouts based on loga_length. The rules of
Table 22 MUST be applied first. Note: u64m = NFS4_UINT64_MAX; a_off = loga_offset; a_len = loga_length.
Layout iomode of request |
Layout a_len of request |
Layout iomode of reply |
Layout offset of reply |
Layout length of reply |
_READ |
u64m |
MAY be _READ
|
MUST be <= a_off
|
SHOULD be u64m
|
_READ |
u64m |
MAY be _RW
|
MUST be <= a_off
|
SHOULD be u64m
|
_READ |
> 0 and < u64m |
MAY be _READ
|
MUST be <= a_off
|
SHOULD be >= a_off - layout offset + a_len
|
_READ |
> 0 and < u64m |
MAY be _RW
|
MUST be <= a_off
|
SHOULD be >= a_off - layout offset + a_len
|
_READ |
0 |
MAY be _READ
|
MUST be <= a_off
|
SHOULD be > a_off - layout offset
|
_READ |
0 |
MAY be _READ
|
MUST be <= a_off
|
SHOULD be > a_off - layout offset
|
_RW |
u64m |
MUST be _RW
|
MUST be <= a_off
|
SHOULD be u64m
|
_RW |
> 0 and < u64m |
MUST be _RW
|
MUST be <= a_off
|
SHOULD be >= a_off - layout offset + a_len
|
_RW |
0 |
MUST be _RW
|
MUST be <= a_off
|
SHOULD be > a_off - layout offset
|
Table 23
The loga_stateid field specifies a valid stateid. If a layout is not currently held by the client, the loga_stateid field represents a stateid reflecting the correspondingly valid open, byte-range lock, or delegation stateid. Once a layout is held on the file by the client, the loga_stateid field
MUST be a stateid as returned from a previous LAYOUTGET or LAYOUTRETURN operation or provided by a CB_LAYOUTRECALL operation (see
Section 12.5.3).
The loga_maxcount field specifies the maximum layout size (in bytes) that the client can handle. If the size of the layout structure exceeds the size specified by maxcount, the metadata server will return the NFS4ERR_TOOSMALL error.
The returned layout is expressed as an array, logr_layout, with each element of type layout4. If a file has a single striping pattern, then logr_layout
SHOULD contain just one entry. Otherwise, if the requested range overlaps more than one striping pattern, logr_layout will contain the required number of entries. The elements of logr_layout
MUST be sorted in ascending order of the value of the lo_offset field of each element. There
MUST be no gaps or overlaps in the range between two successive elements of logr_layout. The lo_iomode field in each element of logr_layout
MUST be the same.
Table 22 and
Table 23 both refer to a returned layout iomode, offset, and length. Because the returned layout is encoded in the logr_layout array, more description is required.
-
iomode
-
The value of the returned layout iomode listed in Table 22 and Table 23 is equal to the value of the lo_iomode field in each element of logr_layout. As shown in Table 22 and Table 23, the metadata server MAY return a layout with an lo_iomode different from the requested iomode (field loga_iomode of the request). If it does so, it MUST ensure that the lo_iomode is more permissive than the loga_iomode requested. For example, this behavior allows an implementation to upgrade LAYOUTIOMODE4_READ requests to LAYOUTIOMODE4_RW requests at its discretion, within the limits of the layout type specific protocol. A lo_iomode of either LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW MUST be returned.
-
offset
-
The value of the returned layout offset listed in Table 22 and Table 23 is always equal to the lo_offset field of the first element logr_layout.
-
length
-
When setting the value of the returned layout length, the situation is complicated by the possibility that the special layout length value NFS4_UINT64_MAX is involved. For a logr_layout array of N elements, the lo_length field in the first N-1 elements MUST NOT be NFS4_UINT64_MAX. The lo_length field of the last element of logr_layout can be NFS4_UINT64_MAX under some conditions as described in the following list.
-
If an applicable rule of Table 22 states that the metadata server MUST return a layout of length NFS4_UINT64_MAX, then the lo_length field of the last element of logr_layout MUST be NFS4_UINT64_MAX.
-
If an applicable rule of Table 22 states that the metadata server MUST NOT return a layout of length NFS4_UINT64_MAX, then the lo_length field of the last element of logr_layout MUST NOT be NFS4_UINT64_MAX.
-
If an applicable rule of Table 23 states that the metadata server SHOULD return a layout of length NFS4_UINT64_MAX, then the lo_length field of the last element of logr_layout SHOULD be NFS4_UINT64_MAX.
-
When the value of the returned layout length of Table 22 and Table 23 is not NFS4_UINT64_MAX, then the returned layout length is equal to the sum of the lo_length fields of each element of logr_layout.
The logr_return_on_close result field is a directive to return the layout before closing the file. When the metadata server sets this return value to TRUE, it
MUST be prepared to recall the layout in the case in which the client fails to return the layout before close. For the metadata server that knows a layout must be returned before a close of the file, this return value can be used to communicate the desired behavior to the client and thus remove one extra step from the client's and metadata server's interaction.
The logr_stateid stateid is returned to the client for use in subsequent layout related operations. See Sections [
8.2], [
12.5.3], and [
12.5.5.2] for a further discussion and requirements.
The format of the returned layout (lo_content) is specific to the layout type. The value of the layout type (lo_content.loc_type) for each of the elements of the array of layouts returned by the metadata server (logr_layout)
MUST be equal to the loga_layout_type specified by the client. If it is not equal, the client
SHOULD ignore the response as invalid and behave as if the metadata server returned an error, even if the client does have support for the layout type returned.
If neither the requested file nor its containing file system support layouts, the metadata server
MUST return NFS4ERR_LAYOUTUNAVAILABLE. If the layout type is not supported, the metadata server
MUST return NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout matches the client provided layout identification, the metadata server
MUST return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is specified, or a loga_iomode of LAYOUTIOMODE4_ANY is specified, the metadata server
MUST return NFS4ERR_BADIOMODE.
If the layout for the file is unavailable due to transient conditions, e.g., file sharing prohibits layouts, the metadata server
MUST return NFS4ERR_LAYOUTTRYLATER.
If the layout request is rejected due to an overlapping layout recall, the metadata server
MUST return NFS4ERR_RECALLCONFLICT. See
Section 12.5.5.2 for details.
If the layout conflicts with a mandatory byte-range lock held on the file, and if the storage devices have no method of enforcing mandatory locks, other than through the restriction of layouts, the metadata server
SHOULD return NFS4ERR_LOCKED.
If client sets loga_signal_layout_avail to TRUE, then it is registering with the client a "want" for a layout in the event the layout cannot be obtained due to resource exhaustion. If the metadata server supports and will honor the "want", the results will have logr_will_signal_layout_avail set to TRUE. If so, the client should expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a layout is available.
On success, the current filehandle retains its value and the current stateid is updated to match the value as returned in the results.
Typically, LAYOUTGET will be called as part of a COMPOUND request after an OPEN operation and results in the client having location information for the file. This requires that loga_stateid be set to the special stateid that tells the metadata server to use the current stateid, which is set by OPEN (see
Section 16.2.3.1.2). A client may also hold a layout across multiple OPENs. The client specifies a layout type that limits what kind of layout the metadata server will return. This prevents metadata servers from granting layouts that are unusable by the client.
As indicated by
Table 22 and
Table 23, the specification of LAYOUTGET allows a pNFS client and server considerable flexibility. A pNFS client can take several strategies for sending LAYOUTGET. Some examples are as follows.
-
If LAYOUTGET is preceded by OPEN in the same COMPOUND request and the OPEN requests OPEN4_SHARE_ACCESS_READ access, the client might opt to request a _READ layout with loga_offset set to zero, loga_minlength set to zero, and loga_length set to NFS4_UINT64_MAX. If the file has space allocated to it, that space is striped over one or more storage devices, and there is either no conflicting layout or the concept of a conflicting layout does not apply to the pNFS server's layout type or implementation, then the metadata server might return a layout with a starting offset of zero, and a length equal to the length of the file, if not NFS4_UINT64_MAX. If the length of the file is not a multiple of the pNFS server's stripe width (see Section 13.2 for a formal definition), the metadata server might round up the returned layout's length.
-
If LAYOUTGET is preceded by OPEN in the same COMPOUND request, and the OPEN requests OPEN4_SHARE_ACCESS_WRITE access and does not truncate the file, the client might opt to request a _RW layout with loga_offset set to zero, loga_minlength set to zero, and loga_length set to the file's current length (if known), or NFS4_UINT64_MAX. As with the previous case, under some conditions the metadata server might return a layout that covers the entire length of the file or beyond.
-
This strategy is as above, but the OPEN truncates the file. In this case, the client might anticipate it will be writing to the file from offset zero, and so loga_offset and loga_minlength are set to zero, and loga_length is set to the value of threshold4_write_iosize. The metadata server might return a layout from offset zero with a length at least as long as threshold4_write_iosize.
-
A process on the client invokes a request to read from offset 10000 for length 50000. The client is using buffered I/O, and has buffer sizes of 4096 bytes. The client intends to map the request of the process into a series of READ requests starting at offset 8192. The end offset needs to be higher than 10000 + 50000 = 60000, and the next offset that is a multiple of 4096 is 61440. The difference between 61440 and that starting offset of the layout is 53248 (which is the product of 4096 and 15). The value of threshold4_read_iosize is less than 53248, so the client sends a LAYOUTGET request with loga_offset set to 8192, loga_minlength set to 53248, and loga_length set to the file's length (if known) minus 8192 or NFS4_UINT64_MAX (if the file's length is not known). Since this LAYOUTGET request exceeds the metadata server's threshold, it grants the layout, possibly with an initial offset of zero, with an end offset of at least 8192 + 53248 - 1 = 61439, but preferably a layout with an offset aligned on the stripe width and a length that is a multiple of the stripe width.
-
This strategy is as above, but the client is not using buffered I/O, and instead all internal I/O requests are sent directly to the server. The LAYOUTGET request has loga_offset equal to 10000 and loga_minlength set to 50000. The value of loga_length is set to the length of the file. The metadata server is free to return a layout that fully overlaps the requested range, with a starting offset and length aligned on the stripe width.
-
Again, a process on the client invokes a request to read from offset 10000 for length 50000 (i.e. a range with a starting offset of 10000 and an ending offset of 69999), and buffered I/O is in use. The client is expecting that the server might not be able to return the layout for the full I/O range. The client intends to map the request of the process into a series of thirteen READ requests starting at offset 8192, each with length 4096, with a total length of 53248 (which equals 13 * 4096), which fully contains the range that client's process wants to read. Because the value of threshold4_read_iosize is equal to 4096, it is practical and reasonable for the client to use several LAYOUTGET operations to complete the series of READs. The client sends a LAYOUTGET request with loga_offset set to 8192, loga_minlength set to 4096, and loga_length set to 53248 or higher. The server will grant a layout possibly with an initial offset of zero, with an end offset of at least 8192 + 4096 - 1 = 12287, but preferably a layout with an offset aligned on the stripe width and a length that is a multiple of the stripe width. This will allow the client to make forward progress, possibly sending more LAYOUTGET operations for the remainder of the range.
-
An NFS client detects a sequential read pattern, and so sends a LAYOUTGET operation that goes well beyond any current or pending read requests to the server. The server might likewise detect this pattern, and grant the LAYOUTGET request. Once the client reads from an offset of the file that represents 50% of the way through the range of the last layout it received, in order to avoid stalling I/O that would wait for a layout, the client sends more operations from an offset of the file that represents 50% of the way through the last layout it received. The client continues to request layouts with byte-ranges that are well in advance of the byte-ranges of recent and/or read requests of processes running on the client.
-
This strategy is as above, but the client fails to detect the pattern, but the server does. The next time the metadata server gets a LAYOUTGET, it returns a layout with a length that is well beyond loga_minlength.
-
A client is using buffered I/O, and has a long queue of write-behinds to process and also detects a sequential write pattern. It sends a LAYOUTGET for a layout that spans the range of the queued write-behinds and well beyond, including ranges beyond the filer's current length. The client continues to send LAYOUTGET operations once the write-behind queue reaches 50% of the maximum queue length.
Once the client has obtained a layout referring to a particular device ID, the metadata server
MUST NOT delete the device ID until the layout is returned or revoked.
CB_NOTIFY_DEVICEID can race with LAYOUTGET. One race scenario is that LAYOUTGET returns a device ID for which the client does not have device address mappings, and the metadata server sends a CB_NOTIFY_DEVICEID to add the device ID to the client's awareness and meanwhile the client sends GETDEVICEINFO on the device ID. This scenario is discussed in
Section 18.40.4. Another scenario is that the CB_NOTIFY_DEVICEID is processed by the client before it processes the results from LAYOUTGET. The client will send a GETDEVICEINFO on the device ID. If the results from GETDEVICEINFO are received before the client gets results from LAYOUTGET, then there is no longer a race. If the results from LAYOUTGET are received before the results from GETDEVICEINFO, the client can either wait for results of GETDEVICEINFO or send another one to get possibly more up-to-date device address mappings for the device ID.
/* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */
const LAYOUT4_RET_REC_FILE = 1;
const LAYOUT4_RET_REC_FSID = 2;
const LAYOUT4_RET_REC_ALL = 3;
enum layoutreturn_type4 {
LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE,
LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID,
LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL
};
struct layoutreturn_file4 {
offset4 lrf_offset;
length4 lrf_length;
stateid4 lrf_stateid;
/* layouttype4 specific data */
opaque lrf_body<>;
};
union layoutreturn4 switch(layoutreturn_type4 lr_returntype) {
case LAYOUTRETURN4_FILE:
layoutreturn_file4 lr_layout;
default:
void;
};
struct LAYOUTRETURN4args {
/* CURRENT_FH: file */
bool lora_reclaim;
layouttype4 lora_layout_type;
layoutiomode4 lora_iomode;
layoutreturn4 lora_layoutreturn;
};
union layoutreturn_stateid switch (bool lrs_present) {
case TRUE:
stateid4 lrs_stateid;
case FALSE:
void;
};
union LAYOUTRETURN4res switch (nfsstat4 lorr_status) {
case NFS4_OK:
layoutreturn_stateid lorr_stateid;
default:
void;
};
This operation returns from the client to the server one or more layouts represented by the client ID (derived from the session ID in the preceding SEQUENCE operation), lora_layout_type, and lora_iomode. When lr_returntype is LAYOUTRETURN4_FILE, the returned layout is further identified by the current filehandle, lrf_offset, lrf_length, and lrf_stateid. If the lrf_length field is NFS4_UINT64_MAX, all bytes of the layout, starting at lrf_offset, are returned. When lr_returntype is LAYOUTRETURN4_FSID, the current filehandle is used to identify the file system and all layouts matching the client ID, the fsid of the file system, lora_layout_type, and lora_iomode are returned. When lr_returntype is LAYOUTRETURN4_ALL, all layouts matching the client ID, lora_layout_type, and lora_iomode are returned and the current filehandle is not used. After this call, the client
MUST NOT use the returned layout(s) and the associated storage protocol to access the file data.
If the set of layouts designated in the case of LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL is empty, then no error results. In the case of LAYOUTRETURN4_FILE, the byte-range specified is returned even if it is a subdivision of a layout previously obtained with LAYOUTGET, a combination of multiple layouts previously obtained with LAYOUTGET, or a combination including some layouts previously obtained with LAYOUTGET, and one or more subdivisions of such layouts. When the byte-range does not designate any bytes for which a layout is held for the specified file, client ID, layout type and mode, no error results. See
Section 12.5.5.2.1.5 for considerations with "bulk" return of layouts.
The layout being returned may be a subset or superset of a layout specified by CB_LAYOUTRECALL. However, if it is a subset, the recall is not complete until the full recalled scope has been returned. Recalled scope refers to the byte-range in the case of LAYOUTRETURN4_FILE, the use of LAYOUTRETURN4_FSID, or the use of LAYOUTRETURN4_ALL. There must be a LAYOUTRETURN with a matching scope to complete the return even if all current layout ranges have been previously individually returned.
For all lr_returntype values, an iomode of LAYOUTIOMODE4_ANY specifies that all layouts that match the other arguments to LAYOUTRETURN (i.e., client ID, lora_layout_type, and one of current filehandle and range; fsid derived from current filehandle; or LAYOUTRETURN4_ALL) are being returned.
In the case that lr_returntype is LAYOUTRETURN4_FILE, the lrf_stateid provided by the client is a layout stateid as returned from previous layout operations. Note that the "seqid" field of lrf_stateid
MUST NOT be zero. See Sections [
8.2], [
12.5.3], and [
12.5.5.2] for a further discussion and requirements.
Return of a layout or all layouts does not invalidate the mapping of storage device ID to a storage device address. The mapping remains in effect until specifically changed or deleted via device ID notification callbacks. Of course if there are no remaining layouts that refer to a previously used device ID, the server is free to delete a device ID without a notification callback, which will be the case when notifications are not in effect.
If the lora_reclaim field is set to TRUE, the client is attempting to return a layout that was acquired before the restart of the metadata server during the metadata server's grace period. When returning layouts that were acquired during the metadata server's grace period, the client
MUST set the lora_reclaim field to FALSE. The lora_reclaim field
MUST be set to FALSE also when lr_layoutreturn is LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL. See
Section 18.42 for more details.
Layouts may be returned when recalled or voluntarily (i.e., before the server has recalled them). In either case, the client must properly propagate state changed under the context of the layout to the storage device(s) or to the metadata server before returning the layout.
If the client returns the layout in response to a CB_LAYOUTRECALL where the lor_recalltype field of the clora_recall field was LAYOUTRECALL4_FILE, the client should use the lor_stateid value from CB_LAYOUTRECALL as the value for lrf_stateid. Otherwise, it should use logr_stateid (from a previous LAYOUTGET result) or lorr_stateid (from a previous LAYRETURN result). This is done to indicate the point in time (in terms of layout stateid transitions) when the recall was sent. The client uses the precise lora_recallstateid value and
MUST NOT set the stateid's seqid to zero; otherwise, NFS4ERR_BAD_STATEID
MUST be returned. NFS4ERR_OLD_STATEID can be returned if the client is using an old seqid, and the server knows the client should not be using the old seqid. For example, the client uses the seqid on slot 1 of the session, receives the response with the new seqid, and uses the slot to send another request with the old seqid.
If a client fails to return a layout in a timely manner, then the metadata server
SHOULD use its control protocol with the storage devices to fence the client from accessing the data referenced by the layout. See
Section 12.5.5 for more details.
If the LAYOUTRETURN request sets the lora_reclaim field to TRUE after the metadata server's grace period, NFS4ERR_NO_GRACE is returned.
If the LAYOUTRETURN request sets the lora_reclaim field to TRUE and lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL, NFS4ERR_INVAL is returned.
If the client sets the lr_returntype field to LAYOUTRETURN4_FILE, then the lrs_stateid field will represent the layout stateid as updated for this operation's processing; the current stateid will also be updated to match the returned value. If the last byte of any layout for the current file, client ID, and layout type is being returned and there are no remaining pending CB_LAYOUTRECALL operations for which a LAYOUTRETURN operation must be done, lrs_present
MUST be FALSE, and no stateid will be returned. In addition, the COMPOUND request's current stateid will be set to the all-zeroes special stateid (see
Section 16.2.3.1.2). The server
MUST reject with NFS4ERR_BAD_STATEID any further use of the current stateid in that COMPOUND until the current stateid is re-established by a later stateid-returning operation.
On success, the current filehandle retains its value.
If the EXCHGID4_FLAG_BIND_PRINC_STATEID capability is set on the client ID (see
Section 18.35), the server will require that the principal, security flavor, and if applicable, the GSS mechanism, combination that acquired the layout also be the one to send LAYOUTRETURN. This might not be possible if credentials for the principal are no longer available. The server will allow the machine credential or SSV credential (see
Section 18.35) to send LAYOUTRETURN if LAYOUTRETURN's operation code was set in the spo_must_allow result of EXCHANGE_ID.
The final LAYOUTRETURN operation in response to a CB_LAYOUTRECALL callback
MUST be serialized with any outstanding, intersecting LAYOUTRETURN operations. Note that it is possible that while a client is returning the layout for some recalled range, the server may recall a superset of that range (e.g., LAYOUTRECALL4_ALL); the final return operation for the latter must block until the former layout recall is done.
Returning all layouts in a file system using LAYOUTRETURN4_FSID is typically done in response to a CB_LAYOUTRECALL for that file system as the final return operation. Similarly, LAYOUTRETURN4_ALL is used in response to a recall callback for all layouts. It is possible that the client already returned some outstanding layouts via individual LAYOUTRETURN calls and the call for LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL marks the end of the LAYOUTRETURN sequence. See
Section 12.5.5.1 for more details.
Once the client has returned all layouts referring to a particular device ID, the server
MAY delete the device ID.
enum secinfo_style4 {
SECINFO_STYLE4_CURRENT_FH = 0,
SECINFO_STYLE4_PARENT = 1
};
/* CURRENT_FH: object or child directory */
typedef secinfo_style4 SECINFO_NO_NAME4args;
/* CURRENTFH: consumed if status is NFS4_OK */
typedef SECINFO4res SECINFO_NO_NAME4res;
Like the SECINFO operation, SECINFO_NO_NAME is used by the client to obtain a list of valid RPC authentication flavors for a specific file object. Unlike SECINFO, SECINFO_NO_NAME only works with objects that are accessed by filehandle.
There are two styles of SECINFO_NO_NAME, as determined by the value of the secinfo_style4 enumeration. If SECINFO_STYLE4_CURRENT_FH is passed, then SECINFO_NO_NAME is querying for the required security for the current filehandle. If SECINFO_STYLE4_PARENT is passed, then SECINFO_NO_NAME is querying for the required security of the current filehandle's parent. If the style selected is SECINFO_STYLE4_PARENT, then SECINFO should apply the same access methodology used for LOOKUPP when evaluating the traversal to the parent directory. Therefore, if the requester does not have the appropriate access to LOOKUPP the parent, then SECINFO_NO_NAME must behave the same way and return NFS4ERR_ACCESS.
If PUTFH, PUTPUBFH, PUTROOTFH, or RESTOREFH returns NFS4ERR_WRONGSEC, then the client resolves the situation by sending a COMPOUND request that consists of PUTFH, PUTPUBFH, or PUTROOTFH immediately followed by SECINFO_NO_NAME, style SECINFO_STYLE4_CURRENT_FH. See
Section 2.6 for instructions on dealing with NFS4ERR_WRONGSEC error returns from PUTFH, PUTROOTFH, PUTPUBFH, or RESTOREFH.
If SECINFO_STYLE4_PARENT is specified and there is no parent directory, SECINFO_NO_NAME
MUST return NFS4ERR_NOENT.
On success, the current filehandle is consumed (see
Section 2.6.3.1.1.8), and if the next operation after SECINFO_NO_NAME tries to use the current filehandle, that operation will fail with the status NFS4ERR_NOFILEHANDLE.
Everything else about SECINFO_NO_NAME is the same as SECINFO. See the discussion on SECINFO (
Section 18.29.3).
struct SEQUENCE4args {
sessionid4 sa_sessionid;
sequenceid4 sa_sequenceid;
slotid4 sa_slotid;
slotid4 sa_highest_slotid;
bool sa_cachethis;
};
const SEQ4_STATUS_CB_PATH_DOWN = 0x00000001;
const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING = 0x00000002;
const SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED = 0x00000004;
const SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED = 0x00000008;
const SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED = 0x00000010;
const SEQ4_STATUS_ADMIN_STATE_REVOKED = 0x00000020;
const SEQ4_STATUS_RECALLABLE_STATE_REVOKED = 0x00000040;
const SEQ4_STATUS_LEASE_MOVED = 0x00000080;
const SEQ4_STATUS_RESTART_RECLAIM_NEEDED = 0x00000100;
const SEQ4_STATUS_CB_PATH_DOWN_SESSION = 0x00000200;
const SEQ4_STATUS_BACKCHANNEL_FAULT = 0x00000400;
const SEQ4_STATUS_DEVID_CHANGED = 0x00000800;
const SEQ4_STATUS_DEVID_DELETED = 0x00001000;
struct SEQUENCE4resok {
sessionid4 sr_sessionid;
sequenceid4 sr_sequenceid;
slotid4 sr_slotid;
slotid4 sr_highest_slotid;
slotid4 sr_target_highest_slotid;
uint32_t sr_status_flags;
};
union SEQUENCE4res switch (nfsstat4 sr_status) {
case NFS4_OK:
SEQUENCE4resok sr_resok4;
default:
void;
};
The SEQUENCE operation is used by the server to implement session request control and the reply cache semantics.
SEQUENCE
MUST appear as the first operation of any COMPOUND in which it appears. The error NFS4ERR_SEQUENCE_POS will be returned when it is found in any position in a COMPOUND beyond the first. Operations other than SEQUENCE, BIND_CONN_TO_SESSION, EXCHANGE_ID, CREATE_SESSION, and DESTROY_SESSION,
MUST NOT appear as the first operation in a COMPOUND. Such operations
MUST yield the error NFS4ERR_OP_NOT_IN_SESSION if they do appear at the start of a COMPOUND.
If SEQUENCE is received on a connection not associated with the session via CREATE_SESSION or BIND_CONN_TO_SESSION, and connection association enforcement is enabled (see
Section 18.35), then the server returns NFS4ERR_CONN_NOT_BOUND_TO_SESSION.
The sa_sessionid argument identifies the session to which this request applies. The sr_sessionid result
MUST equal sa_sessionid.
The sa_slotid argument is the index in the reply cache for the request. The sa_sequenceid field is the sequence number of the request for the reply cache entry (slot). The sr_slotid result
MUST equal sa_slotid. The sr_sequenceid result
MUST equal sa_sequenceid.
The sa_highest_slotid argument is the highest slot ID for which the client has a request outstanding; it could be equal to sa_slotid. The server returns two "highest_slotid" values: sr_highest_slotid and sr_target_highest_slotid. The former is the highest slot ID the server will accept in future SEQUENCE operation, and
SHOULD NOT be less than the value of sa_highest_slotid (but see
Section 2.10.6.1 for an exception). The latter is the highest slot ID the server would prefer the client use on a future SEQUENCE operation.
If sa_cachethis is TRUE, then the client is requesting that the server cache the entire reply in the server's reply cache; therefore, the server
MUST cache the reply (see
Section 2.10.6.1.3). The server
MAY cache the reply if sa_cachethis is FALSE. If the server does not cache the entire reply, it
MUST still record that it executed the request at the specified slot and sequence ID.
The response to the SEQUENCE operation contains a word of status flags (sr_status_flags) that can provide to the client information related to the status of the client's lock state and communications paths. Note that any status bits relating to lock state
MAY be reset when lock state is lost due to a server restart (even if the session is persistent across restarts; session persistence does not imply lock state persistence) or the establishment of a new client instance.
-
SEQ4_STATUS_CB_PATH_DOWN
-
When set, indicates that the client has no operational backchannel path for any session associated with the client ID, making it necessary for the client to re-establish one. This bit remains set on all SEQUENCE responses on all sessions associated with the client ID until at least one backchannel is available on any session associated with the client ID. If the client fails to re-establish a backchannel for the client ID, it is subject to having recallable state revoked.
-
SEQ4_STATUS_CB_PATH_DOWN_SESSION
-
When set, indicates that the session has no operational backchannel. There are two reasons why SEQ4_STATUS_CB_PATH_DOWN_SESSION may be set and not SEQ4_STATUS_CB_PATH_DOWN. First is that a callback operation that applies specifically to the session (e.g., CB_RECALL_SLOT, see Section 20.8) needs to be sent. Second is that the server did send a callback operation, but the connection was lost before the reply. The server cannot be sure whether or not the client received the callback operation, and so, per rules on request retry, the server MUST retry the callback operation over the same session. The SEQ4_STATUS_CB_PATH_DOWN_SESSION bit is the indication to the client that it needs to associate a connection to the session's backchannel. This bit remains set on all SEQUENCE responses of the session until a connection is associated with the session's a backchannel. If the client fails to re-establish a backchannel for the session, it is subject to having recallable state revoked.
-
SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRING
-
When set, indicates that all GSS contexts or RPCSEC_GSS handles assigned to the session's backchannel will expire within a period equal to the lease time. This bit remains set on all SEQUENCE replies until at least one of the following are true:
-
All SSV RPCSEC_GSS handles on the session's backchannel have been destroyed and all non-SSV GSS contexts have expired.
-
At least one more SSV RPCSEC_GSS handle has been added to the backchannel.
-
The expiration time of at least one non-SSV GSS context of an RPCSEC_GSS handle is beyond the lease period from the current time (relative to the time of when a SEQUENCE response was sent)
-
SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED
-
When set, indicates all non-SSV GSS contexts and all SSV RPCSEC_GSS handles assigned to the session's backchannel have expired or have been destroyed. This bit remains set on all SEQUENCE replies until at least one non-expired non-SSV GSS context for the session's backchannel has been established or at least one SSV RPCSEC_GSS handle has been assigned to the backchannel.
-
SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED
-
When set, indicates that the lease has expired and as a result the server released all of the client's locking state. This status bit remains set on all SEQUENCE replies until the loss of all such locks has been acknowledged by use of FREE_STATEID (see Section 18.38), or by establishing a new client instance by destroying all sessions (via DESTROY_SESSION), the client ID (via DESTROY_CLIENTID), and then invoking EXCHANGE_ID and CREATE_SESSION to establish a new client ID.
-
SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED
-
When set, indicates that some subset of the client's locks have been revoked due to expiration of the lease period followed by another client's conflicting LOCK operation. This status bit remains set on all SEQUENCE replies until the loss of all such locks has been acknowledged by use of FREE_STATEID.
-
SEQ4_STATUS_ADMIN_STATE_REVOKED
-
When set, indicates that one or more locks have been revoked without expiration of the lease period, due to administrative action. This status bit remains set on all SEQUENCE replies until the loss of all such locks has been acknowledged by use of FREE_STATEID.
-
SEQ4_STATUS_RECALLABLE_STATE_REVOKED
-
When set, indicates that one or more recallable objects have been revoked without expiration of the lease period, due to the client's failure to return them when recalled, which may be a consequence of there being no working backchannel and the client failing to re-establish a backchannel per the SEQ4_STATUS_CB_PATH_DOWN, SEQ4_STATUS_CB_PATH_DOWN_SESSION, or SEQ4_STATUS_CB_GSS_CONTEXTS_EXPIRED status flags. This status bit remains set on all SEQUENCE replies until the loss of all such locks has been acknowledged by use of FREE_STATEID.
-
SEQ4_STATUS_LEASE_MOVED
-
When set, indicates that responsibility for lease renewal has been transferred to one or more new servers. This condition will continue until the client receives an NFS4ERR_MOVED error and the server receives the subsequent GETATTR for the fs_locations or fs_locations_info attribute for an access to each file system for which a lease has been moved to a new server. See Section 11.11.9.2.
-
SEQ4_STATUS_RESTART_RECLAIM_NEEDED
-
When set, indicates that due to server restart, the client must reclaim locking state. Until the client sends a global RECLAIM_COMPLETE (Section 18.51), every SEQUENCE operation will return SEQ4_STATUS_RESTART_RECLAIM_NEEDED.
-
SEQ4_STATUS_BACKCHANNEL_FAULT
-
The server has encountered an unrecoverable fault with the backchannel (e.g., it has lost track of the sequence ID for a slot in the backchannel). The client MUST stop sending more requests on the session's fore channel, wait for all outstanding requests to complete on the fore and back channel, and then destroy the session.
-
SEQ4_STATUS_DEVID_CHANGED
-
The client is using device ID notifications and the server has changed a device ID mapping held by the client. This flag will stay present until the client has obtained the new mapping with GETDEVICEINFO.
-
SEQ4_STATUS_DEVID_DELETED
-
The client is using device ID notifications and the server has deleted a device ID mapping held by the client. This flag will stay in effect until the client sends a GETDEVICEINFO on the device ID with a null value in the argument gdia_notify_types.
The value of the sa_sequenceid argument relative to the cached sequence ID on the slot falls into one of three cases.
-
If the difference between sa_sequenceid and the server's cached sequence ID at the slot ID is two (2) or more, or if sa_sequenceid is less than the cached sequence ID (accounting for wraparound of the unsigned sequence ID value), then the server MUST return NFS4ERR_SEQ_MISORDERED.
-
If sa_sequenceid and the cached sequence ID are the same, this is a retry, and the server replies with what is recorded in the reply cache.The lease is possibly renewed as described below.
-
If sa_sequenceid is one greater (accounting for wraparound) than the cached sequence ID, then this is a new request, and the slot's sequence ID is incremented. The operations subsequent to SEQUENCE, if any, are processed. If there are no other operations, the only other effects are to cache the SEQUENCE reply in the slot, maintain the session's activity, and possibly renew the lease.
If the client reuses a slot ID and sequence ID for a completely different request, the server
MAY treat the request as if it is a retry of what it has already executed. The server
MAY however detect the client's illegal reuse and return NFS4ERR_SEQ_FALSE_RETRY.
If SEQUENCE returns an error, then the state of the slot (sequence ID, cached reply)
MUST NOT change, and the associated lease
MUST NOT be renewed.
If SEQUENCE returns NFS4_OK, then the associated lease
MUST be renewed (see
Section 8.3), except if SEQ4_STATUS_EXPIRED_ALL_STATE_REVOKED is returned in sr_status_flags.
The server
MUST maintain a mapping of session ID to client ID in order to validate any operations that follow SEQUENCE that take a stateid as an argument and/or result.
If the client establishes a persistent session, then a SEQUENCE received after a server restart might encounter requests performed and recorded in a persistent reply cache before the server restart. In this case, SEQUENCE will be processed successfully, while requests that were not previously performed and recorded are rejected with NFS4ERR_DEADSESSION.
Depending on which of the operations within the COMPOUND were successfully performed before the server restart, these operations will also have replies sent from the server reply cache. Note that when these operations establish locking state, it is locking state that applies to the previous server instance and to the previous client ID, even though the server restart, which logically happened after these operations, eliminated that state. In the case of a partially executed COMPOUND, processing may reach an operation not processed during the earlier server instance, making this operation a new one and not performable on the existing session. In this case, NFS4ERR_DEADSESSION will be returned from that operation.
struct ssa_digest_input4 {
SEQUENCE4args sdi_seqargs;
};
struct SET_SSV4args {
opaque ssa_ssv<>;
opaque ssa_digest<>;
};
struct ssr_digest_input4 {
SEQUENCE4res sdi_seqres;
};
struct SET_SSV4resok {
opaque ssr_digest<>;
};
union SET_SSV4res switch (nfsstat4 ssr_status) {
case NFS4_OK:
SET_SSV4resok ssr_resok4;
default:
void;
};
This operation is used to update the SSV for a client ID. Before SET_SSV is called the first time on a client ID, the SSV is zero. The SSV is the key used for the SSV GSS mechanism (
Section 2.10.9)
SET_SSV
MUST be preceded by a SEQUENCE operation in the same COMPOUND. It
MUST NOT be used if the client did not opt for SP4_SSV state protection when the client ID was created (see
Section 18.35); the server returns NFS4ERR_INVAL in that case.
The field ssa_digest is computed as the output of the HMAC ([
52]) using the subkey derived from the SSV4_SUBKEY_MIC_I2T and current SSV as the key (see
Section 2.10.9 for a description of subkeys), and an XDR encoded value of data type ssa_digest_input4. The field sdi_seqargs is equal to the arguments of the SEQUENCE operation for the COMPOUND procedure that SET_SSV is within.
The argument ssa_ssv is XORed with the current SSV to produce the new SSV. The argument ssa_ssv
SHOULD be generated randomly.
In the response, ssr_digest is the output of the HMAC using the subkey derived from SSV4_SUBKEY_MIC_T2I and new SSV as the key, and an XDR encoded value of data type ssr_digest_input4. The field sdi_seqres is equal to the results of the SEQUENCE operation for the COMPOUND procedure that SET_SSV is within.
As noted in
Section 18.35, the client and server can maintain multiple concurrent versions of the SSV. The client and server each
MUST maintain an internal SSV version number, which is set to one the first time SET_SSV executes on the server and the client receives the first SET_SSV reply. Each subsequent SET_SSV increases the internal SSV version number by one. The value of this version number corresponds to the smpt_ssv_seq, smt_ssv_seq, sspt_ssv_seq, and ssct_ssv_seq fields of the SSV GSS mechanism tokens (see
Section 2.10.9).
When the server receives ssa_digest, it
MUST verify the digest by computing the digest the same way the client did and comparing it with ssa_digest. If the server gets a different result, this is an error, NFS4ERR_BAD_SESSION_DIGEST. This error might be the result of another SET_SSV from the same client ID changing the SSV. If so, the client recovers by sending a SET_SSV operation again with a recomputed digest based on the subkey of the new SSV. If the transport connection is dropped after the SET_SSV request is sent, but before the SET_SSV reply is received, then there are special considerations for recovery if the client has no more connections associated with sessions associated with the client ID of the SSV. See
Section 18.34.4.
Clients
SHOULD NOT send an ssa_ssv that is equal to a previous ssa_ssv, nor equal to a previous or current SSV (including an ssa_ssv equal to zero since the SSV is initialized to zero when the client ID is created).
Clients
SHOULD send SET_SSV with RPCSEC_GSS privacy. Servers
MUST support RPCSEC_GSS with privacy for any COMPOUND that has { SEQUENCE, SET_SSV }.
A client
SHOULD NOT send SET_SSV with the SSV GSS mechanism's credential because the purpose of SET_SSV is to seed the SSV from non-SSV credentials. Instead, SET_SSV
SHOULD be sent with the credential of a user that is accessing the client ID for the first time (
Section 2.10.8.3). However, if the client does send SET_SSV with SSV credentials, the digest protecting the arguments uses the value of the SSV before ssa_ssv is XORed in, and the digest protecting the results uses the value of the SSV after the ssa_ssv is XORed in.
struct TEST_STATEID4args {
stateid4 ts_stateids<>;
};
struct TEST_STATEID4resok {
nfsstat4 tsr_status_codes<>;
};
union TEST_STATEID4res switch (nfsstat4 tsr_status) {
case NFS4_OK:
TEST_STATEID4resok tsr_resok4;
default:
void;
};
The TEST_STATEID operation is used to check the validity of a set of stateids. It can be used at any time, but the client should definitely use it when it receives an indication that one or more of its stateids have been invalidated due to lock revocation. This occurs when the SEQUENCE operation returns with one of the following sr_status_flags set:
-
SEQ4_STATUS_EXPIRED_SOME_STATE_REVOKED
-
SEQ4_STATUS_EXPIRED_ADMIN_STATE_REVOKED
-
SEQ4_STATUS_EXPIRED_RECALLABLE_STATE_REVOKED
The client can use TEST_STATEID one or more times to test the validity of its stateids. Each use of TEST_STATEID allows a large set of such stateids to be tested and avoids problems with earlier stateids in a COMPOUND request from interfering with the checking of subsequent stateids, as would happen if individual stateids were tested by a series of corresponding by operations in a COMPOUND request.
For each stateid, the server returns the status code that would be returned if that stateid were to be used in normal operation. Returning such a status indication is not an error and does not cause COMPOUND processing to terminate. Checks for the validity of the stateid proceed as they would for normal operations with a number of exceptions:
-
There is no check for the type of stateid object, as would be the case for normal use of a stateid.
-
There is no reference to the current filehandle.
-
Special stateids are always considered invalid (they result in the error code NFS4ERR_BAD_STATEID).
All stateids are interpreted as being associated with the client for the current session. Any possible association with a previous instance of the client (as stale stateids) is not considered.
The valid status values in the returned status_code array are NFS4ERR_OK, NFS4ERR_BAD_STATEID, NFS4ERR_OLD_STATEID, NFS4ERR_EXPIRED, NFS4ERR_ADMIN_REVOKED, and NFS4ERR_DELEG_REVOKED.
See Sections [
8.2.2] and [
8.2.4] for a discussion of stateid structure, lifetime, and validation.
union deleg_claim4 switch (open_claim_type4 dc_claim) {
/*
* No special rights to object. Ordinary delegation
* request of the specified object. Object identified
* by filehandle.
*/
case CLAIM_FH: /* new to v4.1 */
/* CURRENT_FH: object being delegated */
void;
/*
* Right to file based on a delegation granted
* to a previous boot instance of the client.
* File is specified by filehandle.
*/
case CLAIM_DELEG_PREV_FH: /* new to v4.1 */
/* CURRENT_FH: object being delegated */
void;
/*
* Right to the file established by an open previous
* to server reboot. File identified by filehandle.
* Used during server reclaim grace period.
*/
case CLAIM_PREVIOUS:
/* CURRENT_FH: object being reclaimed */
open_delegation_type4 dc_delegate_type;
};
struct WANT_DELEGATION4args {
uint32_t wda_want;
deleg_claim4 wda_claim;
};
union WANT_DELEGATION4res switch (nfsstat4 wdr_status) {
case NFS4_OK:
open_delegation4 wdr_resok4;
default:
void;
};
Where this description mandates the return of a specific error code for a specific condition, and where multiple conditions apply, the server
MAY return any of the mandated error codes.
This operation allows a client to:
-
Get a delegation on all types of files except directories.
-
Register a "want" for a delegation for the specified file object, and be notified via a callback when the delegation is available. The server MAY support notifications of availability via callbacks. If the server does not support registration of wants, it MUST NOT return an error to indicate that, and instead MUST return with ond_why set to WND4_CONTENTION or WND4_RESOURCE and ond_server_will_push_deleg or ond_server_will_signal_avail set to FALSE. When the server indicates that it will notify the client by means of a callback, it will either provide the delegation using a CB_PUSH_DELEG operation or cancel its promise by sending a CB_WANTS_CANCELLED operation.
-
Cancel a want for a delegation.
The client
SHOULD NOT set OPEN4_SHARE_ACCESS_READ and
SHOULD NOT set OPEN4_SHARE_ACCESS_WRITE in wda_want. If it does, the server
MUST ignore them.
The meanings of the following flags in wda_want are the same as they are in OPEN, except as noted below.
-
OPEN4_SHARE_ACCESS_WANT_READ_DELEG
-
OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG
-
OPEN4_SHARE_ACCESS_WANT_ANY_DELEG
-
OPEN4_SHARE_ACCESS_WANT_NO_DELEG. Unlike the OPEN operation, this flag SHOULD NOT be set by the client in the arguments to WANT_DELEGATION, and MUST be ignored by the server.
-
OPEN4_SHARE_ACCESS_WANT_CANCEL
-
OPEN4_SHARE_ACCESS_WANT_SIGNAL_DELEG_WHEN_RESRC_AVAIL
-
OPEN4_SHARE_ACCESS_WANT_PUSH_DELEG_WHEN_UNCONTENDED
The handling of the above flags in WANT_DELEGATION is the same as in OPEN. Information about the delegation and/or the promises the server is making regarding future callbacks are the same as those described in the open_delegation4 structure.
The successful results of WANT_DELEGATION are of data type open_delegation4, which is the same data type as the "delegation" field in the results of the OPEN operation (see
Section 18.16.3). The server constructs wdr_resok4 the same way it constructs OPEN's "delegation" with one difference: WANT_DELEGATION
MUST NOT return a delegation type of OPEN_DELEGATE_NONE.
If ((wda_want & OPEN4_SHARE_ACCESS_WANT_DELEG_MASK) & ~OPEN4_SHARE_ACCESS_WANT_NO_DELEG) is zero, then the client is indicating no explicit desire or non-desire for a delegation and the server
MUST return NFS4ERR_INVAL.
The client uses the OPEN4_SHARE_ACCESS_WANT_CANCEL flag in the WANT_DELEGATION operation to cancel a previously requested want for a delegation. Note that if the server is in the process of sending the delegation (via CB_PUSH_DELEG) at the time the client sends a cancellation of the want, the delegation might still be pushed to the client.
If WANT_DELEGATION fails to return a delegation, and the server returns NFS4_OK, the server
MUST set the delegation type to OPEN4_DELEGATE_NONE_EXT, and set od_whynone, as described in
Section 18.16. Write delegations are not available for file types that are not writable. This includes file objects of types NF4BLK, NF4CHR, NF4LNK, NF4SOCK, and NF4FIFO. If the client requests OPEN4_SHARE_ACCESS_WANT_WRITE_DELEG without OPEN4_SHARE_ACCESS_WANT_READ_DELEG on an object with one of the aforementioned file types, the server must set wdr_resok4.od_whynone.ond_why to WND4_WRITE_DELEG_NOT_SUPP_FTYPE.
A request for a conflicting delegation is not normally intended to trigger the recall of the existing delegation. Servers may choose to treat some clients as having higher priority such that their wants will trigger recall of an existing delegation, although that is expected to be an unusual situation.
Servers will generally recall delegations assigned by WANT_DELEGATION on the same basis as those assigned by OPEN. CB_RECALL will generally be done only when other clients perform operations inconsistent with the delegation. The normal response to aging of delegations is to use CB_RECALL_ANY, in order to give the client the opportunity to keep the delegations most useful from its point of view.
struct DESTROY_CLIENTID4args {
clientid4 dca_clientid;
};
struct DESTROY_CLIENTID4res {
nfsstat4 dcr_status;
};
The DESTROY_CLIENTID operation destroys the client ID. If there are sessions (both idle and non-idle), opens, locks, delegations, layouts, and/or wants (
Section 18.49) associated with the unexpired lease of the client ID, the server
MUST return NFS4ERR_CLIENTID_BUSY. DESTROY_CLIENTID
MAY be preceded with a SEQUENCE operation as long as the client ID derived from the session ID of SEQUENCE is not the same as the client ID to be destroyed. If the client IDs are the same, then the server
MUST return NFS4ERR_CLIENTID_BUSY.
If DESTROY_CLIENTID is not prefixed by SEQUENCE, it
MUST be the only operation in the COMPOUND request (otherwise, the server
MUST return NFS4ERR_NOT_ONLY_OP). If the operation is sent without a SEQUENCE preceding it, a client that retransmits the request may receive an error in response, because the original request might have been successfully executed.
DESTROY_CLIENTID allows a server to immediately reclaim the resources consumed by an unused client ID, and also to forget that it ever generated the client ID. By forgetting that it ever generated the client ID, the server can safely reuse the client ID on a future EXCHANGE_ID operation.
struct RECLAIM_COMPLETE4args {
/*
* If rca_one_fs TRUE,
*
* CURRENT_FH: object in
* file system reclaim is
* complete for.
*/
bool rca_one_fs;
};
struct RECLAIM_COMPLETE4res {
nfsstat4 rcr_status;
};
A RECLAIM_COMPLETE operation is used to indicate that the client has reclaimed all of the locking state that it will recover using reclaim, when it is recovering state due to either a server restart or the migration of a file system to another server. There are two types of RECLAIM_COMPLETE operations:
-
When rca_one_fs is FALSE, a global RECLAIM_COMPLETE is being done. This indicates that recovery of all locks that the client held on the previous server instance has been completed. The current filehandle need not be set in this case.
-
When rca_one_fs is TRUE, a file system-specific RECLAIM_COMPLETE is being done. This indicates that recovery of locks for a single fs (the one designated by the current filehandle) due to the migration of the file system has been completed. Presence of a current filehandle is required when rca_one_fs is set to TRUE. When the current filehandle designates a filehandle in a file system not in the process of migration, the operation returns NFS4_OK and is otherwise ignored.
Once a RECLAIM_COMPLETE is done, there can be no further reclaim operations for locks whose scope is defined as having completed recovery. Once the client sends RECLAIM_COMPLETE, the server will not allow the client to do subsequent reclaims of locking state for that scope and, if these are attempted, will return NFS4ERR_NO_GRACE.
Whenever a client establishes a new client ID and before it does the first non-reclaim operation that obtains a lock, it
MUST send a RECLAIM_COMPLETE with rca_one_fs set to FALSE, even if there are no locks to reclaim. If non-reclaim locking operations are done before the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned.
Similarly, when the client accesses a migrated file system on a new server, before it sends the first non-reclaim operation that obtains a lock on this new server, it
MUST send a RECLAIM_COMPLETE with rca_one_fs set to TRUE and current filehandle within that file system, even if there are no locks to reclaim. If non-reclaim locking operations are done on that file system before the RECLAIM_COMPLETE, an NFS4ERR_GRACE error will be returned.
It should be noted that there are situations in which a client needs to issue both forms of RECLAIM_COMPLETE. An example is an instance of file system migration in which the file system is migrated to a server for which the client has no clientid. As a result, the client needs to obtain a clientid from the server (incurring the responsibility to do RECLAIM_COMPLETE with rca_one_fs set to FALSE) as well as RECLAIM_COMPLETE with rca_one_fs set to TRUE to complete the per-fs grace period associated with the file system migration. These two may be done in any order as long as all necessary lock reclaims have been done before issuing either of them.
Any locks not reclaimed at the point at which RECLAIM_COMPLETE is done become non-reclaimable. The client
MUST NOT attempt to reclaim them, either during the current server instance or in any subsequent server instance, or on another server to which responsibility for that file system is transferred. If the client were to do so, it would be violating the protocol by representing itself as owning locks that it does not own, and so has no right to reclaim. See
Section 8.4.3 of [
66] for a discussion of edge conditions related to lock reclaim.
By sending a RECLAIM_COMPLETE, the client indicates readiness to proceed to do normal non-reclaim locking operations. The client should be aware that such operations may temporarily result in NFS4ERR_GRACE errors until the server is ready to terminate its grace period.
Servers will typically use the information as to when reclaim activity is complete to reduce the length of the grace period. When the server maintains in persistent storage a list of clients that might have had locks, it is able to use the fact that all such clients have done a RECLAIM_COMPLETE to terminate the grace period and begin normal operations (i.e., grant requests for new locks) sooner than it might otherwise.
Latency can be minimized by doing a RECLAIM_COMPLETE as part of the COMPOUND request in which the last lock-reclaiming operation is done. When there are no reclaims to be done, RECLAIM_COMPLETE should be done immediately in order to allow the grace period to end as soon as possible.
RECLAIM_COMPLETE should only be done once for each server instance or occasion of the transition of a file system. If it is done a second time, the error NFS4ERR_COMPLETE_ALREADY will result. Note that because of the session feature's retry protection, retries of COMPOUND requests containing RECLAIM_COMPLETE operation will not result in this error.
When a RECLAIM_COMPLETE is sent, the client effectively acknowledges any locks not yet reclaimed as lost. This allows the server to re-enable the client to recover locks if the occurrence of edge conditions, as described in
Section 8.4.3, had caused the server to disable the client's ability to recover locks.
Because previous descriptions of RECLAIM_COMPLETE were not sufficiently explicit about the circumstances in which use of RECLAIM_COMPLETE with rca_one_fs set to TRUE was appropriate, there have been cases in which it has been misused by clients who have issued RECLAIM_COMPLETE with rca_one_fs set to TRUE when it should have not been. There have also been cases in which servers have, in various ways, not responded to such misuse as described above, either ignoring the rca_one_fs setting (treating the operation as a global RECLAIM_COMPLETE) or ignoring the entire operation.
While clients
SHOULD NOT misuse this feature, and servers
SHOULD respond to such misuse as described above, implementors need to be aware of the following considerations as they make necessary trade-offs between interoperability with existing implementations and proper support for facilities to allow lock recovery in the event of file system migration.
-
When servers have no support for becoming the destination server of a file system subject to migration, there is no possibility of a per-fs RECLAIM_COMPLETE being done legitimately, and occurrences of it SHOULD be ignored. However, the negative consequences of accepting such mistaken use are quite limited as long as the client does not issue it before all necessary reclaims are done.
-
When a server might become the destination for a file system being migrated, inappropriate use of per-fs RECLAIM_COMPLETE is more concerning. In the case in which the file system designated is not within a per-fs grace period, the per-fs RECLAIM_COMPLETE SHOULD be ignored, with the negative consequences of accepting it being limited, as in the case in which migration is not supported. However, if the server encounters a file system undergoing migration, the operation cannot be accepted as if it were a global RECLAIM_COMPLETE without invalidating its intended use.
struct ILLEGAL4res {
nfsstat4 status;
};
This operation is a placeholder for encoding a result to handle the case of the client sending an operation code within COMPOUND that is not supported. See the COMPOUND procedure description for more details.
The status field of ILLEGAL4res
MUST be set to NFS4ERR_OP_ILLEGAL.
A client will probably not send an operation with code OP_ILLEGAL but if it does, the response will be ILLEGAL4res just as it would be with any other invalid operation code. Note that if the server gets an illegal operation code that is not OP_ILLEGAL, and if the server checks for legal operation codes during the XDR decode phase, then the ILLEGAL4res would not be returned.