8. File Locking and Share Reservations
Integrating locking into the NFS protocol necessarily causes it to be state-full. With the inclusion of "share" file locks the protocol becomes substantially more dependent on state than the traditional combination of NFS and NLM [XNFS]. There are three components to making this state manageable: o Clear division between client and server o Ability to reliably detect inconsistency in state between client and server o Simple and robust recovery mechanisms
In this model, the server owns the state information. The client communicates its view of this state to the server as needed. The client is also able to detect inconsistent state before modifying a file. To support Win32 "share" locks it is necessary to atomically OPEN or CREATE files. Having a separate share/unshare operation would not allow correct implementation of the Win32 OpenFile API. In order to correctly implement share semantics, the previous NFS protocol mechanisms used when a file is opened or created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFS version 4 protocol has an OPEN operation that subsumes the functionality of LOOKUP, CREATE, and ACCESS. However, because many operations require a filehandle, the traditional LOOKUP is preserved to map a file name to filehandle without establishing state on the server. The policy of granting access or modifying files is managed by the server based on the client's state. These mechanisms can implement policy ranging from advisory only locking to full mandatory locking.8.1. Locking
It is assumed that manipulating a lock is rare when compared to READ and WRITE operations. It is also assumed that crashes and network partitions are relatively rare. Therefore it is important that the READ and WRITE operations have a lightweight mechanism to indicate if they possess a held lock. A lock request contains the heavyweight information required to establish a lock and uniquely define the lock owner. The following sections describe the transition from the heavy weight information to the eventual stateid used for most client and server locking and lease interactions.8.1.1. Client ID
For each LOCK request, the client must identify itself to the server. This is done in such a way as to allow for correct lock identification and crash recovery. Client identification is accomplished with two values. o A verifier that is used to detect client reboots. o A variable length opaque array to uniquely define a client. For an operating system this may be a fully qualified host name or IP address. For a user level NFS client it may additionally contain a process id or other unique sequence.
The data structure for the Client ID would then appear as: struct nfs_client_id { opaque verifier[4]; opaque id<>; } It is possible through the mis-configuration of a client or the existence of a rogue client that two clients end up using the same nfs_client_id. This situation is avoided by "negotiating" the nfs_client_id between client and server with the use of the SETCLIENTID and SETCLIENTID_CONFIRM operations. The following describes the two scenarios of negotiation. 1 Client has never connected to the server In this case the client generates an nfs_client_id and unless another client has the same nfs_client_id.id field, the server accepts the request. The server also records the principal (or principal to uid mapping) from the credential in the RPC request that contains the nfs_client_id negotiation request (SETCLIENTID operation). Two clients might still use the same nfs_client_id.id due to perhaps configuration error. For example, a High Availability configuration where the nfs_client_id.id is derived from the ethernet controller address and both systems have the same address. In this case, the result is a switched union that returns, in addition to NFS4ERR_CLID_INUSE, the network address (the rpcbind netid and universal address) of the client that is using the id. 2 Client is re-connecting to the server after a client reboot In this case, the client still generates an nfs_client_id but the nfs_client_id.id field will be the same as the nfs_client_id.id generated prior to reboot. If the server finds that the principal/uid is equal to the previously "registered" nfs_client_id.id, then locks associated with the old nfs_client_id are immediately released. If the principal/uid is not equal, then this is a rogue client and the request is returned in error. For more discussion of crash recovery semantics, see the section on "Crash Recovery". It is possible for a retransmission of request to be received by the server after the server has acted upon and responded to the original client request. Therefore to mitigate effects of the retransmission of the SETCLIENTID operation, the client and server
use a confirmation step. The server returns a confirmation verifier that the client then sends to the server in the SETCLIENTID_CONFIRM operation. Once the server receives the confirmation from the client, the locking state for the client is released. In both cases, upon success, NFS4_OK is returned. To help reduce the amount of data transferred on OPEN and LOCK, the server will also return a unique 64-bit clientid value that is a shorthand reference to the nfs_client_id values presented by the client. From this point forward, the client will use the clientid to refer to itself. The clientid assigned by the server should be chosen so that it will not conflict with a clientid previously assigned by the server. This applies across server restarts or reboots. When a clientid is presented to a server and that clientid is not recognized, as would happen after a server reboot, the server will reject the request with the error NFS4ERR_STALE_CLIENTID. When this happens, the client must obtain a new clientid by use of the SETCLIENTID operation and then proceed to any other necessary recovery for the server reboot case (See the section "Server Failure and Recovery"). The client must also employ the SETCLIENTID operation when it receives a NFS4ERR_STALE_STATEID error using a stateid derived from its current clientid, since this also indicates a server reboot which has invalidated the existing clientid (see the next section "nfs_lockowner and stateid Definition" for details).8.1.2. Server Release of Clientid
If the server determines that the client holds no associated state for its clientid, the server may choose to release the clientid. The server may make this choice for an inactive client so that resources are not consumed by those intermittently active clients. If the client contacts the server after this release, the server must ensure the client receives the appropriate error so that it will use the SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity. It should be clear that the server must be very hesitant to release a clientid since the resulting work on the client to recover from such an event will be the same burden as if the server had failed and restarted. Typically a server would not release a clientid unless there had been no activity from that client for many minutes.
8.1.3. nfs_lockowner and stateid Definition
When requesting a lock, the client must present to the server the clientid and an identifier for the owner of the requested lock. These two fields are referred to as the nfs_lockowner and the definition of those fields are: o A clientid returned by the server as part of the client's use of the SETCLIENTID operation. o A variable length opaque array used to uniquely define the owner of a lock managed by the client. This may be a thread id, process id, or other unique value. When the server grants the lock, it responds with a unique 64-bit stateid. The stateid is used as a shorthand reference to the nfs_lockowner, since the server will be maintaining the correspondence between them. The server is free to form the stateid in any manner that it chooses as long as it is able to recognize invalid and out-of-date stateids. This requirement includes those stateids generated by earlier instances of the server. From this, the client can be properly notified of a server restart. This notification will occur when the client presents a stateid to the server from a previous instantiation. The server must be able to distinguish the following situations and return the error as specified: o The stateid was generated by an earlier server instance (i.e. before a server reboot). The error NFS4ERR_STALE_STATEID should be returned. o The stateid was generated by the current server instance but the stateid no longer designates the current locking state for the lockowner-file pair in question (i.e. one or more locking operations has occurred). The error NFS4ERR_OLD_STATEID should be returned. This error condition will only occur when the client issues a locking request which changes a stateid while an I/O request that uses that stateid is outstanding.
o The stateid was generated by the current server instance but the stateid does not designate a locking state for any active lockowner-file pair. The error NFS4ERR_BAD_STATEID should be returned. This error condition will occur when there has been a logic error on the part of the client or server. This should not happen. One mechanism that may be used to satisfy these requirements is for the server to divide stateids into three fields: o A server verifier which uniquely designates a particular server instantiation. o An index into a table of locking-state structures. o A sequence value which is incremented for each stateid that is associated with the same index into the locking-state table. By matching the incoming stateid and its field values with the state held at the server, the server is able to easily determine if a stateid is valid for its current instantiation and state. If the stateid is not valid, the appropriate error can be supplied to the client.8.1.4. Use of the stateid
All READ and WRITE operations contain a stateid. If the nfs_lockowner performs a READ or WRITE on a range of bytes within a locked range, the stateid (previously returned by the server) must be used to indicate that the appropriate lock (record or share) is held. If no state is established by the client, either record lock or share lock, a stateid of all bits 0 is used. If no conflicting locks are held on the file, the server may service the READ or WRITE operation. If a conflict with an explicit lock occurs, an error is returned for the operation (NFS4ERR_LOCKED). This allows "mandatory locking" to be implemented. A stateid of all bits 1 (one) allows READ operations to bypass record locking checks at the server. However, WRITE operations with stateid with bits all 1 (one) do not bypass record locking checks. File locking checks are handled by the OPEN operation (see the section "OPEN/CLOSE Operations"). An explicit lock may not be granted while a READ or WRITE operation with conflicting implicit locking is being performed.
8.1.5. Sequencing of Lock Requests
Locking is different than most NFS operations as it requires "at- most-one" semantics that are not provided by ONCRPC. ONCRPC over a reliable transport is not sufficient because a sequence of locking requests may span multiple TCP connections. In the face of retransmission or reordering, lock or unlock requests must have a well defined and consistent behavior. To accomplish this, each lock request contains a sequence number that is a consecutively increasing integer. Different nfs_lockowners have different sequences. The server maintains the last sequence number (L) received and the response that was returned. Note that for requests that contain a sequence number, for each nfs_lockowner, there should be no more than one outstanding request. If a request with a previous sequence number (r < L) is received, it is rejected with the return of error NFS4ERR_BAD_SEQID. Given a properly-functioning client, the response to (r) must have been received before the last request (L) was sent. If a duplicate of last request (r == L) is received, the stored response is returned. If a request beyond the next sequence (r == L + 2) is received, it is rejected with the return of error NFS4ERR_BAD_SEQID. Sequence history is reinitialized whenever the client verifier changes. Since the sequence number is represented with an unsigned 32-bit integer, the arithmetic involved with the sequence number is mod 2^32. It is critical the server maintain the last response sent to the client to provide a more reliable cache of duplicate non-idempotent requests than that of the traditional cache described in [Juszczak]. The traditional duplicate request cache uses a least recently used algorithm for removing unneeded requests. However, the last lock request and response on a given nfs_lockowner must be cached as long as the lock state exists on the server.8.1.6. Recovery from Replayed Requests
As described above, the sequence number is per nfs_lockowner. As long as the server maintains the last sequence number received and follows the methods described above, there are no risks of a Byzantine router re-sending old requests. The server need only maintain the nfs_lockowner, sequence number state as long as there are open files or closed files with locks outstanding.
LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence number and therefore the risk of the replay of these operations resulting in undesired effects is non-existent while the server maintains the nfs_lockowner state.8.1.7. Releasing nfs_lockowner State
When a particular nfs_lockowner no longer holds open or file locking state at the server, the server may choose to release the sequence number state associated with the nfs_lockowner. The server may make this choice based on lease expiration, for the reclamation of server memory, or other implementation specific details. In any event, the server is able to do this safely only when the nfs_lockowner no longer is being utilized by the client. The server may choose to hold the nfs_lockowner state in the event that retransmitted requests are received. However, the period to hold this state is implementation specific. In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is retransmitted after the server has previously released the nfs_lockowner state, the server will find that the nfs_lockowner has no files open and an error will be returned to the client. If the nfs_lockowner does have a file open, the stateid will not match and again an error is returned to the client. In the case that an OPEN is retransmitted and the nfs_lockowner is being used for the first time or the nfs_lockowner state has been previously released by the server, the use of the OPEN_CONFIRM operation will prevent incorrect behavior. When the server observes the use of the nfs_lockowner for the first time, it will direct the client to perform the OPEN_CONFIRM for the corresponding OPEN. This sequence establishes the use of an nfs_lockowner and associated sequence number. See the section "OPEN_CONFIRM - Confirm Open" for further details.8.2. Lock Ranges
The protocol allows a lock owner to request a lock with one byte range and then either upgrade or unlock a sub-range of the initial lock. It is expected that this will be an uncommon type of request. In any case, servers or server file systems may not be able to support sub-range lock semantics. In the event that a server receives a locking request that represents a sub-range of current locking state for the lock owner, the server is allowed to return the error NFS4ERR_LOCK_RANGE to signify that it does not support sub- range lock operations. Therefore, the client should be prepared to receive this error and, if appropriate, report the error to the requesting application.
The client is discouraged from combining multiple independent locking ranges that happen to be adjacent into a single request since the server may not support sub-range requests and for reasons related to the recovery of file locking state in the event of server failure. As discussed in the section "Server Failure and Recovery" below, the server may employ certain optimizations during recovery that work effectively only when the client's behavior during lock recovery is similar to the client's locking behavior prior to server failure.8.3. Blocking Locks
Some clients require the support of blocking locks. The NFS version 4 protocol must not rely on a callback mechanism and therefore is unable to notify a client when a previously denied lock has been granted. Clients have no choice but to continually poll for the lock. This presents a fairness problem. Two new lock types are added, READW and WRITEW, and are used to indicate to the server that the client is requesting a blocking lock. The server should maintain an ordered list of pending blocking locks. When the conflicting lock is released, the server may wait the lease period for the first waiting client to re-request the lock. After the lease period expires the next waiting client request is allowed the lock. Clients are required to poll at an interval sufficiently small that it is likely to acquire the lock in a timely manner. The server is not required to maintain a list of pending blocked locks as it is used to increase fairness and not correct operation. Because of the unordered nature of crash recovery, storing of lock state to stable storage would be required to guarantee ordered granting of blocking locks. Servers may also note the lock types and delay returning denial of the request to allow extra time for a conflicting lock to be released, allowing a successful return. In this way, clients can avoid the burden of needlessly frequent polling for blocking locks. The server should take care in the length of delay in the event the client retransmits the request.8.4. Lease Renewal
The purpose of a lease is to allow a server to remove stale locks that are held by a client that has crashed or is otherwise unreachable. It is not a mechanism for cache consistency and lease renewals may not be denied if the lease interval has not expired. The following events cause implicit renewal of all of the leases for a given client (i.e. all those sharing a given clientid). Each of these is a positive indication that the client is still active and
that the associated state held at the server, for the client, is still valid. o An OPEN with a valid clientid. o Any operation made with a valid stateid (CLOSE, DELEGRETURN, LOCK, LOCKU, OPEN, OPEN_CONFIRM, READ, RENEW, SETATTR, WRITE). This does not include the special stateids of all bits 0 or all bits 1. Note that if the client had restarted or rebooted, the client would not be making these requests without issuing the SETCLIENTID operation. The use of the SETCLIENTID operation (possibly with the addition of the optional SETCLIENTID_CONFIRM operation) notifies the server to drop the locking state associated with the client. If the server has rebooted, the stateids (NFS4ERR_STALE_STATEID error) or the clientid (NFS4ERR_STALE_CLIENTID error) will not be valid hence preventing spurious renewals. This approach allows for low overhead lease renewal which scales well. In the typical case no extra RPC calls are required for lease renewal and in the worst case one RPC is required every lease period (i.e. a RENEW operation). The number of locks held by the client is not a factor since all state for the client is involved with the lease renewal action. Since all operations that create a new lease also renew existing leases, the server must maintain a common lease expiration time for all valid leases for a given client. This lease time can then be easily updated upon implicit lease renewal actions.8.5. Crash Recovery
The important requirement in crash recovery is that both the client and the server know when the other has failed. Additionally, it is required that a client sees a consistent view of data across server restarts or reboots. All READ and WRITE operations that may have been queued within the client or network buffers must wait until the client has successfully recovered the locks protecting the READ and WRITE operations.8.5.1. Client Failure and Recovery
In the event that a client fails, the server may recover the client's locks when the associated leases have expired. Conflicting locks from another client may only be granted after this lease expiration. If the client is able to restart or reinitialize within the lease
period the client may be forced to wait the remainder of the lease period before obtaining new locks. To minimize client delay upon restart, lock requests are associated with an instance of the client by a client supplied verifier. This verifier is part of the initial SETCLIENTID call made by the client. The server returns a clientid as a result of the SETCLIENTID operation. The client then confirms the use of the verifier with SETCLIENTID_CONFIRM. The clientid in combination with an opaque owner field is then used by the client to identify the lock owner for OPEN. This chain of associations is then used to identify all locks for a particular client. Since the verifier will be changed by the client upon each initialization, the server can compare a new verifier to the verifier associated with currently held locks and determine that they do not match. This signifies the client's new instantiation and subsequent loss of locking state. As a result, the server is free to release all locks held which are associated with the old clientid which was derived from the old verifier. For secure environments, a change in the verifier must only cause the release of locks associated with the authenticated requester. This is required to prevent a rogue entity from freeing otherwise valid locks. Note that the verifier must have the same uniqueness properties of the verifier for the COMMIT operation.8.5.2. Server Failure and Recovery
If the server loses locking state (usually as a result of a restart or reboot), it must allow clients time to discover this fact and re- establish the lost locking state. The client must be able to re- establish the locking state without having the server deny valid requests because the server has granted conflicting access to another client. Likewise, if there is the possibility that clients have not yet re-established their locking state for a file, the server must disallow READ and WRITE operations for that file. The duration of this recovery period is equal to the duration of the lease period. A client can determine that server failure (and thus loss of locking state) has occurred, when it receives one of two errors. The NFS4ERR_STALE_STATEID error indicates a stateid invalidated by a reboot or restart. The NFS4ERR_STALE_CLIENTID error indicates a clientid invalidated by reboot or restart. When either of these are received, the client must establish a new clientid (See the section "Client ID") and re-establish the locking state as discussed below.
The period of special handling of locking and READs and WRITEs, equal in duration to the lease period, is referred to as the "grace period". During the grace period, clients recover locks and the associated state by reclaim-type locking requests (i.e. LOCK requests with reclaim set to true and OPEN operations with a claim type of CLAIM_PREVIOUS). During the grace period, the server must reject READ and WRITE operations and non-reclaim locking requests (i.e. other LOCK and OPEN operations) with an error of NFS4ERR_GRACE. If the server can reliably determine that granting a non-reclaim request will not conflict with reclamation of locks by other clients, the NFS4ERR_GRACE error does not have to be returned and the non- reclaim client request can be serviced. For the server to be able to service READ and WRITE operations during the grace period, it must again be able to guarantee that no possible conflict could arise between an impending reclaim locking request and the READ or WRITE operation. If the server is unable to offer that guarantee, the NFS4ERR_GRACE error must be returned to the client. For a server to provide simple, valid handling during the grace period, the easiest method is to simply reject all non-reclaim locking requests and READ and WRITE operations by returning the NFS4ERR_GRACE error. However, a server may keep information about granted locks in stable storage. With this information, the server could determine if a regular lock or READ or WRITE operation can be safely processed. For example, if a count of locks on a given file is available in stable storage, the server can track reclaimed locks for the file and when all reclaims have been processed, non-reclaim locking requests may be processed. This way the server can ensure that non-reclaim locking requests will not conflict with potential reclaim requests. With respect to I/O requests, if the server is able to determine that there are no outstanding reclaim requests for a file by information from stable storage or another similar mechanism, the processing of I/O requests could proceed normally for the file. To reiterate, for a server that allows non-reclaim lock and I/O requests to be processed during the grace period, it MUST determine that no lock subsequently reclaimed will be rejected and that no lock subsequently reclaimed would have prevented any I/O operation processed during the grace period. Clients should be prepared for the return of NFS4ERR_GRACE errors for non-reclaim lock and I/O requests. In this case the client should employ a retry mechanism for the request. A delay (on the order of several seconds) between retries should be used to avoid overwhelming the server. Further discussion of the general is included in
[Floyd]. The client must account for the server that is able to perform I/O and non-reclaim locking requests within the grace period as well as those that can not do so. A reclaim-type locking request outside the server's grace period can only succeed if the server can guarantee that no conflicting lock or I/O request has been granted since reboot or restart.8.5.3. Network Partitions and Recovery
If the duration of a network partition is greater than the lease period provided by the server, the server will have not received a lease renewal from the client. If this occurs, the server may free all locks held for the client. As a result, all stateids held by the client will become invalid or stale. Once the client is able to reach the server after such a network partition, all I/O submitted by the client with the now invalid stateids will fail with the server returning the error NFS4ERR_EXPIRED. Once this error is received, the client will suitably notify the application that held the lock. As a courtesy to the client or as an optimization, the server may continue to hold locks on behalf of a client for which recent communication has extended beyond the lease period. If the server receives a lock or I/O request that conflicts with one of these courtesy locks, the server must free the courtesy lock and grant the new request. If the server continues to hold locks beyond the expiration of a client's lease, the server MUST employ a method of recording this fact in its stable storage. Conflicting locks requests from another client may be serviced after the lease expiration. There are various scenarios involving server failure after such an event that require the storage of these lease expirations or network partitions. One scenario is as follows: A client holds a lock at the server and encounters a network partition and is unable to renew the associated lease. A second client obtains a conflicting lock and then frees the lock. After the unlock request by the second client, the server reboots or reinitializes. Once the server recovers, the network partition heals and the original client attempts to reclaim the original lock. In this scenario and without any state information, the server will allow the reclaim and the client will be in an inconsistent state because the server or the client has no knowledge of the conflicting lock.
The server may choose to store this lease expiration or network partitioning state in a way that will only identify the client as a whole. Note that this may potentially lead to lock reclaims being denied unnecessarily because of a mix of conflicting and non- conflicting locks. The server may also choose to store information about each lock that has an expired lease with an associated conflicting lock. The choice of the amount and type of state information that is stored is left to the implementor. In any case, the server must have enough state information to enable correct recovery from multiple partitions and multiple server failures.8.6. Recovery from a Lock Request Timeout or Abort
In the event a lock request times out, a client may decide to not retry the request. The client may also abort the request when the process for which it was issued is terminated (e.g. in UNIX due to a signal. It is possible though that the server received the request and acted upon it. This would change the state on the server without the client being aware of the change. It is paramount that the client re-synchronize state with server before it attempts any other operation that takes a seqid and/or a stateid with the same nfs_lockowner. This is straightforward to do without a special re- synchronize operation. Since the server maintains the last lock request and response received on the nfs_lockowner, for each nfs_lockowner, the client should cache the last lock request it sent such that the lock request did not receive a response. From this, the next time the client does a lock operation for the nfs_lockowner, it can send the cached request, if there is one, and if the request was one that established state (e.g. a LOCK or OPEN operation) the client can follow up with a request to remove the state (e.g. a LOCKU or CLOSE operation). With this approach, the sequencing and stateid information on the client and server for the given nfs_lockowner will re-synchronize and in turn the lock state will re-synchronize.8.7. Server Revocation of Locks
At any point, the server can revoke locks held by a client and the client must be prepared for this event. When the client detects that its locks have been or may have been revoked, the client is responsible for validating the state information between itself and the server. Validating locking state for the client means that it must verify or reclaim state for each lock currently held.
The first instance of lock revocation is upon server reboot or re- initialization. In this instance the client will receive an error (NFS4ERR_STALE_STATEID or NFS4ERR_STALE_CLIENTID) and the client will proceed with normal crash recovery as described in the previous section. The second lock revocation event is the inability to renew the lease period. While this is considered a rare or unusual event, the client must be prepared to recover. Both the server and client will be able to detect the failure to renew the lease and are capable of recovering without data corruption. For the server, it tracks the last renewal event serviced for the client and knows when the lease will expire. Similarly, the client must track operations which will renew the lease period. Using the time that each such request was sent and the time that the corresponding reply was received, the client should bound the time that the corresponding renewal could have occurred on the server and thus determine if it is possible that a lease period expiration could have occurred. The third lock revocation event can occur as a result of administrative intervention within the lease period. While this is considered a rare event, it is possible that the server's administrator has decided to release or revoke a particular lock held by the client. As a result of revocation, the client will receive an error of NFS4ERR_EXPIRED and the error is received within the lease period for the lock. In this instance the client may assume that only the nfs_lockowner's locks have been lost. The client notifies the lock holder appropriately. The client may not assume the lease period has been renewed as a result of failed operation. When the client determines the lease period may have expired, the client must mark all locks held for the associated lease as "unvalidated". This means the client has been unable to re-establish or confirm the appropriate lock state with the server. As described in the previous section on crash recovery, there are scenarios in which the server may grant conflicting locks after the lease period has expired for a client. When it is possible that the lease period has expired, the client must validate each lock currently held to ensure that a conflicting lock has not been granted. The client may accomplish this task by issuing an I/O request, either a pending I/O or a zero-length read, specifying the stateid associated with the lock in question. If the response to the request is success, the client has validated all of the locks governed by that stateid and re-established the appropriate state between itself and the server. If the I/O request is not successful, then one or more of the locks associated with the stateid was revoked by the server and the client must notify the owner.
8.8. Share Reservations
A share reservation is a mechanism to control access to a file. It is a separate and independent mechanism from record locking. When a client opens a file, it issues an OPEN operation to the server specifying the type of access required (READ, WRITE, or BOTH) and the type of access to deny others (deny NONE, READ, WRITE, or BOTH). If the OPEN fails the client will fail the application's open request. Pseudo-code definition of the semantics: if ((request.access & file_state.deny)) || (request.deny & file_state.access)) return (NFS4ERR_DENIED) The constants used for the OPEN and OPEN_DOWNGRADE operations for the access and deny fields are as follows: const OPEN4_SHARE_ACCESS_READ = 0x00000001; const OPEN4_SHARE_ACCESS_WRITE = 0x00000002; const OPEN4_SHARE_ACCESS_BOTH = 0x00000003; const OPEN4_SHARE_DENY_NONE = 0x00000000; const OPEN4_SHARE_DENY_READ = 0x00000001; const OPEN4_SHARE_DENY_WRITE = 0x00000002; const OPEN4_SHARE_DENY_BOTH = 0x00000003;8.9. OPEN/CLOSE Operations
To provide correct share semantics, a client MUST use the OPEN operation to obtain the initial filehandle and indicate the desired access and what if any access to deny. Even if the client intends to use a stateid of all 0's or all 1's, it must still obtain the filehandle for the regular file with the OPEN operation so the appropriate share semantics can be applied. For clients that do not have a deny mode built into their open programming interfaces, deny equal to NONE should be used. The OPEN operation with the CREATE flag, also subsumes the CREATE operation for regular files as used in previous versions of the NFS protocol. This allows a create with a share to be done atomically. The CLOSE operation removes all share locks held by the nfs_lockowner on that file. If record locks are held, the client SHOULD release all locks before issuing a CLOSE. The server MAY free all outstanding locks on CLOSE but some servers may not support the CLOSE of a file that still has record locks held. The server MUST return failure if any locks would exist after the CLOSE.
The LOOKUP operation will return a filehandle without establishing any lock state on the server. Without a valid stateid, the server will assume the client has the least access. For example, a file opened with deny READ/WRITE cannot be accessed using a filehandle obtained through LOOKUP because it would not have a valid stateid (i.e. using a stateid of all bits 0 or all bits 1).8.10. Open Upgrade and Downgrade
When an OPEN is done for a file and the lockowner for which the open is being done already has the file open, the result is to upgrade the open file status maintained on the server to include the access and deny bits specified by the new OPEN as well as those for the existing OPEN. The result is that there is one open file, as far as the protocol is concerned, and it includes the union of the access and deny bits for all of the OPEN requests completed. Only a single CLOSE will be done to reset the effects of both OPEN's. Note that the client, when issuing the OPEN, may not know that the same file is in fact being opened. The above only applies if both OPEN's result in the OPEN'ed object being designated by the same filehandle. When the server chooses to export multiple filehandles corresponding to the same file object and returns different filehandles on two different OPEN's of the same file object, the server MUST NOT "OR" together the access and deny bits and coalesce the two open files. Instead the server must maintain separate OPEN's with separate stateid's and will require separate CLOSE's to free them. When multiple open files on the client are merged into a single open file object on the server, the close of one of the open files (on the client) may necessitate change of the access and deny status of the open file on the server. This is because the union of the access and deny bits for the remaining open's may be smaller (i.e. a proper subset) than previously. The OPEN_DOWNGRADE operation is used to make the necessary change and the client should use it to update the server so that share reservation requests by other clients are handled properly.8.11. Short and Long Leases
When determining the time period for the server lease, the usual lease tradeoffs apply. Short leases are good for fast server recovery at a cost of increased RENEW or READ (with zero length) requests. Longer leases are certainly kinder and gentler to large internet servers trying to handle very large numbers of clients. The number of RENEW requests drop in proportion to the lease time. The disadvantages of long leases are slower recovery after server failure (server must wait for leases to expire and grace period before
granting new lock requests) and increased file contention (if client fails to transmit an unlock request then server must wait for lease expiration before granting new locks). Long leases are usable if the server is able to store lease state in non-volatile memory. Upon recovery, the server can reconstruct the lease state from its non-volatile memory and continue operation with its clients and therefore long leases are not an issue.8.12. Clocks and Calculating Lease Expiration
To avoid the need for synchronized clocks, lease times are granted by the server as a time delta. However, there is a requirement that the client and server clocks do not drift excessively over the duration of the lock. There is also the issue of propagation delay across the network which could easily be several hundred milliseconds as well as the possibility that requests will be lost and need to be retransmitted. To take propagation delay into account, the client should subtract it from lease times (e.g. if the client estimates the one-way propagation delay as 200 msec, then it can assume that the lease is already 200 msec old when it gets it). In addition, it will take another 200 msec to get a response back to the server. So the client must send a lock renewal or write data back to the server 400 msec before the lease would expire.8.13. Migration, Replication and State
When responsibility for handling a given file system is transferred to a new server (migration) or the client chooses to use an alternate server (e.g. in response to server unresponsiveness) in the context of file system replication, the appropriate handling of state shared between the client and server (i.e. locks, leases, stateid's, and clientid's) is as described below. The handling differs between migration and replication. For related discussion of file server state and recover of such see the sections under "File Locking and Share Reservations"8.13.1. Migration and State
In the case of migration, the servers involved in the migration of a file system SHOULD transfer all server state from the original to the new server. This must be done in a way that is transparent to the client. This state transfer will ease the client's transition when a file system migration occurs. If the servers are successful in transferring all state, the client will continue to use stateid's assigned by the original server. Therefore the new server must
recognize these stateid's as valid. This holds true for the clientid as well. Since responsibility for an entire file system is transferred with a migration event, there is no possibility that conflicts will arise on the new server as a result of the transfer of locks. As part of the transfer of information between servers, leases would be transferred as well. The leases being transferred to the new server will typically have a different expiration time from those for the same client, previously on the new server. To maintain the property that all leases on a given server for a given client expire at the same time, the server should advance the expiration time to the later of the leases being transferred or the leases already present. This allows the client to maintain lease renewal of both classes without special effort. The servers may choose not to transfer the state information upon migration. However, this choice is discouraged. In this case, when the client presents state information from the original server, the client must be prepared to receive either NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID from the new server. The client should then recover its state information as it normally would in response to a server failure. The new server must take care to allow for the recovery of state information as it would in the event of server restart.8.13.2. Replication and State
Since client switch-over in the case of replication is not under server control, the handling of state is different. In this case, leases, stateid's and clientid's do not have validity across a transition from one server to another. The client must re-establish its locks on the new server. This can be compared to the re- establishment of locks by means of reclaim-type requests after a server reboot. The difference is that the server has no provision to distinguish requests reclaiming locks from those obtaining new locks or to defer the latter. Thus, a client re-establishing a lock on the new server (by means of a LOCK or OPEN request), may have the requests denied due to a conflicting lock. Since replication is intended for read-only use of filesystems, such denial of locks should not pose large difficulties in practice. When an attempt to re-establish a lock on a new server is denied, the client should treat the situation as if his original lock had been revoked.
8.13.3. Notification of Migrated Lease
In the case of lease renewal, the client may not be submitting requests for a file system that has been migrated to another server. This can occur because of the implicit lease renewal mechanism. The client renews leases for all file systems when submitting a request to any one file system at the server. In order for the client to schedule renewal of leases that may have been relocated to the new server, the client must find out about lease relocation before those leases expire. To accomplish this, all operations which implicitly renew leases for a client (i.e. OPEN, CLOSE, READ, WRITE, RENEW, LOCK, LOCKT, LOCKU), will return the error NFS4ERR_LEASE_MOVED if responsibility for any of the leases to be renewed has been transferred to a new server. This condition will continue until the client receives an NFS4ERR_MOVED error and the server receives the subsequent GETATTR(fs_locations) for an access to each file system for which a lease has been moved to a new server. When a client receives an NFS4ERR_LEASE_MOVED error, it should perform some operation, such as a RENEW, on each file system associated with the server in question. When the client receives an NFS4ERR_MOVED error, the client can follow the normal process to obtain the new server information (through the fs_locations attribute) and perform renewal of those leases on the new server. If the server has not had state transferred to it transparently, it will receive either NFS4ERR_STALE_CLIENTID or NFS4ERR_STALE_STATEID from the new server, as described above, and can then recover state information as it does in the event of server failure.9. Client-Side Caching
Client-side caching of data, of file attributes, and of file names is essential to providing good performance with the NFS protocol. Providing distributed cache coherence is a difficult problem and previous versions of the NFS protocol have not attempted it. Instead, several NFS client implementation techniques have been used to reduce the problems that a lack of coherence poses for users. These techniques have not been clearly defined by earlier protocol specifications and it is often unclear what is valid or invalid client behavior. The NFS version 4 protocol uses many techniques similar to those that have been used in previous protocol versions. The NFS version 4 protocol does not provide distributed cache coherence. However, it defines a more limited set of caching guarantees to allow locks and share reservations to be used without destructive interference from client side caching.
In addition, the NFS version 4 protocol introduces a delegation mechanism which allows many decisions normally made by the server to be made locally by clients. This mechanism provides efficient support of the common cases where sharing is infrequent or where sharing is read-only.9.1. Performance Challenges for Client-Side Caching
Caching techniques used in previous versions of the NFS protocol have been successful in providing good performance. However, several scalability challenges can arise when those techniques are used with very large numbers of clients. This is particularly true when clients are geographically distributed which classically increases the latency for cache revalidation requests. The previous versions of the NFS protocol repeat their file data cache validation requests at the time the file is opened. This behavior can have serious performance drawbacks. A common case is one in which a file is only accessed by a single client. Therefore, sharing is infrequent. In this case, repeated reference to the server to find that no conflicts exist is expensive. A better option with regards to performance is to allow a client that repeatedly opens a file to do so without reference to the server. This is done until potentially conflicting operations from another client actually occur. A similar situation arises in connection with file locking. Sending file lock and unlock requests to the server as well as the read and write requests necessary to make data caching consistent with the locking semantics (see the section "Data Caching and File Locking") can severely limit performance. When locking is used to provide protection against infrequent conflicts, a large penalty is incurred. This penalty may discourage the use of file locking by applications. The NFS version 4 protocol provides more aggressive caching strategies with the following design goals: o Compatibility with a large range of server semantics. o Provide the same caching benefits as previous versions of the NFS protocol when unable to provide the more aggressive model. o Requirements for aggressive caching are organized so that a large portion of the benefit can be obtained even when not all of the requirements can be met.
The appropriate requirements for the server are discussed in later sections in which specific forms of caching are covered. (see the section "Open Delegation").9.2. Delegation and Callbacks
Recallable delegation of server responsibilities for a file to a client improves performance by avoiding repeated requests to the server in the absence of inter-client conflict. With the use of a "callback" RPC from server to client, a server recalls delegated responsibilities when another client engages in sharing of a delegated file. A delegation is passed from the server to the client, specifying the object of the delegation and the type of delegation. There are different types of delegations but each type contains a stateid to be used to represent the delegation when performing operations that depend on the delegation. This stateid is similar to those associated with locks and share reservations but differs in that the stateid for a delegation is associated with a clientid and may be used on behalf of all the nfs_lockowners for the given client. A delegation is made to the client as a whole and not to any specific process or thread of control within it. Because callback RPCs may not work in all environments (due to firewalls, for example), correct protocol operation does not depend on them. Preliminary testing of callback functionality by means of a CB_NULL procedure determines whether callbacks can be supported. The CB_NULL procedure checks the continuity of the callback path. A server makes a preliminary assessment of callback availability to a given client and avoids delegating responsibilities until it has determined that callbacks are supported. Because the granting of a delegation is always conditional upon the absence of conflicting access, clients must not assume that a delegation will be granted and they must always be prepared for OPENs to be processed without any delegations being granted. Once granted, a delegation behaves in most ways like a lock. There is an associated lease that is subject to renewal together with all of the other leases held by that client. Unlike locks, an operation by a second client to a delegated file will cause the server to recall a delegation through a callback. On recall, the client holding the delegation must flush modified state (such as modified data) to the server and return the delegation. The conflicting request will not receive a response until the recall is complete. The recall is considered complete when
the client returns the delegation or the server times out on the recall and revokes the delegation as a result of the timeout. Following the resolution of the recall, the server has the information necessary to grant or deny the second client's request. At the time the client receives a delegation recall, it may have substantial state that needs to be flushed to the server. Therefore, the server should allow sufficient time for the delegation to be returned since it may involve numerous RPCs to the server. If the server is able to determine that the client is diligently flushing state to the server as a result of the recall, the server may extend the usual time allowed for a recall. However, the time allowed for recall completion should not be unbounded. An example of this is when responsibility to mediate opens on a given file is delegated to a client (see the section "Open Delegation"). The server will not know what opens are in effect on the client. Without this knowledge the server will be unable to determine if the access and deny state for the file allows any particular open until the delegation for the file has been returned. A client failure or a network partition can result in failure to respond to a recall callback. In this case, the server will revoke the delegation which in turn will render useless any modified state still on the client.9.2.1. Delegation Recovery
There are three situations that delegation recovery must deal with: o Client reboot or restart o Server reboot or restart o Network partition (full or callback-only) In the event the client reboots or restarts, the failure to renew leases will result in the revocation of record locks and share reservations. Delegations, however, may be treated a bit differently. There will be situations in which delegations will need to be reestablished after a client reboots or restarts. The reason for this is the client may have file data stored locally and this data was associated with the previously held delegations. The client will need to reestablish the appropriate file state on the server.
To allow for this type of client recovery, the server may extend the period for delegation recovery beyond the typical lease expiration period. This implies that requests from other clients that conflict with these delegations will need to wait. Because the normal recall process may require significant time for the client to flush changed state to the server, other clients need be prepared for delays that occur because of a conflicting delegation. This longer interval would increase the window for clients to reboot and consult stable storage so that the delegations can be reclaimed. For open delegations, such delegations are reclaimed using OPEN with a claim type of CLAIM_DELEGATE_PREV. (see the sections on "Data Caching and Revocation" and "Operation 18: OPEN" for discussion of open delegation and the details of OPEN respectively). When the server reboots or restarts, delegations are reclaimed (using the OPEN operation with CLAIM_DELEGATE_PREV) in a similar fashion to record locks and share reservations. However, there is a slight semantic difference. In the normal case if the server decides that a delegation should not be granted, it performs the requested action (e.g. OPEN) without granting any delegation. For reclaim, the server grants the delegation but a special designation is applied so that the client treats the delegation as having been granted but recalled by the server. Because of this, the client has the duty to write all modified state to the server and then return the delegation. This process of handling delegation reclaim reconciles three principles of the NFS Version 4 protocol: o Upon reclaim, a client reporting resources assigned to it by an earlier server instance must be granted those resources. o The server has unquestionable authority to determine whether delegations are to be granted and, once granted, whether they are to be continued. o The use of callbacks is not to be depended upon until the client has proven its ability to receive them. When a network partition occurs, delegations are subject to freeing by the server when the lease renewal period expires. This is similar to the behavior for locks and share reservations. For delegations, however, the server may extend the period in which conflicting requests are held off. Eventually the occurrence of a conflicting request from another client will cause revocation of the delegation. A loss of the callback path (e.g. by later network configuration change) will have the same effect. A recall request will fail and revocation of the delegation will result.
A client normally finds out about revocation of a delegation when it uses a stateid associated with a delegation and receives the error NFS4ERR_EXPIRED. It also may find out about delegation revocation after a client reboot when it attempts to reclaim a delegation and receives that same error. Note that in the case of a revoked write open delegation, there are issues because data may have been modified by the client whose delegation is revoked and separately by other clients. See the section "Revocation Recovery for Write Open Delegation" for a discussion of such issues. Note also that when delegations are revoked, information about the revoked delegation will be written by the server to stable storage (as described in the section "Crash Recovery"). This is done to deal with the case in which a server reboots after revoking a delegation but before the client holding the revoked delegation is notified about the revocation.9.3. Data Caching
When applications share access to a set of files, they need to be implemented so as to take account of the possibility of conflicting access by another application. This is true whether the applications in question execute on different clients or reside on the same client. Share reservations and record locks are the facilities the NFS version 4 protocol provides to allow applications to coordinate access by providing mutual exclusion facilities. The NFS version 4 protocol's data caching must be implemented such that it does not invalidate the assumptions that those using these facilities depend upon.9.3.1. Data Caching and OPENs
In order to avoid invalidating the sharing assumptions that applications rely on, NFS version 4 clients should not provide cached data to applications or modify it on behalf of an application when it would not be valid to obtain or modify that same data via a READ or WRITE operation. Furthermore, in the absence of open delegation (see the section "Open Delegation") two additional rules apply. Note that these rules are obeyed in practice by many NFS version 2 and version 3 clients. o First, cached data present on a client must be revalidated after doing an OPEN. This is to ensure that the data for the OPENed file is still correctly reflected in the client's cache. This validation must be done at least when the client's OPEN operation includes DENY=WRITE or BOTH thus terminating a period in which
other clients may have had the opportunity to open the file with WRITE access. Clients may choose to do the revalidation more often (i.e. at OPENs specifying DENY=NONE) to parallel the NFS version 3 protocol's practice for the benefit of users assuming this degree of cache revalidation. o Second, modified data must be flushed to the server before closing a file OPENed for write. This is complementary to the first rule. If the data is not flushed at CLOSE, the revalidation done after client OPENs as file is unable to achieve its purpose. The other aspect to flushing the data before close is that the data must be committed to stable storage, at the server, before the CLOSE operation is requested by the client. In the case of a server reboot or restart and a CLOSEd file, it may not be possible to retransmit the data to be written to the file. Hence, this requirement.9.3.2. Data Caching and File Locking
For those applications that choose to use file locking instead of share reservations to exclude inconsistent file access, there is an analogous set of constraints that apply to client side data caching. These rules are effective only if the file locking is used in a way that matches in an equivalent way the actual READ and WRITE operations executed. This is as opposed to file locking that is based on pure convention. For example, it is possible to manipulate a two-megabyte file by dividing the file into two one-megabyte regions and protecting access to the two regions by file locks on bytes zero and one. A lock for write on byte zero of the file would represent the right to do READ and WRITE operations on the first region. A lock for write on byte one of the file would represent the right to do READ and WRITE operations on the second region. As long as all applications manipulating the file obey this convention, they will work on a local file system. However, they may not work with the NFS version 4 protocol unless clients refrain from data caching. The rules for data caching in the file locking environment are: o First, when a client obtains a file lock for a particular region, the data cache corresponding to that region (if any cache data exists) must be revalidated. If the change attribute indicates that the file may have been updated since the cached data was obtained, the client must flush or invalidate the cached data for the newly locked region. A client might choose to invalidate all of non-modified cached data that it has for the file but the only requirement for correct operation is to invalidate all of the data in the newly locked region.
o Second, before releasing a write lock for a region, all modified data for that region must be flushed to the server. The modified data must also be written to stable storage. Note that flushing data to the server and the invalidation of cached data must reflect the actual byte ranges locked or unlocked. Rounding these up or down to reflect client cache block boundaries will cause problems if not carefully done. For example, writing a modified block when only half of that block is within an area being unlocked may cause invalid modification to the region outside the unlocked area. This, in turn, may be part of a region locked by another client. Clients can avoid this situation by synchronously performing portions of write operations that overlap that portion (initial or final) that is not a full block. Similarly, invalidating a locked area which is not an integral number of full buffer blocks would require the client to read one or two partial blocks from the server if the revalidation procedure shows that the data which the client possesses may not be valid. The data that is written to the server as a pre-requisite to the unlocking of a region must be written, at the server, to stable storage. The client may accomplish this either with synchronous writes or by following asynchronous writes with a COMMIT operation. This is required because retransmission of the modified data after a server reboot might conflict with a lock held by another client. A client implementation may choose to accommodate applications which use record locking in non-standard ways (e.g. using a record lock as a global semaphore) by flushing to the server more data upon an LOCKU than is covered by the locked range. This may include modified data within files other than the one for which the unlocks are being done. In such cases, the client must not interfere with applications whose READs and WRITEs are being done only within the bounds of record locks which the application holds. For example, an application locks a single byte of a file and proceeds to write that single byte. A client that chose to handle a LOCKU by flushing all modified data to the server could validly write that single byte in response to an unrelated unlock. However, it would not be valid to write the entire block in which that single written byte was located since it includes an area that is not locked and might be locked by another client. Client implementations can avoid this problem by dividing files with modified data into those for which all modifications are done to areas covered by an appropriate record lock and those for which there are modifications not covered by a record lock. Any writes done for the former class of files must not include areas not locked and thus not modified on the client.
9.3.3. Data Caching and Mandatory File Locking
Client side data caching needs to respect mandatory file locking when it is in effect. The presence of mandatory file locking for a given file is indicated in the result flags for an OPEN. When mandatory locking is in effect for a file, the client must check for an appropriate file lock for data being read or written. If a lock exists for the range being read or written, the client may satisfy the request using the client's validated cache. If an appropriate file lock is not held for the range of the read or write, the read or write request must not be satisfied by the client's cache and the request must be sent to the server for processing. When a read or write request partially overlaps a locked region, the request should be subdivided into multiple pieces with each region (locked or not) treated appropriately.9.3.4. Data Caching and File Identity
When clients cache data, the file data needs to organized according to the file system object to which the data belongs. For NFS version 3 clients, the typical practice has been to assume for the purpose of caching that distinct filehandles represent distinct file system objects. The client then has the choice to organize and maintain the data cache on this basis. In the NFS version 4 protocol, there is now the possibility to have significant deviations from a "one filehandle per object" model because a filehandle may be constructed on the basis of the object's pathname. Therefore, clients need a reliable method to determine if two filehandles designate the same file system object. If clients were simply to assume that all distinct filehandles denote distinct objects and proceed to do data caching on this basis, caching inconsistencies would arise between the distinct client side objects which mapped to the same server side object. By providing a method to differentiate filehandles, the NFS version 4 protocol alleviates a potential functional regression in comparison with the NFS version 3 protocol. Without this method, caching inconsistencies within the same client could occur and this has not been present in previous versions of the NFS protocol. Note that it is possible to have such inconsistencies with applications executing on multiple clients but that is not the issue being addressed here. For the purposes of data caching, the following steps allow an NFS version 4 client to determine whether two distinct filehandles denote the same server side object:
o If GETATTR directed to two filehandles have different values of the fsid attribute, then the filehandles represent distinct objects. o If GETATTR for any file with an fsid that matches the fsid of the two filehandles in question returns a unique_handles attribute with a value of TRUE, then the two objects are distinct. o If GETATTR directed to the two filehandles does not return the fileid attribute for one or both of the handles, then the it cannot be determined whether the two objects are the same. Therefore, operations which depend on that knowledge (e.g. client side data caching) cannot be done reliably. o If GETATTR directed to the two filehandles returns different values for the fileid attribute, then they are distinct objects. o Otherwise they are the same object.