9. File Locking and Share Reservations
Integrating locking into the NFS protocol necessarily causes it to be stateful. With the inclusion of share reservations, the protocol becomes substantially more dependent on state than the traditional combination of NFS and NLM (Network Lock Manager) [xnfs]. There are three components to making this state manageable: o clear division between client and server o ability to reliably detect inconsistency in state between client and server o simple and robust recovery mechanisms In this model, the server owns the state information. The client requests changes in locks, and the server responds with the changes made. Non-client-initiated changes in locking state are infrequent. The client receives prompt notification of such changes and can adjust its view of the locking state to reflect the server's changes. Individual pieces of state created by the server and passed to the client at its request are represented by 128-bit stateids. These stateids may represent a particular open file, a set of byte-range locks held by a particular owner, or a recallable delegation of privileges to access a file in particular ways or at a particular location. In all cases, there is a transition from the most general information that represents a client as a whole to the eventual lightweight stateid used for most client and server locking interactions. The details of this transition will vary with the type of object, but it always starts with a client ID. To support Win32 share reservations, it is necessary to atomically OPEN or CREATE files and apply the appropriate locks in the same operation. Having a separate share/unshare operation would not allow correct implementation of the Win32 OpenFile API. In order to correctly implement share semantics, the previous NFS protocol mechanisms used when a file is opened or created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFSv4 protocol has an OPEN operation that subsumes the NFSv3 methodology of LOOKUP, CREATE, and ACCESS. However, because many operations require a filehandle, the traditional LOOKUP is preserved to map a filename to a filehandle without establishing state on the server. The policy of granting access or modifying files is managed by the server based on the client's state. These mechanisms can implement policy ranging from advisory only locking to full mandatory locking.
9.1. Opens and Byte-Range Locks
It is assumed that manipulating a byte-range lock is rare when compared to READ and WRITE operations. It is also assumed that server restarts and network partitions are relatively rare. Therefore, it is important that the READ and WRITE operations have a lightweight mechanism to indicate if they possess a held lock. A byte-range lock request contains the heavyweight information required to establish a lock and uniquely define the owner of the lock. The following sections describe the transition from the heavyweight information to the eventual stateid used for most client and server locking and lease interactions.9.1.1. Client ID
For each LOCK request, the client must identify itself to the server. This is done in such a way as to allow for correct lock identification and crash recovery. A sequence of a SETCLIENTID operation followed by a SETCLIENTID_CONFIRM operation is required to establish the identification onto the server. Establishment of identification by a new incarnation of the client also has the effect of immediately breaking any leased state that a previous incarnation of the client might have had on the server, as opposed to forcing the new client incarnation to wait for the leases to expire. Breaking the lease state amounts to the server removing all lock, share reservation, and, where the server is not supporting the CLAIM_DELEGATE_PREV claim type, all delegation state associated with the same client with the same identity. For a discussion of delegation state recovery, see Section 10.2.1. Owners of opens and owners of byte-range locks are separate entities and remain separate even if the same opaque arrays are used to designate owners of each. The protocol distinguishes between open-owners (represented by open_owner4 structures) and lock-owners (represented by lock_owner4 structures). Both sorts of owners consist of a clientid and an opaque owner string. For each client, the set of distinct owner values used with that client constitutes the set of owners of that type, for the given client. Each open is associated with a specific open-owner, while each byte-range lock is associated with a lock-owner and an open-owner, the latter being the open-owner associated with the open file under which the LOCK operation was done.
Client identification is encapsulated in the following structure: struct nfs_client_id4 { verifier4 verifier; opaque id<NFS4_OPAQUE_LIMIT>; }; The first field, verifier, is a client incarnation verifier that is used to detect client reboots. Only if the verifier is different from that which the server has previously recorded for the client (as identified by the second field of the structure, id) does the server start the process of canceling the client's leased state. The second field, id, is a variable-length string that uniquely defines the client. There are several considerations for how the client generates the id string: o The string should be unique so that multiple clients do not present the same string. The consequences of two clients presenting the same string range from one client getting an error to one client having its leased state abruptly and unexpectedly canceled. o The string should be selected so the subsequent incarnations (e.g., reboots) of the same client cause the client to present the same string. The implementer is cautioned against an approach that requires the string to be recorded in a local file because this precludes the use of the implementation in an environment where there is no local disk and all file access is from an NFSv4 server. o The string should be different for each server network address that the client accesses, rather than common to all server network addresses. The reason is that it may not be possible for the client to tell if the same server is listening on multiple network addresses. If the client issues SETCLIENTID with the same id string to each network address of such a server, the server will think it is the same client, and each successive SETCLIENTID will cause the server to begin the process of removing the client's previous leased state. o The algorithm for generating the string should not assume that the client's network address won't change. This includes changes between client incarnations and even changes while the client is still running in its current incarnation. This means that if the client includes just the client's and server's network address in
the id string, there is a real risk, after the client gives up the network address, that another client, using a similar algorithm for generating the id string, will generate a conflicting id string. Given the above considerations, an example of a well-generated id string is one that includes: o The server's network address. o The client's network address. o For a user-level NFSv4 client, it should contain additional information to distinguish the client from other user-level clients running on the same host, such as a universally unique identifier (UUID). o Additional information that tends to be unique, such as one or more of: * The client machine's serial number (for privacy reasons, it is best to perform some one-way function on the serial number). * A MAC address (for privacy reasons, it is best to perform some one-way function on the MAC address). * The timestamp of when the NFSv4 software was first installed on the client (though this is subject to the previously mentioned caution about using information that is stored in a file, because the file might only be accessible over NFSv4). * A true random number. However, since this number ought to be the same between client incarnations, this shares the same problem as that of using the timestamp of the software installation. As a security measure, the server MUST NOT cancel a client's leased state if the principal that established the state for a given id string is not the same as the principal issuing the SETCLIENTID. Note that SETCLIENTID (Section 16.33) and SETCLIENTID_CONFIRM (Section 16.34) have a secondary purpose of establishing the information the server needs to make callbacks to the client for the purpose of supporting delegations. It is permitted to change this information via SETCLIENTID and SETCLIENTID_CONFIRM within the same incarnation of the client without removing the client's leased state.
Once a SETCLIENTID and SETCLIENTID_CONFIRM sequence has successfully completed, the client uses the shorthand client identifier, of type clientid4, instead of the longer and less compact nfs_client_id4 structure. This shorthand client identifier (a client ID) is assigned by the server and should be chosen so that it will not conflict with a client ID previously assigned by the server. This applies across server restarts or reboots. When a client ID is presented to a server and that client ID is not recognized, as would happen after a server reboot, the server will reject the request with the error NFS4ERR_STALE_CLIENTID. When this happens, the client must obtain a new client ID by use of the SETCLIENTID operation and then proceed to any other necessary recovery for the server reboot case (see Section 9.6.2). The client must also employ the SETCLIENTID operation when it receives an NFS4ERR_STALE_STATEID error using a stateid derived from its current client ID, since this also indicates a server reboot, which has invalidated the existing client ID (see Section 9.6.2 for details). See the detailed descriptions of SETCLIENTID (Section 16.33.4) and SETCLIENTID_CONFIRM (Section 16.34.4) for a complete specification of the operations.9.1.2. Server Release of Client ID
If the server determines that the client holds no associated state for its client ID, the server may choose to release the client ID. The server may make this choice for an inactive client so that resources are not consumed by those intermittently active clients. If the client contacts the server after this release, the server must ensure that the client receives the appropriate error so that it will use the SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity. It should be clear that the server must be very hesitant to release a client ID since the resulting work on the client to recover from such an event will be the same burden as if the server had failed and restarted. Typically, a server would not release a client ID unless there had been no activity from that client for many minutes. Note that if the id string in a SETCLIENTID request is properly constructed, and if the client takes care to use the same principal for each successive use of SETCLIENTID, then, barring an active denial-of-service attack, NFS4ERR_CLID_INUSE should never be returned.
However, client bugs, server bugs, or perhaps a deliberate change of the principal owner of the id string (such as the case of a client that changes security flavors, and under the new flavor there is no mapping to the previous owner) will in rare cases result in NFS4ERR_CLID_INUSE. In that event, when the server gets a SETCLIENTID for a client ID that currently has no state, or it has state but the lease has expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST allow the SETCLIENTID and confirm the new client ID if followed by the appropriate SETCLIENTID_CONFIRM.9.1.3. Use of Seqids
In several contexts, 32-bit sequence values called "seqids" are used as part of managing locking state. Such values are used: o To provide an ordering of locking-related operations associated with a particular lock-owner or open-owner. See Section 9.1.7 for a detailed explanation. o To define an ordered set of instances of a set of locks sharing a particular set of ownership characteristics. See Section 9.1.4.2 for a detailed explanation. Successive seqid values for the same object are normally arrived at by incrementing the current value by one. This pattern continues until the seqid is incremented past NFS4_UINT32_MAX, in which case one (rather than zero) is to be the next seqid value. When two seqid values are to be compared to determine which of the two is later, the possibility of wraparound needs to be considered. In many cases, the values are such that simple numeric comparisons can be used. For example, if the seqid values to be compared are both less than one million, the higher value can be considered the later. On the other hand, if one of the values is at or near NFS_UINT32_MAX and the other is less than one million, then implementations can reasonably decide that the lower value has had one more wraparound and is thus, while numerically lower, actually later. Implementations can compare seqids in the presence of potential wraparound by adopting the reasonable assumption that the chain of increments from one to the other is shorter than 2**31. So, if the difference between the two seqids is less than 2**31, then the lower seqid is to be treated as earlier. If, however, the difference
between the two seqids is greater than or equal to 2**31, then it can be assumed that the lower seqid has encountered one more wraparound and can be treated as later.9.1.4. Stateid Definition
When the server grants a lock of any type (including opens, byte-range locks, and delegations), it responds with a unique stateid that represents a set of locks (often a single lock) for the same file, of the same type, and sharing the same ownership characteristics. Thus, opens of the same file by different open-owners each have an identifying stateid. Similarly, each set of byte-range locks on a file owned by a specific lock-owner has its own identifying stateid. Delegations also have associated stateids by which they may be referenced. The stateid is used as a shorthand reference to a lock or set of locks, and given a stateid, the server can determine the associated state-owner or state-owners (in the case of an open-owner/lock-owner pair) and the associated filehandle. When stateids are used, the current filehandle must be the one associated with that stateid. All stateids associated with a given client ID are associated with a common lease that represents the claim of those stateids and the objects they represent to be maintained by the server. See Section 9.5 for a discussion of the lease. Each stateid must be unique to the server. Many operations take a stateid as an argument but not a clientid, so the server must be able to infer the client from the stateid.9.1.4.1. Stateid Types
With the exception of special stateids (see Section 9.1.4.3), each stateid represents locking objects of one of a set of types defined by the NFSv4 protocol. Note that in all these cases, where we speak of a guarantee, it is understood there are situations such as a client restart, or lock revocation, that allow the guarantee to be voided. o Stateids may represent opens of files. Each stateid in this case represents the OPEN state for a given client ID/open-owner/filehandle triple. Such stateids are subject to change (with consequent incrementing of the stateid's seqid) in response to OPENs that result in upgrade and OPEN_DOWNGRADE operations.
o Stateids may represent sets of byte-range locks. All locks held on a particular file by a particular owner and all gotten under the aegis of a particular open file are associated with a single stateid, with the seqid being incremented whenever LOCK and LOCKU operations affect that set of locks. o Stateids may represent file delegations, which are recallable guarantees by the server to the client that other clients will not reference, or will not modify, a particular file until the delegation is returned. A stateid represents a single delegation held by a client for a particular filehandle.9.1.4.2. Stateid Structure
Stateids are divided into two fields: a 96-bit "other" field identifying the specific set of locks and a 32-bit "seqid" sequence value. Except in the case of special stateids (see Section 9.1.4.3), a particular value of the "other" field denotes a set of locks of the same type (for example, byte-range locks, opens, or delegations), for a specific file or directory, and sharing the same ownership characteristics. The seqid designates a specific instance of such a set of locks, and is incremented to indicate changes in such a set of locks, by either the addition or deletion of locks from the set, a change in the byte-range they apply to, or an upgrade or downgrade in the type of one or more locks. When such a set of locks is first created, the server returns a stateid with a seqid value of one. On subsequent operations that modify the set of locks, the server is required to advance the seqid field by one whenever it returns a stateid for the same state-owner/file/type combination and the operation is one that might make some change in the set of locks actually designated. In this case, the server will return a stateid with an "other" field the same as previously used for that state-owner/file/type combination, with an incremented seqid field. Seqids will be compared, by both the client and the server. The client uses such comparisons to determine the order of operations, while the server uses them to determine whether the NFS4ERR_OLD_STATEID error is to be returned. In all cases, the possibility of seqid wraparound needs to be taken into account, as discussed in Section 9.1.3.
9.1.4.3. Special Stateids
Stateid values whose "other" field is either all zeros or all ones are reserved. They MUST NOT be assigned by the server but have special meanings defined by the protocol. The particular meaning depends on whether the "other" field is all zeros or all ones and the specific value of the seqid field. The following combinations of "other" and seqid are defined in NFSv4: Anonymous Stateid: When "other" and seqid are both zero, the stateid is treated as a special anonymous stateid, which can be used in READ, WRITE, and SETATTR requests to indicate the absence of any open state associated with the request. When an anonymous stateid value is used, and an existing open denies the form of access requested, then access will be denied to the request. READ Bypass Stateid: When "other" and seqid are both all ones, the stateid is a special READ bypass stateid. When this value is used in WRITE or SETATTR, it is treated like the anonymous value. When used in READ, the server MAY grant access, even if access would normally be denied to READ requests. If a stateid value is used that has all zeros or all ones in the "other" field but does not match one of the cases above, the server MUST return the error NFS4ERR_BAD_STATEID. Special stateids, unlike other stateids, are not associated with individual client IDs or filehandles and can be used with all valid client IDs and filehandles.9.1.4.4. Stateid Lifetime and Validation
Stateids must remain valid until either a client restart or a server restart, or until the client returns all of the locks associated with the stateid by means of an operation such as CLOSE or DELEGRETURN. If the locks are lost due to revocation, as long as the client ID is valid, the stateid remains a valid designation of that revoked state. Stateids associated with byte-range locks are an exception. They remain valid even if a LOCKU frees all remaining locks, so long as the open file with which they are associated remains open. It should be noted that there are situations in which the client's locks become invalid, without the client requesting they be returned. These include lease expiration and a number of forms of lock revocation within the lease period. It is important to note that in these situations, the stateid remains valid and the client can use it to determine the disposition of the associated lost locks.
An "other" value must never be reused for a different purpose (i.e., different filehandle, owner, or type of locks) within the context of a single client ID. A server may retain the "other" value for the same purpose beyond the point where it may otherwise be freed, but if it does so, it must maintain seqid continuity with previous values. One mechanism that may be used to satisfy the requirement that the server recognize invalid and out-of-date stateids is for the server to divide the "other" field of the stateid into two fields: o An index into a table of locking-state structures. o A generation number that is incremented on each allocation of a table entry for a particular use. And then store the following in each table entry: o The client ID with which the stateid is associated. o The current generation number for the (at most one) valid stateid sharing this index value. o The filehandle of the file on which the locks are taken. o An indication of the type of stateid (open, byte-range lock, file delegation). o The last seqid value returned corresponding to the current "other" value. o An indication of the current status of the locks associated with this stateid -- in particular, whether these have been revoked and, if so, for what reason. With this information, an incoming stateid can be validated and the appropriate error returned when necessary. Special and non-special stateids are handled separately. (See Section 9.1.4.3 for a discussion of special stateids.) When a stateid is being tested, and the "other" field is all zeros or all ones, a check that the "other" and seqid fields match a defined combination for a special stateid is done and the results determined as follows: o If the "other" and seqid fields do not match a defined combination associated with a special stateid, the error NFS4ERR_BAD_STATEID is returned.
o If the combination is valid in general but is not appropriate to the context in which the stateid is used (e.g., an all-zero stateid is used when an open stateid is required in a LOCK operation), the error NFS4ERR_BAD_STATEID is also returned. o Otherwise, the check is completed and the special stateid is accepted as valid. When a stateid is being tested, and the "other" field is neither all zeros nor all ones, the following procedure could be used to validate an incoming stateid and return an appropriate error, when necessary, assuming that the "other" field would be divided into a table index and an entry generation. Note that the terms "earlier" and "later" used in connection with seqid comparison are to be understood as explained in Section 9.1.3. o If the table index field is outside the range of the associated table, return NFS4ERR_BAD_STATEID. o If the selected table entry is of a different generation than that specified in the incoming stateid, return NFS4ERR_BAD_STATEID. o If the selected table entry does not match the current filehandle, return NFS4ERR_BAD_STATEID. o If the stateid represents revoked state or state lost as a result of lease expiration, then return NFS4ERR_EXPIRED, NFS4ERR_BAD_STATEID, or NFS4ERR_ADMIN_REVOKED, as appropriate. o If the stateid type is not valid for the context in which the stateid appears, return NFS4ERR_BAD_STATEID. Note that a stateid may be valid in general but invalid for a particular operation, as, for example, when a stateid that doesn't represent byte-range locks is passed to the non-from_open case of LOCK or to LOCKU, or when a stateid that does not represent an open is passed to CLOSE or OPEN_DOWNGRADE. In such cases, the server MUST return NFS4ERR_BAD_STATEID. o If the seqid field is not zero and it is later than the current sequence value corresponding to the current "other" field, return NFS4ERR_BAD_STATEID. o If the seqid field is earlier than the current sequence value corresponding to the current "other" field, return NFS4ERR_OLD_STATEID.
o Otherwise, the stateid is valid, and the table entry should contain any additional information about the type of stateid and information associated with that particular type of stateid, such as the associated set of locks (e.g., open-owner and lock-owner information), as well as information on the specific locks themselves, such as open modes and byte ranges.9.1.4.5. Stateid Use for I/O Operations
Clients performing Input/Output (I/O) operations need to select an appropriate stateid based on the locks (including opens and delegations) held by the client and the various types of state-owners sending the I/O requests. SETATTR operations that change the file size are treated like I/O operations in this regard. The following rules, applied in order of decreasing priority, govern the selection of the appropriate stateid. In following these rules, the client will only consider locks of which it has actually received notification by an appropriate operation response or callback. o If the client holds a delegation for the file in question, the delegation stateid SHOULD be used. o Otherwise, if the entity corresponding to the lock-owner (e.g., a process) sending the I/O has a byte-range lock stateid for the associated open file, then the byte-range lock stateid for that lock-owner and open file SHOULD be used. o If there is no byte-range lock stateid, then the OPEN stateid for the current open-owner, i.e., the OPEN stateid for the open file in question, SHOULD be used. o Finally, if none of the above apply, then a special stateid SHOULD be used. Ignoring these rules may result in situations in which the server does not have information necessary to properly process the request. For example, when mandatory byte-range locks are in effect, if the stateid does not indicate the proper lock-owner, via a lock stateid, a request might be avoidably rejected. The server, however, should not try to enforce these ordering rules and should use whatever information is available to properly process I/O requests. In particular, when a client has a delegation for a given file, it SHOULD take note of this fact in processing a request, even if it is sent with a special stateid.
9.1.4.6. Stateid Use for SETATTR Operations
In the case of SETATTR operations, a stateid is present. In cases other than those that set the file size, the client may send either a special stateid or, when a delegation is held for the file in question, a delegation stateid. While the server SHOULD validate the stateid and may use the stateid to optimize the determination as to whether a delegation is held, it SHOULD note the presence of a delegation even when a special stateid is sent, and MUST accept a valid delegation stateid when sent.9.1.5. Lock-Owner
When requesting a lock, the client must present to the server the client ID and an identifier for the owner of the requested lock. These two fields comprise the lock-owner and are defined as follows: o A client ID returned by the server as part of the client's use of the SETCLIENTID operation. o A variable-length opaque array used to uniquely define the owner of a lock managed by the client. This may be a thread id, process id, or other unique value. When the server grants the lock, it responds with a unique stateid. The stateid is used as a shorthand reference to the lock-owner, since the server will be maintaining the correspondence between them.9.1.6. Use of the Stateid and Locking
All READ, WRITE, and SETATTR operations contain a stateid. For the purposes of this section, SETATTR operations that change the size attribute of a file are treated as if they are writing the area between the old and new size (i.e., the range truncated or added to the file by means of the SETATTR), even where SETATTR is not explicitly mentioned in the text. The stateid passed to one of these operations must be one that represents an OPEN (e.g., via the open-owner), a set of byte-range locks, or a delegation, or it may be a special stateid representing anonymous access or the READ bypass stateid. If the state-owner performs a READ or WRITE in a situation in which it has established a lock or share reservation on the server (any OPEN constitutes a share reservation), the stateid (previously returned by the server) must be used to indicate what locks, including both byte-range locks and share reservations, are held by the state-owner. If no state is established by the client -- either
byte-range lock or share reservation -- the anonymous stateid is used. Regardless of whether an anonymous stateid or a stateid returned by the server is used, if there is a conflicting share reservation or mandatory byte-range lock held on the file, the server MUST refuse to service the READ or WRITE operation. Share reservations are established by OPEN operations and by their nature are mandatory in that when the OPEN denies READ or WRITE operations, that denial results in such operations being rejected with error NFS4ERR_LOCKED. Byte-range locks may be implemented by the server as either mandatory or advisory, or the choice of mandatory or advisory behavior may be determined by the server on the basis of the file being accessed (for example, some UNIX-based servers support a "mandatory lock bit" on the mode attribute such that if set, byte-range locks are required on the file before I/O is possible). When byte-range locks are advisory, they only prevent the granting of conflicting lock requests and have no effect on READs or WRITEs. Mandatory byte-range locks, however, prevent conflicting I/O operations. When they are attempted, they are rejected with NFS4ERR_LOCKED. When the client gets NFS4ERR_LOCKED on a file it knows it has the proper share reservation for, it will need to issue a LOCK request on the region of the file that includes the region the I/O was to be performed on, with an appropriate locktype (i.e., READ*_LT for a READ operation, WRITE*_LT for a WRITE operation). With NFSv3, there was no notion of a stateid, so there was no way to tell if the application process of the client sending the READ or WRITE operation had also acquired the appropriate byte-range lock on the file. Thus, there was no way to implement mandatory locking. With the stateid construct, this barrier has been removed. Note that for UNIX environments that support mandatory file locking, the distinction between advisory and mandatory locking is subtle. In fact, advisory and mandatory byte-range locks are exactly the same insofar as the APIs and requirements on implementation are concerned. If the mandatory lock attribute is set on the file, the server checks to see if the lock-owner has an appropriate shared (read) or exclusive (write) byte-range lock on the region it wishes to read or write to. If there is no appropriate lock, the server checks if there is a conflicting lock (which can be done by attempting to acquire the conflicting lock on behalf of the lock-owner and, if successful, release the lock after the READ or WRITE is done), and if there is, the server returns NFS4ERR_LOCKED. For Windows environments, there are no advisory byte-range locks, so the server always checks for byte-range locks during I/O requests.
Thus, the NFSv4 LOCK operation does not need to distinguish between advisory and mandatory byte-range locks. It is the NFSv4 server's processing of the READ and WRITE operations that introduces the distinction. Every stateid other than the special stateid values noted in this section, whether returned by an OPEN-type operation (i.e., OPEN, OPEN_DOWNGRADE) or by a LOCK-type operation (i.e., LOCK or LOCKU), defines an access mode for the file (i.e., READ, WRITE, or READ-WRITE) as established by the original OPEN that began the stateid sequence, and as modified by subsequent OPENs and OPEN_DOWNGRADEs within that stateid sequence. When a READ, WRITE, or SETATTR that specifies the size attribute is done, the operation is subject to checking against the access mode to verify that the operation is appropriate given the OPEN with which the operation is associated. In the case of WRITE-type operations (i.e., WRITEs and SETATTRs that set size), the server must verify that the access mode allows writing and return an NFS4ERR_OPENMODE error if it does not. In the case of READ, the server may perform the corresponding check on the access mode, or it may choose to allow READ on opens for WRITE only, to accommodate clients whose write implementation may unavoidably do reads (e.g., due to buffer cache constraints). However, even if READs are allowed in these circumstances, the server MUST still check for locks that conflict with the READ (e.g., another open specifying denial of READs). Note that a server that does enforce the access mode check on READs need not explicitly check for conflicting share reservations since the existence of OPEN for read access guarantees that no conflicting share reservation can exist. A READ bypass stateid MAY allow READ operations to bypass locking checks at the server. However, WRITE operations with a READ bypass stateid MUST NOT bypass locking checks and are treated exactly the same as if an anonymous stateid were used. A lock may not be granted while a READ or WRITE operation using one of the special stateids is being performed and the range of the lock request conflicts with the range of the READ or WRITE operation. For the purposes of this paragraph, a conflict occurs when a shared lock is requested and a WRITE operation is being performed, or an exclusive lock is requested and either a READ or a WRITE operation is being performed. A SETATTR that sets size is treated similarly to a WRITE as discussed above.
9.1.7. Sequencing of Lock Requests
Locking is different than most NFS operations as it requires "at-most-one" semantics that are not provided by ONC RPC. ONC RPC over a reliable transport is not sufficient because a sequence of locking requests may span multiple TCP connections. In the face of retransmission or reordering, lock or unlock requests must have a well-defined and consistent behavior. To accomplish this, each lock request contains a sequence number that is a consecutively increasing integer. Different state-owners have different sequences. The server maintains the last sequence number (L) received and the response that was returned. The server SHOULD assign a seqid value of one for the first request issued for any given state-owner. Subsequent values are arrived at by incrementing the seqid value, subject to wraparound as described in Section 9.1.3. Note that for requests that contain a sequence number, for each state-owner, there should be no more than one outstanding request. When a request is received, its sequence number (r) is compared to that of the last one received (L). Only if it has the correct next sequence, normally L + 1, is the request processed beyond the point of seqid checking. Given a properly functioning client, the response to (r) must have been received before the last request (L) was sent. If a duplicate of last request (r == L) is received, the stored response is returned. If the sequence value received is any other value, it is rejected with the return of error NFS4ERR_BAD_SEQID. Sequence history is reinitialized whenever the SETCLIENTID/ SETCLIENTID_CONFIRM sequence changes the client verifier. It is critical that the server maintain the last response sent to the client to provide a more reliable cache of duplicate non-idempotent requests than that of the traditional cache described in [Chet]. The traditional duplicate request cache uses a least recently used algorithm for removing unneeded requests. However, the last lock request and response on a given state-owner must be cached as long as the lock state exists on the server. The client MUST advance the sequence number for the CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE operations. This is true even in the event that the previous operation that used the sequence number received an error. The only exception to this rule is if the previous operation received one of the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID, NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR, NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE, or NFS4ERR_MOVED.
9.1.8. Recovery from Replayed Requests
As described above, the sequence number is per state-owner. As long as the server maintains the last sequence number received and follows the methods described above, there are no risks of a Byzantine router re-sending old requests. The server need only maintain the (state-owner, sequence number) state as long as there are open files or closed files with locks outstanding. LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence number, and therefore the risk of the replay of these operations resulting in undesired effects is non-existent while the server maintains the state-owner state.9.1.9. Interactions of Multiple Sequence Values
Some operations may have multiple sources of data for request sequence checking and retransmission determination. Some operations have multiple sequence values associated with multiple types of state-owners. In addition, such operations may also have a stateid with its own seqid value, that will be checked for validity. As noted above, there may be multiple sequence values to check. The following rules should be followed by the server in processing these multiple sequence values within a single operation. o When a sequence value associated with a state-owner is unavailable for checking because the state-owner is unknown to the server, it takes no part in the comparison. o When any of the state-owner sequence values are invalid, NFS4ERR_BAD_SEQID is returned. When a stateid sequence is checked, NFS4ERR_BAD_STATEID or NFS4ERR_OLD_STATEID is returned as appropriate, but NFS4ERR_BAD_SEQID has priority. o When any one of the sequence values matches a previous request, for a state-owner, it is treated as a retransmission and not re-executed. When the type of the operation does not match that originally used, NFS4ERR_BAD_SEQID is returned. When the server can determine that the request differs from the original, it may return NFS4ERR_BAD_SEQID. o When multiple sequence values match previous operations but the operations are not the same, NFS4ERR_BAD_SEQID is returned.
o When there are no sequence values available for comparison and the operation is an OPEN, the server indicates to the client that an OPEN_CONFIRM is required, unless it can conclusively determine that confirmation is not required (e.g., by knowing that no open-owner state has ever been released for the current clientid).9.1.10. Releasing State-Owner State
When a particular state-owner no longer holds open or file locking state at the server, the server may choose to release the sequence number state associated with the state-owner. The server may make this choice based on lease expiration, the reclamation of server memory, or other implementation-specific details. Note that when this is done, a retransmitted request, normally identified by a matching state-owner sequence, may not be correctly recognized, so that the client will not receive the original response that it would have if the state-owner state was not released. If the server were able to be sure that a given state-owner would never again be used by a client, such an issue could not arise. Even when the state-owner state is released and the client subsequently uses that state-owner, retransmitted requests will be detected as invalid and the request not executed, although the client may have a recovery path that is more complicated than simply getting the original response back transparently. In any event, the server is able to safely release state-owner state (in the sense that retransmitted requests will not be erroneously acted upon) when the state-owner is not currently being utilized by the client (i.e., there are no open files associated with an open-owner and no lock stateids associated with a lock-owner). The server may choose to hold the state-owner state in order to simplify the recovery path, in the case in which retransmissions of currently active requests are received. However, the period for which it chooses to hold this state is implementation specific. In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is retransmitted after the server has previously released the state-owner state, the server will find that the state-owner has no files open and an error will be returned to the client. If the state-owner does have a file open, the stateid will not match and again an error is returned to the client.
9.1.11. Use of Open Confirmation
In the case that an OPEN is retransmitted and the open-owner is being used for the first time or the open-owner state has been previously released by the server, the use of the OPEN_CONFIRM operation will prevent incorrect behavior. When the server observes the use of the open-owner for the first time, it will direct the client to perform the OPEN_CONFIRM for the corresponding OPEN. This sequence establishes the use of an open-owner and associated sequence number. Since the OPEN_CONFIRM sequence connects a new open-owner on the server with an existing open-owner on a client, the sequence number may have any valid (i.e., non-zero) value. The OPEN_CONFIRM step assures the server that the value received is the correct one. (See Section 16.18 for further details.) There are a number of situations in which the requirement to confirm an OPEN would pose difficulties for the client and server, in that they would be prevented from acting in a timely fashion on information received, because that information would be provisional, subject to deletion upon non-confirmation. Fortunately, these are situations in which the server can avoid the need for confirmation when responding to open requests. The two constraints are: o The server must not bestow a delegation for any open that would require confirmation. o The server MUST NOT require confirmation on a reclaim-type open (i.e., one specifying claim type CLAIM_PREVIOUS or CLAIM_DELEGATE_PREV). These constraints are related in that reclaim-type opens are the only ones in which the server may be required to send a delegation. For CLAIM_NULL, sending the delegation is optional, while for CLAIM_DELEGATE_CUR, no delegation is sent. Delegations being sent with an open requiring confirmation are troublesome because recovering from non-confirmation adds undue complexity to the protocol, while requiring confirmation on reclaim- type opens poses difficulties in that the inability to resolve the status of the reclaim until lease expiration may make it difficult to have timely determination of the set of locks being reclaimed (since the grace period may expire). Requiring open confirmation on reclaim-type opens is avoidable because of the nature of the environments in which such opens are done. For CLAIM_PREVIOUS opens, this is immediately after server reboot, so there should be no time for open-owners to be created, found to be unused, and recycled. For CLAIM_DELEGATE_PREV opens,
we are dealing with either a client reboot situation or a network partition resulting in deletion of lease state (and returning NFS4ERR_EXPIRED). A server that supports delegations can be sure that no open-owners for that client have been recycled since client initialization or deletion of lease state and thus can be confident that confirmation will not be required.9.2. Lock Ranges
The protocol allows a lock-owner to request a lock with a byte range and then either upgrade or unlock a sub-range of the initial lock. It is expected that this will be an uncommon type of request. In any case, servers or server file systems may not be able to support sub-range lock semantics. In the event that a server receives a locking request that represents a sub-range of current locking state for the lock-owner, the server is allowed to return the error NFS4ERR_LOCK_RANGE to signify that it does not support sub-range lock operations. Therefore, the client should be prepared to receive this error and, if appropriate, report the error to the requesting application. The client is discouraged from combining multiple independent locking ranges that happen to be adjacent into a single request, since the server may not support sub-range requests, and for reasons related to the recovery of file locking state in the event of server failure. As discussed in Section 9.6.2 below, the server may employ certain optimizations during recovery that work effectively only when the client's behavior during lock recovery is similar to the client's locking behavior prior to server failure.9.3. Upgrading and Downgrading Locks
If a client has a write lock on a record, it can request an atomic downgrade of the lock to a read lock via the LOCK request, by setting the type to READ_LT. If the server supports atomic downgrade, the request will succeed. If not, it will return NFS4ERR_LOCK_NOTSUPP. The client should be prepared to receive this error and, if appropriate, report the error to the requesting application. If a client has a read lock on a record, it can request an atomic upgrade of the lock to a write lock via the LOCK request by setting the type to WRITE_LT or WRITEW_LT. If the server does not support atomic upgrade, it will return NFS4ERR_LOCK_NOTSUPP. If the upgrade can be achieved without an existing conflict, the request will succeed. Otherwise, the server will return either NFS4ERR_DENIED or NFS4ERR_DEADLOCK. The error NFS4ERR_DEADLOCK is returned if the client issued the LOCK request with the type set to WRITEW_LT and the
server has detected a deadlock. The client should be prepared to receive such errors and, if appropriate, report them to the requesting application.9.4. Blocking Locks
Some clients require the support of blocking locks. The NFSv4 protocol must not rely on a callback mechanism and therefore is unable to notify a client when a previously denied lock has been granted. Clients have no choice but to continually poll for the lock. This presents a fairness problem. Two new lock types are added, READW and WRITEW, and are used to indicate to the server that the client is requesting a blocking lock. The server should maintain an ordered list of pending blocking locks. When the conflicting lock is released, the server may wait the lease period for the first waiting client to re-request the lock. After the lease period expires, the next waiting client request is allowed the lock. Clients are required to poll at an interval sufficiently small that it is likely to acquire the lock in a timely manner. The server is not required to maintain a list of pending blocked locks, as it is not used to provide correct operation but only to increase fairness. Because of the unordered nature of crash recovery, storing of lock state to stable storage would be required to guarantee ordered granting of blocking locks. Servers may also note the lock types and delay returning denial of the request to allow extra time for a conflicting lock to be released, allowing a successful return. In this way, clients can avoid the burden of needlessly frequent polling for blocking locks. The server should take care with the length of delay in the event that the client retransmits the request. If a server receives a blocking lock request, denies it, and then later receives a non-blocking request for the same lock, which is also denied, then it should remove the lock in question from its list of pending blocking locks. Clients should use such a non-blocking request to indicate to the server that this is the last time they intend to poll for the lock, as may happen when the process requesting the lock is interrupted. This is a courtesy to the server, to prevent it from unnecessarily waiting a lease period before granting other lock requests. However, clients are not required to perform this courtesy, and servers must not depend on them doing so. Also, clients must be prepared for the possibility that this final locking request will be accepted.