5.8. Interpreting owner and owner_group
The recommended attributes "owner" and "owner_group" (and also users and groups within the "acl" attribute) are represented in terms of a UTF-8 string. To avoid a representation that is tied to a particular underlying implementation at the client or server, the use of the UTF-8 string has been chosen. Note that section 6.1 of [RFC2624] provides additional rationale. It is expected that the client and server will have their own local representation of owner and owner_group that is used for local storage or presentation to the end user. Therefore, it is expected that when these attributes are transferred between the client and server that the local representation is translated to a syntax of the form "user@dns_domain". This will allow for a client and server that do not use the same local representation the ability to translate to a common syntax that can be interpreted by both. Similarly, security principals may be represented in different ways by different security mechanisms. Servers normally translate these representations into a common format, generally that used by local storage, to serve as a means of identifying the users corresponding to these security principals. When these local identifiers are translated to the form of the owner attribute, associated with files created by such principals they identify, in a common format, the users associated with each corresponding set of security principals. The translation used to interpret owner and group strings is not specified as part of the protocol. This allows various solutions to be employed. For example, a local translation table may be consulted that maps between a numeric id to the user@dns_domain syntax. A name service may also be used to accomplish the translation. A server may provide a more general service, not limited by any particular translation (which would only translate a limited set of possible strings) by storing the owner and owner_group attributes in local storage without any translation or it may augment a translation method by storing the entire string for attributes for which no translation is available while using the local representation for those cases in which a translation is available. Servers that do not provide support for all possible values of the owner and owner_group attributes, should return an error (NFS4ERR_BADOWNER) when a string is presented that has no translation, as the value to be set for a SETATTR of the owner, owner_group, or acl attributes. When a server does accept an owner or owner_group value as valid on a SETATTR (and similarly for the owner and group strings in an acl), it is promising to return that same string when a corresponding GETATTR is done. Configuration changes and ill-constructed name translations (those that contain
aliasing) may make that promise impossible to honor. Servers should make appropriate efforts to avoid a situation in which these attributes have their values changed when no real change to ownership has occurred. The "dns_domain" portion of the owner string is meant to be a DNS domain name. For example, user@ietf.org. Servers should accept as valid a set of users for at least one domain. A server may treat other domains as having no valid translations. A more general service is provided when a server is capable of accepting users for multiple domains, or for all domains, subject to security constraints. In the case where there is no translation available to the client or server, the attribute value must be constructed without the "@". Therefore, the absence of the @ from the owner or owner_group attribute signifies that no translation was available at the sender and that the receiver of the attribute should not use that string as a basis for translation into its own internal format. Even though the attribute value can not be translated, it may still be useful. In the case of a client, the attribute string may be used for local display of ownership. To provide a greater degree of compatibility with previous versions of NFS (i.e., v2 and v3), which identified users and groups by 32-bit unsigned uid's and gid's, owner and group strings that consist of decimal numeric values with no leading zeros can be given a special interpretation by clients and servers which choose to provide such support. The receiver may treat such a user or group string as representing the same user as would be represented by a v2/v3 uid or gid having the corresponding numeric value. A server is not obligated to accept such a string, but may return an NFS4ERR_BADOWNER instead. To avoid this mechanism being used to subvert user and group translation, so that a client might pass all of the owners and groups in numeric form, a server SHOULD return an NFS4ERR_BADOWNER error when there is a valid translation for the user or owner designated in this way. In that case, the client must use the appropriate name@domain string and not the special form for compatibility. The owner string "nobody" may be used to designate an anonymous user, which will be associated with a file created by a security principal that cannot be mapped through normal means to the owner attribute.
5.9. Character Case Attributes
With respect to the case_insensitive and case_preserving attributes, each UCS-4 character (which UTF-8 encodes) has a "long descriptive name" [RFC1345] which may or may not included the word "CAPITAL" or "SMALL". The presence of SMALL or CAPITAL allows an NFS server to implement unambiguous and efficient table driven mappings for case insensitive comparisons, and non-case-preserving storage. For general character handling and internationalization issues, see the section "Internationalization".5.10. Quota Attributes
For the attributes related to filesystem quotas, the following definitions apply: quota_avail_soft The value in bytes which represents the amount of additional disk space that can be allocated to this file or directory before the user may reasonably be warned. It is understood that this space may be consumed by allocations to other files or directories though there is a rule as to which other files or directories. quota_avail_hard The value in bytes which represent the amount of additional disk space beyond the current allocation that can be allocated to this file or directory before further allocations will be refused. It is understood that this space may be consumed by allocations to other files or directories. quota_used The value in bytes which represent the amount of disc space used by this file or directory and possibly a number of other similar files or directories, where the set of "similar" meets at least the criterion that allocating space to any file or directory in the set will reduce the "quota_avail_hard" of every other file or directory in the set. Note that there may be a number of distinct but overlapping sets of files or directories for which a quota_used value is maintained (e.g., "all files with a given owner", "all files with a given group owner", etc.). The server is at liberty to choose any of those sets but should do so in a repeatable way. The rule may be configured per- filesystem or may be "choose the set with the smallest quota".
5.11. Access Control Lists
The NFS version 4 ACL attribute is an array of access control entries (ACE). Although, the client can read and write the ACL attribute, the NFSv4 model is the server does all access control based on the server's interpretation of the ACL. If at any point the client wants to check access without issuing an operation that modifies or reads data or metadata, the client can use the OPEN and ACCESS operations to do so. There are various access control entry types, as defined in the Section "ACE type". The server is able to communicate which ACE types are supported by returning the appropriate value within the aclsupport attribute. Each ACE covers one or more operations on a file or directory as described in the Section "ACE Access Mask". It may also contain one or more flags that modify the semantics of the ACE as defined in the Section "ACE flag". The NFS ACE attribute is defined as follows: typedef uint32_t acetype4; typedef uint32_t aceflag4; typedef uint32_t acemask4; struct nfsace4 { acetype4 type; aceflag4 flag; acemask4 access_mask; utf8str_mixed who; }; To determine if a request succeeds, each nfsace4 entry is processed in order by the server. Only ACEs which have a "who" that matches the requester are considered. Each ACE is processed until all of the bits of the requester's access have been ALLOWED. Once a bit (see below) has been ALLOWED by an ACCESS_ALLOWED_ACE, it is no longer considered in the processing of later ACEs. If an ACCESS_DENIED_ACE is encountered where the requester's access still has unALLOWED bits in common with the "access_mask" of the ACE, the request is denied. However, unlike the ALLOWED and DENIED ACE types, the ALARM and AUDIT ACE types do not affect a requester's access, and instead are for triggering events as a result of a requester's access attempt. Therefore, all AUDIT and ALARM ACEs are processed until end of the ACL. When the ACL is fully processed, if there are bits in requester's mask that have not been considered whether the server allows or denies the access is undefined. If there is a mode attribute on the file, then this cannot happen, since the mode's MODE4_*OTH bits will map to EVERYONE@ ACEs that unambiguously specify the requester's access.
The NFS version 4 ACL model is quite rich. Some server platforms may provide access control functionality that goes beyond the UNIX-style mode attribute, but which is not as rich as the NFS ACL model. So that users can take advantage of this more limited functionality, the server may indicate that it supports ACLs as long as it follows the guidelines for mapping between its ACL model and the NFS version 4 ACL model. The situation is complicated by the fact that a server may have multiple modules that enforce ACLs. For example, the enforcement for NFS version 4 access may be different from the enforcement for local access, and both may be different from the enforcement for access through other protocols such as SMB. So it may be useful for a server to accept an ACL even if not all of its modules are able to support it. The guiding principle in all cases is that the server must not accept ACLs that appear to make the file more secure than it really is.5.11.1. ACE type
Type Description _____________________________________________________ ALLOW Explicitly grants the access defined in acemask4 to the file or directory. DENY Explicitly denies the access defined in acemask4 to the file or directory. AUDIT LOG (system dependent) any access attempt to a file or directory which uses any of the access methods specified in acemask4. ALARM Generate a system ALARM (system dependent) when any access attempt is made to a file or directory for the access methods specified in acemask4. A server need not support all of the above ACE types. The bitmask constants used to represent the above definitions within the aclsupport attribute are as follows: const ACL4_SUPPORT_ALLOW_ACL = 0x00000001; const ACL4_SUPPORT_DENY_ACL = 0x00000002; const ACL4_SUPPORT_AUDIT_ACL = 0x00000004; const ACL4_SUPPORT_ALARM_ACL = 0x00000008;
The semantics of the "type" field follow the descriptions provided above. The constants used for the type field (acetype4) are as follows: const ACE4_ACCESS_ALLOWED_ACE_TYPE = 0x00000000; const ACE4_ACCESS_DENIED_ACE_TYPE = 0x00000001; const ACE4_SYSTEM_AUDIT_ACE_TYPE = 0x00000002; const ACE4_SYSTEM_ALARM_ACE_TYPE = 0x00000003; Clients should not attempt to set an ACE unless the server claims support for that ACE type. If the server receives a request to set an ACE that it cannot store, it MUST reject the request with NFS4ERR_ATTRNOTSUPP. If the server receives a request to set an ACE that it can store but cannot enforce, the server SHOULD reject the request with NFS4ERR_ATTRNOTSUPP. Example: suppose a server can enforce NFS ACLs for NFS access but cannot enforce ACLs for local access. If arbitrary processes can run on the server, then the server SHOULD NOT indicate ACL support. On the other hand, if only trusted administrative programs run locally, then the server may indicate ACL support.5.11.2. ACE Access Mask
The access_mask field contains values based on the following: Access Description _______________________________________________________________ READ_DATA Permission to read the data of the file LIST_DIRECTORY Permission to list the contents of a directory WRITE_DATA Permission to modify the file's data ADD_FILE Permission to add a new file to a directory APPEND_DATA Permission to append data to a file ADD_SUBDIRECTORY Permission to create a subdirectory to a directory READ_NAMED_ATTRS Permission to read the named attributes of a file WRITE_NAMED_ATTRS Permission to write the named attributes of a file EXECUTE Permission to execute a file DELETE_CHILD Permission to delete a file or directory within a directory READ_ATTRIBUTES The ability to read basic attributes (non-acls) of a file WRITE_ATTRIBUTES Permission to change basic attributes
(non-acls) of a file DELETE Permission to Delete the file READ_ACL Permission to Read the ACL WRITE_ACL Permission to Write the ACL WRITE_OWNER Permission to change the owner SYNCHRONIZE Permission to access file locally at the server with synchronous reads and writes The bitmask constants used for the access mask field are as follows: const ACE4_READ_DATA = 0x00000001; const ACE4_LIST_DIRECTORY = 0x00000001; const ACE4_WRITE_DATA = 0x00000002; const ACE4_ADD_FILE = 0x00000002; const ACE4_APPEND_DATA = 0x00000004; const ACE4_ADD_SUBDIRECTORY = 0x00000004; const ACE4_READ_NAMED_ATTRS = 0x00000008; const ACE4_WRITE_NAMED_ATTRS = 0x00000010; const ACE4_EXECUTE = 0x00000020; const ACE4_DELETE_CHILD = 0x00000040; const ACE4_READ_ATTRIBUTES = 0x00000080; const ACE4_WRITE_ATTRIBUTES = 0x00000100; const ACE4_DELETE = 0x00010000; const ACE4_READ_ACL = 0x00020000; const ACE4_WRITE_ACL = 0x00040000; const ACE4_WRITE_OWNER = 0x00080000; const ACE4_SYNCHRONIZE = 0x00100000; Server implementations need not provide the granularity of control that is implied by this list of masks. For example, POSIX-based systems might not distinguish APPEND_DATA (the ability to append to a file) from WRITE_DATA (the ability to modify existing contents); both masks would be tied to a single "write" permission. When such a server returns attributes to the client, it would show both APPEND_DATA and WRITE_DATA if and only if the write permission is enabled. If a server receives a SETATTR request that it cannot accurately implement, it should error in the direction of more restricted access. For example, suppose a server cannot distinguish overwriting data from appending new data, as described in the previous paragraph. If a client submits an ACE where APPEND_DATA is set but WRITE_DATA is not (or vice versa), the server should reject the request with NFS4ERR_ATTRNOTSUPP. Nonetheless, if the ACE has type DENY, the server may silently turn on the other bit, so that both APPEND_DATA and WRITE_DATA are denied.
5.11.3. ACE flag
The "flag" field contains values based on the following descriptions. ACE4_FILE_INHERIT_ACE Can be placed on a directory and indicates that this ACE should be added to each new non-directory file created. ACE4_DIRECTORY_INHERIT_ACE Can be placed on a directory and indicates that this ACE should be added to each new directory created. ACE4_INHERIT_ONLY_ACE Can be placed on a directory but does not apply to the directory, only to newly created files/directories as specified by the above two flags. ACE4_NO_PROPAGATE_INHERIT_ACE Can be placed on a directory. Normally when a new directory is created and an ACE exists on the parent directory which is marked ACL4_DIRECTORY_INHERIT_ACE, two ACEs are placed on the new directory. One for the directory itself and one which is an inheritable ACE for newly created directories. This flag tells the server to not place an ACE on the newly created directory which is inheritable by subdirectories of the created directory. ACE4_SUCCESSFUL_ACCESS_ACE_FLAG ACL4_FAILED_ACCESS_ACE_FLAG The ACE4_SUCCESSFUL_ACCESS_ACE_FLAG (SUCCESS) and ACE4_FAILED_ACCESS_ACE_FLAG (FAILED) flag bits relate only to ACE4_SYSTEM_AUDIT_ACE_TYPE (AUDIT) and ACE4_SYSTEM_ALARM_ACE_TYPE (ALARM) ACE types. If during the processing of the file's ACL, the server encounters an AUDIT or ALARM ACE that matches the principal attempting the OPEN, the server notes that fact, and the presence, if any, of the SUCCESS and FAILED flags encountered in the AUDIT or ALARM ACE. Once the server completes the ACL processing, and the share reservation processing, and the OPEN call, it then notes if the OPEN succeeded or failed. If the OPEN succeeded, and if the SUCCESS flag was set for a matching AUDIT or ALARM, then the appropriate AUDIT or ALARM event occurs. If the OPEN failed, and if the FAILED flag was set for the matching AUDIT or ALARM, then the appropriate AUDIT or ALARM event occurs. Clearly either or both of the SUCCESS or FAILED can be set, but if neither is set, the AUDIT or ALARM ACE is not useful.
The previously described processing applies to that of the ACCESS operation as well. The difference being that "success" or "failure" does not mean whether ACCESS returns NFS4_OK or not. Success means whether ACCESS returns all requested and supported bits. Failure means whether ACCESS failed to return a bit that was requested and supported. ACE4_IDENTIFIER_GROUP Indicates that the "who" refers to a GROUP as defined under UNIX. The bitmask constants used for the flag field are as follows: const ACE4_FILE_INHERIT_ACE = 0x00000001; const ACE4_DIRECTORY_INHERIT_ACE = 0x00000002; const ACE4_NO_PROPAGATE_INHERIT_ACE = 0x00000004; const ACE4_INHERIT_ONLY_ACE = 0x00000008; const ACE4_SUCCESSFUL_ACCESS_ACE_FLAG = 0x00000010; const ACE4_FAILED_ACCESS_ACE_FLAG = 0x00000020; const ACE4_IDENTIFIER_GROUP = 0x00000040; A server need not support any of these flags. If the server supports flags that are similar to, but not exactly the same as, these flags, the implementation may define a mapping between the protocol-defined flags and the implementation-defined flags. Again, the guiding principle is that the file not appear to be more secure than it really is. For example, suppose a client tries to set an ACE with ACE4_FILE_INHERIT_ACE set but not ACE4_DIRECTORY_INHERIT_ACE. If the server does not support any form of ACL inheritance, the server should reject the request with NFS4ERR_ATTRNOTSUPP. If the server supports a single "inherit ACE" flag that applies to both files and directories, the server may reject the request (i.e., requiring the client to set both the file and directory inheritance flags). The server may also accept the request and silently turn on the ACE4_DIRECTORY_INHERIT_ACE flag.5.11.4. ACE who
There are several special identifiers ("who") which need to be understood universally, rather than in the context of a particular DNS domain. Some of these identifiers cannot be understood when an NFS client accesses the server, but have meaning when a local process accesses the file. The ability to display and modify these permissions is permitted over NFS, even if none of the access methods on the server understands the identifiers.
Who Description _______________________________________________________________ "OWNER" The owner of the file. "GROUP" The group associated with the file. "EVERYONE" The world. "INTERACTIVE" Accessed from an interactive terminal. "NETWORK" Accessed via the network. "DIALUP" Accessed as a dialup user to the server. "BATCH" Accessed from a batch job. "ANONYMOUS" Accessed without any authentication. "AUTHENTICATED" Any authenticated user (opposite of ANONYMOUS) "SERVICE" Access from a system service. To avoid conflict, these special identifiers are distinguish by an appended "@" and should appear in the form "xxxx@" (note: no domain name after the "@"). For example: ANONYMOUS@.5.11.5. Mode Attribute
The NFS version 4 mode attribute is based on the UNIX mode bits. The following bits are defined: const MODE4_SUID = 0x800; /* set user id on execution */ const MODE4_SGID = 0x400; /* set group id on execution */ const MODE4_SVTX = 0x200; /* save text even after use */ const MODE4_RUSR = 0x100; /* read permission: owner */ const MODE4_WUSR = 0x080; /* write permission: owner */ const MODE4_XUSR = 0x040; /* execute permission: owner */ const MODE4_RGRP = 0x020; /* read permission: group */ const MODE4_WGRP = 0x010; /* write permission: group */ const MODE4_XGRP = 0x008; /* execute permission: group */ const MODE4_ROTH = 0x004; /* read permission: other */ const MODE4_WOTH = 0x002; /* write permission: other */ const MODE4_XOTH = 0x001; /* execute permission: other */ Bits MODE4_RUSR, MODE4_WUSR, and MODE4_XUSR apply to the principal identified in the owner attribute. Bits MODE4_RGRP, MODE4_WGRP, and MODE4_XGRP apply to the principals identified in the owner_group attribute. Bits MODE4_ROTH, MODE4_WOTH, MODE4_XOTH apply to any principal that does not match that in the owner group, and does not have a group matching that of the owner_group attribute. The remaining bits are not defined by this protocol and MUST NOT be used. The minor version mechanism must be used to define further bit usage.
Note that in UNIX, if a file has the MODE4_SGID bit set and no MODE4_XGRP bit set, then READ and WRITE must use mandatory file locking.5.11.6. Mode and ACL Attribute
The server that supports both mode and ACL must take care to synchronize the MODE4_*USR, MODE4_*GRP, and MODE4_*OTH bits with the ACEs which have respective who fields of "OWNER@", "GROUP@", and "EVERYONE@" so that the client can see semantically equivalent access permissions exist whether the client asks for owner, owner_group and mode attributes, or for just the ACL. Because the mode attribute includes bits (e.g., MODE4_SVTX) that have nothing to do with ACL semantics, it is permitted for clients to specify both the ACL attribute and mode in the same SETATTR operation. However, because there is no prescribed order for processing the attributes in a SETATTR, the client must ensure that ACL attribute, if specified without mode, would produce the desired mode bits, and conversely, the mode attribute if specified without ACL, would produce the desired "OWNER@", "GROUP@", and "EVERYONE@" ACEs.5.11.7. mounted_on_fileid
UNIX-based operating environments connect a filesystem into the namespace by connecting (mounting) the filesystem onto the existing file object (the mount point, usually a directory) of an existing filesystem. When the mount point's parent directory is read via an API like readdir(), the return results are directory entries, each with a component name and a fileid. The fileid of the mount point's directory entry will be different from the fileid that the stat() system call returns. The stat() system call is returning the fileid of the root of the mounted filesystem, whereas readdir() is returning the fileid stat() would have returned before any filesystems were mounted on the mount point. Unlike NFS version 3, NFS version 4 allows a client's LOOKUP request to cross other filesystems. The client detects the filesystem crossing whenever the filehandle argument of LOOKUP has an fsid attribute different from that of the filehandle returned by LOOKUP. A UNIX-based client will consider this a "mount point crossing". UNIX has a legacy scheme for allowing a process to determine its current working directory. This relies on readdir() of a mount point's parent and stat() of the mount point returning fileids as previously described. The mounted_on_fileid attribute corresponds to the fileid that readdir() would have returned as described previously.
While the NFS version 4 client could simply fabricate a fileid corresponding to what mounted_on_fileid provides (and if the server does not support mounted_on_fileid, the client has no choice), there is a risk that the client will generate a fileid that conflicts with one that is already assigned to another object in the filesystem. Instead, if the server can provide the mounted_on_fileid, the potential for client operational problems in this area is eliminated. If the server detects that there is no mounted point at the target file object, then the value for mounted_on_fileid that it returns is the same as that of the fileid attribute. The mounted_on_fileid attribute is RECOMMENDED, so the server SHOULD provide it if possible, and for a UNIX-based server, this is straightforward. Usually, mounted_on_fileid will be requested during a READDIR operation, in which case it is trivial (at least for UNIX- based servers) to return mounted_on_fileid since it is equal to the fileid of a directory entry returned by readdir(). If mounted_on_fileid is requested in a GETATTR operation, the server should obey an invariant that has it returning a value that is equal to the file object's entry in the object's parent directory, i.e., what readdir() would have returned. Some operating environments allow a series of two or more filesystems to be mounted onto a single mount point. In this case, for the server to obey the aforementioned invariant, it will need to find the base mount point, and not the intermediate mount points.6. Filesystem Migration and Replication
With the use of the recommended attribute "fs_locations", the NFS version 4 server has a method of providing filesystem migration or replication services. For the purposes of migration and replication, a filesystem will be defined as all files that share a given fsid (both major and minor values are the same). The fs_locations attribute provides a list of filesystem locations. These locations are specified by providing the server name (either DNS domain or IP address) and the path name representing the root of the filesystem. Depending on the type of service being provided, the list will provide a new location or a set of alternate locations for the filesystem. The client will use this information to redirect its requests to the new server.6.1. Replication
It is expected that filesystem replication will be used in the case of read-only data. Typically, the filesystem will be replicated on two or more servers. The fs_locations attribute will provide the
list of these locations to the client. On first access of the filesystem, the client should obtain the value of the fs_locations attribute. If, in the future, the client finds the server unresponsive, the client may attempt to use another server specified by fs_locations. If applicable, the client must take the appropriate steps to recover valid filehandles from the new server. This is described in more detail in the following sections.6.2. Migration
Filesystem migration is used to move a filesystem from one server to another. Migration is typically used for a filesystem that is writable and has a single copy. The expected use of migration is for load balancing or general resource reallocation. The protocol does not specify how the filesystem will be moved between servers. This server-to-server transfer mechanism is left to the server implementor. However, the method used to communicate the migration event between client and server is specified here. Once the servers participating in the migration have completed the move of the filesystem, the error NFS4ERR_MOVED will be returned for subsequent requests received by the original server. The NFS4ERR_MOVED error is returned for all operations except PUTFH and GETATTR. Upon receiving the NFS4ERR_MOVED error, the client will obtain the value of the fs_locations attribute. The client will then use the contents of the attribute to redirect its requests to the specified server. To facilitate the use of GETATTR, operations such as PUTFH must also be accepted by the server for the migrated file system's filehandles. Note that if the server returns NFS4ERR_MOVED, the server MUST support the fs_locations attribute. If the client requests more attributes than just fs_locations, the server may return fs_locations only. This is to be expected since the server has migrated the filesystem and may not have a method of obtaining additional attribute data. The server implementor needs to be careful in developing a migration solution. The server must consider all of the state information clients may have outstanding at the server. This includes but is not limited to locking/share state, delegation state, and asynchronous file writes which are represented by WRITE and COMMIT verifiers. The server should strive to minimize the impact on its clients during and after the migration process.
6.3. Interpretation of the fs_locations Attribute
The fs_location attribute is structured in the following way: struct fs_location { utf8str_cis server<>; pathname4 rootpath; }; struct fs_locations { pathname4 fs_root; fs_location locations<>; }; The fs_location struct is used to represent the location of a filesystem by providing a server name and the path to the root of the filesystem. For a multi-homed server or a set of servers that use the same rootpath, an array of server names may be provided. An entry in the server array is an UTF8 string and represents one of a traditional DNS host name, IPv4 address, or IPv6 address. It is not a requirement that all servers that share the same rootpath be listed in one fs_location struct. The array of server names is provided for convenience. Servers that share the same rootpath may also be listed in separate fs_location entries in the fs_locations attribute. The fs_locations struct and attribute then contains an array of locations. Since the name space of each server may be constructed differently, the "fs_root" field is provided. The path represented by fs_root represents the location of the filesystem in the server's name space. Therefore, the fs_root path is only associated with the server from which the fs_locations attribute was obtained. The fs_root path is meant to aid the client in locating the filesystem at the various servers listed. As an example, there is a replicated filesystem located at two servers (servA and servB). At servA the filesystem is located at path "/a/b/c". At servB the filesystem is located at path "/x/y/z". In this example the client accesses the filesystem first at servA with a multi-component lookup path of "/a/b/c/d". Since the client used a multi-component lookup to obtain the filehandle at "/a/b/c/d", it is unaware that the filesystem's root is located in servA's name space at "/a/b/c". When the client switches to servB, it will need to determine that the directory it first referenced at servA is now represented by the path "/x/y/z/d" on servB. To facilitate this, the fs_locations attribute provided by servA would have a fs_root value of "/a/b/c" and two entries in fs_location. One entry in fs_location will be for itself (servA) and the other will be for servB with a
path of "/x/y/z". With this information, the client is able to substitute "/x/y/z" for the "/a/b/c" at the beginning of its access path and construct "/x/y/z/d" to use for the new server. See the section "Security Considerations" for a discussion on the recommendations for the security flavor to be used by any GETATTR operation that requests the "fs_locations" attribute.6.4. Filehandle Recovery for Migration or Replication
Filehandles for filesystems that are replicated or migrated generally have the same semantics as for filesystems that are not replicated or migrated. For example, if a filesystem has persistent filehandles and it is migrated to another server, the filehandle values for the filesystem will be valid at the new server. For volatile filehandles, the servers involved likely do not have a mechanism to transfer filehandle format and content between themselves. Therefore, a server may have difficulty in determining if a volatile filehandle from an old server should return an error of NFS4ERR_FHEXPIRED. Therefore, the client is informed, with the use of the fh_expire_type attribute, whether volatile filehandles will expire at the migration or replication event. If the bit FH4_VOL_MIGRATION is set in the fh_expire_type attribute, the client must treat the volatile filehandle as if the server had returned the NFS4ERR_FHEXPIRED error. At the migration or replication event in the presence of the FH4_VOL_MIGRATION bit, the client will not present the original or old volatile filehandle to the new server. The client will start its communication with the new server by recovering its filehandles using the saved file names.7. NFS Server Name Space
7.1. Server Exports
On a UNIX server the name space describes all the files reachable by pathnames under the root directory or "/". On a Windows NT server the name space constitutes all the files on disks named by mapped disk letters. NFS server administrators rarely make the entire server's filesystem name space available to NFS clients. More often portions of the name space are made available via an "export" feature. In previous versions of the NFS protocol, the root filehandle for each export is obtained through the MOUNT protocol; the client sends a string that identifies the export of name space and the server returns the root filehandle for it. The MOUNT protocol supports an EXPORTS procedure that will enumerate the server's exports.
7.2. Browsing Exports
The NFS version 4 protocol provides a root filehandle that clients can use to obtain filehandles for these exports via a multi-component LOOKUP. A common user experience is to use a graphical user interface (perhaps a file "Open" dialog window) to find a file via progressive browsing through a directory tree. The client must be able to move from one export to another export via single-component, progressive LOOKUP operations. This style of browsing is not well supported by the NFS version 2 and 3 protocols. The client expects all LOOKUP operations to remain within a single server filesystem. For example, the device attribute will not change. This prevents a client from taking name space paths that span exports. An automounter on the client can obtain a snapshot of the server's name space using the EXPORTS procedure of the MOUNT protocol. If it understands the server's pathname syntax, it can create an image of the server's name space on the client. The parts of the name space that are not exported by the server are filled in with a "pseudo filesystem" that allows the user to browse from one mounted filesystem to another. There is a drawback to this representation of the server's name space on the client: it is static. If the server administrator adds a new export the client will be unaware of it.7.3. Server Pseudo Filesystem
NFS version 4 servers avoid this name space inconsistency by presenting all the exports within the framework of a single server name space. An NFS version 4 client uses LOOKUP and READDIR operations to browse seamlessly from one export to another. Portions of the server name space that are not exported are bridged via a "pseudo filesystem" that provides a view of exported directories only. A pseudo filesystem has a unique fsid and behaves like a normal, read only filesystem. Based on the construction of the server's name space, it is possible that multiple pseudo filesystems may exist. For example, /a pseudo filesystem /a/b real filesystem /a/b/c pseudo filesystem /a/b/c/d real filesystem Each of the pseudo filesystems are considered separate entities and therefore will have a unique fsid.
7.4. Multiple Roots
The DOS and Windows operating environments are sometimes described as having "multiple roots". Filesystems are commonly represented as disk letters. MacOS represents filesystems as top level names. NFS version 4 servers for these platforms can construct a pseudo file system above these root names so that disk letters or volume names are simply directory names in the pseudo root.7.5. Filehandle Volatility
The nature of the server's pseudo filesystem is that it is a logical representation of filesystem(s) available from the server. Therefore, the pseudo filesystem is most likely constructed dynamically when the server is first instantiated. It is expected that the pseudo filesystem may not have an on disk counterpart from which persistent filehandles could be constructed. Even though it is preferable that the server provide persistent filehandles for the pseudo filesystem, the NFS client should expect that pseudo file system filehandles are volatile. This can be confirmed by checking the associated "fh_expire_type" attribute for those filehandles in question. If the filehandles are volatile, the NFS client must be prepared to recover a filehandle value (e.g., with a multi-component LOOKUP) when receiving an error of NFS4ERR_FHEXPIRED.7.6. Exported Root
If the server's root filesystem is exported, one might conclude that a pseudo-filesystem is not needed. This would be wrong. Assume the following filesystems on a server: / disk1 (exported) /a disk2 (not exported) /a/b disk3 (exported) Because disk2 is not exported, disk3 cannot be reached with simple LOOKUPs. The server must bridge the gap with a pseudo-filesystem.7.7. Mount Point Crossing
The server filesystem environment may be constructed in such a way that one filesystem contains a directory which is 'covered' or mounted upon by a second filesystem. For example: /a/b (filesystem 1) /a/b/c/d (filesystem 2)
The pseudo filesystem for this server may be constructed to look like: / (place holder/not exported) /a/b (filesystem 1) /a/b/c/d (filesystem 2) It is the server's responsibility to present the pseudo filesystem that is complete to the client. If the client sends a lookup request for the path "/a/b/c/d", the server's response is the filehandle of the filesystem "/a/b/c/d". In previous versions of the NFS protocol, the server would respond with the filehandle of directory "/a/b/c/d" within the filesystem "/a/b". The NFS client will be able to determine if it crosses a server mount point by a change in the value of the "fsid" attribute.7.8. Security Policy and Name Space Presentation
The application of the server's security policy needs to be carefully considered by the implementor. One may choose to limit the viewability of portions of the pseudo filesystem based on the server's perception of the client's ability to authenticate itself properly. However, with the support of multiple security mechanisms and the ability to negotiate the appropriate use of these mechanisms, the server is unable to properly determine if a client will be able to authenticate itself. If, based on its policies, the server chooses to limit the contents of the pseudo filesystem, the server may effectively hide filesystems from a client that may otherwise have legitimate access. As suggested practice, the server should apply the security policy of a shared resource in the server's namespace to the components of the resource's ancestors. For example: / /a/b /a/b/c The /a/b/c directory is a real filesystem and is the shared resource. The security policy for /a/b/c is Kerberos with integrity. The server should apply the same security policy to /, /a, and /a/b. This allows for the extension of the protection of the server's namespace to the ancestors of the real shared resource.
For the case of the use of multiple, disjoint security mechanisms in the server's resources, the security for a particular object in the server's namespace should be the union of all security mechanisms of all direct descendants.8. File Locking and Share Reservations
Integrating locking into the NFS protocol necessarily causes it to be stateful. With the inclusion of share reservations the protocol becomes substantially more dependent on state than the traditional combination of NFS and NLM [XNFS]. There are three components to making this state manageable: o Clear division between client and server o Ability to reliably detect inconsistency in state between client and server o Simple and robust recovery mechanisms In this model, the server owns the state information. The client communicates its view of this state to the server as needed. The client is also able to detect inconsistent state before modifying a file. To support Win32 share reservations it is necessary to atomically OPEN or CREATE files. Having a separate share/unshare operation would not allow correct implementation of the Win32 OpenFile API. In order to correctly implement share semantics, the previous NFS protocol mechanisms used when a file is opened or created (LOOKUP, CREATE, ACCESS) need to be replaced. The NFS version 4 protocol has an OPEN operation that subsumes the NFS version 3 methodology of LOOKUP, CREATE, and ACCESS. However, because many operations require a filehandle, the traditional LOOKUP is preserved to map a file name to filehandle without establishing state on the server. The policy of granting access or modifying files is managed by the server based on the client's state. These mechanisms can implement policy ranging from advisory only locking to full mandatory locking.8.1. Locking
It is assumed that manipulating a lock is rare when compared to READ and WRITE operations. It is also assumed that crashes and network partitions are relatively rare. Therefore it is important that the READ and WRITE operations have a lightweight mechanism to indicate if they possess a held lock. A lock request contains the heavyweight information required to establish a lock and uniquely define the lock owner.
The following sections describe the transition from the heavy weight information to the eventual stateid used for most client and server locking and lease interactions.8.1.1. Client ID
For each LOCK request, the client must identify itself to the server. This is done in such a way as to allow for correct lock identification and crash recovery. A sequence of a SETCLIENTID operation followed by a SETCLIENTID_CONFIRM operation is required to establish the identification onto the server. Establishment of identification by a new incarnation of the client also has the effect of immediately breaking any leased state that a previous incarnation of the client might have had on the server, as opposed to forcing the new client incarnation to wait for the leases to expire. Breaking the lease state amounts to the server removing all lock, share reservation, and, where the server is not supporting the CLAIM_DELEGATE_PREV claim type, all delegation state associated with same client with the same identity. For discussion of delegation state recovery, see the section "Delegation Recovery". Client identification is encapsulated in the following structure: struct nfs_client_id4 { verifier4 verifier; opaque id<NFS4_OPAQUE_LIMIT>; }; The first field, verifier is a client incarnation verifier that is used to detect client reboots. Only if the verifier is different from that which the server has previously recorded the client (as identified by the second field of the structure, id) does the server start the process of canceling the client's leased state. The second field, id is a variable length string that uniquely defines the client. There are several considerations for how the client generates the id string: o The string should be unique so that multiple clients do not present the same string. The consequences of two clients presenting the same string range from one client getting an error to one client having its leased state abruptly and unexpectedly canceled.
o The string should be selected so the subsequent incarnations (e.g., reboots) of the same client cause the client to present the same string. The implementor is cautioned against an approach that requires the string to be recorded in a local file because this precludes the use of the implementation in an environment where there is no local disk and all file access is from an NFS version 4 server. o The string should be different for each server network address that the client accesses, rather than common to all server network addresses. The reason is that it may not be possible for the client to tell if the same server is listening on multiple network addresses. If the client issues SETCLIENTID with the same id string to each network address of such a server, the server will think it is the same client, and each successive SETCLIENTID will cause the server to begin the process of removing the client's previous leased state. o The algorithm for generating the string should not assume that the client's network address won't change. This includes changes between client incarnations and even changes while the client is stilling running in its current incarnation. This means that if the client includes just the client's and server's network address in the id string, there is a real risk, after the client gives up the network address, that another client, using a similar algorithm for generating the id string, will generate a conflicting id string. Given the above considerations, an example of a well generated id string is one that includes: o The server's network address. o The client's network address. o For a user level NFS version 4 client, it should contain additional information to distinguish the client from other user level clients running on the same host, such as a process id or other unique sequence. o Additional information that tends to be unique, such as one or more of: - The client machine's serial number (for privacy reasons, it is best to perform some one way function on the serial number). - A MAC address.
- The timestamp of when the NFS version 4 software was first installed on the client (though this is subject to the previously mentioned caution about using information that is stored in a file, because the file might only be accessible over NFS version 4). - A true random number. However since this number ought to be the same between client incarnations, this shares the same problem as that of the using the timestamp of the software installation. As a security measure, the server MUST NOT cancel a client's leased state if the principal established the state for a given id string is not the same as the principal issuing the SETCLIENTID. Note that SETCLIENTID and SETCLIENTID_CONFIRM has a secondary purpose of establishing the information the server needs to make callbacks to the client for purpose of supporting delegations. It is permitted to change this information via SETCLIENTID and SETCLIENTID_CONFIRM within the same incarnation of the client without removing the client's leased state. Once a SETCLIENTID and SETCLIENTID_CONFIRM sequence has successfully completed, the client uses the shorthand client identifier, of type clientid4, instead of the longer and less compact nfs_client_id4 structure. This shorthand client identifier (a clientid) is assigned by the server and should be chosen so that it will not conflict with a clientid previously assigned by the server. This applies across server restarts or reboots. When a clientid is presented to a server and that clientid is not recognized, as would happen after a server reboot, the server will reject the request with the error NFS4ERR_STALE_CLIENTID. When this happens, the client must obtain a new clientid by use of the SETCLIENTID operation and then proceed to any other necessary recovery for the server reboot case (See the section "Server Failure and Recovery"). The client must also employ the SETCLIENTID operation when it receives a NFS4ERR_STALE_STATEID error using a stateid derived from its current clientid, since this also indicates a server reboot which has invalidated the existing clientid (see the next section "lock_owner and stateid Definition" for details). See the detailed descriptions of SETCLIENTID and SETCLIENTID_CONFIRM for a complete specification of the operations.
8.1.2. Server Release of Clientid
If the server determines that the client holds no associated state for its clientid, the server may choose to release the clientid. The server may make this choice for an inactive client so that resources are not consumed by those intermittently active clients. If the client contacts the server after this release, the server must ensure the client receives the appropriate error so that it will use the SETCLIENTID/SETCLIENTID_CONFIRM sequence to establish a new identity. It should be clear that the server must be very hesitant to release a clientid since the resulting work on the client to recover from such an event will be the same burden as if the server had failed and restarted. Typically a server would not release a clientid unless there had been no activity from that client for many minutes. Note that if the id string in a SETCLIENTID request is properly constructed, and if the client takes care to use the same principal for each successive use of SETCLIENTID, then, barring an active denial of service attack, NFS4ERR_CLID_INUSE should never be returned. However, client bugs, server bugs, or perhaps a deliberate change of the principal owner of the id string (such as the case of a client that changes security flavors, and under the new flavor, there is no mapping to the previous owner) will in rare cases result in NFS4ERR_CLID_INUSE. In that event, when the server gets a SETCLIENTID for a client id that currently has no state, or it has state, but the lease has expired, rather than returning NFS4ERR_CLID_INUSE, the server MUST allow the SETCLIENTID, and confirm the new clientid if followed by the appropriate SETCLIENTID_CONFIRM.8.1.3. lock_owner and stateid Definition
When requesting a lock, the client must present to the server the clientid and an identifier for the owner of the requested lock. These two fields are referred to as the lock_owner and the definition of those fields are: o A clientid returned by the server as part of the client's use of the SETCLIENTID operation. o A variable length opaque array used to uniquely define the owner of a lock managed by the client. This may be a thread id, process id, or other unique value.
When the server grants the lock, it responds with a unique stateid. The stateid is used as a shorthand reference to the lock_owner, since the server will be maintaining the correspondence between them. The server is free to form the stateid in any manner that it chooses as long as it is able to recognize invalid and out-of-date stateids. This requirement includes those stateids generated by earlier instances of the server. From this, the client can be properly notified of a server restart. This notification will occur when the client presents a stateid to the server from a previous instantiation. The server must be able to distinguish the following situations and return the error as specified: o The stateid was generated by an earlier server instance (i.e., before a server reboot). The error NFS4ERR_STALE_STATEID should be returned. o The stateid was generated by the current server instance but the stateid no longer designates the current locking state for the lockowner-file pair in question (i.e., one or more locking operations has occurred). The error NFS4ERR_OLD_STATEID should be returned. This error condition will only occur when the client issues a locking request which changes a stateid while an I/O request that uses that stateid is outstanding. o The stateid was generated by the current server instance but the stateid does not designate a locking state for any active lockowner-file pair. The error NFS4ERR_BAD_STATEID should be returned. This error condition will occur when there has been a logic error on the part of the client or server. This should not happen. One mechanism that may be used to satisfy these requirements is for the server to, o divide the "other" field of each stateid into two fields: - A server verifier which uniquely designates a particular server instantiation. - An index into a table of locking-state structures.
o utilize the "seqid" field of each stateid, such that seqid is monotonically incremented for each stateid that is associated with the same index into the locking-state table. By matching the incoming stateid and its field values with the state held at the server, the server is able to easily determine if a stateid is valid for its current instantiation and state. If the stateid is not valid, the appropriate error can be supplied to the client.8.1.4. Use of the stateid and Locking
All READ, WRITE and SETATTR operations contain a stateid. For the purposes of this section, SETATTR operations which change the size attribute of a file are treated as if they are writing the area between the old and new size (i.e., the range truncated or added to the file by means of the SETATTR), even where SETATTR is not explicitly mentioned in the text. If the lock_owner performs a READ or WRITE in a situation in which it has established a lock or share reservation on the server (any OPEN constitutes a share reservation) the stateid (previously returned by the server) must be used to indicate what locks, including both record locks and share reservations, are held by the lockowner. If no state is established by the client, either record lock or share reservation, a stateid of all bits 0 is used. Regardless whether a stateid of all bits 0, or a stateid returned by the server is used, if there is a conflicting share reservation or mandatory record lock held on the file, the server MUST refuse to service the READ or WRITE operation. Share reservations are established by OPEN operations and by their nature are mandatory in that when the OPEN denies READ or WRITE operations, that denial results in such operations being rejected with error NFS4ERR_LOCKED. Record locks may be implemented by the server as either mandatory or advisory, or the choice of mandatory or advisory behavior may be determined by the server on the basis of the file being accessed (for example, some UNIX-based servers support a "mandatory lock bit" on the mode attribute such that if set, record locks are required on the file before I/O is possible). When record locks are advisory, they only prevent the granting of conflicting lock requests and have no effect on READs or WRITEs. Mandatory record locks, however, prevent conflicting I/O operations. When they are attempted, they are rejected with NFS4ERR_LOCKED. When the client gets NFS4ERR_LOCKED on a file it knows it has the proper share reservation for, it will need to issue a LOCK request on the region
of the file that includes the region the I/O was to be performed on, with an appropriate locktype (i.e., READ*_LT for a READ operation, WRITE*_LT for a WRITE operation). With NFS version 3, there was no notion of a stateid so there was no way to tell if the application process of the client sending the READ or WRITE operation had also acquired the appropriate record lock on the file. Thus there was no way to implement mandatory locking. With the stateid construct, this barrier has been removed. Note that for UNIX environments that support mandatory file locking, the distinction between advisory and mandatory locking is subtle. In fact, advisory and mandatory record locks are exactly the same in so far as the APIs and requirements on implementation. If the mandatory lock attribute is set on the file, the server checks to see if the lockowner has an appropriate shared (read) or exclusive (write) record lock on the region it wishes to read or write to. If there is no appropriate lock, the server checks if there is a conflicting lock (which can be done by attempting to acquire the conflicting lock on the behalf of the lockowner, and if successful, release the lock after the READ or WRITE is done), and if there is, the server returns NFS4ERR_LOCKED. For Windows environments, there are no advisory record locks, so the server always checks for record locks during I/O requests. Thus, the NFS version 4 LOCK operation does not need to distinguish between advisory and mandatory record locks. It is the NFS version 4 server's processing of the READ and WRITE operations that introduces the distinction. Every stateid other than the special stateid values noted in this section, whether returned by an OPEN-type operation (i.e., OPEN, OPEN_DOWNGRADE), or by a LOCK-type operation (i.e., LOCK or LOCKU), defines an access mode for the file (i.e., READ, WRITE, or READ- WRITE) as established by the original OPEN which began the stateid sequence, and as modified by subsequent OPENs and OPEN_DOWNGRADEs within that stateid sequence. When a READ, WRITE, or SETATTR which specifies the size attribute, is done, the operation is subject to checking against the access mode to verify that the operation is appropriate given the OPEN with which the operation is associated. In the case of WRITE-type operations (i.e., WRITEs and SETATTRs which set size), the server must verify that the access mode allows writing and return an NFS4ERR_OPENMODE error if it does not. In the case, of READ, the server may perform the corresponding check on the access mode, or it may choose to allow READ on opens for WRITE only, to accommodate clients whose write implementation may unavoidably do
reads (e.g., due to buffer cache constraints). However, even if READs are allowed in these circumstances, the server MUST still check for locks that conflict with the READ (e.g., another open specify denial of READs). Note that a server which does enforce the access mode check on READs need not explicitly check for conflicting share reservations since the existence of OPEN for read access guarantees that no conflicting share reservation can exist. A stateid of all bits 1 (one) MAY allow READ operations to bypass locking checks at the server. However, WRITE operations with a stateid with bits all 1 (one) MUST NOT bypass locking checks and are treated exactly the same as if a stateid of all bits 0 were used. A lock may not be granted while a READ or WRITE operation using one of the special stateids is being performed and the range of the lock request conflicts with the range of the READ or WRITE operation. For the purposes of this paragraph, a conflict occurs when a shared lock is requested and a WRITE operation is being performed, or an exclusive lock is requested and either a READ or a WRITE operation is being performed. A SETATTR that sets size is treated similarly to a WRITE as discussed above.8.1.5. Sequencing of Lock Requests
Locking is different than most NFS operations as it requires "at- most-one" semantics that are not provided by ONCRPC. ONCRPC over a reliable transport is not sufficient because a sequence of locking requests may span multiple TCP connections. In the face of retransmission or reordering, lock or unlock requests must have a well defined and consistent behavior. To accomplish this, each lock request contains a sequence number that is a consecutively increasing integer. Different lock_owners have different sequences. The server maintains the last sequence number (L) received and the response that was returned. The first request issued for any given lock_owner is issued with a sequence number of zero. Note that for requests that contain a sequence number, for each lock_owner, there should be no more than one outstanding request. If a request (r) with a previous sequence number (r < L) is received, it is rejected with the return of error NFS4ERR_BAD_SEQID. Given a properly-functioning client, the response to (r) must have been received before the last request (L) was sent. If a duplicate of last request (r == L) is received, the stored response is returned. If a request beyond the next sequence (r == L + 2) is received, it is rejected with the return of error NFS4ERR_BAD_SEQID. Sequence history is reinitialized whenever the SETCLIENTID/SETCLIENTID_CONFIRM sequence changes the client verifier.
Since the sequence number is represented with an unsigned 32-bit integer, the arithmetic involved with the sequence number is mod 2^32. For an example of modulo arithmetic involving sequence numbers see [RFC793]. It is critical the server maintain the last response sent to the client to provide a more reliable cache of duplicate non-idempotent requests than that of the traditional cache described in [Juszczak]. The traditional duplicate request cache uses a least recently used algorithm for removing unneeded requests. However, the last lock request and response on a given lock_owner must be cached as long as the lock state exists on the server. The client MUST monotonically increment the sequence number for the CLOSE, LOCK, LOCKU, OPEN, OPEN_CONFIRM, and OPEN_DOWNGRADE operations. This is true even in the event that the previous operation that used the sequence number received an error. The only exception to this rule is if the previous operation received one of the following errors: NFS4ERR_STALE_CLIENTID, NFS4ERR_STALE_STATEID, NFS4ERR_BAD_STATEID, NFS4ERR_BAD_SEQID, NFS4ERR_BADXDR, NFS4ERR_RESOURCE, NFS4ERR_NOFILEHANDLE.8.1.6. Recovery from Replayed Requests
As described above, the sequence number is per lock_owner. As long as the server maintains the last sequence number received and follows the methods described above, there are no risks of a Byzantine router re-sending old requests. The server need only maintain the (lock_owner, sequence number) state as long as there are open files or closed files with locks outstanding. LOCK, LOCKU, OPEN, OPEN_DOWNGRADE, and CLOSE each contain a sequence number and therefore the risk of the replay of these operations resulting in undesired effects is non-existent while the server maintains the lock_owner state.8.1.7. Releasing lock_owner State
When a particular lock_owner no longer holds open or file locking state at the server, the server may choose to release the sequence number state associated with the lock_owner. The server may make this choice based on lease expiration, for the reclamation of server memory, or other implementation specific details. In any event, the server is able to do this safely only when the lock_owner no longer is being utilized by the client. The server may choose to hold the lock_owner state in the event that retransmitted requests are received. However, the period to hold this state is implementation specific.
In the case that a LOCK, LOCKU, OPEN_DOWNGRADE, or CLOSE is retransmitted after the server has previously released the lock_owner state, the server will find that the lock_owner has no files open and an error will be returned to the client. If the lock_owner does have a file open, the stateid will not match and again an error is returned to the client.8.1.8. Use of Open Confirmation
In the case that an OPEN is retransmitted and the lock_owner is being used for the first time or the lock_owner state has been previously released by the server, the use of the OPEN_CONFIRM operation will prevent incorrect behavior. When the server observes the use of the lock_owner for the first time, it will direct the client to perform the OPEN_CONFIRM for the corresponding OPEN. This sequence establishes the use of an lock_owner and associated sequence number. Since the OPEN_CONFIRM sequence connects a new open_owner on the server with an existing open_owner on a client, the sequence number may have any value. The OPEN_CONFIRM step assures the server that the value received is the correct one. See the section "OPEN_CONFIRM - Confirm Open" for further details. There are a number of situations in which the requirement to confirm an OPEN would pose difficulties for the client and server, in that they would be prevented from acting in a timely fashion on information received, because that information would be provisional, subject to deletion upon non-confirmation. Fortunately, these are situations in which the server can avoid the need for confirmation when responding to open requests. The two constraints are: o The server must not bestow a delegation for any open which would require confirmation. o The server MUST NOT require confirmation on a reclaim-type open (i.e., one specifying claim type CLAIM_PREVIOUS or CLAIM_DELEGATE_PREV). These constraints are related in that reclaim-type opens are the only ones in which the server may be required to send a delegation. For CLAIM_NULL, sending the delegation is optional while for CLAIM_DELEGATE_CUR, no delegation is sent. Delegations being sent with an open requiring confirmation are troublesome because recovering from non-confirmation adds undue complexity to the protocol while requiring confirmation on reclaim- type opens poses difficulties in that the inability to resolve
the status of the reclaim until lease expiration may make it difficult to have timely determination of the set of locks being reclaimed (since the grace period may expire). Requiring open confirmation on reclaim-type opens is avoidable because of the nature of the environments in which such opens are done. For CLAIM_PREVIOUS opens, this is immediately after server reboot, so there should be no time for lockowners to be created, found to be unused, and recycled. For CLAIM_DELEGATE_PREV opens, we are dealing with a client reboot situation. A server which supports delegation can be sure that no lockowners for that client have been recycled since client initialization and thus can ensure that confirmation will not be required.