4. Implementation issues The NFS version 3 protocol was designed to allow different operating systems to share files. However, since it was designed in a UNIX environment, many operations have semantics similar to the operations of the UNIX file system. This section discusses some of the general implementation-specific details and semantic issues. Procedure descriptions have implementation comments specific to that procedure. A number of papers have been written describing issues encountered when constructing an NFS version 2 protocol implementation. The best overview paper is still [Sandberg]. [Israel], [Macklem], and [Pawlowski] describe other implementations. [X/OpenNFS] provides a complete description of the NFS version 2 protocol and supporting protocols, as well as a discussion on implementation issues and procedure and error semantics. Many of the issues encountered when constructing an NFS version 2 protocol implementation will be encountered when constructing an NFS version 3 protocol implementation. 4.1 Multiple version support The RPC protocol provides explicit support for versioning of a service. Client and server implementations of NFS version 3 protocol should support both versions, for full backwards compatibility, when possible. Default behavior of the RPC binding protocol is the client and server bind using the highest version number they both support. Client or server implementations that cannot easily support both versions (for example, because of memory restrictions) will have to choose what version to support. The NFS version 2 protocol would be a safe choice since fully capable clients and servers should support both versions. However, this choice would need to be made keeping all requirements in mind. 4.2 Server/client relationship The NFS version 3 protocol is designed to allow servers to be as simple and general as possible. Sometimes the simplicity of the server can be a problem, if the client implements complicated file system semantics. For example, some operating systems allow removal of open files. A process can open a file and, while it is open, remove it from the directory. The file can be read and
written as long as the process keeps it open, even though the file has no name in the file system. It is impossible for a stateless server to implement these semantics. The client can do some tricks such as renaming the file on remove (to a hidden name), and only physically deleting it on close. The NFS version 3 protocol provides sufficient functionality to implement most file system semantics on a client. Every NFS version 3 protocol client can also potentially be a server, and remote and local mounted file systems can be freely mixed. This leads to some problems when a client travels down the directory tree of a remote file system and reaches the mount point on the server for another remote file system. Allowing the server to follow the second remote mount would require loop detection, server lookup, and user revalidation. Instead, both NFS version 2 protocol and NFS version 3 protocol implementations do not typically let clients cross a server's mount point. When a client does a LOOKUP on a directory on which the server has mounted a file system, the client sees the underlying directory instead of the mounted directory. For example, if a server has a file system called /usr and mounts another file system on /usr/src, if a client mounts /usr, it does not see the mounted version of /usr/src. A client could do remote mounts that match the server's mount points to maintain the server's view. In this example, the client would also have to mount /usr/src in addition to /usr, even if they are from the same server. 4.3 Path name interpretation There are a few complications to the rule that path names are always parsed on the client. For example, symbolic links could have different interpretations on different clients. There is no answer to this problem in this specification. Another common problem for non-UNIX implementations is the special interpretation of the pathname, "..", to mean the parent of a given directory. A future revision of the protocol may use an explicit flag to indicate the parent instead - however it is not a problem as many working non-UNIX implementations exist.
4.4 Permission issues The NFS version 3 protocol, strictly speaking, does not define the permission checking used by servers. However, it is expected that a server will do normal operating system permission checking using AUTH_UNIX style authentication as the basis of its protection mechanism, or another stronger form of authentication such as AUTH_DES or AUTH_KERB. With AUTH_UNIX authentication, the server gets the client's effective uid, effective gid, and groups on each call and uses them to check permission. These are the so-called UNIX credentials. AUTH_DES and AUTH_KERB use a network name, or netname, as the basis for identification (from which a UNIX server derives the necessary standard UNIX credentials). There are problems with this method that have been solved. Using uid and gid implies that the client and server share the same uid list. Every server and client pair must have the same mapping from user to uid and from group to gid. Since every client can also be a server, this tends to imply that the whole network shares the same uid/gid space. If this is not the case, then it usually falls upon the server to perform some custom mapping of credentials from one authentication domain into another. A discussion of techniques for managing a shared user space or for providing mechanisms for user ID mapping is beyond the scope of this specification. Another problem arises due to the usually stateful open operation. Most operating systems check permission at open time, and then check that the file is open on each read and write request. With stateless servers, the server cannot detect that the file is open and must do permission checking on each read and write call. UNIX client semantics of access permission checking on open can be provided with the ACCESS procedure call in this revision, which allows a client to explicitly check access permissions without resorting to trying the operation. On a local file system, a user can open a file and then change the permissions so that no one is allowed to touch it, but will still be able to write to the file because it is open. On a remote file system, by contrast, the write would fail. To get around this problem, the server's permission checking algorithm should allow the owner of a file to access it regardless of the permission setting. This is needed in a practical NFS version 3 protocol server implementation, but it does depart from correct local file system semantics. This should not affect the return result of access permissions as returned by the ACCESS
procedure, however. A similar problem has to do with paging in an executable program over the network. The operating system usually checks for execute permission before opening a file for demand paging, and then reads blocks from the open file. In a local UNIX file system, an executable file does not need read permission to execute (pagein). An NFS version 3 protocol server can not tell the difference between a normal file read (where the read permission bit is meaningful) and a demand pagein read (where the server should allow access to the executable file if the execute bit is set for that user or group or public). To make this work, the server allows reading of files if the uid given in the call has either execute or read permission on the file, through ownership, group membership or public access. Again, this departs from correct local file system semantics. In most operating systems, a particular user (on UNIX, the uid 0) has access to all files, no matter what permission and ownership they have. This superuser permission may not be allowed on the server, since anyone who can become superuser on their client could gain access to all remote files. A UNIX server by default maps uid 0 to a distinguished value (UID_NOBODY), as well as mapping the groups list, before doing its access checking. A server implementation may provide a mechanism to change this mapping. This works except for NFS version 3 protocol root file systems (required for diskless NFS version 3 protocol client support), where superuser access cannot be avoided. Export options are used, on the server, to restrict the set of clients allowed superuser access. 4.5 Duplicate request cache The typical NFS version 3 protocol failure recovery model uses client time-out and retry to handle server crashes, network partitions, and lost server replies. A retried request is called a duplicate of the original. When used in a file server context, the term idempotent can be used to distinguish between operation types. An idempotent request is one that a server can perform more than once with equivalent results (though it may in fact change, as a side effect, the access time on a file, say for READ). Some NFS operations are obviously non-idempotent. They cannot be reprocessed without special attention simply because they may fail if tried a second time. The CREATE request, for example,
can be used to create a file for which the owner does not have write permission. A duplicate of this request cannot succeed if the original succeeded. Likewise, a file can be removed only once. The side effects caused by performing a duplicate non-idempotent request can be destructive (for example, a truncate operation causing lost writes). The combination of a stateless design with the common choice of an unreliable network transport (UDP) implies the possibility of destructive replays of non-idempotent requests. Though to be more accurate, it is the inherent stateless design of the NFS version 3 protocol on top of an unreliable RPC mechanism that yields the possibility of destructive replays of non-idempotent requests, since even in an implementation of the NFS version 3 protocol over a reliable connection-oriented transport, a connection break with automatic reestablishment requires duplicate request processing (the client will retransmit the request, and the server needs to deal with a potential duplicate non-idempotent request). Most NFS version 3 protocol server implementations use a cache of recent requests (called the duplicate request cache) for the processing of duplicate non-idempotent requests. The duplicate request cache provides a short-term memory mechanism in which the original completion status of a request is remembered and the operation attempted only once. If a duplicate copy of this request is received, then the original completion status is returned. The duplicate-request cache mechanism has been useful in reducing destructive side effects caused by duplicate NFS version 3 protocol requests. This mechanism, however, does not guarantee against these destructive side effects in all failure modes. Most servers store the duplicate request cache in RAM, so the contents are lost if the server crashes. The exception to this may possibly occur in a redundant server approach to high availability, where the file system itself may be used to share the duplicate request cache state. Even if the cache survives server reboots (or failovers in the high availability case), its effectiveness is a function of its size. A network partition can cause a cache entry to be reused before a client receives a reply for the corresponding request. If this happens, the duplicate request will be processed as a new one, possibly with destructive side effects.
A good description of the implementation and use of a duplicate request cache can be found in [Juszczak]. 4.6 File name component handling Server implementations of NFS version 3 protocol will frequently impose restrictions on the names which can be created. Many servers will also forbid the use of names that contain certain characters, such as the path component separator used by the server operating system. For example, the UFS file system will reject a name which contains "/", while "." and ".." are distinguished in UFS, and may not be specified as the name when creating a file system object. The exact error status values return for these errors is specified in the description of each procedure argument. The values (which conform to NFS version 2 protocol server practice) are not necessarily obvious, nor are they consistent from one procedure to the next. 4.7 Synchronous modifying operations Data-modifying operations in the NFS version 3 protocol are synchronous. When a procedure returns to the client, the client can assume that the operation has completed and any data associated with the request is now on stable storage. 4.8 Stable storage NFS version 3 protocol servers must be able to recover without data loss from multiple power failures (including cascading power failures, that is, several power failures in quick succession), operating system failures, and hardware failure of components other than the storage medium itself (for example, disk, nonvolatile RAM). Some examples of stable storage that are allowable for an NFS server include: 1. Media commit of data, that is, the modified data has been successfully written to the disk media, for example, the disk platter. 2. An immediate reply disk drive with battery-backed on-drive intermediate storage or uninterruptible power system (UPS). 3. Server commit of data with battery-backed intermediate storage and recovery software.
4. Cache commit with uninterruptible power system (UPS) and recovery software. Conversely, the following are not examples of stable storage: 1. An immediate reply disk drive without battery-backed on-drive intermediate storage or uninterruptible power system (UPS). 2. Cache commit without both uninterruptible power system (UPS) and recovery software. The only exception to this (introduced in this protocol revision) is as described under the WRITE procedure on the handling of the stable bit, and the use of the COMMIT procedure. It is the use of the synchronous COMMIT procedure that provides the necessary semantic support in the NFS version 3 protocol. 4.9 Lookups and name resolution A common objection to the NFS version 3 protocol is the philosophy of component-by-component LOOKUP by the client in resolving a name. The objection is that this is inefficient, as latencies for component-by-component LOOKUP would be unbearable. Implementation practice solves this issue. A name cache, providing component to file-handle mapping, is kept on the client to short circuit actual LOOKUP invocations over the wire. The cache is subject to cache timeout parameters that bound attributes. 4.10 Adaptive retransmission Most client implementations use either an exponential back-off strategy to some maximum retransmission value, or a more adaptive strategy that attempts congestion avoidance. Congestion avoidance schemes in NFS request retransmission are modelled on the work presented in [Jacobson]. [Nowicki] and [Macklem] describe congestion avoidance schemes to be applied to the NFS protocol over UDP. 4.11 Caching policies The NFS version 3 protocol does not define a policy for caching on the client or server. In particular, there is no
support for strict cache consistency between a client and server, nor between different clients. See [Kazar] for a discussion of the issues of cache synchronization and mechanisms in several distributed file systems. 4.12 Stable versus unstable writes The setting of the stable field in the WRITE arguments, that is whether or not to do asynchronous WRITE requests, is straightforward on a UNIX client. If the NFS version 3 protocol client receives a write request that is not marked as being asynchronous, it should generate the RPC with stable set to TRUE. If the request is marked as being asynchronous, the RPC should be generated with stable set to FALSE. If the response comes back with the committed field set to TRUE, the client should just mark the write request as done and no further action is required. If committed is set to FALSE, indicating that the buffer was not synchronized with the server's disk, the client will need to mark the buffer in some way which indicates that a copy of the buffer lives on the server and that a new copy does not need to be sent to the server, but that a commit is required. Note that this algorithm introduces a new state for buffers, thus there are now three states for buffers. The three states are dirty, done but needs to be committed, and done. This extra state on the client will likely require modifications to the system outside of the NFS version 3 protocol client. One proposal that was rejected was the addition of a boolean commit argument to the WRITE operation. It would be used to indicate whether the server should do a full file commit after doing the write. This seems as if it could be useful if the client knew that it was doing the last write on the file. It is difficult to see how this could be used, given existing client architectures though. The asynchronous write opens up the window of problems associated with write sharing. For example: client A writes some data asynchronously. Client A is still holding the buffers cached, waiting to commit them later. Client B reads the modified data and writes it back to the server. The server then crashes. When it comes back up, client A issues a COMMIT operation which returns with a different cookie as well as changed attributes. In this case, the correct action may or may not be to retransmit the cached buffers. Unfortunately, client A can't tell for sure, so it will need to retransmit the buffers, thus overwriting the changes from
client B. Fortunately, write sharing is rare and the solution matches the current write sharing situation. Without using locking for synchronization, the behaviour will be indeterminate. In a high availability (redundant system) server implementation, two cases exist which relate to the verf changing. If the high availability server implementation does not use a shared-memory scheme, then the verf should change on failover, since the unsynchronized data is not available to the second processor and there is no guarantee that the system which had the data cached was able to flush it to stable storage before going down. The client will need to retransmit the data to be safe. In a shared-memory high availability server implementation, the verf would not need to change because the server would still have the cached data available to it to be flushed. The exact policy regarding the verf in a shared memory high availability implementation, however, is up to the server implementor. 4.13 32 bit clients/servers and 64 bit clients/servers The 64 bit nature of the NFS version 3 protocol introduces several compatibility problems. The most notable two are mismatched clients and servers, that is, a 32 bit client and a 64 bit server or a 64 bit client and a 32 bit server. The problems of a 64 bit client and a 32 bit server are easy to handle. The client will never encounter a file that it can not handle. If it sends a request to the server that the server can not handle, the server should reject the request with an appropriate error. The problems of a 32 bit client and a 64 bit server are much harder to handle. In this situation, the server does not have a problem because it can handle anything that the client can generate. However, the client may encounter a file that it can not handle. The client will not be able to handle a file whose size can not be expressed in 32 bits. Thus, the client will not be able to properly decode the size of the file into its local attributes structure. Also, a file can grow beyond the limit of the client while the client is accessing the file. The solutions to these problems are left up to the individual implementor. However, there are two common approaches used to resolve this situation. The implementor can choose between them or even can invent a new solution altogether.
The most common solution is for the client to deny access to any file whose size can not be expressed in 32 bits. This is probably the safest, but does introduce some strange semantics when the file grows beyond the limit of the client while it is being access by that client. The file becomes inaccessible even while it is being accessed. The second solution is for the client to map any size greater than it can handle to the maximum size that it can handle. Effectively, it is lying to the application program. This allows the application access as much of the file as possible given the 32 bit offset restriction. This eliminates the strange semantic of the file effectively disappearing after it has been accessed, but does introduce other problems. The client will not be able to access the entire file. Currently, the first solution is the recommended solution. However, client implementors are encouraged to do the best that they can to reduce the effects of this situation.
5.0 Appendix I: Mount protocol The changes from the NFS version 2 protocol to the NFS version 3 protocol have required some changes to be made in the MOUNT protocol. To meet the needs of the NFS version 3 protocol, a new version of the MOUNT protocol has been defined. This new protocol satisfies the requirements of the NFS version 3 protocol and addresses several other current market requirements. 5.1 RPC Information 5.1.1 Authentication The MOUNT service uses AUTH_NONE in the NULL procedure. AUTH_UNIX, AUTH_SHORT, AUTH_DES, or AUTH_KERB are used for all other procedures. Other authentication types may be supported in the future. 5.1.2 Constants These are the RPC constants needed to call the MOUNT service. They are given in decimal. PROGRAM 100005 VERSION 3 5.1.3 Transport address The MOUNT service is normally supported over the TCP and UDP protocols. The rpcbind daemon should be queried for the correct transport address. 5.1.4 Sizes const MNTPATHLEN = 1024; /* Maximum bytes in a path name */ const MNTNAMLEN = 255; /* Maximum bytes in a name */ const FHSIZE3 = 64; /* Maximum bytes in a V3 file handle */ 5.1.5 Basic Data Types typedef opaque fhandle3<FHSIZE3>; typedef string dirpath<MNTPATHLEN>; typedef string name<MNTNAMLEN>;
enum mountstat3 { MNT3_OK = 0, /* no error */ MNT3ERR_PERM = 1, /* Not owner */ MNT3ERR_NOENT = 2, /* No such file or directory */ MNT3ERR_IO = 5, /* I/O error */ MNT3ERR_ACCES = 13, /* Permission denied */ MNT3ERR_NOTDIR = 20, /* Not a directory */ MNT3ERR_INVAL = 22, /* Invalid argument */ MNT3ERR_NAMETOOLONG = 63, /* Filename too long */ MNT3ERR_NOTSUPP = 10004, /* Operation not supported */ MNT3ERR_SERVERFAULT = 10006 /* A failure on the server */ }; 5.2 Server Procedures The following sections define the RPC procedures supplied by a MOUNT version 3 protocol server. The RPC procedure number is given at the top of the page with the name and version. The SYNOPSIS provides the name of the procedure, the list of the names of the arguments, the list of the names of the results, followed by the XDR argument declarations and results declarations. The information in the SYNOPSIS is specified in RPC Data Description Language as defined in [RFC1014]. The DESCRIPTION section tells what the procedure is expected to do and how its arguments and results are used. The ERRORS section lists the errors returned for specific types of failures. The IMPLEMENTATION field describes how the procedure is expected to work and how it should be used by clients. program MOUNT_PROGRAM { version MOUNT_V3 { void MOUNTPROC3_NULL(void) = 0; mountres3 MOUNTPROC3_MNT(dirpath) = 1; mountlist MOUNTPROC3_DUMP(void) = 2; void MOUNTPROC3_UMNT(dirpath) = 3; void MOUNTPROC3_UMNTALL(void) = 4; exports MOUNTPROC3_EXPORT(void) = 5; } = 3; } = 100005;
5.2.0 Procedure 0: Null - Do nothing SYNOPSIS void MOUNTPROC3_NULL(void) = 0; DESCRIPTION Procedure NULL does not do any work. It is made available to allow server response testing and timing. IMPLEMENTATION It is important that this procedure do no work at all so that it can be used to measure the overhead of processing a service request. By convention, the NULL procedure should never require any authentication. A server may choose to ignore this convention, in a more secure implementation, where responding to the NULL procedure call acknowledges the existence of a resource to an unauthenticated client. ERRORS Since the NULL procedure takes no MOUNT protocol arguments and returns no MOUNT protocol response, it can not return a MOUNT protocol error. However, it is possible that some server implementations may return RPC errors based on security and authentication requirements.
5.2.1 Procedure 1: MNT - Add mount entry SYNOPSIS mountres3 MOUNTPROC3_MNT(dirpath) = 1; struct mountres3_ok { fhandle3 fhandle; int auth_flavors<>; }; union mountres3 switch (mountstat3 fhs_status) { case MNT_OK: mountres3_ok mountinfo; default: void; }; DESCRIPTION Procedure MNT maps a pathname on the server to a file handle. The pathname is an ASCII string that describes a directory on the server. If the call is successful (MNT3_OK), the server returns an NFS version 3 protocol file handle and a vector of RPC authentication flavors that are supported with the client's use of the file handle (or any file handles derived from it). The authentication flavors are defined in Section 7.2 and section 9 of [RFC1057]. IMPLEMENTATION If mountres3.fhs_status is MNT3_OK, then mountres3.mountinfo contains the file handle for the directory and a list of acceptable authentication flavors. This file handle may only be used in the NFS version 3 protocol. This procedure also results in the server adding a new entry to its mount list recording that this client has mounted the directory. AUTH_UNIX authentication or better is required. ERRORS MNT3ERR_NOENT MNT3ERR_IO MNT3ERR_ACCES MNT3ERR_NOTDIR MNT3ERR_NAMETOOLONG
5.2.2 Procedure 2: DUMP - Return mount entries SYNOPSIS mountlist MOUNTPROC3_DUMP(void) = 2; typedef struct mountbody *mountlist; struct mountbody { name ml_hostname; dirpath ml_directory; mountlist ml_next; }; DESCRIPTION Procedure DUMP returns the list of remotely mounted file systems. The mountlist contains one entry for each client host name and directory pair. IMPLEMENTATION This list is derived from a list maintained on the server of clients that have requested file handles with the MNT procedure. Entries are removed from this list only when a client calls the UMNT or UMNTALL procedure. Entries may become stale if a client crashes and does not issue either UMNT calls for all of the file systems that it had previously mounted or a UMNTALL to remove all entries that existed for it on the server. ERRORS There are no MOUNT protocol errors which can be returned from this procedure. However, RPC errors may be returned for authentication or other RPC failures.
5.2.3 Procedure 3: UMNT - Remove mount entry SYNOPSIS void MOUNTPROC3_UMNT(dirpath) = 3; DESCRIPTION Procedure UMNT removes the mount list entry for the directory that was previously the subject of a MNT call from this client. AUTH_UNIX authentication or better is required. IMPLEMENTATION Typically, server implementations have maintained a list of clients which have file systems mounted. In the past, this list has been used to inform clients that the server was going to be shutdown. ERRORS There are no MOUNT protocol errors which can be returned from this procedure. However, RPC errors may be returned for authentication or other RPC failures.
5.2.4 Procedure 4: UMNTALL - Remove all mount entries SYNOPSIS void MOUNTPROC3_UMNTALL(void) = 4; DESCRIPTION Procedure UMNTALL removes all of the mount entries for this client previously recorded by calls to MNT. AUTH_UNIX authentication or better is required. IMPLEMENTATION This procedure should be used by clients when they are recovering after a system shutdown. If the client could not successfully unmount all of its file systems before being shutdown or the client crashed because of a software or hardware problem, there may be servers which still have mount entries for this client. This is an easy way for the client to inform all servers at once that it does not have any mounted file systems. However, since this procedure is generally implemented using broadcast RPC, it is only of limited usefullness. ERRORS There are no MOUNT protocol errors which can be returned from this procedure. However, RPC errors may be returned for authentication or other RPC failures.
5.2.5 Procedure 5: EXPORT - Return export list SYNOPSIS exports MOUNTPROC3_EXPORT(void) = 5; typedef struct groupnode *groups; struct groupnode { name gr_name; groups gr_next; }; typedef struct exportnode *exports; struct exportnode { dirpath ex_dir; groups ex_groups; exports ex_next; }; DESCRIPTION Procedure EXPORT returns a list of all the exported file systems and which clients are allowed to mount each one. The names in the group list are implementation-specific and cannot be directly interpreted by clients. These names can represent hosts or groups of hosts. IMPLEMENTATION This procedure generally returns the contents of a list of shared or exported file systems. These are the file systems which are made available to NFS version 3 protocol clients. ERRORS There are no MOUNT protocol errors which can be returned from this procedure. However, RPC errors may be returned for authentication or other RPC failures.
6.0 Appendix II: Lock manager protocol Because the NFS version 2 protocol as well as the NFS version 3 protocol is stateless, an additional Network Lock Manager (NLM) protocol is required to support locking of NFS-mounted files. The NLM version 3 protocol, which is used with the NFS version 2 protocol, is documented in [X/OpenNFS]. Some of the changes in the NFS version 3 protocol require a new version of the NLM protocol. This new protocol is the NLM version 4 protocol. The following table summarizes the correspondence between versions of the NFS protocol and NLM protocol. NFS and NLM protocol compatibility +---------+---------+ | NFS | NLM | | Version | Version | +===================+ | 2 | 1,3 | +---------+---------+ | 3 | 4 | +---------+---------+ This appendix only discusses the differences between the NLM version 3 protocol and the NLM version 4 protocol. As in the NFS version 3 protocol, almost all the names in the NLM version 4 protocol have been changed to include a version number. This appendix does not discuss changes that consist solely of a name change. 6.1 RPC Information 6.1.1 Authentication The NLM service uses AUTH_NONE in the NULL procedure. AUTH_UNIX, AUTH_SHORT, AUTH_DES, and AUTH_KERB are used for all other procedures. Other authentication types may be supported in the future. 6.1.2 Constants These are the RPC constants needed to call the NLM service. They are given in decimal. PROGRAM 100021 VERSION 4
6.1.3 Transport Address The NLM service is normally supported over the TCP and UDP protocols. The rpcbind daemon should be queried for the correct transport address. 6.1.4 Basic Data Types uint64 typedef unsigned hyper uint64; int64 typedef hyper int64; uint32 typedef unsigned long uint32; int32 typedef long int32; These types are new for the NLM version 4 protocol. They are the same as in the NFS version 3 protocol. nlm4_stats enum nlm4_stats { NLM4_GRANTED = 0, NLM4_DENIED = 1, NLM4_DENIED_NOLOCKS = 2, NLM4_BLOCKED = 3, NLM4_DENIED_GRACE_PERIOD = 4, NLM4_DEADLCK = 5, NLM4_ROFS = 6, NLM4_STALE_FH = 7, NLM4_FBIG = 8, NLM4_FAILED = 9 }; Nlm4_stats indicates the success or failure of a call. This version contains several new error codes, so that clients can provide more precise failure information to applications. NLM4_GRANTED The call completed successfully. NLM4_DENIED The call failed. For attempts to set a lock, this status implies that if the client retries the call later, it may
succeed. NLM4_DENIED_NOLOCKS The call failed because the server could not allocate the necessary resources. NLM4_BLOCKED Indicates that a blocking request cannot be granted immediately. The server will issue an NLMPROC4_GRANTED callback to the client when the lock is granted. NLM4_DENIED_GRACE_PERIOD The call failed because the server is reestablishing old locks after a reboot and is not yet ready to resume normal service. NLM4_DEADLCK The request could not be granted and blocking would cause a deadlock. NLM4_ROFS The call failed because the remote file system is read-only. For example, some server implementations might not support exclusive locks on read-only file systems. NLM4_STALE_FH The call failed because it uses an invalid file handle. This can happen if the file has been removed or if access to the file has been revoked on the server. NLM4_FBIG The call failed because it specified a length or offset that exceeds the range supported by the server. NLM4_FAILED The call failed for some reason not already listed. The client should take this status as a strong hint not to retry the request. nlm4_holder struct nlm4_holder { bool exclusive; int32 svid; netobj oh; uint64 l_offset; uint64 l_len; };
This structure indicates the holder of a lock. The exclusive field tells whether the holder has an exclusive lock or a shared lock. The svid field identifies the process that is holding the lock. The oh field is an opaque object that identifies the host or process that is holding the lock. The l_len and l_offset fields identify the region that is locked. The only difference between the NLM version 3 protocol and the NLM version 4 protocol is that in the NLM version 3 protocol, the l_len and l_offset fields are 32 bits wide, while they are 64 bits wide in the NLM version 4 protocol. nlm4_lock struct nlm4_lock { string caller_name<LM_MAXSTRLEN>; netobj fh; netobj oh; int32 svid; uint64 l_offset; uint64 l_len; }; This structure describes a lock request. The caller_name field identifies the host that is making the request. The fh field identifies the file to lock. The oh field is an opaque object that identifies the host or process that is making the request, and the svid field identifies the process that is making the request. The l_offset and l_len fields identify the region of the file that the lock controls. A l_len of 0 means "to end of file". There are two differences between the NLM version 3 protocol and the NLM version 4 protocol versions of this structure. First, in the NLM version 3 protocol, the length and offset are 32 bits wide, while they are 64 bits wide in the NLM version 4 protocol. Second, in the NLM version 3 protocol, the file handle is a fixed-length NFS version 2 protocol file handle, which is encoded as a byte count followed by a byte array. In the NFS version 3 protocol, the file handle is already variable-length, so it is copied directly into the fh field. That is, the first four bytes of the fh field are the same as the byte count in an NFS version 3 protocol nfs_fh3. The rest of the fh field contains the byte array from the NFS version 3 protocol nfs_fh3.
nlm4_share struct nlm4_share { string caller_name<LM_MAXSTRLEN>; netobj fh; netobj oh; fsh4_mode mode; fsh4_access access; }; This structure is used to support DOS file sharing. The caller_name field identifies the host making the request. The fh field identifies the file to be operated on. The oh field is an opaque object that identifies the host or process that is making the request. The mode and access fields specify the file-sharing and access modes. The encoding of fh is a byte count, followed by the file handle byte array. See the description of nlm4_lock for more details. 6.2 NLM Procedures The procedures in the NLM version 4 protocol are semantically the same as those in the NLM version 3 protocol. The only semantic difference is the addition of a NULL procedure that can be used to test for server responsiveness. The procedure names with _MSG and _RES suffixes denote asynchronous messages; for these the void response implies no reply. A syntactic change is that the procedures were renamed to avoid name conflicts with the values of nlm4_stats. Thus the procedure definition is as follows. version NLM4_VERS { void NLMPROC4_NULL(void) = 0; nlm4_testres NLMPROC4_TEST(nlm4_testargs) = 1; nlm4_res NLMPROC4_LOCK(nlm4_lockargs) = 2; nlm4_res NLMPROC4_CANCEL(nlm4_cancargs) = 3; nlm4_res NLMPROC4_UNLOCK(nlm4_unlockargs) = 4;
nlm4_res NLMPROC4_GRANTED(nlm4_testargs) = 5; void NLMPROC4_TEST_MSG(nlm4_testargs) = 6; void NLMPROC4_LOCK_MSG(nlm4_lockargs) = 7; void NLMPROC4_CANCEL_MSG(nlm4_cancargs) = 8; void NLMPROC4_UNLOCK_MSG(nlm4_unlockargs) = 9; void NLMPROC4_GRANTED_MSG(nlm4_testargs) = 10; void NLMPROC4_TEST_RES(nlm4_testres) = 11; void NLMPROC4_LOCK_RES(nlm4_res) = 12; void NLMPROC4_CANCEL_RES(nlm4_res) = 13; void NLMPROC4_UNLOCK_RES(nlm4_res) = 14; void NLMPROC4_GRANTED_RES(nlm4_res) = 15; nlm4_shareres NLMPROC4_SHARE(nlm4_shareargs) = 20; nlm4_shareres NLMPROC4_UNSHARE(nlm4_shareargs) = 21; nlm4_res NLMPROC4_NM_LOCK(nlm4_lockargs) = 22; void NLMPROC4_FREE_ALL(nlm4_notify) = 23; } = 4;
6.2.0 Procedure 0: NULL - Do nothing SYNOPSIS void NLMPROC4_NULL(void) = 0; DESCRIPTION The NULL procedure does no work. It is made available in all RPC services to allow server response testing and timing. IMPLEMENTATION It is important that this procedure do no work at all so that it can be used to measure the overhead of processing a service request. By convention, the NULL procedure should never require any authentication. ERRORS It is possible that some server implementations may return RPC errors based on security and authentication requirements. 6.3 Implementation issues 6.3.1 64-bit offsets and lengths Some NFS version 3 protocol servers can only support requests where the file offset or length fits in 32 or fewer bits. For these servers, the lock manager will have the same restriction. If such a lock manager receives a request that it cannot handle (because the offset or length uses more than 32 bits), it should return the error, NLM4_FBIG. 6.3.2 File handles The change in the file handle format from the NFS version 2 protocol to the NFS version 3 protocol complicates the lock manager. First, the lock manager needs some way to tell when an NFS version 2 protocol file handle refers to the same file as an NFS version 3 protocol file handle. (This is assuming that the lock manager supports both NLM version 3 protocol clients and NLM version 4 protocol clients.) Second, if the lock manager runs the file handle through a hashing function, the hashing function may need
to be retuned to work with NFS version 3 protocol file handles as well as NFS version 2 protocol file handles.
7.0 Appendix III: Bibliography [Corbin] Corbin, John, "The Art of Distributed Programming-Programming Techniques for Remote Procedure Calls." Springer-Verlag, New York, New York. 1991. Basic description of RPC and XDR and how to program distributed applications using them. [Glover] Glover, Fred, "TNFS Protocol Specification," Trusted System Interest Group, Work in Progress. [Israel] Israel, Robert K., Sandra Jett, James Pownell, George M. Ericson, "Eliminating Data Copies in UNIX-based NFS Servers," Uniforum Conference Proceedings, San Francisco, CA, February 27 - March 2, 1989. Describes two methods for reducing data copies in NFS server code. [Jacobson] Jacobson, V., "Congestion Control and Avoidance," Proc. ACM SIGCOMM `88, Stanford, CA, August 1988. The paper describing improvements to TCP to allow use over Wide Area Networks and through gateways connecting networks of varying capacity. This work was a starting point for the NFS Dynamic Retransmission work. [Juszczak] Juszczak, Chet, "Improving the Performance and Correctness of an NFS Server," USENIX Conference Proceedings, USENIX Association, Berkeley, CA, June 1990, pages 53-63. Describes reply cache implementation that avoids work in the server by handling duplicate requests. More important, though listed as a side-effect, the reply cache aids in the avoidance of destructive non-idempotent operation re-application -- improving correctness. [Kazar] Kazar, Michael Leon, "Synchronization and Caching Issues in the Andrew File System," USENIX Conference Proceedings, USENIX Association, Berkeley, CA, Dallas Winter 1988, pages 27-36. A description of the cache consistency scheme in AFS. Contrasted with other distributed file systems.
[Macklem] Macklem, Rick, "Lessons Learned Tuning the 4.3BSD Reno Implementation of the NFS Protocol," Winter USENIX Conference Proceedings, USENIX Association, Berkeley, CA, January 1991. Describes performance work in tuning the 4.3BSD Reno NFS implementation. Describes performance improvement (reduced CPU loading) through elimination of data copies. [Mogul] Mogul, Jeffrey C., "A Recovery Protocol for Spritely NFS," USENIX File System Workshop Proceedings, Ann Arbor, MI, USENIX Association, Berkeley, CA, May 1992. Second paper on Spritely NFS proposes a lease-based scheme for recovering state of consistency protocol. [Nowicki] Nowicki, Bill, "Transport Issues in the Network File System," ACM SIGCOMM newsletter Computer Communication Review, April 1989. A brief description of the basis for the dynamic retransmission work. [Pawlowski] Pawlowski, Brian, Ron Hixon, Mark Stein, Joseph Tumminaro, "Network Computing in the UNIX and IBM Mainframe Environment," Uniforum `89 Conf. Proc., (1989) Description of an NFS server implementation for IBM's MVS operating system. [RFC1014] Sun Microsystems, Inc., "XDR: External Data Representation Standard", RFC 1014, Sun Microsystems, Inc., June 1987. Specification for canonical format for data exchange, used with RPC. [RFC1057] Sun Microsystems, Inc., "RPC: Remote Procedure Call Protocol Specification", RFC 1057, Sun Microsystems, Inc., June 1988. Remote procedure protocol specification. [RFC1094] Sun Microsystems, Inc., "Network Filesystem Specification", RFC 1094, Sun Microsystems, Inc., March 1989. NFS version 2 protocol specification.
[Sandberg] Sandberg, R., D. Goldberg, S. Kleiman, D. Walsh, B. Lyon, "Design and Implementation of the Sun Network Filesystem," USENIX Conference Proceedings, USENIX Association, Berkeley, CA, Summer 1985. The basic paper describing the SunOS implementation of the NFS version 2 protocol, and discusses the goals, protocol specification and trade-offs. [Srinivasan] Srinivasan, V., Jeffrey C. Mogul, "Spritely NFS: Implementation and Performance of Cache Consistency Protocols", WRL Research Report 89/5, Digital Equipment Corporation Western Research Laboratory, 100 Hamilton Ave., Palo Alto, CA, 94301, May 1989. This paper analyzes the effect of applying a Sprite-like consistency protocol applied to standard NFS. The issues of recovery in a stateful environment are covered in [Mogul]. [X/OpenNFS] X/Open Company, Ltd., X/Open CAE Specification: Protocols for X/Open Internetworking: XNFS, X/Open Company, Ltd., Apex Plaza, Forbury Road, Reading Berkshire, RG1 1AX, United Kingdom, 1991. This is an indispensable reference for NFS version 2 protocol and accompanying protocols, including the Lock Manager and the Portmapper. [X/OpenPCNFS] X/Open Company, Ltd., X/Open CAE Specification: Protocols for X/Open Internetworking: (PC)NFS, Developer's Specification, X/Open Company, Ltd., Apex Plaza, Forbury Road, Reading Berkshire, RG1 1AX, United Kingdom, 1991. This is an indispensable reference for NFS version 2 protocol and accompanying protocols, including the Lock Manager and the Portmapper.
8. Security Considerations Since sensitive file data may be transmitted or received from a server by the NFS protocol, authentication, privacy, and data integrity issues should be addressed by implementations of this protocol. As with the previous protocol revision (version 2), NFS version 3 defers to the authentication provisions of the supporting RPC protocol [RFC1057], and assumes that data privacy and integrity are provided by underlying transport layers as available in each implementation of the protocol. See section 4.4 for a discussion relating to file access permissions. 9. Acknowledgements This description of the protocol is derived from an original document written by Brian Pawlowski and revised by Peter Staubach. This protocol is the result of a co-operative effort that comprises the contributions of Geoff Arnold, Brent Callaghan, John Corbin, Fred Glover, Chet Juszczak, Mike Eisler, John Gillono, Dave Hitz, Mike Kupfer, Rick Macklem, Ron Minnich, Brian Pawlowski, David Robinson, Rusty Sandberg, Craig Schamp, Spencer Shepler, Carl Smith, Mark Stein, Peter Staubach, Tom Talpey, Rob Thurlow, and Mark Wittle.
10. Authors' Addresses Address comments related to this protocol to: nfs3@eng.sun.com Brent Callaghan Sun Microsystems, Inc. 2550 Garcia Avenue Mailstop UMTV05-44 Mountain View, CA 94043-1100 Phone: 1-415-336-1051 Fax: 1-415-336-6015 EMail: brent.callaghan@eng.sun.com Brian Pawlowski Network Appliance Corp. 319 North Bernardo Ave. Mountain View, CA 94043 Phone: 1-415-428-5136 Fax: 1-415-428-5151 EMail: beepy@netapp.com Peter Staubach Sun Microsystems, Inc. 2550 Garcia Avenue Mailstop UMTV05-44 Mountain View, CA 94043-1100 Phone: 1-415-336-5615 Fax: 1-415-336-6015 EMail: peter.staubach@eng.sun.com