18.39. Operation 46: GET_DIR_DELEGATION - Get a Directory Delegation
18.39.1. ARGUMENT
typedef nfstime4 attr_notice4; struct GET_DIR_DELEGATION4args { /* CURRENT_FH: delegated directory */ bool gdda_signal_deleg_avail; bitmap4 gdda_notification_types; attr_notice4 gdda_child_attr_delay; attr_notice4 gdda_dir_attr_delay; bitmap4 gdda_child_attributes; bitmap4 gdda_dir_attributes; };18.39.2. RESULT
struct GET_DIR_DELEGATION4resok { verifier4 gddr_cookieverf; /* Stateid for get_dir_delegation */ stateid4 gddr_stateid; /* Which notifications can the server support */ bitmap4 gddr_notification; bitmap4 gddr_child_attributes; bitmap4 gddr_dir_attributes; }; enum gddrnf4_status { GDD4_OK = 0, GDD4_UNAVAIL = 1 }; union GET_DIR_DELEGATION4res_non_fatal switch (gddrnf4_status gddrnf_status) { case GDD4_OK: GET_DIR_DELEGATION4resok gddrnf_resok4; case GDD4_UNAVAIL: bool gddrnf_will_signal_deleg_avail; };
union GET_DIR_DELEGATION4res switch (nfsstat4 gddr_status) { case NFS4_OK: GET_DIR_DELEGATION4res_non_fatal gddr_res_non_fatal4; default: void; };18.39.3. DESCRIPTION
The GET_DIR_DELEGATION operation is used by a client to request a directory delegation. The directory is represented by the current filehandle. The client also specifies whether it wants the server to notify it when the directory changes in certain ways by setting one or more bits in a bitmap. The server may refuse to grant the delegation. In that case, the server will return NFS4ERR_DIRDELEG_UNAVAIL. If the server decides to hand out the delegation, it will return a cookie verifier for that directory. If the cookie verifier changes when the client is holding the delegation, the delegation will be recalled unless the client has asked for notification for this event. The server will also return a directory delegation stateid, gddr_stateid, as a result of the GET_DIR_DELEGATION operation. This stateid will appear in callback messages related to the delegation, such as notifications and delegation recalls. The client will use this stateid to return the delegation voluntarily or upon recall. A delegation is returned by calling the DELEGRETURN operation. The server might not be able to support notifications of certain events. If the client asks for such notifications, the server MUST inform the client of its inability to do so as part of the GET_DIR_DELEGATION reply by not setting the appropriate bits in the supported notifications bitmask, gddr_notification, contained in the reply. The server MUST NOT add bits to gddr_notification that the client did not request. The GET_DIR_DELEGATION operation can be used for both normal and named attribute directories. If client sets gdda_signal_deleg_avail to TRUE, then it is registering with the client a "want" for a directory delegation. If the delegation is not available, and the server supports and will honor the "want", the results will have gddrnf_will_signal_deleg_avail set to TRUE and no error will be indicated on return. If so, the client should expect a future CB_RECALLABLE_OBJ_AVAIL operation to indicate that a directory delegation is available. If the server does not wish to honor the
"want" or is not able to do so, it returns the error NFS4ERR_DIRDELEG_UNAVAIL. If the delegation is immediately available, the server SHOULD return it with the response to the operation, rather than via a callback. When a client makes a request for a directory delegation while it already holds a directory delegation for that directory (including the case where it has been recalled but not yet returned by the client or revoked by the server), the server MUST reply with the value of gddr_status set to NFS4_OK, the value of gddrnf_status set to GDD4_UNAVAIL, and the value of gddrnf_will_signal_deleg_avail set to FALSE. The delegation the client held before the request remains intact, and its state is unchanged. The current stateid is not changed (see Section 16.2.3.1.2 for a description of the current stateid).18.39.4. IMPLEMENTATION
Directory delegations provide the benefit of improving cache consistency of namespace information. This is done through synchronous callbacks. A server must support synchronous callbacks in order to support directory delegations. In addition to that, asynchronous notifications provide a way to reduce network traffic as well as improve client performance in certain conditions. Notifications are specified in terms of potential changes to the directory. A client can ask to be notified of events by setting one or more bits in gdda_notification_types. The client can ask for notifications on addition of entries to a directory (by setting the NOTIFY4_ADD_ENTRY in gdda_notification_types), notifications on entry removal (NOTIFY4_REMOVE_ENTRY), renames (NOTIFY4_RENAME_ENTRY), directory attribute changes (NOTIFY4_CHANGE_DIR_ATTRIBUTES), and cookie verifier changes (NOTIFY4_CHANGE_COOKIE_VERIFIER) by setting one or more corresponding bits in the gdda_notification_types field. The client can also ask for notifications of changes to attributes of directory entries (NOTIFY4_CHANGE_CHILD_ATTRIBUTES) in order to keep its attribute cache up to date. However, any changes made to child attributes do not cause the delegation to be recalled. If a client is interested in directory entry caching or negative name caching, it can set the gdda_notification_types appropriately to its particular need and the server will notify it of all changes that would otherwise invalidate its name cache. The kind of notification a client asks for may depend on the directory size, its rate of change, and the applications being used to access that directory. The enumeration of the conditions under which a client might ask for a notification is out of the scope of this specification.
For attribute notifications, the client will set bits in the gdda_dir_attributes bitmap to indicate which attributes it wants to be notified of. If the server does not support notifications for changes to a certain attribute, it SHOULD NOT set that attribute in the supported attribute bitmap specified in the reply (gddr_dir_attributes). The client will also set in the gdda_child_attributes bitmap the attributes of directory entries it wants to be notified of, and the server will indicate in gddr_child_attributes which attributes of directory entries it will notify the client of. The client will also let the server know if it wants to get the notification as soon as the attribute change occurs or after a certain delay by setting a delay factor; gdda_child_attr_delay is for attribute changes to directory entries and gdda_dir_attr_delay is for attribute changes to the directory. If this delay factor is set to zero, that indicates to the server that the client wants to be notified of any attribute changes as soon as they occur. If the delay factor is set to N seconds, the server will make a best-effort guarantee that attribute updates are synchronized within N seconds. If the client asks for a delay factor that the server does not support or that may cause significant resource consumption on the server by causing the server to send a lot of notifications, the server should not commit to sending out notifications for attributes and therefore must not set the appropriate bit in the gddr_child_attributes and gddr_dir_attributes bitmaps in the response. The client MUST use a security tuple (Section 2.6.1) that the directory or its applicable ancestor (Section 2.6) is exported with. If not, the server MUST return NFS4ERR_WRONGSEC to the operation that both precedes GET_DIR_DELEGATION and sets the current filehandle (see Section 2.6.3.1). The directory delegation covers all the entries in the directory except the parent entry. That means if a directory and its parent both hold directory delegations, any changes to the parent will not cause a notification to be sent for the child even though the child's parent entry points to the parent directory.
18.40. Operation 47: GETDEVICEINFO - Get Device Information
18.40.1. ARGUMENT
struct GETDEVICEINFO4args { deviceid4 gdia_device_id; layouttype4 gdia_layout_type; count4 gdia_maxcount; bitmap4 gdia_notify_types; };18.40.2. RESULT
struct GETDEVICEINFO4resok { device_addr4 gdir_device_addr; bitmap4 gdir_notification; }; union GETDEVICEINFO4res switch (nfsstat4 gdir_status) { case NFS4_OK: GETDEVICEINFO4resok gdir_resok4; case NFS4ERR_TOOSMALL: count4 gdir_mincount; default: void; };18.40.3. DESCRIPTION
The GETDEVICEINFO operation returns pNFS storage device address information for the specified device ID. The client identifies the device information to be returned by providing the gdia_device_id and gdia_layout_type that uniquely identify the device. The client provides gdia_maxcount to limit the number of bytes for the result. This maximum size represents all of the data being returned within the GETDEVICEINFO4resok structure and includes the XDR overhead. The server may return less data. If the server is unable to return any information within the gdia_maxcount limit, the error NFS4ERR_TOOSMALL will be returned. However, if gdia_maxcount is zero, NFS4ERR_TOOSMALL MUST NOT be returned. The da_layout_type field of the gdir_device_addr returned by the server MUST be equal to the gdia_layout_type specified by the client. If it is not equal, the client SHOULD ignore the response as invalid and behave as if the server returned an error, even if the client does have support for the layout type returned.
The client also provides a notification bitmap, gdia_notify_types, for the device ID mapping notification for which it is interested in receiving; the server must support device ID notifications for the notification request to have affect. The notification mask is composed in the same manner as the bitmap for file attributes (Section 3.3.7). The numbers of bit positions are listed in the notify_device_type4 enumeration type (Section 20.12). Only two enumerated values of notify_device_type4 currently apply to GETDEVICEINFO: NOTIFY_DEVICEID4_CHANGE and NOTIFY_DEVICEID4_DELETE (see Section 20.12). The notification bitmap applies only to the specified device ID. If a client sends a GETDEVICEINFO operation on a deviceID multiple times, the last notification bitmap is used by the server for subsequent notifications. If the bitmap is zero or empty, then the device ID's notifications are turned off. If the client wants to just update or turn off notifications, it MAY send a GETDEVICEINFO operation with gdia_maxcount set to zero. In that event, if the device ID is valid, the reply's da_addr_body field of the gdir_device_addr field will be of zero length. If an unknown device ID is given in gdia_device_id, the server returns NFS4ERR_NOENT. Otherwise, the device address information is returned in gdir_device_addr. Finally, if the server supports notifications for device ID mappings, the gdir_notification result will contain a bitmap of which notifications it will actually send to the client (via CB_NOTIFY_DEVICEID, see Section 20.12). If NFS4ERR_TOOSMALL is returned, the results also contain gdir_mincount. The value of gdir_mincount represents the minimum size necessary to obtain the device information.18.40.4. IMPLEMENTATION
Aside from updating or turning off notifications, another use case for gdia_maxcount being set to zero is to validate a device ID. The client SHOULD request a notification for changes or deletion of a device ID to device address mapping so that the server can allow the client gracefully use a new mapping, without having pending I/O fail abruptly, or force layouts using the device ID to be recalled or revoked. It is possible that GETDEVICEINFO (and GETDEVICELIST) will race with CB_NOTIFY_DEVICEID, i.e., CB_NOTIFY_DEVICEID arrives before the client gets and processes the response to GETDEVICEINFO or
GETDEVICELIST. The analysis of the race leverages the fact that the server MUST NOT delete a device ID that is referred to by a layout the client has. o CB_NOTIFY_DEVICEID deletes a device ID. If the client believes it has layouts that refer to the device ID, then it is possible that layouts referring to the deleted device ID have been revoked. The client should send a TEST_STATEID request using the stateid for each layout that might have been revoked. If TEST_STATEID indicates that any layouts have been revoked, the client must recover from layout revocation as described in Section 12.5.6. If TEST_STATEID indicates that at least one layout has not been revoked, the client should send a GETDEVICEINFO operation on the supposedly deleted device ID to verify that the device ID has been deleted. If GETDEVICEINFO indicates that the device ID does not exist, then the client assumes the server is faulty and recovers by sending an EXCHANGE_ID operation. If GETDEVICEINFO indicates that the device ID does exist, then while the server is faulty for sending an erroneous device ID deletion notification, the degree to which it is faulty does not require the client to create a new client ID. If the client does not have layouts that refer to the device ID, no harm is done. The client should mark the device ID as deleted, and when GETDEVICEINFO or GETDEVICELIST results are received that indicate that the device ID has been in fact deleted, the device ID should be removed from the client's cache. o CB_NOTIFY_DEVICEID indicates that a device ID's device addressing mappings have changed. The client should assume that the results from the in-progress GETDEVICEINFO will be stale for the device ID once received, and so it should send another GETDEVICEINFO on the device ID.
18.41. Operation 48: GETDEVICELIST - Get All Device Mappings for a File System
18.41.1. ARGUMENT
struct GETDEVICELIST4args { /* CURRENT_FH: object belonging to the file system */ layouttype4 gdla_layout_type; /* number of deviceIDs to return */ count4 gdla_maxdevices; nfs_cookie4 gdla_cookie; verifier4 gdla_cookieverf; };18.41.2. RESULT
struct GETDEVICELIST4resok { nfs_cookie4 gdlr_cookie; verifier4 gdlr_cookieverf; deviceid4 gdlr_deviceid_list<>; bool gdlr_eof; }; union GETDEVICELIST4res switch (nfsstat4 gdlr_status) { case NFS4_OK: GETDEVICELIST4resok gdlr_resok4; default: void; };18.41.3. DESCRIPTION
This operation is used by the client to enumerate all of the device IDs that a server's file system uses. The client provides a current filehandle of a file object that belongs to the file system (i.e., all file objects sharing the same fsid as that of the current filehandle) and the layout type in gdia_layout_type. Since this operation might require multiple calls to enumerate all the device IDs (and is thus similar to the READDIR (Section 18.23) operation), the client also provides gdia_cookie and gdia_cookieverf to specify the current cursor position in the list. When the client wants to read from the beginning of the file system's device mappings, it sets gdla_cookie to zero. The field gdla_cookieverf MUST be ignored by the server when gdla_cookie is
zero. The client provides gdla_maxdevices to limit the number of device IDs in the result. If gdla_maxdevices is zero, the server MUST return NFS4ERR_INVAL. The server MAY return fewer device IDs. The successful response to the operation will contain the cookie, gdlr_cookie, and the cookie verifier, gdlr_cookieverf, to be used on the subsequent GETDEVICELIST. A gdlr_eof value of TRUE signifies that there are no remaining entries in the server's device list. Each element of gdlr_deviceid_list contains a device ID.18.41.4. IMPLEMENTATION
An example of the use of this operation is for pNFS clients and servers that use LAYOUT4_BLOCK_VOLUME layouts. In these environments it may be helpful for a client to determine device accessibility upon first file system access.18.42. Operation 49: LAYOUTCOMMIT - Commit Writes Made Using a Layout
18.42.1. ARGUMENT
union newtime4 switch (bool nt_timechanged) { case TRUE: nfstime4 nt_time; case FALSE: void; }; union newoffset4 switch (bool no_newoffset) { case TRUE: offset4 no_offset; case FALSE: void; }; struct LAYOUTCOMMIT4args { /* CURRENT_FH: file */ offset4 loca_offset; length4 loca_length; bool loca_reclaim; stateid4 loca_stateid; newoffset4 loca_last_write_offset; newtime4 loca_time_modify; layoutupdate4 loca_layoutupdate; };
18.42.2. RESULT
union newsize4 switch (bool ns_sizechanged) { case TRUE: length4 ns_size; case FALSE: void; }; struct LAYOUTCOMMIT4resok { newsize4 locr_newsize; }; union LAYOUTCOMMIT4res switch (nfsstat4 locr_status) { case NFS4_OK: LAYOUTCOMMIT4resok locr_resok4; default: void; };18.42.3. DESCRIPTION
The LAYOUTCOMMIT operation commits changes in the layout represented by the current filehandle, client ID (derived from the session ID in the preceding SEQUENCE operation), byte-range, and stateid. Since layouts are sub-dividable, a smaller portion of a layout, retrieved via LAYOUTGET, can be committed. The byte-range being committed is specified through the byte-range (loca_offset and loca_length). This byte-range MUST overlap with one or more existing layouts previously granted via LAYOUTGET (Section 18.43), each with an iomode of LAYOUTIOMODE4_RW. In the case where the iomode of any held layout segment is not LAYOUTIOMODE4_RW, the server should return the error NFS4ERR_BAD_IOMODE. For the case where the client does not hold matching layout segment(s) for the defined byte-range, the server should return the error NFS4ERR_BAD_LAYOUT. The LAYOUTCOMMIT operation indicates that the client has completed writes using a layout obtained by a previous LAYOUTGET. The client may have only written a subset of the data range it previously requested. LAYOUTCOMMIT allows it to commit or discard provisionally allocated space and to update the server with a new end-of-file. The layout referenced by LAYOUTCOMMIT is still valid after the operation completes and can be continued to be referenced by the client ID, filehandle, byte-range, layout type, and stateid. If the loca_reclaim field is set to TRUE, this indicates that the client is attempting to commit changes to a layout after the restart of the metadata server during the metadata server's recovery grace
period (see Section 12.7.4). This type of request may be necessary when the client has uncommitted writes to provisionally allocated byte-ranges of a file that were sent to the storage devices before the restart of the metadata server. In this case, the layout provided by the client MUST be a subset of a writable layout that the client held immediately before the restart of the metadata server. The value of the field loca_stateid MUST be a value that the metadata server returned before it restarted. The metadata server is free to accept or reject this request based on its own internal metadata consistency checks. If the metadata server finds that the layout provided by the client does not pass its consistency checks, it MUST reject the request with the status NFS4ERR_RECLAIM_BAD. The successful completion of the LAYOUTCOMMIT request with loca_reclaim set to TRUE does NOT provide the client with a layout for the file. It simply commits the changes to the layout specified in the loca_layoutupdate field. To obtain a layout for the file, the client must send a LAYOUTGET request to the server after the server's grace period has expired. If the metadata server receives a LAYOUTCOMMIT request with loca_reclaim set to TRUE when the metadata server is not in its recovery grace period, it MUST reject the request with the status NFS4ERR_NO_GRACE. Setting the loca_reclaim field to TRUE is required if and only if the committed layout was acquired before the metadata server restart. If the client is committing a layout that was acquired during the metadata server's grace period, it MUST set the "reclaim" field to FALSE. The loca_stateid is a layout stateid value as returned by previously successful layout operations (see Section 12.5.3). The loca_last_write_offset field specifies the offset of the last byte written by the client previous to the LAYOUTCOMMIT. Note that this value is never equal to the file's size (at most it is one byte less than the file's size) and MUST be less than or equal to NFS4_MAXFILEOFF. Also, loca_last_write_offset MUST overlap the range described by loca_offset and loca_length. The metadata server may use this information to determine whether the file's size needs to be updated. If the metadata server updates the file's size as the result of the LAYOUTCOMMIT operation, it must return the new size (locr_newsize.ns_size) as part of the results. The loca_time_modify field allows the client to suggest a modification time it would like the metadata server to set. The metadata server may use the suggestion or it may use the time of the LAYOUTCOMMIT operation to set the modification time. If the metadata server uses the client-provided modification time, it should ensure that time does not flow backwards. If the client wants to force the
metadata server to set an exact time, the client should use a SETATTR operation in a COMPOUND right after LAYOUTCOMMIT. See Section 12.5.4 for more details. If the client desires the resultant modification time, it should construct the COMPOUND so that a GETATTR follows the LAYOUTCOMMIT. The loca_layoutupdate argument to LAYOUTCOMMIT provides a mechanism for a client to provide layout-specific updates to the metadata server. For example, the layout update can describe what byte-ranges of the original layout have been used and what byte-ranges can be deallocated. There is no NFSv4.1 file layout-specific layoutupdate4 structure. The layout information is more verbose for block devices than for objects and files because the latter two hide the details of block allocation behind their storage protocols. At the minimum, the client needs to communicate changes to the end-of-file location back to the server, and, if desired, its view of the file's modification time. For block/volume layouts, it needs to specify precisely which blocks have been used. If the layout identified in the arguments does not exist, the error NFS4ERR_BADLAYOUT is returned. The layout being committed may also be rejected if it does not correspond to an existing layout with an iomode of LAYOUTIOMODE4_RW. On success, the current filehandle retains its value and the current stateid retains its value.18.42.4. IMPLEMENTATION
The client MAY also use LAYOUTCOMMIT with the loca_reclaim field set to TRUE to convey hints to modified file attributes or to report layout-type specific information such as I/O errors for object-based storage layouts, as normally done during normal operation. Doing so may help the metadata server to recover files more efficiently after restart. For example, some file system implementations may require expansive recovery of file system objects if the metadata server does not get a positive indication from all clients holding a LAYOUTIOMODE4_RW layout that they have successfully completed all their writes. Sending a LAYOUTCOMMIT (if required) and then following with LAYOUTRETURN can provide such an indication and allow for graceful and efficient recovery.
If loca_reclaim is TRUE, the metadata server is free to either examine or ignore the value in the field loca_stateid. The metadata server implementation might or might not encode in its layout stateid information that allows the metadate server to perform a consistency check on the LAYOUTCOMMIT request.18.43. Operation 50: LAYOUTGET - Get Layout Information
18.43.1. ARGUMENT
struct LAYOUTGET4args { /* CURRENT_FH: file */ bool loga_signal_layout_avail; layouttype4 loga_layout_type; layoutiomode4 loga_iomode; offset4 loga_offset; length4 loga_length; length4 loga_minlength; stateid4 loga_stateid; count4 loga_maxcount; };18.43.2. RESULT
struct LAYOUTGET4resok { bool logr_return_on_close; stateid4 logr_stateid; layout4 logr_layout<>; }; union LAYOUTGET4res switch (nfsstat4 logr_status) { case NFS4_OK: LAYOUTGET4resok logr_resok4; case NFS4ERR_LAYOUTTRYLATER: bool logr_will_signal_layout_avail; default: void; };18.43.3. DESCRIPTION
The LAYOUTGET operation requests a layout from the metadata server for reading or writing the file given by the filehandle at the byte- range specified by offset and length. Layouts are identified by the client ID (derived from the session ID in the preceding SEQUENCE operation), current filehandle, layout type (loga_layout_type), and
the layout stateid (loga_stateid). The use of the loga_iomode field depends upon the layout type, but should reflect the client's data access intent. If the metadata server is in a grace period, and does not persist layouts and device ID to device address mappings, then it MUST return NFS4ERR_GRACE (see Section 8.4.2.1). The LAYOUTGET operation returns layout information for the specified byte-range: a layout. The client actually specifies two ranges, both starting at the offset in the loga_offset field. The first range is between loga_offset and loga_offset + loga_length - 1 inclusive. This range indicates the desired range the client wants the layout to cover. The second range is between loga_offset and loga_offset + loga_minlength - 1 inclusive. This range indicates the required range the client needs the layout to cover. Thus, loga_minlength MUST be less than or equal to loga_length. When a length field is set to NFS4_UINT64_MAX, this indicates a desire (when loga_length is NFS4_UINT64_MAX) or requirement (when loga_minlength is NFS4_UINT64_MAX) to get a layout from loga_offset through the end-of-file, regardless of the file's length. The following rules govern the relationships among, and the minima of, loga_length, loga_minlength, and loga_offset. o If loga_length is less than loga_minlength, the metadata server MUST return NFS4ERR_INVAL. o If loga_minlength is zero, this is an indication to the metadata server that the client desires any layout at offset loga_offset or less that the metadata server has "readily available". Readily is subjective, and depends on the layout type and the pNFS server implementation. For example, some metadata servers might have to pre-allocate stable storage when they receive a request for a range of a file that goes beyond the file's current length. If loga_minlength is zero and loga_length is greater than zero, this tells the metadata server what range of the layout the client would prefer to have. If loga_length and loga_minlength are both zero, then the client is indicating that it desires a layout of any length with the ending offset of the range no less than the value specified loga_offset, and the starting offset at or below loga_offset. If the metadata server does not have a layout that is readily available, then it MUST return NFS4ERR_LAYOUTTRYLATER. o If the sum of loga_offset and loga_minlength exceeds NFS4_UINT64_MAX, and loga_minlength is not NFS4_UINT64_MAX, the error NFS4ERR_INVAL MUST result.
o If the sum of loga_offset and loga_length exceeds NFS4_UINT64_MAX, and loga_length is not NFS4_UINT64_MAX, the error NFS4ERR_INVAL MUST result. After the metadata server has performed the above checks on loga_offset, loga_minlength, and loga_offset, the metadata server MUST return a layout according to the rules in Table 13. Acceptable layouts based on loga_minlength. Note: u64m = NFS4_UINT64_MAX; a_off = loga_offset; a_minlen = loga_minlength. +-----------+-----------+----------+----------+---------------------+ | Layout | Layout | Layout | Layout | Layout length of | | iomode of | a_minlen | iomode | offset | reply | | request | of | of reply | of reply | | | | request | | | | +-----------+-----------+----------+----------+---------------------+ | _READ | u64m | MAY be | MUST be | MUST be >= file | | | | _READ | <= a_off | length - layout | | | | | | offset | | _READ | u64m | MAY be | MUST be | MUST be u64m | | | | _RW | <= a_off | | | _READ | > 0 and < | MAY be | MUST be | MUST be >= MIN(file | | | u64m | _READ | <= a_off | length, a_minlen + | | | | | | a_off) - layout | | | | | | offset | | _READ | > 0 and < | MAY be | MUST be | MUST be >= a_off - | | | u64m | _RW | <= a_off | layout offset + | | | | | | a_minlen | | _READ | 0 | MAY be | MUST be | MUST be > 0 | | | | _READ | <= a_off | | | _READ | 0 | MAY be | MUST be | MUST be > 0 | | | | _RW | <= a_off | | | _RW | u64m | MUST be | MUST be | MUST be u64m | | | | _RW | <= a_off | | | _RW | > 0 and < | MUST be | MUST be | MUST be >= a_off - | | | u64m | _RW | <= a_off | layout offset + | | | | | | a_minlen | | _RW | 0 | MUST be | MUST be | MUST be > 0 | | | | _RW | <= a_off | | +-----------+-----------+----------+----------+---------------------+ Table 13 If loga_minlength is not zero and the metadata server cannot return a layout according to the rules in Table 13, then the metadata server MUST return the error NFS4ERR_BADLAYOUT. If loga_minlength is zero and the metadata server cannot or will not return a layout according
to the rules in Table 13, then the metadata server MUST return the error NFS4ERR_LAYOUTTRYLATER. Assuming that loga_length is greater than loga_minlength or equal to zero, the metadata server SHOULD return a layout according to the rules in Table 14. Desired layouts based on loga_length. The rules of Table 13 MUST be applied first. Note: u64m = NFS4_UINT64_MAX; a_off = loga_offset; a_len = loga_length. +------------+------------+-----------+-----------+-----------------+ | Layout | Layout | Layout | Layout | Layout length | | iomode of | a_len of | iomode of | offset of | of reply | | request | request | reply | reply | | +------------+------------+-----------+-----------+-----------------+ | _READ | u64m | MAY be | MUST be | SHOULD be u64m | | | | _READ | <= a_off | | | _READ | u64m | MAY be | MUST be | SHOULD be u64m | | | | _RW | <= a_off | | | _READ | > 0 and < | MAY be | MUST be | SHOULD be >= | | | u64m | _READ | <= a_off | a_off - layout | | | | | | offset + a_len | | _READ | > 0 and < | MAY be | MUST be | SHOULD be >= | | | u64m | _RW | <= a_off | a_off - layout | | | | | | offset + a_len | | _READ | 0 | MAY be | MUST be | SHOULD be > | | | | _READ | <= a_off | a_off - layout | | | | | | offset | | _READ | 0 | MAY be | MUST be | SHOULD be > | | | | _READ | <= a_off | a_off - layout | | | | | | offset | | _RW | u64m | MUST be | MUST be | SHOULD be u64m | | | | _RW | <= a_off | | | _RW | > 0 and < | MUST be | MUST be | SHOULD be >= | | | u64m | _RW | <= a_off | a_off - layout | | | | | | offset + a_len | | _RW | 0 | MUST be | MUST be | SHOULD be > | | | | _RW | <= a_off | a_off - layout | | | | | | offset | +------------+------------+-----------+-----------+-----------------+ Table 14 The loga_stateid field specifies a valid stateid. If a layout is not currently held by the client, the loga_stateid field represents a stateid reflecting the correspondingly valid open, byte-range lock, or delegation stateid. Once a layout is held on the file by the
client, the loga_stateid field MUST be a stateid as returned from a previous LAYOUTGET or LAYOUTRETURN operation or provided by a CB_LAYOUTRECALL operation (see Section 12.5.3). The loga_maxcount field specifies the maximum layout size (in bytes) that the client can handle. If the size of the layout structure exceeds the size specified by maxcount, the metadata server will return the NFS4ERR_TOOSMALL error. The returned layout is expressed as an array, logr_layout, with each element of type layout4. If a file has a single striping pattern, then logr_layout SHOULD contain just one entry. Otherwise, if the requested range overlaps more than one striping pattern, logr_layout will contain the required number of entries. The elements of logr_layout MUST be sorted in ascending order of the value of the lo_offset field of each element. There MUST be no gaps or overlaps in the range between two successive elements of logr_layout. The lo_iomode field in each element of logr_layout MUST be the same. Table 13 and Table 14 both refer to a returned layout iomode, offset, and length. Because the returned layout is encoded in the logr_layout array, more description is required. iomode The value of the returned layout iomode listed in Table 13 and Table 14 is equal to the value of the lo_iomode field in each element of logr_layout. As shown in Table 13 and Table 14, the metadata server MAY return a layout with an lo_iomode different from the requested iomode (field loga_iomode of the request). If it does so, it MUST ensure that the lo_iomode is more permissive than the loga_iomode requested. For example, this behavior allows an implementation to upgrade LAYOUTIOMODE4_READ requests to LAYOUTIOMODE4_RW requests at its discretion, within the limits of the layout type specific protocol. A lo_iomode of either LAYOUTIOMODE4_READ or LAYOUTIOMODE4_RW MUST be returned. offset The value of the returned layout offset listed in Table 13 and Table 14 is always equal to the lo_offset field of the first element logr_layout. length When setting the value of the returned layout length, the situation is complicated by the possibility that the special layout length value NFS4_UINT64_MAX is involved. For a
logr_layout array of N elements, the lo_length field in the first N-1 elements MUST NOT be NFS4_UINT64_MAX. The lo_length field of the last element of logr_layout can be NFS4_UINT64_MAX under some conditions as described in the following list. * If an applicable rule of Table 13 states that the metadata server MUST return a layout of length NFS4_UINT64_MAX, then the lo_length field of the last element of logr_layout MUST be NFS4_UINT64_MAX. * If an applicable rule of Table 13 states that the metadata server MUST NOT return a layout of length NFS4_UINT64_MAX, then the lo_length field of the last element of logr_layout MUST NOT be NFS4_UINT64_MAX. * If an applicable rule of Table 14 states that the metadata server SHOULD return a layout of length NFS4_UINT64_MAX, then the lo_length field of the last element of logr_layout SHOULD be NFS4_UINT64_MAX. * When the value of the returned layout length of Table 13 and Table 14 is not NFS4_UINT64_MAX, then the returned layout length is equal to the sum of the lo_length fields of each element of logr_layout. The logr_return_on_close result field is a directive to return the layout before closing the file. When the metadata server sets this return value to TRUE, it MUST be prepared to recall the layout in the case in which the client fails to return the layout before close. For the metadata server that knows a layout must be returned before a close of the file, this return value can be used to communicate the desired behavior to the client and thus remove one extra step from the client's and metadata server's interaction. The logr_stateid stateid is returned to the client for use in subsequent layout related operations. See Sections 8.2, 12.5.3, and 12.5.5.2 for a further discussion and requirements. The format of the returned layout (lo_content) is specific to the layout type. The value of the layout type (lo_content.loc_type) for each of the elements of the array of layouts returned by the metadata server (logr_layout) MUST be equal to the loga_layout_type specified by the client. If it is not equal, the client SHOULD ignore the response as invalid and behave as if the metadata server returned an error, even if the client does have support for the layout type returned.
If neither the requested file nor its containing file system support layouts, the metadata server MUST return NFS4ERR_LAYOUTUNAVAILABLE. If the layout type is not supported, the metadata server MUST return NFS4ERR_UNKNOWN_LAYOUTTYPE. If layouts are supported but no layout matches the client provided layout identification, the metadata server MUST return NFS4ERR_BADLAYOUT. If an invalid loga_iomode is specified, or a loga_iomode of LAYOUTIOMODE4_ANY is specified, the metadata server MUST return NFS4ERR_BADIOMODE. If the layout for the file is unavailable due to transient conditions, e.g., file sharing prohibits layouts, the metadata server MUST return NFS4ERR_LAYOUTTRYLATER. If the layout request is rejected due to an overlapping layout recall, the metadata server MUST return NFS4ERR_RECALLCONFLICT. See Section 12.5.5.2 for details. If the layout conflicts with a mandatory byte-range lock held on the file, and if the storage devices have no method of enforcing mandatory locks, other than through the restriction of layouts, the metadata server SHOULD return NFS4ERR_LOCKED. If client sets loga_signal_layout_avail to TRUE, then it is registering with the client a "want" for a layout in the event the layout cannot be obtained due to resource exhaustion. If the metadata server supports and will honor the "want", the results will have logr_will_signal_layout_avail set to TRUE. If so, the client should expect a CB_RECALLABLE_OBJ_AVAIL operation to indicate that a layout is available. On success, the current filehandle retains its value and the current stateid is updated to match the value as returned in the results.18.43.4. IMPLEMENTATION
Typically, LAYOUTGET will be called as part of a COMPOUND request after an OPEN operation and results in the client having location information for the file. This requires that loga_stateid be set to the special stateid that tells the metadata server to use the current stateid, which is set by OPEN (see Section 16.2.3.1.2). A client may also hold a layout across multiple OPENs. The client specifies a layout type that limits what kind of layout the metadata server will return. This prevents metadata servers from granting layouts that are unusable by the client.
As indicated by Table 13 and Table 14, the specification of LAYOUTGET allows a pNFS client and server considerable flexibility. A pNFS client can take several strategies for sending LAYOUTGET. Some examples are as follows. o If LAYOUTGET is preceded by OPEN in the same COMPOUND request and the OPEN requests OPEN4_SHARE_ACCESS_READ access, the client might opt to request a _READ layout with loga_offset set to zero, loga_minlength set to zero, and loga_length set to NFS4_UINT64_MAX. If the file has space allocated to it, that space is striped over one or more storage devices, and there is either no conflicting layout or the concept of a conflicting layout does not apply to the pNFS server's layout type or implementation, then the metadata server might return a layout with a starting offset of zero, and a length equal to the length of the file, if not NFS4_UINT64_MAX. If the length of the file is not a multiple of the pNFS server's stripe width (see Section 13.2 for a formal definition), the metadata server might round up the returned layout's length. o If LAYOUTGET is preceded by OPEN in the same COMPOUND request, and the OPEN requests OPEN4_SHARE_ACCESS_WRITE access and does not truncate the file, the client might opt to request a _RW layout with loga_offset set to zero, loga_minlength set to zero, and loga_length set to the file's current length (if known), or NFS4_UINT64_MAX. As with the previous case, under some conditions the metadata server might return a layout that covers the entire length of the file or beyond. o This strategy is as above, but the OPEN truncates the file. In this case, the client might anticipate it will be writing to the file from offset zero, and so loga_offset and loga_minlength are set to zero, and loga_length is set to the value of threshold4_write_iosize. The metadata server might return a layout from offset zero with a length at least as long as threshold4_write_iosize. o A process on the client invokes a request to read from offset 10000 for length 50000. The client is using buffered I/O, and has buffer sizes of 4096 bytes. The client intends to map the request of the process into a series of READ requests starting at offset 8192. The end offset needs to be higher than 10000 + 50000 = 60000, and the next offset that is a multiple of 4096 is 61440. The difference between 61440 and that starting offset of the layout is 53248 (which is the product of 4096 and 15). The value of threshold4_read_iosize is less than 53248, so the client sends a LAYOUTGET request with loga_offset set to 8192, loga_minlength set to 53248, and loga_length set to the file's length (if known)
minus 8192 or NFS4_UINT64_MAX (if the file's length is not known). Since this LAYOUTGET request exceeds the metadata server's threshold, it grants the layout, possibly with an initial offset of zero, with an end offset of at least 8192 + 53248 - 1 = 61439, but preferably a layout with an offset aligned on the stripe width and a length that is a multiple of the stripe width. o This strategy is as above, but the client is not using buffered I/O, and instead all internal I/O requests are sent directly to the server. The LAYOUTGET request has loga_offset equal to 10000 and loga_minlength set to 50000. The value of loga_length is set to the length of the file. The metadata server is free to return a layout that fully overlaps the requested range, with a starting offset and length aligned on the stripe width. o Again, a process on the client invokes a request to read from offset 10000 for length 50000 (i.e. a range with a starting offset of 10000 and an ending offset of 69999), and buffered I/O is in use. The client is expecting that the server might not be able to return the layout for the full I/O range. The client intends to map the request of the process into a series of thirteen READ requests starting at offset 8192, each with length 4096, with a total length of 53248 (which equals 13 * 4096), which fully contains the range that client's process wants to read. Because the value of threshold4_read_iosize is equal to 4096, it is practical and reasonable for the client to use several LAYOUTGET operations to complete the series of READs. The client sends a LAYOUTGET request with loga_offset set to 8192, loga_minlength set to 4096, and loga_length set to 53248 or higher. The server will grant a layout possibly with an initial offset of zero, with an end offset of at least 8192 + 4096 - 1 = 12287, but preferably a layout with an offset aligned on the stripe width and a length that is a multiple of the stripe width. This will allow the client to make forward progress, possibly sending more LAYOUTGET operations for the remainder of the range. o An NFS client detects a sequential read pattern, and so sends a LAYOUTGET operation that goes well beyond any current or pending read requests to the server. The server might likewise detect this pattern, and grant the LAYOUTGET request. Once the client reads from an offset of the file that represents 50% of the way through the range of the last layout it received, in order to avoid stalling I/O that would wait for a layout, the client sends more operations from an offset of the file that represents 50% of the way through the last layout it received. The client continues to request layouts with byte-ranges that are well in advance of the byte-ranges of recent and/or read requests of processes running on the client.
o This strategy is as above, but the client fails to detect the pattern, but the server does. The next time the metadata server gets a LAYOUTGET, it returns a layout with a length that is well beyond loga_minlength. o A client is using buffered I/O, and has a long queue of write- behinds to process and also detects a sequential write pattern. It sends a LAYOUTGET for a layout that spans the range of the queued write-behinds and well beyond, including ranges beyond the filer's current length. The client continues to send LAYOUTGET operations once the write-behind queue reaches 50% of the maximum queue length. Once the client has obtained a layout referring to a particular device ID, the metadata server MUST NOT delete the device ID until the layout is returned or revoked. CB_NOTIFY_DEVICEID can race with LAYOUTGET. One race scenario is that LAYOUTGET returns a device ID for which the client does not have device address mappings, and the metadata server sends a CB_NOTIFY_DEVICEID to add the device ID to the client's awareness and meanwhile the client sends GETDEVICEINFO on the device ID. This scenario is discussed in Section 18.40.4. Another scenario is that the CB_NOTIFY_DEVICEID is processed by the client before it processes the results from LAYOUTGET. The client will send a GETDEVICEINFO on the device ID. If the results from GETDEVICEINFO are received before the client gets results from LAYOUTGET, then there is no longer a race. If the results from LAYOUTGET are received before the results from GETDEVICEINFO, the client can either wait for results of GETDEVICEINFO or send another one to get possibly more up-to-date device address mappings for the device ID.18.44. Operation 51: LAYOUTRETURN - Release Layout Information
18.44.1. ARGUMENT
/* Constants used for LAYOUTRETURN and CB_LAYOUTRECALL */ const LAYOUT4_RET_REC_FILE = 1; const LAYOUT4_RET_REC_FSID = 2; const LAYOUT4_RET_REC_ALL = 3; enum layoutreturn_type4 { LAYOUTRETURN4_FILE = LAYOUT4_RET_REC_FILE, LAYOUTRETURN4_FSID = LAYOUT4_RET_REC_FSID, LAYOUTRETURN4_ALL = LAYOUT4_RET_REC_ALL };
struct layoutreturn_file4 { offset4 lrf_offset; length4 lrf_length; stateid4 lrf_stateid; /* layouttype4 specific data */ opaque lrf_body<>; }; union layoutreturn4 switch(layoutreturn_type4 lr_returntype) { case LAYOUTRETURN4_FILE: layoutreturn_file4 lr_layout; default: void; }; struct LAYOUTRETURN4args { /* CURRENT_FH: file */ bool lora_reclaim; layouttype4 lora_layout_type; layoutiomode4 lora_iomode; layoutreturn4 lora_layoutreturn; };18.44.2. RESULT
union layoutreturn_stateid switch (bool lrs_present) { case TRUE: stateid4 lrs_stateid; case FALSE: void; }; union LAYOUTRETURN4res switch (nfsstat4 lorr_status) { case NFS4_OK: layoutreturn_stateid lorr_stateid; default: void; };18.44.3. DESCRIPTION
This operation returns from the client to the server one or more layouts represented by the client ID (derived from the session ID in the preceding SEQUENCE operation), lora_layout_type, and lora_iomode. When lr_returntype is LAYOUTRETURN4_FILE, the returned layout is further identified by the current filehandle, lrf_offset, lrf_length, and lrf_stateid. If the lrf_length field is NFS4_UINT64_MAX, all
bytes of the layout, starting at lrf_offset, are returned. When lr_returntype is LAYOUTRETURN4_FSID, the current filehandle is used to identify the file system and all layouts matching the client ID, the fsid of the file system, lora_layout_type, and lora_iomode are returned. When lr_returntype is LAYOUTRETURN4_ALL, all layouts matching the client ID, lora_layout_type, and lora_iomode are returned and the current filehandle is not used. After this call, the client MUST NOT use the returned layout(s) and the associated storage protocol to access the file data. If the set of layouts designated in the case of LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL is empty, then no error results. In the case of LAYOUTRETURN4_FILE, the byte-range specified is returned even if it is a subdivision of a layout previously obtained with LAYOUTGET, a combination of multiple layouts previously obtained with LAYOUTGET, or a combination including some layouts previously obtained with LAYOUTGET, and one or more subdivisions of such layouts. When the byte-range does not designate any bytes for which a layout is held for the specified file, client ID, layout type and mode, no error results. See Section 12.5.5.2.1.5 for considerations with "bulk" return of layouts. The layout being returned may be a subset or superset of a layout specified by CB_LAYOUTRECALL. However, if it is a subset, the recall is not complete until the full recalled scope has been returned. Recalled scope refers to the byte-range in the case of LAYOUTRETURN4_FILE, the use of LAYOUTRETURN4_FSID, or the use of LAYOUTRETURN4_ALL. There must be a LAYOUTRETURN with a matching scope to complete the return even if all current layout ranges have been previously individually returned. For all lr_returntype values, an iomode of LAYOUTIOMODE4_ANY specifies that all layouts that match the other arguments to LAYOUTRETURN (i.e., client ID, lora_layout_type, and one of current filehandle and range; fsid derived from current filehandle; or LAYOUTRETURN4_ALL) are being returned. In the case that lr_returntype is LAYOUTRETURN4_FILE, the lrf_stateid provided by the client is a layout stateid as returned from previous layout operations. Note that the "seqid" field of lrf_stateid MUST NOT be zero. See Sections 8.2, 12.5.3, and 12.5.5.2 for a further discussion and requirements. Return of a layout or all layouts does not invalidate the mapping of storage device ID to a storage device address. The mapping remains in effect until specifically changed or deleted via device ID notification callbacks. Of course if there are no remaining layouts
that refer to a previously used device ID, the server is free to delete a device ID without a notification callback, which will be the case when notifications are not in effect. If the lora_reclaim field is set to TRUE, the client is attempting to return a layout that was acquired before the restart of the metadata server during the metadata server's grace period. When returning layouts that were acquired during the metadata server's grace period, the client MUST set the lora_reclaim field to FALSE. The lora_reclaim field MUST be set to FALSE also when lr_layoutreturn is LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL. See LAYOUTCOMMIT (Section 18.42) for more details. Layouts may be returned when recalled or voluntarily (i.e., before the server has recalled them). In either case, the client must properly propagate state changed under the context of the layout to the storage device(s) or to the metadata server before returning the layout. If the client returns the layout in response to a CB_LAYOUTRECALL where the lor_recalltype field of the clora_recall field was LAYOUTRECALL4_FILE, the client should use the lor_stateid value from CB_LAYOUTRECALL as the value for lrf_stateid. Otherwise, it should use logr_stateid (from a previous LAYOUTGET result) or lorr_stateid (from a previous LAYRETURN result). This is done to indicate the point in time (in terms of layout stateid transitions) when the recall was sent. The client uses the precise lora_recallstateid value and MUST NOT set the stateid's seqid to zero; otherwise, NFS4ERR_BAD_STATEID MUST be returned. NFS4ERR_OLD_STATEID can be returned if the client is using an old seqid, and the server knows the client should not be using the old seqid. For example, the client uses the seqid on slot 1 of the session, receives the response with the new seqid, and uses the slot to send another request with the old seqid. If a client fails to return a layout in a timely manner, then the metadata server SHOULD use its control protocol with the storage devices to fence the client from accessing the data referenced by the layout. See Section 12.5.5 for more details. If the LAYOUTRETURN request sets the lora_reclaim field to TRUE after the metadata server's grace period, NFS4ERR_NO_GRACE is returned. If the LAYOUTRETURN request sets the lora_reclaim field to TRUE and lr_returntype is set to LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL, NFS4ERR_INVAL is returned.
If the client sets the lr_returntype field to LAYOUTRETURN4_FILE, then the lrs_stateid field will represent the layout stateid as updated for this operation's processing; the current stateid will also be updated to match the returned value. If the last byte of any layout for the current file, client ID, and layout type is being returned and there are no remaining pending CB_LAYOUTRECALL operations for which a LAYOUTRETURN operation must be done, lrs_present MUST be FALSE, and no stateid will be returned. In addition, the COMPOUND request's current stateid will be set to the all-zeroes special stateid (see Section 16.2.3.1.2). The server MUST reject with NFS4ERR_BAD_STATEID any further use of the current stateid in that COMPOUND until the current stateid is re-established by a later stateid-returning operation. On success, the current filehandle retains its value. If the EXCHGID4_FLAG_BIND_PRINC_STATEID capability is set on the client ID (see Section 18.35), the server will require that the principal, security flavor, and if applicable, the GSS mechanism, combination that acquired the layout also be the one to send LAYOUTRETURN. This might not be possible if credentials for the principal are no longer available. The server will allow the machine credential or SSV credential (see Section 18.35) to send LAYOUTRETURN if LAYOUTRETURN's operation code was set in the spo_must_allow result of EXCHANGE_ID.18.44.4. IMPLEMENTATION
The final LAYOUTRETURN operation in response to a CB_LAYOUTRECALL callback MUST be serialized with any outstanding, intersecting LAYOUTRETURN operations. Note that it is possible that while a client is returning the layout for some recalled range, the server may recall a superset of that range (e.g., LAYOUTRECALL4_ALL); the final return operation for the latter must block until the former layout recall is done. Returning all layouts in a file system using LAYOUTRETURN4_FSID is typically done in response to a CB_LAYOUTRECALL for that file system as the final return operation. Similarly, LAYOUTRETURN4_ALL is used in response to a recall callback for all layouts. It is possible that the client already returned some outstanding layouts via individual LAYOUTRETURN calls and the call for LAYOUTRETURN4_FSID or LAYOUTRETURN4_ALL marks the end of the LAYOUTRETURN sequence. See Section 12.5.5.1 for more details. Once the client has returned all layouts referring to a particular device ID, the server MAY delete the device ID.