6. Concluding Remarks This document represents a description of the current state of the VMTP design. We are currently engaged in several experimental implementations to explore and refine all aspects of the protocol. Preliminary implementations are running in the UNIX 4.3BSD kernel and in the V kernel. Several issues are still being discussed and explored with this protocol. First, the size of the checksum field and the algorithm to use for its calculation are undergoing some discussion. The author believes that the conventional 16-bit checksum used with TCP and IP is too weak for future high-speed networks, arguing for at least a 32-bit checksum. Unfortunately, there appears to be limited theory covering checksum algorithms that are suitable for calculation in software. Implementation of the streaming facilities of VMTP is still in progress. This facility is expected to be important for wide-area, long delay communication.
I. Standard VMTP Response Codes The following are the numeric values of the response codes used in VMTP. 0 OK 1 RETRY 2 RETRY_ALL 3 BUSY 4 NONEXISTENT_ENTITY 5 ENTITY_MIGRATED 6 NO_PERMISSION 7 NOT_AWAITING_MSG 8 VMTP_ERROR 9 MSGTRANS_OVERFLOW 10 BAD_TRANSACTION_ID 11 STREAMING_NOT_SUPPORTED 12 NO_RUN_RECORD 13 RETRANS_TIMEOUT 14 USER_TIMEOUT 15 RESPONSE_DISCARDED 16 SECURITY_NOT_SUPPORTED 17 BAD_REPLY_SEGMENT 18 SECURITY_REQUIRED 19 STREAMED_RESPONSE 20 TOO_MANY_RETRIES 21 NO_PRINCIPAL
22 NO_KEY 23 ENCRYPTION_NOT_SUPPORTED 24 NO_AUTHENTICATOR 25-63 Reserved for future VMTP assignment. Other values of the codes are available for use by higher level protocols. Separate protocol documents will specify further standard values. Applications are free to use values starting at 0x00800000 (hex) for application-specific return values.
II. VMTP RPC Presentation Protocol For complete generality, the mapping of the procedures and the parameters onto VMTP messages should be defined by a RPC presentation protocol. In the absence of an accepted standard protocol, we define an RPC presentation protocol for VMTP as follows. Each procedure is assigned an identifying Request Code. The Request code serves effectively the same as a tag field of variant record, identifying the format of the Request and associated Response as a variant of the possible message formats. The format of the Request for a procedure is its Request Code followed by its parameters sequentially in the message control block until it is full. The remaining parameters are sent as part of the message segment data formatted according to the XDR protocol (RFC ??). In this case, the size of the segment is specified in the SegmentSize field. The Response for a procedure consists of a ResponseCode field followed by the return parameters sequentially in the message control block, except if there is a parameter returned that must be transmitted as segment data, its size is specified in the SegmentSize field and the parameter is stored in the SegmentData field. Attributes associated with procedure definitions should indicate the Flags to be used in the RequestCode. Request Codes are assigned as described below. II.1. Request Code Management Request codes are divided into Public Interface Codes and application-specific, according to whether the PIC value is set. An interface is a set of request codes representing one service or module function. A public interface is one that is to be used in multiple independently developed modules. In VMTP, public interface codes are allocated in units of 256 structured as +-------------+----------------+-------------------+ | ControlFlags| Interface | Version/Procedure | +-------------+----------------+-------------------+ 8 bits 16 bits 8 bits An interface is free to allocate the 8 bits for version and procedure as desired. For example, all 8 bits can be used for procedures. A module requiring more than 256 Version/Procedure values can be allocated
multiple Interface values. They need not be consecutive Interface values.
III. VMTP Management Procedures Standard procedures are defined for VMTP management, including creation, deletion and query of entities and entity groups, probing to get information about entities, and updating message transaction information at the client or the server. The procedures are implemented by the VMTP manager that constitutes a portion of every complete VMTP module. Each procedure is invoked by sending a Request to the VMTP manager that handles the entity specified in the operation or the local manager. The Request sent using the normal Send operation with the Server specified as the well-known entity group VMTP_MANGER_GROUP, using the CoResident Entity mechanism to direct the request to the specific manager that should handle the Request. (The ProbeEntity operation is multicast to the VMTP_MANAGER_GROUP if the host address for the entity is not known locally and the host address is determined as the host address of the responder. For all other operations, a ProbeEntity operation is used to determine the host address if it is not known.) Specifying co-resident entity 0 is interpreted as the co-resident with the invoking process. The co-resident entity identifier may also specify a group in which case, the Request is sent to all managers with members in this group. The standard procedures with their RequestCode and parameters are listed below with their semantics. (The RequestCode range 0xVV000100 to 0xVV0001FF is reserved for use by the VMTP management routines, where VV is any choice of control flags with the PIC bit set. The flags are set below as required for each procedure.) 0x05000101 - ProbeEntity(CREntity, entityId, authDomain) -> (code, <staterec>) Request and return information on the specified entity in the specified authDomain, sending the Request to the VMTP management module coresident with CREntity. An error return is given if the requested information cannot be provided in the specified authDomain. The <staterec> returned is structured as the following fields. Transaction identifier The current or next transaction identifier being used by the probed entity. ProcessId: 64 bits Identifier for client process. The meaning of this is specified as part of
the Domain definition. PrincipalId The identifier for the principal or account associated with the process specified by ProcessId. The meaning of this field is specified as part of the Domain definition. EffectivePrincipalId The identifier for the principal or account associated with the Client port, which may be different from the PrincipalId especially if this is an nested call. The meaning of this field is specified as part of the Domain definition. The code field indicates whether this is an error response or not. The codes and their interpretation are: OK No error. Probe was completed OK. NONEXISTENT_ENTITY Specified entity does not exist. ENTITY_MIGRATED The entity has migrated and is no longer at the host to which the request was sent. NO_PERMISSION Entity has refused to provide ProbeResponse. VMTP_ERROR The Request packet group was in error relative to the VMTP protocol specification. "default" Some type of error - discard ProbeResponse. 0x0D000102 - AuthProbeEntity(CREntity,entityId,authDomain,randomId) -> (code,ProbeAuthenticator,EncryptType,EntityAuthenticator) Request authentication of the entity specified by entityId from the VMTP manager coresident with CREntity in authDomain authentication domain, returning the
information contained in the return parameters. The fields are set the same as that specified for the basic ProbeResponse except as noted below. ProbeAuthenticator 20 bytes consisting of the EntityId, the randomId and the probed Entity's current Transaction value plus a 32-bit checksum for these two fields (checksummed using the standard packet Checksum algorithm), all encrypted with the Key supplied in the Authenticator. EncryptType An identifier that identifies the variant of encryption method being used by the probed Entity for packets it transmits and packets it is able to receive. (See Appendix V.) The high-order 8 bits of the EncryptType contain the XOR of the 8 octets of the PrincipalId associated with private key used to encrypt the EntityAuthenticator. This value is used by the requestor or Client as an aid in locating the key to decrypt the authenticator. EntityAuthenticator (returned as segment data) The ProcessId, PrincipalId, EffectivePrincipal associated with the ProbedEntity plus the private encryption/decryption key and its lifetime limit to be used for communication with the Entity. The authenticator is encrypted with a private key associated with the Client entity such that it can be neither read nor forged by a party not trusted by the Client Entity. The format of the Authenticator in the message segment is shown in detail in Figure III-1. Key: 64 bits Encryption key to be used for encrypting and decrypting packets sent to and received from the probed Entity. This is the "working" key for packet transmissions. VMTP only uses private
+-----------------------------------------------+ | ProcessId (8 octets) | +-----------------------------------------------+ | PrincipalId (8 octets) | +-----------------------------------------------+ | EffectivePrincipalId (8 octets) | +-----------------------------------------------+ | Key (8 octets) | +-----------------------------------------------+ | KeyTimeLimit | +-----------------------------------------------+ | AuthDomain | +-----------------------------------------------+ | AuthChecksum | +-----------------------------------------------+ Figure III-1: Authenticator Format key encryption for data transmission. KeyTimeLimit: 32 bits The time in seconds since Dec. 31st, 1969 GMT at which one should cease to use the Key. AuthDomain: 32 bits The authentication domain in which to interpret the principal identifiers. This may be different from the authDomain specified in the call if the Server cannot provide the authentication information in the request domain. AuthChecksum: 32 bits Contains the checksum (using the same Checksum algorithm as for packet) of KeyTimeLimit, Key, PrincipalId and EffectivePrincipalId. Notes: 1. A authentication Probe Request and Response are sent unencrypted in general because it is used prior to there being a secure channel. Therefore, specific fields or groups of fields checksummed and encrypted to prevent unauthorized modification or forgery. In
particular, the ProbeAuthenticator is checksummed and encrypted with the Key. 2. The ProbeAuthenticator authenticates the Response as responding to the Request when its EntityId, randomId and Transaction values match those in the Probe request. The ProbeAutenticator is bound to the EntityAutenticator by being encrypted by the private Key contained in that authenticator. 3. The authenticator is encrypted such that it can be decrypted by a private key, known to the Client. This authenticator is presumably obtained from a key distribution center that the Client trusts. The AuthChecksum prevents undetected modifications to the authenticator. 0x05000103 - ProbeEntityBlock( entityId ) -> ( code, entityId ) Check whether the block of 256 entity identifiers associated with this entityId are in use. The entityId returned should match that being queried or else the return value should be ignored and the operation redone. 0x05000104 - QueryVMTPNode( entityId ) -> (code, MTU, flags, authdomain, domains, authdomains, domainlist) Query the VMTP management module for entityId to get various module- or node-wide parameters, including: (1) MTU - Maximum transmission unit or packet size handled by this node. (2) flags- zero or more of the following bit fields: 1 Handles streamed Requests. 2 Can issue streamed message transactions for clients. 4 Handles secure Requests. 8 Can issue secure message transactions. The authdomain indicates the primary authentication domain supported. The domains and authdomains parameters indicate the number of entity domains and authentication domains supported by this node, which are listed in the data segment parameter domainlist if
either parameter is non-zero. (All the entity domains precede the authentication domains in the data segment.) 0x05000105 - GetRequestForwarder( CREntity, entityId1 ) -> (code, entityId2, principal, authDomain) Return the forwarding server's entity identifer and principal for the forwarder of entityId1. CREntity should be zero to get the local VMTP management module. 0x05000106 - CreateEntity( entityId1 ) -> ( code, entityId2 ) Create a new entity and return its entity identifier in entityId2. The entity is created local to the entity specified in entityId1 and local to the requestor if entityId1 is 0. 0x05000107 - DeleteEntity( entityId ) -> ( code ) Delete the entity specified by entityId, which may be a group. If a group, the deletion is only on a best efforts basis. The client must take additional measures to ensure complete deletion if required. 0x0D000108 -QueryEntity( entityId ) -> ( code, descriptor ) Return a descriptor of entityId in arg of a maximum of segmentSize bytes. 0x05000109 - SignalEntity( entityId, arg )->( code ) Send the signal specified by arg to the entity specified by entityId. (arg is 32 bits.) 0x0500010A - CreateGroup(CREntity,entityGroupId,entityId,perms)->(code) Request that the VMTP manager local to CREntity create an new entity group, using the specified entityGroupId with entityId as the first member and permissions "perms", a 32-bit field described later. The invoker is registered as a manager of the new group, giving it the permissions to add or remove members. (Normally CREntity is 0, indicating the VMTP manager local to the requestor.) 0x0500010B - AddToGroup(CREntity, entityGroupId, entityId, perms)->(code) Request that the VMTP manager local to CREntity add the specified entityId to the entityGroupId with the specified permissions. If entityGroupId specifies a restricted group, the invoker must have permission to add members to the group, either because the invoker is
a manager of the group or because it was added to the group with the required permissions. If CREntity is 0, then the local VMTP manager checks permissions and forwards the request with CREntity set to entityId and the entityId field set to a digital signature (see below) of the Request by the VMTP manager, certifying that the Client has the permissions required by the Request. (If entityGroupId specifies an unrestricted group, the Request can be sent directly to the handling VMTP manager by setting CREntity to entityId.) 0x0500010C - RemoveFromGroup(CREntity, entityGroupId, entityId)->(code) Request that the VMTP manager local to CREntity remove the specified entityId from the group specified by entityGroupId. Normally CREntity is 0, indicating the VMTP manager local to the requestor. If CREntity is 0, then the local VMTP manager checks permissions and forwards the request with CREntity set to entityId and the entityId field a digital signature of the Request by the VMTP manager, certifying that the Client has the permissions required by the Request. 0x0500010D - QueryGroup( entityId )->( code, record )... Return information on the specified entity. The Response from each responding VMTP manager is (code, record). The format of the record is (memberCount, member1, member2, ...). The Responses are returned on a best efforts basis; there is no guarantee that responses from all managers with members in the specified group will be received. 0x0500010E - ModifyService(entityId,flags,count,pc,threadlist)->(code, count) Modify the service associated with the entity specified by entityId. The flags may indicate a message service model, in which case the call "count" parameter indicates the maximum number of queued messages desired; the return "count" parameter indicates the number of queued message allowed. Alternatively, the "flags" parameters indicates the RPC thread service model, in which case "count" threads are requested, each with an inital program counter as specified and stack, priority and message receive area indicated by the threadlist. In particular, "threadlist" consists of "count" records of the form (priority,stack,stacksize,segment,segmentsize), each one assigned to one of the threads. Flags defined for the
"flags" parameter are: 1 THREAD_SERVICE - otherwise the message model. 2 AUTHENTICATION_REQUIRED - Sent a Probe request to determine principal associated with the Client, if not known. 4 SECURITY_REQUIRED - Request must be encrypted or else reject. 8 INCREMENTAL - treat the count value as an increment (or decrement) relative to the current value rather than an absolute value for the maximum number of queued messages or threads. In the thread model, the count must be a positive increment or else 0, which disables the service. Only a count of 0 terminates currently queued requests or in-progress request handling. 0x4500010F - NotifyVmtpClient(client,cntrl,recSeq,transact,delivery,code)->() Update the state associated with the transaction specified by client and transact, an entity identifier and transaction identifier, respectively. This operation is normally used only by another VMTP management module. (Note that it is a datagram operation.) The other parameters are as follows: ctrl A 32-bit value corresponding to 4th 32-bit word of the VMTP header of a Response packet that would be sent in response to the Request that this is responding to. That is, the control flags, ForwardCount, RetransmitCount and Priority fields match those of the Request. (The NRS flag is set if the receiveSeqNumber field is used.) The PGCount subfield indicates the number of previous Request packet groups being acknowledged by this Notify operation. (The bit fields that are reserved in
this word in the header are also reserved here and must be zero.) recSeq Sequence number of reception at the Server if the NRS flag is set in the ctrl parameter, otherwise reserved and zero. (This is used for sender-based logging of message activity for replay in case of failure - an optional facility.) delivery Indicates the segment blocks of the packet group have been received at the Server. code indicates the action the client should take, as described below. The VMTP management module should take action on this operation according to the code, as specified below. OK Do nothing at this time, continue waiting for the response with a reset timer. RETRY Retransmit the request packet group immediately with at least the segment blocks that the Server failed to receive, the complement of those indicated by the delivery parameter. RETRY_ALL Retransmit the request packet group immediately with at least the segment blocks that the Server failed to receive, as indicated by the delivery field plus all subsequently transmitted packets that are part of this packet run. (The latter is applicable only for streamed message transactions.) BUSY The server was unable to accept the Request at this time. Retry later if desired to continue with the message transaction. NONEXISTENT_ENTITY Specified Server entity does not exist.
ENTITY_MIGRATED The server entity has migrated and is no longer at the host to which the request was sent. The Server should attempt to determine the new host address of the Client using the VMTP management ProbeEntity operation (described earlier). NO_PERMISSION Server has not authorized reception of messages from this client. NOT_AWAITING_MSG The conditional message delivery bit was set for the Request packet group and the Server was not waiting for it so the Request packet group was discarded. VMTP_ERROR The Request packet group was in error relative to the VMTP protocol specification. BAD_TRANSACTION_ID Transaction identifier is old relative to the transaction identifier held for the Client by the Server. STREAMING_NOT_SUPPORTED Server does not support multiple outstanding message transactions from the same Client, i.e. streamed message transactions. SECURITY_NOT_SUPPORTED The Request was secure and this Server does not support security. SECURITY_REQUIRED The Server is refusing the Request because it was not encrypted. NO_RUN_RECORD Server has no record of previous packets in this run of packet groups. This can occur if the first packet group is lost or if the current packet group is sent significantly later than the last one and the Server has discarded its client state record.
0x45000110 - NotifyVmtpServer(server,client,transact,delivery,code)->() Update the server state associated with the transaction specified by client and transact, an entity identifier and transaction identifier, respectively. This operation is normally used only by another VMTP management module. (Note that it is a datagram operation.) The other parameters are as follows: delivery Indicates the segment blocks of the Response packet group that have been received at the Client. code indicates the action the Server should take, as listed below. The VMTP management module should take action on this operation according to the code, as specified below. OK Client is satisfied with Response data. The Server can discard the response data, if any. RETRY Retransmit the Response packet group immediately with at least the segment blocks that the Client failed to receive, as indicated by the delivery parameter. (The delivery parameter indicates those segment blocks received by the Client). RETRY_ALL Retransmit the Response packet group immediately with at least the segment blocks that the Client failed to receive, as indicated by the (complement of) the delivery parameter. Also, retransmit all Response packet groups send subsequent to the specified packet group. NONEXISTENT_ENTITY Specified Client entity does not exist. ENTITY_MIGRATED The Client entity has migrated and is no longer at the host to which the response was sent. RESPONSE_DISCARDED
The Response was discarded and no longer of interest to the Client. This may occur if the conditional message delivery bit was set for the Response packet group and the Client was not waiting for it so the Response packet group was discarded. VMTP_ERROR The Response packet group was in error relative to the VMTP protocol specification. 0x41000111 - NotifyRemoteVmtpClient(client,ctrl,recSeq,transact,delivery,code->() The same as NotifyVmtpClient except the co-resident addressing is not used. This operation is used to update client state that is remote when a Request is forwarded. Note the use of the CRE bit in the RequestCodes to route the request to the correct VMTP management module(s) to handle the request. III.1. Entity Group Management An entity in a group has a set of permissions associated with its membership, controling whether it can add or remove others, whether it can remove itself, and whether others can remove it from the group. The permissions for entity groups are as follows: VMTP_GRP_MANAGER 0x00000001 { Manager of group. } VMTP_REM_BY_SELF 0x00000002 { Can be removed self. } VMTP_REM_BY_PRIN 0x00000004 { Can be rem'ed by same principal} VMTP_REM_BY_OTHE 0x00000008 { Can be removed any others. } VMTP_ADD_PRIN 0x00000010 { Can add by same principal. } VMTP_ADD_OTHE 0x00000020 { Can add any others. } VMTP_REM_PRIN 0x00000040 { Can remove same principal. } VMTP_REM_OTHE 0x00000080 { Can remove any others. } To remove an entity from a restricted group, the invoker must have permission to remove that entity and the entity must have permissions that allow it to be removed by that entity. With an unrestricted group, only the latter condition applies. With a restricted group, a member can only be added by another entity with the permissions to add other entities. The creator of a group is given full permissions on a group. A entity adding another entity to a
group can only give the entity it adds a subset of its permissions. With unrestricted groups, any entity can add itself to the group. It can also add other entities to the group providing the entity is not marked as immune to such requests. (This is an implementation restriction that individual entities can impose.) III.2. VMTP Management Digital Signatures As mentioned above, the entityId field of the AddToGroup and RemoveFromGroup is used to transmit a digital signature indicating the permission for the operation has been checked by the sending kernel. The digital signature procedures have not yet been defined. This field should be set to 0 for now to indicate no signature after the CREntity parameter is set to the entity on which the operation is to be performed.
IV. VMTP Entity Identifier Domains VMTP allows for several disjoint naming domains for its endpoints. The 64-bit entity identifier is only unique and meaningful within its domain. Each domain can define its own algorithm or mechanism for assignment of entity identifiers, although each domain mechanism must ensure uniqueness, stability of identifiers and host independence. IV.1. Domain 1 For initial use of VMTP, we define the domain with Domain identifier 1 as follows: +-----------+----------------+------------------------+ | TypeFlags | Discriminator | Internet Address | +-----------+----------------+------------------------+ 4 bits 28 bits 32 bits The Internet address is the Internet address of the host on which this entity-id is originally allocated. The Discriminator is an arbitrary value that is unique relative to this Internet host address. In addition, the host must guarantee that this identifier does not get reused for a long period of time after it becomes invalid. ("Invalid" means that no VMTP module considers in bound to an entity.) One technique is to use the lower order bits of a 1 second clock. The clock need not represent real-time but must never be set back after a crash. In a simple implementation, using the low order bits of a clock as the time stamp, the generation of unique identifiers is overall limited to no more than 1 per second on average. The type flags were described in Section 3.1. An entity may migrate between hosts. Thus, an implementation can heuristically use the embedded Internet address to locate an entity but should be prepared to maintain a cache of redirects for migrated entities, plus accept Notify operations indicating that migration has occurred. Entity group identifiers in Domain 1 are structured in one of two forms, depending on whether they are well-known or dynamically allocated identifiers. A well-known entity identifier is structured as: +-----------+----------------+------------------------+ | TypeFlags | Discriminator |Internet Host Group Addr| +-----------+----------------+------------------------+ 4 bits 28 bits 32 bits
with the second high-order bit (GRP) set to 1. This form of entity identifier is mapped to the Internet host group address specified in the low-order 32 bits. The Discriminator distinguishes group identifiers using the same Internet host group. Well-known entity group identifiers should be allocated to correspond to the basic services provided by hosts that are members of the group, not specifically because that service is provided by VMTP. For example, the well-known entity group identifier for the domain name service should contain as its embedded Internet host group address the host group for Domain Name servers. A dynamically allocated entity identifier is structured as: +-----------+----------------+------------------------+ | TypeFlags | Discriminator | Internet Host Addr | +-----------+----------------+------------------------+ 4 bits 28 bits 32 bits with the second high-order bit (GRP) set to 1. The Internet address in the low-order 32 bits is a Internet address assigned to the host that dynamically allocates this entity group identifier. A dynamically allocated entity group identifier is mapped to Internet host group address 232.X.X.X where X.X.X are the low-order 24 bits of the Discriminator subfield of the entity group identifier. We use the following notation for Domain 1 entity identifiers <10> and propose it use as a standard convention. <flags>-<discriminator>-<Internet address> where <flags> are [X]{BE,LE,RG,UG}[A] X = reserved BE = big-endian entity LE = little-endian entity RG = restricted group UG = unrestricted group A = alias and <discriminator> is a decimal integer and <Internet address> is in standard dotted decimal IP address notation. Examples: _______________ <10> This notation was developed by Steve Deering.
BE-25593-36.8.0.49 is big-endian entity #25593 created on host 36.8.0.49. RG-1-224.0.1.0 is the well-known restricted VMTP managers group. UG-565338-36.8.0.77 is unrestricted entity group #565338 created on host 36.8.0.77. LEA-7823-36.8.0.77 is a little-endian alias entity #7823 created on host 36.8.0.77. This notation makes it easy to communicate and understand entity identifiers for Domain 1. The well-known entity identifiers specified to date are: VMTP_MANAGER_GROUP RG-1-224.0.1.0 Managers for VMTP operations. VMTP_DEFAULT_BECLIENT BE-1-224.0.1.0 Client entity identifier to use when a (big-endian) host has not determined or been allocated any client entity identifiers. VMTP_DEFAULT_LECLIENT LE-1-224.0.1.0 Client entity identifier to use when a (little-endian) host has not determined or been allocated any client entity identifiers. Note that 224.0.1.0 is the host group address assigned to VMTP and to which all VMTP hosts belong. Other well-known entity group identifiers will be specified in subsequent extensions to VMTP and in higher-level protocols that use VMTP. IV.2. Domain 3 Domain 3 is reserved for embedded systems that are restricted to a single network and are independent of IP. Entity identifiers are allocated using the decentralized approach described below. The mapping of entity group identifiers is specific to the type of network being used and not defined here. In general, there should be a simple algorithmic mapping from entity group identifier to multicast address, similar to that described for Domain 1. Similarly, the values for default client identifier are specific to the type of network and not
defined here. IV.3. Other Domains Definition of additional VMTP domains is planned for the future. Requests for allocation of VMTP Domains should be addressed to the Internet protocol administrator. IV.4. Decentralized Entity Identifier Allocation The ProbeEntityBlock operation may be used to determine whether a block of entity identifiers is in use. ("In use" means valid or reserved by a host for allocation.) This mechanism is used to detect collisions in allocation of blocks of entity identifiers as part of the implementation of decentralized allocation of entity identifiers. (Decentralized allocation is used in local domain use of VMTP such as in embedded systems- see Domain 3.) Basically, a group of hosts can form a Domain or sub-Domain, a group of hosts managing their own entity identifier space or subspace, respectively. As an example of a sub-Domain, a group of hosts in Domain 1 all identified with a particular host group address can manage the sub-Domain corresponding to all entity identifiers that contain that host group address. The ProbeEntityBlock operation is used to allocate the random bits of these identifiers as follows. When a host requires a new block of entity identifiers, it selects a new block (randomly or by some choice algorithm) and then multicasts a ProbeEntityBlock request to the members of the (sub-)Domain some R times. If no response is received after R (re)transmissions, the host concludes that it is free to use this block of identifiers. Otherwise, it picks another block and tries again. Notes: 1. A block of 256 identifiers is specified by an entity identifier with the low-order 8 bits all zero. 2. When a host allocates an initial block of entity identifiers (and therefore does not yet have a specified entity identifier to use) it uses VMTP_DEFAULT_BECLIENT (if big-endian, else VMTP_DEFAULT_LECLIENT if little-endian) as its client identifier in the ProbeEntityBlock Request and a transaction identifier of 0. As soon as it has allocated a block of entity identifiers, it should use these identifiers
for all subsequent communication. The default client identifier values are defined for each Domain. 3. The set of hosts using this decentralized allocation must not be subject to network partitioning. That is, the R transmissions must be sufficient to ensure that every host sees the ProbeEntityBlock request and (reliably) sends a response. (A host that detects a collision can retransmit the response multiple times until it sees a new ProbeEntityBlock operation from the same host/Client up to a maximum number of times.) For instance, a set of machines connected by a single local network may able to use this type of allocation. 4. To guarantee T-stability, a host must prevent reuse of a block of identifiers if any of the identifiers in the block are currently valid or have been valid less than T seconds previously. To this end, a host must remember recently used identifiers and object to their reuse in response to a ProbeEntityBlock operation. 5. Care is required in a VMTP implementation to ensure that Probe operations cannot be discarded due to lack of buffer space or queued or delayed so that a response is not generated quickly. This is required not only to detect collisions but also to provide accurate roundtrip estimates as part of ProbeEntity operations.
V. Authentication Domains A VMTP authentication domain defines the format and interpretation for principal identifiers and encryption keys. In particular, an authentication domain must specify a means by which principal identifiers are allocated and guaranteed unique and stable. The currently defined authentication domains are as follows (0 is reserved). Ideally, all entities within one entity domain are also associated with one authentication domain. However, authentication domains are orthogonal to entity domains. Entities within one domain may have different authentication domains. (In this case, it is generally necessary to have some correspondence between principals in the different domains.) Also, one entity identifier may be associated with multiple authentication domains. Finally, one authentication domain may be used across multiple entity domains. V.1. Authentication Domain 1 A principal identifier is structured as follows. +---------------------------+------------------------+ | Internet Address | Local User Identifier | +---------------------------+------------------------+ 32 bits 32 bits The Internet Address may specify an individual host (such as a UNIX machine) or may specify a host group address corresponding to a cluster of machines operating under a single adminstration. In both cases, there is assumed to be an adminstration associated with the embedded Internet address that guarantees the uniqueness and stability of the User Identifier relative to the Internet address. In particular, that administration is the only one authorized to allocate principal identifiers with that Internet address prefix, and it may allocate any of these identifiers. In authentication domain 1, the standard EncryptionQualifiers are: 0 Clear text - no encryption. 1 use 64-bit CBC DES for encryption and decryption. V.2. Other Authentication Domains Other authentication domains will be defined in the future as needed.
VI. IP Implementation VMTP is designed to be implemented on the DoD IP Internet Datagram Protocol (although it may also be implemented as a local network protocol directly in "raw" network packets.) VMTP is assigned the protocol number 81. With a 20 octet IP header and one segment block, a VMTP packet is 600 octets. By convention, any host implementing VMTP implicitly agrees to accept VMTP/IP packets of at least 600 octets. VMTP multicast facilities are designed to work with, and have been implemented using, the multicast extensions to the Internet [8] described in RFC 966 and 988. The wide-scale use of full VMTP/IP depends on the availability of IP multicast in this form.
VII. Implementation Notes The performance and reliability of a protocol in operation is highly dependent on the quality of its implementation, in addition to the "intrinsic" quality of the protocol design. One of the design goals of the VMTP effort was to produce an efficiently implementable protocol. The following notes and suggestions are based on experience with implementing VMTP in the V distributed system and the UNIX 4.3 BSD kernel. The following is described for a client and server handling only one domain. A multi-domain client or server would replicate these structures for each domain, although buffer space may be shared. VII.1. Mapping Data Structures The ClientMap procedure is implemented using a hash table that maps to the Client State Record whether this entity is local or remote, as shown in Figure VII-1. +---+---+--------------------------+ ClientMap | | x | | +---+-|-+--------------------------+ | +--------------+ +--------------+ +-->| LocalClient |--->| LocalClient | +--------------+ +--------------+ | RemoteClient | | RemoteClient |-> ... +--------------+ +--------------+ | | | | | | | | +--------------+ +--------------+ Figure VII-1: Mapping Client Identifier to CSR Local clients are linked through the LocalClientLink, similarly for the RemoteClientLink. Once a CSR with the specified Entity Id is found, some field or flag indicates whether it is identifying a local or remote Entity. Hash collisions are handled with the overflow pointers LocalClientLink and RemoteClientLink (not shown) in the CSR for the LocalClient and RemoteClient fields, respectively. Note that a CSR representing an RPC request has both a local and remote entity identifier mapping to the same CSR. The Server specified in a Request is mapped to a server descriptor using the ServerMap (with collisions handled by the overflow pointer.). The server descriptor is the root of a queue of CSR's for handling requests plus flags that modify the handling of the Request. Flags include:
+-------+---+-------------------------+ ServerMap | | x | | +-------+-|-+-------------------------+ | +--------------+ | | OverflowLink | | +--------------+ +-->| Server | +--------------+ | Flags | Lock | +--------------+ | Head Pointer | +--------------+ | Tail Pointer | +--------------+ Figure VII-2: Mapping Server Identifiers THREAD_QUEUE Request is to be invoked directly as a remote procedure invocation, rather than by a server process in the message model. AUTHENTICATION_REQUIRED Sent a Probe request to determine principal associated with the Client, if not known. SECURITY_REQUIRED Request must be encrypted or else reject. REQUESTS_QUEUED Queue contains waiting requests, rather than free CSR's. Queue this request as well. SERVER_WAITING The server is waiting and available to handle incoming Request immediately, as required by CMD. Alternatively, the Server identifiers can be mapped to a CSR using the MapToClient mechanism with a pointer in the CSR refering to the server descriptor, if any. This scheme is attractive if there are client CSR's associated with a service to allow it to communicate as a client using VMTP with other services. Finally, a similar structure is used to expand entity group identifiers to the local membership, as shown in Figure VII-3. A group identifier is hashed to an index in the GroupMap. The list of group descriptors rooted at that index in the GroupMap contains a group descriptor for each local member of the group. The flags are the group permissions defined in Appendix III.
+-------+---+----------------------------------+ GroupMap | | x | | +-------+-|-+----------------------------------+ | +--------------+ | | OverflowLink | | +--------------+ +-->|EntityGroupId | +--------------+ | Flags | +--------------+ | Member Entity| +--------------+ Figure VII-3: Mapping Group Identifiers Note that the same pool of descriptors could be used for the server and group descriptors given that they are similar in size. VII.2. Client Data Structures Each client entity is represented as a client state record. The CSR contains a VMTP header as well as other bookkeeping fields, including timeout count, retransmission count, as described in Section 4.1. In addition, there is a timeout queue, transmission queue and reception queue. Finally, there is a ServerHost cache that maps from server entity-id records to host address, estimated round trip time, interpacket gap, MTU size and (optimally) estimated processing time for this server entity. VII.3. Server Data Structures The server maintains a heap of client state records (CSR), one for each (Client, Transaction). (If streams are not supported, there is, at worst, a CSR per Client with which the server has communicated with recently.) The CSR contains a VMTP header as well as various bookkeeping fields including timeout count, retransmission count. The server maintains a hash table mapping of Client to CSR as well as the transmission, timeout and reception queues. In a VMTP module implementing both the client and server functions, the same timeout queue and transmission queue are used for both.
VII.4. Packet Group transmission The procedure SendPacketGroup( csr ) transmits the packet group specified by the record CSR. It performs: 1. Fragmentation of the segment data, if any, into packets. (Note, segment data flagged by SDA bit.) 2. Modifies the VMTP header for each packet as required e.g. changing the delivery mask as appropriate. 3. Computes the VMTP checksum. 4. Encrypts the appropriate portion of the packet, if required. 5. Prepends and appends network-level header and trailer using network address from ServerHost cache, or from the responding CSR. 6. Transmits the packet with the interpacket gap specified in the cache. This may involve round-robin scheduling between hosts as well as delaying transmissions slightly. 7. Invokes the finish-up procedure specified by the CSR record, completing the processing. Generally, this finish-up procedure adds the record to the timeout queue with the appropriate timeout queue. The CSR includes a 32-bit transmission mask that indicates the portions of the segment to transmit. The SendPacketGroup procedure is assumed to handle queuing at the network transmission queue, queuing in priority order according to the priority field specified in the CSR record. (This priority may be reflected in network transmission behavior for networks that support priority.) The SendPacketGroup procedure only looks at the following fields of a CSR - Transmission mask - FuncCode - SDA - Client - Server
- CoResidentEntity - Key It modifies the following fields - Length - Delivery - Checksum In the case of encrypted transmission, it encrypts the entire packet, not including the Client field and the following 32-bits. If the packet group is a Response, (i.e. lower-order bit of function code is 1) the destination network address is determined from the Client, otherwise the Server. The HostAddr field is set either from the ServerHost cache (if a Request) or from the original Request if a Response, before SendPacketGroup is called. The CSR includes a timeout and TTL fields indicating the maximum time to complete the processing and the time-to-live for the packets to be transmitted. SendPacketGroup is viewed as the right functionality to implement for transmission in an "intelligent" network interface. Finally, it appears preferable to be able to assume that all portions of the segment remain memory-resident (no page faults) during transmission. In a demand-paged systems, some form of locking is required to keep the segment data in memory. VII.5. VMTP Management Module The implementation should implement the management operations as a separate module that is invoked from within the VMTP module. When a Request is received, either from the local user level or the network, for the VMTP management module, the management module is invoked as a remote or local procedure call to handle this request and return a response (if not a datagram request). By registering as a local server, the management module should minimize the special-case code required for its invocation. The management module is basically a case statement that selects the operation based on the RequestCode and then invokes the specified management operation. The procedure implementing the management operation, especially operations like NotifyVmtpClient and
NotifyVmtpServer, are logically part of the VMTP module because they require full access to the basic data structures of the VMTP implementation. The management module should be implemented so that it can respond quickly to all requests, particularly since the timing of management interactions is used to estimate round trip time. To date, all implementations of the management module have been done at the kernel level, along with VMTP proper. VII.6. Timeout Handling The timeout queue is a queue of CSR records, ordered by timeout count, as specified in the CSR record. On entry into the timeout queue, the CSR record has the timeout field set to the time (preferable in milliseconds or similar unit) to remain in the queue plus the finishup field set to the procedure to execute on removal on timeout from the queue. The timeout field for a CSR in the queue is the time relative to the record preceding it in the queue (if any) at which it is to be removed. Some system-specific mechanism decrements the time for the record at the front of the queue, invoking the finishup procedure when the count goes to zero. Using this scheme, a special CSR is used to timeout and scan CSR's for non-recently pinged CSR's. That is, this CSR times out and invokes a finishup procedure that scans for non-recently pinged CSR that are "AwaitingResponse" and signals the request processing entity and deletes the CSR. It then returns to the timeout queue. The timeout mechanism tends to be specific to an operating system. The scheme described may have to be adapted to the operating system in which VMTP is to be implemented. This mechanism handles client request timeout and client response timeout. It is not intended to handle interpacket gaps given that these times are expected to be under 1 millisecond in general and possibly only a few microseconds. VII.7. Timeout Values Roundtrip timeout values are estimated by matching Responses or NotifyVmtpClient Requests to Request transmission, relying on the retransmitCount to identify the particular transmission of the Request that generated the response. A similar technique can be used with Responses and NotifyVmtpServer Requests. The retransmitCount is
incremented each time the Response is sent, whether the retransmission was caused by timeout or retransmission of the Request. The ProbeEntity request is recommended as a basic way of getting up-to-date information about a Client as well as predictable host machine turnaround in processing a request. (VMTP assumes and requires an efficient, bounded response time implementation of the ProbeEntity operation.) Using this mechanism for measuring RTT, it is recommended that the various estimation and smoothing techniques developed for TCP RTT estimation be adapted and used. VII.8. Packet Reception Logically a network packet containing a VMTP packet is 5 portions: - network header, possibly including lower-level headers - VMTP header - data segment - VMTP checksum - network trailer, etc. It may be advantageous to receive a packet fragmented into these portions, if supported by the network module. In this case, ideally the VMTP header may be received directly into a CSR, the data segment into a page that can be mapped, rather than copied, to its final destination, with VMTP checksum and network header in a separate area (used to extract the network address corresponding to the sender). Packet reception is described in detail by the pseudo-code in Section 4.7. With a response, normally the CSR has an associated segment area immediately available so delivery of segment data is immediate. Similarly, server entities should be "armed" with CSR's with segment areas that provide for immediate delivery of requests. It is reasonable to discard segment data that cannot be immediately delivered in this way, providing that clients and servers are able to preallocate CSR's with segment areas for requests and responses. In particular, a client should be able to provide some number of additional CSR's for receiving multiple responses to a multicast request.
The CSR data structure is intended to be the interface data structure for an intelligent network interface. For reception, the interface is "armed" with CSR's that may point to segment areas in main memory, into which it can deliver a packet group. Ideally, the interface handles all the processing of all packets, interacting with the host after receiving a complete Request or Response packet group. An implementation should use an interface based on SendPacketGroup(CSR) and ReceivePacketGroup(CSR) to facilitate the introduction of an intelligent network interface. ReceivePacketGroup(csr) provides the interface with a CSR descriptor and zero or more bytes of main memory to receive segment data. The CSR describes whether it is to receive responses (and if so, for which client) or requests (and if so for which server). The procedure ReclaimCSR(CSR) reclaims the specified record from the interface before it has been returned after receiving the specified packet group. A finishup procedure is set in the CSR to be invoked when the CSR is returned to the host by the normal processing sequence in the interface. Similarly, the timeout parameter is set to indicate the maximum time the host is providing for the routine to perform the specified function. The CSR and associated segment memory is returned to the host after the timeout period with an indication of progress after the timeout period. It is not returned earlier. VII.9. Streaming The implementation of streaming is optional in both VMTP clients and servers. Ideally, all performance-critical servers should implement streaming. In addition, clients that have high context switch overhead, network access overhead or expect to be communicating over long delay links should also implement streaming. A client stream is implemented by allocating a CSR for each outstanding message transaction. A stream of transactions is handled similarly to multiple outstanding transactions from separate clients except for the interaction between consecutive numbered transactions in a stream. For the server VMTP module, streamed message transactions to a server are queued (if accepted) subordinate to the first unprocessed CSR corresponding to this Client. Thus, streamed transactions from a given Client are always performed in the order specified by the transaction identifiers.
If a server does not implement streaming, it must refuse streamed message transactions using the NotifyVmtpClient operation. Also, all client VMTP's that support streaming must support the streamed interface to a server that does not support streaming. That is, it must perform the message transactions one at a time. Consequently, a program that uses the streaming interface to a non-streaming server experiences degraded performance, but not failure. VII.10. Implementation Experience The implementation experience to date includes a partial implementation (minus the streaming and full security) in the V kernel plus a similar preliminary implementation in the 4.3 BSD Unix kernel. In the V kernel implementation, the CSR's are part of the (lightweight) process descriptor. The V kernel implementation is able to perform a VMTP message transaction with no data segment between two Sun-3/75's connected by 10 Mb Ethernet in 2.25 milliseconds. It is also able to transfer data at 4.7 megabits per second using 16 kilobyte Requests (but null checksums.) The UNIX kernel implementation running on Microvax II's achieves a basic message transaction time of 9 milliseconds and data rate of 1.9 megabits per second using 16 kilobyte Responses. This implementation is using the standard VMTP checksum. We hope to report more extensive implementation experience in future revisions of this document.
VIII. UNIX 4.3 BSD Kernel Interface for VMTP UNIX 4.3 BSD includes a socket-based design for program interfaces to a variety of protocol families and types of protocols (streams, datagrams). In this appendix, we sketch an extension to this design to support a transaction-style protocol. (Some familiarity with UNIX 4.2/3 IPC is assumed.) Several extensions are required to the system interface, rather than just adding a protocol, because no provision was made for supporting transaction protocols in the original design. These extensions include a new "transaction" type of socket plus new system calls invoke, getreply, probeentity, recreq, sendreply and forward. A socket of type transaction bound to the VMTP protocol type IPPROTO_VMTP is created by the call s = socket(AF_INET, SOCK_TRANSACT, VMTP); This socket is bound to an entity identifier by bind(s, &entityid, sizeof(entityid)); The first address/port bound to a socket is considered its primary name and is the one used on packet transmission. A message transaction is invoked between the socket named by s and the Server specified by mcb by invoke(s, mcb, segptr, seglen, timeout ); The mcb is a message control block whose format was described in Section 2.4. The message control block specifies the request to send plus the destination Server. The response message control block returned by the server is stored in mcb when invoke returns. The invoking process is blocked until a response is received or the message transaction times out unless the request is a datagram request. (Non-blocking versions with signals on completion could also be provided, especially with a streaming implementation.) For multicast message transactions (sent to an entity group), the next response to the current message transaction (if it arrives in less than timeout milliseconds) is returned by getreply( s, mcb, segptr, maxseglen, timeout ); The invoke operation sent to an entity group completes as soon as the first response is received. A request is retransmitted until the first reply is received (assuming the request is not a datagram). Thus, the system does not retransmit while getreply is timing out even if no replies are available.
The state of an entity associated with entityId is probed using probeentity( entityId, state ); A UNIX process acting as a VMTP server accepts a Request by the operation recvreq(s, mcb, segptr, maxseglen ); The request message for the next queued transaction request is returned in mcb, plus the segment data of maximum length maxseglen, starting at segptr in the address space. On return, the message control block contains the values as set in invoke except: (1) the Client field indicates the Client that sent the received Request message. (2) the Code field indicates the type of request. (3) the MsgDelivery field indicates the portions of the segment actually received within the specified segment size, if MDM is 1 in the Code field. A segment block is marked as missing (i.e. the corresponding bit in the MsgDelivery field is 0) unless it is received in its entirety or it is all of the data in last segment contained in the segment. To complete a transaction, the reply specified by mcb is sent to the client specified by the MCB using sendreply(s, mcb, segptr ); The Client field of the MCB indicates the client to respond to. Finally, a message transaction specified by mcb is forwarded to newserver as though it were sent there by its original invoker using forward(s, mcb, segptr, timeout );
Index Acknowledgment 14 APG 16, 31, 39 Authentication domain 20 Big-endian 9 Checksum 14, 43 Checksum, not set 44 Client 7, 10, 38 Client timer 16 CMD 42, 110 CMG 32, 40 Co-resident entity 25 Code 42 CoResidentEntity 42, 43 CRE 21, 42 DGM 42 Digital signature, VMTP management 95, 101 Diskless workstations 2 Domain 9, 38 Domain 1 102 Domain 3 104 Entity 7 Entity domain 9 Entity group 8 Entity identifier 37 Entity identifier allocation 105 Entity identifier, all-zero 38 EPG 20, 39 Features 6 ForwardCount 24 Forwarding 24 FunctionCode 41 Group 8 Group message transaction 10 Group timeouts 16 GRP 37 HandleNoCSR 62 HandleRequestNoCSR 79 HCO 14, 23, 39
Host independence 8 Idempotent 15 Interpacket gap 18, 40 IP 108 Key 91 LEE 32, 37 Little-endian 9 MCB 118 MDG 22, 40 MDM 30, 42 Message control block 118 Message size 6 Message transaction 7, 10 MPG 39 MsgDelivery 43 MSGTRANS_OVERFLOW 27 Multicast 4, 21, 120 Multicast, reliable 21 Naming 6 Negative acknowledgment 31 NER 25, 31, 39 NRT 26, 30, 39 NSR 25, 27, 31, 39 Object-oriented 2 Overrun 18 Packet group 7, 29, 39 Packet group run 31 PacketDelivery 29, 31, 41 PGcount 26, 41 PIC 42 Principal 11 Priority 41 Process 11 ProcessId 89 Protocol number,IP 108 RAE 37 Rate control 18 Real-time 2, 4 Realtime 22
Reliability 12 Request message 10 RequestAckRetries 30 RequestRetries 15 Response message 10 ResponseAckRetries 31 ResponseRetries 15 Restricted group 8 Retransmission 15 RetransmitCount 17 Roundtrip time 17 RPC 2 Run 31, 39 Run, message transactions 25 SDA 42 Security 4, 19 Segment block 41 Segment data 43 SegmentSize 42, 43 Selective retransmission 18 Server 7, 10, 41 Server group 8 Sockets, VMTP 118 STI 26, 40 Streaming 25, 55 Strictly stable 8 Subgroups 21 T-stable 8 TC1(Server) 16 TC2(Server) 16 TC3(Server) 16 TC4 16 TCP 2 Timeouts 15 Transaction 10, 41 Transaction identification 10 TS1(Client) 17 TS2(Client) 17 TS3(Client) 17 TS4(Client) 17 TS5(Client) 17 Type flags 8 UNIX interface 118 Unrestricted group 8, 38
NotifyVmtpClient 7, 26, 27, 30 NotifyVmtpServer 7, 14, 30 User Data 43 Version 38 VMTP Management digital signature 95, 101