4. Ancillary Data 4.2BSD allowed file descriptors to be transferred between separate processes across a UNIX domain socket using the sendmsg() and recvmsg() functions. Two members of the msghdr structure, msg_accrights and msg_accrightslen, were used to send and receive the descriptors. When the OSI protocols were added to 4.3BSD Reno in 1990 the names of these two fields in the msghdr structure were changed to msg_control and msg_controllen, because they were used by the OSI protocols for "control information", although the comments in the source code call this "ancillary data". Other than the OSI protocols, the use of ancillary data has been rare. In 4.4BSD, for example, the only use of ancillary data with IPv4 is to return the destination address of a received UDP datagram if the IP_RECVDSTADDR socket option is set. With Unix domain sockets ancillary data is still used to send and receive descriptors. Nevertheless the ancillary data fields of the msghdr structure provide a clean way to pass information in addition to the data that is being read or written. The inclusion of the msg_control and msg_controllen members of the msghdr structure along with the cmsghdr structure that is pointed to by the msg_control member is required by the Posix.1g sockets API standard (which should be completed during 1997). In this document ancillary data is used to exchange the following optional information between the application and the kernel: 1. the send/receive interface and source/destination address, 2. the hop limit, 3. next hop address, 4. Hop-by-Hop options, 5. Destination options, and 6. Routing header. Before describing these uses in detail, we review the definition of the msghdr structure itself, the cmsghdr structure that defines an ancillary data object, and some functions that operate on the ancillary data objects.
4.1. The msghdr Structure The msghdr structure is used by the recvmsg() and sendmsg() functions. Its Posix.1g definition is: struct msghdr { void *msg_name; /* ptr to socket address structure */ socklen_t msg_namelen; /* size of socket address structure */ struct iovec *msg_iov; /* scatter/gather array */ size_t msg_iovlen; /* # elements in msg_iov */ void *msg_control; /* ancillary data */ socklen_t msg_controllen; /* ancillary data buffer length */ int msg_flags; /* flags on received message */ }; The structure is declared as a result of including <sys/socket.h>. (Note: Before Posix.1g the two "void *" pointers were typically "char *", and the two socklen_t members and the size_t member were typically integers. Earlier drafts of Posix.1g had the two socklen_t members as size_t, but Draft 6.6 of Posix.1g, apparently the final draft, changed these to socklen_t to simplify binary portability for 64-bit implementations and to align Posix.1g with X/Open's Networking Services, Issue 5. The change in msg_control to a "void *" pointer affects any code that increments this pointer.) Most Berkeley-derived implementations limit the amount of ancillary data in a call to sendmsg() to no more than 108 bytes (an mbuf). This API requires a minimum of 10240 bytes of ancillary data, but it is recommended that the amount be limited only by the buffer space reserved by the socket (which can be modified by the SO_SNDBUF socket option). (Note: This magic number 10240 was picked as a value that should always be large enough. 108 bytes is clearly too small as the maximum size of a Type 0 Routing header is 376 bytes.) 4.2. The cmsghdr Structure The cmsghdr structure describes ancillary data objects transferred by recvmsg() and sendmsg(). Its Posix.1g definition is: struct cmsghdr { socklen_t cmsg_len; /* #bytes, including this header */ int cmsg_level; /* originating protocol */ int cmsg_type; /* protocol-specific type */ /* followed by unsigned char cmsg_data[]; */ }; This structure is declared as a result of including <sys/socket.h>.
As shown in this definition, normally there is no member with the name cmsg_data[]. Instead, the data portion is accessed using the CMSG_xxx() macros, as described shortly. Nevertheless, it is common to refer to the cmsg_data[] member. (Note: Before Posix.1g the cmsg_len member was an integer, and not a socklen_t. See the Note in the previous section for why socklen_t is used here.) When ancillary data is sent or received, any number of ancillary data objects can be specified by the msg_control and msg_controllen members of the msghdr structure, because each object is preceded by a cmsghdr structure defining the object's length (the cmsg_len member). Historically Berkeley-derived implementations have passed only one object at a time, but this API allows multiple objects to be passed in a single call to sendmsg() or recvmsg(). The following example shows two ancillary data objects in a control buffer. |<--------------------------- msg_controllen -------------------------->| | | |<----- ancillary data object ----->|<----- ancillary data object ----->| |<---------- CMSG_SPACE() --------->|<---------- CMSG_SPACE() --------->| | | | |<---------- cmsg_len ---------->| |<--------- cmsg_len ----------->| | |<--------- CMSG_LEN() --------->| |<-------- CMSG_LEN() ---------->| | | | | | | +-----+-----+-----+--+-----------+--+-----+-----+-----+--+-----------+--+ |cmsg_|cmsg_|cmsg_|XX| |XX|cmsg_|cmsg_|cmsg_|XX| |XX| |len |level|type |XX|cmsg_data[]|XX|len |level|type |XX|cmsg_data[]|XX| +-----+-----+-----+--+-----------+--+-----+-----+-----+--+-----------+--+ ^ | msg_control points here The fields shown as "XX" are possible padding, between the cmsghdr structure and the data, and between the data and the next cmsghdr structure, if required by the implementation. 4.3. Ancillary Data Object Macros To aid in the manipulation of ancillary data objects, three macros from 4.4BSD are defined by Posix.1g: CMSG_DATA(), CMSG_NXTHDR(), and CMSG_FIRSTHDR(). Before describing these macros, we show the following example of how they might be used with a call to recvmsg(). struct msghdr msg; struct cmsghdr *cmsgptr;
/* fill in msg */ /* call recvmsg() */ for (cmsgptr = CMSG_FIRSTHDR(&msg); cmsgptr != NULL; cmsgptr = CMSG_NXTHDR(&msg, cmsgptr)) { if (cmsgptr->cmsg_level == ... && cmsgptr->cmsg_type == ... ) { u_char *ptr; ptr = CMSG_DATA(cmsgptr); /* process data pointed to by ptr */ } } We now describe the three Posix.1g macros, followed by two more that are new with this API: CMSG_SPACE() and CMSG_LEN(). All these macros are defined as a result of including <sys/socket.h>. 4.3.1. CMSG_FIRSTHDR struct cmsghdr *CMSG_FIRSTHDR(const struct msghdr *mhdr); CMSG_FIRSTHDR() returns a pointer to the first cmsghdr structure in the msghdr structure pointed to by mhdr. The macro returns NULL if there is no ancillary data pointed to the by msghdr structure (that is, if either msg_control is NULL or if msg_controllen is less than the size of a cmsghdr structure). One possible implementation could be #define CMSG_FIRSTHDR(mhdr) \ ( (mhdr)->msg_controllen >= sizeof(struct cmsghdr) ? \ (struct cmsghdr *)(mhdr)->msg_control : \ (struct cmsghdr *)NULL ) (Note: Most existing implementations do not test the value of msg_controllen, and just return the value of msg_control. The value of msg_controllen must be tested, because if the application asks recvmsg() to return ancillary data, by setting msg_control to point to the application's buffer and setting msg_controllen to the length of this buffer, the kernel indicates that no ancillary data is available by setting msg_controllen to 0 on return. It is also easier to put this test into this macro, than making the application perform the test.)
4.3.2. CMSG_NXTHDR struct cmsghdr *CMSG_NXTHDR(const struct msghdr *mhdr, const struct cmsghdr *cmsg); CMSG_NXTHDR() returns a pointer to the cmsghdr structure describing the next ancillary data object. mhdr is a pointer to a msghdr structure and cmsg is a pointer to a cmsghdr structure. If there is not another ancillary data object, the return value is NULL. The following behavior of this macro is new to this API: if the value of the cmsg pointer is NULL, a pointer to the cmsghdr structure describing the first ancillary data object is returned. That is, CMSG_NXTHDR(mhdr, NULL) is equivalent to CMSG_FIRSTHDR(mhdr). If there are no ancillary data objects, the return value is NULL. This provides an alternative way of coding the processing loop shown earlier: struct msghdr msg; struct cmsghdr *cmsgptr = NULL; /* fill in msg */ /* call recvmsg() */ while ((cmsgptr = CMSG_NXTHDR(&msg, cmsgptr)) != NULL) { if (cmsgptr->cmsg_level == ... && cmsgptr->cmsg_type == ... ) { u_char *ptr; ptr = CMSG_DATA(cmsgptr); /* process data pointed to by ptr */ } } One possible implementation could be: #define CMSG_NXTHDR(mhdr, cmsg) \ ( ((cmsg) == NULL) ? CMSG_FIRSTHDR(mhdr) : \ (((u_char *)(cmsg) + ALIGN((cmsg)->cmsg_len) \ + ALIGN(sizeof(struct cmsghdr)) > \ (u_char *)((mhdr)->msg_control) + (mhdr)->msg_controllen) ? \ (struct cmsghdr *)NULL : \ (struct cmsghdr *)((u_char *)(cmsg) + ALIGN((cmsg)->cmsg_len))) ) The macro ALIGN(), which is implementation dependent, rounds its argument up to the next even multiple of whatever alignment is required (probably a multiple of 4 or 8 bytes).
4.3.3. CMSG_DATA unsigned char *CMSG_DATA(const struct cmsghdr *cmsg); CMSG_DATA() returns a pointer to the data (what is called the cmsg_data[] member, even though such a member is not defined in the structure) following a cmsghdr structure. One possible implementation could be: #define CMSG_DATA(cmsg) ( (u_char *)(cmsg) + \ ALIGN(sizeof(struct cmsghdr)) ) 4.3.4. CMSG_SPACE unsigned int CMSG_SPACE(unsigned int length); This macro is new with this API. Given the length of an ancillary data object, CMSG_SPACE() returns the space required by the object and its cmsghdr structure, including any padding needed to satisfy alignment requirements. This macro can be used, for example, to allocate space dynamically for the ancillary data. This macro should not be used to initialize the cmsg_len member of a cmsghdr structure; instead use the CMSG_LEN() macro. One possible implementation could be: #define CMSG_SPACE(length) ( ALIGN(sizeof(struct cmsghdr)) + \ ALIGN(length) ) 4.3.5. CMSG_LEN unsigned int CMSG_LEN(unsigned int length); This macro is new with this API. Given the length of an ancillary data object, CMSG_LEN() returns the value to store in the cmsg_len member of the cmsghdr structure, taking into account any padding needed to satisfy alignment requirements. One possible implementation could be: #define CMSG_LEN(length) ( ALIGN(sizeof(struct cmsghdr)) + length )
Note the difference between CMSG_SPACE() and CMSG_LEN(), shown also in the figure in Section 4.2: the former accounts for any required padding at the end of the ancillary data object and the latter is the actual length to store in the cmsg_len member of the ancillary data object. 4.4. Summary of Options Described Using Ancillary Data There are six types of optional information described in this document that are passed between the application and the kernel using ancillary data: 1. the send/receive interface and source/destination address, 2. the hop limit, 3. next hop address, 4. Hop-by-Hop options, 5. Destination options, and 6. Routing header. First, to receive any of this optional information (other than the next hop address, which can only be set), the application must call setsockopt() to turn on the corresponding flag: int on = 1; setsockopt(fd, IPPROTO_IPV6, IPV6_PKTINFO, &on, sizeof(on)); setsockopt(fd, IPPROTO_IPV6, IPV6_HOPLIMIT, &on, sizeof(on)); setsockopt(fd, IPPROTO_IPV6, IPV6_HOPOPTS, &on, sizeof(on)); setsockopt(fd, IPPROTO_IPV6, IPV6_DSTOPTS, &on, sizeof(on)); setsockopt(fd, IPPROTO_IPV6, IPV6_RTHDR, &on, sizeof(on)); When any of these options are enabled, the corresponding data is returned as control information by recvmsg(), as one or more ancillary data objects. Nothing special need be done to send any of this optional information; the application just calls sendmsg() and specifies one or more ancillary data objects as control information. We also summarize the three cmsghdr fields that describe the ancillary data objects: cmsg_level cmsg_type cmsg_data[] #times ------------ ------------ ------------------------ ------ IPPROTO_IPV6 IPV6_PKTINFO in6_pktinfo structure once IPPROTO_IPV6 IPV6_HOPLIMIT int once IPPROTO_IPV6 IPV6_NEXTHOP socket address structure once IPPROTO_IPV6 IPV6_HOPOPTS implementation dependent mult.
IPPROTO_IPV6 IPV6_DSTOPTS implementation dependent mult. IPPROTO_IPV6 IPV6_RTHDR implementation dependent once The final column indicates how many times an ancillary data object of that type can appear as control information. The Hop-by-Hop and Destination options can appear multiple times, while all the others can appear only one time. All these options are described in detail in following sections. All the constants beginning with IPV6_ are defined as a result of including the <netinet/in.h> header. (Note: We intentionally use the same constant for the cmsg_level member as is used as the second argument to getsockopt() and setsockopt() (what is called the "level"), and the same constant for the cmsg_type member as is used as the third argument to getsockopt() and setsockopt() (what is called the "option name"). This is consistent with the existing use of ancillary data in 4.4BSD: returning the destination address of an IPv4 datagram.) (Note: It is up to the implementation what it passes as ancillary data for the Hop-by-Hop option, Destination option, and Routing header option, since the API to these features is through a set of inet6_option_XXX() and inet6_rthdr_XXX() functions that we define later. These functions serve two purposes: to simplify the interface to these features (instead of requiring the application to know the intimate details of the extension header formats), and to hide the actual implementation from the application. Nevertheless, we show some examples of these features that store the actual extension header as the ancillary data. Implementations need not use this technique.) 4.5. IPV6_PKTOPTIONS Socket Option The summary in the previous section assumes a UDP socket. Sending and receiving ancillary data is easy with UDP: the application calls sendmsg() and recvmsg() instead of sendto() and recvfrom(). But there might be cases where a TCP application wants to send or receive this optional information. For example, a TCP client might want to specify a Routing header and this needs to be done before calling connect(). Similarly a TCP server might want to know the received interface after accept() returns along with any Destination options.
A new socket option is defined that provides access to the optional information described in the previous section, but without using recvmsg() and sendmsg(). Setting the socket option specifies any of the optional output fields: setsockopt(fd, IPPROTO_IPV6, IPV6_PKTOPTIONS, &buf, len); The fourth argument points to a buffer containing one or more ancillary data objects, and the fifth argument is the total length of all these objects. The application fills in this buffer exactly as if the buffer were being passed to sendmsg() as control information. The options set by calling setsockopt() for IPV6_PKTOPTIONS are called "sticky" options because once set they apply to all packets sent on that socket. The application can call setsockopt() again to change all the sticky options, or it can call setsockopt() with a length of 0 to remove all the sticky options for the socket. The corresponding receive option getsockopt(fd, IPPROTO_IPV6, IPV6_PKTOPTIONS, &buf, &len); returns a buffer with one or more ancillary data objects for all the optional receive information that the application has previously specified that it wants to receive. The fourth argument points to the buffer that is filled in by the call. The fifth argument is a pointer to a value-result integer: when the function is called the integer specifies the size of the buffer pointed to by the fourth argument, and on return this integer contains the actual number of bytes that were returned. The application processes this buffer exactly as if the buffer were returned by recvmsg() as control information. To simplify this document, in the remaining sections when we say "can be specified as ancillary data to sendmsg()" we mean "can be specified as ancillary data to sendmsg() or specified as a sticky option using setsockopt() and the IPV6_PKTOPTIONS socket option". Similarly when we say "can be returned as ancillary data by recvmsg()" we mean "can be returned as ancillary data by recvmsg() or returned by getsockopt() with the IPV6_PKTOPTIONS socket option". 4.5.1. TCP Sticky Options When using getsockopt() with the IPV6_PKTOPTIONS option and a TCP socket, only the options from the most recently received segment are retained and returned to the caller, and only after the socket option has been set. That is, TCP need not start saving a copy of the options until the application says to do so.
The application is not allowed to specify ancillary data in a call to sendmsg() on a TCP socket, and none of the ancillary data that we describe in this document is ever returned as control information by recvmsg() on a TCP socket. 4.5.2. UDP and Raw Socket Sticky Options The IPV6_PKTOPTIONS socket option can also be used with a UDP socket or with a raw IPv6 socket, normally to set some of the options once, instead of with each call to sendmsg(). Unlike the TCP case, the sticky options can be overridden on a per- packet basis with ancillary data specified in a call to sendmsg() on a UDP or raw IPv6 socket. If any ancillary data is specified in a call to sendmsg(), none of the sticky options are sent with that datagram. 5. Packet Information There are four pieces of information that an application can specify for an outgoing packet using ancillary data: 1. the source IPv6 address, 2. the outgoing interface index, 3. the outgoing hop limit, and 4. the next hop address. Three similar pieces of information can be returned for a received packet as ancillary data: 1. the destination IPv6 address, 2. the arriving interface index, and 3. the arriving hop limit. The first two pieces of information are contained in an in6_pktinfo structure that is sent as ancillary data with sendmsg() and received as ancillary data with recvmsg(). This structure is defined as a result of including the <netinet/in.h> header. struct in6_pktinfo { struct in6_addr ipi6_addr; /* src/dst IPv6 address */ unsigned int ipi6_ifindex; /* send/recv interface index */ }; In the cmsghdr structure containing this ancillary data, the cmsg_level member will be IPPROTO_IPV6, the cmsg_type member will be IPV6_PKTINFO, and the first byte of cmsg_data[] will be the first byte of the in6_pktinfo structure.
This information is returned as ancillary data by recvmsg() only if the application has enabled the IPV6_PKTINFO socket option: int on = 1; setsockopt(fd, IPPROTO_IPV6, IPV6_PKTINFO, &on, sizeof(on)); Nothing special need be done to send this information: just specify the control information as ancillary data for sendmsg(). (Note: The hop limit is not contained in the in6_pktinfo structure for the following reason. Some UDP servers want to respond to client requests by sending their reply out the same interface on which the request was received and with the source IPv6 address of the reply equal to the destination IPv6 address of the request. To do this the application can enable just the IPV6_PKTINFO socket option and then use the received control information from recvmsg() as the outgoing control information for sendmsg(). The application need not examine or modify the in6_pktinfo structure at all. But if the hop limit were contained in this structure, the application would have to parse the received control information and change the hop limit member, since the received hop limit is not the desired value for an outgoing packet.) 5.1. Specifying/Receiving the Interface Interfaces on an IPv6 node are identified by a small positive integer, as described in Section 4 of [RFC-2133]. That document also describes a function to map an interface name to its interface index, a function to map an interface index to its interface name, and a function to return all the interface names and indexes. Notice from this document that no interface is ever assigned an index of 0. When specifying the outgoing interface, if the ipi6_ifindex value is 0, the kernel will choose the outgoing interface. If the application specifies an outgoing interface for a multicast packet, the interface specified by the ancillary data overrides any interface specified by the IPV6_MULTICAST_IF socket option (described in [RFC-2133]), for that call to sendmsg() only. When the IPV6_PKTINFO socket option is enabled, the received interface index is always returned as the ipi6_ifindex member of the in6_pktinfo structure. 5.2. Specifying/Receiving Source/Destination Address The source IPv6 address can be specified by calling bind() before each output operation, but supplying the source address together with the data requires less overhead (i.e., fewer system calls) and
requires less state to be stored and protected in a multithreaded application. When specifying the source IPv6 address as ancillary data, if the ipi6_addr member of the in6_pktinfo structure is the unspecified address (IN6ADDR_ANY_INIT), then (a) if an address is currently bound to the socket, it is used as the source address, or (b) if no address is currently bound to the socket, the kernel will choose the source address. If the ipi6_addr member is not the unspecified address, but the socket has already bound a source address, then the ipi6_addr value overrides the already-bound source address for this output operation only. The kernel must verify that the requested source address is indeed a unicast address assigned to the node. When the in6_pktinfo structure is returned as ancillary data by recvmsg(), the ipi6_addr member contains the destination IPv6 address from the received packet. 5.3. Specifying/Receiving the Hop Limit The outgoing hop limit is normally specified with either the IPV6_UNICAST_HOPS socket option or the IPV6_MULTICAST_HOPS socket option, both of which are described in [RFC-2133]. Specifying the hop limit as ancillary data lets the application override either the kernel's default or a previously specified value, for either a unicast destination or a multicast destination, for a single output operation. Returning the received hop limit is useful for programs such as Traceroute and for IPv6 applications that need to verify that the received hop limit is 255 (e.g., that the packet has not been forwarded). The received hop limit is returned as ancillary data by recvmsg() only if the application has enabled the IPV6_HOPLIMIT socket option: int on = 1; setsockopt(fd, IPPROTO_IPV6, IPV6_HOPLIMIT, &on, sizeof(on)); In the cmsghdr structure containing this ancillary data, the cmsg_level member will be IPPROTO_IPV6, the cmsg_type member will be IPV6_HOPLIMIT, and the first byte of cmsg_data[] will be the first byte of the integer hop limit. Nothing special need be done to specify the outgoing hop limit: just specify the control information as ancillary data for sendmsg(). As specified in [RFC-2133], the interpretation of the integer hop limit value is
x < -1: return an error of EINVAL x == -1: use kernel default 0 <= x <= 255: use x x >= 256: return an error of EINVAL 5.4. Specifying the Next Hop Address The IPV6_NEXTHOP ancillary data object specifies the next hop for the datagram as a socket address structure. In the cmsghdr structure containing this ancillary data, the cmsg_level member will be IPPROTO_IPV6, the cmsg_type member will be IPV6_NEXTHOP, and the first byte of cmsg_data[] will be the first byte of the socket address structure. This is a privileged option. (Note: It is implementation defined and beyond the scope of this document to define what "privileged" means. Unix systems use this term to mean the process must have an effective user ID of 0.) If the socket address structure contains an IPv6 address (e.g., the sin6_family member is AF_INET6), then the node identified by that address must be a neighbor of the sending host. If that address equals the destination IPv6 address of the datagram, then this is equivalent to the existing SO_DONTROUTE socket option. 5.5. Additional Errors with sendmsg() With the IPV6_PKTINFO socket option there are no additional errors possible with the call to recvmsg(). But when specifying the outgoing interface or the source address, additional errors are possible from sendmsg(). The following are examples, but some of these may not be provided by some implementations, and some implementations may define additional errors: ENXIO The interface specified by ipi6_ifindex does not exist. ENETDOWN The interface specified by ipi6_ifindex is not enabled for IPv6 use. EADDRNOTAVAIL ipi6_ifindex specifies an interface but the address ipi6_addr is not available for use on that interface. EHOSTUNREACH No route to the destination exists over the interface specified by ifi6_ifindex.
6. Hop-By-Hop Options A variable number of Hop-by-Hop options can appear in a single Hop- by-Hop options header. Each option in the header is TLV-encoded with a type, length, and value. Today only three Hop-by-Hop options are defined for IPv6 [RFC-1883]: Jumbo Payload, Pad1, and PadN, although a proposal exists for a router-alert Hop-by-Hop option. The Jumbo Payload option should not be passed back to an application and an application should receive an error if it attempts to set it. This option is processed entirely by the kernel. It is indirectly specified by datagram-based applications as the size of the datagram to send and indirectly passed back to these applications as the length of the received datagram. The two pad options are for alignment purposes and are automatically inserted by a sending kernel when needed and ignored by the receiving kernel. This section of the API is therefore defined for future Hop-by-Hop options that an application may need to specify and receive. Individual Hop-by-Hop options (and Destination options, which are described shortly, and which are similar to the Hop-by-Hop options) may have specific alignment requirements. For example, the 4-byte Jumbo Payload length should appear on a 4-byte boundary, and IPv6 addresses are normally aligned on an 8-byte boundary. These requirements and the terminology used with these options are discussed in Section 4.2 and Appendix A of [RFC-1883]. The alignment of each option is specified by two values, called x and y, written as "xn + y". This states that the option must appear at an integer multiple of x bytes from the beginning of the options header (x can have the values 1, 2, 4, or 8), plus y bytes (y can have a value between 0 and 7, inclusive). The Pad1 and PadN options are inserted as needed to maintain the required alignment. Whatever code builds either a Hop-by-Hop options header or a Destination options header must know the values of x and y for each option. Multiple Hop-by-Hop options can be specified by the application. Normally one ancillary data object describes all the Hop-by-Hop options (since each option is itself TLV-encoded) but the application can specify multiple ancillary data objects for the Hop-by-Hop options, each object specifying one or more options. Care must be taken designing the API for these options since 1. it may be possible for some future Hop-by-Hop options to be generated by the application and processed entirely by the application (e.g., the kernel may not know the alignment restrictions for the option),
2. it must be possible for the kernel to insert its own Hop-by-Hop options in an outgoing packet (e.g., the Jumbo Payload option), 3. the application can place one or more Hop-by-Hop options into a single ancillary data object, 4. if the application specifies multiple ancillary data objects, each containing one or more Hop-by-Hop options, the kernel must combine these a single Hop-by-Hop options header, and 5. it must be possible for the kernel to remove some Hop-by-Hop options from a received packet before returning the remaining Hop-by-Hop options to the application. (This removal might consist of the kernel converting the option into a pad option of the same length.) Finally, we note that access to some Hop-by-Hop options or to some Destination options, might require special privilege. That is, normal applications (without special privilege) might be forbidden from setting certain options in outgoing packets, and might never see certain options in received packets. 6.1. Receiving Hop-by-Hop Options To receive Hop-by-Hop options the application must enable the IPV6_HOPOPTS socket option: int on = 1; setsockopt(fd, IPPROTO_IPV6, IPV6_HOPOPTS, &on, sizeof(on)); All the Hop-by-Hop options are returned as one ancillary data object described by a cmsghdr structure. The cmsg_level member will be IPPROTO_IPV6 and the cmsg_type member will be IPV6_HOPOPTS. These options are then processed by calling the inet6_option_next() and inet6_option_find() functions, described shortly. 6.2. Sending Hop-by-Hop Options To send one or more Hop-by-Hop options, the application just specifies them as ancillary data in a call to sendmsg(). No socket option need be set. Normally all the Hop-by-Hop options are specified by a single ancillary data object. Multiple ancillary data objects, each containing one or more Hop-by-Hop options, can also be specified, in which case the kernel will combine all the Hop-by-Hop options into a single Hop-by-Hop extension header. But it should be more efficient to use a single ancillary data object to describe all the Hop-by-Hop
options. The cmsg_level member is set to IPPROTO_IPV6 and the cmsg_type member is set to IPV6_HOPOPTS. The option is normally constructed using the inet6_option_init(), inet6_option_append(), and inet6_option_alloc() functions, described shortly. Additional errors may be possible from sendmsg() if the specified option is in error. 6.3. Hop-by-Hop and Destination Options Processing Building and parsing the Hop-by-Hop and Destination options is complicated for the reasons given earlier. We therefore define a set of functions to help the application. The function prototypes for these functions are all in the <netinet/in.h> header. 6.3.1. inet6_option_space int inet6_option_space(int nbytes); This function returns the number of bytes required to hold an option when it is stored as ancillary data, including the cmsghdr structure at the beginning, and any padding at the end (to make its size a multiple of 8 bytes). The argument is the size of the structure defining the option, which must include any pad bytes at the beginning (the value y in the alignment term "xn + y"), the type byte, the length byte, and the option data. (Note: If multiple options are stored in a single ancillary data object, which is the recommended technique, this function overestimates the amount of space required by the size of N-1 cmsghdr structures, where N is the number of options to be stored in the object. This is of little consequence, since it is assumed that most Hop-by-Hop option headers and Destination option headers carry only one option (p. 33 of [RFC-1883]).) 6.3.2. inet6_option_init int inet6_option_init(void *bp, struct cmsghdr **cmsgp, int type); This function is called once per ancillary data object that will contain either Hop-by-Hop or Destination options. It returns 0 on success or -1 on an error. bp is a pointer to previously allocated space that will contain the ancillary data object. It must be large enough to contain all the individual options to be added by later calls to inet6_option_append() and inet6_option_alloc().
cmsgp is a pointer to a pointer to a cmsghdr structure. *cmsgp is initialized by this function to point to the cmsghdr structure constructed by this function in the buffer pointed to by bp. type is either IPV6_HOPOPTS or IPV6_DSTOPTS. This type is stored in the cmsg_type member of the cmsghdr structure pointed to by *cmsgp. 6.3.3. inet6_option_append int inet6_option_append(struct cmsghdr *cmsg, const uint8_t *typep, int multx, int plusy); This function appends a Hop-by-Hop option or a Destination option into an ancillary data object that has been initialized by inet6_option_init(). This function returns 0 if it succeeds or -1 on an error. cmsg is a pointer to the cmsghdr structure that must have been initialized by inet6_option_init(). typep is a pointer to the 8-bit option type. It is assumed that this field is immediately followed by the 8-bit option data length field, which is then followed immediately by the option data. The caller initializes these three fields (the type-length-value, or TLV) before calling this function. The option type must have a value from 2 to 255, inclusive. (0 and 1 are reserved for the Pad1 and PadN options, respectively.) The option data length must have a value between 0 and 255, inclusive, and is the length of the option data that follows. multx is the value x in the alignment term "xn + y" described earlier. It must have a value of 1, 2, 4, or 8. plusy is the value y in the alignment term "xn + y" described earlier. It must have a value between 0 and 7, inclusive. 6.3.4. inet6_option_alloc uint8_t *inet6_option_alloc(struct cmsghdr *cmsg, int datalen, int multx, int plusy);
This function appends a Hop-by-Hop option or a Destination option into an ancillary data object that has been initialized by inet6_option_init(). This function returns a pointer to the 8-bit option type field that starts the option on success, or NULL on an error. The difference between this function and inet6_option_append() is that the latter copies the contents of a previously built option into the ancillary data object while the current function returns a pointer to the space in the data object where the option's TLV must then be built by the caller. cmsg is a pointer to the cmsghdr structure that must have been initialized by inet6_option_init(). datalen is the value of the option data length byte for this option. This value is required as an argument to allow the function to determine if padding must be appended at the end of the option. (The inet6_option_append() function does not need a data length argument since the option data length must already be stored by the caller.) multx is the value x in the alignment term "xn + y" described earlier. It must have a value of 1, 2, 4, or 8. plusy is the value y in the alignment term "xn + y" described earlier. It must have a value between 0 and 7, inclusive. 6.3.5. inet6_option_next int inet6_option_next(const struct cmsghdr *cmsg, uint8_t **tptrp); This function processes the next Hop-by-Hop option or Destination option in an ancillary data object. If another option remains to be processed, the return value of the function is 0 and *tptrp points to the 8-bit option type field (which is followed by the 8-bit option data length, followed by the option data). If no more options remain to be processed, the return value is -1 and *tptrp is NULL. If an error occurs, the return value is -1 and *tptrp is not NULL. cmsg is a pointer to cmsghdr structure of which cmsg_level equals IPPROTO_IPV6 and cmsg_type equals either IPV6_HOPOPTS or IPV6_DSTOPTS. tptrp is a pointer to a pointer to an 8-bit byte and *tptrp is used by the function to remember its place in the ancillary data object each time the function is called. The first time this function is called for a given ancillary data object, *tptrp must be set to NULL.
Each time this function returns success, *tptrp points to the 8-bit option type field for the next option to be processed. 6.3.6. inet6_option_find int inet6_option_find(const struct cmsghdr *cmsg, uint8_t *tptrp, int type); This function is similar to the previously described inet6_option_next() function, except this function lets the caller specify the option type to be searched for, instead of always returning the next option in the ancillary data object. cmsg is a pointer to cmsghdr structure of which cmsg_level equals IPPROTO_IPV6 and cmsg_type equals either IPV6_HOPOPTS or IPV6_DSTOPTS. tptrp is a pointer to a pointer to an 8-bit byte and *tptrp is used by the function to remember its place in the ancillary data object each time the function is called. The first time this function is called for a given ancillary data object, *tptrp must be set to NULL. This function starts searching for an option of the specified type beginning after the value of *tptrp. If an option of the specified type is located, this function returns 0 and *tptrp points to the 8- bit option type field for the option of the specified type. If an option of the specified type is not located, the return value is -1 and *tptrp is NULL. If an error occurs, the return value is -1 and *tptrp is not NULL. 6.3.7. Options Examples We now provide an example that builds two Hop-by-Hop options. First we define two options, called X and Y, taken from the example in Appendix A of [RFC-1883]. We assume that all options will have structure definitions similar to what is shown below. /* option X and option Y are defined in [RFC-1883], pp. 33-34 */ #define IP6_X_OPT_TYPE X /* replace X with assigned value */ #define IP6_X_OPT_LEN 12 #define IP6_X_OPT_MULTX 8 /* 8n + 2 alignment */ #define IP6_X_OPT_OFFSETY 2 struct ip6_X_opt { uint8_t ip6_X_opt_pad[IP6_X_OPT_OFFSETY]; uint8_t ip6_X_opt_type; uint8_t ip6_X_opt_len; uint32_t ip6_X_opt_val1; uint64_t ip6_X_opt_val2; };
#define IP6_Y_OPT_TYPE Y /* replace Y with assigned value */ #define IP6_Y_OPT_LEN 7 #define IP6_Y_OPT_MULTX 4 /* 4n + 3 alignment */ #define IP6_Y_OPT_OFFSETY 3 struct ip6_Y_opt { uint8_t ip6_Y_opt_pad[IP6_Y_OPT_OFFSETY]; uint8_t ip6_Y_opt_type; uint8_t ip6_Y_opt_len; uint8_t ip6_Y_opt_val1; uint16_t ip6_Y_opt_val2; uint32_t ip6_Y_opt_val3; }; We now show the code fragment to build one ancillary data object containing both options. struct msghdr msg; struct cmsghdr *cmsgptr; struct ip6_X_opt optX; struct ip6_Y_opt optY; msg.msg_control = malloc(inet6_option_space(sizeof(optX) + sizeof(optY))); inet6_option_init(msg.msg_control, &cmsgptr, IPV6_HOPOPTS); optX.ip6_X_opt_type = IP6_X_OPT_TYPE; optX.ip6_X_opt_len = IP6_X_OPT_LEN; optX.ip6_X_opt_val1 = <32-bit value>; optX.ip6_X_opt_val2 = <64-bit value>; inet6_option_append(cmsgptr, &optX.ip6_X_opt_type, IP6_X_OPT_MULTX, IP6_X_OPT_OFFSETY); optY.ip6_Y_opt_type = IP6_Y_OPT_TYPE; optY.ip6_Y_opt_len = IP6_Y_OPT_LEN; optY.ip6_Y_opt_val1 = <8-bit value>; optY.ip6_Y_opt_val2 = <16-bit value>; optY.ip6_Y_opt_val3 = <32-bit value>; inet6_option_append(cmsgptr, &optY.ip6_Y_opt_type, IP6_Y_OPT_MULTX, IP6_Y_OPT_OFFSETY); msg.msg_controllen = cmsgptr->cmsg_len; The call to inet6_option_init() builds the cmsghdr structure in the control buffer.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_len = CMSG_LEN(0) = 12 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_level = IPPROTO_IPV6 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_type = IPV6_HOPOPTS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Here we assume a 32-bit architecture where sizeof(struct cmsghdr) equals 12, with a desired alignment of 4-byte boundaries (that is, the ALIGN() macro shown in the sample implementations of the CMSG_xxx() macros rounds up to a multiple of 4). The first call to inet6_option_append() appends the X option. Since this is the first option in the ancillary data object, 2 bytes are allocated for the Next Header byte and for the Hdr Ext Len byte. The former will be set by the kernel, depending on the type of header that follows this header, and the latter byte is set to 1. These 2 bytes form the 2 bytes of padding (IP6_X_OPT_OFFSETY) required at the beginning of this option. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_len = 28 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_level = IPPROTO_IPV6 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_type = IPV6_HOPOPTS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Hdr Ext Len=1 | Option Type=X |Opt Data Len=12| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 4-octet field | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + 8-octet field + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ The cmsg_len member of the cmsghdr structure is incremented by 16, the size of the option. The next call to inet6_option_append() appends the Y option to the ancillary data object.
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_len = 44 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_level = IPPROTO_IPV6 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_type = IPV6_HOPOPTS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Hdr Ext Len=3 | Option Type=X |Opt Data Len=12| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 4-octet field | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + 8-octet field + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | PadN Option=1 |Opt Data Len=1 | 0 | Option Type=Y | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Opt Data Len=7 | 1-octet field | 2-octet field | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 4-octet field | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | PadN Option=1 |Opt Data Len=2 | 0 | 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ 16 bytes are appended by this function, so cmsg_len becomes 44. The inet6_option_append() function notices that the appended data requires 4 bytes of padding at the end, to make the size of the ancillary data object a multiple of 8, and appends the PadN option before returning. The Hdr Ext Len byte is incremented by 2 to become 3. Alternately, the application could build two ancillary data objects, one per option, although this will probably be less efficient than combining the two options into a single ancillary data object (as just shown). The kernel must combine these into a single Hop-by-Hop extension header in the final IPv6 packet. struct msghdr msg; struct cmsghdr *cmsgptr; struct ip6_X_opt optX; struct ip6_Y_opt optY; msg.msg_control = malloc(inet6_option_space(sizeof(optX)) + inet6_option_space(sizeof(optY))); inet6_option_init(msg.msg_control, &cmsgptr, IPPROTO_HOPOPTS); optX.ip6_X_opt_type = IP6_X_OPT_TYPE;
optX.ip6_X_opt_len = IP6_X_OPT_LEN; optX.ip6_X_opt_val1 = <32-bit value>; optX.ip6_X_opt_val2 = <64-bit value>; inet6_option_append(cmsgptr, &optX.ip6_X_opt_type, IP6_X_OPT_MULTX, IP6_X_OPT_OFFSETY); msg.msg_controllen = CMSG_SPACE(sizeof(optX)); inet6_option_init((u_char *)msg.msg_control + msg.msg_controllen, &cmsgptr, IPPROTO_HOPOPTS); optY.ip6_Y_opt_type = IP6_Y_OPT_TYPE; optY.ip6_Y_opt_len = IP6_Y_OPT_LEN; optY.ip6_Y_opt_val1 = <8-bit value>; optY.ip6_Y_opt_val2 = <16-bit value>; optY.ip6_Y_opt_val3 = <32-bit value>; inet6_option_append(cmsgptr, &optY.ip6_Y_opt_type, IP6_Y_OPT_MULTX, IP6_Y_OPT_OFFSETY); msg.msg_controllen += cmsgptr->cmsg_len; Each call to inet6_option_init() builds a new cmsghdr structure, and the final result looks like the following:
+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_len = 28 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_level = IPPROTO_IPV6 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_type = IPV6_HOPOPTS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Hdr Ext Len=1 | Option Type=X |Opt Data Len=12| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 4-octet field | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | | + 8-octet field + | | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_len = 28 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_level = IPPROTO_IPV6 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | cmsg_type = IPV6_HOPOPTS | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Next Header | Hdr Ext Len=1 | Pad1 Option=0 | Option Type=Y | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |Opt Data Len=7 | 1-octet field | 2-octet field | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | 4-octet field | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | PadN Option=1 |Opt Data Len=2 | 0 | 0 | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ When the kernel combines these two options into a single Hop-by-Hop extension header, the first 3 bytes of the second ancillary data object (the Next Header byte, the Hdr Ext Len byte, and the Pad1 option) will be combined into a PadN option occupying 3 bytes. The following code fragment is a redo of the first example shown (building two options in a single ancillary data object) but this time we use inet6_option_alloc(). uint8_t *typep; struct msghdr msg; struct cmsghdr *cmsgptr; struct ip6_X_opt *optXp; /* now a pointer, not a struct */ struct ip6_Y_opt *optYp; /* now a pointer, not a struct */ msg.msg_control = malloc(inet6_option_space(sizeof(*optXp) + sizeof(*optYp)));
inet6_option_init(msg.msg_control, &cmsgptr, IPV6_HOPOPTS); typep = inet6_option_alloc(cmsgptr, IP6_X_OPT_LEN, IP6_X_OPT_MULTX, IP6_X_OPT_OFFSETY); optXp = (struct ip6_X_opt *) (typep - IP6_X_OPT_OFFSETY); optXp->ip6_X_opt_type = IP6_X_OPT_TYPE; optXp->ip6_X_opt_len = IP6_X_OPT_LEN; optXp->ip6_X_opt_val1 = <32-bit value>; optXp->ip6_X_opt_val2 = <64-bit value>; typep = inet6_option_alloc(cmsgptr, IP6_Y_OPT_LEN, IP6_Y_OPT_MULTX, IP6_Y_OPT_OFFSETY); optYp = (struct ip6_Y_opt *) (typep - IP6_Y_OPT_OFFSETY); optYp->ip6_Y_opt_type = IP6_Y_OPT_TYPE; optYp->ip6_Y_opt_len = IP6_Y_OPT_LEN; optYp->ip6_Y_opt_val1 = <8-bit value>; optYp->ip6_Y_opt_val2 = <16-bit value>; optYp->ip6_Y_opt_val3 = <32-bit value>; msg.msg_controllen = cmsgptr->cmsg_len; Notice that inet6_option_alloc() returns a pointer to the 8-bit option type field. If the program wants a pointer to an option structure that includes the padding at the front (as shown in our definitions of the ip6_X_opt and ip6_Y_opt structures), the y-offset at the beginning of the structure must be subtracted from the returned pointer. The following code fragment shows the processing of Hop-by-Hop options using the inet6_option_next() function. struct msghdr msg; struct cmsghdr *cmsgptr; /* fill in msg */ /* call recvmsg() */ for (cmsgptr = CMSG_FIRSTHDR(&msg); cmsgptr != NULL; cmsgptr = CMSG_NXTHDR(&msg, cmsgptr)) { if (cmsgptr->cmsg_level == IPPROTO_IPV6 && cmsgptr->cmsg_type == IPV6_HOPOPTS) { uint8_t *tptr = NULL; while (inet6_option_next(cmsgptr, &tptr) == 0) { if (*tptr == IP6_X_OPT_TYPE) { struct ip6_X_opt *optXp;
optXp = (struct ip6_X_opt *) (tptr - IP6_X_OPT_OFFSETY); <do whatever with> optXp->ip6_X_opt_val1; <do whatever with> optXp->ip6_X_opt_val2; } else if (*tptr == IP6_Y_OPT_TYPE) { struct ip6_Y_opt *optYp; optYp = (struct ip6_Y_opt *) (tptr - IP6_Y_OPT_OFFSETY); <do whatever with> optYp->ip6_Y_opt_val1; <do whatever with> optYp->ip6_Y_opt_val2; <do whatever with> optYp->ip6_Y_opt_val3; } } if (tptr != NULL) <error encountered by inet6_option_next()>; } }