8. Traffic Conditioning Blocks (TCBs)
The Classifier, Meter, Action, Algorithmic Dropper, Queue and Scheduler functional datapath elements described above can be combined into Traffic Conditioning Blocks (TCBs). A TCB is an abstraction of a set of functional datapath elements that may be used to facilitate the definition of specific traffic conditioning functionality (e.g., it might be likened to a template which can be replicated multiple times for different traffic streams or different customers). It has no likely physical representation in the implementation of the data path: it is invented purely as an abstraction for use by management tools. This model describes the configuration and management of a Diffserv interface in terms of a TCB that contains, by definition, zero or more Classifier, Meter, Action, Algorithmic Dropper, Queue and Scheduler elements. These elements are arranged arbitrarily according to the policy being expressed, but always in the order here. Traffic may be classified; classified traffic may be metered; each stream of traffic identified by a combination of classifiers and meters may have some set of actions performed on it, followed by drop
algorithms; packets of the traffic stream may ultimately be stored into a queue and then be scheduled out to the next TCB or physical interface. It is permissible to omit elements or include null elements of any type, or to concatenate multiple functional datapath elements of the same type. When the Diffserv treatment for a given packet needs to have such building blocks repeated, this is performed by cascading multiple TCBs: an output of one TCB may drive the input of a succeeding one. For example, consider the case where traffic of a set of classes is shaped to a set of rates, but the total output rate of the group of classes must also be limited to a rate. One might imagine a set of network news feeds, each with a certain maximum rate, and a policy that their aggregate may not exceed some figure. This may be simply accomplished by cascading two TCBs. The first classifies the traffic into its separate feeds and queues each feed separately. The feeds (or a subset of them) are now fed into a second TCB, which places all input (these news feeds) into a single queue with a certain maximum rate. In implementation, one could imagine this as the several literal queues, a CBQ or WFQ system with an appropriate (and complex) weighting scheme, or a number of other approaches. But they would have the same externally measurable effect on the traffic as if they had been literally implemented with separate TCBs.8.1. TCB
A generalized TCB might consist of the following stages: - Classification stage - Metering stage - Action stage (involving Markers, Absolute Droppers, Counters, and Multiplexors) - Queuing stage (involving Algorithmic Droppers, Queues, and Schedulers) where each stage may consist of a set of parallel datapaths consisting of pipelined elements. A Classifier or a Meter is typically a 1:N element, an Action, Algorithmic Dropper, or Queue is typically a 1:1 element and a Scheduler is a N:1 element. A complete TCB should, however, result in a 1:1 or 1:N abstract element. Note that the fan-in or fan-out of an element is not an important defining characteristic of this taxonomy.
8.1.1. Building blocks for Queuing
Some particular rules are applied to the ordering of elements within a Queuing stage within a TCB: elements of the same type may appear more than once, either in parallel or in series. Typically, a queuing stage will have relatively many elements in parallel and few in series. Iteration and recursion are not supported constructs (the elements are arranged in an acyclic graph). The following inter- connections of elements are allowed: - The input of a Queue may be the input of the queuing block, or it may be connected to the output of an Algorithmic Dropper, or to an output of a Scheduler. - Each input of a Scheduler may be connected to the output of a Queue, to the output of an Algorithmic Dropper, or to the output of another Scheduler. - The input of an Algorithmic Dropper may be the first element of the queuing stage, the output of another Algorithmic Dropper, or it may be connected to the output of a Queue (to indicate head-dropping). - The output of the queuing block may be the output of a Queue, an Algorithmic Dropper, or a Scheduler. Note, in particular, that Schedulers may operate in series such so that a packet at the head of a Queue feeding the concatenated Schedulers is serviced only after all of the scheduling criteria are met. For example, a Queue which carries EF traffic streams may be served first by a non-work-conserving Scheduler to shape the stream to a maximum rate, then by a work-conserving Scheduler to mix EF traffic streams with other traffic streams. Alternatively, there might be a Queue and/or a dropper between the two Schedulers. Note also that some non-sensical scenarios (e.g., a Queue preceding an Algorithmic Dropper, directly feeding into another Queue), are prohibited.8.2. An Example TCB
A SLS is presumed to have been negotiated between the customer and the provider which specifies the handling of the customer's traffic, as defined by a TCS) by the provider's network. The agreement might be of the following form:
DSCP PHB Profile Treatment ---- --- ------- ---------------------- 001001 EF Profile4 Discard non-conforming. 001100 AF11 Profile5 Shape to profile, tail-drop when full. 001101 AF21 Profile3 Re-mark non-conforming to DSCP 001000, tail-drop when full. other BE none Apply RED-like dropping. This SLS specifies that the customer may submit packets marked for DSCP 001001 which will get EF treatment so long as they remain conforming to Profile4, which will be discarded if they exceed this profile. The discarded packets are counted in this example, perhaps for use by the provider's sales department in convincing the customer to buy a larger SLS. Packets marked for DSCP 001100 will be shaped to Profile5 before forwarding. Packets marked for DSCP 001101 will be metered to Profile3 with non-conforming packets "downgraded" by being re-marked with a DSCP of 001000. It is implicit in this agreement that conforming packets are given the PHB originally indicated by the packets' DSCP field. Figures 6 and 7 illustrates a TCB that might be used to handle this SLS at an ingress interface at the customer/provider boundary. The Classification stage of this example consists of a single BA classifier. The BA classifier is used to separate traffic based on the Diffserv service level requested by the customer (as indicated by the DSCP in each submitted packet's IP header). We illustrate three DSCP filter values: A, B, and C. The 'X' in the BA classifier is a wildcard filter that matches every packet not otherwise matched. The path for DSCP 001100 proceeds directly to Dropper1 whilst the paths for DSCP 001001 and 001101 include a metering stage. All other traffic is passed directly on to Dropper3. There is a separate meter for each set of packets corresponding to classifier outputs A and C. Each meter uses a specific profile, as specified in the TCS, for the corresponding Diffserv service level. The meters in this example each indicate one of two conformance levels: conforming or non- conforming. Following the Metering stage is an Action stage in some of the branches. Packets submitted for DSCP 001001 (Classifier output A) that are deemed non-conforming by Meter1 are counted and discarded while packets that are conforming are passed on to Queue1. Packets submitted for DSCP 001101 (Classifier output C) that are deemed non- conforming by Meter2 are re-marked and then both conforming and non- conforming packets are multiplexed together before being passed on to Dropper2/Queue3.
The Algorithmic Dropping, Queuing and Scheduling stages are realized as follows, illustrated in figure 7. Note that the figure does not show any of the implicit control linkages between elements that allow e.g., an Algorithmic Dropper to sense the current state of a succeeding Queue. +-----+ | A|---------------------------> to Queue1 +->| | | | B|--+ +-----+ +-----+ | +-----+ | | | | | | Meter1 +->| |--->| | | | | | | | +-----+ +-----+ | Counter1 Absolute submitted +-----+ | Dropper1 traffic | A|-----+ --------->| B|--------------------------------------> to AlgDropper1 | C|-----+ | X|--+ | +-----+ | | +-----+ +-----+ Classifier1| | | A|--------------->|A | (BA) | +->| | | |--> to AlgDrop2 | | B|--+ +-----+ +->|B | | +-----+ | | | | +-----+ | Meter2 +->| |-+ Mux1 | | | | +-----+ | Marker1 +-----------------------------------> to AlgDropper3 Figure 6: An Example Traffic Conditioning Block (Part 1) Conforming DSCP 001001 packets from Meter1 are passed directly to Queue1: there is no way, with configuration of the following Scheduler to match the metering, for these packets to overflow the depth of Queue1, so there is no requirement for dropping at this point. Packets marked for DSCP 001100 must be passed through a tail-dropper, AlgDropper1, which serves to limit the depth of the following queue, Queue2: packets that arrive to a full queue will be discarded. This is likely to be an error case: the customer is obviously not sticking to its agreed profile. Similarly, all packets from the original DSCP 001101 stream (some may have been re-marked by this stage) are passed to AlgDropper2 and Queue3. Packets marked for all other DSCPs are passed to AlgDropper3 which is a RED-like Algorithmic Dropper: based on feedback of the current depth of Queue4, this dropper is supposed to discard enough packets from its input stream to keep the queue depth under control.
These four Queue elements are then serviced by a Scheduler element Scheduler1: this must be configured to give each of its inputs an appropriate priority and/or bandwidth share. Inputs A and C are given guarantees of bandwidth, as appropriate for the contracted profiles. Input B is given a limit on the bandwidth it can use (i.e., a non-work-conserving discipline) in order to achieve the desired shaping of this stream. Input D is given no limits or guarantees but a lower priority than the other queues, appropriate for its best-effort status. Traffic then exits the Scheduler in a single orderly stream. The interconnections of the TCB elements illustrated in Figures 6 and 7 can be represented textually as follows: TCB1: Classifier1: FilterA: Meter1 FilterB: Dropper1 FilterC: Meter2 Default: Dropper3 from Meter1 +-----+ ------------------------------->| |----+ | | | +-----+ | Queue1 | | +-----+ from Classifier1 +-----+ +-----+ +->|A | ---------------->| |------->| |------>|B |-------> | | | | +--->|C | exiting +-----+ +-----+ | +->|D | traffic AlgDropper1 Queue2 | | +-----+ | | Scheduler1 from Mux1 +-----+ +-----+ | | ---------------->| |------->| |--+ | | | | | | +-----+ +-----+ | AlgDropper2 Queue3 | | from Classifier1 +-----+ +-----+ | ---------------->| |------->| |----+ | | | | +-----+ +-----+ AlgDropper3 Queue4 Figure 7: An Example Traffic Conditioning Block (Part 2)
Meter1: Type: AverageRate Profile: Profile4 ConformingOutput: Queue1 NonConformingOutput: Counter1 Counter1: Output: AbsoluteDropper1 Meter2: Type: AverageRate Profile: Profile3 ConformingOutput: Mux1.InputA NonConformingOutput: Marker1 Marker1: Type: DSCPMarker Mark: 001000 Output: Mux1.InputB Mux1: Output: Dropper2 AlgDropper1: Type: AlgorithmicDropper Discipline: Drop-on-threshold Trigger: Queue2.Depth > 10kbyte Output: Queue2 AlgDropper2: Type: AlgorithmicDropper Discipline: Drop-on-threshold Trigger: Queue3.Depth > 20kbyte Output: Queue3 AlgDropper3: Type: AlgorithmicDropper Discipline: RED93 Trigger: Internal Output: Queue3 MinThresh: Queue3.Depth > 20 kbyte MaxThresh: Queue3.Depth > 40 kbyte <other RED parms too>
Queue1: Type: FIFO Output: Scheduler1.InputA Queue2: Type: FIFO Output: Scheduler1.InputB Queue3: Type: FIFO Output: Scheduler1.InputC Queue4: Type: FIFO Output: Scheduler1.InputD Scheduler1: Type: Scheduler4Input InputA: MaxRateProfile: none MinRateProfile: Profile4 Priority: 20 InputB: MaxRateProfile: Profile5 MinRateProfile: none Priority: 40 InputC: MaxRateProfile: none MinRateProfile: Profile3 Priority: 20 InputD: MaxRateProfile: none MinRateProfile: none Priority: 108.3. An Example TCB to Support Multiple Customers
The TCB described above can be installed on an ingress interface to implement a provider/customer TCS if the interface is dedicated to the customer. However, if a single interface is shared between multiple customers, then the TCB above will not suffice, since it does not differentiate among traffic from different customers. Its classification stage uses only BA classifiers. The configuration is readily modified to support the case of multiple customers per interface, as follows. First, a TCB is defined for each customer to reflect the TCS with that customer: TCB1, defined above is the TCB for customer 1. Similar elements are created for
TCB2 and for TCB3 which reflect the agreements with customers 2 and 3 respectively. These 3 TCBs may or may not contain similar elements and parameters. Finally, a classifier is added to the front end to separate the traffic from the three different customers. This forms a new TCB, TCB4, which is illustrated in Figure 8. A representation of this multi-customer TCB might be: TCB4: Classifier4: Filter1: to TCB1 Filter2: to TCB2 Filter3: to TCB3 No Match: AbsoluteDropper4 AbsoluteDropper4: Type: AbsoluteDropper TCB1: (as defined above) TCB2: (similar to TCB1, perhaps with different elements or numeric parameters) TCB3: (similar to TCB1, perhaps with different elements or numeric parameters) and the filters, based on each customer's source MAC address, could be defined as follows: Filter1: submitted +-----+ traffic | A|--------> TCB1 --------->| B|--------> TCB2 | C|--------> TCB3 | X|------+ +-----+ +-----+ +-->| | Classifier4 +-----+ AbsoluteDrop4 Figure 8: An Example of a Multi-Customer TCB
Type: MacAddress SrcValue: 01-02-03-04-05-06 (source MAC address of customer 1) SrcMask: FF-FF-FF-FF-FF-FF DestValue: 00-00-00-00-00-00 DestMask: 00-00-00-00-00-00 Filter2: (similar to Filter1 but with customer 2's source MAC address as SrcValue) Filter3: (similar to Filter1 but with customer 3's source MAC address as SrcValue) In this example, Classifier4 separates traffic submitted from different customers based on the source MAC address in submitted packets. Those packets with recognized source MAC addresses are passed to the TCB implementing the TCS with the corresponding customer. Those packets with unrecognized source MAC addresses are passed to a dropper. TCB4 has a Classifier stage and an Action element stage performing dropping of all unmatched traffic.8.4. TCBs Supporting Microflow-based Services
The TCB illustrated above describes a configuration that might be suitable for enforcing a SLS at a router's ingress. It assumes that the customer marks its own traffic for the appropriate service level. It then limits the rate of aggregate traffic submitted at each service level, thereby protecting the resources of the Diffserv network. It does not provide any isolation between the customer's individual microflows. A more complex example might be a TCB configuration that offers additional functionality to the customer. It recognizes individual customer microflows and marks each one independently. It also isolates the customer's individual microflows from each other in order to prevent a single microflow from seizing an unfair share of the resources available to the customer at a certain service level. This is illustrated in Figure 9. Suppose that the customer has an SLS which specifies 2 service levels, to be identified to the provider by DSCP A and DSCP B. Traffic is first directed to a MF classifier which classifies traffic based on miscellaneous classification criteria, to a granularity sufficient to identify individual customer microflows. Each microflow can then be marked for a specific DSCP The metering
elements limit the contribution of each of the customer's microflows to the service level for which it was marked. Packets exceeding the allowable limit for the microflow are dropped. +-----+ +-----+ Classifier1 | | | |---------------+ (MF) +->| |-->| | +-----+ | +-----+ | | | | |---->| | | | A|------ +-----+ +-----+ +-----+ | -->| B|-----+ Marker1 Meter1 Absolute | | C|---+ | Dropper1 | +-----+ | X|-+ | | +-----+ +-----+ +-->|A | +-----+ | | | | | | |------------------>|B |---> | | +->| |-->| | +-----+ +-->|C | to TCB2 | | | | | |---->| | | +-----+ | | +-----+ +-----+ +-----+ | Mux1 | | Marker2 Meter2 Absolute | | | Dropper2 | | | +-----+ +-----+ | | | | | | |---------------+ | |--->| |-->| | +-----+ | | | | |---->| | | +-----+ +-----+ +-----+ | Marker3 Meter3 Absolute | Dropper3 V etc. Figure 9: An Example of a Marking and Traffic Isolation TCB This TCB could be formally specified as follows: TCB1: Classifier1: (MF) FilterA: Marker1 FilterB: Marker2 FilterC: Marker3 etc. Marker1: Output: Meter1 Marker2: Output: Meter2 Marker3: Output: Meter3
Meter1: ConformingOutput: Mux1.InputA NonConformingOutput: AbsoluteDropper1 Meter2: ConformingOutput: Mux1.InputB NonConformingOutput: AbsoluteDropper2 Meter3: ConformingOutput: Mux1.InputC NonConformingOutput: AbsoluteDropper3 etc. Mux1: Output: to TCB2 Note that the detailed traffic element declarations are not shown here. Traffic is either dropped by TCB1 or emerges marked for one of two DSCPs. This traffic is then passed to TCB2 which is illustrated in Figure 10. TCB2 could then be specified as follows: Classifier2: (BA) FilterA: Meter5 FilterB: Meter6 +-----+ | |---------------> to Queue1 +->| | +-----+ +-----+ | | |---->| | | A|---+ +-----+ +-----+ ->| | Meter5 AbsoluteDropper4 | B|---+ +-----+ +-----+ | | |---------------> to Queue2 Classifier2 +->| | +-----+ (BA) | |---->| | +-----+ +-----+ Meter6 AbsoluteDropper5 Figure 10: Additional Example: TCB2 Meter5: ConformingOutput: Queue1 NonConformingOutput: AbsoluteDropper4
Meter6: ConformingOutput: Queue2 NonConformingOutput: AbsoluteDropper58.5. Cascaded TCBs
Nothing in this model prevents more complex scenarios in which one microflow TCB precedes another (e.g., for TCBs implementing separate TCSs for the source and for a set of destinations).9. Security Considerations
Security vulnerabilities of Diffserv network operation are discussed in [DSARCH]. This document describes an abstract functional model of Diffserv router elements. Certain denial-of-service attacks such as those resulting from resource starvation may be mitigated by appropriate configuration of these router elements; for example, by rate limiting certain traffic streams or by authenticating traffic marked for higher quality-of-service. There may be theft-of-service scenarios where a malicious host can exploit a loose token bucket policer to obtain slightly better QoS than that committed in the TCS.10. Acknowledgments
Concepts, terminology, and text have been borrowed liberally from [POLTERM], as well as from other IETF work on MIBs and policy- management. We wish to thank the authors of some of those documents: Fred Baker, Michael Fine, Keith McCloghrie, John Seligson, Kwok Chan, Scott Hahn, and Andrea Westerinen for their contributions. This document has benefited from the comments and suggestions of several participants of the Diffserv working group, particularly Shahram Davari, John Strassner, and Walter Weiss. This document could never have reached this level of rough consensus without the relentless pressure of the co-chairs Brian Carpenter and Kathie Nichols, for which the authors are grateful.11. References
[AF-PHB] Heinanen, J., Baker, F., Weiss, W. and J. Wroclawski, "Assured Forwarding PHB Group", RFC 2597, June 1999. [DSARCH] Carlson, M., Weiss, W., Blake, S., Wang, Z., Black, D. and E. Davies, "An Architecture for Differentiated Services", RFC 2475, December 1998.
[DSFIELD] Nichols, K., Blake, S., Baker, F. and D. Black, "Definition of the Differentiated Services Field (DS Field) in the IPv4 and IPv6 Headers", RFC 2474, December 1998. [DSMIB] Baker, F., Smith, A., and K. Chan, "Management Information Base for the Differentiated Services Architecture", RFC 3289, May 2002. [E2E] Bernet, Y., Yavatkar, R., Ford, P., Baker, F., Zhang, L., Speer, M., Nichols, K., Braden, R., Davie, B., Wroclawski, J. and E. Felstaine, "A Framework for Integrated Services Operation over Diffserv Networks", RFC 2998, November 2000. [EF-PHB] Davie, B., Charny, A., Bennett, J.C.R., Benson, K., Le Boudec, J.Y., Courtney, W., Davari, S., Firoiu, V. and D. Stiliadis, "An Expedited Forwarding PHB (Per-Hop Behavior)", RFC 3246, March 2002. [FJ95] Floyd, S. and V. Jacobson, "Link Sharing and Resource Management Models for Packet Networks", IEEE/ACM Transactions on Networking, Vol. 3 No. 4, August 1995l. [INTSERV] Braden, R., Clark, D. and S. Shenker, "Integrated Services in the Internet Architecture: an Overview", RFC 1633, June 1994. [NEWTERMS] Grossman, D., "New Terminology and Clarifications for Diffserv", RFC 3260, April, 2002 [PDBDEF] K. Nichols and B. Carpenter, "Definition of Differentiated Services Per Domain Behaviors and Rules for Their Specification", RFC 3086, April 2001. [POLTERM] Westerinen, A., Schnizlein, J., Strassner, J., Scherling, M., Quinn, B., Herzog, S., Huynh, A., Carlson, M., Perry, J. and S. Waldbusser, "Policy Terminology", RFC 3198, November 2001. [QOSDEVMOD] Strassner, J., Westerinen, A. and B. Moore, "Information Model for Describing Network Device QoS Mechanisms", Work in Progress.
[QUEUEMGMT] Braden, R., Clark, D., Crowcroft, J., Davie, B., Deering, S., Estrin, D., Floyd, S., Jacobson, V., Minshall, C., Partridge, C., Peterson, L., Ramakrishnan, K., Shenker, S., Wroclawski, J. and L. Zhang, "Recommendations on Queue Management and Congestion Avoidance in the Internet", RFC 2309, April 1998. [SRTCM] Heinanen, J. and R. Guerin, "A Single Rate Three Color Marker", RFC 2697, September 1999. [TRTCM] Heinanen, J. and R. Guerin, "A Two Rate Three Color Marker", RFC 2698, September 1999. [VIC] McCanne, S. and Jacobson, V., "vic: A Flexible Framework for Packet Video", ACM Multimedia '95, November 1995, San Francisco, CA, pp. 511-522. <ftp://ftp.ee.lbl.gov/papers/vic-mm95.ps.Z> [802.1D] "Information technology - Telecommunications and information exchange between systems - Local and metropolitan area networks - Common specifications - Part 3: Media Access Control (MAC) Bridges: Revision. This is a revision of ISO/IEC 10038: 1993, 802.1j-1992 and 802.6k-1992. It incorporates P802.11c, P802.1p and P802.12e.", ISO/IEC 15802-3: 1998.
Appendix A. Discussion of Token Buckets and Leaky Buckets
"Leaky bucket" and/or "Token Bucket" models are used to describe rate control in several architectures, including Frame Relay, ATM, Integrated Services and Differentiated Services. Both of these models are, by definition, theoretical relationships between some defined burst size, B, a rate, R, and a time interval, t: R = B/t Thus, a token bucket or leaky bucket might specify an information rate of 1.2 Mbps with a burst size of 1500 bytes. In this case, the token rate is 1,200,000 bits per second, the token burst is 12,000 bits and the token interval is 10 milliseconds. The specification says that conforming traffic will, in the worst case, come in 100 bursts per second of 1500 bytes each and at an average rate not exceeding 1.2 Mbps.A.1 Leaky Buckets
A leaky bucket algorithm is primarily used for shaping traffic as it leaves an interface onto the network (handled under Queues and Schedulers in this model). Traffic theoretically departs from an interface at a rate of one bit every so many time units (in the example, one bit every 0.83 microseconds) but, in fact, departs in multi-bit units (packets) at a rate approximating the theoretical, as measured over a longer interval. In the example, it might send one 1500 byte packet every 10 ms or perhaps one 500 byte packet every 3.3 ms. It is also possible to build multi-rate leaky buckets in which traffic departs from the interface at varying rates depending on recent activity or inactivity. Implementations generally seek as constant a transmission rate as achievable. In theory, a 10 Mbps shaped transmission stream from an algorithmic implementation and a stream which is running at 10 Mbps because its bottleneck link has been a 10 Mbps Ethernet link should be indistinguishable. Depending on configuration, the approximation to theoretical smoothness may vary by moving as much as an MTU from one token interval to another. Traffic may also be jostled by other traffic competing for the same transmission resources.A.2 Token Buckets
A token bucket, on the other hand, measures the arrival rate of traffic from another device. This traffic may originally have been shaped using a leaky bucket shaper or its equivalent. The token bucket determines whether the traffic (still) conforms to the specification. Multi-rate token buckets (e.g., token buckets with
both a peak rate and a mean rate, and sometimes more) are commonly used, such as those described in [SRTCM] and [TRTCM]. In this case, absolute smoothness is not expected, but conformance to one or more of the specified rates is. Simplistically, a data stream is said to conform to a simple token bucket parameterized by a {R, B} if the system receives in any time interval, t, at most, an amount of data not exceeding (R * t) + B. For a multi-rate token bucket case, the data stream is said to conform if, for each of the rates, the stream conforms to the token- bucket profile appropriate for traffic of that class. For example, received traffic that arrives pre-classified as one of the "excess" rates (e.g., AF12 or AF13 traffic for a device implementing the AF1x PHB) is only compared to the relevant "excess" token bucket profile.A.3 Some Consequences
The fact that Internet Protocol data is organized into variable length packets introduces some uncertainty in the conformance decision made by any downstream Meter that is attempting to determine conformance to a traffic profile that is theoretically designed for fixed-length units of data. When used as a leaky bucket shaper, the above definition interacts with clock granularity in ways one might not expect. A leaky bucket releases a packet only when all of its bits would have been allowed: it does not borrow from future capacity. If the clock is very fine grain, on the order of the bit rate or faster, this is not an issue. But if the clock is relatively slow (and millisecond or multi- millisecond clocks are not unusual in networking equipment), this can introduce jitter to the shaped stream. This leaves an implementor of a token bucket Meter with a dilemma. When the number of bandwidth tokens, b, left in the token bucket is positive but less than the size of the packet being operated on, L, one of three actions can be performed: (1) The whole size of the packet can be subtracted from the bucket, leaving it negative, remembering that, when new tokens are next added to the bucket, the new token allocation, B, must be added to b rather than simply setting the bucket to "full". This option potentially puts more than the desired burst size of data into this token bucket interval and correspondingly less into the next. It does, however, keep the average amount accepted per token bucket interval equal to the token burst. This approach accepts traffic if any one bit in the packet would have been
accepted and borrows up to one MTU of capacity from one or more subsequent intervals when necessary. Such a token bucket meter implementation is said to offer "loose" conformance to the token bucket. (2) Alternatively, the packet can be rejected and the amount of tokens in the bucket left unchanged (and maybe an attempt could be made to accept the packet under another threshold in another bucket), remembering that, when new tokens are next added to the bucket, the new token allocation, B, must be added to b rather than simply setting the bucket to "full". This potentially puts less than the permissible burst size of data into this token bucket interval and correspondingly more into the next. Like the first option, it keeps the average amount accepted per token bucket interval equal to the token burst. This approach accepts traffic only if every bit in the packet would have been accepted and borrows up to one MTU of capacity from one or more previous intervals when necessary. Such a token bucket meter implementation is said to offer "strict" (or perhaps "stricter") conformance to the token bucket. This option is consistent with [SRTCM] and [TRTCM] and is often used in ATM and frame-relay implementations. (3) The TB variable can be set to zero to account for the first part of the packet and the remainder of the packet size can be taken out of the next-colored bucket. This, of course, has another bug: the same packet cannot have both conforming and non-conforming components in the Diffserv architecture and so is not really appropriate here and we do not discuss this option further here. Unfortunately, the thing that cannot be done is exactly to fit the token burst specification with random sized packets: therefore token buckets in a variable length packet environment always have a some variance from theoretical reality. This has also been observed in the ATM Guaranteed Frame Rate (GFR) service category specification and Frame Relay. A number of observations may be made: o Operationally, a token bucket meter is reasonable for traffic which has been shaped by a leaky bucket shaper or a serial line. However, traffic in the Internet is rarely shaped in that way: TCP applies no shaping to its traffic, but rather depends on longer- range ACK-clocking behavior to help it approximate a certain rate and explicitly sends traffic bursts during slow start, retransmission, and fast recovery. Video-on-IP implementations such as [VIC] may have a leaky bucket shaper available to them,
but often do not, and simply enqueue the output of their codec for transmission on the appropriate interface. As a result, in each of these cases, a token bucket meter may reject traffic in the short term (over a single token interval) which it would have accepted if it had a longer time in view and which it needs to accept for the application to work properly. To work around this, the token interval, B/R, must approximate or exceed the RTT of the session(s) in question and the burst size, B, must accommodate the largest burst that the originator might send. o The behavior of a loose token bucket is significantly different from the token bucket description for ATM and for Frame Relay. o A loose token bucket does not accept packets while the token count is negative. This means that, when a large packet has just borrowed tokens from the future, even a small incoming packet (e.g., a 40-byte TCP ACK/SYN) will not be accepted. Therefore, if such a loose token bucket is configured with a burst size close to the MTU, some discrimination against smaller packets can take place: use of a larger burst size avoids this problem. o The converse of the above is that a strict token bucket sometimes does not accept large packets when a loose one would do so. Therefore, if such a strict token bucket is configured with a burst size close to the MTU, some discrimination against larger packets can take place: use of a larger burst size avoids this problem. o In real-world deployments, MTUs are often larger than the burst size offered by a link-layer network service provider. If so then it is possible that a strict token bucket meter would find that traffic never matches the specified profile: this may be avoided by not allowing such a specification to be used. This situation cannot arise with a loose token bucket since the smallest burst size that can be configured is 1 bit, by definition limiting a loose token bucket to having a burst size of greater than one MTU. o Both strict token bucket specifications, as specified in [SRTCM] and [TRTCM], and loose ones, are subject to a persistent under- run. These accumulate burst capacity over time, up to the maximum burst size. Suppose that the maximum burst size is exactly the size of the packets being sent - which one might call the "strictest" token bucket implementation. In such a case, when one packet has been accepted, the token depth becomes zero and starts to accumulate again. If the next packet is received any time earlier than a token interval later, it will not be accepted. If the next packet arrives exactly on time, it will be accepted and the token depth again set to zero. If it arrives later, however,
accumulation of tokens will have stopped because it is capped by the maximum burst size: during the interval between the bucket becoming full and the actual arrival of the packet, no new tokens are added. As a result, jitter that accumulates across multiple hops in the network conspires against the algorithm to reduce the actual acceptance rate. Thus it usually makes sense to set the maximum token bucket size somewhat greater than the MTU in order to absorb some of the jitter and allow a practical acceptance rate more in line with the desired theoretical rate.A.4 Mathematical Definition of Strict Token Bucket Conformance
The strict token bucket conformance behavior defined in [SRTCM] and [TRTCM] is not mandatory for compliance with any current Diffserv standards, but we give here a mathematical definition of two- parameter token bucket operation which is consistent with those documents and which can also be used to define a shaping profile. Define a token bucket with bucket size B, token accumulation rate R and instantaneous token occupancy b(t). Assume that b(0) = B. Then after an arbitrary interval with no packet arrivals, b(t) will not change since the bucket is already full of tokens. Assume a packet of size L bytes arrives at time t'. The bucket occupancy is still B. Then, as long as L <= B, the packet conforms to the meter, and afterwards b(t') = B - L. Assume now an interval delta_t = t - t' elapses before the next packet arrives, of size L' <= B. Just before this, at time t-, the bucket has accumulated delta_t*R tokens over the interval, up to a maximum of B tokens so that: b(t-) = min{ B, b(t') + delta_t*R } For a strict token bucket, the conformance test is as follows: if (b(t-) - L' >= 0) { /* the packet conforms */ b(t) = b(t-) - L'; } else { /* the packet does not conform */ b(t) = b(t-); }
This function can also be used to define a shaping profile. If a packet of size L arrives at time t, it will be eligible for transmission at time te given as follows (we still assume L <= B): te = max{ t, t" } where t" = (L - b(t') + t'*R) / R and b(t") = L, the time when L credits have accumulated in the bucket, and when the packet would conform if the token bucket were a meter. te != t" only if t > t". A mathematical definition along these lines for loose token bucket conformance is left as an exercise for the reader.Authors' Addresses
Yoram Bernet Microsoft One Microsoft Way Redmond, WA 98052 Phone: +1 425 936 9568 EMail: ybernet@msn.com Steven Blake Ericsson 920 Main Campus Drive, Suite 500 Raleigh, NC 27606 Phone: +1 919 472 9913 EMail: steven.blake@ericsson.com Daniel Grossman Motorola Inc. 20 Cabot Blvd. Mansfield, MA 02048 Phone: +1 508 261 5312 EMail: dan@dma.isg.mot.com Andrew Smith (editor) Harbour Networks Jiuling Building 21 North Xisanhuan Ave. Beijing, 100089 PRC Fax: +1 415 345 1827 EMail: ah_smith@acm.org
Full Copyright Statement Copyright (C) The Internet Society (2002). All Rights Reserved. This document and translations of it may be copied and furnished to others, and derivative works that comment on or otherwise explain it or assist in its implementation may be prepared, copied, published and distributed, in whole or in part, without restriction of any kind, provided that the above copyright notice and this paragraph are included on all such copies and derivative works. However, this document itself may not be modified in any way, such as by removing the copyright notice or references to the Internet Society or other Internet organizations, except as needed for the purpose of developing Internet standards in which case the procedures for copyrights defined in the Internet Standards process must be followed, or as required to translate it into languages other than English. The limited permissions granted above are perpetual and will not be revoked by the Internet Society or its successors or assigns. This document and the information contained herein is provided on an "AS IS" basis and THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIMS ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Acknowledgement Funding for the RFC Editor function is currently provided by the Internet Society.