Appendix A. Examples
A.1. Example of a Re-Marking Operation during Severe Congestion in the Interior Nodes
This appendix describes an example of a re-marking operation during severe congestion in the Interior nodes. Per supported PHB, the Interior node can support the operation states depicted in Figure 26, when the per-flow congestion notification based on probing signaling scheme is used in combination with this severe congestion type. Figure 27 depicts the same functionality when the per-flow congestion notification based on probing scheme is not used in combination with the severe congestion scheme. The description given in this and the following appendices, focuses on the situation where: (1) the "notified DSCP" marking is used in congestion notification state, and (2) the "encoded DSCP" and "affected DSCP" markings are used in severe congestion state. In this case, the "notified DSCP" marking is used during the congestion notification state to mark all packets passing through an Interior node that operates in the congestion notification state. In this way, and in combination with probing, a flow-based ECMP solution can be provided for the congestion notification state. The "encoded DSCP" marking is used to encode and signal the excess rate, measured at Interior nodes, to the Egress nodes. The "affected DSCP" marking is used to mark all packets that are passing through a severe congested node and are not "encoded DSCP" marked. Another possible situation could be derived in which both congestion notification and severe congestion state use the "encoded DSCP" marking, without using the "notified DSCP" marking. The "affected DSCP" marking is used to mark all packets that pass through an Interior node that is in severe congestion state and are not "encoded DSCP" marked. In addition, the probe packet that is carried by an intra-domain RESERVE message and pass through Interior nodes SHOULD be "encoded DSCP" marked if the Interior node is in congestion notification or severe congestion states. Otherwise, the probe packet will remain unmarked. In this way, an ECMP solution can be provided for both congestion notification and severe congestion states. The"encoded DSCP" packets signal an excess rate that is not only associated with Interior nodes that are in severe congestion state, but also with Interior nodes that are in congestion notification state. The algorithm at the Interior node is similar to the algorithm described in the following appendix sections. However, this method is not described in detail in this example.
--------------------------------------------- | event B | | V ---------- ------------- ---------- | Normal | event A | Congestion | event B | Severe | | state |---------->| notification|-------->|congestion| | | | state | | state | ---------- ------------- ---------- ^ ^ | | | | event C | | | ----------------------- | | event D | ------------------------------------------------ Figure 26: States of operation, severe congestion combined with congestion notification based on probing ---------- ------------- | Normal | event B | Severe | | state |-------------->| congestion | | | | state | ---------- ------------- ^ | | event E | --------------------------- Figure 27: States of operation, severe congestion without congestion notification based on probing The terms used in Figures 26 and 27 are: Normal state: represents the normal operation conditions of the node, i.e., no congestion. Severe congestion state: represents the state in which the Interior node is severely congested related to a certain PHB. It is important to emphasize that one of the targets of the severe congestion state solution to change the severe congestion state behavior directly to the normal state. Congestion notification: state in which the load is relatively high, close to the level when congestion can occur. event A: this event occurs when the incoming PHB rate is higher than the "congestion notification detection" threshold and lower than the "severe congestion detection". This threshold is used by the congestion notification based on probing scheme, see Sections 4.6.1.7 and 4.6.2.6.
event B: this event occurs when the incoming PHB rate is higher than the "severe congestion detection" threshold. event C: this event occurs when the incoming PHB rate is lower than or equal to the "congestion notification detection" threshold. event D: this event occurs when the incoming PHB rate is lower than or equal to the "severe_congestion_restoration" threshold. It is important to emphasize that this even supports one of the targets of the severe congestion state solution to change the severe congestion state behavior directly to the normal state. event E: this event occurs when the incoming PHB rate is lower than or equal to the "severe congestion restoration" threshold. Note that the "severe congestion detection", "severe congestion restoration" and admission thresholds SHOULD be higher than the "congestion notification detection" threshold, i.e., "severe congestion detection" > "congestion notification detection" and "severe congestion restoration" > "congestion notification detection". Furthermore, the "severe congestion detection" threshold SHOULD be higher than or equal to the admission threshold that is used by the reservation-based and NSIS measurement-based signaling schemes. "severe congestion detection" >= admission threshold. Moreover, the "severe congestion restoration" threshold SHOULD be lower than or equal to the "severe congestion detection" threshold that is used by the reservation-based and NSIS measurement-based signaling schemes, that is: "severe congestion restoration" <= "severe congestion detection" During severe congestion, the Interior node calculates, per traffic class (PHB), the incoming rate that is above the "severe congestion restoration" threshold, denoted as signaled_overload_rate, in the following way: * A severe congested Interior node SHOULD take into account that packets might be dropped. Therefore, before queuing and eventually dropping packets, the Interior node SHOULD count the total number of unmarked and re-marked bytes received by the severe congested node, denote this number as total_received_bytes. Note that there are situations in which more than one Interior node in the same path become severely congested. Therefore, any Interior node located behind a severely congested node MAY receive marked bytes.
When the "severe congestion detection" threshold per PHB is set equal to the maximum capacity allocated to one PHB used by the RMD-QOSM, it means that if the maximum capacity associated to a PHB is fully utilized and a packet belonging to this PHB arrives, then it is assumed that the Interior node will not forward this packet downstream. In other words, this packet will either be dropped or set to another PHB. Furthermore, this also means that after the severe congestion situation is solved, then the ongoing flows will be able to send their associated packets up to a total rate equal to the maximum capacity associated with the PHB. Therefore, when more than one Interior node located on the same path will be severely congested and when the Interior node receives "encoded DSCP" marked packets, it means that an Interior node located upstream is also severely congested. When the "severe congestion detection" threshold per PHB is set equal to the maximum capacity allocated to one PHB, then this Interior node MUST forward the "encoded DSCP" marked packets and it SHOULD NOT consider these packets during its local re-marking process. In other words, the Egress should see the excess rates encoded by the different severely congested Interior nodes as independent, and therefore, these independent excess rates will be added. When the "severe congestion detection" threshold per PHB is not set equal to the maximum capacity allocated to one PHB, this means that after the severe congestion situation is solved, the ongoing flows will not be able to send their associated packets up to a total rate equal to the maximum capacity associated with the PHB, but only up to the "severe_congestion_threshold". When more than one Interior node located on the same communication path is severely congested and when one of these Interior node receives "encoded_DSCP" marked packets, this Interior node SHOULD NOT mark unmarked, i.e., either "original DSCP" or "affected DSCP" or "notified DSCP" encoded packets, up to a rate equal to the difference between the maximum PHB capacity and the "severe congestion threshold", when the incoming "encoded DSCP" marked packets are already able to signal this difference. In this case, the "severe congestion threshold" SHOULD be configured in all Interior nodes, which are located in the RMD domain, and equal to: "severe_congestion_threshold" = Maximum PHB capacity - threshold_offset_rate The threshold_offset_rate represents rate and SHOULD have the same value in all Interior nodes.
* before queuing and eventually dropping the packets, at the end of each measurement interval of T seconds, calculate the current estimated overloaded rate, say measured_overload_rate, by using the following equation: measured_overload_rate = =((total_received_bytes)/T)-severe_congestion_restoration) To provide a reliable estimation of the encoded information, several techniques can be used; see [AtLi01], [AdCa03], [ThCo04], and [AnHa06]. Note that since marking is done in Interior nodes, the decisions are made at Egress nodes, and the termination of flows is performed by Ingress nodes, there is a significant delay until the overload information is learned by the Ingress nodes (see Section 6 of [CsTa05]). The delay consists of the trip time of data packets from the severely congested Interior node to the Egress, the measurement interval, i.e., T, and the trip time of the notification signaling messages from Egress to Ingress. Moreover, until the overload decreases at the severely congested Interior node, an additional trip time from the Ingress node to the severely congested Interior node MUST expire. This is because immediately before receiving the congestion notification, the Ingress MAY have sent out packets in the flows that were selected for termination. That is, a terminated flow MAY contribute to congestion for a time longer that is taken from the Ingress to the Interior node. Without considering the above, Interior nodes would continue marking the packets until the measured utilization falls below the severe congestion restoration threshold. In this way, in the end, more flows will be terminated than necessary, i.e., an overreaction takes place. [CsTa05] provides a solution to this problem, where the Interior nodes use a sliding window memory to keep track of the signaling overload in a couple of previous measurement intervals. At the end of a measurement interval, T, before encoding and signaling the overloaded rate as "encoded DSCP" packets, the actual overload is decreased with the sum of already signaled overload stored in the sliding window memory, since that overload is already being handled in the severe congestion handling control loop. The sliding window memory consists of an integer number of cells, i.e., n = maximum number of cells. Guidelines for configuring the sliding window parameters are given in [CsTa05]. At the end of each measurement interval, the newest calculated overload is pushed into the memory, and the oldest cell is dropped. If Mi is the overload_rate stored in ith memory cell (i = [1..n]), then at the end of every measurement interval, the overload rate that is signaled to the Egress node, i.e., signaled_overload_rate is calculated as follows:
Sum_Mi =0 For i =1 to n { Sum_Mi = Sum_Mi + Mi } signaled_overload_rate = measured_overload_rate - Sum_Mi, where Sum_Mi is calculated as above. Next, the sliding memory is updated as follows: for i = 1..(n-1): Mi <- Mi+1 Mn <- signaled_overload_rate The bytes that have to be re-marked to satisfy the signaled overload rate: signaled_remarked_bytes, are calculated using the following pseudocode: IF severe_congestion_threshold <> Maximum PHB capacity THEN { IF (incoming_encoded-DSCP_rate <> 0) AND (incoming_encoded-DSCP_rate =< termination_offset_rate) THEN { signaled_remarked_bytes = = ((signaled_overload_rate - incoming_encoded-DSCP_rate)*T)/N } ELSE IF (incoming_encoded-DSCP_rate > termination_offset_rate) THEN signaled_remarked_bytes = = ((signaled_overload_rate - termination_offset_rate)*T)/N ELSE IF (incoming_encoded-DSCP_rate =0) THEN signaled_remarked_bytes = = signaled_overload_rate*T/N } ELSE signaled_remarked_bytes = signaled_overload_rate *T/N Where the incoming "encoded DSCP" rate is calculated as follows: incoming_encoded-DSCP_rate = = (received number of "encoded_DSCP" during T) * N)/T; The signal_remarked_bytes also represents the number of the outgoing packets (after the dropping stage) that MUST be re-marked, during each measurement interval T, by a node when operates in severe congestion mode.
Note that, in order to process an overload situation higher than 100% of the maintained severe congestion threshold, all the nodes within the domain MUST be configured and maintain a scaling parameter, e.g., N used in the above equation, which in combination with the marked bytes, e.g., signaled_remarked_bytes, such a high overload situation can be calculated and represented. N can be equal to or higher than 1. Note that when incoming re-marked bytes are dropped, the operation of the severe congestion algorithm MAY be affected, e.g., the algorithm MAY become, in certain situations, slower. An implementation of the algorithm MAY assure as much as possible that the incoming marked bytes are not dropped. This could for example be accomplished by using different dropping rate thresholds for marked and unmarked bytes. Note that when the "affected DSCP" marking is used by a node that is congested due to a severe congestion situation, then all the outgoing packets that are not marked (i.e., by using the "encoded DSCP") have to be re-marked using the "affected DSCP" marking. The "encoded DSCP" and the "affected DSCP" marked packets (when applied in the whole RMD domain) are propagated to the QNE Edge nodes. Furthermore, note that when the congestion notification based on probing is used in combination with severe congestion, then in addition to the possible "encoded DSCP" and "affected DSCP", another DSCP for the re-marking of the same PHB is used (see Section 4.6.1.7). This additional DSCP is denoted in this document as "notified DSCP". When an Interior node operates in the severe congested state (see Figure 27), and receives "notified DSCP" packets, these packets are considered to be unmarked packets (but not "affected DSCP" packets). This means that during severe congestion, also the "notified DSCP" packets can be re-marked and encoded as either "encoded DSCP" or "affected DSCP" packets.A.2. Example of a Detailed Severe Congestion Operation in the Egress Nodes
This appendix describes an example of a detailed severe congestion operation in the Egress nodes. The states of operation in Egress nodes are similar to the ones described in Appendix A.1. The definition of the events, see below, is however different than the definition of the events given in Figures 26 and 27:
* event A: when the Egress receives a predefined rate of "notified DSCP" marked bytes/packets, event A is activated (see Sections 4.6.1.7 and A.4). The predefined rate of "notified DSCP" marked bytes is denoted as the congestion notification detection threshold. Note this congestion notification detection threshold can also be zero, meaning that the event A is activated when the Egress node, during an interval T, receives at least one "notified DSCP" packet. * event B: this event occurs when the Egress receives packets marked as either "encoded DSCP" or "affected DSCP" (when "affected DSCP" is applied in the whole RMD domain). * event C: this event occurs when the rate of incoming "notified DSCP" packets decreases below the congestion notification detection threshold. In the situation that the congestion notification detection threshold is zero, this will mean that event C is activated when the Egress node, during an interval T, does not receive any "notified DSCP" marked packets. * event D: this event occurs when the Egress, during an interval T, does not receive packets marked as either "encoded DSCP" or "affected DSCP" (when "affected DSCP" is applied in the whole RMD domain). Note that when "notified DSCP" is applied in the whole RMD domain for the support of congestion notification, this event could cause the following change in operation state. When the Egress, during an interval T, does not receive (1) packets marked as either "encoded DSCP" or "affected DSCP" (when "affected DSCP" is applied in the whole RMD domain) and (2) it does NOT receive "notified DSCP" marked packets, the change in the operation state occurs from the severe congestion state to normal state. When the Egress, during an interval T, does not receive (1) packets marked as either "encoded DSCP" or "affected DSCP" (when "affected DSCP" is applied in the whole RMD domain) and (2) it does receive "notified DSCP" marked packets, the change in the operation state occurs from the severe congestion state to the congestion notification state. * event E: this event occurs when the Egress, during an interval T, does not receive packets marked as either "encoded DSCP" or "affected DSCP" (when "affected DSCP" is applied in the whole RMD domain).
An example of the algorithm for calculation of the number of flows associated with each priority class that have to be terminated is explained by the pseudocode below. The Edge nodes are able to support severe congestion handling by: (1) identifying which flows were affected by the severe congestion and (2) selecting and terminating some of these flows such that the quality of service of the remaining flows is recovered. The "encoded DSCP" and the "affected DSCP" marked packets (when applied in the whole RMD domain) are received by the QNE Edge node. The QNE Edge nodes keep per-flow state and therefore they can translate the calculated bandwidth to be terminated, to number of flows. The QNE Egress node records the excess rate and the identity of all the flows, arriving at the QNE Egress node, with "encoded DSCP" and with "affected DSCP" (when applied in the whole RMD domain); only these flows, which are the ones passing through the severely congested Interior node(s), are candidates for termination. The excess rate is calculated by measuring the rate of all the "encoded DSCP" data packets that arrive at the QNE Egress node. The measured excess rate is converted by the Egress node, by multiplying it by the factor N, which was used by the QNE Interior node(s) to encode the overload level. When different priority flows are supported, all the low priority flows that arrived at the Egress node are terminated first. Next, all the medium priority flows are stopped and finally, if necessary, even high priority flows are chosen. Within a priority class both "encoded DSCP" and "affected DSCP" are considered before the mechanism moves to higher priority class. Finally, for each flow that has to be terminated the Egress node, sends a NOTIFY message to the Ingress node, which stops the flow. Below, this algorithm is described in detail. First, when the Egress operates in the severe congestion state, the total amount of re-marked bandwidth associated with the PHB traffic class, say total_congested_bandwidth, is calculated. Note that when the node maintains information about each Ingress/Egress pair aggregate, then the total_congested_bandwidth MUST be calculated per Ingress/Egress pair reservation aggregate. This bandwidth represents the severely congested bandwidth that SHOULD be terminated. The total_congested_bandwidth can be calculated as follows: total_congested_bandwidth = N*input_remarked_bytes/T
Where, input_remarked_bytes represents the number of "encoded DSCP" marked bytes that arrive at the Egress, during one measurement interval T, N is defined as in Sections 4.6.1.6.2.1 and A.1. The term denoted as terminated_bandwidth is a temporal variable representing the total bandwidth that has to be terminated, belonging to the same PHB traffic class. The terminate_flow_bandwidth (priority_class) is the total bandwidth associated with flows of priority class equal to priority_class. The parameter priority_class is an integer fulfilling: 0 =< priority_class =< Maximum_priority. The QNE Egress node records the identity of the QNE Ingress node that forwarded each flow, the total_congested_bandwidth and the identity of all the flows, arriving at the QNE Egress node, with "encoded DSCP" and "affected DSCP" (when applied in whole RMD domain). This ensures that only these flows, which are the ones passing through the severely overloaded QNE Interior node(s), are candidates for termination. The selection of the flows to be terminated is described in the pseudocode that is given below, which is realized by the function denoted below as calculate_terminate_flows(). The calculate_terminate_flows() function uses the <terminate_bandwidth_class> value and translates this bandwidth value to number of flows that have to be terminated. Only the "encoded DSCP" flows and "affected DSCP" (when applied in whole RMD domain) flows, which are the ones passing through the severely overloaded Interior node(s), are candidates for termination. After the flows to be terminated are selected, the <sum_bandwidth_terminate(priority_class)> value is calculated that is the sum of the bandwidth associated with the flows, belonging to a certain priority class, which will certainly be terminated. The constraint of finding the total number of flows that have to be terminated is that sum_bandwidth_terminate(priority_class), SHOULD be smaller or approximately equal to the variable terminate_bandwidth(priority_class).
terminated_bandwidth = 0; priority_class = 0; while terminated_bandwidth < total_congested_bandwidth { terminate_bandwidth(priority_class) = = total_congested_bandwidth - terminated_bandwidth calculate_terminate_flows(priority_class); terminated_bandwidth = = sum_bandwidth_terminate(priority_class) + terminated_bandwidth; priority_class = priority_class + 1; } If the Egress node maintains Ingress/Egress pair reservation aggregates, then the above algorithm is performed for each Ingress/Egress pair reservation aggregate. Finally, for each flow that has to be terminated, the QNE Egress node sends a NOTIFY message to the QNE Ingress node to terminate the flow.A.3. Example of a Detailed Re-Marking Admission Control (Congestion Notification) Operation in Interior Nodes
This appendix describes an example of a detailed re-marking admission control (congestion notification) operation in Interior nodes. The predefined congestion notification threshold, see Appendix A.1, is set according to, and usually less than, an engineered bandwidth limitation, i.e., admission threshold, e.g., based on a Service Level Agreement or a capacity limitation of specific links. The difference between the congestion notification threshold and the engineered bandwidth limitation, i.e., admission threshold, provides an interval where the signaling information on resource limitation is already sent by a node but the actual resource limitation is not reached. This is due to the fact that data packets associated with an admitted session have not yet arrived, which allows the admission control process available at the Egress to interpret the signaling information and reject new calls before reaching congestion. Note that in the situation when the data rate is higher than the preconfigured congestion notification rate, data packets are also re- marked (see Section 4.6.1.6.2.1). To distinguish between congestion notification and severe congestion, two methods MAY be used (see Appendix A.1): * using different <DSCP> values (re-marked <DSCP> values). The re- marked DSCP that is used for this purpose is denoted as "notified DSCP" in this document. When this method is used and when the Interior node is in "congestion notification" state, see Appendix
A.1, then the node SHOULD re-mark all the unmarked bytes passing through the node using the "notified DSCP". Note that this method can only be applied if all nodes in the RMD domain use the "notified" DSCP marking. In this way, probe packets that will pass through the Interior node that operates in congestion notification state are also encoded using the "notified DSCP" marking. * Using the "encoded DSCP" marking for congestion notification and severe congestion. This method is not described in detail in this example appendix.A.4. Example of a Detailed Admission Control (Congestion Notification) Operation in Egress Nodes
This appendix describes an example of a detailed admission control (congestion notification) operation in Egress nodes. The admission control congestion notification procedure can be applied only if the Egress maintains the Ingress/Egress pair aggregate. When the operation state of the Ingress/Egress pair aggregate is the "congestion notification", see Appendix A.2, then the implementation of the algorithm depends on how the congestion notification situation is notified to the Egress. As mentioned in Appendix A.3, two methods are used: * using the "notified DSCP". During a measurement interval T, the Egress counts the number of "notified DSCP" marked bytes that belong to the same PHB and are associated with the same Ingress/Egress pair aggregate, say input_notified_bytes. We denote the rate as incoming_notified_rate. * using the "encoded DSCP". In this case, during a measurement interval T, the Egress measures the input_notified_bytes by counting the "encoded DSCP" bytes. Below only the detail description of the first method is given. The incoming congestion_rate can be then calculated as follows: incoming_congestion_rate = input_notified_bytes/T If the incoming_congestion_rate is higher than a preconfigured congestion notification threshold, then the communication path between Ingress and Egress is considered to be congested. Note that the pre-congestion notification threshold can be set to "0". In this
case, the Egress node will operate in congestion notification state at the moment that it receives at least one "notified DSCP" encoded packet. When the Egress node operates in "congestion notification" state and if the end-to-end RESERVE (probe) arrives at the Egress, then this request SHOULD be rejected. Note that this happens only when the probe packet is either "notified DSCP" or "encoded DSCP" marked. In this way, it is ensured that the end-to-end RESERVE (probe) packet passed through the node that is congested. This feature is very useful when ECMP-based routing is used to detect only flows that are passing through the congested router. If such an Ingress/Egress pair aggregated state is not available when the (probe) RESERVE message arrives at the Egress, then this request is accepted if the DSCP of the packet carrying the RESERVE message is unmarked. Otherwise (if the packet is either "notified DSCP" or "encoded DSCP" marked), it is rejected.A.5. Example of Selecting Bidirectional Flows for Termination during Severe Congestion
This appendix describes an example of selecting bidirectional flows for termination during severe congestion. When a severe congestion occurs, e.g., in the forward path, and when the algorithm terminates flows to solve the severe congestion in the forward path, then the reserved bandwidth associated with the terminated bidirectional flows is also released. Therefore, a careful selection of the flows that have to be terminated SHOULD take place. A possible method of selecting the flows belonging to the same priority type passing through the severe congestion point on a unidirectional path can be the following: * the Egress node SHOULD select, if possible, first unidirectional flows instead of bidirectional flows. * the Egress node SHOULD select, if possible, bidirectional flows that reserved a relatively small amount of resources on the path reversed to the path of congestion.A.6. Example of a Severe Congestion Solution for Bidirectional Flows Congested Simultaneously on Forward and Reverse Paths
This appendix describes an example of a severe congestion solution for bidirectional flows congested simultaneously on forward and reverse paths.
This scenario describes a solution using the combination of the severe congestion solutions described in Section 4.6.2.5.2. It is considered that the severe congestion occurs simultaneously in forward and reverse directions, which MAY affect the same bidirectional flows. When the QNE Edges maintain per-flow intra-domain QoS-NSLP operational states, the steps can be the following, see Figure A.3. Consider that the Egress node selects a number of bidirectional flows to be terminated. In this case, the Egress will send, for each bidirectional flow, a NOTIFY message to Ingress. If the Ingress receives these NOTIFY messages and its operational state (associated with reverse path) is in the severe congestion state (see Figures 26 and 27), then the Ingress operates in the following way: * For each NOTIFY message, the Ingress SHOULD identify the bidirectional flows that have to be terminated. * The Ingress then calculates the total bandwidth that SHOULD be released in the reverse direction (thus not in forward direction) if the bidirectional flows will be terminated (preempted), say "notify_reverse_bandwidth". This bandwidth can be calculated by the sum of the bandwidth values associated with all the end-to-end sessions that received a (severe congestion) NOTIFY message. * Furthermore, using the received marked packets (from the reverse path) the Ingress will calculate, using the algorithm used by an Egress and described in Appendix A.2, the total bandwidth that has to be terminated in order to solve the congestion in the reverse path direction, say "marked_reverse_bandwidth". * The Ingress then calculates the bandwidth of the additional flows that have to be terminated, say "additional_reverse_bandwidth", in order to solve the severe congestion in reverse direction, by taking into account: ** the bandwidth in the reverse direction of the bidirectional flows that were appointed by the Egress (the ones that received a NOTIFY message) to be preempted, i.e., "notify_reverse_bandwidth". ** the total amount of bandwidth in the reverse direction that has been calculated by using the received marked packets, i.e., "marked_reverse_bandwidth".
QNE(Ingress) NE (int.) NE (int.) NE (int.) QNE(Egress) NTLP stateful NTLP stateful data| user | | | | --->| data | #unmarked bytes| | | |--------------->S #marked bytes | | | | S--------------------------->| | | | | |-------------->|data | | | | |---> | | | | Term.? | NOTIFY | | |Yes |<------------------------------------------------------------| | | | | |data | | | user | |<--- | user data | | data |<--------------| | (#marked bytes)| S<----------| | |<--------------------------------S | | | (#unmarked bytes) S | | Term|<--------------------------------S | | Flow? | S | | YES |RESERVE(RMD-QSPEC): S | | |"forward - T tear" s | | |--------------->| RESERVE(RMD-QSPEC): | | | | "forward - T tear" | | | |--------------------------->| | | | S |-------------->| | | S RESERVE(RMD-QSPEC): | | S "reverse - T tear" | | RESERVE(RMD-QSPEC) S |<--------------| | "reverse - T tear" S<----------| | |<--------------------------------S | | Figure 28: Intra-domain RMD severe congestion handling for bidirectional reservation (congestion in both forward and reverse direction) This additional bandwidth can be calculated using the following algorithm: IF ("marked_reverse_bandwidth" > "notify_reverse_bandwidth") THEN "additional_reverse_bandwidth" = = "marked_reverse_bandwidth"- "notify_reverse_bandwidth"; ELSE "additional_reverse_bandwidth" = 0 * Ingress terminates the flows that experienced a severe congestion in the forward path and received a (severe congestion) NOTIFY message.
* If possible, the Ingress SHOULD terminate unidirectional flows that use the same Egress-Ingress reverse direction communication path to satisfy the release of a total bandwidth up equal to the "additional_reverse_bandwidth", see Appendix A.5. * If the number of REQUIRED unidirectional flows (to satisfy the above issue) is not available, then a number of bidirectional flows that are using the same Egress-Ingress reverse direction communication path MAY be selected for preemption in order to satisfy the release of a total bandwidth equal up to the "additional_reverse_bandwidth". Note that using the guidelines given in Appendix A.5, first the bidirectional flows that reserved a relatively small amount of resources on the path reversed to the path of congestion SHOULD be selected for termination. When the QNE Edges maintain aggregated intra-domain QoS-NSLP operational states, the steps can be the following. * The Egress calculates the bandwidth to be terminated using the same method as described in Section 4.6.1.6.2.2. The Egress includes this bandwidth value in a <PDR Bandwidth> within a "PDR_Congestion_Report" container that is carried by the end- to-end NOTIFY message. * The Ingress receives the NOTIFY message and reads the <PDR Bandwidth> value included in the "PDR_Congestion_Report" container. Note that this value is denoted as "notify_reverse_bandwidth" in the situation that the QNE Edges maintain per-flow intra-domain QoS-NSLP operational states, but is calculated differently. The variables "marked_reverse_bandwidth" and "additional_reverse_bandwidth" are calculated using the same steps as explained for the situation that the QNE Edges maintain per-flow intra-domain QoS-NSLP states. * Regarding the termination of flows that use the same Egress- Ingress reverse direction communication path, the Ingress can follow the same procedures as the situation that the QNE Edges maintain per-flow intra-domain QoS-NSLP operational states. The RMD-aggregated (reduced-state) reservations maintained by the Interior nodes, can be reduced in the "forward" and "reverse" directions by using the procedure described in Section 4.6.2.3 and including in the <Peak Data Rate-1 (p)> value of the local RMD-QSPEC <TMOD-1> parameter of the RMD-QOSM <QoS Desired> field carried by the forward intra-domain RESERVE
the value equal to <notify_reverse_bandwidth> and by including the <additional_reverse_bandwidth> value in the <PDR Bandwidth> parameter within the "PDR_Release_Request" container that is carried by the same intra-domain RESERVE message.A.7. Example of Preemption Handling during Admission Control
This appendix describes an example of how preemption handling is supported during admission control. This section describes the mechanism that can be supported by the QNE Ingress, QNE Interior, and QNE Egress nodes to satisfy preemption during the admission control process. This mechanism uses the preemption building blocks specified in [RFC5974].A.7.1. Preemption Handling in QNE Ingress Nodes
If a QNE Ingress receives a RESERVE for a session that causes other session(s) to be preempted, for each of these to-be-preempted sessions, then the QNE Ingress follows the following steps: Step_1: The QNE Ingress MUST send a tearing RESERVE downstream and add a BOUND-SESSION-ID, with <Binding_Code> value equal to "Indicated session caused preemption" that indicates the SESSION-ID of the session that caused the preemption. Furthermore, an <INFO-SPEC> object with error code value equal to "Reservation preempted" has to be included in each of these tearing RESERVE messages. The selection of which flows have to be preempted can be based on predefined policies. For example, this selection process can be based on the MRI associated with the high and low priority sessions. In particular, the QNE Ingress can select low(er) priority session(s) where their MRI is "close" (especially the target IP) to the one associated with the higher priority session. This means that typically the high priority session and the to-be-preempted lower priority sessions are following the same communication path and are passing through the same QNE Egress node. Furthermore, the amount of lower priority sessions that have to be preempted per each high priority session, has to be such that the requested resources by the higher priority session SHOULD be lower or equal than the sum of the reserved resources associated with the lower priority sessions that have to be preempted.
Step_2: For each of the sent tearing RESERVE(s) the QNE Ingress will send a NOTIFY message with an <INFO-SPEC> object with error code value equal to "Reservation preempted" towards the QNI. Step_3: After sending the preempted (tearing) RESERVE(s), the Ingress QNE will send the (reserving) RESERVE, which caused the preemption, downstream towards the QNE Egress.A.7.2. Preemption Handling in QNE Interior Nodes
The QNE Interior upon receiving the first (tearing) RESERVE that carries the <BOUND-SESSION-ID> object with <Binding_Code> value equal to "Indicated session caused preemption" and an <INFO-SPEC> object with error code value equal to "Reservation preempted" it considers that this session has to be preempted. In this case, the QNE Interior creates a so-called "preemption state", which is identified by the SESSION-ID carried in the preemption-related <BOUND-SESSION-ID> object. Furthermore, this "preemption state" will include the SESSION-ID of the session associated with the (tearing) RESERVE. Subsequently, if additional tearing RESERVE(s) are arriving including the same values of BOUND- SESSION-ID and <INFO-SPEC> objects, then the associated SESSION-IDs of these (tearing) RESERVE message will be included in the already created "preemption state". The QNE will then set a timer, with a value that is high enough to ensure that it will not expire before the (reserving) RESERVE arrives. Note that when the "preemption state" timer expires, the bandwidth associated with the preempted session(s) will have to be released, following a normal RMD-QOSM bandwidth release procedure. If the QNE Interior node will not receive all the to-be-preempted (tearing) RESERVE messages sent by the QNE Ingress before their associated (reserving) RESERVE message arrives, then the (reserving) RESERVE message will not reserve any resources and this message will be "M" marked (see Section 4.6.1.2). Note that this situation is not a typical situation. Typically, this situation can only occur when at least one of (tearing) the RESERVE messages is dropped due to an error condition.
Otherwise, if the QNE Interior receives all the to-be-preempted (tearing) RESERVE messages sent by the QNE Ingress, then the QNE Interior will remove the pending resources, and make the new reservation using normal RMD-QOSM bandwidth release and reservation procedures.A.7.3. Preemption Handling in QNE Egress Nodes
Similar to the QNE Interior operation, the QNE Egress, upon receiving the first (tearing) RESERVE that carries the <BOUND-SESSION-ID> object with the <Binding_Code> value equal to "Indicated session caused preemption" and an <INFO-SPEC> object with error code value equal to "Reservation preempted", it considers that this session has to be preempted. Similar to the QNE Interior operation the QNE Egress creates a so called "preemption state", which is identified by the SESSION-ID carried in the preemption-related <BOUND-SESSION-ID> object. This "preemption state" will store the same type of information and use the same timer value as specified in Appendix A.7.2. Subsequently, if additional tearing RESERVE(s) are arriving including the same values of BOUND-SESSION-ID and <INFO-SPEC> objects, then the associated SESSION-IDs of these (tearing) RESERVE message will be included in the already created "preemption state". If the (reserving) RESERVE message sent by the QNE Ingress node arrived and is not "M" marked, and if all the to-be-preempted (tearing) RESERVE messages arrived, then the QNE Egress will remove the pending resources and make the new reservation using normal RMD- QOSM procedures. If the QNE Egress receives an "M" marked RESERVE message, then the QNE Egress will use the normal partial RMD-QOSM procedure to release the partial reserved resources associated with the "M" marked RESERVE (see Section 4.6.1.2). If the QNE Egress will not receive all the to-be-preempted (tearing) RESERVE messages sent by the QNE Ingress before their associated and not "M" marked (reserving) RESERVE message arrives, then the following steps can be followed: * If the QNE Egress uses an end-to-end QOSM that supports the preemption handling, then the QNE Egress has to calculate and select new lower priority sessions that have to be terminated. How the preempted sessions are selected and signaled to the downstream QNEs is similar to the operation specified in Appendix A.7.1.
* If the QNE Egress does not use an end-to-end QOSM that supports the preemption handling, then the QNE Egress has to reject the requesting (reserving) RESERVE message associated with the high priority session (see Section 4.6.1.2). Note that typically, the situation in which the QNE Egress does not receive all the to-be-preempted (tearing) RESERVE messages sent by the QNE Ingress can only occur when at least one of the (tearing) RESERVE messages are dropped due to an error condition.A.8. Example of a Retransmission Procedure within the RMD Domain
This appendix describes an example of a retransmission procedure that can be used in the RMD domain. If the retransmission of intra-domain RESERVE messages within the RMD domain is not disallowed, then all the QNE Interior nodes SHOULD use the functionality described in this section. In this situation, we enable QNE Interior nodes to maintain a replay cache in which each entry contains the <RSN>, <SESSION-ID> (available via GIST), <REFRESH-PERIOD> (available via the QoS NSLP [RFC5974]), and the last received "PHR Container" <Parameter ID> carried by the RMD-QSPEC for each session [RFC5975]. Thus, this solution uses information carried by <QoS-NSLP> objects [RFC5974] and parameters carried by the RMD-QSPEC "PHR Container". The following phases can be distinguished: Phase 1: Create Replay Cache Entry When an Interior node receives an intra-domain RESERVE message and its cache is empty or there is no matching entry, it reads the <Parameter ID> field of the "PHR Container" of the received message. If the <Parameter ID> is a PHR_RESOURCE_REQUEST, which indicates that the intra-domain RESERVE message is a reservation request, then the QNE Interior node creates a new entry in the cache and copies the <RSN>, <SESSION-ID> and <Parameter ID> to the entry and sets the <REFRESH-PERIOD>. By using the information stored in the list, the Interior node verifies whether or not the received intra-domain RESERVE message is sent by an adversary. For example, if the <SESSION-ID> and <RSN> of a received intra-domain RESERVE message match the values stored in the list then the Interior node checks the <Parameter ID> part.
If the <Parameter ID> is different, then: Situation D1: <Parameter ID> in its own list is PHR_RESOURCE_REQUEST, and <Parameter ID> in the message is PHR_REFRESH_UPDATE; Situation D2: <Parameter ID> in its own list is PHR_RESOURCE_REQUEST or PHR_REFRESH_UPDATE, and <Parameter ID> in the message is PHR_RELEASE_REQUEST; Situation D3: <Parameter ID> in its own list is PHR_REFRESH_UPDATE, and <Parameter ID> in the message is PHR_RESOURCE_REQUEST; For Situation D1, the QNE Interior node processes this message by RMD-QOSM default operation, reserves bandwidth, updates the entry, and passes the message to downstream nodes. For Situation D2, the QNE Interior node processes this message by RMD-QOSM default operation, releases bandwidth, deletes all entries associated with the session and passes the message to downstream nodes. For situation D3, the QNE Interior node does not use/process the local RMD-QSPEC <TMOD-1> parameter carried by the received intra-domain RESERVE message. Furthermore, the <K> flag in the "PHR Container" has to be set such that the local RMD-QSPEC <TMOD-1> parameter carried by the intra-domain RESERVE message is not processed/used by a QNE Interior node. If the <Parameter ID> is the same, then: Situation S1: <Parameter ID> is equal to PHR_RESOURCE_REQUEST; Situation S2: <Parameter ID> is equal to PHR_REFRESH_UPDATE; For situation S1, the QNE Interior node does not process the intra-domain RESERVE message, but it just passes it to downstream nodes, because it might have been retransmitted by the QNE Ingress node. For situation S2, the QNE Interior node processes the first incoming intra-domain (refresh) RESERVE message within a refresh period and updates the entry and forwards it to the downstream nodes. If only <Session-ID> is matched to the list, then the QNE Interior node checks the <RSN>. Here also two situations can be distinguished: If a rerouting takes place (see Section 5.2.5.2 in [RFC5974]), the <RSN> in the message will be equal to either <RSN + 2> in the stored list if it is not a tearing RESERVE or <RSN -1> in the stored list if it is a tearing RESERVE:
The QNE Interior node will check the <Parameter ID> part; If the <RSN> in the message is equal to <RSN + 2> in the stored list and the <Parameter ID> is a PHR_RESOURCE_REQUEST or PHR_REFRESH_UPDATE, then the received intra-domain RESERVE message has to be interpreted and processed as a typical (non-tearing) RESERVE message, which is caused by rerouting, see Section 5.2.5.2 in [RFC5974]. If the <RSN> in the message is equal to <RSN-1> in the stored list and the <Parameter ID> is a PHR_RELEASE_REQUEST, then the received intra-domain RESERVE message has to be interpreted and processed as a typical (tearing) RESERVE message, which is caused by rerouting (see Section 5.2.5.2 in [RFC5974]). If other situations occur than the ones described above, then the QNE Interior node does not use/process the local RMD-QSPEC <TMOD-1> parameter carried by the received intra-domain RESERVE message. Furthermore, the <K> parameter has to be set, see above. Phase 2: Update Replay Cache Entry When a QNE Interior node receives an intra-domain RESERVE message, it retrieves the corresponding entry from the cache and compares the values. If the message is valid, the Interior node will update <Parameter ID> and <REFRESH-PERIOD> in the list entry. Phase 3: Delete Replay Cache Entry When a QNE Interior node receives an intra-domain (tear) RESERVE message and an entry in the replay cache can be found, then the QNE Interior node will delete this entry after processing the message. Furthermore, the Interior node will delete cache entries, if it did not receive an intra-domain (refresh) RESERVE message during the <REFRESH-PERIOD> period with a <Parameter ID> value equal to PHR_REFRESH_UPDATE.A.9. Example on Matching the Initiator QSPEC to the Local RMD-QSPEC
Section 3.4 of [RFC5975] describes an example of how the QSPEC can be Used within QoS-NSLP. Figure 29 illustrates a situation where a QNI and a QNR are using an end-to-end QOSM, denoted in this context as Z-e2e. It is considered that the QNI access network side is a wireless access network built on a generation "X" technology with QoS support as defined by generation "X", while QNR access network is a wired/fixed access network with its own defined QoS support.
Furthermore, it is considered that the shown QNE Edges are located at the boundary of an RMD domain and that the shown QNE Interior nodes are located inside the RMD domain. The QNE Edges are able to run both the Z-e2e QOSM and the RMD-QOSM, while the QNE Interior nodes can only run the RMD-QOSM. The QNI is considered to be a wireless laptop, for example, while the QNR is considered to be a PC. |------| |------| |------| |------| |Z-e2e |<->|Z-e2e |<------------------------->|Z-e2e |<->|Z-e2e | | QOSM | | QOSM | | QOSM | | QOSM | | | |------| |-------| |-------| |------| | | | NSLP | | NSLP |<->| NSLP |<->| NSLP |<->| NSLP | | NSLP | |Z-e2e | | RMD | | RMD | | RMD | | RMD | | Z-e2e| | QOSM | | QOSM | | QOSM | | QOSM | | QOSM | | QOSM | |------| |------| |-------| |-------| |------| |------| ----------------------------------------------------------------- |------| |------| |-------| |-------| |------| |------| | NTLP |<->| NTLP |<->| NTLP |<->| NTLP |<->| NTLP |<->| NTLP | |------| |------| |-------| |-------| |------| |------| QNI QNE QNE QNE QNE QNR (End) (Ingress Edge) (Interior) (Interior) (Egress Edge) (End) Figure 29. Example of initiator and local domain QOSM operation The QNI sets <QoS Desired> and <QoS Available> QSPEC objects in the initiator QSPEC, and initializes <QoS Available> to <QoS Desired>. In this example, the <Minimum QoS> object is not populated. The QNI populates QSPEC parameters to ensure correct treatment of its traffic in domains down the path. Additionally, to ensure correct treatment further down the path, the QNI includes <PHB Class> in <QoS Desired>. The QNI therefore includes in the QSPEC. <QoS Desired> = <TMOD-1> <PHB Class> <QoS Available> = <TMOD-1> <Path Latency> In this example, it is assumed that the <TMOD-1> parameter is used to encode the traffic parameters of a VoIP application that uses RTP and the G.711 Codec, see Appendix B in [RFC5975]. The below text is copied from [RFC5975]. In the simplest case the Minimum Policed Unit m is the sum of the IP-, UDP- and RTP- headers + payload. The IP header in the IPv4 case has a size of 20 octets (40 octets if IPv6 is used). The UDP header has a size of 8 octets and RTP uses a 12 octet header. The
G.711 Codec specifies a bandwidth of 64 kbit/s (8000 octets/s). Assuming RTP transmits voice datagrams every 20 ms, the payload for one datagram is 8000 octets/s * 0.02 s = 160 octets. IPv4+UDP+RTP+payload: m=20+8+12+160 octets = 200 octets IPv6+UDP+RTP+payload: m=40+8+12+160 octets = 220 octets The Rate r specifies the amount of octets per second. 50 datagrams are sent per second. IPv4: r = 50 1/s * m = 10,000 octets/s IPv6: r = 50 1/s * m = 11,000 octets/s The bucket size b specifies the maximum burst. In this example, a burst of 10 packets is used. IPv4: b = 10 * m = 2000 octets IPv6: b = 10 * m = 2200 octets In our example, we will assume that IPV4 is used and therefore, the <TMOD-1> values will be set as follows: m = 200 octets r = 10000 octets/s b = 2000 octets The <Peak Data Rate-1 (p)> and MPS are not specified above, but in our example we will assume: p = r = 10000 octets/s MPS = 220 octets The <PHB Class> is set in such a way that the Expedited Forwarding (EF) PHB is used. Since <Path Latency> and <QoS Class> are not vital parameters from the QNI's perspective, it does not raise their <M> flags. Each QNE, which supports the Z-e2e QOSM on the path, reads and interprets those parameters in the initiator QSPEC. When an end-to-end RESERVE message is received at a QNE Ingress node at the RMD domain border, the QNE Ingress can "hide" the initiator end-to-end RESERVE message so that only the QNE Edges process the initiator (end-to-end) RESERVE message, which then bypasses intermediate nodes between the Edges of the domain, and issues its own local RESERVE message (see Section 6). For this new local RESERVE message, the QNE Ingress node generates the local RMD-QSPEC.
The RMD-QSPEC corresponding to the RMD-QOSM is generated based on the original initiator QSPEC according to the procedures described in Section 4.5 of [RFC5974] and in Section 6 of this document. The RMD QNE Ingress maps the <TMOD-1> parameters contained in the original Initiator QSPEC into the equivalent <TMOD-1> parameter representing only the peak bandwidth in the local RMD-QSPEC. In this example, the initial <TMOD-1> parameters are mapped into the RMD-QSPEC <TMOD-1> parameters as follows. As specified, the RMD-QOSM bandwidth equivalent <TMOD-1> parameter of RMD-QSPEC should have: r = p of initial e2e <TMOD-1> parameter m = large; b = large; For the RMD-QSPEC <TMOD-1> parameter, the following values are calculated: r = p of initial e2e <TMOD-1> parameter = 10000 octets/s m is set in this example to large as follows: m = MPS of initial e2e <TMOD-1> parameter = 220 octets The maximum value of b = 250 gigabytes, but in our example this value is quite large. The b parameter specifies the extent to which the data rate can exceed the sustainable level for short periods of time. In order to get a large b, in this example we consider that for a period of certain period of time the data rate can exceed the sustainable level, which in our example is the peak rate (p). Thus, in our example, we calculate b as: b = p * "period of time" For this VoIP example, we can assume that this period of time is 1.5 seconds, see below: b = 10000 octets/s * 1.5 seconds = 15000 octets Thus, the local RMD-QSPEC <TMOD-1> values are: r = 10000 octets/s p = 10000 octets/s m = 220 octets b = 15000 octets MPS = 220 octets
The bit level format of the RMD-QSPEC is given in Section 4.1. In particular, the Initiator/Local QSPEC bit, i.e., <I> is set to "Local" (i.e., "1") and the <Qspec Proc> is set as follows: * Message Sequence = 0: Sender initiated * Object combination = 0: <QoS Desired> for RESERVE and <QoS Reserved> for RESPONSE The <QSPEC Version> used by RMD-QOSM is the default version, i.e., "0", see [RFC5975]. The <QSPEC Type> value used by the RMD-QOSM is specified in [RFC5975] and is equal to: "2". The <Traffic Handling Directives> contains the following fields: <Traffic Handling Directives> = <PHR container> <PDR container> The Per-Hop Reservation container (PHR container) and the Per-Domain Reservation container (PDR container) are specified in Sections 4.1.2 and 4.1.3, respectively. The <PHR container> contains the traffic handling directives for intra-domain communication and reservation. The <PDR container> contains additional traffic handling directives that are needed for edge-to-edge communication. The RMD-QOSM <QoS Desired> and <QoS Reserved>, are specified in Section 4.1.1. In RMD-QOSM the <QoS Desired> and <QoS Reserved> objects contain the following parameters: <QoS Desired> = <TMOD-1> <PHB Class> <Admission Priority> <QoS Reserved> = <TMOD-1> <PHB Class> <Admission Priority> The bit format of the <PHB Class> (see [RFC5975] and Figures 4 and 5) and <Admission Priority> complies to the bit format specified in [RFC5975]. In this example, the RMD-QSPEC <TMOD-1> values are the ones that were calculated and given above. Furthermore, the <PHB Class>, represents the EF PHB class. Moreover, in this example the RMD reservation is established without an <Admission Priority> parameter, which is equivalent to a reservation established with an <Admission Priority> whose value is 1. The RMD QNE Egress node updates <QoS Available> on behalf of the entire RMD domain if it can. If it cannot (since the <M> flag is not set for <Path Latency>) it raises the parameter-specific, "not- supported" flag, warning the QNR that the final latency value in <QoS Available> is imprecise.
In the "Y" access domain, the initiator QSPEC is processed by the QNR in the similar was as it was processed in the "X" wireless access domain, by the QNI. If the reservation was successful, eventually the RESERVE request arrives at the QNR (otherwise, the QNE at which the reservation failed would have aborted the RESERVE and sent an error RESPONSE back to the QNI). If the <RII> was included in the QoS-NSLP message, the QNR generates a positive RESPONSE with QSPEC objects <QoS Reserved> and <QoS Available>. The parameters appearing in <QoS Reserved> are the same as in <QoS Desired>, with values copied from <QoS Available>. Hence, the QNR includes the following QSPEC objects in the RESPONSE message: <QoS Reserved> = <TMOD-1> <PHB Class> <QoS Available> = <TMOD-1> <Path Latency>Contributors
Attila Takacs Ericsson Research Ericsson Hungary Ltd. Laborc 1, Budapest, Hungary, H-1037 EMail: Attila.Takacs@ericsson.com Andras Csaszar Ericsson Research Ericsson Hungary Ltd. Laborc 1, Budapest, Hungary, H-1037 EMail: Andras.Csaszar@ericsson.com
Authors' Addresses
Attila Bader Ericsson Research Ericsson Hungary Ltd. Laborc 1, Budapest, Hungary, H-1037 EMail: Attila.Bader@ericsson.com Lars Westberg Ericsson Research Torshamnsgatan 23 SE-164 80 Stockholm, Sweden EMail: Lars.Westberg@ericsson.com Georgios Karagiannis University of Twente P.O. Box 217 7500 AE Enschede, The Netherlands EMail: g.karagiannis@ewi.utwente.nl Cornelia Kappler ck technology concepts Berlin, Germany EMail: cornelia.kappler@cktecc.de Hannes Tschofenig Nokia Siemens Networks Linnoitustie 6 Espoo 02600 Finland EMail: Hannes.Tschofenig@nsn.com URI: http://www.tschofenig.priv.at Tom Phelan Sonus Networks 250 Apollo Dr. Chelmsford, MA 01824 USA EMail: tphelan@sonusnet.com