9. UPDATE Message Handling An UPDATE message may be received only in the Established state. When an UPDATE message is received, each field is checked for validity as specified in Section 6.3. If an optional non-transitive attribute is unrecognized, it is quietly ignored. If an optional transitive attribute is unrecognized, the Partial bit (the third high-order bit) in the attribute flags octet is set to 1, and the attribute is retained for propagation to other BGP speakers. If an optional attribute is recognized, and has a valid value, then, depending on the type of the optional attribute, it is processed locally, retained, and updated, if necessary, for possible propagation to other BGP speakers. If the UPDATE message contains a non-empty WITHDRAWN ROUTES field, the previously advertised routes whose destinations (expressed as IP prefixes) contained in this field shall be removed from the Adj-RIB- In. This BGP speaker shall run its Decision Process since the previously advertised route is not longer available for use. If the UPDATE message contains a feasible route, it shall be placed in the appropriate Adj-RIB-In, and the following additional actions shall be taken: i) If its Network Layer Reachability Information (NLRI) is identical to the one of a route currently stored in the Adj-RIB-In, then the new route shall replace the older route in the Adj-RIB-In, thus implicitly withdrawing the older route from service. The BGP speaker shall run its Decision Process since the older route is no longer available for use. ii) If the new route is an overlapping route that is included (see 9.1.4) in an earlier route contained in the Adj-RIB-In, the BGP speaker shall run its Decision Process since the more specific route has implicitly made a portion of the less specific route unavailable for use. iii) If the new route has identical path attributes to an earlier route contained in the Adj-RIB-In, and is more specific (see 9.1.4) than the earlier route, no further actions are necessary. iv) If the new route has NLRI that is not present in any of the routes currently stored in the Adj-RIB-In, then the new route shall
be placed in the Adj-RIB-In. The BGP speaker shall run its Decision Process. v) If the new route is an overlapping route that is less specific (see 9.1.4) than an earlier route contained in the Adj-RIB-In, the BGP speaker shall run its Decision Process on the set of destinations described only by the less specific route. 9.1 Decision Process The Decision Process selects routes for subsequent advertisement by applying the policies in the local Policy Information Base (PIB) to the routes stored in its Adj-RIB-In. The output of the Decision Process is the set of routes that will be advertised to all peers; the selected routes will be stored in the local speaker's Adj-RIB- Out. The selection process is formalized by defining a function that takes the attribute of a given route as an argument and returns a non- negative integer denoting the degree of preference for the route. The function that calculates the degree of preference for a given route shall not use as its inputs any of the following: the existence of other routes, the non-existence of other routes, or the path attributes of other routes. Route selection then consists of individual application of the degree of preference function to each feasible route, followed by the choice of the one with the highest degree of preference. The Decision Process operates on routes contained in each Adj-RIB-In, and is responsible for: - selection of routes to be advertised to BGP speakers located in the local speaker's autonomous system - selection of routes to be advertised to BGP speakers located in neighboring autonomous systems - route aggregation and route information reduction The Decision Process takes place in three distinct phases, each triggered by a different event: a) Phase 1 is responsible for calculating the degree of preference for each route received from a BGP speaker located in a neighboring autonomous system, and for advertising to the other BGP speakers in the local autonomous system the routes that have the highest degree of preference for each distinct destination.
b) Phase 2 is invoked on completion of phase 1. It is responsible for choosing the best route out of all those available for each distinct destination, and for installing each chosen route into the appropriate Loc-RIB. c) Phase 3 is invoked after the Loc-RIB has been modified. It is responsible for disseminating routes in the Loc-RIB to each peer located in a neighboring autonomous system, according to the policies contained in the PIB. Route aggregation and information reduction can optionally be performed within this phase. 9.1.1 Phase 1: Calculation of Degree of Preference The Phase 1 decision function shall be invoked whenever the local BGP speaker receives an UPDATE message from a peer located in a neighboring autonomous system that advertises a new route, a replacement route, or a withdrawn route. The Phase 1 decision function is a separate process which completes when it has no further work to do. The Phase 1 decision function shall lock an Adj-RIB-In prior to operating on any route contained within it, and shall unlock it after operating on all new or unfeasible routes contained within it. For each newly received or replacement feasible route, the local BGP speaker shall determine a degree of preference. If the route is learned from a BGP speaker in the local autonomous system, either the value of the LOCAL_PREF attribute shall be taken as the degree of preference, or the local system shall compute the degree of preference of the route based on preconfigured policy information. If the route is learned from a BGP speaker in a neighboring autonomous system, then the degree of preference shall be computed based on preconfigured policy information. The exact nature of this policy information and the computation involved is a local matter. The local speaker shall then run the internal update process of 9.2.1 to select and advertise the most preferable route. 9.1.2 Phase 2: Route Selection The Phase 2 decision function shall be invoked on completion of Phase 1. The Phase 2 function is a separate process which completes when it has no further work to do. The Phase 2 process shall consider all routes that are present in the Adj-RIBs-In, including those received from BGP speakers located in its own autonomous system and those received from BGP speakers located in neighboring autonomous systems.
The Phase 2 decision function shall be blocked from running while the Phase 3 decision function is in process. The Phase 2 function shall lock all Adj-RIBs-In prior to commencing its function, and shall unlock them on completion. If the NEXT_HOP attribute of a BGP route depicts an address to which the local BGP speaker doesn't have a route in its Loc-RIB, the BGP route SHOULD be excluded from the Phase 2 decision function. For each set of destinations for which a feasible route exists in the Adj-RIBs-In, the local BGP speaker shall identify the route that has: a) the highest degree of preference of any route to the same set of destinations, or b) is the only route to that destination, or c) is selected as a result of the Phase 2 tie breaking rules specified in 9.1.2.1. The local speaker SHALL then install that route in the Loc-RIB, replacing any route to the same destination that is currently being held in the Loc-RIB. The local speaker MUST determine the immediate next hop to the address depicted by the NEXT_HOP attribute of the selected route by performing a lookup in the IGP and selecting one of the possible paths in the IGP. This immediate next hop MUST be used when installing the selected route in the Loc-RIB. If the route to the address depicted by the NEXT_HOP attribute changes such that the immediate next hop changes, route selection should be recalculated as specified above. Unfeasible routes shall be removed from the Loc-RIB, and corresponding unfeasible routes shall then be removed from the Adj- RIBs-In. 9.1.2.1 Breaking Ties (Phase 2) In its Adj-RIBs-In a BGP speaker may have several routes to the same destination that have the same degree of preference. The local speaker can select only one of these routes for inclusion in the associated Loc-RIB. The local speaker considers all equally preferable routes, both those received from BGP speakers located in neighboring autonomous systems, and those received from other BGP speakers located in the local speaker's autonomous system. The following tie-breaking procedure assumes that for each candidate route all the BGP speakers within an autonomous system can ascertain the cost of a path (interior distance) to the address depicted by the
NEXT_HOP attribute of the route. Ties shall be broken according to the following algorithm: a) If the local system is configured to take into account MULTI_EXIT_DISC, and the candidate routes differ in their MULTI_EXIT_DISC attribute, select the route that has the lowest value of the MULTI_EXIT_DISC attribute. b) Otherwise, select the route that has the lowest cost (interior distance) to the entity depicted by the NEXT_HOP attribute of the route. If there are several routes with the same cost, then the tie-breaking shall be broken as follows: - if at least one of the candidate routes was advertised by the BGP speaker in a neighboring autonomous system, select the route that was advertised by the BGP speaker in a neighboring autonomous system whose BGP Identifier has the lowest value among all other BGP speakers in neighboring autonomous systems; - otherwise, select the route that was advertised by the BGP speaker whose BGP Identifier has the lowest value. 9.1.3 Phase 3: Route Dissemination The Phase 3 decision function shall be invoked on completion of Phase 2, or when any of the following events occur: a) when routes in a Loc-RIB to local destinations have changed b) when locally generated routes learned by means outside of BGP have changed c) when a new BGP speaker - BGP speaker connection has been established The Phase 3 function is a separate process which completes when it has no further work to do. The Phase 3 Routing Decision function shall be blocked from running while the Phase 2 decision function is in process. All routes in the Loc-RIB shall be processed into a corresponding entry in the associated Adj-RIBs-Out. Route aggregation and information reduction techniques (see 9.2.4.1) may optionally be applied. For the benefit of future support of inter-AS multicast capabilities, a BGP speaker that participates in inter-AS multicast routing shall advertise a route it receives from one of its external peers and if
it installs it in its Loc-RIB, it shall advertise it back to the peer from which the route was received. For a BGP speaker that does not participate in inter-AS multicast routing such an advertisement is optional. When doing such an advertisement, the NEXT_HOP attribute should be set to the address of the peer. An implementation may also optimize such an advertisement by truncating information in the AS_PATH attribute to include only its own AS number and that of the peer that advertised the route (such truncation requires the ORIGIN attribute to be set to INCOMPLETE). In addition an implementation is not required to pass optional or discretionary path attributes with such an advertisement. When the updating of the Adj-RIBs-Out and the Forwarding Information Base (FIB) is complete, the local BGP speaker shall run the external update process of 9.2.2. 9.1.4 Overlapping Routes A BGP speaker may transmit routes with overlapping Network Layer Reachability Information (NLRI) to another BGP speaker. NLRI overlap occurs when a set of destinations are identified in non-matching multiple routes. Since BGP encodes NLRI using IP prefixes, overlap will always exhibit subset relationships. A route describing a smaller set of destinations (a longer prefix) is said to be more specific than a route describing a larger set of destinations (a shorted prefix); similarly, a route describing a larger set of destinations (a shorter prefix) is said to be less specific than a route describing a smaller set of destinations (a longer prefix). The precedence relationship effectively decomposes less specific routes into two parts: - a set of destinations described only by the less specific route, and - a set of destinations described by the overlap of the less specific and the more specific routes When overlapping routes are present in the same Adj-RIB-In, the more specific route shall take precedence, in order from more specific to least specific. The set of destinations described by the overlap represents a portion of the less specific route that is feasible, but is not currently in use. If a more specific route is later withdrawn, the set of destinations described by the overlap will still be reachable using the less specific route.
If a BGP speaker receives overlapping routes, the Decision Process shall take into account the semantics of the overlapping routes. In particular, if a BGP speaker accepts the less specific route while rejecting the more specific route from the same peer, then the destinations represented by the overlap may not forward along the ASs listed in the AS_PATH attribute of that route. Therefore, a BGP speaker has the following choices: a) Install both the less and the more specific routes b) Install the more specific route only c) Install the non-overlapping part of the less specific route only (that implies de-aggregation) d) Aggregate the two routes and install the aggregated route e) Install the less specific route only f) Install neither route If a BGP speaker chooses e), then it should add ATOMIC_AGGREGATE attribute to the route. A route that carries ATOMIC_AGGREGATE attribute can not be de-aggregated. That is, the NLRI of this route can not be made more specific. Forwarding along such a route does not guarantee that IP packets will actually traverse only ASs listed in the AS_PATH attribute of the route. If a BGP speaker chooses a), it must not advertise the more general route without the more specific route. 9.2 Update-Send Process The Update-Send process is responsible for advertising UPDATE messages to all peers. For example, it distributes the routes chosen by the Decision Process to other BGP speakers which may be located in either the same autonomous system or a neighboring autonomous system. rules for information exchange between BGP speakers located in different autonomous systems are given in 9.2.2; rules for information exchange between BGP speakers located in the same autonomous system are given in 9.2.1. Distribution of routing information between a set of BGP speakers, all of which are located in the same autonomous system, is referred to as internal distribution.
9.2.1 Internal Updates The Internal update process is concerned with the distribution of routing information to BGP speakers located in the local speaker's autonomous system. When a BGP speaker receives an UPDATE message from another BGP speaker located in its own autonomous system, the receiving BGP speaker shall not re-distribute the routing information contained in that UPDATE message to other BGP speakers located in its own autonomous system. When a BGP speaker receives a new route from a BGP speaker in a neighboring autonomous system, it shall advertise that route to all other BGP speakers in its autonomous system by means of an UPDATE message if any of the following conditions occur: 1) the degree of preference assigned to the newly received route by the local BGP speaker is higher than the degree of preference that the local speaker has assigned to other routes that have been received from BGP speakers in neighboring autonomous systems, or 2) there are no other routes that have been received from BGP speakers in neighboring autonomous systems, or 3) the newly received route is selected as a result of breaking a tie between several routes which have the highest degree of preference, and the same destination (the tie-breaking procedure is specified in 9.2.1.1). When a BGP speaker receives an UPDATE message with a non-empty WITHDRAWN ROUTES field, it shall remove from its Adj-RIB-In all routes whose destinations was carried in this field (as IP prefixes). The speaker shall take the following additional steps: 1) if the corresponding feasible route had not been previously advertised, then no further action is necessary 2) if the corresponding feasible route had been previously advertised, then: i) if a new route is selected for advertisement that has the same Network Layer Reachability Information as the unfeasible routes, then the local BGP speaker shall advertise the replacement route ii) if a replacement route is not available for advertisement, then the BGP speaker shall include the destinations of the
unfeasible route (in form of IP prefixes) in the WITHDRAWN ROUTES field of an UPDATE message, and shall send this message to each peer to whom it had previously advertised the corresponding feasible route. All feasible routes which are advertised shall be placed in the appropriate Adj-RIBs-Out, and all unfeasible routes which are advertised shall be removed from the Adj-RIBs-Out. 9.2.1.1 Breaking Ties (Internal Updates) If a local BGP speaker has connections to several BGP speakers in neighboring autonomous systems, there will be multiple Adj-RIBs-In associated with these peers. These Adj-RIBs-In might contain several equally preferable routes to the same destination, all of which were advertised by BGP speakers located in neighboring autonomous systems. The local BGP speaker shall select one of these routes according to the following rules: a) If the candidate route differ only in their NEXT_HOP and MULTI_EXIT_DISC attributes, and the local system is configured to take into account MULTI_EXIT_DISC attribute, select the routes that has the lowest value of the MULTI_EXIT_DISC attribute. b) If the local system can ascertain the cost of a path to the entity depicted by the NEXT_HOP attribute of the candidate route, select the route with the lowest cost. c) In all other cases, select the route that was advertised by the BGP speaker whose BGP Identifier has the lowest value. 9.2.2 External Updates The external update process is concerned with the distribution of routing information to BGP speakers located in neighboring autonomous systems. As part of Phase 3 route selection process, the BGP speaker has updated its Adj-RIBs-Out and its Forwarding Table. All newly installed routes and all newly unfeasible routes for which there is no replacement route shall be advertised to BGP speakers located in neighboring autonomous systems by means of UPDATE message. Any routes in the Loc-RIB marked as unfeasible shall be removed. Changes to the reachable destinations within its own autonomous system shall also be advertised in an UPDATE message.
9.2.3 Controlling Routing Traffic Overhead The BGP protocol constrains the amount of routing traffic (that is, UPDATE messages) in order to limit both the link bandwidth needed to advertise UPDATE messages and the processing power needed by the Decision Process to digest the information contained in the UPDATE messages. 9.2.3.1 Frequency of Route Advertisement The parameter MinRouteAdvertisementInterval determines the minimum amount of time that must elapse between advertisement of routes to a particular destination from a single BGP speaker. This rate limiting procedure applies on a per-destination basis, although the value of MinRouteAdvertisementInterval is set on a per BGP peer basis. Two UPDATE messages sent from a single BGP speaker that advertise feasible routes to some common set of destinations received from BGP speakers in neighboring autonomous systems must be separated by at least MinRouteAdvertisementInterval. Clearly, this can only be achieved precisely by keeping a separate timer for each common set of destinations. This would be unwarranted overhead. Any technique which ensures that the interval between two UPDATE messages sent from a single BGP speaker that advertise feasible routes to some common set of destinations received from BGP speakers in neighboring autonomous systems will be at least MinRouteAdvertisementInterval, and will also ensure a constant upper bound on the interval is acceptable. Since fast convergence is needed within an autonomous system, this procedure does not apply for routes receives from other BGP speakers in the same autonomous system. To avoid long-lived black holes, the procedure does not apply to the explicit withdrawal of unfeasible routes (that is, routes whose destinations (expressed as IP prefixes) are listed in the WITHDRAWN ROUTES field of an UPDATE message). This procedure does not limit the rate of route selection, but only the rate of route advertisement. If new routes are selected multiple times while awaiting the expiration of MinRouteAdvertisementInterval, the last route selected shall be advertised at the end of MinRouteAdvertisementInterval. 9.2.3.2 Frequency of Route Origination The parameter MinASOriginationInterval determines the minimum amount of time that must elapse between successive advertisements of UPDATE messages that report changes within the advertising BGP speaker's own autonomous systems.
9.2.3.3 Jitter To minimize the likelihood that the distribution of BGP messages by a given BGP speaker will contain peaks, jitter should be applied to the timers associated with MinASOriginationInterval, Keepalive, and MinRouteAdvertisementInterval. A given BGP speaker shall apply the same jitter to each of these quantities regardless of the destinations to which the updates are being sent; that is, jitter will not be applied on a "per peer" basis. The amount of jitter to be introduced shall be determined by multiplying the base value of the appropriate timer by a random factor which is uniformly distributed in the range from 0.75 to 1.0. 9.2.4 Efficient Organization of Routing Information Having selected the routing information which it will advertise, a BGP speaker may avail itself of several methods to organize this information in an efficient manner. 9.2.4.1 Information Reduction Information reduction may imply a reduction in granularity of policy control - after information is collapsed, the same policies will apply to all destinations and paths in the equivalence class. The Decision Process may optionally reduce the amount of information that it will place in the Adj-RIBs-Out by any of the following methods: a) Network Layer Reachability Information (NLRI): Destination IP addresses can be represented as IP address prefixes. In cases where there is a correspondence between the address structure and the systems under control of an autonomous system administrator, it will be possible to reduce the size of the NLRI carried in the UPDATE messages. b) AS_PATHs: AS path information can be represented as ordered AS_SEQUENCEs or unordered AS_SETs. AS_SETs are used in the route aggregation algorithm described in 9.2.4.2. They reduce the size of the AS_PATH information by listing each AS number only once, regardless of how many times it may have appeared in multiple AS_PATHs that were aggregated.
An AS_SET implies that the destinations listed in the NLRI can be reached through paths that traverse at least some of the constituent autonomous systems. AS_SETs provide sufficient information to avoid routing information looping; however their use may prune potentially feasible paths, since such paths are no longer listed individually as in the form of AS_SEQUENCEs. In practice this is not likely to be a problem, since once an IP packet arrives at the edge of a group of autonomous systems, the BGP speaker at that point is likely to have more detailed path information and can distinguish individual paths to destinations. 9.2.4.2 Aggregating Routing Information Aggregation is the process of combining the characteristics of several different routes in such a way that a single route can be advertised. Aggregation can occur as part of the decision process to reduce the amount of routing information that will be placed in the Adj-RIBs-Out. Aggregation reduces the amount of information that a BGP speaker must store and exchange with other BGP speakers. Routes can be aggregated by applying the following procedure separately to path attributes of like type and to the Network Layer Reachability Information. Routes that have the following attributes shall not be aggregated unless the corresponding attributes of each route are identical: MULTI_EXIT_DISC, NEXT_HOP. Path attributes that have different type codes can not be aggregated together. Path of the same type code may be aggregated, according to the following rules: ORIGIN attribute: If at least one route among routes that are aggregated has ORIGIN with the value INCOMPLETE, then the aggregated route must have the ORIGIN attribute with the value INCOMPLETE. Otherwise, if at least one route among routes that are aggregated has ORIGIN with the value EGP, then the aggregated route must have the origin attribute with the value EGP. In all other case the value of the ORIGIN attribute of the aggregated route is INTERNAL. AS_PATH attribute: If routes to be aggregated have identical AS_PATH attributes, then the aggregated route has the same AS_PATH attribute as each individual route. For the purpose of aggregating AS_PATH attributes we model each AS within the AS_PATH attribute as a tuple <type, value>, where "type" identifies a type of the path segment the AS belongs to
(e.g. AS_SEQUENCE, AS_SET), and "value" is the AS number. If the routes to be aggregated have different AS_PATH attributes, then the aggregated AS_PATH attribute shall satisfy all of the following conditions: - all tuples of the type AS_SEQUENCE in the aggregated AS_PATH shall appear in all of the AS_PATH in the initial set of routes to be aggregated. - all tuples of the type AS_SET in the aggregated AS_PATH shall appear in at least one of the AS_PATH in the initial set (they may appear as either AS_SET or AS_SEQUENCE types). - for any tuple X of the type AS_SEQUENCE in the aggregated AS_PATH which precedes tuple Y in the aggregated AS_PATH, X precedes Y in each AS_PATH in the initial set which contains Y, regardless of the type of Y. - No tuple with the same value shall appear more than once in the aggregated AS_PATH, regardless of the tuple's type. An implementation may choose any algorithm which conforms to these rules. At a minimum a conformant implementation shall be able to perform the following algorithm that meets all of the above conditions: - determine the longest leading sequence of tuples (as defined above) common to all the AS_PATH attributes of the routes to be aggregated. Make this sequence the leading sequence of the aggregated AS_PATH attribute. - set the type of the rest of the tuples from the AS_PATH attributes of the routes to be aggregated to AS_SET, and append them to the aggregated AS_PATH attribute. - if the aggregated AS_PATH has more than one tuple with the same value (regardless of tuple's type), eliminate all, but one such tuple by deleting tuples of the type AS_SET from the aggregated AS_PATH attribute. Appendix 6, section 6.8 presents another algorithm that satisfies the conditions and allows for more complex policy configurations. ATOMIC_AGGREGATE: If at least one of the routes to be aggregated has ATOMIC_AGGREGATE path attribute, then the aggregated route shall have this attribute as well.
AGGREGATOR: All AGGREGATOR attributes of all routes to be aggregated should be ignored. 9.3 Route Selection Criteria Generally speaking, additional rules for comparing routes among several alternatives are outside the scope of this document. There are two exceptions: - If the local AS appears in the AS path of the new route being considered, then that new route cannot be viewed as better than any other route. If such a route were ever used, a routing loop would result. - In order to achieve successful distributed operation, only routes with a likelihood of stability can be chosen. Thus, an AS must avoid using unstable routes, and it must not make rapid spontaneous changes to its choice of route. Quantifying the terms "unstable" and "rapid" in the previous sentence will require experience, but the principle is clear. 9.4 Originating BGP routes A BGP speaker may originate BGP routes by injecting routing information acquired by some other means (e.g. via an IGP) into BGP. A BGP speaker that originates BGP routes shall assign the degree of preference to these routes by passing them through the Decision Process (see Section 9.1). These routes may also be distributed to other BGP speakers within the local AS as part of the Internal update process (see Section 9.2.1). The decision whether to distribute non- BGP acquired routes within an AS via BGP or not depends on the environment within the AS (e.g. type of IGP) and should be controlled via configuration.
Appendix 1. BGP FSM State Transitions and Actions. This Appendix discusses the transitions between states in the BGP FSM in response to BGP events. The following is the list of these states and events when the negotiated Hold Time value is non-zero. BGP States: 1 - Idle 2 - Connect 3 - Active 4 - OpenSent 5 - OpenConfirm 6 - Established BGP Events: 1 - BGP Start 2 - BGP Stop 3 - BGP Transport connection open 4 - BGP Transport connection closed 5 - BGP Transport connection open failed 6 - BGP Transport fatal error 7 - ConnectRetry timer expired 8 - Hold Timer expired 9 - KeepAlive timer expired 10 - Receive OPEN message 11 - Receive KEEPALIVE message 12 - Receive UPDATE messages 13 - Receive NOTIFICATION message
The following table describes the state transitions of the BGP FSM and the actions triggered by these transitions. Event Actions Message Sent Next State -------------------------------------------------------------------- Idle (1) 1 Initialize resources none 2 Start ConnectRetry timer Initiate a transport connection others none none 1 Connect(2) 1 none none 2 3 Complete initialization OPEN 4 Clear ConnectRetry timer 5 Restart ConnectRetry timer none 3 7 Restart ConnectRetry timer none 2 Initiate a transport connection others Release resources none 1 Active (3) 1 none none 3 3 Complete initialization OPEN 4 Clear ConnectRetry timer 5 Close connection 3 Restart ConnectRetry timer 7 Restart ConnectRetry timer none 2 Initiate a transport connection others Release resources none 1 OpenSent(4) 1 none none 4 4 Close transport connection none 3 Restart ConnectRetry timer 6 Release resources none 1 10 Process OPEN is OK KEEPALIVE 5 Process OPEN failed NOTIFICATION 1 others Close transport connection NOTIFICATION 1 Release resources
OpenConfirm (5) 1 none none 5 4 Release resources none 1 6 Release resources none 1 9 Restart KeepAlive timer KEEPALIVE 5 11 Complete initialization none 6 Restart Hold Timer 13 Close transport connection 1 Release resources others Close transport connection NOTIFICATION 1 Release resources Established (6) 1 none none 6 4 Release resources none 1 6 Release resources none 1 9 Restart KeepAlive timer KEEPALIVE 6 11 Restart Hold Timer KEEPALIVE 6 12 Process UPDATE is OK UPDATE 6 Process UPDATE failed NOTIFICATION 1 13 Close transport connection 1 Release resources others Close transport connection NOTIFICATION 1 Release resources ---------------------------------------------------------------------
The following is a condensed version of the above state transition table. Events| Idle | Connect | Active | OpenSent | OpenConfirm | Estab | (1) | (2) | (3) | (4) | (5) | (6) |-------------------------------------------------------------- 1 | 2 | 2 | 3 | 4 | 5 | 6 | | | | | | 2 | 1 | 1 | 1 | 1 | 1 | 1 | | | | | | 3 | 1 | 4 | 4 | 1 | 1 | 1 | | | | | | 4 | 1 | 1 | 1 | 3 | 1 | 1 | | | | | | 5 | 1 | 3 | 3 | 1 | 1 | 1 | | | | | | 6 | 1 | 1 | 1 | 1 | 1 | 1 | | | | | | 7 | 1 | 2 | 2 | 1 | 1 | 1 | | | | | | 8 | 1 | 1 | 1 | 1 | 1 | 1 | | | | | | 9 | 1 | 1 | 1 | 1 | 5 | 6 | | | | | | 10 | 1 | 1 | 1 | 1 or 5 | 1 | 1 | | | | | | 11 | 1 | 1 | 1 | 1 | 6 | 6 | | | | | | 12 | 1 | 1 | 1 | 1 | 1 | 1 or 6 | | | | | | 13 | 1 | 1 | 1 | 1 | 1 | 1 | | | | | | --------------------------------------------------------------- Appendix 2. Comparison with RFC1267 BGP-4 is capable of operating in an environment where a set of reachable destinations may be expressed via a single IP prefix. The concept of network classes, or subnetting is foreign to BGP-4. To accommodate these capabilities BGP-4 changes semantics and encoding associated with the AS_PATH attribute. New text has been added to define semantics associated with IP prefixes. These abilities allow BGP-4 to support the proposed supernetting scheme [9]. To simplify configuration this version introduces a new attribute, LOCAL_PREF, that facilitates route selection procedures.
The INTER_AS_METRIC attribute has been renamed to be MULTI_EXIT_DISC. A new attribute, ATOMIC_AGGREGATE, has been introduced to insure that certain aggregates are not de-aggregated. Another new attribute, AGGREGATOR, can be added to aggregate routes in order to advertise which AS and which BGP speaker within that AS caused the aggregation. To insure that Hold Timers are symmetric, the Hold Time is now negotiated on a per-connection basis. Hold Times of zero are now supported. Appendix 3. Comparison with RFC 1163 All of the changes listed in Appendix 2, plus the following. To detect and recover from BGP connection collision, a new field (BGP Identifier) has been added to the OPEN message. New text (Section 6.8) has been added to specify the procedure for detecting and recovering from collision. The new document no longer restricts the border router that is passed in the NEXT_HOP path attribute to be part of the same Autonomous System as the BGP Speaker. New document optimizes and simplifies the exchange of the information about previously reachable routes. Appendix 4. Comparison with RFC 1105 All of the changes listed in Appendices 2 and 3, plus the following. Minor changes to the RFC1105 Finite State Machine were necessary to accommodate the TCP user interface provided by 4.3 BSD. The notion of Up/Down/Horizontal relations present in RFC1105 has been removed from the protocol. The changes in the message format from RFC1105 are as follows: 1. The Hold Time field has been removed from the BGP header and added to the OPEN message. 2. The version field has been removed from the BGP header and added to the OPEN message. 3. The Link Type field has been removed from the OPEN message. 4. The OPEN CONFIRM message has been eliminated and replaced with implicit confirmation provided by the KEEPALIVE message.
5. The format of the UPDATE message has been changed significantly. New fields were added to the UPDATE message to support multiple path attributes. 6. The Marker field has been expanded and its role broadened to support authentication. Note that quite often BGP, as specified in RFC 1105, is referred to as BGP-1, BGP, as specified in RFC 1163, is referred to as BGP-2, BGP, as specified in RFC1267 is referred to as BGP-3, and BGP, as specified in this document is referred to as BGP-4. Appendix 5. TCP options that may be used with BGP If a local system TCP user interface supports TCP PUSH function, then each BGP message should be transmitted with PUSH flag set. Setting PUSH flag forces BGP messages to be transmitted promptly to the receiver. If a local system TCP user interface supports setting precedence for TCP connection, then the BGP transport connection should be opened with precedence set to Internetwork Control (110) value (see also [6]). Appendix 6. Implementation Recommendations This section presents some implementation recommendations. 6.1 Multiple Networks Per Message The BGP protocol allows for multiple address prefixes with the same AS path and next-hop gateway to be specified in one message. Making use of this capability is highly recommended. With one address prefix per message there is a substantial increase in overhead in the receiver. Not only does the system overhead increase due to the reception of multiple messages, but the overhead of scanning the routing table for updates to BGP peers and other routing protocols (and sending the associated messages) is incurred multiple times as well. One method of building messages containing many address prefixes per AS path and gateway from a routing table that is not organized per AS path is to build many messages as the routing table is scanned. As each address prefix is processed, a message for the associated AS path and gateway is allocated, if it does not exist, and the new address prefix is added to it. If such a message exists, the new address prefix is just appended to it. If the message lacks the space to hold the new address prefix, it is transmitted, a new message is allocated, and the new address prefix is inserted into the new message. When the entire routing table has been scanned, all
allocated messages are sent and their resources released. Maximum compression is achieved when all the destinations covered by the address prefixes share a gateway and common path attributes, making it possible to send many address prefixes in one 4096-byte message. When peering with a BGP implementation that does not compress multiple address prefixes into one message, it may be necessary to take steps to reduce the overhead from the flood of data received when a peer is acquired or a significant network topology change occurs. One method of doing this is to limit the rate of updates. This will eliminate the redundant scanning of the routing table to provide flash updates for BGP peers and other routing protocols. A disadvantage of this approach is that it increases the propagation latency of routing information. By choosing a minimum flash update interval that is not much greater than the time it takes to process the multiple messages this latency should be minimized. A better method would be to read all received messages before sending updates. 6.2 Processing Messages on a Stream Protocol BGP uses TCP as a transport mechanism. Due to the stream nature of TCP, all the data for received messages does not necessarily arrive at the same time. This can make it difficult to process the data as messages, especially on systems such as BSD Unix where it is not possible to determine how much data has been received but not yet processed. One method that can be used in this situation is to first try to read just the message header. For the KEEPALIVE message type, this is a complete message; for other message types, the header should first be verified, in particular the total length. If all checks are successful, the specified length, minus the size of the message header is the amount of data left to read. An implementation that would "hang" the routing information process while trying to read from a peer could set up a message buffer (4096 bytes) per peer and fill it with data as available until a complete message has been received. 6.3 Reducing route flapping To avoid excessive route flapping a BGP speaker which needs to withdraw a destination and send an update about a more specific or less specific route shall combine them into the same UPDATE message.
6.4 BGP Timers BGP employs five timers: ConnectRetry, Hold Time, KeepAlive, MinASOriginationInterval, and MinRouteAdvertisementInterval The suggested value for the ConnectRetry timer is 120 seconds. The suggested value for the Hold Time is 90 seconds. The suggested value for the KeepAlive timer is 30 seconds. The suggested value for the MinASOriginationInterval is 15 seconds. The suggested value for the MinRouteAdvertisementInterval is 30 seconds. An implementation of BGP MUST allow these timers to be configurable. 6.5 Path attribute ordering Implementations which combine update messages as described above in 6.1 may prefer to see all path attributes presented in a known order. This permits them to quickly identify sets of attributes from different update messages which are semantically identical. To facilitate this, it is a useful optimization to order the path attributes according to type code. This optimization is entirely optional. 6.6 AS_SET sorting Another useful optimization that can be done to simplify this situation is to sort the AS numbers found in an AS_SET. This optimization is entirely optional. 6.7 Control over version negotiation Since BGP-4 is capable of carrying aggregated routes which cannot be properly represented in BGP-3, an implementation which supports BGP-4 and another BGP version should provide the capability to only speak BGP-4 on a per-peer basis. 6.8 Complex AS_PATH aggregation An implementation which chooses to provide a path aggregation algorithm which retains significant amounts of path information may wish to use the following procedure: For the purpose of aggregating AS_PATH attributes of two routes, we model each AS as a tuple <type, value>, where "type" identifies a type of the path segment the AS belongs to (e.g. AS_SEQUENCE, AS_SET), and "value" is the AS number. Two ASs are said to be the same if their corresponding <type, value> tuples are the same.
The algorithm to aggregate two AS_PATH attributes works as follows: a) Identify the same ASs (as defined above) within each AS_PATH attribute that are in the same relative order within both AS_PATH attributes. Two ASs, X and Y, are said to be in the same order if either: - X precedes Y in both AS_PATH attributes, or - Y precedes X in both AS_PATH attributes. b) The aggregated AS_PATH attribute consists of ASs identified in (a) in exactly the same order as they appear in the AS_PATH attributes to be aggregated. If two consecutive ASs identified in (a) do not immediately follow each other in both of the AS_PATH attributes to be aggregated, then the intervening ASs (ASs that are between the two consecutive ASs that are the same) in both attributes are combined into an AS_SET path segment that consists of the intervening ASs from both AS_PATH attributes; this segment is then placed in between the two consecutive ASs identified in (a) of the aggregated attribute. If two consecutive ASs identified in (a) immediately follow each other in one attribute, but do not follow in another, then the intervening ASs of the latter are combined into an AS_SET path segment; this segment is then placed in between the two consecutive ASs identified in (a) of the aggregated attribute. If as a result of the above procedure a given AS number appears more than once within the aggregated AS_PATH attribute, all, but the last instance (rightmost occurrence) of that AS number should be removed from the aggregated AS_PATH attribute. References [1] Mills, D., "Exterior Gateway Protocol Formal Specification", RFC 904, BBN, April 1984. [2] Rekhter, Y., "EGP and Policy Based Routing in the New NSFNET Backbone", RFC 1092, T.J. Watson Research Center, February 1989. [3] Braun, H-W., "The NSFNET Routing Architecture", RFC 1093, MERIT/NSFNET Project, February 1989. [4] Postel, J., "Transmission Control Protocol - DARPA Internet Program Protocol Specification", STD 7, RFC 793, DARPA, September 1981.
[5] Rekhter, Y., and P. Gross, "Application of the Border Gateway Protocol in the Internet", RFC 1772, T.J. Watson Research Center, IBM Corp., MCI, March 1995. [6] Postel, J., "Internet Protocol - DARPA Internet Program Protocol Specification", STD 5, RFC 791, DARPA, September 1981. [7] "Information Processing Systems - Telecommunications and Information Exchange between Systems - Protocol for Exchange of Inter-domain Routeing Information among Intermediate Systems to Support Forwarding of ISO 8473 PDUs", ISO/IEC IS10747, 1993 [8] Fuller, V., Li, T., Yu, J., and K. Varadhan, "Classless Inter- Domain Routing (CIDR): an Address Assignment and Aggregation Strategy", RFC 1519, BARRNet, cisco, MERIT, OARnet, September 1993 [9] Rekhter, Y., Li, T., "An Architecture for IP Address Allocation with CIDR", RFC 1518, T.J. Watson Research Center, cisco, September 1993 Security Considerations Security issues are not discussed in this document. Editors' Addresses Yakov Rekhter T.J. Watson Research Center IBM Corporation P.O. Box 704, Office H3-D40 Yorktown Heights, NY 10598 Phone: +1 914 784 7361 EMail: yakov@watson.ibm.com Tony Li cisco Systems, Inc. 170 W. Tasman Dr. San Jose, CA 95134 EMail: tli@cisco.com