4.4 Race Conditions
MGCP deals with race conditions through the notion of a "quarantine list" and through explicit detection of desynchronization, e.g., for mismatched hook state due to glare for an endpoint.
MGCP does not assume that the transport mechanism will maintain the order of commands and responses. This may cause race conditions, that may be obviated through a proper behavior of the Call Agent. (Note that some race conditions are inherent to distributed systems; they would still occur, even if the commands were transmitted in strict order.) In some cases, many gateways may decide to restart operation at the same time. This may occur, for example, if an area loses power or transmission capability during an earthquake or an ice storm. When power and transmission are reestablished, many gateways may decide to send "RestartInProgress" commands simultaneously, leading to very unstable operation.4.4.1 Quarantine List
MGCP controlled gateways will receive "notification requests" that ask them to watch for a list of "events". The protocol elements that determine the handling of these events are the "Requested Events" list, the "Digit Map", the "Quarantine Handling", and the "Detect Events" list. When the endpoint is initialized, the requested events list only consists of persistent events for the endpoint, and the digit map is assumed empty. At this point, the endpoint MAY use an implicit NotificationRequest with the reserved RequestIdentifier zero ("0") to detect and report a persistent event, e.g., off-hook. A pre-existing off-hook condition MUST here result in the off-hook event being generated as well. The endpoint awaits the reception of a NotificationRequest command, after which the gateway starts observing the endpoint for occurrences of the events mentioned in the list, including persistent events. The events are examined as they occur. The action that follows is determined by the "action" parameter associated with the event in the list of requested events, and also by the digit map. The events that are defined as "accumulate" or "accumulate according to digit map" are accumulated in a list of events, the events that are marked as "accumulate according to the digit map" will additionally be accumulated in the "current dial string". This will go on until one event is encountered that triggers a notification which will be sent to the current "notified entity". The gateway, at this point, will transmit the Notify command and will place the endpoint in a "notification" state. As long as the endpoint is in this notification state, the events that are to be detected on the endpoint are stored in a "quarantine" buffer (FIFO)
for later processing. The events are, in a sense, "quarantined". All events that are specified by the union of the RequestedEvents parameter and the most recently received DetectEvents parameter or, in the absence of the latter, all events that are referred to in the RequestedEvents, SHALL be detected and quarantined, regardless of the action associated with the event. Persistent events are here viewed as implicitly included in RequestedEvents. If the quarantine buffer reaches the capacity of the endpoint, a Quarantine Buffer Overflow event (see Appendix B) SHOULD be generated (when this event is supported, the endpoint MUST ensure it has capacity to include the event in the quarantine buffer). Excess events will now be discarded. The endpoint exits the "notification state" when the response (whether success or failure) to the Notify command is received. The Notify command may be retransmitted in the "notification state", as specified in Section 3.5 and 4. If the endpoint is or becomes disconnected (see Section 4.3) during this, a response to the Notify command will never be received. The Notify command is then lost and hence no longer considered pending, yet the endpoint is still in the "notification state". Should that occur, completion of the disconnected procedure specified in Section 4.4.7 SHALL then lead the endpoint to exit the "notification state". When the endpoint exits the "notification state" it resets the list of observed events and the "current dial string" of the endpoint to a null value. Following that point, the behavior of the gateway depends on the value of the QuarantineHandling parameter in the triggering NotificationRequest command: If the Call Agent had specified, that it expected at most one notification in response to the notification request command, then the gateway SHALL simply keep on accumulating events in the quarantine buffer until it receives the next notification request command. If, however, the gateway is authorized to send multiple successive Notify commands, it will proceed as follows. When the gateway exits the "notification state", it resets the list of observed events and the "current dial string" of the endpoint to a null value and starts processing the list of quarantined events, using the already received list of requested events and digit map. When processing these events, the gateway may encounter an event which triggers a Notify command to be sent. If that is the case, the gateway can adopt one of the two following behaviors:
* it can immediately transmit a Notify command that will report all events that were accumulated in the list of observed events until the triggering event, included, leaving the unprocessed events in the quarantine buffer, * or it can attempt to empty the quarantine buffer and transmit a single Notify command reporting several sets of events (in a single list of observed events) and possibly several dial strings. The "current dial string" is reset to a null value after each triggering event. The events that follow the last triggering event are left in the quarantine buffer. If the gateway transmits a Notify command, the endpoint will reenter and remain in the "notification state" until the acknowledgement is received (as described above). If the gateway does not find a quarantined event that triggers a Notify command, it places the endpoint in a normal state. Events are then processed as they come, in exactly the same way as if a Notification Request command had just been received. A gateway may receive at any time a new Notification Request command for the endpoint, including the case where the endpoint is disconnected. Activating an embedded Notification Request is here viewed as receiving a new Notification Request as well, except that the current list of ObservedEvents remains unmodified rather than being processed again. When a new notification request is received in the notification state, the gateway SHALL ensure that the pending Notify is received by the Call Agent prior to a new Notify (note that a Notify that was lost due to being disconnected, is no longer considered pending). It does so by using the "piggybacking" functionality of the protocol. The messages will then be sent in a single packet to the current "notified entity". The steps involved are the following: a) the gateway sends a response to the new notification request. b) the endpoint is then taken out of the "notification state" without waiting for the acknowledgement of the pending Notify command. c) a copy of the unacknowledged Notify command is kept until an acknowledgement is received. If a timer elapses, the Notify will be retransmitted. d) If the gateway has to transmit a new Notify before the previous Notify(s) is acknowledged, it constructs a packet that piggybacks a repetition of the old Notify(s) and the new Notify (ordered by age with the oldest first). This datagram will be sent to the current "notified entity".
f) Gateways that cannot piggyback several messages in the same datagram and hence guarantee in-order delivery of two (or more) Notify's SHALL leave the endpoint in the "notification" state as long as the last Notify is not acknowledged.
The procedure is illustrated by the following diagram: +-------------------+ | Processing Events |<--------------------------------------+ +-------------------+ | | | Need to send NTFY | | | v | +-------------------+ | | Outstanding NTFY |---- No -------+ | | | | | +-------------------+ v | | +-----------+ | Yes | Send NTFY | | | +-----------+ | v | | +--------------------+ v | | Piggyback new NTFY | +--------------------+ | | w. old outstanding |---->| Notification State | | | NTFY(s) | +--------------------+ | +--------------------+ | | | new RQNT NTFY response | received received | | | | | v | | +-------------+ | | | Step mode ? |- No ->+ | +-------------+ ^ | | | | Yes | | | | | v | | +---------------+ | | | Wait for RQNT | | | +---------------+ | | | | | RQNT received | | | | | v | | +---------------+ | +------>| Apply RQNT and|----->+ | send response | +---------------+
Gateways may also attempt to deliver the pending Notify prior to a successful response to the new NotificationRequest by using the "piggybacking" functionality of the protocol. This was in fact required behavior in RFC 2705, however there are several complications in doing this, and the benefits are questionable. In particular, the RFC 2705 mechanism did not guarantee in-order delivery of Notify's and responses to NotificationRequests in general, and hence Call Agents had to handle out-of-order delivery of these messages anyway. The change to optional status is thus backwards compatible while greatly reducing complexity. After receiving the Notification Request command, the requested events list and digit map (if a new one was provided) are replaced by the newly received parameters, and the current dial string is reset to a null value. Furthermore, when the Notification Request was received in the "notification state", the list of observed events is reset to a null value. The subsequent behavior is conditioned by the value of the QuarantineHandling parameter. The parameter may specify that quarantined events (and observed events which in this case is now an empty list), should be discarded, in which case they will be. If the parameter specifies that the quarantined (and observed) events are to be processed, the gateway will start processing the list of quarantined (and observed) events, using the newly received list of requested events and digit map (if provided). When processing these events, the gateway may encounter an event which requires a Notify command to be sent. If that is the case, the gateway will immediately transmit a Notify command that will report all events that were accumulated in the list of observed events until the triggering event, included leaving the unprocessed events in the quarantine buffer, and will enter the "notification state". A new notification request may be received while the gateway has accumulated events according to the previous notification request, but has not yet detected a notification-triggering events, i.e., the endpoint is not in the "notification state". The handling of not- yet-notified events is determined, as with the quarantined events, by the quarantine handling parameter: * If the quarantine-handling parameter specifies that quarantined events shall be ignored, the observed events list is simply reset. * If the quarantine-handling parameter specifies that quarantined events shall be processed, the observed event list is transferred to the quarantined event list. The observed event list is then reset, and the quarantined event list is processed.
Call Agents controlling endpoints in lockstep mode SHOULD provide the response to a successful Notify message and the new NotificationRequest in the same datagram using the piggybacking mechanism.4.4.2 Explicit Detection
A key element of the state of several endpoints is the position of the hook. A race condition may occur when the user decides to go off-hook before the Call Agent has the time to ask the gateway to notify an off-hook event (the "glare" condition well known in telephony), or if the user goes on-hook before the Call Agent has the time to request the event's notification. To avoid this race condition, the gateway MUST check the condition of the endpoint before acknowledging a NotificationRequest. It MUST return an error: 1. If the gateway is requested to notify an "off-hook" transition while the phone is already off-hook, (error code 401 - phone off hook) 2. If the gateway is requested to notify an "on-hook" or "flash hook" condition while the phone is already on-hook (error code 402 - phone on hook). Additionally, individual signal definitions can specify that a signal will only operate under certain conditions, e.g., ringing may only be possible if the phone is already off-hook. If such prerequisites exist for a given signal, the gateway MUST return the error specified in the signal definition if the prerequisite is not met. It should be noted, that the condition check is performed at the time the notification request is received, whereas the actual event that caused the current condition may have either been reported, or ignored earlier, or it may currently be quarantined. The other state variables of the gateway, such as the list of RequestedEvents or list of requested signals, are entirely replaced after each successful NotificationRequest, which prevents any long term discrepancy between the Call Agent and the gateway. When a NotificationRequest is unsuccessful, whether it is included in a connection-handling command or not, the gateway MUST simply continue as if the command had never been received. As all other transactions, the NotificationRequest MUST operate as an atomic transaction, thus any changes initiated as a result of the command MUST be reverted.
Another race condition may occur when a Notify is issued shortly before the reception by the gateway of a NotificationRequest. The RequestIdentifier is used to correlate Notify commands with NotificationRequest commands thereby enabling the Call Agent to determine if the Notify command was generated before or after the gateway received the new NotificationRequest. This is especially important to avoid deadlocks in "step" mode.4.4.3 Transactional Semantics
As the potential transaction completion times increase, e.g., due to external resource reservations, a careful definition of the transactional semantics becomes increasingly important. In particular the issue of race conditions, e.g., as it relates to hook-state, must be defined carefully. An important point to consider is, that the status of a pre-condition (e.g., hook-state) may in fact change between the time a transaction starts and the time it either completes successfully (transaction commit) or fails. In general, we can say that the successful execution of a transaction depends on one or more pre-conditions where the status of one or more of the pre-conditions may change dynamically between the transaction start and transaction commit. The simplest semantics for this is simply to require that all pre- conditions be met from the time the transaction is initiated until the transaction commits. If any pre-condition is not met before the completion of the transaction, the transaction will also fail. As an example, consider a transaction that includes a request for the "off-hook" event. When the transaction is initiated the phone is "on-hook" and this pre-condition is therefore met. If the hook-state changes to "off-hook" before the transaction completes, the pre- condition is no longer met, and the transaction therefore immediately fails. Finally, we need to consider the point in time when a new transaction takes effect and endpoint processing according to an old transaction stops. For example, assume that transaction T1 has been executed successfully and event processing is currently being done according to transaction T1. Now we receive a new transaction T2 specifying new event processing (for example a CreateConnection with an encapsulated NotificationRequest). Since we don't know whether T2 will complete successfully or not, we cannot start processing events according to T2 until the outcome of T2 is known. While we could suspend all event processing until the outcome of T2 is known, this would make for a less responsive system and hence SHOULD NOT be done. Instead, when a new transaction Ty is received and Ty modifies
processing according to an old transaction Tx, processing according to Tx SHOULD remain active for as long as possible, until a successful outcome of Ty is known to occur. If Ty fails, then processing according to Tx will of course continue as usual. Any changes incurred by Ty logically takes effect when Ty commits. Thus, if the endpoint was in the notification state when Ty commits, and Ty contained a NotificationRequest, the endpoint will be taken out of the notification state when Ty commits. Note that this is independent of whether the endpoint was in the notification state when Ty was initiated. For example, a Notify could be generated due to processing according to Tx between the start and commit of Ty. If the commit of Ty leads to the endpoint entering the notification state, a new NotificationRequest (Tz) is needed to exit the notification state. This follows from the fact that transaction execution respects causal order. Another related issue is the use of wildcards, especially the "all of" wildcard, which may match more than one endpoint. When a command is requested, and the endpoint identifier matches more than one endpoint, transactional semantics still apply. Thus, the command MUST either succeed for all the endpoints, or it MUST fail for all of them. A single response is consequently always issued.4.4.4 Ordering of Commands, and Treatment of Misorder
MGCP does not mandate that the underlying transport protocol guarantees in-order delivery of commands to a gateway or an endpoint. This property tends to maximize the timeliness of actions, but it has a few drawbacks. For example: * Notify commands may be delayed and arrive at the Call Agent after the transmission of a new Notification Request command, * If a new NotificationRequest is transmitted before a previous one is acknowledged, there is no guarantee that the previous one will not be received and executed after the new one. Call Agents that want to guarantee consistent operation of the endpoints can use the following rules: 1) When a gateway handles several endpoints, commands pertaining to the different endpoints can be sent in parallel, for example following a model where each endpoint is controlled by its own process or its own thread. 2) When several connections are created on the same endpoint, commands pertaining to different connections can be sent in parallel.
3) On a given connection, there should normally be only one outstanding command (create or modify). However, a DeleteConnection command can be issued at any time. In consequence, a gateway may sometimes receive a ModifyConnection command that applies to a previously deleted connection. Such commands will fail, and an error code MUST be returned (error code 515 - incorrect connection-id, is RECOMMENDED). 4) On a given endpoint, there should normally be only one outstanding NotificationRequest command at any time. The RequestId parameter MUST be used to correlate Notify commands with the triggering notification request. 5) In some cases, an implicitly or explicitly wildcarded DeleteConnection command that applies to a group of endpoints can step in front of a pending CreateConnection command. The Call Agent should individually delete all connections whose completion was pending at the time of the global DeleteConnection command. Also, new CreateConnection commands for endpoints named by the wild-carding SHOULD NOT be sent until the wild-carded DeleteConnection command is acknowledged. 6) When commands are embedded within each other, sequencing requirements for all commands must be adhered to. For example a Create Connection command with a Notification Request in it must adhere to the sequencing requirements associated with both CreateConnection and NotificationRequest at the same time. 7) AuditEndpoint and AuditConnection are not subject to any sequencing requirements. 8) RestartInProgress MUST always be the first command sent by an endpoint as defined by the restart procedure. Any other command or non-restart response (see Section 4.4.6), except for responses to auditing, MUST be delivered after this RestartInProgress command (piggybacking allowed). 9) When multiple messages are piggybacked in a single packet, the messages are always processed in order. 10) On a given endpoint, there should normally be only one outstanding EndpointConfiguration command at any time. Gateways MUST NOT make any assumptions as to whether Call Agents follow these rules or not. Consequently gateways MUST always respond to commands, regardless of whether they adhere to the above rules or not. To ensure consistent operation, gateways SHOULD behave as specified below when one or more of the above rules are not followed:
* Where a single outstanding command is expected (ModifyConnection, NotificationRequest, and EndpointConfiguration), but the same command is received in a new transaction before the old finishes executing, the gateway SHOULD fail the previous command. This includes the case where one or more of the commands were encapsulated. The use of error code 407 (transaction aborted) is RECOMMENDED. * If a ModifyConnection command is received for a pending CreateConnection command, the ModifyConnection command SHOULD simply be rejected. The use of error code 400 (transient error) is RECOMMENDED. Note that this situation constitutes a Call Agent programming error. * If a DeleteConnection command is received for a pending CreateConnection or ModifyConnection command, the pending command MUST be aborted. The use of error code 407 (transaction aborted) is RECOMMENDED. Note, that where reception of a new command leads to aborting an old command, the old command SHOULD be aborted regardless of whether the new command succeeds or not. For example, if a ModifyConnection command is aborted by a DeleteConnection command which itself fails due to an encapsulated NotificationRequest, the ModifyConnection command is still aborted.4.4.5 Endpoint Service States
As described earlier, endpoints configured for operation may be either in-service or out-of-service. The actual service-state of the endpoint is reflected by the combination of the RestartMethod and RestartDelay parameters, which are sent with RestartInProgress commands (Section 2.3.12) and furthermore may be audited in AuditEndpoint commands (Section 2.3.10). The service-state of an endpoint affects how it processes a command. An endpoint in-service MUST process any command received, whereas an endpoint that is out-of-service MUST reject non-auditing commands, but SHOULD process auditing commands if possible. For backwards compatibility, auditing commands for an out-of-service endpoint may alternatively be rejected as well. Any command rejected due to an endpoint being out-of-service SHOULD generate error code 501 (endpoint not ready/out-of-service). Note that (per Section 2.1.2), unless otherwise specified for a command, endpoint names containing the "any of" wildcard only refer to endpoints in-service, whereas endpoint names containing the "all of" wildcard refer to all endpoints, regardless of service state.
The above relationships are illustrated in the table below which shows the current service-states and gateway processing of commands as a function of the RestartInProgress command sent and the response (if any) received to it. The last column also lists (in parentheses) the RestartMethod to be returned if audited:
------------------------------------------------------------------ | Restart- | Restart- | 2xx | Service- | Response to | | Method | Delay | received ?| State | new command | |------------------------------------------------------------------| | graceful | zero | Yes/No | In | non-audit: 2xx | | | | | | audit: 2xx | | | | | | (graceful) | |-----------+----------+-----------+----------+--------------------| | graceful | non-zero | Yes/No | In* | non-audit: 2xx | | | | | | audit: 2xx | | | | | | (graceful) | |-----------+----------+-----------+----------+--------------------| | forced | N/A | Yes/No | Out | non-audit: 501 | | | | | | audit: 2xx | | | | | | (forced) | |-----------+----------+-----------+----------+--------------------| | restart | zero | No | In | non-audit: 2xx,405*| | | | | | audit: 2xx | | | | | | (restart) | |-----------+----------+-----------+----------+--------------------| | restart | zero | Yes | In | non-audit: 2xx | | | | | | audit: 2xx | | | | | | (restart) | |-----------+----------+-----------+----------+--------------------| | restart | non-zero | No | Out* | non-audit: 501* | | | | | | audit: 2xx | | | | | | (restart) | |-----------+----------+-----------+----------+--------------------| | restart | non-zero | Yes | Out* | non-audit: 501* | | | | | | audit: 2xx | | | | | | (restart) | |-----------+----------+-----------+----------+--------------------| | discon- | zero/ | No | In | non-audit: 2xx, | | nected | non-zero | | | audit: 2xx | | | | | | (disconnected)| |-----------+----------+-----------+----------+--------------------| | discon- | zero/ | Yes | In | non-audit: 2xx | | nected | non-zero | | | audit: 2xx | | | | | | (restart) | |-----------+----------+-----------+----------+--------------------| | cancel- | N/A | Yes/No | In | non-audit: 2xx | | graceful | | | | audit: 2xx | | | | | | (restart) | ------------------------------------------------------------------
Notes (*): * The three service-states marked with "*" will change after the expiration of the RestartDelay at which time an updated RestartInProgress command SHOULD be sent. * If the endpoint returns 2xx when the restart procedure has not yet completed, then in-order delivery MUST still be satisfied, i.e., piggy-backing is to be used. If instead, the command is not processed, 405 SHOULD be returned. * Following a "restart" RestartInProgress with a non-zero RestartDelay, error code 501 is only returned until the endpoint goes in-service, i.e., until the expiration of the RestartDelay.4.4.6 Fighting the Restart Avalanche
Let's suppose that a large number of gateways are powered on simultaneously. If they were to all initiate a RestartInProgress transaction, the Call Agent would very likely be swamped, leading to message losses and network congestion during the critical period of service restoration. In order to prevent such avalanches, the following behavior is REQUIRED: 1) When a gateway is powered on, it MUST initiate a restart timer to a random value, uniformly distributed between 0 and a maximum waiting delay (MWD). Care should be taken to avoid synchronicity of the random number generation between multiple gateways that would use the same algorithm. 2) The gateway MUST then wait for either the end of this timer, the reception of a command from the Call Agent, or the detection of a local user activity, such as for example an off-hook transition on a residential gateway. 3) When the timer elapses, when a command is received, or when an activity is detected, the gateway MUST initiate the restart procedure. The restart procedure simply requires the endpoint to guarantee that the first * non-audit command, or * non-restart response (i.e., error codes other than 405, 501, and 520) to a non-audit command
that the Call Agent sees from this endpoint is a "restart" RestartInProgress command. The endpoint is free to take full advantage of piggybacking to achieve this. Endpoints that are considered in-service will have a RestartMethod of "restart", whereas endpoints considered out-of-service will have a RestartMethod of "forced" (also see Section 4.4.5). Commands rejected due to an endpoint not yet having completed the restart procedure SHOULD use error code 405 (endpoint "restarting"). The restart procedure is complete once a success response has been received. If an error response is received, the subsequent behavior depends on the error code in question: * If the error code indicates a transient error (4xx), then the restart procedure MUST be initiated again (as a new transaction). * If the error code is 521, then the endpoint is redirected, and the restart procedure MUST be initiated again (as a new transaction). The 521 response MUST have included a NotifiedEntity which then is the "notified entity" towards which the restart is initiated. If it did not include a NotifiedEntity, the response is treated as any other permanent error (see below). * If the error is any other permanent error (5xx), and the endpoint is not able to rectify the error, then the endpoint no longer initiates the restart procedure on its own (until rebooted/restarted) unless otherwise specified. If a command is received for the endpoint, the endpoint MUST initiate the restart procedure again. Note that if the RestartInProgress is piggybacked with the response (R) to a command received while restarting, then retransmission of the RestartInProgress does not require piggybacking of the response R. However, while the endpoint is restarting, a resend of the response R does require the RestartInProgress to be piggybacked to ensure in-order delivery of the two. Should the gateway enter the "disconnected" state while carrying out the restart procedure, the disconnected procedure specified in Section 4.4.7 MUST be carried out, except that a "restart" rather than "disconnected" message is sent during the procedure. Each endpoint in a gateway will have a provisionable Call Agent, i.e., "notified entity", to direct the initial restart message towards. When the collection of endpoints in a gateway is managed by more than one Call Agent, the above procedure MUST be performed for each collection of endpoints managed by a given Call Agent. The gateway MUST take full advantage of wild-carding to minimize the
number of RestartInProgress messages generated when multiple endpoints in a gateway restart and the endpoints are managed by the same Call Agent. Note that during startup, it is possible for endpoints to start out as being out-of-service, and then become in- service as part of the gateway initialization procedure. A gateway may thus choose to send first a "forced" RestartInProgress for all its endpoints, and subsequently a "restart" RestartInProgress for the endpoints that come in-service. Alternatively, the gateway may simply send "restart" RestartInProgress for only those endpoints that are in-service, and "forced" RestartInProgress for the specific endpoints that are out-of-service. Wild-carding MUST still be used to minimize the number of messages sent though. The value of MWD is a configuration parameter that depends on the type of the gateway. The following reasoning can be used to determine the value of this delay on residential gateways. Call agents are typically dimensioned to handle the peak hour traffic load, during which, in average, 10% of the lines will be busy, placing calls whose average duration is typically 3 minutes. The processing of a call typically involves 5 to 6 MGCP transactions between each endpoint and the Call Agent. This simple calculation shows that the Call Agent is expected to handle 5 to 6 transactions for each endpoint, every 30 minutes on average, or, to put it otherwise, about one transaction per endpoint every 5 to 6 minutes on average. This suggest that a reasonable value of MWD for a residential gateway would be 10 to 12 minutes. In the absence of explicit configuration, residential gateways should adopt a value of 600 seconds for MWD. The same reasoning suggests that the value of MWD should be much shorter for trunking gateways or for business gateways, because they handle a large number of endpoints, and also because the usage rate of these endpoints is much higher than 10% during the peak busy hour, a typical value being 60%. These endpoints, during the peak hour, are thus expected to contribute about one transaction per minute to the Call Agent load. A reasonable algorithm is to make the value of MWD per "trunk" endpoint six times shorter than the MWD per residential gateway, and also inversely proportional to the number of endpoints that are being restarted. For example MWD should be set to 2.5 seconds for a gateway that handles a T1 line, or to 60 milliseconds for a gateway that handles a T3 line.
4.4.7 Disconnected Endpoints
In addition to the restart procedure, gateways also have a "disconnected" procedure, which MUST be initiated when an endpoint becomes "disconnected" as described in Section 4.3. It should here be noted, that endpoints can only become disconnected when they attempt to communicate with the Call Agent. The following steps MUST be followed by an endpoint that becomes "disconnected": 1. A "disconnected" timer is initialized to a random value, uniformly distributed between 1 and a provisionable "disconnected" initial waiting delay (Tdinit), e.g., 15 seconds. Care MUST be taken to avoid synchronicity of the random number generation between multiple gateways and endpoints that would use the same algorithm. 2. The gateway then waits for either the end of this timer, the reception of a command for the endpoint from the Call Agent, or the detection of a local user activity for the endpoint, such as for example an off-hook transition. 3. When the "disconnected" timer elapses for the endpoint, when a command is received for the endpoint, or when local user activity is detected for the endpoint, the gateway initiates the "disconnected" procedure for the endpoint - if a disconnected procedure was already in progress for the endpoint, it is simply replaced by the new one. Furthermore, in the case of local user activity, a provisionable "disconnected" minimum waiting delay (Tdmin) MUST have elapsed since the endpoint became disconnected or the last time it ended the "disconnected" procedure in order to limit the rate at which the procedure is performed. If Tdmin has not passed, the endpoint simply proceeds to step 2 again, without affecting any disconnected procedure already in progress. 4. If the "disconnected" procedure still left the endpoint disconnected, the "disconnected" timer is then doubled, subject to a provisionable "disconnected" maximum waiting delay (Tdmax), e.g., 600 seconds, and the gateway proceeds with step 2 again (using a new transaction-id). The "disconnected" procedure is similar to the restart procedure in that it simply states that the endpoint MUST send a RestartInProgress command to the Call Agent informing it that the endpoint was disconnected. Furthermore, the endpoint MUST guarantee that the first non-audit message (non-audit command or response to non-audit command) that the Call Agent sees from this endpoint MUST inform the Call Agent that the endpoint is disconnected (unless the endpoint goes out-of-service). When a command (C) is received, this is achieved by sending a piggy-backed datagram with a "disconnected"
RestartInProgress command and the response to command C to the source address of command C as opposed to the current "notified entity". This piggy-backed RestartInProgress is not automatically retransmitted by the endpoint but simply relies on fate-sharing with the piggy-backed response to guarantee the in-order delivery requirement. The Call Agent still sends a response to the piggy- backed RestartInProgress, however, as usual, the response may be lost. In addition to the piggy-backed RestartInProgress command, a new "disconnected" procedure is triggered by the command received. This will lead to a non piggy-backed copy (i.e., same transaction) of the "disconnected" RestartInProgress command being sent reliably to the current "notified entity". When the Call Agent learns that the endpoint is disconnected, the Call Agent may then for instance decide to audit the endpoint, or simply clear all connections for the endpoint. Note that each such "disconnected" procedure will result in a new RestartInProgress command, which will be subject to the normal retransmission procedures specified in Section 4.3. At the end of the procedure, the endpoint may thus still be "disconnected". Should the endpoint go out-of-service while being disconnected, it SHOULD send a "forced" RestartInProgress message as described in Section 2.3.12. The disconnected procedure is complete once a success response has been received. Error responses are handled similarly to the restart procedure (Section 4.4.6). If the "disconnected" procedure is to be initiated again following an error response, the rate-limiting timer considerations specified above still apply. Note, that if the RestartInProgress is piggybacked with the response (R) to a command received while being disconnected, then retransmission of this particular RestartInProgress does not require piggybacking of the response R. However, while the endpoint is disconnected, resending the response R does require the RestartInProgress to be piggybacked with the response to ensure the in-order delivery of the two. If a set of disconnected endpoints have the same "notified entity", and the set of endpoints can be named with a wildcard, the gateway MAY replace the individual disconnected procedures with a suitably wildcarded disconnected procedure instead. In that case, the Restart Delay for the wildcarded "disconnected" RestartInProgress command SHALL be the Restart Delay corresponding to the oldest disconnected procedure replaced. Note that if only a subset of these endpoints subsequently have their "notified entity" changed and/or are no longer disconnected, then that wildcarded disconnected procedure can no longer be used. The remaining individual disconnected procedures MUST then be resumed again.
A disconnected endpoint may wish to send a command (besides RestartInProgress) while it is disconnected. Doing so will only succeed once the Call Agent is reachable again, which raises the question of what to do with such a command meanwhile. At one extreme, the endpoint could drop the command right away, however that would not work very well when the Call Agent was in fact available, but the endpoint had not yet completed the "disconnected" procedure (consider for example the case where a NotificationRequest was just received which immediately resulted in a Notify being generated). To prevent such scenarios, disconnected endpoints SHALL NOT blindly drop new commands to be sent for a period of T-MAX seconds after they receive a non-audit command. One way of satisfying this requirement is to employ a temporary buffering of commands to be sent, however in doing so, the endpoint MUST ensure, that it: * does not build up a long queue of commands to be sent, * does not swamp the Call Agent by rapidly sending too many commands once it is connected again. Buffering commands for T-MAX seconds and, once the endpoint is connected again, limiting the rate at which buffered commands are sent to one outstanding command per endpoint is considered acceptable (see also Section 4.4.8, especially if using wildcards). If the endpoint is not connected within T-MAX seconds, but a "disconnected" procedure is initiated within T-MAX seconds, the endpoint MAY piggyback the buffered command(s) with that RestartInProgress. Note, that once a command has been sent, regardless of whether it was buffered initially, or piggybacked earlier, retransmission of that command MUST cease T-MAX seconds after the initial send as described in Section 4.3. This specification purposely does not specify any additional behavior for a disconnected endpoint. Vendors MAY for instance choose to provide silence, play reorder tone, or even enable a downloaded wav file to be played. The default value for Tdinit is 15 seconds, the default value for Tdmin, is 15 seconds, and the default value for Tdmax is 600 seconds.
4.4.8 Load Control in General
The previous sections have described several MGCP mechanisms to deal with congestion and overload, namely: * the UDP retransmission strategy which adapts to network and call agent congestion on a per endpoint basis, * the guidelines on the ordering of commands which limit the number of commands issued in parallel, * the restart procedure which prevents flooding in case of a restart avalanche, and * the disconnected procedure which prevents flooding in case of a large number of disconnected endpoints. It is however still possible for a given set of endpoints, either on the same or different gateways, to issue one or more commands at a given point in time. Although it can be argued, that Call Agents should be sized to handle one message per served endpoint at any given point in time, this may not always be the case in practice. Similarly, gateways may not be able to handle a message for all of its endpoints at any given point in time. In general, such issues can be dealt with through the use of a credit-based mechanism, or by monitoring and automatically adapting to the observed behavior. We opt for the latter approach as follows. Conceptually, we assume that Call Agents and gateways maintain a queue of incoming transactions to be executed. Associated with this transaction queue is a high-water and a low-water mark. Once the queue length reaches the high-water mark, the entity SHOULD start issuing 101 provisional responses (transaction queued) until the queue length drops to the low-water mark. This applies to new transactions as well as to retransmissions. If the entity is unable to process any new transactions at this time, it SHOULD return error code 409 (processing overload). Furthermore, gateways SHOULD adjust the sending rate of new commands to a given Call Agent by monitoring the observed response times from that Call Agent to a *set* of endpoints. If the observed smoothed average response time suddenly rises significantly over some threshold, or the gateway receives a 101 (transaction queued) or 409 (overload) response, the gateway SHOULD adjust the sending rate of new commands to that Call Agent accordingly. The details of the smoothing average algorithm, the rate adjustments, and the thresholds involved are for further study, however they MUST be configurable.
Similarly, Call Agents SHOULD adjust the sending rate of new transactions to a given gateway by monitoring the observed response times from that gateway for a *set* of endpoints. If the observed smoothed average response time suddenly rises significantly over some threshold, or the Call Agent receives a 101 (transaction queued) or 409 (overloaded), the Call Agent SHOULD adjust the sending rate of new commands to that gateway accordingly. The details of the smoothing average algorithm, the rate adjustments, and the thresholds involved are for further study, however they MUST be configurable.5. Security Requirements
Any entity can send a command to an MGCP endpoint. If unauthorized entities could use the MGCP, they would be able to set-up unauthorized calls, or to interfere with authorized calls. We expect that MGCP messages will always be carried over secure Internet connections, as defined in the IP security architecture as defined in RFC 2401, using either the IP Authentication Header, defined in RFC 2402, or the IP Encapsulating Security Payload, defined in RFC 2406. The complete MGCP protocol stack would thus include the following layers: ------------------------------- | MGCP | |-------------------------------| | UDP | |-------------------------------| | IP security | | (authentication or encryption)| |-------------------------------| | IP | |-------------------------------| | transmission media | ------------------------------- Adequate protection of the connections will be achieved if the gateways and the Call Agents only accept messages for which IP security provided an authentication service. An encryption service will provide additional protection against eavesdropping, thus preventing third parties from monitoring the connections set up by a given endpoint. The encryption service will also be requested if the session descriptions are used to carry session keys, as defined in SDP.
These procedures do not necessarily protect against denial of service attacks by misbehaving gateways or misbehaving Call Agents. However, they will provide an identification of these misbehaving entities, which should then be deprived of their authorization through maintenance procedures.5.1 Protection of Media Connections
MGCP allows Call Agent to provide gateways with "session keys" that can be used to encrypt the audio messages, protecting against eavesdropping. A specific problem of packet networks is "uncontrolled barge-in". This attack can be performed by directing media packets to the IP address and UDP port used by a connection. If no protection is implemented, the packets will be decoded and the signals will be played on the "line side". A basic protection against this attack is to only accept packets from known sources, however this tends to conflict with RTP principles. This also has two inconveniences: it slows down connection establishment and it can be fooled by source spoofing: * To enable the address-based protection, the Call Agent must obtain the source address of the egress gateway and pass it to the ingress gateway. This requires at least one network round trip, and leaves us with a dilemma: either allow the call to proceed without waiting for the round trip to complete, and risk for example "clipping" a remote announcement, or wait for the full round trip and settle for slower call-set-up procedures. * Source spoofing is only effective if the attacker can obtain valid pairs of source and destination addresses and ports, for example by listening to a fraction of the traffic. To fight source spoofing, one could try to control all access points to the network. But this is in practice very hard to achieve. An alternative to checking the source address is to encrypt and authenticate the packets, using a secret key that is conveyed during the call set-up procedure. This will not slow down the call set-up, and provides strong protection against address spoofing.6. Packages
As described in Section 2.1.6, packages are the preferred way of extending MGCP. In this section we describe the requirements associated with defining a package.
A package MUST have a unique package name defined. The package name MUST be registered with the IANA, unless it starts with the characters "x-" or "x+" which are reserved for experimental packages. Please refer to Appendix C for IANA considerations. A package MUST also have a version defined which is simply a non- negative integer. The default and initial version of a package is zero, the next version is one, etc. New package versions MUST be completely backwards compatible, i.e., a new version of a package MUST NOT redefine or remove any of the extensions provided in an earlier version of the package. If such a need arises, a new package name MUST be used instead. Packages containing signals of type time-out MAY indicate if the "to" parameter is supported for all the time-out signals in the package as well as the default rounding rules associated with these (see Section 3.2.2.4). If no such definition is provided, each time-out signal SHOULD provide these definitions. A package defines one or more of the following extensions: * Actions * BearerInformation * ConnectionModes * ConnectionParameters * DigitMapLetters * Events and Signals * ExtensionParameters * LocalConnectionOptions * ReasonCodes * RestartMethods * Return codes For each of the above types of extensions supported by the package, the package definition MUST contain a description of the extension as defined in the following sections. Please note, that package extensions, just like any other extension, MUST adhere to the MGCP grammar.
6.1 Actions
Extension Actions SHALL include: * The name and encoding of the extension action. * If the extension action takes any action parameters, then the name, encoding, and possible values of those parameters. * A description of the operation of the extension action. * A listing of the actions in this specification the extension can be combined with. If such a listing is not provided, it is assumed that the extension action cannot be combined with any other action in this specification. * If more than one extension action is defined in the package, then a listing of the actions in the package the extension can be combined with. If such a listing is not provided, it is assumed that the extension action cannot be combined with any other action in the package. Extension actions defined in two or more different packages SHOULD NOT be used simultaneously, unless very careful consideration to their potential interaction and side-effects has been given.6.2 BearerInformation
BearerInformation extensions SHALL include: * The name and encoding of the BearerInformation extension. * The possible values and encoding of those values that can be assigned to the BearerInformation extension. * A description of the operation of the BearerInformation extension. As part of this description the default value (if any) if the extension is omitted in an EndpointConfiguration command MUST be defined. It may be necessary to make a distinction between the default value before and after the initial application of the parameter, for example if the parameter retains its previous value once specified, until explicitly altered. If default values are not described, then the extension parameter simply defaults to empty in all EndpointConfiguration commands. Note that the extension SHALL be included in the result for an AuditEndpoint command auditing the BearerInformation.
6.3 ConnectionModes
Extension Connection Modes SHALL include: * The name and encoding of the extension connection mode. * A description of the operation of the extension connection mode. * A description of the interaction a connection in the extension connection mode will have with other connections in each of the modes defined in this specification. If such a description is not provided, the extension connection mode MUST NOT have any interaction with other connections on the endpoint. Extension connection modes SHALL NOT be included in the list of modes in a response to an AuditEndpoint for Capabilities, since the package will be reported in the list of packages.6.4 ConnectionParameters
Extension Connection Parameters SHALL include: * The name and encoding of the connection parameter extension. * The possible values and encoding of those values that can be assigned to the connection parameter extension. * A description of how those values are derived. Note that the extension connection parameter MUST be included in the result for an AuditConnection command auditing the connection parameters.6.5 DigitMapLetters
Extension Digit Map Letters SHALL include: * The name and encoding of the extension digit map letter(s). * A description of the meaning of the extension digit map letter(s). Note that extension DigitMapLetters in a digit map do not follow the normal naming conventions for extensions defined in packages. More specifically the package name and slash ("/") will not be part of the extension name, thereby forming a flat and limited name space with potential name clashing.
Therefore, a package SHALL NOT define a digit map letter extension whose encoding has already been used in another package. If two packages have used the same encoding for a digit map letter extension, and those two packages are supported by the same endpoint, the result of using that digit map letter extension is undefined. Note that although an extension DigitMapLetter does not include the package name prefix and slash ("/") as part of the extension name within a digit map, the package name prefix and slash are included when the event code for the event that matched the DigitMapLetter is reported as an observed event. In other words, the digit map just define the matching rule(s), but the event is still reported like any other event.6.6 Events and Signals
The event/signal definition SHALL include the precise name of the event/signal (i.e., the code used in MGCP), a plain text definition of the event/signal, and, when appropriate, the precise definition of the corresponding events/signals, for example the exact frequencies of audio signals such as dial tones or DTMF tones. The package description MUST provide, for each event/signal, the following information: * The description of the event/signal and its purpose, which SHOULD include the actual signal that is generated by the client (e.g., xx ms FSK tone) as well as the resulting user observed result (e.g., Message Waiting light on/off). The event code used for the event/signal. * The detailed characteristics of the event/signal, such as for example frequencies and amplitude of audio signals, modulations and repetitions. Such details may be country specific. * The typical and maximum duration of the event/signal if applicable. * If the signal or event can be applied to a connection (across a media stream), it MUST be indicated explicitly. If no such indication is provided, it is assumed that the signal or event cannot be applied to a connection. For events, the following MUST be provided as well: * An indication if the event is persistent. By default, events are not persistent - defining events as being persistent is discouraged (see Appendix B for a preferred alternative). Note that persistent
events will automatically trigger a Notify when they occur, unless the Call Agent explicitly instructed the endpoint otherwise. This not only violates the normal MGCP model, but also assumes the Call Agent supports the package in question. Such an assumption is unlikely to hold in general. * An indication if there is an auditable event-state associated with the event. By default, events do not have auditable event-states. * If event parameters are supported, it MUST be stated explicitly. The precise syntax and semantics of these MUST then be provided (subject to the grammar provided in Appendix A). It SHOULD also be specified whether these parameters apply to RequestedEvents, ObservedEvents, DetectEvents and EventStates. If not specified otherwise, it is assumed that: * they do not apply to RequestedEvents, * they do apply to ObservedEvents, * they apply in the same way to DetectEvents as they do to RequestedEvents for a given event parameter, * they apply in the same way to EventStates as they do to ObservedEvents for a given event parameter. * If the event is expected to be used in digit map matching, it SHOULD explicitly state so. Note that only events with single letter or digit parameter codes can do this. See Section 2.1.5 for further details. For signals, the following MUST be provided as well: * The type of signal (OO, TO, BR). * Time-Out signals SHOULD have an indication of the default time-out value. In some cases, time-out values may be variable (if dependent on some action to complete such as out-pulsing digits). * If signal parameters are supported, it MUST be stated explicitly. The precise syntax and semantics of these MUST then be provided (subject to the grammar provided in Appendix A). * Time-Out signals may also indicate whether the "to" parameter is supported or not as well as what the rounding rules associated with them are. If omitted from the signal definition, the package-wide definition is assumed (see Section 6). If the package definition did not specify this, rounding rules default to the nearest non-
zero second, whereas support for the "to" parameter defaults to "no" for package version zero, and "yes" for package versions one and higher. The following format is RECOMMENDED for defining events and signals in conformance with the above: ------------------------------------------------------------------ | Symbol | Definition | R | S Duration | |---------|----------------------------|-----|---------------------| | | | | | | | | | | ------------------------------------------------------------------ where: * Symbol indicates the event code used for the event/signal, e.g., "hd". * Definition gives a brief definition of the event/signal * R contains an "x" if the event can be detected or one or more of the following symbols: - "P" if the event is persistent. - "S" if the events is an event-state that may be audited. - "C" if the event can be detected on a connection. * S contains one of the following if it is a signal: - "OO" if the signal is On/Off signal. - "TO" if the signal is a Time-Out signal. - "BR" if the signal is a Brief signal. * S also contains: - "C" if the signal can be applied on a connection. The table SHOULD then be followed by a more comprehensive description of each event/signal defined.
6.6.1 Default and Reserved Events
All packages that contain Time-Out type signals contain the operation failure ("of") and operation complete ("oc") events, irrespective of whether they are provided as part of the package description or not. These events are needed to support Time-Out signals and cannot be overridden in packages with Time-Out signals. They MAY be extended if necessary, however such practice is discouraged. If a package without Time-Out signals does contain definitions for the "oc" and "of" events, the event definitions provided in the package MAY over-ride those indicated here. Such practice is however discouraged and is purely allowed to avoid potential backwards compatibility problems. It is considered good practice to explicitly mention that the two events are supported in accordance with their default definitions, which are as follows: ------------------------------------------------------------------ | Symbol | Definition | R | S Duration | |---------|----------------------------|-----|---------------------| | oc | Operation Complete | x | | | of | Operation Failure | x | | ------------------------------------------------------------------ Operation complete (oc): The operation complete event is generated when the gateway was asked to apply one or several signals of type TO on the endpoint or connection, and one or more of those signals completed without being stopped by the detection of a requested event such as off-hook transition or dialed digit. The completion report should carry as a parameter the name of the signal that came to the end of its live time, as in: O: G/oc(G/rt) In this case, the observed event occurred because the "rt" signal in the "G" package timed out. If the reported signal was applied on a connection, the parameter supplied will include the name of the connection as well, as in: O: G/oc(G/rt@0A3F58) When the operation complete event is requested, it cannot be parameterized with any event parameters. When the package name is omitted (which is discouraged) as part of the signal name, the default package is assumed.
Operation failure (of): The operation failure event is generated when the endpoint was asked to apply one or several signals of type TO on the endpoint or connection, and one or more of those signals failed prior to timing out. The completion report should carry as a parameter the name of the signal that failed, as in: O: G/of(G/rt) In this case a failure occurred in producing the "rt" signal in the "G" package. When the reported signal was applied on a connection, the parameter supplied will include the name of the connection as well, as in: O: G/of(G/rt@0A3F58) When the operation failure event is requested, event parameters can not be specified. When the package name is omitted (which is discouraged), the default package name is assumed.6.7 ExtensionParameters
Extension parameter extensions SHALL include: * The name and encoding of the extension parameter. * The possible values and encoding of those values that can be assigned to the extension parameter. * For each of the commands defined in this specification, whether the extension parameter is Mandatory, Optional, or Forbidden in requests as well as responses. Note that extension parameters SHOULD NOT normally be mandatory. * A description of the operation of the extension parameter. As part of this description the default value (if any) if the extension is omitted in a command MUST be defined. It may be necessary to make a distinction between the default value before and after the initial application of the parameter, for example if the parameter retains its previous value once specified, until explicitly altered. If default values are not described, then the extension parameter simply defaults to empty in all commands. * Whether the extension can be audited in AuditEndpoint and/or AuditConnection as well as the values returned. If nothing is specified, then auditing of the extension parameter can only be done for AuditEndpoint, and the value returned SHALL be the current value for the extension. Note that this may be empty.