9. MSML Dialog Packages
9.1. Overview
MSML Dialog Packages define an XML [n2] language for composing complex media objects from a vocabulary of simple media resource objects called primitives. It is primarily a descriptive or declarative language to describe media processing objects. MSML dialogs operate on a single or multiple streams that are identified by the MSML document outside the scope of the MSML Dialog Package. MSML dialogs are intended to be used in different environments. As such, the language itself does not define how an MSML dialog is used. Each environment in which an MSML dialog is used must define how it is used, the set of services provided, and the mechanism for passing information between the environment and MSML dialog. The specific mechanisms used to realize the interface between MSML dialog and its environment are platform specific. MSML Dialog Packages provide two models for access to media resources and service creation building blocks. Both models MAY be used in conjunction with each other in a complementary manner. The first model (referred to as "Media Primitives and Composites", part of the mandatory MSML Dialog Base Package) contains media primitives (such as digit collection and announcements) and composite functions (such as play and collect combined as a single operation). The second model (referred to as "Media Groups", part of the optional MSML Dialog Group Package) allows the ability to define complex customized interactions, via event passing mechanisms, between media primitives, if required.
MSML Dialog Core Package Defines core framework over which all MSML Dialog Packages operate. MSML Dialog Base Package Media Primitives <dtmf> or <collect> DTMF digit collection <play> Playing of Announcements <dtmfgen> Generation of DTMF digits <tonegen> Tone genration <record> Media recording Media Composites <collect> Supports play and collect operation. Composite function with inclusion of play. <record> Supports play and record operation. Composite function with inclusion of play. MSML Dialog Group Package <group> Allows grouping of media primitives for parallel execution, with an event exchange mechanism between the media primitives to achieve customized media operations. All the above media primitive elements are accepted within the group. The following operations MUST be supported using elements described above using either the MSML Dialog Base Package or MSML Dialog Group Package. Announcement only <play> Collection only <dtmf> or <collect> Recording only <record>
Play and Collect <collect> <play/> </collect> Play and Record <record> <play/> </record> Additional MSML Dialog Packages are: o MSML Dialog Transform Package o MSML Dialog Speech Package o MSML Fax Detection Package o MSML Fax Send/Receive Package MSML dialogs MAY be used to simply expose primitive media resource objects but will be used more often to describe dialog operations and media transformation objects that can be controlled via user interaction. MSML dialogs do not contain any computation or flow control constructs. There are no results automatically generated when media operations complete. Results MUST be explicitly requested using a <send> or <exit> element within the definition of the MSML dialog.9.2. Primitives
Primitives perform a single function on a media stream or multiple streams such as generating audio/video, recognizing speech or DTMF, or adjusting the gain. They may be composed so that primitives execute concurrently. Primitives not composed for concurrent execution MUST simply execute sequentially in the order they occur in an MSML document. All concurrently executing primitives in the same MSML object (defined in one MSML document) MAY interact with each other through events (see MSML Dialog Group Package). Primitives are categorized into one of the following descriptive categories. o Recognizers have a media input but no output. They allow different things within a media stream to be recognized or detected and for events to be generated based upon received media.
o Transformers have one media input and output and may send and receive events. o Sources and sinks generate or consume media. They have either a media input or a media output but not both. They may receive and generate events. o Composites combine underlying primitives to provide higher- level user interaction, without the need for specific event- based exchange between the primitives. The composite elements provide a simpler mechanism for more commonly used services, such as play and collect or play and record. Primitives may define different media processing behavior (states) based upon the events that they receive. Primitives that support different processing states must define their default starting state and should support the "initial" attribute to allow that state to be specified when the primitive is instantiated. All primitives must support the "terminate" event class. The following types of primitives are defined within this specification: Recognizers Transformers Source/Sink Composites ------------------------------------------------------ dtmf/collect agc play dtmf/collect faxdetect clamp record record speech gain dtmfgen vad gate tonegen relay faxsend faxrcv Primitives have shadow variables, similar to those within VoiceXML [n5], which are automatically assigned values when the primitives are used. Upon initialization of an MSML dialog context, all shadow variables have the string value "undefined". Each primitive has its own instance of shadow variables that are global in scope to the entire MSML dialog context. Names SHOULD be assigned to individual primitives when more than one primitive of the same type is used within one MSML document. Shadow variables are overwritten if the primitive has not been named and is instantiated a second time. Shadow variables cannot be modified under user control. They may be returned from the MSML dialog context using the <send> element.
9.3. Events
Events provide the mechanism for primitives to interact with each other and for an MSML context to interact with its external environment. The external environment is defined by the way in which an MSML context has been invoked. This will often be through MSML, but other languages and protocols such as SIP may also be used. Every primitive and group conceptually implements their own event queue. Events sent to them get placed into their associated queue. Events are removed from their queues and processed in order. Primitives within a group conceptually have their own thread of execution. Due to the asynchronous nature of servicing events from multiple queues, it cannot be assumed that several events sent in sequence to different queues will be processed in the order in which they were sent. For example, if recognition of something led to sending events to both a <play> and a <record> in that order, it is possible that the <record> may process its event before the <play>. Primitives each define the set of events that they support and the behavior associated with their handling of each event. This allows many types of behaviors to be defined. For example, VCR type controls can be constructed by defining primitives that support events corresponding to each control. Media recognition/detection can be used to cause those events to be generated. Alternatively, events can be originated elsewhere, such as from a control agent, and simply received by the primitive implementing the control. Examples of the use of events include adjusting volume (gain) and pause and resume of both announcement playout and record creation. Primitives act on events based upon the longest match of an event name. Event names are a period '.' delimited sequence of tokens. The first token, or the root of the name, can be considered an event class. Matching allows a standard meaning to be defined and then extended based upon what triggers an event's generation. For example, a record primitive has different behavior depending upon whether it completed because a user stopped speaking or because it was cancelled. The recording is retained in the first case but not the second. Longest match allows new recognizers to be created and used without changing how existing primitives are defined. For example, a face recognition capability could be created that generates a terminate.frowning event when a user looks puzzled. Although no primitive directly defines this event, it will still effect a generic terminate action. Primitives that require specialized behavior based
upon frowning may be extended to support this. As well, the event can still be exported from the MSML context without requiring that primitives receiving the event understand facial expressions.9.4. MSML Dialog Usage with SIP
MSML dialogs MAY be used directly with SIP for dialog interactions (e.g., IVR or fax). It can be initially invoked as part of the "Prompt and Collect" service described in "Basic Network Media Services with SIP" [n7]. That defines service indicators for a small number of well-defined services using the user part of the SIP Request-URI (R-URI). The prompt and collect service uses "dialog" as the service indicator. URI parameters further refine the specific IVR request. This document defines an additional parameter "msml-param" for the dialog service indicator as follows: dialog-parameters = ";" ( dialog-param [ vxml-parameters ] ) | moml-param dialog-param = "voicexml=" dialog-url moml-param = "moml=" moml-url There are no additional URI parameters when MSML is used as the dialog language. MSML dialogs define discrete IVR dialog commands. These commands MAY be included directly in the body of the INVITE to the "dialog" service indicator by using the "cid" [n8] URL scheme. This scheme identifies a message body part that in this case would contain the MSML dialog request. Note that a multipart message body, containing a single part, MUST be present even if the INVITE does not contain an SDP offer. Subsequent MSML dialog requests are sent in the body of SIP INFO messages as are all messages from a media server. An example of SIP URI as described above is: sip:dialog@mediaserver.example.net;\ moml=cid:14864099865376@appserver.example.net The body part that contained the MSML dialog referenced by the URL would have a Content-Id header of: Content-Id: <14864099865376@appserver.example.net>
The results of executing an <exit> or <disconnect>, or of executing a <send> that has a "target" attribute value equal to "source", are notified in SIP INFO messages using the <event> element from MSML Core package. No messages are sent if execution completes normally without executing one of these elements. If there is an error during validation or execution, then a media server MUST notify the error as described above and must include the namelist items "moml.error.status" and "moml.error.description". The values for these items are defined in section 11. A restricted subset of MSML dialogs can also be used with the "Announcement" service defined in [n7]. This service uses "annc" as the service indicator and defines parameters that describe an announcement. The "play=" parameter identifies the URL of a prompt or a provisioned announcement sequence. The value of the "play=" parameter can refer to an MSML dialog body part using a "cid" URL as described above. That body part must only contain the <play> primitive. Using MSML dialogs enhances the announcement service by allowing the client to specify a sequence of audio segments rather than requiring each sequence to be provisioned as well as support for video. Moreover, MSML dialogs define a standard set of variables in contrast to [n7] which defines a parameterization mechanism but does not formally specify any semantics. If a media server does not understand the "cid" scheme or does not understand MSML dialogs, it must respond with the SIP response code "488 - not acceptable here". If the MSML dialog body contains elements other than the <play> primitive, or there are errors during validation, a media server must respond with a SIP response code "400 - bad request". Finally, if there is a discrepancy between parameters specified in the Request-URI and corresponding attributes defined in the MSML dialog body, the Request-URI parameters must be silently ignored. MSML dialogs MUST NOT change the operation of the announcement service from that defined in [n7]. When the announcement completes, a media server issues a SIP BYE request. The INFO method MUST NOT used with the announcement service.9.5. MSML Dialog Structure and Modularity
MSML is structured as a set of packages. Only the core and base packages are required. The Dialog Core Package defines the framework for MSML requests to a media server, without specific functionality. It consists of the "primitive" abstraction, an abstract element for
control flow, the sequential execution model, and the <send> element. That is, the MSML Dialog Core Package allows for the execution of a sequence of one or more media processing primitives with the ability to notify events to the invocation environment. Primitives are contained within the MSML Dialog Base Package, which defines the basic <play>, <record>, <dtmf>, <dtmfgen>, <tonegen>, and <collect> elements. Another package, the MSML Dialog Transform Package, defines the simple half-duplex filters. More advanced primitives are defined in the speech and fax packages. The MSML speech package depends on the MSML Dialog Base Package as it extends the capability of <play> by adding synthesized speech. Finally, the group execution model, which is currently the only element that changes the flow of control, is defined in a separate MSML Dialog Group Package. All of these packages are optional with the exception that MSML Dialog Core and MSML Dialog Base Packages MUST be implemented to provide the minimal functionality.9.6. MSML Dialog Core Package
The MSML Dialog Core Package defines the structural framework and abstractions for MSML dialogs (via its schema). It also defines the basic elements that are not part of the core primitive or control abstractions. This package is dependent on the MSML Core Package. Events generated by MSML dialogs, such as prompt completion, digits collected, or dialog termination, are communicated by the media server via the MSML Core Package (see MSML Core Package <event>). MSML dialogs are executed independently from the MSML core context. When an MSML dialog is started, MSML allocates the dialog control resources, and if successful, starts those resources executing. MSML core execution then continues without waiting for the MSML dialog to complete. This forking of MSML dialog invocation from the MSML core context is done via the <dialogstart> element. Media streams are created between the MSML dialog target and other internal media server resources as part of dialog execution. Stream creation is subject to the requirements defined in the MSML Core Package and media streams as defined by the MSML Conference Core Package.9.6.1. <dialogstart>
The <dialogstart> element is used to instantiate an MSML media dialog on connections or conferences. The dialog is specified either inline or by a URI [n6]. Inline dialogs MUST be composed of any of the MSML Dialog Packages. MSML dialogs MAY be defined externally as VoiceXML [n5]. The MSML dialog description MUST NOT be inline if the src attribute, containing a URI, is present.
The originator of the MSML dialog is notified using a "msml.dialog.exit" event when the dialog completes. Any results returned by the dialog when it exits are sent as a namelist to the event. The "msml.dialog.exit" event is also used when dialogs fail due to errors encountered fetching external documents or errors that occur within the dialog execution thread. In this case, a namelist containing the items "dialog.exit.status" and "dialog.exit.description" is returned with the event to inform the client of the failure and the failure reason. The values of these items are defined within this package and the MSML Core Package. Information from the failed dialog may be returned as additional namelist items. Attributes: target: an identifier of a connection or a conference that will interact with the dialog. The identifier must not contain wildcards. Mandatory. src: the URL of the dialog description. MUST NOT be used if the MSML dialog description is inline. Otherwise, an error (422) will result and MSML document execution will stop. type: a MIME type that identifies the type of language used to describe the dialog. application/moml+xml and application/vxml+xml are used to identify MSML dialogs and VoiceXML [n5] respectively. Mandatory. name: an instance name for the dialog. If the attribute is not present, the media server will assign an identifier to the dialog. If the attribute is present but the name is already associated with the target, an error (431) will result and MSML document execution will stop. Any results that a dialog generates will be correlated to its identifier. mark: a token that can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML element is returned in an error response. Therefore, the value of all "mark" attributes within an MSML document should be unique. The following sections show examples of initiating an external MSML dialog, an inline embedded MSML dialog, and an MSML-initiated VoiceXML dialog. The following example starts an MSML dialog on a connection.
<?xml version="1.0" encoding="UTF-8"?> <msml version="1.1"> <dialogstart target="conn:abcd1234" type="application/moml+xml" name="sample" src="http://server.example.com/scripts/foo.moml"/> </msml> The following example starts an inline embedded MSML dialog on a connection. <?xml version="1.0" encoding="UTF-8"?> <msml version="1.1"> <dialogstart target="conn:abcd1234" name="sample"> <play> <audio uri="file://clip1.wav"/> <audio uri="http://host1/clip2.wav"/> <tts uri="http://host2/text.ssml"/> <var type="date" subtype="mdy" value="20030601"/> </play> <send target="source" event="done" namelist="play.amt play.end"/> </dialogstart> </msml> The following example starts a VoiceXML dialog on a connection. <?xml version="1.0" encoding="UTF-8"?> <msml version="1.1"> <dialogstart target="conn:abcd1234" type="application/vxml+xml" name="sample" src="http://server.example.com/scripts/foo.vxml"/> </msml> If this dialog fails once its execution thread had begun, for example, the fetch of the VoiceXML document failed, an example of the event that would be returned would be: <?xml version="1.0" encoding="UTF-8"?> <event name="msml.dialog.exit" id="conn:abcd1234/dialog:sample"> <name>dialog.exit.status</name> <value>423</value> <name>dialog.exit.description</name> <value>External document fetch error</value> </event>
9.6.2. <dialogend>
Dialog end is used to terminate an MSML dialog created through <dialogstart> before it completes of its own accord. The operation of <dialogend> depends on the dialog language being used by the executing context. When that context is VoiceXML, a "connection.disconnected" event will be thrown to the VoiceXML application. When that context is MSML dialog, a "terminate" event will be sent to the MSML core context. <dialogend> allows the executing dialog the opportunity to gracefully complete before generating a "msml.dialog.exit" event. Dialog results may be returned and will be contained as a namelist to that event. Attributes: id: the identifier of a dialog. Mandatory. mark: a token that can be used to identify execution progress in the case of errors. The value of the mark attribute from the last successfully executed MSML dialog element is returned in an error response. Therefore, the value of all "mark" attributes within an MSML document should be unique. For example, if the dialog from the previous example was still executing, the following would terminate the dialog and generate an "msml.dialog.exit" event. <?xml version="1.0" encoding="UTF-8"?> <msml version="1.1"> <dialogend id="conn:abcd1234/dialog:sample"/> </msml>9.6.3. <send>
The <send> element sends an event and optional namelist to the recipient identified by the target attribute. Event names are defined by the recipient. In the case where the recipient is an MSML dialog group or primitive, the events are defined within this document. Other recipients MAY use names that are suitable for their environment. The "target" attribute specifies the recipient of the event. Recipients MAY be other MSML dialog primitives or groups executing within the object, the object itself, or the environment that invoked the MSML dialog. Sending events to media primitives or groups is supported by the MSML Dialog Group Package. Any target that is
unknown within the object is assumed to be destined to the external environment. By convention, the string "source" SHOULD used to address that environment, but any target name distinct from the MSML dialog namespace MAY be used. Attributes: event: the name of an event. Mandatory. target: the recipient of the event. The recipient MUST be a MSML dialog primitive, the currently executing group, or the MSML dialog environment. A primitive is specified by a primitive type, optionally appended by a period '.' followed by the identifier of a primitive. Identifiers are only needed when more than one primitive of the same type exists in the object. The executing group is specified using the token "group". The environment is specified using the token "source", optionally appended by a period '.' followed by any environment specific target. Mandatory. namelist: a list of zero or more shadow variables that are included with the event.9.6.4. <exit>
The <exit> element causes execution of the MSML dialog to terminate. Attributes: namelist: a list of one or more shadow variables that MAY optionally be sent to the context that invoked the MSML Dialog object.9.6.5. <disconnect>
The <disconnect> element is similar to <exit> but has the additional semantics of indicating to the context that invoked the MSML dialog that it should disconnect from a media server, the media stream associated with the object. The method of disconnection depends upon how the media stream was initially established. If SIP was used, a <disconnect> would cause a media server to issue a BYE request. The request would be sent for the SIP dialog associated with media session on which the MSML dialog was operating.
Attributes: namelist: a list of one or more shadow variables that MAY optionally be sent to the context that invoked the MSML dialog object.9.7. MSML Dialog Base Package
The MSML Dialog Base Package defines a required set of base functionality for the media server. It supports individual media primitives, such as playing an announcement or collection digits, as well as composite operations such as play and collect. When this package is used in conjunction with the MSML Dialog Group Package, the event-based mechanism is used to control primitives. This package may also be used in conjunction with the MSML Speech Package to extend the functionality of prompts to include TTS and user input collection to include ASR. In the following sections, subsections of a primitive define child elements of that primitive and are not themselves considered primitives. They do not receive events or populate shadow variables.9.7.1. <play>
Play is used to generate an audio or video stream. It MUST play in sequence the media created by the child media elements <audio>, <video>, <media>, <tts>, and <var>. When the play stops, either because the terminate event is received or all media generation has completed, the <playexit> element, if present, is executed. At least one media generation element must be present. Play supports two states: generate and suspend. Media generation occurs in the generate state and is suspended in the suspend state. Once in the suspend state, media generation continues upon receiving the generate event. The default initial state is generate. Audio MAY be generated in different languages by specifying the xml:lang attribute for <play> and/or the child elements of <play>. The language is inherited by the child elements, but each child MAY specify its own language. Except for physical audio clips, it is an error if a language is specified but the media server cannot render the audio in the requested language. Attributes: id: an optional identifier that may be referenced elsewhere for sending events to the play primitive.
interval: specifies the delay between stopping one iteration and beginning another. The attribute has no effect if iterate is not also specified. Default is no interval. iterate: specifies the number of times the media specified by the child media elements should be played. Each iteration is a complete play of each of the child media elements in document order. Defaults to once '1'. initial: defines the initial state for the play element. Default is "generate". maxtime: defines the maximum allowed time for the <play> to complete. barge: defines whether or not audio announcements may be interrupted by DTMF detection during play-out. The DTMF digit barging the announcement is stored in the digit buffer. Valid values for barge are "true" or "false", and the attribute is mandatory. When barge is applied to a conference target, DTMF digit detected from any conference participant MUST terminate the announcement. cleardb: defines whether or not the digit buffer is cleared, prior to starting the announcement. Valid values for cleardb are "true" or "false", and the attribute is mandatory. offset: defines an offset, measured in units of time, where the <play> is to begin media generation. Offset is only valid when all child media elements are <audio>. skip: an amount, expressed in time, that will be used to skip through the media when "forward" and "backward" events are received. Default is 3 s (three seconds). xml:lang: specifies the language to use for content that can be rendered in different languages. Events: The following describes input events to the media primitive object. The MSML Dialog Group Package allows an event exchange mechanism between primitives. pause: causes the play to enter the suspend state. resume: causes play to enter the generate state.
forward: skips forward through the media. Only has effect when all child media elements are <audio>. backward: skips backward through the media. Only has effect when all child media elements are <audio>. restart: skips to the beginning of the media. Only has effect when all child media elements are <audio>. toggle-state: causes the suspend / generate state to toggle. terminate: terminates the play and assigns values to the shadow variables. Shadow Variables: play.amt: identifies the length of time for which media was generated before the play was stopped. This does not include time that may have elapsed while the play was in the suspend state. play.end: contains the event that caused the play to stop. When the play stops because all media generation has completed, end is assigned the value "play.complete". Note: Attributes barge and cleardb provide a simplified mechanism for controlling play operations with implicit DTMF without the use of <group> and event exchange mechanism. When using the <play> element within the group framework and barge is specified, detection of barge condition generates an implicit terminate event to the play primitive. The following sections describe the child elements of <play>.9.7.1.1. <audio>
The <audio> element identifies prerecorded audio to play. Local URI references may resolve to a single physical audio clip, a logical clip, or a provisioned sequence of clips (physical or logical). A logical clip is one that can be rendered differently based on the language attribute. Logical clips are provisioned for each of the languages that a media server supports. Remote URI references are resolved according to the capabilities of the remote server. Attributes: uri: identifies the location of the audio to be played. The file and http schemes are supported. Mandatory.
format: defines the encoding and file type of the audio resource. The format attribute is defined as a string type of form "audio/<filetype>;codecs=<codec>". The keyword 'audio' identifies an audio content. The codecs field identifies the audio file's codec to be used for decoding the audio content. If format attribute is not specified, the filetype MUST be determined from the URI and the codec information MUST be determined from the media resource. audiosamplerate: identifies audio sample rate in kHz. If not specified, the sample rate SHOULD be determined from the media resource. audiosamplesize: identifies audio sample size in bits. If not specified, the sample size SHOULD be determined from the media resource. iterate: specifies the number of times the audio is to be played. Defaults to once '1'. xml:lang: specifies the language to use when the URI identifies a logical clip, either directly, or as part of a sequence.9.7.1.2. <video>
The <video> element identifies prerecorded multimedia to play. Contents identified by the URI attribute may contain audio only, video only, or both audio and video. The media server SHOULD attempt to play both audio and video from the identified URI, if both are available in the content. Attributes: uri: identifies the location of the video or multimedia to be played. The file and http schemes are supported. Mandatory. format: defines the encoding and file type of the video or multimedia resource. The format attribute is defined as a string type of form "video/<filetype>;codecs=<codecx>,<codecy>". The keyword 'video' identifies video-only media or media containing audio and video. The "codecs" field identifies the audio and/or video codecs to be used for decoding the file content, where the order of the codec values is not significant. In the event of audio and video content, using 'video' keyword, the codecs=<codecx>,<codecy> field MAY be used to identify the audio codec and the video codec. If not specified, the codec information SHOULD be determined from the media file.
audiosamplerate: identifies audio sample rate in kHz. If not specified, the sample rate SHOULD be determined from the media file. audiosamplesize: identifies audio sample size in bits. If not specified, the sample size SHOULD be determined from the media file. codecconfig: identifies an optional special instruction string for codec configuration. Default is to send no special configuration string to the codec. profile: identifies a video profile name specific to the codec. If not specified, default video profile of the codec SHOULD be selected. level: identifies a video profile level to the codec. Default is to send no profile information to the codec and allow the codec to select an internal default. imagewidth: identifies the width of video image in pixels. Default is to use image width information from media file. imageheight: identifies the height of video image in pixels. Default is to use image height information from media file.
maxbitrate: identifies the bitrate of the video signal in kbps. Default is to use maximum bitrate information from the media file. framerate: identifies the video frame rate in frames per second. Default is to use frame rate information from the media file. iterate: specifies the number of times the media content is to be played. Defaults to once '1'.9.7.1.3. <media>
The <media> element identifies multimedia content for play. All content of the <media> element MUST start to play concurrently. This element may be used to generate a multimedia stream from two independent media resources, one identifying audio and the other identifying video. The <media> element MUST contain at least one child element. Valid child elements of <media> are <audio> and <video>, as described earlier. <media> element MUST contain at most one <audio> element or at most one <video> element.9.7.1.4. <var>
The <var> element specifies the generation of audio from a variable using prerecorded audio segments. A variable represents a semantic concept (such as date or number) and dynamically produces the appropriate speech. Prerecorded audio allows an application vendor or service provider to choose the exact voice for their audio and therefore completely control the "sound and feel" of the service provided to end users. It provides very high audio quality and allows the variables to blend seamlessly into the surrounding audio segments. Text to speech (TTS) using Speech Synthesis Markup Language (SSML) [n11] may also be used to render variables, but may not provide as good quality, or allow as complete control of the "sound and feel" or user experience. TTS is normally used for reading text such as emails and for very large vocabularies such as stock names. TTS results in a very clear difference between the variables and the surrounding audio segments. (See MSML Dialog Speech Package.) Attributes: type: specifies the type of variable. Mandatory. Variable type must be one of "date", "digits", "duration", "month", "money", "number", "silence", "time", or "weekday".
subtype: specifies an optional clarification of type. Specific values depend upon the type. value: text that should be rendered appropriate to the type and subtype attributes. Mandatory. xml:lang: specifies the language to use when rendering the variable.9.7.1.5. <playexit>
The <playexit> element MUST be invoked when generation of all content of the <play> has come to completion. The contents of this element MAY be used to send events. Attributes: none9.7.2. <dtmfgen>
DTMF generator originates one or more DTMF digits in sequence. Attributes: id: an optional identifier that may be referenced elsewhere for sending events to the dtmfgen primitive. digits: a string of characters from the alphabet "0-9a-d#*" that correspond to a sequence of DTMF tones. Mandatory. level: used to define the power level for which the tones will be generated. Expressed in dBm0 in a range of 0 to -96 dBm0. Larger negative values express lower power levels. Note that values lower than -55 dBm0 will be rejected by most receivers (TR- TSY-000181, ITU-T Q.24A). Default is -6 dBm0. dur: the duration in milliseconds for which each tone should be generated. Implementations may round the value if they only support discrete durations. Default is 100 ms. interval: the duration in milliseconds of a silence interval following each generated tone. Implementations may round the value if they only support discrete durations. Default is 100 ms. Events: terminate: terminates DTMF generation and assigns values to the
shadow variables. Shadow Variables: dtmfgen.end: contains the event that caused DTMF generation to stop. The following sections describe the child elements of <dtmfgen>.9.7.2.1. <dtmfgenexit>
The <dtmfgenexit> element MUST be invoked when the DTMF generation operation completes or is terminated as a result of receiving the terminate event. The <dtmfgenexit> element MAY be used to send events when the DTMF generation has completed. Attributes: none9.7.3. <tonegen>
Tone generator allows customized tone generation. A sequence of varying tones with optional silence intervals can be composed using the <tonegen> element. Child elements of <tonegen>, namely <tone> and <silence>, specify a single tone or sequence of tones. Attributes: id: an optional identifier that may be referenced elsewhere for sending events to the tonegen primitive. iterate: A numeric value specifying the total number of iterations. A value of 'forever' represents infinite repetitions. Optional. Default is 1. Events: terminate: terminates tone generation and assigns values to the shadow variables. Shadow Variables: tonegen.end: contains the event that caused tone generation to stop. The following sections describe the child elements of <tonegen>.
9.7.3.1. <tone>
The <tone> element specifies a single tone with an optional silence interval. The tone specification consists of two tone frequencies, their attenuation values, a duration of the tone, and the number of times to repeat the tone. Attributes: duration: time duration or length of the individual tone, specified in "ms" or "s" in increments of 10 ms. A value of 0 represents an infinite duration. Mandatory. iterate: specifies the number of times to execute the contents of <tone> element. A value of 'forever' represents infinite repetitions. Optional. Default is 1. Events: none Child Elements: The child elements of <tone> element specify a single tone and an optional silence interval to be inserted at the end of tone generation. A tone is defined by <tone1> and <tone2> elements. Each <tone> element MUST contain at least one of <tone1> or <tone2>, or MAY contain <tone1> and <tone2> exactly once. <tone1> Attributes: freq: specifies the frequency of the first tone in "Hz", ranging from 0 to 3999 Hz. Mandatory. atten: specifies the attenuation level expressed in dBm0, ranging from 0 to -96 dBm0. Mandatory. <tone2> Attributes: freq: specifies the frequency of the second tone in "Hz", ranging from 0 to 3999 Hz. Mandatory. atten: specifies the attenuation level expressed in dBm0, ranging from 0 to -96 dBm0. Mandatory.
<silence> - Refer to the silence element definition below.9.7.3.2. <silence>
The <silence> element inserts a silence interval as optional content of <tonegen> or <tone> elements. Attributes: duration: specifies the amount of silence interval in "ms" or "s", in increments of 10ms. Mandatory. Events: none9.7.3.3. <tonegenexit>
The <tonegenexit> element MUST be invoked when the tone generation operation completes or is terminated as a result of receiving the terminate event. The <tonegenexit> element MAY be used to send events when the tone generation has completed. Attributes: none9.7.4. <record>
Record creates a recording. Similar to play, <record> supports two states: create and suspend. Received media becomes part of the recording when <record> is in the create state and is discarded when it is in the suspend state. Recording MUST be terminated when a terminate event is received or when a nospeech event is received and no audio has yet been recorded. <record> differentiates different types of terminate events. An optional <play> element MAY be specified as a child element of <record>. This mechanism provides a complete play-record operation, where the prompts specified within the <play> element are played in advance of start of recording. Note: Attributes prespeech, postspeech, and termkey provide a simplified mechanism for controlling record operations using implicit DTMF and VAD, without the use of <group> and event exchange mechanism.
Attributes: id: an optional identifier that may be referenced elsewhere for sending events to the record primitive. append: a boolean that defines whether the recording is allowed to be appended to an existing file if dest already exists. Default is "false". The attribute is ignored if the scheme is http. dest: the destination for the recording, which will contain either audio only, video only, or both audio and video depending on the stream(s) being recorded. Recording MAY be either local or external based upon the attribute value. File and http schemes are supported. audiodest: the destination for the audio-only recording. Recording MAY be either local or external based upon the attribute value. All combinations of dest, audiodest, and videodest are valid. File and http schemes are supported. videodest: the destination for the video-only recording. Recording MAY be either local or external based upon the attribute value. All combinations of dest, audiodest, and videodest are valid. File and http schemes are supported. format: defines the encoding and file type of the recording. The format attribute is defined as a string type of form "audio|video/filetype;codecs=x,y". The keyword 'audio' identifies an audio only recording, while the keyword 'video' identifies video-only recording or an audio plus video recording. The codecs field identifies the audio and/or video codecs to be used for the recording, where the order of the codec values is not significant. In the event of audio and video recording, using 'video' keyword, the codecs=x,y field MAY be used to identify the audio codec and the video codec. Mandatory. codecconfig: identifies an optional special instruction string for codec configuration. Default is to send no special configuration string to the codec. audiosamplerate: identifies audio sample rate in kHz. If not specified, the sample rate SHOULD be determined from the media source. audiosamplesize: identifies audio sample size in bits. If not specified, the sample size SHOULD be determined from the media source.
profile: identifies a video profile name specific to the codec. If not specified, default video profile of the codec SHOULD be selected for the recording. level: identifies a video profile level to the codec. Default is to send no profile information to the codec and allow the codec to select an internal default. imagewidth: identifies the width of video image in pixels. Default is to use image width information from the media source. imageheight: identifies the height of video image in pixels. Default is to use image height information from the media source. maxbitrate: identifies the bitrate of the video signal in kbps. Default is to use maximum bitrate information from the media source. framerate: identifies the video frame rate in frames per second. Default is to use frame rate information from the media source. initial: defines the initial state for the record element. Default is "create", which starts the recording as soon as the <record> element is executed. The "initial" attribute is applicable only when <record> is used within the <group> structure. maxtime: defines the maximum length of the recording in units of time. Mandatory. prespeech: defines a timer value, in seconds, for detection of absence of audio energy at the start of the record operation. If no audio energy is detection for the amount of time specified by prespeech, the recording is terminated. Default is 0 s, which does not activate the prespeech timer. postspeech: defines a timer value, in seconds, for detection of absence of audio energy while the recoding is in progress. During an in progress recording, if absence of audio energy is detected as specified by the postspeech timer, the recording is terminated. Default is 0 s, which disables the ability to terminate a recording due to postspeech silence. termkey: defines a single DTMF key that, when detected, terminates the recording. Absence of this attribute prevents the recording from being terminated due to detection of DTMF digits. When termkey is specified, the detected DTMF digit terminates the recording and the DTMF digit is not entered in the digit buffer.
Events: The following describes input events to the media primitive object. The MSML Dialog Group Package allows an event exchange mechanism between primitives. pause: causes the record to enter the suspend state. Received media is discarded. resume: causes the record to resume if it was suspended. It has no effect otherwise. toggle-state: causes the suspend / create state to toggle. terminate: terminates the recording and assigns values to the shadow variables. terminate.cancelled: terminates the recording and assigns values to the shadow variables. If the dest attribute used the file scheme, the local recording is deleted. Applications are responsible for removing external files created using the http scheme. terminate.finalsilence: terminates the recording and assigns values to the shadow variables. If the dest attribute used the file scheme, the final silence is removed from the recording. nospeech: terminates the recording and assigns values to the shadow variables if it is received and no recording has yet been created. The "nospeech" event is ignored if audio has already been recorded. Shadow Variables: record.len: the actual length of the recording measured in units of time. This does not include time that may have elapsed while the record was in the suspend state. record.end: contains the event that caused the record to terminate. When the record terminates because maxtime is exceeded, end is assigned the value "record.complete.maxlength". record.recordid: contains the value of the "dest" attribute, if supplied, otherwise contains a media server assigned record identifier. Record termination due to prespeech silence results in assigned value of "record.failed.prespeech"
Record termination due to postspeech silence results in assigned value of "record.complete.postspeech" Record termination due to DTMF detection results in assigned value of "record.complete.termkey" The following sections describe the child elements of <record>.9.7.4.1. <play>
The optional <play> element as a child element of <record> allows a prompt to be played prior to start of recording. The record operation starts at the end of the play sequence or if the play is barged by DTMF, assuming that barge=true is specified for <play>. For a complete description, refer to <play> element.9.7.4.2. <tonegen>
The optional <tonegen> element as a child element of <record> allows a tone or sequence of tones to be played prior to start of recording. The record operation starts at the end of the tone generation. For a complete description, refer to <tonegen> element.9.7.4.3. <recordexit>
The <recordexit> element MUST be invoked when the record operation completes or when the recording is terminated as a result of receiving the terminate event. The <recordexit> element MAY be used to send events when the recording has completed. Attributes: none9.7.5. <dtmf> or <collect>
DTMF input fulfills several roles within MSML dialogs. It is used to trigger events that will affect the media processing operation of other primitives. It is also used to collect DTMF digits from a media stream that are to be reported back to the user of MSML dialog. Often DTMF detection is used for both purposes. Barge is the most common example, where a prompt is stopped based upon DTMF input but more digits may remain to be collected. DTMF detection supports multiple simultaneous recognition patterns. Different patterns can be used to trigger sending different events in order to implement DTMF controls. Alternatively, one pattern may be
used to represent a collection and another pattern, a substring of the first, used as a barge indication. An optional <play> element MAY be specified as a child element of <dtmf> or <collect>. This mechanism provides a complete play-collect operation, where the prompt(s) specified within the <play> element are played in advance of DTMF digit collection. Note that all patterns share the same digit collection buffer, inter- digit timing, a single <nomatch> element, and a single <noinput> element. As such, multiple patterns may not be suitable to support simultaneous collections for different purposes. When this is required, separate <dtmf> elements should be used instead. <dtmf> terminates if any of the <pattern>, <noinput>, or <nomatch> elements are matched the maximum number of times that they are allowed. The number of times they may match may be specified as an attribute of <dtmf> or of the individual child elements. Element identifier <dtmf> is equivalent to <collect>. However, <collect> is the preferred name. MSML clients SHOULD use <collect>, while MSML servers SHOULD support both. Attributes: id: an optional identifier that may be referenced elsewhere for sending events to this primitive. cleardb: a boolean indication of whether the buffer for digit collection should be cleared of any collected digits when the element is instantiated. If set to false, any digits currently in the buffer MUST be immediately compared against the pattern elements. fdt: defines the first-digit timer value. The first-digit timer is started when DTMF detection is initially invoked. If no DTMF digits are detected during this initial interval, the <noinput> element MUST be invoked. Optional, default is 0 s (wait forever for the first digit). idt: defines the inter-digit timer to be used when digits are being collected. When specified, the timer is started when the first digit is detected and restarted on each subsequent digit. Timer expiration is applied to all patterns. After that, if any patterns remain active and a nomatch element is specified, the nomatch is executed and DTMF input MUST terminate. The idt attribute should only be used when digit collection is being performed. Optional, default is 4 s.
edt: defines the extra-digit timer value. Specifies the length of time the media server MUST wait after a match to detect a termination key, if one is specified by the <pattern> element. Optional, default is 4 s. starttimer: boolean value that defines whether the first digit timer (fdt) is started initially. When set to false, the starttimer event must be received for it to start. Default is "false". iterate: specifies the number of times the <pattern>, <noinput>, and <nomatch> elements may be executed unless those elements specify differently. The value "forever" MAY be used to indicate that these may be executed any number of times. Default is once '1'. ldd: defines the minimum duration for a digit to be held in order for it to be detected as a long DTMF digit. A long DTMF digit event MUST be treated as a single DTMF event, and MUST contain an extra character 'L' at the end to be distinguished from the other regular digit events. For example, "#L" and "#" are different DTMF events. Optional, default of 0 s. A value of 0 s disables long DTMF digit detection and reporting. Attribute value is an integer with a valid range from 100 ms to 100 s (units MUST be supplied). Events: The following describes input events to the media primitive object. The MSML Dialog Group Package allows an event exchange mechanism between primitives. starttimer: starts the first digit timer (fdt) if it has not already been started. Has no effect otherwise. terminate: terminates the DTMF input and assigns values to the shadow variables. Shadow Variables: dtmf.digits: the string of DTMF digits that have been received (the contents of the digit buffer). dtmf.len: the number of digits in the digit buffer. dtmf.last: the last digit in the digit buffer.
dtmf.end: contains the event that caused the <dtmf> to terminate or is assigned one of "dtmf.match", "dtmf.noinput", or "dtmf.nomatch" depending upon which of the corresponding elements reached its maximum. The following sections describe the child elements of <dtmf> or <collect>.9.7.5.1. <play>
The optional <play> element as a child element of <dtmf> or <collect> allows a prompt to be played prior to DTMF digit collection. DTMF digit collection starts at the end of the play sequence or if the play is barged by DTMF, assuming that barge=true is specified for <play>. For a complete description, refer to <play> element.9.7.5.2. <pattern>
The <pattern> element describes one or more DTMF digits that are to be recognized. When the pattern is matched, the child elements MUST be executed. Attributes: digits: the digit pattern that should be matched. Mandatory. format: an enumerated value that defines the format used to express the digit pattern. The format may be "mgcp" or "megaco" for patterns expressed as a digit map from those specifications, or as one of the simple built-in formats defined within this specification. Currently, a single built-in format "moml+digits" is defined that allows a match based on either one or more specific digits, or based upon a specific length specification with an optional return key. "moml+digits" is the default. iterate: specifies the number of times the <pattern> may be matched. The value "forever" may be used to indicate that <pattern> may be matched any number of times. This value overrides any specified in <dtmf>. Default is once '1'.9.7.5.3. <detect>
The contents of the <detect> element MUST be executed whenever any DTMF is first detected. It MUST be matched at most once. Attributes: none
9.7.5.4. <noinput>
The <noinput> element is used when DTMF is being collected. Children of the <noinput> element MUST be executed when DTMF has not been detected and the first digit timeout occurs. Attributes: iterate: specifies the number of times the <noinput> may be triggered. The value "forever" may be used to indicate that <noinput> may be triggered any number of times. This value overrides any specified in <dtmf>. Default is once '1'.9.7.5.5. <nomatch>
The <nomatch> element is used when DTMF is being collected. Children of the <nomatch> element MUST be executed when it is determined that none of the individual patterns can be matched. Attributes: iterate: specifies the number of times the <nomatch> may be triggered. The value "forever" may be used to indicate that <nomatch> may be triggered any number of times. This value overrides any specified in <dtmf>. Default is once '1'.9.7.5.6. <dtmfexit>
The <dtmfexit> element MUST be invoked when the dtmf input completes because one of <pattern>, <noinput>, or <nomatch> occurred its maximum number of times. Attributes: None9.7.6. <moml>
The root element <moml> MUST be used when the document is a stand- alone MSML dialog, where the invoking application media type indicates 'application/moml+xml'. Additionally, for backwards compatibility, the <moml> element MUST be used within <dialogstart>, which contains an inline embedded MSML dialog. Valid contents of <moml> are all elements described within this MSML Dialog Base Package.
Attributes: version: "1.0" Mandatory. id: an identifier unique to this object. Events returned from MSML dialog (the "target" attribute of a <send> is equal to "source") will be correlated with this identifier. Mandatory. Events: terminate: terminates the MOML context. A terminate event gets sent to the currently executing <group> or primitive.9.8. MSML Dialog Group Package
The group package defines a single control flow construct that specifies concurrent execution. Primitives are composed for concurrent execution by placing them within a <group> element. Groups define how media flows between multiple concurrently executing primitives. They have one or more inputs and one or more outputs. A <group> represents the declaration of a complex media processing operation. The event interaction between primitives (see the following subsection) is defined within the context of one or more groups. However groups themselves do not scope events, they simply define that primitives are concurrently executing and a primitive must be executing in order to receive an event. Placing primitives within a group structure is an optional feature of this specification. It allows for complex services to created using the event exchange mechanism between the primitives. For simpler services, such as play/collect or play/record, the use of group mechanism is not necessary. MSML Dialog Group Package is dependent on the MSML Dialog Base Package. Groups may also be used to describe media objects that transform a media stream while optionally allowing application or user control of the transformation. For example, a gain control could be defined that responds to user speech or DTMF input. In this case, a recognition primitive would send events to a gain control primitive. Groups have one attribute that defines the media flow within them. They also have a dimension that defines how many media inputs and outputs they have. Currently, dimensions of 1 and 2 are supported based upon the group topology. These correspond to a group with one input and one output and a group with two inputs and two outputs.
Media flow to and from the primitives within the group is based upon a topology attribute of the <group> element. The topology attribute defines a topology schema and implies the group dimension. There are several common ways in which primitives are often connected together. A schema provides a convenient template that can be applied to multiple primitives without having to define all of the individual media relationships. The following two schemas are initially defined for one-dimensional groups: o parallel: specifies that media sent to the group is sent to every primitive that has an input. The group bridges the output from every primitive that has an output into a single common group output. o serial: specifies that the first primitive listed in the group receives the media sent to the group. Its output is to be connected to the input of the next primitive defined within the group and so on until the last primitive within the group becomes the group output. Groups with these topologies are shown in the two diagrams below. The group on the left has a parallel topology and that on the right has a serial topology. /-> P1 --\ / \ G(in) +---> P2 ----> G(out) G(in) --> P1 --> P2 --> P3 --> G(out) \ / \-> P3 --/ More complex media flows MAY be created by nesting groups of serial and parallel topologies within each other. For example, the diagram below has a group with a serial topology nested within a star topology. /-----> P1 ------------------------\ / \ Gs(in) +-> Gp(in) --> P2 --> P3 --> Gp(out) -+> Gs(out) This combination could be used to create record operation where DTMF was to be clamped from the recording itself, but a DTMF key press is still used to stop the recording. In this case, P1 would be a DTMF recognizer, P2 would be a clamp primitive, and P3 a recorder as shown by the following example. This example omits child elements and attributes not concerned with the core concept. The following section discusses sending events, and the details of each of the primitives are found in section 4.
<group topology="parallel"> <dtmf/> <group topology="serial"> <clamp/> <record/> </group> </group> A single schema, "fullduplex", is defined for a two-dimensional group. A full-duplex two-dimensional group has exactly two immediate children. Those children may be primitives or other one-dimensional groups. A "fullduplex" group must only be used as the top-most group and must not be nested. Each primitive (P1) and group (G2) becomes half of the full-duplex group as shown in the diagram below. G-A(in1) +-> G2 --> G-B(out1) G-A(out2) <-- P1 <-+ G-B(in2) Full-duplex groups are symmetrical when both halves are the same. They are asymmetrical when they differ. Asymmetric groups need to have a name associated with each side. The left side is defined as the input of the first child of the full-duplex group combined with the output of the second child. The right side is reverse. These sides were labeled A and B respectively in the preceding diagram. An example of a full-duplex group is the user operated gain control mentioned at the beginning of this subsection. The gain should operate on the audio that a user hears, but the gain is controlled by recognizing things such as DTMF or spoken commands in media that the user originates. The following shows the XML tag grouping that would accomplish this and corresponds to the media flow shown in the diagram above. If the user's audio is not required for anything other than control of the gain, then the <relay> is not required and the internal group could be omitted. A complete XML description for this is included in the examples section. <group topology="fullduplex"> <group topology="parallel"> <dtmf/> <relay/> </group> <gain/> </group> Primitives within a group MUST begin concurrently but MAY finish asynchronously based upon events that they receive or their task completes. A group MUST terminate when all of the primitives within
it have completed. If the group contains a <groupexit> element, then the contents of that element MUST be executed as part of group termination. A group itself MAY receive a terminate event requesting termination. A terminate event sent to the group causes a terminate event to be sent to each of its currently active primitives. The <groupexit> element is not executed until all primitives have processed their respective terminate events.9.8.1. <group>
The <group> element allows the contained primitives to be executed concurrently. Attributes: topology: specifies a schema that defines the flow of media within the group. Three schemas are initially defined. "fullduplex" is specified for use with two-dimensional groups. "parallel" and "serial" are for use with one-dimensional groups. The definitions of these topologies are in section 9.8. Mandatory. id: identifies the name of the group. Mandatory when groups are nested. Events: terminate: causes a terminate event to be sent to each element contained within the group.9.8.2. <groupexit>
The <groupexit> element allows events to be sent when group processing completes. Group processing completes when all contained primitives terminate. Attributes: none Events: none