7. Resource Discovery
Server resources may be discovered and their capabilities learned by clients through standard SIP machinery. The client MAY issue a SIP OPTIONS transaction to a server, which has the effect of requesting the capabilities of the server. The server MUST respond to such a request with an SDP-encoded description of its capabilities according to RFC 3264 [RFC3264]. The MRCPv2 capabilities are described by a single "m=" line containing the media type "application" and transport type "TCP/TLS/MRCPv2" or "TCP/MRCPv2". There MUST be one "resource" attribute for each media resource that the server supports, and it has the resource type identifier as its value. The SDP description MUST also contain "m=" lines describing the audio capabilities and the coders the server supports. In this example, the client uses the SIP OPTIONS method to query the capabilities of the MRCPv2 server. C->S: OPTIONS sip:mrcp@server.example.com SIP/2.0 Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf7 Max-Forwards:6 To:<sip:mrcp@example.com> From:Sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:63104 OPTIONS Contact:<sip:sarvi@client.example.com> Accept:application/sdp Content-Length:0 S->C: SIP/2.0 200 OK Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf7;received=192.0.32.10 To:<sip:mrcp@example.com>;tag=62784 From:Sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:63104 OPTIONS Contact:<sip:mrcp@server.example.com> Allow:INVITE, ACK, CANCEL, OPTIONS, BYE
Accept:application/sdp Accept-Encoding:gzip Accept-Language:en Supported:foo Content-Type:application/sdp Content-Length:... v=0 o=sarvi 2890844536 2890842811 IN IP4 192.0.2.12 s=- i=MRCPv2 server capabilities c=IN IP4 192.0.2.12/127 t=0 0 m=application 0 TCP/TLS/MRCPv2 1 a=resource:speechsynth a=resource:speechrecog a=resource:speakverify m=audio 0 RTP/AVP 0 3 a=rtpmap:0 PCMU/8000 a=rtpmap:3 GSM/8000 Using SIP OPTIONS for MRCPv2 Server Capability Discovery8. Speech Synthesizer Resource
This resource processes text markup provided by the client and generates a stream of synthesized speech in real time. Depending upon the server implementation and capability of this resource, the client can also dictate parameters of the synthesized speech such as voice characteristics, speaker speed, etc. The synthesizer resource is controlled by MRCPv2 requests from the client. Similarly, the resource can respond to these requests or generate asynchronous events to the client to indicate conditions of interest to the client during the generation of the synthesized speech stream. This section applies for the following resource types: o speechsynth o basicsynth The capabilities of these resources are defined in Section 3.1.
8.1. Synthesizer State Machine
The synthesizer maintains a state machine to process MRCPv2 requests from the client. The state transitions shown below describe the states of the synthesizer and reflect the state of the request at the head of the synthesizer resource queue. A SPEAK request in the PENDING state can be deleted or stopped by a STOP request without affecting the state of the resource. Idle Speaking Paused State State State | | | |----------SPEAK-------->| |--------| |<------STOP-------------| CONTROL | |<----SPEAK-COMPLETE-----| |------->| |<----BARGE-IN-OCCURRED--| | | |---------| | | CONTROL |-----------PAUSE--------->| | |-------->|<----------RESUME---------| | | |----------| |----------| | PAUSE | | BARGE-IN-OCCURRED | |--------->| |<---------| |----------| | | | SPEECH-MARKER | | |<---------| | |----------| |----------| | | STOP | RESUME | | | |<---------| | |<---------| | | |<---------------------STOP-------------------------| |----------| | | | DEFINE-LEXICON | | | | | | |<---------| | | |<---------------BARGE-IN-OCCURRED------------------| Synthesizer State Machine8.2. Synthesizer Methods
The synthesizer supports the following methods.
synthesizer-method = "SPEAK" / "STOP" / "PAUSE" / "RESUME" / "BARGE-IN-OCCURRED" / "CONTROL" / "DEFINE-LEXICON"8.3. Synthesizer Events
The synthesizer can generate the following events. synthesizer-event = "SPEECH-MARKER" / "SPEAK-COMPLETE"8.4. Synthesizer Header Fields
A synthesizer method can contain header fields containing request options and information to augment the Request, Response, or Event it is associated with. synthesizer-header = jump-size / kill-on-barge-in / speaker-profile / completion-cause / completion-reason / voice-parameter / prosody-parameter / speech-marker / speech-language / fetch-hint / audio-fetch-hint / failed-uri / failed-uri-cause / speak-restart / speak-length / load-lexicon / lexicon-search-order8.4.1. Jump-Size
This header field MAY be specified in a CONTROL method and controls the amount to jump forward or backward in an active SPEAK request. A '+' or '-' indicates a relative value to what is being currently played. This header field MAY also be specified in a SPEAK request as a desired offset into the synthesized speech. In this case, the synthesizer MUST begin speaking from this amount of time into the speech markup. Note that an offset that extends beyond the end of
the produced speech will result in audio of length zero. The different speech length units supported are dependent on the synthesizer implementation. If the synthesizer resource does not support a unit for the operation, the resource MUST respond with a status-code of 409 "Unsupported Header Field Value". jump-size = "Jump-Size" ":" speech-length-value CRLF speech-length-value = numeric-speech-length / text-speech-length text-speech-length = 1*UTFCHAR SP "Tag" numeric-speech-length = ("+" / "-") positive-speech-length positive-speech-length = 1*19DIGIT SP numeric-speech-unit numeric-speech-unit = "Second" / "Word" / "Sentence" / "Paragraph"8.4.2. Kill-On-Barge-In
This header field MAY be sent as part of the SPEAK method to enable "kill-on-barge-in" support. If enabled, the SPEAK method is interrupted by DTMF input detected by a signal detector resource or by the start of speech sensed or recognized by the speech recognizer resource. kill-on-barge-in = "Kill-On-Barge-In" ":" BOOLEAN CRLF The client MUST send a BARGE-IN-OCCURRED method to the synthesizer resource when it receives a barge-in-able event from any source. This source could be a synthesizer resource or signal detector resource and MAY be either local or distributed. If this header field is not specified in a SPEAK request or explicitly set by a SET-PARAMS, the default value for this header field is "true". If the recognizer or signal detector resource is on the same server as the synthesizer and both are part of the same session, the server MAY work with both to provide internal notification to the synthesizer so that audio may be stopped without having to wait for the client's BARGE-IN-OCCURRED event. It is generally RECOMMENDED when playing a prompt to the user with Kill-On-Barge-In and asking for input, that the client issue the RECOGNIZE request ahead of the SPEAK request for optimum performance
and user experience. This way, it is guaranteed that the recognizer is online before the prompt starts playing and the user's speech will not be truncated at the beginning (especially for power users).8.4.3. Speaker-Profile
This header field MAY be part of the SET-PARAMS/GET-PARAMS or SPEAK request from the client to the server and specifies a URI that references the profile of the speaker. Speaker profiles are collections of voice parameters like gender, accent, etc. speaker-profile = "Speaker-Profile" ":" uri CRLF8.4.4. Completion-Cause
This header field MUST be specified in a SPEAK-COMPLETE event coming from the synthesizer resource to the client. This indicates the reason the SPEAK request completed. completion-cause = "Completion-Cause" ":" 3DIGIT SP 1*VCHAR CRLF +------------+-----------------------+------------------------------+ | Cause-Code | Cause-Name | Description | +------------+-----------------------+------------------------------+ | 000 | normal | SPEAK completed normally. | | 001 | barge-in | SPEAK request was terminated | | | | because of barge-in. | | 002 | parse-failure | SPEAK request terminated | | | | because of a failure to | | | | parse the speech markup | | | | text. | | 003 | uri-failure | SPEAK request terminated | | | | because access to one of the | | | | URIs failed. | | 004 | error | SPEAK request terminated | | | | prematurely due to | | | | synthesizer error. | | 005 | language-unsupported | Language not supported. | | 006 | lexicon-load-failure | Lexicon loading failed. | | 007 | cancelled | A prior SPEAK request failed | | | | while this one was still in | | | | the queue. | +------------+-----------------------+------------------------------+ Synthesizer Resource Completion Cause Codes
8.4.5. Completion-Reason
This header field MAY be specified in a SPEAK-COMPLETE event coming from the synthesizer resource to the client. This contains the reason text behind the SPEAK request completion. This header field communicates text describing the reason for the failure, such as an error in parsing the speech markup text. completion-reason = "Completion-Reason" ":" quoted-string CRLF The completion reason text is provided for client use in logs and for debugging and instrumentation purposes. Clients MUST NOT interpret the completion reason text.8.4.6. Voice-Parameter
This set of header fields defines the voice of the speaker. voice-parameter = voice-gender / voice-age / voice-variant / voice-name voice-gender = "Voice-Gender:" voice-gender-value CRLF voice-gender-value = "male" / "female" / "neutral" voice-age = "Voice-Age:" 1*3DIGIT CRLF voice-variant = "Voice-Variant:" 1*19DIGIT CRLF voice-name = "Voice-Name:" 1*UTFCHAR *(1*WSP 1*UTFCHAR) CRLF The "Voice-" parameters are derived from the similarly named attributes of the voice element specified in W3C's Speech Synthesis Markup Language Specification (SSML) [W3C.REC-speech-synthesis-20040907]. Legal values for these parameters are as defined in that specification. These header fields MAY be sent in SET-PARAMS or GET-PARAMS requests to define or get default values for the entire session or MAY be sent in the SPEAK request to define default values for that SPEAK request. Note that SSML content can itself set these values internal to the SSML document, of course.
Voice parameter header fields MAY also be sent in a CONTROL method to affect a SPEAK request in progress and change its behavior on the fly. If the synthesizer resource does not support this operation, it MUST reject the request with a status-code of 403 "Unsupported Header Field".8.4.7. Prosody-Parameters
This set of header fields defines the prosody of the speech. prosody-parameter = "Prosody-" prosody-param-name ":" prosody-param-value CRLF prosody-param-name = 1*VCHAR prosody-param-value = 1*VCHAR prosody-param-name is any one of the attribute names under the prosody element specified in W3C's Speech Synthesis Markup Language Specification [W3C.REC-speech-synthesis-20040907]. The prosody- param-value is any one of the value choices of the corresponding prosody element attribute from that specification. These header fields MAY be sent in SET-PARAMS or GET-PARAMS requests to define or get default values for the entire session or MAY be sent in the SPEAK request to define default values for that SPEAK request. Furthermore, these attributes can be part of the speech text marked up in SSML. The prosody parameter header fields in the SET-PARAMS or SPEAK request only apply if the speech data is of type 'text/plain' and does not use a speech markup format. These prosody parameter header fields MAY also be sent in a CONTROL method to affect a SPEAK request in progress and change its behavior on the fly. If the synthesizer resource does not support this operation, it MUST respond back to the client with a status-code of 403 "Unsupported Header Field".8.4.8. Speech-Marker
This header field contains timestamp information in a "timestamp" field. This is a Network Time Protocol (NTP) [RFC5905] timestamp, a 64-bit number in decimal form. It MUST be synced with the Real-Time Protocol (RTP) [RFC3550] timestamp of the media stream through the Real-Time Control Protocol (RTCP) [RFC3550].
Markers are bookmarks that are defined within the markup. Most speech markup formats provide mechanisms to embed marker fields within speech texts. The synthesizer generates SPEECH-MARKER events when it reaches these marker fields. This header field MUST be part of the SPEECH-MARKER event and contain the marker tag value after the timestamp, separated by a semicolon. In these events, the timestamp marks the time the text corresponding to the marker was emitted as speech by the synthesizer. This header field MUST also be returned in responses to STOP, CONTROL, and BARGE-IN-OCCURRED methods, in the SPEAK-COMPLETE event, and in an IN-PROGRESS SPEAK response. In these messages, if any markers have been encountered for the current SPEAK, the marker tag value MUST be the last embedded marker encountered. If no markers have yet been encountered for the current SPEAK, only the timestamp is REQUIRED. Note that in these events, the purpose of this header field is to provide timestamp information associated with important events within the lifecycle of a request (start of SPEAK processing, end of SPEAK processing, receipt of CONTROL/STOP/BARGE-IN-OCCURRED). timestamp = "timestamp" "=" time-stamp-value time-stamp-value = 1*20DIGIT speech-marker = "Speech-Marker" ":" timestamp [";" 1*(UTFCHAR / %x20)] CRLF8.4.9. Speech-Language
This header field specifies the default language of the speech data if the language is not specified in the markup. The value of this header field MUST follow RFC 5646 [RFC5646] for its values. The header field MAY occur in SPEAK, SET-PARAMS, or GET-PARAMS requests. speech-language = "Speech-Language" ":" 1*VCHAR CRLF8.4.10. Fetch-Hint
When the synthesizer needs to fetch documents or other resources like speech markup or audio files, this header field controls the corresponding URI access properties. This provides client policy on when the synthesizer should retrieve content from the server. A value of "prefetch" indicates the content MAY be downloaded when the request is received, whereas "safe" indicates that content MUST NOT
be downloaded until actually referenced. The default value is "prefetch". This header field MAY occur in SPEAK, SET-PARAMS, or GET-PARAMS requests. fetch-hint = "Fetch-Hint" ":" ("prefetch" / "safe") CRLF8.4.11. Audio-Fetch-Hint
When the synthesizer needs to fetch documents or other resources like speech audio files, this header field controls the corresponding URI access properties. This provides client policy whether or not the synthesizer is permitted to attempt to optimize speech by pre- fetching audio. The value is either "safe" to say that audio is only fetched when it is referenced, never before; "prefetch" to permit, but not require the implementation to pre-fetch the audio; or "stream" to allow it to stream the audio fetches. The default value is "prefetch". This header field MAY occur in SPEAK, SET-PARAMS, or GET-PARAMS requests. audio-fetch-hint = "Audio-Fetch-Hint" ":" ("prefetch" / "safe" / "stream") CRLF8.4.12. Failed-URI
When a synthesizer method needs a synthesizer to fetch or access a URI and the access fails, the server SHOULD provide the failed URI in this header field in the method response, unless there are multiple URI failures, in which case the server MUST provide one of the failed URIs in this header field in the method response. failed-uri = "Failed-URI" ":" absoluteURI CRLF8.4.13. Failed-URI-Cause
When a synthesizer method needs a synthesizer to fetch or access a URI and the access fails, the server MUST provide the URI-specific or protocol-specific response code for the URI in the Failed-URI header field in the method response through this header field. The value encoding is UTF-8 (RFC 3629 [RFC3629]) to accommodate any access protocol -- some access protocols might have a response string instead of a numeric response code. failed-uri-cause = "Failed-URI-Cause" ":" 1*UTFCHAR CRLF
8.4.14. Speak-Restart
When a client issues a CONTROL request to a currently speaking synthesizer resource to jump backward, and the target jump point is before the start of the current SPEAK request, the current SPEAK request MUST restart from the beginning of its speech data and the server's response to the CONTROL request MUST contain this header field with a value of "true" indicating a restart. speak-restart = "Speak-Restart" ":" BOOLEAN CRLF8.4.15. Speak-Length
This header field MAY be specified in a CONTROL method to control the maximum length of speech to speak, relative to the current speaking point in the currently active SPEAK request. If numeric, the value MUST be a positive integer. If a header field with a Tag unit is specified, then the speech output continues until the tag is reached or the SPEAK request is completed, whichever comes first. This header field MAY be specified in a SPEAK request to indicate the length to speak from the speech data and is relative to the point in speech that the SPEAK request starts. The different speech length units supported are synthesizer implementation dependent. If a server does not support the specified unit, the server MUST respond with a status-code of 409 "Unsupported Header Field Value". speak-length = "Speak-Length" ":" positive-length-value CRLF positive-length-value = positive-speech-length / text-speech-length text-speech-length = 1*UTFCHAR SP "Tag" positive-speech-length = 1*19DIGIT SP numeric-speech-unit numeric-speech-unit = "Second" / "Word" / "Sentence" / "Paragraph"
8.4.16. Load-Lexicon
This header field is used to indicate whether a lexicon has to be loaded or unloaded. The value "true" means to load the lexicon if not already loaded, and the value "false" means to unload the lexicon if it is loaded. The default value for this header field is "true". This header field MAY be specified in a DEFINE-LEXICON method. load-lexicon = "Load-Lexicon" ":" BOOLEAN CRLF8.4.17. Lexicon-Search-Order
This header field is used to specify a list of active pronunciation lexicon URIs and the search order among the active lexicons. Lexicons specified within the SSML document take precedence over the lexicons specified in this header field. This header field MAY be specified in the SPEAK, SET-PARAMS, and GET-PARAMS methods. lexicon-search-order = "Lexicon-Search-Order" ":" "<" absoluteURI ">" *(" " "<" absoluteURI ">") CRLF8.5. Synthesizer Message Body
A synthesizer message can contain additional information associated with the Request, Response, or Event in its message body.8.5.1. Synthesizer Speech Data
Marked-up text for the synthesizer to speak is specified as a typed media entity in the message body. The speech data to be spoken by the synthesizer can be specified inline by embedding the data in the message body or by reference by providing a URI for accessing the data. In either case, the data and the format used to markup the speech needs to be of a content type supported by the server. All MRCPv2 servers containing synthesizer resources MUST support both plain text speech data and W3C's Speech Synthesis Markup Language [W3C.REC-speech-synthesis-20040907] and hence MUST support the media types 'text/plain' and 'application/ssml+xml'. Other formats MAY be supported. If the speech data is to be fetched by URI reference, the media type 'text/uri-list' (see RFC 2483 [RFC2483]) is used to indicate one or more URIs that, when dereferenced, will contain the content to be spoken. If a list of speech URIs is specified, the resource MUST speak the speech data provided by each URI in the order in which the URIs are specified in the content.
MRCPv2 clients and servers MUST support the 'multipart/mixed' media type. This is the appropriate media type to use when providing a mix of URI and inline speech data. Embedded within the multipart content block, there MAY be content for the 'text/uri-list', 'application/ ssml+xml', and/or 'text/plain' media types. The character set and encoding used in the speech data is specified according to standard media type definitions. The multipart content MAY also contain actual audio data. Clients may have recorded audio clips stored in memory or on a local device and wish to play it as part of the SPEAK request. The audio portions MAY be sent by the client as part of the multipart content block. This audio is referenced in the speech markup data that is another part in the multipart content block according to the 'multipart/mixed' media type specification. Content-Type:text/uri-list Content-Length:... http://www.example.com/ASR-Introduction.ssml http://www.example.com/ASR-Document-Part1.ssml http://www.example.com/ASR-Document-Part2.ssml http://www.example.com/ASR-Conclusion.ssml URI List Example Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Aldine Turnbet and arrived at <break/> <say-as interpret-as="vxml:time">0345p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak> SSML Example
Content-Type:multipart/mixed; boundary="break" --break Content-Type:text/uri-list Content-Length:... http://www.example.com/ASR-Introduction.ssml http://www.example.com/ASR-Document-Part1.ssml http://www.example.com/ASR-Document-Part2.ssml http://www.example.com/ASR-Conclusion.ssml --break Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak> --break-- Multipart Example8.5.2. Lexicon Data
Synthesizer lexicon data from the client to the server can be provided inline or by reference. Either way, they are carried as typed media in the message body of the MRCPv2 request message (see Section 8.14). When a lexicon is specified inline in the message, the client MUST provide a Content-ID for that lexicon as part of the content header fields. The server MUST store the lexicon associated with that Content-ID for the duration of the session. A stored lexicon can be overwritten by defining a new lexicon with the same Content-ID.
Lexicons that have been associated with a Content-ID can be referenced through the 'session' URI scheme (see Section 13.6). If lexicon data is specified by external URI reference, the media type 'text/uri-list' (see RFC 2483 [RFC2483] ) is used to list the one or more URIs that may be dereferenced to obtain the lexicon data. All MRCPv2 servers MUST support the "http" and "https" URI access mechanisms, and MAY support other mechanisms. If the data in the message body consists of a mix of URI and inline lexicon data, the 'multipart/mixed' media type is used. The character set and encoding used in the lexicon data may be specified according to standard media type definitions.8.6. SPEAK Method
The SPEAK request provides the synthesizer resource with the speech text and initiates speech synthesis and streaming. The SPEAK method MAY carry voice and prosody header fields that alter the behavior of the voice being synthesized, as well as a typed media message body containing the actual marked-up text to be spoken. The SPEAK method implementation MUST do a fetch of all external URIs that are part of that operation. If caching is implemented, this URI fetching MUST conform to the cache-control hints and parameter header fields associated with the method in deciding whether it is to be fetched from cache or from the external server. If these hints/ parameters are not specified in the method, the values set for the session using SET-PARAMS/GET-PARAMS apply. If it was not set for the session, their default values apply. When applying voice parameters, there are three levels of precedence. The highest precedence are those specified within the speech markup text, followed by those specified in the header fields of the SPEAK request and hence that apply for that SPEAK request only, followed by the session default values that can be set using the SET-PARAMS request and apply for subsequent methods invoked during the session. If the resource was idle at the time the SPEAK request arrived at the server and the SPEAK method is being actively processed, the resource responds immediately with a success status code and a request-state of IN-PROGRESS. If the resource is in the speaking or paused state when the SPEAK method arrives at the server, i.e., it is in the middle of processing a previous SPEAK request, the status returns success with a request- state of PENDING. The server places the SPEAK request in the synthesizer resource request queue. The request queue operates
strictly FIFO: requests are processed serially in order of receipt. If the current SPEAK fails, all SPEAK methods in the pending queue are cancelled and each generates a SPEAK-COMPLETE event with a Completion-Cause of "cancelled". For the synthesizer resource, SPEAK is the only method that can return a request-state of IN-PROGRESS or PENDING. When the text has been synthesized and played into the media stream, the resource issues a SPEAK-COMPLETE event with the request-id of the SPEAK request and a request-state of COMPLETE. C->S: MRCP/2.0 ... SPEAK 543257 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-Age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>. </s> <s>The subject is <prosody rate="-20%">ski trip</prosody> </s> </p> </speak> S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS Channel-Identifier:32AECB23433802@speechsynth Speech-Marker:timestamp=857206027059 S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE Channel-Identifier:32AECB23433802@speechsynth Completion-Cause:000 normal Speech-Marker:timestamp=857206027059 SPEAK Example
8.7. STOP
The STOP method from the client to the server tells the synthesizer resource to stop speaking if it is speaking something. The STOP request can be sent with an Active-Request-Id-List header field to stop the zero or more specific SPEAK requests that may be in queue and return a response status-code of 200 "Success". If no Active-Request-Id-List header field is sent in the STOP request, the server terminates all outstanding SPEAK requests. If a STOP request successfully terminated one or more PENDING or IN-PROGRESS SPEAK requests, then the response MUST contain an Active- Request-Id-List header field enumerating the SPEAK request-ids that were terminated. Otherwise, there is no Active-Request-Id-List header field in the response. No SPEAK-COMPLETE events are sent for such terminated requests. If a SPEAK request that was IN-PROGRESS and speaking was stopped, the next pending SPEAK request, if any, becomes IN-PROGRESS at the resource and enters the speaking state. If a SPEAK request that was IN-PROGRESS and paused was stopped, the next pending SPEAK request, if any, becomes IN-PROGRESS and enters the paused state. C->S: MRCP/2.0 ... SPEAK 543258 Channel-Identifier:32AECB23433802@speechsynth Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak>
S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS Channel-Identifier:32AECB23433802@speechsynth Speech-Marker:timestamp=857206027059 C->S: MRCP/2.0 ... STOP 543259 Channel-Identifier:32AECB23433802@speechsynth S->C: MRCP/2.0 ... 543259 200 COMPLETE Channel-Identifier:32AECB23433802@speechsynth Active-Request-Id-List:543258 Speech-Marker:timestamp=857206039059 STOP Example8.8. BARGE-IN-OCCURRED
The BARGE-IN-OCCURRED method, when used with the synthesizer resource, provides a client that has detected a barge-in-able event a means to communicate the occurrence of the event to the synthesizer resource. This method is useful in two scenarios: 1. The client has detected DTMF digits in the input media or some other barge-in-able event and wants to communicate that to the synthesizer resource. 2. The recognizer resource and the synthesizer resource are in different servers. In this case, the client acts as an intermediary for the two servers. It receives an event from the recognition resource and sends a BARGE-IN-OCCURRED request to the synthesizer. In such cases, the BARGE-IN-OCCURRED method would also have a Proxy-Sync-Id header field received from the resource generating the original event. If a SPEAK request is active with kill-on-barge-in enabled (see Section 8.4.2), and the BARGE-IN-OCCURRED event is received, the synthesizer MUST immediately stop streaming out audio. It MUST also terminate any speech requests queued behind the current active one, irrespective of whether or not they have barge-in enabled. If a barge-in-able SPEAK request was playing and it was terminated, the response MUST contain an Active-Request-Id-List header field listing the request-ids of all SPEAK requests that were terminated. The server generates no SPEAK-COMPLETE events for these requests.
If there were no SPEAK requests terminated by the synthesizer resource as a result of the BARGE-IN-OCCURRED method, the server MUST respond to the BARGE-IN-OCCURRED with a status-code of 200 "Success", and the response MUST NOT contain an Active-Request-Id-List header field. If the synthesizer and recognizer resources are part of the same MRCPv2 session, they can be optimized for a quicker kill-on-barge-in response if the recognizer and synthesizer interact directly. In these cases, the client MUST still react to a START-OF-INPUT event from the recognizer by invoking the BARGE-IN-OCCURRED method to the synthesizer. The client MUST invoke the BARGE-IN-OCCURRED if it has any outstanding requests to the synthesizer resource in either the PENDING or IN-PROGRESS state. C->S: MRCP/2.0 ... SPEAK 543258 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-Age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak> S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS Channel-Identifier:32AECB23433802@speechsynth Speech-Marker:timestamp=857206027059 C->S: MRCP/2.0 ... BARGE-IN-OCCURRED 543259 Channel-Identifier:32AECB23433802@speechsynth Proxy-Sync-Id:987654321
S->C:MRCP/2.0 ... 543259 200 COMPLETE Channel-Identifier:32AECB23433802@speechsynth Active-Request-Id-List:543258 Speech-Marker:timestamp=857206039059 BARGE-IN-OCCURRED Example8.9. PAUSE
The PAUSE method from the client to the server tells the synthesizer resource to pause speech output if it is speaking something. If a PAUSE method is issued on a session when a SPEAK is not active, the server MUST respond with a status-code of 402 "Method not valid in this state". If a PAUSE method is issued on a session when a SPEAK is active and paused, the server MUST respond with a status-code of 200 "Success". If a SPEAK request was active, the server MUST return an Active-Request-Id-List header field whose value contains the request-id of the SPEAK request that was paused. C->S: MRCP/2.0 ... SPEAK 543258 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-Age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak> S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS Channel-Identifier:32AECB23433802@speechsynth Speech-Marker:timestamp=857206027059
C->S: MRCP/2.0 ... PAUSE 543259 Channel-Identifier:32AECB23433802@speechsynth S->C: MRCP/2.0 ... 543259 200 COMPLETE Channel-Identifier:32AECB23433802@speechsynth Active-Request-Id-List:543258 PAUSE Example8.10. RESUME
The RESUME method from the client to the server tells a paused synthesizer resource to resume speaking. If a RESUME request is issued on a session with no active SPEAK request, the server MUST respond with a status-code of 402 "Method not valid in this state". If a RESUME request is issued on a session with an active SPEAK request that is speaking (i.e., not paused), the server MUST respond with a status-code of 200 "Success". If a SPEAK request was paused, the server MUST return an Active-Request-Id-List header field whose value contains the request-id of the SPEAK request that was resumed. C->S: MRCP/2.0 ... SPEAK 543258 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak>
S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS@speechsynth Channel-Identifier:32AECB23433802 Speech-Marker:timestamp=857206027059 C->S: MRCP/2.0 ... PAUSE 543259 Channel-Identifier:32AECB23433802@speechsynth S->C: MRCP/2.0 ... 543259 200 COMPLETE Channel-Identifier:32AECB23433802@speechsynth Active-Request-Id-List:543258 C->S: MRCP/2.0 ... RESUME 543260 Channel-Identifier:32AECB23433802@speechsynth S->C: MRCP/2.0 ... 543260 200 COMPLETE Channel-Identifier:32AECB23433802@speechsynth Active-Request-Id-List:543258 RESUME Example8.11. CONTROL
The CONTROL method from the client to the server tells a synthesizer that is speaking to modify what it is speaking on the fly. This method is used to request the synthesizer to jump forward or backward in what it is speaking, change speaker rate, speaker parameters, etc. It affects only the currently IN-PROGRESS SPEAK request. Depending on the implementation and capability of the synthesizer resource, it may or may not support the various modifications indicated by header fields in the CONTROL request. When a client invokes a CONTROL method to jump forward and the operation goes beyond the end of the active SPEAK method's text, the CONTROL request still succeeds. The active SPEAK request completes and returns a SPEAK-COMPLETE event following the response to the CONTROL method. If there are more SPEAK requests in the queue, the synthesizer resource starts at the beginning of the next SPEAK request in the queue. When a client invokes a CONTROL method to jump backward and the operation jumps to the beginning or beyond the beginning of the speech data of the active SPEAK method, the CONTROL request still succeeds. The response to the CONTROL request contains the speak- restart header field, and the active SPEAK request restarts from the beginning of its speech data.
These two behaviors can be used to rewind or fast-forward across multiple speech requests, if the client wants to break up a speech markup text into multiple SPEAK requests. If a SPEAK request was active when the CONTROL method was received, the server MUST return an Active-Request-Id-List header field containing the request-id of the SPEAK request that was active. C->S: MRCP/2.0 ... SPEAK 543258 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak> S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS Channel-Identifier:32AECB23433802@speechsynth Speech-Marker:timestamp=857205016059 C->S: MRCP/2.0 ... CONTROL 543259 Channel-Identifier:32AECB23433802@speechsynth Prosody-rate:fast S->C: MRCP/2.0 ... 543259 200 COMPLETE Channel-Identifier:32AECB23433802@speechsynth Active-Request-Id-List:543258 Speech-Marker:timestamp=857206027059
C->S: MRCP/2.0 ... CONTROL 543260 Channel-Identifier:32AECB23433802@speechsynth Jump-Size:-15 Words S->C: MRCP/2.0 ... 543260 200 COMPLETE Channel-Identifier:32AECB23433802@speechsynth Active-Request-Id-List:543258 Speech-Marker:timestamp=857206039059 CONTROL Example8.12. SPEAK-COMPLETE
This is an Event message from the synthesizer resource to the client that indicates the corresponding SPEAK request was completed. The request-id field matches the request-id of the SPEAK request that initiated the speech that just completed. The request-state field is set to COMPLETE by the server, indicating that this is the last event with the corresponding request-id. The Completion-Cause header field specifies the cause code pertaining to the status and reason of request completion, such as the SPEAK completed normally or because of an error, kill-on-barge-in, etc. C->S: MRCP/2.0 ... SPEAK 543260 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak>
S->C: MRCP/2.0 ... 543260 200 IN-PROGRESS Channel-Identifier:32AECB23433802@speechsynth Speech-Marker:timestamp=857206027059 S->C: MRCP/2.0 ... SPEAK-COMPLETE 543260 COMPLETE Channel-Identifier:32AECB23433802@speechsynth Completion-Cause:000 normal Speech-Marker:timestamp=857206039059 SPEAK-COMPLETE Example8.13. SPEECH-MARKER
This is an event generated by the synthesizer resource to the client when the synthesizer encounters a marker tag in the speech markup it is currently processing. The value of the request-id field MUST match that of the corresponding SPEAK request. The request-state field MUST have the value "IN-PROGRESS" as the speech is still not complete. The value of the speech marker tag hit, describing where the synthesizer is in the speech markup, MUST be returned in the Speech-Marker header field, along with an NTP timestamp indicating the instant in the output speech stream that the marker was encountered. The SPEECH-MARKER event MUST also be generated with a null marker value and output NTP timestamp when a SPEAK request in Pending-State (i.e., in the queue) changes state to IN-PROGRESS and starts speaking. The NTP timestamp MUST be synchronized with the RTP timestamp used to generate the speech stream through standard RTCP machinery. C->S: MRCP/2.0 ... SPEAK 543261 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/>
<say-as interpret-as="vxml:time">0342p</say-as>.</s> <mark name="here"/> <s>The subject is <prosody rate="-20%">ski trip</prosody> </s> <mark name="ANSWER"/> </p> </speak> S->C: MRCP/2.0 ... 543261 200 IN-PROGRESS Channel-Identifier:32AECB23433802@speechsynth Speech-Marker:timestamp=857205015059 S->C: MRCP/2.0 ... SPEECH-MARKER 543261 IN-PROGRESS Channel-Identifier:32AECB23433802@speechsynth Speech-Marker:timestamp=857206027059;here S->C: MRCP/2.0 ... SPEECH-MARKER 543261 IN-PROGRESS Channel-Identifier:32AECB23433802@speechsynth Speech-Marker:timestamp=857206039059;ANSWER S->C: MRCP/2.0 ... SPEAK-COMPLETE 543261 COMPLETE Channel-Identifier:32AECB23433802@speechsynth Completion-Cause:000 normal Speech-Marker:timestamp=857207689259;ANSWER SPEECH-MARKER Example8.14. DEFINE-LEXICON
The DEFINE-LEXICON method, from the client to the server, provides a lexicon and tells the server to load or unload the lexicon (see Section 8.4.16). The media type of the lexicon is provided in the Content-Type header (see Section 8.5.2). One such media type is "application/pls+xml" for the Pronunciation Lexicon Specification (PLS) [W3C.REC-pronunciation-lexicon-20081014] [RFC4267]. If the server resource is in the speaking or paused state, the server MUST respond with a failure status-code of 402 "Method not valid in this state". If the resource is in the idle state and is able to successfully load/unload the lexicon, the status MUST return a 200 "Success" status-code and the request-state MUST be COMPLETE.
If the synthesizer could not define the lexicon for some reason, for example, because the download failed or the lexicon was in an unsupported form, the server MUST respond with a failure status-code of 407 and a Completion-Cause header field describing the failure reason.