Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 6787

Media Resource Control Protocol Version 2 (MRCPv2)

Pages: 224
Proposed Standard
Errata
Part 3 of 8 – Pages 46 to 72
First   Prev   Next

Top   ToC   RFC6787 - Page 46   prevText

7. Resource Discovery

Server resources may be discovered and their capabilities learned by clients through standard SIP machinery. The client MAY issue a SIP OPTIONS transaction to a server, which has the effect of requesting the capabilities of the server. The server MUST respond to such a request with an SDP-encoded description of its capabilities according to RFC 3264 [RFC3264]. The MRCPv2 capabilities are described by a single "m=" line containing the media type "application" and transport type "TCP/TLS/MRCPv2" or "TCP/MRCPv2". There MUST be one "resource" attribute for each media resource that the server supports, and it has the resource type identifier as its value. The SDP description MUST also contain "m=" lines describing the audio capabilities and the coders the server supports. In this example, the client uses the SIP OPTIONS method to query the capabilities of the MRCPv2 server. C->S: OPTIONS sip:mrcp@server.example.com SIP/2.0 Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf7 Max-Forwards:6 To:<sip:mrcp@example.com> From:Sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:63104 OPTIONS Contact:<sip:sarvi@client.example.com> Accept:application/sdp Content-Length:0 S->C: SIP/2.0 200 OK Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bf7;received=192.0.32.10 To:<sip:mrcp@example.com>;tag=62784 From:Sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:63104 OPTIONS Contact:<sip:mrcp@server.example.com> Allow:INVITE, ACK, CANCEL, OPTIONS, BYE
Top   ToC   RFC6787 - Page 47
        Accept:application/sdp
        Accept-Encoding:gzip
        Accept-Language:en
        Supported:foo
        Content-Type:application/sdp
        Content-Length:...

        v=0
        o=sarvi 2890844536 2890842811 IN IP4 192.0.2.12
        s=-
        i=MRCPv2 server capabilities
        c=IN IP4 192.0.2.12/127
        t=0 0
        m=application 0 TCP/TLS/MRCPv2 1
        a=resource:speechsynth
        a=resource:speechrecog
        a=resource:speakverify
        m=audio 0 RTP/AVP 0 3
        a=rtpmap:0 PCMU/8000
        a=rtpmap:3 GSM/8000

         Using SIP OPTIONS for MRCPv2 Server Capability Discovery

8. Speech Synthesizer Resource

This resource processes text markup provided by the client and generates a stream of synthesized speech in real time. Depending upon the server implementation and capability of this resource, the client can also dictate parameters of the synthesized speech such as voice characteristics, speaker speed, etc. The synthesizer resource is controlled by MRCPv2 requests from the client. Similarly, the resource can respond to these requests or generate asynchronous events to the client to indicate conditions of interest to the client during the generation of the synthesized speech stream. This section applies for the following resource types: o speechsynth o basicsynth The capabilities of these resources are defined in Section 3.1.
Top   ToC   RFC6787 - Page 48

8.1. Synthesizer State Machine

The synthesizer maintains a state machine to process MRCPv2 requests from the client. The state transitions shown below describe the states of the synthesizer and reflect the state of the request at the head of the synthesizer resource queue. A SPEAK request in the PENDING state can be deleted or stopped by a STOP request without affecting the state of the resource. Idle Speaking Paused State State State | | | |----------SPEAK-------->| |--------| |<------STOP-------------| CONTROL | |<----SPEAK-COMPLETE-----| |------->| |<----BARGE-IN-OCCURRED--| | | |---------| | | CONTROL |-----------PAUSE--------->| | |-------->|<----------RESUME---------| | | |----------| |----------| | PAUSE | | BARGE-IN-OCCURRED | |--------->| |<---------| |----------| | | | SPEECH-MARKER | | |<---------| | |----------| |----------| | | STOP | RESUME | | | |<---------| | |<---------| | | |<---------------------STOP-------------------------| |----------| | | | DEFINE-LEXICON | | | | | | |<---------| | | |<---------------BARGE-IN-OCCURRED------------------| Synthesizer State Machine

8.2. Synthesizer Methods

The synthesizer supports the following methods.
Top   ToC   RFC6787 - Page 49
   synthesizer-method   =  "SPEAK"
                        /  "STOP"
                        /  "PAUSE"
                        /  "RESUME"
                        /  "BARGE-IN-OCCURRED"
                        /  "CONTROL"
                        /  "DEFINE-LEXICON"

8.3. Synthesizer Events

The synthesizer can generate the following events. synthesizer-event = "SPEECH-MARKER" / "SPEAK-COMPLETE"

8.4. Synthesizer Header Fields

A synthesizer method can contain header fields containing request options and information to augment the Request, Response, or Event it is associated with. synthesizer-header = jump-size / kill-on-barge-in / speaker-profile / completion-cause / completion-reason / voice-parameter / prosody-parameter / speech-marker / speech-language / fetch-hint / audio-fetch-hint / failed-uri / failed-uri-cause / speak-restart / speak-length / load-lexicon / lexicon-search-order

8.4.1. Jump-Size

This header field MAY be specified in a CONTROL method and controls the amount to jump forward or backward in an active SPEAK request. A '+' or '-' indicates a relative value to what is being currently played. This header field MAY also be specified in a SPEAK request as a desired offset into the synthesized speech. In this case, the synthesizer MUST begin speaking from this amount of time into the speech markup. Note that an offset that extends beyond the end of
Top   ToC   RFC6787 - Page 50
   the produced speech will result in audio of length zero.  The
   different speech length units supported are dependent on the
   synthesizer implementation.  If the synthesizer resource does not
   support a unit for the operation, the resource MUST respond with a
   status-code of 409 "Unsupported Header Field Value".

   jump-size             =   "Jump-Size" ":" speech-length-value CRLF

   speech-length-value   =   numeric-speech-length
                         /   text-speech-length

   text-speech-length    =   1*UTFCHAR SP "Tag"

   numeric-speech-length =    ("+" / "-") positive-speech-length

   positive-speech-length =   1*19DIGIT SP numeric-speech-unit

   numeric-speech-unit   =   "Second"
                         /   "Word"
                         /   "Sentence"
                         /   "Paragraph"

8.4.2. Kill-On-Barge-In

This header field MAY be sent as part of the SPEAK method to enable "kill-on-barge-in" support. If enabled, the SPEAK method is interrupted by DTMF input detected by a signal detector resource or by the start of speech sensed or recognized by the speech recognizer resource. kill-on-barge-in = "Kill-On-Barge-In" ":" BOOLEAN CRLF The client MUST send a BARGE-IN-OCCURRED method to the synthesizer resource when it receives a barge-in-able event from any source. This source could be a synthesizer resource or signal detector resource and MAY be either local or distributed. If this header field is not specified in a SPEAK request or explicitly set by a SET-PARAMS, the default value for this header field is "true". If the recognizer or signal detector resource is on the same server as the synthesizer and both are part of the same session, the server MAY work with both to provide internal notification to the synthesizer so that audio may be stopped without having to wait for the client's BARGE-IN-OCCURRED event. It is generally RECOMMENDED when playing a prompt to the user with Kill-On-Barge-In and asking for input, that the client issue the RECOGNIZE request ahead of the SPEAK request for optimum performance
Top   ToC   RFC6787 - Page 51
   and user experience.  This way, it is guaranteed that the recognizer
   is online before the prompt starts playing and the user's speech will
   not be truncated at the beginning (especially for power users).

8.4.3. Speaker-Profile

This header field MAY be part of the SET-PARAMS/GET-PARAMS or SPEAK request from the client to the server and specifies a URI that references the profile of the speaker. Speaker profiles are collections of voice parameters like gender, accent, etc. speaker-profile = "Speaker-Profile" ":" uri CRLF

8.4.4. Completion-Cause

This header field MUST be specified in a SPEAK-COMPLETE event coming from the synthesizer resource to the client. This indicates the reason the SPEAK request completed. completion-cause = "Completion-Cause" ":" 3DIGIT SP 1*VCHAR CRLF +------------+-----------------------+------------------------------+ | Cause-Code | Cause-Name | Description | +------------+-----------------------+------------------------------+ | 000 | normal | SPEAK completed normally. | | 001 | barge-in | SPEAK request was terminated | | | | because of barge-in. | | 002 | parse-failure | SPEAK request terminated | | | | because of a failure to | | | | parse the speech markup | | | | text. | | 003 | uri-failure | SPEAK request terminated | | | | because access to one of the | | | | URIs failed. | | 004 | error | SPEAK request terminated | | | | prematurely due to | | | | synthesizer error. | | 005 | language-unsupported | Language not supported. | | 006 | lexicon-load-failure | Lexicon loading failed. | | 007 | cancelled | A prior SPEAK request failed | | | | while this one was still in | | | | the queue. | +------------+-----------------------+------------------------------+ Synthesizer Resource Completion Cause Codes
Top   ToC   RFC6787 - Page 52

8.4.5. Completion-Reason

This header field MAY be specified in a SPEAK-COMPLETE event coming from the synthesizer resource to the client. This contains the reason text behind the SPEAK request completion. This header field communicates text describing the reason for the failure, such as an error in parsing the speech markup text. completion-reason = "Completion-Reason" ":" quoted-string CRLF The completion reason text is provided for client use in logs and for debugging and instrumentation purposes. Clients MUST NOT interpret the completion reason text.

8.4.6. Voice-Parameter

This set of header fields defines the voice of the speaker. voice-parameter = voice-gender / voice-age / voice-variant / voice-name voice-gender = "Voice-Gender:" voice-gender-value CRLF voice-gender-value = "male" / "female" / "neutral" voice-age = "Voice-Age:" 1*3DIGIT CRLF voice-variant = "Voice-Variant:" 1*19DIGIT CRLF voice-name = "Voice-Name:" 1*UTFCHAR *(1*WSP 1*UTFCHAR) CRLF The "Voice-" parameters are derived from the similarly named attributes of the voice element specified in W3C's Speech Synthesis Markup Language Specification (SSML) [W3C.REC-speech-synthesis-20040907]. Legal values for these parameters are as defined in that specification. These header fields MAY be sent in SET-PARAMS or GET-PARAMS requests to define or get default values for the entire session or MAY be sent in the SPEAK request to define default values for that SPEAK request. Note that SSML content can itself set these values internal to the SSML document, of course.
Top   ToC   RFC6787 - Page 53
   Voice parameter header fields MAY also be sent in a CONTROL method to
   affect a SPEAK request in progress and change its behavior on the
   fly.  If the synthesizer resource does not support this operation, it
   MUST reject the request with a status-code of 403 "Unsupported Header
   Field".

8.4.7. Prosody-Parameters

This set of header fields defines the prosody of the speech. prosody-parameter = "Prosody-" prosody-param-name ":" prosody-param-value CRLF prosody-param-name = 1*VCHAR prosody-param-value = 1*VCHAR prosody-param-name is any one of the attribute names under the prosody element specified in W3C's Speech Synthesis Markup Language Specification [W3C.REC-speech-synthesis-20040907]. The prosody- param-value is any one of the value choices of the corresponding prosody element attribute from that specification. These header fields MAY be sent in SET-PARAMS or GET-PARAMS requests to define or get default values for the entire session or MAY be sent in the SPEAK request to define default values for that SPEAK request. Furthermore, these attributes can be part of the speech text marked up in SSML. The prosody parameter header fields in the SET-PARAMS or SPEAK request only apply if the speech data is of type 'text/plain' and does not use a speech markup format. These prosody parameter header fields MAY also be sent in a CONTROL method to affect a SPEAK request in progress and change its behavior on the fly. If the synthesizer resource does not support this operation, it MUST respond back to the client with a status-code of 403 "Unsupported Header Field".

8.4.8. Speech-Marker

This header field contains timestamp information in a "timestamp" field. This is a Network Time Protocol (NTP) [RFC5905] timestamp, a 64-bit number in decimal form. It MUST be synced with the Real-Time Protocol (RTP) [RFC3550] timestamp of the media stream through the Real-Time Control Protocol (RTCP) [RFC3550].
Top   ToC   RFC6787 - Page 54
   Markers are bookmarks that are defined within the markup.  Most
   speech markup formats provide mechanisms to embed marker fields
   within speech texts.  The synthesizer generates SPEECH-MARKER events
   when it reaches these marker fields.  This header field MUST be part
   of the SPEECH-MARKER event and contain the marker tag value after the
   timestamp, separated by a semicolon.  In these events, the timestamp
   marks the time the text corresponding to the marker was emitted as
   speech by the synthesizer.

   This header field MUST also be returned in responses to STOP,
   CONTROL, and BARGE-IN-OCCURRED methods, in the SPEAK-COMPLETE event,
   and in an IN-PROGRESS SPEAK response.  In these messages, if any
   markers have been encountered for the current SPEAK, the marker tag
   value MUST be the last embedded marker encountered.  If no markers
   have yet been encountered for the current SPEAK, only the timestamp
   is REQUIRED.  Note that in these events, the purpose of this header
   field is to provide timestamp information associated with important
   events within the lifecycle of a request (start of SPEAK processing,
   end of SPEAK processing, receipt of CONTROL/STOP/BARGE-IN-OCCURRED).

   timestamp           =   "timestamp" "=" time-stamp-value

   time-stamp-value    =   1*20DIGIT

   speech-marker       =   "Speech-Marker" ":"
                           timestamp
                           [";" 1*(UTFCHAR / %x20)] CRLF

8.4.9. Speech-Language

This header field specifies the default language of the speech data if the language is not specified in the markup. The value of this header field MUST follow RFC 5646 [RFC5646] for its values. The header field MAY occur in SPEAK, SET-PARAMS, or GET-PARAMS requests. speech-language = "Speech-Language" ":" 1*VCHAR CRLF

8.4.10. Fetch-Hint

When the synthesizer needs to fetch documents or other resources like speech markup or audio files, this header field controls the corresponding URI access properties. This provides client policy on when the synthesizer should retrieve content from the server. A value of "prefetch" indicates the content MAY be downloaded when the request is received, whereas "safe" indicates that content MUST NOT
Top   ToC   RFC6787 - Page 55
   be downloaded until actually referenced.  The default value is
   "prefetch".  This header field MAY occur in SPEAK, SET-PARAMS, or
   GET-PARAMS requests.

   fetch-hint          =   "Fetch-Hint" ":" ("prefetch" / "safe") CRLF

8.4.11. Audio-Fetch-Hint

When the synthesizer needs to fetch documents or other resources like speech audio files, this header field controls the corresponding URI access properties. This provides client policy whether or not the synthesizer is permitted to attempt to optimize speech by pre- fetching audio. The value is either "safe" to say that audio is only fetched when it is referenced, never before; "prefetch" to permit, but not require the implementation to pre-fetch the audio; or "stream" to allow it to stream the audio fetches. The default value is "prefetch". This header field MAY occur in SPEAK, SET-PARAMS, or GET-PARAMS requests. audio-fetch-hint = "Audio-Fetch-Hint" ":" ("prefetch" / "safe" / "stream") CRLF

8.4.12. Failed-URI

When a synthesizer method needs a synthesizer to fetch or access a URI and the access fails, the server SHOULD provide the failed URI in this header field in the method response, unless there are multiple URI failures, in which case the server MUST provide one of the failed URIs in this header field in the method response. failed-uri = "Failed-URI" ":" absoluteURI CRLF

8.4.13. Failed-URI-Cause

When a synthesizer method needs a synthesizer to fetch or access a URI and the access fails, the server MUST provide the URI-specific or protocol-specific response code for the URI in the Failed-URI header field in the method response through this header field. The value encoding is UTF-8 (RFC 3629 [RFC3629]) to accommodate any access protocol -- some access protocols might have a response string instead of a numeric response code. failed-uri-cause = "Failed-URI-Cause" ":" 1*UTFCHAR CRLF
Top   ToC   RFC6787 - Page 56

8.4.14. Speak-Restart

When a client issues a CONTROL request to a currently speaking synthesizer resource to jump backward, and the target jump point is before the start of the current SPEAK request, the current SPEAK request MUST restart from the beginning of its speech data and the server's response to the CONTROL request MUST contain this header field with a value of "true" indicating a restart. speak-restart = "Speak-Restart" ":" BOOLEAN CRLF

8.4.15. Speak-Length

This header field MAY be specified in a CONTROL method to control the maximum length of speech to speak, relative to the current speaking point in the currently active SPEAK request. If numeric, the value MUST be a positive integer. If a header field with a Tag unit is specified, then the speech output continues until the tag is reached or the SPEAK request is completed, whichever comes first. This header field MAY be specified in a SPEAK request to indicate the length to speak from the speech data and is relative to the point in speech that the SPEAK request starts. The different speech length units supported are synthesizer implementation dependent. If a server does not support the specified unit, the server MUST respond with a status-code of 409 "Unsupported Header Field Value". speak-length = "Speak-Length" ":" positive-length-value CRLF positive-length-value = positive-speech-length / text-speech-length text-speech-length = 1*UTFCHAR SP "Tag" positive-speech-length = 1*19DIGIT SP numeric-speech-unit numeric-speech-unit = "Second" / "Word" / "Sentence" / "Paragraph"
Top   ToC   RFC6787 - Page 57

8.4.16. Load-Lexicon

This header field is used to indicate whether a lexicon has to be loaded or unloaded. The value "true" means to load the lexicon if not already loaded, and the value "false" means to unload the lexicon if it is loaded. The default value for this header field is "true". This header field MAY be specified in a DEFINE-LEXICON method. load-lexicon = "Load-Lexicon" ":" BOOLEAN CRLF

8.4.17. Lexicon-Search-Order

This header field is used to specify a list of active pronunciation lexicon URIs and the search order among the active lexicons. Lexicons specified within the SSML document take precedence over the lexicons specified in this header field. This header field MAY be specified in the SPEAK, SET-PARAMS, and GET-PARAMS methods. lexicon-search-order = "Lexicon-Search-Order" ":" "<" absoluteURI ">" *(" " "<" absoluteURI ">") CRLF

8.5. Synthesizer Message Body

A synthesizer message can contain additional information associated with the Request, Response, or Event in its message body.

8.5.1. Synthesizer Speech Data

Marked-up text for the synthesizer to speak is specified as a typed media entity in the message body. The speech data to be spoken by the synthesizer can be specified inline by embedding the data in the message body or by reference by providing a URI for accessing the data. In either case, the data and the format used to markup the speech needs to be of a content type supported by the server. All MRCPv2 servers containing synthesizer resources MUST support both plain text speech data and W3C's Speech Synthesis Markup Language [W3C.REC-speech-synthesis-20040907] and hence MUST support the media types 'text/plain' and 'application/ssml+xml'. Other formats MAY be supported. If the speech data is to be fetched by URI reference, the media type 'text/uri-list' (see RFC 2483 [RFC2483]) is used to indicate one or more URIs that, when dereferenced, will contain the content to be spoken. If a list of speech URIs is specified, the resource MUST speak the speech data provided by each URI in the order in which the URIs are specified in the content.
Top   ToC   RFC6787 - Page 58
   MRCPv2 clients and servers MUST support the 'multipart/mixed' media
   type.  This is the appropriate media type to use when providing a mix
   of URI and inline speech data.  Embedded within the multipart content
   block, there MAY be content for the 'text/uri-list', 'application/
   ssml+xml', and/or 'text/plain' media types.  The character set and
   encoding used in the speech data is specified according to standard
   media type definitions.  The multipart content MAY also contain
   actual audio data.  Clients may have recorded audio clips stored in
   memory or on a local device and wish to play it as part of the SPEAK
   request.  The audio portions MAY be sent by the client as part of the
   multipart content block.  This audio is referenced in the speech
   markup data that is another part in the multipart content block
   according to the 'multipart/mixed' media type specification.

   Content-Type:text/uri-list
   Content-Length:...

   http://www.example.com/ASR-Introduction.ssml
   http://www.example.com/ASR-Document-Part1.ssml
   http://www.example.com/ASR-Document-Part2.ssml
   http://www.example.com/ASR-Conclusion.ssml

                             URI List Example


   Content-Type:application/ssml+xml
   Content-Length:...

   <?xml version="1.0"?>
        <speak version="1.0"
               xmlns="http://www.w3.org/2001/10/synthesis"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
               xml:lang="en-US">
          <p>
            <s>You have 4 new messages.</s>
            <s>The first is from Aldine Turnbet
            and arrived at <break/>
            <say-as interpret-as="vxml:time">0345p</say-as>.</s>

            <s>The subject is <prosody
            rate="-20%">ski trip</prosody></s>
         </p>
        </speak>

                               SSML Example
Top   ToC   RFC6787 - Page 59
   Content-Type:multipart/mixed; boundary="break"

   --break
   Content-Type:text/uri-list
   Content-Length:...

   http://www.example.com/ASR-Introduction.ssml
   http://www.example.com/ASR-Document-Part1.ssml
   http://www.example.com/ASR-Document-Part2.ssml
   http://www.example.com/ASR-Conclusion.ssml

   --break
   Content-Type:application/ssml+xml
   Content-Length:...

   <?xml version="1.0"?>
       <speak version="1.0"
              xmlns="http://www.w3.org/2001/10/synthesis"
              xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
              xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
              xml:lang="en-US">
          <p>
            <s>You have 4 new messages.</s>
            <s>The first is from Stephanie Williams
            and arrived at <break/>
            <say-as interpret-as="vxml:time">0342p</say-as>.</s>

            <s>The subject is <prosody
            rate="-20%">ski trip</prosody></s>
          </p>
       </speak>
   --break--

                             Multipart Example

8.5.2. Lexicon Data

Synthesizer lexicon data from the client to the server can be provided inline or by reference. Either way, they are carried as typed media in the message body of the MRCPv2 request message (see Section 8.14). When a lexicon is specified inline in the message, the client MUST provide a Content-ID for that lexicon as part of the content header fields. The server MUST store the lexicon associated with that Content-ID for the duration of the session. A stored lexicon can be overwritten by defining a new lexicon with the same Content-ID.
Top   ToC   RFC6787 - Page 60
   Lexicons that have been associated with a Content-ID can be
   referenced through the 'session' URI scheme (see Section 13.6).

   If lexicon data is specified by external URI reference, the media
   type 'text/uri-list' (see RFC 2483 [RFC2483] ) is used to list the
   one or more URIs that may be dereferenced to obtain the lexicon data.
   All MRCPv2 servers MUST support the "http" and "https" URI access
   mechanisms, and MAY support other mechanisms.

   If the data in the message body consists of a mix of URI and inline
   lexicon data, the 'multipart/mixed' media type is used.  The
   character set and encoding used in the lexicon data may be specified
   according to standard media type definitions.

8.6. SPEAK Method

The SPEAK request provides the synthesizer resource with the speech text and initiates speech synthesis and streaming. The SPEAK method MAY carry voice and prosody header fields that alter the behavior of the voice being synthesized, as well as a typed media message body containing the actual marked-up text to be spoken. The SPEAK method implementation MUST do a fetch of all external URIs that are part of that operation. If caching is implemented, this URI fetching MUST conform to the cache-control hints and parameter header fields associated with the method in deciding whether it is to be fetched from cache or from the external server. If these hints/ parameters are not specified in the method, the values set for the session using SET-PARAMS/GET-PARAMS apply. If it was not set for the session, their default values apply. When applying voice parameters, there are three levels of precedence. The highest precedence are those specified within the speech markup text, followed by those specified in the header fields of the SPEAK request and hence that apply for that SPEAK request only, followed by the session default values that can be set using the SET-PARAMS request and apply for subsequent methods invoked during the session. If the resource was idle at the time the SPEAK request arrived at the server and the SPEAK method is being actively processed, the resource responds immediately with a success status code and a request-state of IN-PROGRESS. If the resource is in the speaking or paused state when the SPEAK method arrives at the server, i.e., it is in the middle of processing a previous SPEAK request, the status returns success with a request- state of PENDING. The server places the SPEAK request in the synthesizer resource request queue. The request queue operates
Top   ToC   RFC6787 - Page 61
   strictly FIFO: requests are processed serially in order of receipt.
   If the current SPEAK fails, all SPEAK methods in the pending queue
   are cancelled and each generates a SPEAK-COMPLETE event with a
   Completion-Cause of "cancelled".

   For the synthesizer resource, SPEAK is the only method that can
   return a request-state of IN-PROGRESS or PENDING.  When the text has
   been synthesized and played into the media stream, the resource
   issues a SPEAK-COMPLETE event with the request-id of the SPEAK
   request and a request-state of COMPLETE.

   C->S: MRCP/2.0 ... SPEAK 543257
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-Age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
            <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams and arrived at
                <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.
                </s>
             <s>The subject is
                    <prosody rate="-20%">ski trip</prosody>
             </s>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

   S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Completion-Cause:000 normal
         Speech-Marker:timestamp=857206027059

                               SPEAK Example
Top   ToC   RFC6787 - Page 62

8.7. STOP

The STOP method from the client to the server tells the synthesizer resource to stop speaking if it is speaking something. The STOP request can be sent with an Active-Request-Id-List header field to stop the zero or more specific SPEAK requests that may be in queue and return a response status-code of 200 "Success". If no Active-Request-Id-List header field is sent in the STOP request, the server terminates all outstanding SPEAK requests. If a STOP request successfully terminated one or more PENDING or IN-PROGRESS SPEAK requests, then the response MUST contain an Active- Request-Id-List header field enumerating the SPEAK request-ids that were terminated. Otherwise, there is no Active-Request-Id-List header field in the response. No SPEAK-COMPLETE events are sent for such terminated requests. If a SPEAK request that was IN-PROGRESS and speaking was stopped, the next pending SPEAK request, if any, becomes IN-PROGRESS at the resource and enters the speaking state. If a SPEAK request that was IN-PROGRESS and paused was stopped, the next pending SPEAK request, if any, becomes IN-PROGRESS and enters the paused state. C->S: MRCP/2.0 ... SPEAK 543258 Channel-Identifier:32AECB23433802@speechsynth Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak>
Top   ToC   RFC6787 - Page 63
   S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

   C->S: MRCP/2.0 ... STOP 543259
         Channel-Identifier:32AECB23433802@speechsynth

   S->C: MRCP/2.0 ... 543259 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258
         Speech-Marker:timestamp=857206039059

                               STOP Example

8.8. BARGE-IN-OCCURRED

The BARGE-IN-OCCURRED method, when used with the synthesizer resource, provides a client that has detected a barge-in-able event a means to communicate the occurrence of the event to the synthesizer resource. This method is useful in two scenarios: 1. The client has detected DTMF digits in the input media or some other barge-in-able event and wants to communicate that to the synthesizer resource. 2. The recognizer resource and the synthesizer resource are in different servers. In this case, the client acts as an intermediary for the two servers. It receives an event from the recognition resource and sends a BARGE-IN-OCCURRED request to the synthesizer. In such cases, the BARGE-IN-OCCURRED method would also have a Proxy-Sync-Id header field received from the resource generating the original event. If a SPEAK request is active with kill-on-barge-in enabled (see Section 8.4.2), and the BARGE-IN-OCCURRED event is received, the synthesizer MUST immediately stop streaming out audio. It MUST also terminate any speech requests queued behind the current active one, irrespective of whether or not they have barge-in enabled. If a barge-in-able SPEAK request was playing and it was terminated, the response MUST contain an Active-Request-Id-List header field listing the request-ids of all SPEAK requests that were terminated. The server generates no SPEAK-COMPLETE events for these requests.
Top   ToC   RFC6787 - Page 64
   If there were no SPEAK requests terminated by the synthesizer
   resource as a result of the BARGE-IN-OCCURRED method, the server MUST
   respond to the BARGE-IN-OCCURRED with a status-code of 200 "Success",
   and the response MUST NOT contain an Active-Request-Id-List header
   field.

   If the synthesizer and recognizer resources are part of the same
   MRCPv2 session, they can be optimized for a quicker kill-on-barge-in
   response if the recognizer and synthesizer interact directly.  In
   these cases, the client MUST still react to a START-OF-INPUT event
   from the recognizer by invoking the BARGE-IN-OCCURRED method to the
   synthesizer.  The client MUST invoke the BARGE-IN-OCCURRED if it has
   any outstanding requests to the synthesizer resource in either the
   PENDING or IN-PROGRESS state.

   C->S: MRCP/2.0 ... SPEAK 543258
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-Age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
           <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams and arrived at
                <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.</s>
             <s>The subject is
                <prosody rate="-20%">ski trip</prosody></s>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

   C->S: MRCP/2.0 ... BARGE-IN-OCCURRED 543259
         Channel-Identifier:32AECB23433802@speechsynth
         Proxy-Sync-Id:987654321
Top   ToC   RFC6787 - Page 65
   S->C:MRCP/2.0 ... 543259 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258
         Speech-Marker:timestamp=857206039059

                         BARGE-IN-OCCURRED Example

8.9. PAUSE

The PAUSE method from the client to the server tells the synthesizer resource to pause speech output if it is speaking something. If a PAUSE method is issued on a session when a SPEAK is not active, the server MUST respond with a status-code of 402 "Method not valid in this state". If a PAUSE method is issued on a session when a SPEAK is active and paused, the server MUST respond with a status-code of 200 "Success". If a SPEAK request was active, the server MUST return an Active-Request-Id-List header field whose value contains the request-id of the SPEAK request that was paused. C->S: MRCP/2.0 ... SPEAK 543258 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-Age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak> S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS Channel-Identifier:32AECB23433802@speechsynth Speech-Marker:timestamp=857206027059
Top   ToC   RFC6787 - Page 66
   C->S: MRCP/2.0 ... PAUSE 543259
         Channel-Identifier:32AECB23433802@speechsynth

   S->C: MRCP/2.0 ... 543259 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258

                               PAUSE Example

8.10. RESUME

The RESUME method from the client to the server tells a paused synthesizer resource to resume speaking. If a RESUME request is issued on a session with no active SPEAK request, the server MUST respond with a status-code of 402 "Method not valid in this state". If a RESUME request is issued on a session with an active SPEAK request that is speaking (i.e., not paused), the server MUST respond with a status-code of 200 "Success". If a SPEAK request was paused, the server MUST return an Active-Request-Id-List header field whose value contains the request-id of the SPEAK request that was resumed. C->S: MRCP/2.0 ... SPEAK 543258 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak>
Top   ToC   RFC6787 - Page 67
   S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS@speechsynth
         Channel-Identifier:32AECB23433802
         Speech-Marker:timestamp=857206027059

   C->S: MRCP/2.0 ... PAUSE 543259
         Channel-Identifier:32AECB23433802@speechsynth

   S->C: MRCP/2.0 ... 543259 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258

   C->S: MRCP/2.0 ... RESUME 543260
         Channel-Identifier:32AECB23433802@speechsynth

   S->C: MRCP/2.0 ... 543260 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258

                              RESUME Example

8.11. CONTROL

The CONTROL method from the client to the server tells a synthesizer that is speaking to modify what it is speaking on the fly. This method is used to request the synthesizer to jump forward or backward in what it is speaking, change speaker rate, speaker parameters, etc. It affects only the currently IN-PROGRESS SPEAK request. Depending on the implementation and capability of the synthesizer resource, it may or may not support the various modifications indicated by header fields in the CONTROL request. When a client invokes a CONTROL method to jump forward and the operation goes beyond the end of the active SPEAK method's text, the CONTROL request still succeeds. The active SPEAK request completes and returns a SPEAK-COMPLETE event following the response to the CONTROL method. If there are more SPEAK requests in the queue, the synthesizer resource starts at the beginning of the next SPEAK request in the queue. When a client invokes a CONTROL method to jump backward and the operation jumps to the beginning or beyond the beginning of the speech data of the active SPEAK method, the CONTROL request still succeeds. The response to the CONTROL request contains the speak- restart header field, and the active SPEAK request restarts from the beginning of its speech data.
Top   ToC   RFC6787 - Page 68
   These two behaviors can be used to rewind or fast-forward across
   multiple speech requests, if the client wants to break up a speech
   markup text into multiple SPEAK requests.

   If a SPEAK request was active when the CONTROL method was received,
   the server MUST return an Active-Request-Id-List header field
   containing the request-id of the SPEAK request that was active.

   C->S: MRCP/2.0 ... SPEAK 543258
         Channel-Identifier:32AECB23433802@speechsynth
         Voice-gender:neutral
         Voice-age:25
         Prosody-volume:medium
         Content-Type:application/ssml+xml
         Content-Length:...

         <?xml version="1.0"?>
           <speak version="1.0"
                xmlns="http://www.w3.org/2001/10/synthesis"
                xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                xsi:schemaLocation="http://www.w3.org/2001/10/synthesis
                   http://www.w3.org/TR/speech-synthesis/synthesis.xsd"
                xml:lang="en-US">
            <p>
             <s>You have 4 new messages.</s>
             <s>The first is from Stephanie Williams
                and arrived at <break/>
                <say-as interpret-as="vxml:time">0342p</say-as>.</s>

             <s>The subject is <prosody
                rate="-20%">ski trip</prosody></s>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857205016059

   C->S: MRCP/2.0 ... CONTROL 543259
         Channel-Identifier:32AECB23433802@speechsynth
         Prosody-rate:fast

   S->C: MRCP/2.0 ... 543259 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258
         Speech-Marker:timestamp=857206027059
Top   ToC   RFC6787 - Page 69
   C->S: MRCP/2.0 ... CONTROL 543260
         Channel-Identifier:32AECB23433802@speechsynth
         Jump-Size:-15 Words

   S->C: MRCP/2.0 ... 543260 200 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Active-Request-Id-List:543258
         Speech-Marker:timestamp=857206039059

                              CONTROL Example

8.12. SPEAK-COMPLETE

This is an Event message from the synthesizer resource to the client that indicates the corresponding SPEAK request was completed. The request-id field matches the request-id of the SPEAK request that initiated the speech that just completed. The request-state field is set to COMPLETE by the server, indicating that this is the last event with the corresponding request-id. The Completion-Cause header field specifies the cause code pertaining to the status and reason of request completion, such as the SPEAK completed normally or because of an error, kill-on-barge-in, etc. C->S: MRCP/2.0 ... SPEAK 543260 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/> <say-as interpret-as="vxml:time">0342p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak>
Top   ToC   RFC6787 - Page 70
   S->C: MRCP/2.0 ... 543260 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059

   S->C: MRCP/2.0 ... SPEAK-COMPLETE 543260 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Completion-Cause:000 normal
         Speech-Marker:timestamp=857206039059

                          SPEAK-COMPLETE Example

8.13. SPEECH-MARKER

This is an event generated by the synthesizer resource to the client when the synthesizer encounters a marker tag in the speech markup it is currently processing. The value of the request-id field MUST match that of the corresponding SPEAK request. The request-state field MUST have the value "IN-PROGRESS" as the speech is still not complete. The value of the speech marker tag hit, describing where the synthesizer is in the speech markup, MUST be returned in the Speech-Marker header field, along with an NTP timestamp indicating the instant in the output speech stream that the marker was encountered. The SPEECH-MARKER event MUST also be generated with a null marker value and output NTP timestamp when a SPEAK request in Pending-State (i.e., in the queue) changes state to IN-PROGRESS and starts speaking. The NTP timestamp MUST be synchronized with the RTP timestamp used to generate the speech stream through standard RTCP machinery. C->S: MRCP/2.0 ... SPEAK 543261 Channel-Identifier:32AECB23433802@speechsynth Voice-gender:neutral Voice-age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams and arrived at <break/>
Top   ToC   RFC6787 - Page 71
                <say-as interpret-as="vxml:time">0342p</say-as>.</s>
                <mark name="here"/>
             <s>The subject is
                <prosody rate="-20%">ski trip</prosody>
             </s>
             <mark name="ANSWER"/>
            </p>
           </speak>

   S->C: MRCP/2.0 ... 543261 200 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857205015059

   S->C: MRCP/2.0 ... SPEECH-MARKER 543261 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206027059;here

   S->C: MRCP/2.0 ... SPEECH-MARKER 543261 IN-PROGRESS
         Channel-Identifier:32AECB23433802@speechsynth
         Speech-Marker:timestamp=857206039059;ANSWER

   S->C: MRCP/2.0 ... SPEAK-COMPLETE 543261 COMPLETE
         Channel-Identifier:32AECB23433802@speechsynth
         Completion-Cause:000 normal
         Speech-Marker:timestamp=857207689259;ANSWER

                           SPEECH-MARKER Example

8.14. DEFINE-LEXICON

The DEFINE-LEXICON method, from the client to the server, provides a lexicon and tells the server to load or unload the lexicon (see Section 8.4.16). The media type of the lexicon is provided in the Content-Type header (see Section 8.5.2). One such media type is "application/pls+xml" for the Pronunciation Lexicon Specification (PLS) [W3C.REC-pronunciation-lexicon-20081014] [RFC4267]. If the server resource is in the speaking or paused state, the server MUST respond with a failure status-code of 402 "Method not valid in this state". If the resource is in the idle state and is able to successfully load/unload the lexicon, the status MUST return a 200 "Success" status-code and the request-state MUST be COMPLETE.
Top   ToC   RFC6787 - Page 72
   If the synthesizer could not define the lexicon for some reason, for
   example, because the download failed or the lexicon was in an
   unsupported form, the server MUST respond with a failure status-code
   of 407 and a Completion-Cause header field describing the failure
   reason.



(page 72 continued on part 4)

Next Section