Tech-invite3GPPspaceIETFspace
96959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 6787

Media Resource Control Protocol Version 2 (MRCPv2)

Pages: 224
Proposed Standard
Errata
Part 4 of 8 – Pages 72 to 99
First   Prev   Next

Top   ToC   RFC6787 - Page 72   prevText

9. Speech Recognizer Resource

The speech recognizer resource receives an incoming voice stream and provides the client with an interpretation of what was spoken in textual form. The recognizer resource is controlled by MRCPv2 requests from the client. The recognizer resource can both respond to these requests and generate asynchronous events to the client to indicate conditions of interest during the processing of the method. This section applies to the following resource types. 1. speechrecog 2. dtmfrecog The difference between the above two resources is in their level of support for recognition grammars. The "dtmfrecog" resource type is capable of recognizing only DTMF digits and hence accepts only DTMF grammars. It only generates barge-in for DTMF inputs and ignores speech. The "speechrecog" resource type can recognize regular speech as well as DTMF digits and hence MUST support grammars describing either speech or DTMF. This resource generates barge-in events for speech and/or DTMF. By analyzing the grammars that are activated by the RECOGNIZE method, it determines if a barge-in should occur for speech and/or DTMF. When the recognizer decides it needs to generate a barge-in, it also generates a START-OF-INPUT event to the client. The recognizer resource MAY support recognition in the normal or hotword modes or both (although note that a single "speechrecog" resource does not perform normal and hotword mode recognition simultaneously). For implementations where a single recognizer resource does not support both modes, or simultaneous normal and hotword recognition is desired, the two modes can be invoked through separate resources allocated to the same SIP dialog (with different MRCP session identifiers) and share the RTP audio feed. The capabilities of the recognizer resource are enumerated below: Normal Mode Recognition Normal mode recognition tries to match all of the speech or DTMF against the grammar and returns a no-match status if the input fails to match or the method times out.
Top   ToC   RFC6787 - Page 73
   Hotword Mode Recognition  Hotword mode is where the recognizer looks
      for a match against specific speech grammar or DTMF sequence and
      ignores speech or DTMF that does not match.  The recognition
      completes only if there is a successful match of grammar, if the
      client cancels the request, or if there is a non-input or
      recognition timeout.

   Voice Enrolled Grammars  A recognizer resource MAY optionally support
      Voice Enrolled Grammars.  With this functionality, enrollment is
      performed using a person's voice.  For example, a list of contacts
      can be created and maintained by recording the person's names
      using the caller's voice.  This technique is sometimes also called
      speaker-dependent recognition.

   Interpretation  A recognizer resource MAY be employed strictly for
      its natural language interpretation capabilities by supplying it
      with a text string as input instead of speech.  In this mode, the
      resource takes text as input and produces an "interpretation" of
      the input according to the supplied grammar.

   Voice enrollment has the concept of an enrollment session.  A session
   to add a new phrase to a personal grammar involves the initial
   enrollment followed by a repeat of enough utterances before
   committing the new phrase to the personal grammar.  Each time an
   utterance is recorded, it is compared for similarity with the other
   samples and a clash test is performed against other entries in the
   personal grammar to ensure there are no similar and confusable
   entries.

   Enrollment is done using a recognizer resource.  Controlling which
   utterances are to be considered for enrollment of a new phrase is
   done by setting a header field (see Section 9.4.39) in the Recognize
   request.

   Interpretation is accomplished through the INTERPRET method
   (Section 9.20) and the Interpret-Text header field (Section 9.4.30).
Top   ToC   RFC6787 - Page 74

9.1. Recognizer State Machine

The recognizer resource maintains a state machine to process MRCPv2 requests from the client. Idle Recognizing Recognized State State State | | | |---------RECOGNIZE---->|---RECOGNITION-COMPLETE-->| |<------STOP------------|<-----RECOGNIZE-----------| | | | | |--------| |-----------| | START-OF-INPUT | GET-RESULT | | |------->| |---------->| |------------| | | | DEFINE-GRAMMAR |----------| | |<-----------| | START-INPUT-TIMERS | | |<---------| | |------| | | | INTERPRET | | |<-----| |------| | | | RECOGNIZE | |-------| |<-----| | | STOP | |<------| | |<-------------------STOP--------------------------| |<-------------------DEFINE-GRAMMAR----------------| Recognizer State Machine If a recognizer resource supports voice enrolled grammars, starting an enrollment session does not change the state of the recognizer resource. Once an enrollment session is started, then utterances are enrolled by calling the RECOGNIZE method repeatedly. The state of the speech recognizer resource goes from IDLE to RECOGNIZING state each time RECOGNIZE is called.

9.2. Recognizer Methods

The recognizer supports the following methods. recognizer-method = recog-only-method / enrollment-method
Top   ToC   RFC6787 - Page 75
   recog-only-method    =  "DEFINE-GRAMMAR"
                        /  "RECOGNIZE"
                        /  "INTERPRET"
                        /  "GET-RESULT"
                        /  "START-INPUT-TIMERS"
                        /  "STOP"

   It is OPTIONAL for a recognizer resource to support voice enrolled
   grammars.  If the recognizer resource does support voice enrolled
   grammars, it MUST support the following methods.

   enrollment-method    =  "START-PHRASE-ENROLLMENT"
                        /  "ENROLLMENT-ROLLBACK"
                        /  "END-PHRASE-ENROLLMENT"
                        /  "MODIFY-PHRASE"
                        /  "DELETE-PHRASE"

9.3. Recognizer Events

The recognizer can generate the following events. recognizer-event = "START-OF-INPUT" / "RECOGNITION-COMPLETE" / "INTERPRETATION-COMPLETE"

9.4. Recognizer Header Fields

A recognizer message can contain header fields containing request options and information to augment the Method, Response, or Event message it is associated with. recognizer-header = recog-only-header / enrollment-header recog-only-header = confidence-threshold / sensitivity-level / speed-vs-accuracy / n-best-list-length / no-input-timeout / input-type / recognition-timeout / waveform-uri / input-waveform-uri / completion-cause / completion-reason / recognizer-context-block / start-input-timers / speech-complete-timeout
Top   ToC   RFC6787 - Page 76
                        /  speech-incomplete-timeout
                        /  dtmf-interdigit-timeout
                        /  dtmf-term-timeout
                        /  dtmf-term-char
                        /  failed-uri
                        /  failed-uri-cause
                        /  save-waveform
                        /  media-type
                        /  new-audio-channel
                        /  speech-language
                        /  ver-buffer-utterance
                        /  recognition-mode
                        /  cancel-if-queue
                        /  hotword-max-duration
                        /  hotword-min-duration
                        /  interpret-text
                        /  dtmf-buffer-time
                        /  clear-dtmf-buffer
                        /  early-no-match

   If a recognizer resource supports voice enrolled grammars, the
   following header fields are also used.

   enrollment-header    =  num-min-consistent-pronunciations
                        /  consistency-threshold
                        /  clash-threshold
                        /  personal-grammar-uri
                        /  enroll-utterance
                        /  phrase-id
                        /  phrase-nl
                        /  weight
                        /  save-best-waveform
                        /  new-phrase-id
                        /  confusable-phrases-uri
                        /  abort-phrase-enrollment

   For enrollment-specific header fields that can appear as part of
   SET-PARAMS or GET-PARAMS methods, the following general rule applies:
   the START-PHRASE-ENROLLMENT method MUST be invoked before these
   header fields may be set through the SET-PARAMS method or retrieved
   through the GET-PARAMS method.

   Note that the Waveform-URI header field of the Recognizer resource
   can also appear in the response to the END-PHRASE-ENROLLMENT method.
Top   ToC   RFC6787 - Page 77

9.4.1. Confidence-Threshold

When a recognizer resource recognizes or matches a spoken phrase with some portion of the grammar, it associates a confidence level with that match. The Confidence-Threshold header field tells the recognizer resource what confidence level the client considers a successful match. This is a float value between 0.0-1.0 indicating the recognizer's confidence in the recognition. If the recognizer determines that there is no candidate match with a confidence that is greater than the confidence threshold, then it MUST return no-match as the recognition result. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. The default value for this header field is implementation specific, as is the interpretation of any specific value for this header field. Although values for servers from different vendors are not comparable, it is expected that clients will tune this value over time for a given server. confidence-threshold = "Confidence-Threshold" ":" FLOAT CRLF

9.4.2. Sensitivity-Level

To filter out background noise and not mistake it for speech, the recognizer resource supports a variable level of sound sensitivity. The Sensitivity-Level header field is a float value between 0.0 and 1.0 and allows the client to set the sensitivity level for the recognizer. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. A higher value for this header field means higher sensitivity. The default value for this header field is implementation specific, as is the interpretation of any specific value for this header field. Although values for servers from different vendors are not comparable, it is expected that clients will tune this value over time for a given server. sensitivity-level = "Sensitivity-Level" ":" FLOAT CRLF

9.4.3. Speed-Vs-Accuracy

Depending on the implementation and capability of the recognizer resource it may be tunable towards Performance or Accuracy. Higher accuracy may mean more processing and higher CPU utilization, meaning fewer active sessions per server and vice versa. The value is a float between 0.0 and 1.0. A value of 0.0 means fastest recognition. A value of 1.0 means best accuracy. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. The default value for this
Top   ToC   RFC6787 - Page 78
   header field is implementation specific.  Although values for servers
   from different vendors are not comparable, it is expected that
   clients will tune this value over time for a given server.

   speed-vs-accuracy        =  "Speed-Vs-Accuracy" ":" FLOAT CRLF

9.4.4. N-Best-List-Length

When the recognizer matches an incoming stream with the grammar, it may come up with more than one alternative match because of confidence levels in certain words or conversation paths. If this header field is not specified, by default, the recognizer resource returns only the best match above the confidence threshold. The client, by setting this header field, can ask the recognition resource to send it more than one alternative. All alternatives must still be above the Confidence-Threshold. A value greater than one does not guarantee that the recognizer will provide the requested number of alternatives. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. The minimum value for this header field is 1. The default value for this header field is 1. n-best-list-length = "N-Best-List-Length" ":" 1*19DIGIT CRLF

9.4.5. Input-Type

When the recognizer detects barge-in-able input and generates a START-OF-INPUT event, that event MUST carry this header field to specify whether the input that caused the barge-in was DTMF or speech. input-type = "Input-Type" ":" inputs CRLF inputs = "speech" / "dtmf"

9.4.6. No-Input-Timeout

When recognition is started and there is no speech detected for a certain period of time, the recognizer can send a RECOGNITION- COMPLETE event to the client with a Completion-Cause of "no-input- timeout" and terminate the recognition operation. The client can use the No-Input-Timeout header field to set this timeout. The value is in milliseconds and can range from 0 to an implementation-specific maximum value. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. The default value is implementation specific. no-input-timeout = "No-Input-Timeout" ":" 1*19DIGIT CRLF
Top   ToC   RFC6787 - Page 79

9.4.7. Recognition-Timeout

When recognition is started and there is no match for a certain period of time, the recognizer can send a RECOGNITION-COMPLETE event to the client and terminate the recognition operation. The Recognition-Timeout header field allows the client to set this timeout value. The value is in milliseconds. The value for this header field ranges from 0 to an implementation-specific maximum value. The default value is 10 seconds. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. recognition-timeout = "Recognition-Timeout" ":" 1*19DIGIT CRLF

9.4.8. Waveform-URI

If the Save-Waveform header field is set to "true", the recognizer MUST record the incoming audio stream of the recognition into a stored form and provide a URI for the client to access it. This header field MUST be present in the RECOGNITION-COMPLETE event if the Save-Waveform header field was set to "true". The value of the header field MUST be empty if there was some error condition preventing the server from recording. Otherwise, the URI generated by the server MUST be unambiguous across the server and all its recognition sessions. The content associated with the URI MUST be available to the client until the MRCPv2 session terminates. Similarly, if the Save-Best-Waveform header field is set to "true", the recognizer MUST save the audio stream for the best repetition of the phrase that was used during the enrollment session. The recognizer MUST then record the recognized audio and make it available to the client by returning a URI in the Waveform-URI header field in the response to the END-PHRASE-ENROLLMENT method. The value of the header field MUST be empty if there was some error condition preventing the server from recording. Otherwise, the URI generated by the server MUST be unambiguous across the server and all its recognition sessions. The content associated with the URI MUST be available to the client until the MRCPv2 session terminates. See the discussion on the sensitivity of saved waveforms in Section 12. The server MUST also return the size in octets and the duration in milliseconds of the recorded audio waveform as parameters associated with the header field. waveform-uri = "Waveform-URI" ":" ["<" uri ">" ";" "size" "=" 1*19DIGIT ";" "duration" "=" 1*19DIGIT] CRLF
Top   ToC   RFC6787 - Page 80

9.4.9. Media-Type

This header field MAY be specified in the SET-PARAMS, GET-PARAMS, or the RECOGNIZE methods and tells the server resource the media type in which to store captured audio or video, such as the one captured and returned by the Waveform-URI header field. media-type = "Media-Type" ":" media-type-value CRLF

9.4.10. Input-Waveform-URI

This optional header field specifies a URI pointing to audio content to be processed by the RECOGNIZE operation. This enables the client to request recognition from a specified buffer or audio file. input-waveform-uri = "Input-Waveform-URI" ":" uri CRLF

9.4.11. Completion-Cause

This header field MUST be part of a RECOGNITION-COMPLETE event coming from the recognizer resource to the client. It indicates the reason behind the RECOGNIZE method completion. This header field MUST be sent in the DEFINE-GRAMMAR and RECOGNIZE responses, if they return with a failure status and a COMPLETE state. In the ABNF below, the cause-code contains a numerical value selected from the Cause-Code column of the following table. The cause-name contains the corresponding token selected from the Cause-Name column. completion-cause = "Completion-Cause" ":" cause-code SP cause-name CRLF cause-code = 3DIGIT cause-name = *VCHAR
Top   ToC   RFC6787 - Page 81
   +------------+-----------------------+------------------------------+
   | Cause-Code | Cause-Name            | Description                  |
   +------------+-----------------------+------------------------------+
   | 000        | success               | RECOGNIZE completed with a   |
   |            |                       | match or DEFINE-GRAMMAR      |
   |            |                       | succeeded in downloading and |
   |            |                       | compiling the grammar.       |
   |            |                       |                              |
   | 001        | no-match              | RECOGNIZE completed, but no  |
   |            |                       | match was found.             |
   |            |                       |                              |
   | 002        | no-input-timeout      | RECOGNIZE completed without  |
   |            |                       | a match due to a             |
   |            |                       | no-input-timeout.            |
   |            |                       |                              |
   | 003        | hotword-maxtime       | RECOGNIZE in hotword mode    |
   |            |                       | completed without a match    |
   |            |                       | due to a                     |
   |            |                       | recognition-timeout.         |
   |            |                       |                              |
   | 004        | grammar-load-failure  | RECOGNIZE failed due to      |
   |            |                       | grammar load failure.        |
   |            |                       |                              |
   | 005        | grammar-compilation-  | RECOGNIZE failed due to      |
   |            | failure               | grammar compilation failure. |
   |            |                       |                              |
   | 006        | recognizer-error      | RECOGNIZE request terminated |
   |            |                       | prematurely due to a         |
   |            |                       | recognizer error.            |
   |            |                       |                              |
   | 007        | speech-too-early      | RECOGNIZE request terminated |
   |            |                       | because speech was too       |
   |            |                       | early. This happens when the |
   |            |                       | audio stream is already      |
   |            |                       | "in-speech" when the         |
   |            |                       | RECOGNIZE request was        |
   |            |                       | received.                    |
   |            |                       |                              |
   | 008        | success-maxtime       | RECOGNIZE request terminated |
   |            |                       | because speech was too long  |
   |            |                       | but whatever was spoken till |
   |            |                       | that point was a full match. |
   |            |                       |                              |
   | 009        | uri-failure           | Failure accessing a URI.     |
   |            |                       |                              |
   | 010        | language-unsupported  | Language not supported.      |
   |            |                       |                              |
Top   ToC   RFC6787 - Page 82
   | 011        | cancelled             | A new RECOGNIZE cancelled    |
   |            |                       | this one, or a prior         |
   |            |                       | RECOGNIZE failed while this  |
   |            |                       | one was still in the queue.  |
   |            |                       |                              |
   | 012        | semantics-failure     | Recognition succeeded, but   |
   |            |                       | semantic interpretation of   |
   |            |                       | the recognized input failed. |
   |            |                       | The RECOGNITION-COMPLETE     |
   |            |                       | event MUST contain the       |
   |            |                       | Recognition result with only |
   |            |                       | input text and no            |
   |            |                       | interpretation.              |
   |            |                       |                              |
   | 013        | partial-match         | Speech Incomplete Timeout    |
   |            |                       | expired before there was a   |
   |            |                       | full match. But whatever was |
   |            |                       | spoken till that point was a |
   |            |                       | partial match to one or more |
   |            |                       | grammars.                    |
   |            |                       |                              |
   | 014        | partial-match-maxtime | The Recognition-Timeout      |
   |            |                       | expired before full match    |
   |            |                       | was achieved. But whatever   |
   |            |                       | was spoken till that point   |
   |            |                       | was a partial match to one   |
   |            |                       | or more grammars.            |
   |            |                       |                              |
   | 015        | no-match-maxtime      | The Recognition-Timeout      |
   |            |                       | expired. Whatever was spoken |
   |            |                       | till that point did not      |
   |            |                       | match any of the grammars.   |
   |            |                       | This cause could also be     |
   |            |                       | returned if the recognizer   |
   |            |                       | does not support detecting   |
   |            |                       | partial grammar matches.     |
   |            |                       |                              |
   | 016        | grammar-definition-   | Any DEFINE-GRAMMAR error     |
   |            | failure               | other than                   |
   |            |                       | grammar-load-failure and     |
   |            |                       | grammar-compilation-failure. |
   +------------+-----------------------+------------------------------+
Top   ToC   RFC6787 - Page 83

9.4.12. Completion-Reason

This header field MAY be specified in a RECOGNITION-COMPLETE event coming from the recognizer resource to the client. This contains the reason text behind the RECOGNIZE request completion. The server uses this header field to communicate text describing the reason for the failure, such as the specific error encountered in parsing a grammar markup. The completion reason text is provided for client use in logs and for debugging and instrumentation purposes. Clients MUST NOT interpret the completion reason text. completion-reason = "Completion-Reason" ":" quoted-string CRLF

9.4.13. Recognizer-Context-Block

This header field MAY be sent as part of the SET-PARAMS or GET-PARAMS request. If the GET-PARAMS method contains this header field with no value, then it is a request to the recognizer to return the recognizer context block. The response to such a message MAY contain a recognizer context block as a typed media message body. If the server returns a recognizer context block, the response MUST contain this header field and its value MUST match the Content-ID of the corresponding media block. If the SET-PARAMS method contains this header field, it MUST also contain a message body containing the recognizer context data and a Content-ID matching this header field value. This Content-ID MUST match the Content-ID that came with the context data during the GET-PARAMS operation. An implementation choosing to use this mechanism to hand off recognizer context data between servers MUST distinguish its implementation-specific block of data by using an IANA-registered content type in the IANA Media Type vendor tree. recognizer-context-block = "Recognizer-Context-Block" ":" [1*VCHAR] CRLF

9.4.14. Start-Input-Timers

This header field MAY be sent as part of the RECOGNIZE request. A value of false tells the recognizer to start recognition but not to start the no-input timer yet. The recognizer MUST NOT start the timers until the client sends a START-INPUT-TIMERS request to the recognizer. This is useful in the scenario when the recognizer and
Top   ToC   RFC6787 - Page 84
   synthesizer engines are not part of the same session.  In such
   configurations, when a kill-on-barge-in prompt is being played (see
   Section 8.4.2), the client wants the RECOGNIZE request to be
   simultaneously active so that it can detect and implement kill-on-
   barge-in.  However, the recognizer SHOULD NOT start the no-input
   timers until the prompt is finished.  The default value is "true".

   start-input-timers  =  "Start-Input-Timers" ":" BOOLEAN CRLF

9.4.15. Speech-Complete-Timeout

This header field specifies the length of silence required following user speech before the speech recognizer finalizes a result (either accepting it or generating a no-match result). The Speech-Complete- Timeout value applies when the recognizer currently has a complete match against an active grammar, and specifies how long the recognizer MUST wait for more input before declaring a match. By contrast, the Speech-Incomplete-Timeout is used when the speech is an incomplete match to an active grammar. The value is in milliseconds. speech-complete-timeout = "Speech-Complete-Timeout" ":" 1*19DIGIT CRLF A long Speech-Complete-Timeout value delays the result to the client and therefore makes the application's response to a user slow. A short Speech-Complete-Timeout may lead to an utterance being broken up inappropriately. Reasonable speech complete timeout values are typically in the range of 0.3 seconds to 1.0 seconds. The value for this header field ranges from 0 to an implementation-specific maximum value. The default value for this header field is implementation specific. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS.

9.4.16. Speech-Incomplete-Timeout

This header field specifies the required length of silence following user speech after which a recognizer finalizes a result. The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is rejected (with a Completion-Cause of "partial-match"). The value is in milliseconds. The value for this header field ranges from 0 to an implementation- specific maximum value. The default value for this header field is implementation specific. speech-incomplete-timeout = "Speech-Incomplete-Timeout" ":" 1*19DIGIT CRLF
Top   ToC   RFC6787 - Page 85
   The Speech-Incomplete-Timeout also applies when the speech prior to
   the silence is a complete match of an active grammar, but where it is
   possible to speak further and still match the grammar.  By contrast,
   the Speech-Complete-Timeout is used when the speech is a complete
   match to an active grammar and no further spoken words can continue
   to represent a match.

   A long Speech-Incomplete-Timeout value delays the result to the
   client and therefore makes the application's response to a user slow.
   A short Speech-Incomplete-Timeout may lead to an utterance being
   broken up inappropriately.

   The Speech-Incomplete-Timeout is usually longer than the Speech-
   Complete-Timeout to allow users to pause mid-utterance (for example,
   to breathe).  This header field MAY occur in RECOGNIZE, SET-PARAMS,
   or GET-PARAMS.

9.4.17. DTMF-Interdigit-Timeout

This header field specifies the inter-digit timeout value to use when recognizing DTMF input. The value is in milliseconds. The value for this header field ranges from 0 to an implementation-specific maximum value. The default value is 5 seconds. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. dtmf-interdigit-timeout = "DTMF-Interdigit-Timeout" ":" 1*19DIGIT CRLF

9.4.18. DTMF-Term-Timeout

This header field specifies the terminating timeout to use when recognizing DTMF input. The DTMF-Term-Timeout applies only when no additional input is allowed by the grammar; otherwise, the DTMF-Interdigit-Timeout applies. The value is in milliseconds. The value for this header field ranges from 0 to an implementation- specific maximum value. The default value is 10 seconds. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. dtmf-term-timeout = "DTMF-Term-Timeout" ":" 1*19DIGIT CRLF

9.4.19. DTMF-Term-Char

This header field specifies the terminating DTMF character for DTMF input recognition. The default value is NULL, which is indicated by an empty header field value. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. dtmf-term-char = "DTMF-Term-Char" ":" VCHAR CRLF
Top   ToC   RFC6787 - Page 86

9.4.20. Failed-URI

When a recognizer needs to fetch or access a URI and the access fails, the server SHOULD provide the failed URI in this header field in the method response, unless there are multiple URI failures, in which case one of the failed URIs MUST be provided in this header field in the method response. failed-uri = "Failed-URI" ":" absoluteURI CRLF

9.4.21. Failed-URI-Cause

When a recognizer method needs a recognizer to fetch or access a URI and the access fails, the server MUST provide the URI-specific or protocol-specific response code for the URI in the Failed-URI header field through this header field in the method response. The value encoding is UTF-8 (RFC 3629 [RFC3629]) to accommodate any access protocol, some of which might have a response string instead of a numeric response code. failed-uri-cause = "Failed-URI-Cause" ":" 1*UTFCHAR CRLF

9.4.22. Save-Waveform

This header field allows the client to request the recognizer resource to save the audio input to the recognizer. The recognizer resource MUST then attempt to record the recognized audio, without endpointing, and make it available to the client in the form of a URI returned in the Waveform-URI header field in the RECOGNITION-COMPLETE event. If there was an error in recording the stream or the audio content is otherwise not available, the recognizer MUST return an empty Waveform-URI header field. The default value for this field is "false". This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. See the discussion on the sensitivity of saved waveforms in Section 12. save-waveform = "Save-Waveform" ":" BOOLEAN CRLF

9.4.23. New-Audio-Channel

This header field MAY be specified in a RECOGNIZE request and allows the client to tell the server that, from this point on, further input audio comes from a different audio source, channel, or speaker. If the recognizer resource had collected any input statistics or adaptation state, the recognizer resource MUST do what is appropriate for the specific recognition technology, which includes but is not limited to discarding any collected input statistics or adaptation state before starting the RECOGNIZE request. Note that if there are
Top   ToC   RFC6787 - Page 87
   multiple resources that are sharing a media stream and are collecting
   or using this data, and the client issues this header field to one of
   the resources, the reset operation applies to all resources that use
   the shared media stream.  This helps in a number of use cases,
   including where the client wishes to reuse an open recognition
   session with an existing media session for multiple telephone calls.

   new-audio-channel        =  "New-Audio-Channel" ":" BOOLEAN
                               CRLF

9.4.24. Speech-Language

This header field specifies the language of recognition grammar data within a session or request, if it is not specified within the data. The value of this header field MUST follow RFC 5646 [RFC5646] for its values. This MAY occur in DEFINE-GRAMMAR, RECOGNIZE, SET-PARAMS, or GET-PARAMS requests. speech-language = "Speech-Language" ":" 1*VCHAR CRLF

9.4.25. Ver-Buffer-Utterance

This header field lets the client request the server to buffer the utterance associated with this recognition request into a buffer available to a co-resident verifier resource. The buffer is shared across resources within a session and is allocated when a verifier resource is added to this session. The client MUST NOT send this header field unless a verifier resource is instantiated for the session. The buffer is released when the verifier resource is released from the session.

9.4.26. Recognition-Mode

This header field specifies what mode the RECOGNIZE method will operate in. The value choices are "normal" or "hotword". If the value is "normal", the RECOGNIZE starts matching speech and DTMF to the grammars specified in the RECOGNIZE request. If any portion of the speech does not match the grammar, the RECOGNIZE command completes with a no-match status. Timers may be active to detect speech in the audio (see Section 9.4.14), so the RECOGNIZE method may complete because of a timeout waiting for speech. If the value of this header field is "hotword", the RECOGNIZE method operates in hotword mode, where it only looks for the particular keywords or DTMF
Top   ToC   RFC6787 - Page 88
   sequences specified in the grammar and ignores silence or other
   speech in the audio stream.  The default value for this header field
   is "normal".  This header field MAY occur on the RECOGNIZE method.

   recognition-mode         =  "Recognition-Mode" ":"
                               "normal" / "hotword" CRLF

9.4.27. Cancel-If-Queue

This header field specifies what will happen if the client attempts to invoke another RECOGNIZE method when this RECOGNIZE request is already in progress for the resource. The value for this header field is a Boolean. A value of "true" means the server MUST terminate this RECOGNIZE request, with a Completion-Cause of "cancelled", if the client issues another RECOGNIZE request for the same resource. A value of "false" for this header field indicates to the server that this RECOGNIZE request will continue to completion, and if the client issues more RECOGNIZE requests to the same resource, they are queued. When the currently active RECOGNIZE request is stopped or completes with a successful match, the first RECOGNIZE method in the queue becomes active. If the current RECOGNIZE fails, all RECOGNIZE methods in the pending queue are cancelled, and each generates a RECOGNITION-COMPLETE event with a Completion-Cause of "cancelled". This header field MUST be present in every RECOGNIZE request. There is no default value. cancel-if-queue = "Cancel-If-Queue" ":" BOOLEAN CRLF

9.4.28. Hotword-Max-Duration

This header field MAY be sent in a hotword mode RECOGNIZE request. It specifies the maximum length of an utterance (in seconds) that will be considered for hotword recognition. This header field, along with Hotword-Min-Duration, can be used to tune performance by preventing the recognizer from evaluating utterances that are too short or too long to be one of the hotwords in the grammar(s). The value is in milliseconds. The default is implementation dependent. If present in a RECOGNIZE request specifying a mode other than "hotword", the header field is ignored. hotword-max-duration = "Hotword-Max-Duration" ":" 1*19DIGIT CRLF

9.4.29. Hotword-Min-Duration

This header field MAY be sent in a hotword mode RECOGNIZE request. It specifies the minimum length of an utterance (in seconds) that will be considered for hotword recognition. This header field, along
Top   ToC   RFC6787 - Page 89
   with Hotword-Max-Duration, can be used to tune performance by
   preventing the recognizer from evaluating utterances that are too
   short or too long to be one of the hotwords in the grammar(s).  The
   value is in milliseconds.  The default value is implementation
   dependent.  If present in a RECOGNIZE request specifying a mode other
   than "hotword", the header field is ignored.

   hotword-min-duration     =  "Hotword-Min-Duration" ":" 1*19DIGIT CRLF

9.4.30. Interpret-Text

The value of this header field is used to provide a pointer to the text for which a natural language interpretation is desired. The value is either a URI or text. If the value is a URI, it MUST be a Content-ID that refers to an entity of type 'text/plain' in the body of the message. Otherwise, the server MUST treat the value as the text to be interpreted. This header field MUST be used when invoking the INTERPRET method. interpret-text = "Interpret-Text" ":" 1*VCHAR CRLF

9.4.31. DTMF-Buffer-Time

This header field MAY be specified in a GET-PARAMS or SET-PARAMS method and is used to specify the amount of time, in milliseconds, of the type-ahead buffer for the recognizer. This is the buffer that collects DTMF digits as they are pressed even when there is no RECOGNIZE command active. When a subsequent RECOGNIZE method is received, it MUST look to this buffer to match the RECOGNIZE request. If the digits in the buffer are not sufficient, then it can continue to listen to more digits to match the grammar. The default size of this DTMF buffer is platform specific. dtmf-buffer-time = "DTMF-Buffer-Time" ":" 1*19DIGIT CRLF

9.4.32. Clear-DTMF-Buffer

This header field MAY be specified in a RECOGNIZE method and is used to tell the recognizer to clear the DTMF type-ahead buffer before starting the RECOGNIZE. The default value of this header field is "false", which does not clear the type-ahead buffer before starting the RECOGNIZE method. If this header field is specified to be "true", then the RECOGNIZE will clear the DTMF buffer before starting recognition. This means digits pressed by the caller before the RECOGNIZE command was issued are discarded. clear-dtmf-buffer = "Clear-DTMF-Buffer" ":" BOOLEAN CRLF
Top   ToC   RFC6787 - Page 90

9.4.33. Early-No-Match

This header field MAY be specified in a RECOGNIZE method and is used to tell the recognizer that it MUST NOT wait for the end of speech before processing the collected speech to match active grammars. A value of "true" indicates the recognizer MUST do early matching. The default value for this header field if not specified is "false". If the recognizer does not support the processing of the collected audio before the end of speech, this header field can be safely ignored. early-no-match = "Early-No-Match" ":" BOOLEAN CRLF

9.4.34. Num-Min-Consistent-Pronunciations

This header field MAY be specified in a START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method and is used to specify the minimum number of consistent pronunciations that must be obtained to voice enroll a new phrase. The minimum value is 1. The default value is implementation specific and MAY be greater than 1. num-min-consistent-pronunciations = "Num-Min-Consistent-Pronunciations" ":" 1*19DIGIT CRLF

9.4.35. Consistency-Threshold

This header field MAY be sent as part of the START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method. Used during voice enrollment, this header field specifies how similar to a previously enrolled pronunciation of the same phrase an utterance needs to be in order to be considered "consistent". The higher the threshold, the closer the match between an utterance and previous pronunciations must be for the pronunciation to be considered consistent. The range for this threshold is a float value between 0.0 and 1.0. The default value for this header field is implementation specific. consistency-threshold = "Consistency-Threshold" ":" FLOAT CRLF

9.4.36. Clash-Threshold

This header field MAY be sent as part of the START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method. Used during voice enrollment, this header field specifies how similar the pronunciations of two different phrases can be before they are considered to be clashing. For example, pronunciations of phrases such as "John Smith" and "Jon Smits" may be so similar that they are difficult to distinguish correctly. A smaller threshold reduces the number of clashes detected. The range for this threshold is a float value between 0.0
Top   ToC   RFC6787 - Page 91
   and 1.0.  The default value for this header field is implementation
   specific.  Clash testing can be turned off completely by setting the
   Clash-Threshold header field value to 0.

   clash-threshold          =  "Clash-Threshold" ":" FLOAT CRLF

9.4.37. Personal-Grammar-URI

This header field specifies the speaker-trained grammar to be used or referenced during enrollment operations. Phrases are added to this grammar during enrollment. For example, a contact list for user "Jeff" could be stored at the Personal-Grammar-URI "http://myserver.example.com/myenrollmentdb/jeff-list". The generated grammar syntax MAY be implementation specific. There is no default value for this header field. This header field MAY be sent as part of the START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method. personal-grammar-uri = "Personal-Grammar-URI" ":" uri CRLF

9.4.38. Enroll-Utterance

This header field MAY be specified in the RECOGNIZE method. If this header field is set to "true" and an Enrollment is active, the RECOGNIZE command MUST add the collected utterance to the personal grammar that is being enrolled. The way in which this occurs is engine specific and may be an area of future standardization. The default value for this header field is "false". enroll-utterance = "Enroll-Utterance" ":" BOOLEAN CRLF

9.4.39. Phrase-Id

This header field in a request identifies a phrase in an existing personal grammar for which enrollment is desired. It is also returned to the client in the RECOGNIZE complete event. This header field MAY occur in START-PHRASE-ENROLLMENT, MODIFY-PHRASE, or DELETE- PHRASE requests. There is no default value for this header field. phrase-id = "Phrase-ID" ":" 1*VCHAR CRLF
Top   ToC   RFC6787 - Page 92

9.4.40. Phrase-NL

This string specifies the interpreted text to be returned when the phrase is recognized. This header field MAY occur in START-PHRASE- ENROLLMENT and MODIFY-PHRASE requests. There is no default value for this header field. phrase-nl = "Phrase-NL" ":" 1*UTFCHAR CRLF

9.4.41. Weight

The value of this header field represents the occurrence likelihood of a phrase in an enrolled grammar. When using grammar enrollment, the system is essentially constructing a grammar segment consisting of a list of possible match phrases. This can be thought of to be similar to the dynamic construction of a <one-of> tag in the W3C grammar specification. Each enrolled-phrase becomes an item in the list that can be matched against spoken input similar to the <item> within a <one-of> list. This header field allows you to assign a weight to the phrase (i.e., <item> entry) in the <one-of> list that is enrolled. Grammar weights are normalized to a sum of one at grammar compilation time, so a weight value of 1 for each phrase in an enrolled grammar list indicates all items in that list have the same weight. This header field MAY occur in START-PHRASE-ENROLLMENT and MODIFY-PHRASE requests. The default value for this header field is implementation specific. weight = "Weight" ":" FLOAT CRLF

9.4.42. Save-Best-Waveform

This header field allows the client to request the recognizer resource to save the audio stream for the best repetition of the phrase that was used during the enrollment session. The recognizer MUST attempt to record the recognized audio and make it available to the client in the form of a URI returned in the Waveform-URI header field in the response to the END-PHRASE-ENROLLMENT method. If there was an error in recording the stream or the audio data is otherwise not available, the recognizer MUST return an empty Waveform-URI header field. This header field MAY occur in the START-PHRASE- ENROLLMENT, SET-PARAMS, and GET-PARAMS methods. save-best-waveform = "Save-Best-Waveform" ":" BOOLEAN CRLF
Top   ToC   RFC6787 - Page 93

9.4.43. New-Phrase-Id

This header field replaces the ID used to identify the phrase in a personal grammar. The recognizer returns the new ID when using an enrollment grammar. This header field MAY occur in MODIFY-PHRASE requests. new-phrase-id = "New-Phrase-ID" ":" 1*VCHAR CRLF

9.4.44. Confusable-Phrases-URI

This header field specifies a grammar that defines invalid phrases for enrollment. For example, typical applications do not allow an enrolled phrase that is also a command word. This header field MAY occur in RECOGNIZE requests that are part of an enrollment session. confusable-phrases-uri = "Confusable-Phrases-URI" ":" uri CRLF

9.4.45. Abort-Phrase-Enrollment

This header field MAY be specified in the END-PHRASE-ENROLLMENT method to abort the phrase enrollment, rather than committing the phrase to the personal grammar. abort-phrase-enrollment = "Abort-Phrase-Enrollment" ":" BOOLEAN CRLF

9.5. Recognizer Message Body

A recognizer message can carry additional data associated with the request, response, or event. The client MAY provide the grammar to be recognized in DEFINE-GRAMMAR or RECOGNIZE requests. When one or more grammars are specified using the DEFINE-GRAMMAR method, the server MUST attempt to fetch, compile, and optimize the grammar before returning a response to the DEFINE-GRAMMAR method. A RECOGNIZE request MUST completely specify the grammars to be active during the recognition operation, except when the RECOGNIZE method is being used to enroll a grammar. During grammar enrollment, such grammars are OPTIONAL. The server resource sends the recognition results in the RECOGNITION-COMPLETE event and the GET-RESULT response. Grammars and recognition results are carried in the message body of the corresponding MRCPv2 messages.

9.5.1. Recognizer Grammar Data

Recognizer grammar data from the client to the server can be provided inline or by reference. Either way, grammar data is carried as typed media entities in the message body of the RECOGNIZE or DEFINE-GRAMMAR
Top   ToC   RFC6787 - Page 94
   request.  All MRCPv2 servers MUST accept grammars in the XML form
   (media type 'application/srgs+xml') of the W3C's XML-based Speech
   Grammar Markup Format (SRGS) [W3C.REC-speech-grammar-20040316] and
   MAY accept grammars in other formats.  Examples include but are not
   limited to:

   o  the ABNF form (media type 'application/srgs') of SRGS

   o  Sun's Java Speech Grammar Format (JSGF)
      [refs.javaSpeechGrammarFormat]

   Additionally, MRCPv2 servers MAY support the Semantic Interpretation
   for Speech Recognition (SISR)
   [W3C.REC-semantic-interpretation-20070405] specification.

   When a grammar is specified inline in the request, the client MUST
   provide a Content-ID for that grammar as part of the content header
   fields.  If there is no space on the server to store the inline
   grammar, the request MUST return with a Completion-Cause code of 016
   "grammar-definition-failure".  Otherwise, the server MUST associate
   the inline grammar block with that Content-ID and MUST store it on
   the server for the duration of the session.  However, if the
   Content-ID is redefined later in the session through a subsequent
   DEFINE-GRAMMAR, the inline grammar previously associated with the
   Content-ID MUST be freed.  If the Content-ID is redefined through a
   subsequent DEFINE-GRAMMAR with an empty message body (i.e., no
   grammar definition), then in addition to freeing any grammar
   previously associated with the Content-ID, the server MUST clear all
   bindings and associations to the Content-ID.  Unless and until
   subsequently redefined, this URI MUST be interpreted by the server as
   one that has never been set.

   Grammars that have been associated with a Content-ID can be
   referenced through the 'session' URI scheme (see Section 13.6).  For
   example:
   session:help@root-level.store

   Grammar data MAY be specified using external URI references.  To do
   so, the client uses a body of media type 'text/uri-list' (see RFC
   2483 [RFC2483] ) to list the one or more URIs that point to the
   grammar data.  The client can use a body of media type 'text/
   grammar-ref-list' (see Section 13.5.1) if it wants to assign weights
   to the list of grammar URI.  All MRCPv2 servers MUST support grammar
   access using the 'http' and 'https' URI schemes.

   If the grammar data the client wishes to be used on a request
   consists of a mix of URI and inline grammar data, the client uses the
   'multipart/mixed' media type to enclose the 'text/uri-list',
Top   ToC   RFC6787 - Page 95
   'application/srgs', or 'application/srgs+xml' content entities.  The
   character set and encoding used in the grammar data are specified
   using to standard media type definitions.

   When more than one grammar URI or inline grammar block is specified
   in a message body of the RECOGNIZE request, the server interprets
   this as a list of grammar alternatives to match against.

   Content-Type:application/srgs+xml
   Content-ID:<request1@form-level.store>
   Content-Length:...

   <?xml version="1.0"?>

   <!-- the default grammar language is US English -->
   <grammar xmlns="http://www.w3.org/2001/06/grammar"
            xml:lang="en-US" version="1.0" root="request">

   <!-- single language attachment to tokens -->
         <rule id="yes">
               <one-of>
                     <item xml:lang="fr-CA">oui</item>
                     <item xml:lang="en-US">yes</item>
               </one-of>
         </rule>

   <!-- single language attachment to a rule expansion -->
         <rule id="request">
               may I speak to
               <one-of xml:lang="fr-CA">
                     <item>Michel Tremblay</item>
                     <item>Andre Roy</item>
               </one-of>
         </rule>

         <!-- multiple language attachment to a token -->
         <rule id="people1">
               <token lexicon="en-US,fr-CA"> Robert </token>
         </rule>
Top   ToC   RFC6787 - Page 96
         <!-- the equivalent single-language attachment expansion -->
         <rule id="people2">
               <one-of>
                     <item xml:lang="en-US">Robert</item>
                     <item xml:lang="fr-CA">Robert</item>
               </one-of>
         </rule>

         </grammar>

                           SRGS Grammar Example


   Content-Type:text/uri-list
   Content-Length:...

   session:help@root-level.store
   http://www.example.com/Directory-Name-List.grxml
   http://www.example.com/Department-List.grxml
   http://www.example.com/TAC-Contact-List.grxml
   session:menu1@menu-level.store

                         Grammar Reference Example


   Content-Type:multipart/mixed; boundary="break"

   --break
   Content-Type:text/uri-list
   Content-Length:...

   http://www.example.com/Directory-Name-List.grxml
   http://www.example.com/Department-List.grxml
   http://www.example.com/TAC-Contact-List.grxml

   --break
   Content-Type:application/srgs+xml
   Content-ID:<request1@form-level.store>
   Content-Length:...
Top   ToC   RFC6787 - Page 97
   <?xml version="1.0"?>

   <!-- the default grammar language is US English -->
   <grammar xmlns="http://www.w3.org/2001/06/grammar"
            xml:lang="en-US" version="1.0">

   <!-- single language attachment to tokens -->
         <rule id="yes">
               <one-of>
                     <item xml:lang="fr-CA">oui</item>
                     <item xml:lang="en-US">yes</item>
               </one-of>
         </rule>

   <!-- single language attachment to a rule expansion -->
         <rule id="request">
               may I speak to
               <one-of xml:lang="fr-CA">
                     <item>Michel Tremblay</item>
                     <item>Andre Roy</item>
               </one-of>
         </rule>

         <!-- multiple language attachment to a token -->
         <rule id="people1">
               <token lexicon="en-US,fr-CA"> Robert </token>
         </rule>

         <!-- the equivalent single-language attachment expansion -->
         <rule id="people2">
               <one-of>
                     <item xml:lang="en-US">Robert</item>
                     <item xml:lang="fr-CA">Robert</item>
               </one-of>
         </rule>

         </grammar>
   --break--

                      Mixed Grammar Reference Example

9.5.2. Recognizer Result Data

Recognition results are returned to the client in the message body of the RECOGNITION-COMPLETE event or the GET-RESULT response message as described in Section 6.3. Element and attribute descriptions for the recognition portion of the NLSML format are provided in Section 9.6 with a normative definition of the schema in Section 16.1.
Top   ToC   RFC6787 - Page 98
   Content-Type:application/nlsml+xml
   Content-Length:...

   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="http://www.example.com/theYesNoGrammar">
       <interpretation>
           <instance>
                   <ex:response>yes</ex:response>
           </instance>
           <input>OK</input>
       </interpretation>
   </result>

                              Result Example

9.5.3. Enrollment Result Data

Enrollment results are returned to the client in the message body of the RECOGNITION-COMPLETE event as described in Section 6.3. Element and attribute descriptions for the enrollment portion of the NLSML format are provided in Section 9.7 with a normative definition of the schema in Section 16.2.

9.5.4. Recognizer Context Block

When a client changes servers while operating on the behalf of the same incoming communication session, this header field allows the client to collect a block of opaque data from one server and provide it to another server. This capability is desirable if the client needs different language support or because the server issued a redirect. Here, the first recognizer resource may have collected acoustic and other data during its execution of recognition methods. After a server switch, communicating this data may allow the recognizer resource on the new server to provide better recognition. This block of data is implementation specific and MUST be carried as media type 'application/octets' in the body of the message. This block of data is communicated in the SET-PARAMS and GET-PARAMS method/response messages. In the GET-PARAMS method, if an empty Recognizer-Context-Block header field is present, then the recognizer SHOULD return its vendor-specific context block, if any, in the message body as an entity of media type 'application/octets' with a specific Content-ID. The Content-ID value MUST also be specified in the Recognizer-Context-Block header field in the GET-PARAMS response. The SET-PARAMS request wishing to provide this vendor-specific data MUST send it in the message body as a typed entity with the same
Top   ToC   RFC6787 - Page 99
   Content-ID that it received from the GET-PARAMS.  The Content-ID MUST
   also be sent in the Recognizer-Context-Block header field of the
   SET-PARAMS message.

   Each speech recognition implementation choosing to use this mechanism
   to hand off recognizer context data among servers MUST distinguish
   its implementation-specific block of data from other implementations
   by choosing a Content-ID that is recognizable among the participating
   servers and unlikely to collide with values chosen by another
   implementation.



(page 99 continued on part 5)

Next Section