9. Speech Recognizer Resource
The speech recognizer resource receives an incoming voice stream and provides the client with an interpretation of what was spoken in textual form. The recognizer resource is controlled by MRCPv2 requests from the client. The recognizer resource can both respond to these requests and generate asynchronous events to the client to indicate conditions of interest during the processing of the method. This section applies to the following resource types. 1. speechrecog 2. dtmfrecog The difference between the above two resources is in their level of support for recognition grammars. The "dtmfrecog" resource type is capable of recognizing only DTMF digits and hence accepts only DTMF grammars. It only generates barge-in for DTMF inputs and ignores speech. The "speechrecog" resource type can recognize regular speech as well as DTMF digits and hence MUST support grammars describing either speech or DTMF. This resource generates barge-in events for speech and/or DTMF. By analyzing the grammars that are activated by the RECOGNIZE method, it determines if a barge-in should occur for speech and/or DTMF. When the recognizer decides it needs to generate a barge-in, it also generates a START-OF-INPUT event to the client. The recognizer resource MAY support recognition in the normal or hotword modes or both (although note that a single "speechrecog" resource does not perform normal and hotword mode recognition simultaneously). For implementations where a single recognizer resource does not support both modes, or simultaneous normal and hotword recognition is desired, the two modes can be invoked through separate resources allocated to the same SIP dialog (with different MRCP session identifiers) and share the RTP audio feed. The capabilities of the recognizer resource are enumerated below: Normal Mode Recognition Normal mode recognition tries to match all of the speech or DTMF against the grammar and returns a no-match status if the input fails to match or the method times out.
Hotword Mode Recognition Hotword mode is where the recognizer looks for a match against specific speech grammar or DTMF sequence and ignores speech or DTMF that does not match. The recognition completes only if there is a successful match of grammar, if the client cancels the request, or if there is a non-input or recognition timeout. Voice Enrolled Grammars A recognizer resource MAY optionally support Voice Enrolled Grammars. With this functionality, enrollment is performed using a person's voice. For example, a list of contacts can be created and maintained by recording the person's names using the caller's voice. This technique is sometimes also called speaker-dependent recognition. Interpretation A recognizer resource MAY be employed strictly for its natural language interpretation capabilities by supplying it with a text string as input instead of speech. In this mode, the resource takes text as input and produces an "interpretation" of the input according to the supplied grammar. Voice enrollment has the concept of an enrollment session. A session to add a new phrase to a personal grammar involves the initial enrollment followed by a repeat of enough utterances before committing the new phrase to the personal grammar. Each time an utterance is recorded, it is compared for similarity with the other samples and a clash test is performed against other entries in the personal grammar to ensure there are no similar and confusable entries. Enrollment is done using a recognizer resource. Controlling which utterances are to be considered for enrollment of a new phrase is done by setting a header field (see Section 9.4.39) in the Recognize request. Interpretation is accomplished through the INTERPRET method (Section 9.20) and the Interpret-Text header field (Section 9.4.30).
9.1. Recognizer State Machine
The recognizer resource maintains a state machine to process MRCPv2 requests from the client. Idle Recognizing Recognized State State State | | | |---------RECOGNIZE---->|---RECOGNITION-COMPLETE-->| |<------STOP------------|<-----RECOGNIZE-----------| | | | | |--------| |-----------| | START-OF-INPUT | GET-RESULT | | |------->| |---------->| |------------| | | | DEFINE-GRAMMAR |----------| | |<-----------| | START-INPUT-TIMERS | | |<---------| | |------| | | | INTERPRET | | |<-----| |------| | | | RECOGNIZE | |-------| |<-----| | | STOP | |<------| | |<-------------------STOP--------------------------| |<-------------------DEFINE-GRAMMAR----------------| Recognizer State Machine If a recognizer resource supports voice enrolled grammars, starting an enrollment session does not change the state of the recognizer resource. Once an enrollment session is started, then utterances are enrolled by calling the RECOGNIZE method repeatedly. The state of the speech recognizer resource goes from IDLE to RECOGNIZING state each time RECOGNIZE is called.9.2. Recognizer Methods
The recognizer supports the following methods. recognizer-method = recog-only-method / enrollment-method
recog-only-method = "DEFINE-GRAMMAR" / "RECOGNIZE" / "INTERPRET" / "GET-RESULT" / "START-INPUT-TIMERS" / "STOP" It is OPTIONAL for a recognizer resource to support voice enrolled grammars. If the recognizer resource does support voice enrolled grammars, it MUST support the following methods. enrollment-method = "START-PHRASE-ENROLLMENT" / "ENROLLMENT-ROLLBACK" / "END-PHRASE-ENROLLMENT" / "MODIFY-PHRASE" / "DELETE-PHRASE"9.3. Recognizer Events
The recognizer can generate the following events. recognizer-event = "START-OF-INPUT" / "RECOGNITION-COMPLETE" / "INTERPRETATION-COMPLETE"9.4. Recognizer Header Fields
A recognizer message can contain header fields containing request options and information to augment the Method, Response, or Event message it is associated with. recognizer-header = recog-only-header / enrollment-header recog-only-header = confidence-threshold / sensitivity-level / speed-vs-accuracy / n-best-list-length / no-input-timeout / input-type / recognition-timeout / waveform-uri / input-waveform-uri / completion-cause / completion-reason / recognizer-context-block / start-input-timers / speech-complete-timeout
/ speech-incomplete-timeout / dtmf-interdigit-timeout / dtmf-term-timeout / dtmf-term-char / failed-uri / failed-uri-cause / save-waveform / media-type / new-audio-channel / speech-language / ver-buffer-utterance / recognition-mode / cancel-if-queue / hotword-max-duration / hotword-min-duration / interpret-text / dtmf-buffer-time / clear-dtmf-buffer / early-no-match If a recognizer resource supports voice enrolled grammars, the following header fields are also used. enrollment-header = num-min-consistent-pronunciations / consistency-threshold / clash-threshold / personal-grammar-uri / enroll-utterance / phrase-id / phrase-nl / weight / save-best-waveform / new-phrase-id / confusable-phrases-uri / abort-phrase-enrollment For enrollment-specific header fields that can appear as part of SET-PARAMS or GET-PARAMS methods, the following general rule applies: the START-PHRASE-ENROLLMENT method MUST be invoked before these header fields may be set through the SET-PARAMS method or retrieved through the GET-PARAMS method. Note that the Waveform-URI header field of the Recognizer resource can also appear in the response to the END-PHRASE-ENROLLMENT method.
9.4.1. Confidence-Threshold
When a recognizer resource recognizes or matches a spoken phrase with some portion of the grammar, it associates a confidence level with that match. The Confidence-Threshold header field tells the recognizer resource what confidence level the client considers a successful match. This is a float value between 0.0-1.0 indicating the recognizer's confidence in the recognition. If the recognizer determines that there is no candidate match with a confidence that is greater than the confidence threshold, then it MUST return no-match as the recognition result. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. The default value for this header field is implementation specific, as is the interpretation of any specific value for this header field. Although values for servers from different vendors are not comparable, it is expected that clients will tune this value over time for a given server. confidence-threshold = "Confidence-Threshold" ":" FLOAT CRLF9.4.2. Sensitivity-Level
To filter out background noise and not mistake it for speech, the recognizer resource supports a variable level of sound sensitivity. The Sensitivity-Level header field is a float value between 0.0 and 1.0 and allows the client to set the sensitivity level for the recognizer. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. A higher value for this header field means higher sensitivity. The default value for this header field is implementation specific, as is the interpretation of any specific value for this header field. Although values for servers from different vendors are not comparable, it is expected that clients will tune this value over time for a given server. sensitivity-level = "Sensitivity-Level" ":" FLOAT CRLF9.4.3. Speed-Vs-Accuracy
Depending on the implementation and capability of the recognizer resource it may be tunable towards Performance or Accuracy. Higher accuracy may mean more processing and higher CPU utilization, meaning fewer active sessions per server and vice versa. The value is a float between 0.0 and 1.0. A value of 0.0 means fastest recognition. A value of 1.0 means best accuracy. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. The default value for this
header field is implementation specific. Although values for servers from different vendors are not comparable, it is expected that clients will tune this value over time for a given server. speed-vs-accuracy = "Speed-Vs-Accuracy" ":" FLOAT CRLF9.4.4. N-Best-List-Length
When the recognizer matches an incoming stream with the grammar, it may come up with more than one alternative match because of confidence levels in certain words or conversation paths. If this header field is not specified, by default, the recognizer resource returns only the best match above the confidence threshold. The client, by setting this header field, can ask the recognition resource to send it more than one alternative. All alternatives must still be above the Confidence-Threshold. A value greater than one does not guarantee that the recognizer will provide the requested number of alternatives. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. The minimum value for this header field is 1. The default value for this header field is 1. n-best-list-length = "N-Best-List-Length" ":" 1*19DIGIT CRLF9.4.5. Input-Type
When the recognizer detects barge-in-able input and generates a START-OF-INPUT event, that event MUST carry this header field to specify whether the input that caused the barge-in was DTMF or speech. input-type = "Input-Type" ":" inputs CRLF inputs = "speech" / "dtmf"9.4.6. No-Input-Timeout
When recognition is started and there is no speech detected for a certain period of time, the recognizer can send a RECOGNITION- COMPLETE event to the client with a Completion-Cause of "no-input- timeout" and terminate the recognition operation. The client can use the No-Input-Timeout header field to set this timeout. The value is in milliseconds and can range from 0 to an implementation-specific maximum value. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. The default value is implementation specific. no-input-timeout = "No-Input-Timeout" ":" 1*19DIGIT CRLF
9.4.7. Recognition-Timeout
When recognition is started and there is no match for a certain period of time, the recognizer can send a RECOGNITION-COMPLETE event to the client and terminate the recognition operation. The Recognition-Timeout header field allows the client to set this timeout value. The value is in milliseconds. The value for this header field ranges from 0 to an implementation-specific maximum value. The default value is 10 seconds. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. recognition-timeout = "Recognition-Timeout" ":" 1*19DIGIT CRLF9.4.8. Waveform-URI
If the Save-Waveform header field is set to "true", the recognizer MUST record the incoming audio stream of the recognition into a stored form and provide a URI for the client to access it. This header field MUST be present in the RECOGNITION-COMPLETE event if the Save-Waveform header field was set to "true". The value of the header field MUST be empty if there was some error condition preventing the server from recording. Otherwise, the URI generated by the server MUST be unambiguous across the server and all its recognition sessions. The content associated with the URI MUST be available to the client until the MRCPv2 session terminates. Similarly, if the Save-Best-Waveform header field is set to "true", the recognizer MUST save the audio stream for the best repetition of the phrase that was used during the enrollment session. The recognizer MUST then record the recognized audio and make it available to the client by returning a URI in the Waveform-URI header field in the response to the END-PHRASE-ENROLLMENT method. The value of the header field MUST be empty if there was some error condition preventing the server from recording. Otherwise, the URI generated by the server MUST be unambiguous across the server and all its recognition sessions. The content associated with the URI MUST be available to the client until the MRCPv2 session terminates. See the discussion on the sensitivity of saved waveforms in Section 12. The server MUST also return the size in octets and the duration in milliseconds of the recorded audio waveform as parameters associated with the header field. waveform-uri = "Waveform-URI" ":" ["<" uri ">" ";" "size" "=" 1*19DIGIT ";" "duration" "=" 1*19DIGIT] CRLF
9.4.9. Media-Type
This header field MAY be specified in the SET-PARAMS, GET-PARAMS, or the RECOGNIZE methods and tells the server resource the media type in which to store captured audio or video, such as the one captured and returned by the Waveform-URI header field. media-type = "Media-Type" ":" media-type-value CRLF9.4.10. Input-Waveform-URI
This optional header field specifies a URI pointing to audio content to be processed by the RECOGNIZE operation. This enables the client to request recognition from a specified buffer or audio file. input-waveform-uri = "Input-Waveform-URI" ":" uri CRLF9.4.11. Completion-Cause
This header field MUST be part of a RECOGNITION-COMPLETE event coming from the recognizer resource to the client. It indicates the reason behind the RECOGNIZE method completion. This header field MUST be sent in the DEFINE-GRAMMAR and RECOGNIZE responses, if they return with a failure status and a COMPLETE state. In the ABNF below, the cause-code contains a numerical value selected from the Cause-Code column of the following table. The cause-name contains the corresponding token selected from the Cause-Name column. completion-cause = "Completion-Cause" ":" cause-code SP cause-name CRLF cause-code = 3DIGIT cause-name = *VCHAR
+------------+-----------------------+------------------------------+ | Cause-Code | Cause-Name | Description | +------------+-----------------------+------------------------------+ | 000 | success | RECOGNIZE completed with a | | | | match or DEFINE-GRAMMAR | | | | succeeded in downloading and | | | | compiling the grammar. | | | | | | 001 | no-match | RECOGNIZE completed, but no | | | | match was found. | | | | | | 002 | no-input-timeout | RECOGNIZE completed without | | | | a match due to a | | | | no-input-timeout. | | | | | | 003 | hotword-maxtime | RECOGNIZE in hotword mode | | | | completed without a match | | | | due to a | | | | recognition-timeout. | | | | | | 004 | grammar-load-failure | RECOGNIZE failed due to | | | | grammar load failure. | | | | | | 005 | grammar-compilation- | RECOGNIZE failed due to | | | failure | grammar compilation failure. | | | | | | 006 | recognizer-error | RECOGNIZE request terminated | | | | prematurely due to a | | | | recognizer error. | | | | | | 007 | speech-too-early | RECOGNIZE request terminated | | | | because speech was too | | | | early. This happens when the | | | | audio stream is already | | | | "in-speech" when the | | | | RECOGNIZE request was | | | | received. | | | | | | 008 | success-maxtime | RECOGNIZE request terminated | | | | because speech was too long | | | | but whatever was spoken till | | | | that point was a full match. | | | | | | 009 | uri-failure | Failure accessing a URI. | | | | | | 010 | language-unsupported | Language not supported. | | | | |
| 011 | cancelled | A new RECOGNIZE cancelled | | | | this one, or a prior | | | | RECOGNIZE failed while this | | | | one was still in the queue. | | | | | | 012 | semantics-failure | Recognition succeeded, but | | | | semantic interpretation of | | | | the recognized input failed. | | | | The RECOGNITION-COMPLETE | | | | event MUST contain the | | | | Recognition result with only | | | | input text and no | | | | interpretation. | | | | | | 013 | partial-match | Speech Incomplete Timeout | | | | expired before there was a | | | | full match. But whatever was | | | | spoken till that point was a | | | | partial match to one or more | | | | grammars. | | | | | | 014 | partial-match-maxtime | The Recognition-Timeout | | | | expired before full match | | | | was achieved. But whatever | | | | was spoken till that point | | | | was a partial match to one | | | | or more grammars. | | | | | | 015 | no-match-maxtime | The Recognition-Timeout | | | | expired. Whatever was spoken | | | | till that point did not | | | | match any of the grammars. | | | | This cause could also be | | | | returned if the recognizer | | | | does not support detecting | | | | partial grammar matches. | | | | | | 016 | grammar-definition- | Any DEFINE-GRAMMAR error | | | failure | other than | | | | grammar-load-failure and | | | | grammar-compilation-failure. | +------------+-----------------------+------------------------------+
9.4.12. Completion-Reason
This header field MAY be specified in a RECOGNITION-COMPLETE event coming from the recognizer resource to the client. This contains the reason text behind the RECOGNIZE request completion. The server uses this header field to communicate text describing the reason for the failure, such as the specific error encountered in parsing a grammar markup. The completion reason text is provided for client use in logs and for debugging and instrumentation purposes. Clients MUST NOT interpret the completion reason text. completion-reason = "Completion-Reason" ":" quoted-string CRLF9.4.13. Recognizer-Context-Block
This header field MAY be sent as part of the SET-PARAMS or GET-PARAMS request. If the GET-PARAMS method contains this header field with no value, then it is a request to the recognizer to return the recognizer context block. The response to such a message MAY contain a recognizer context block as a typed media message body. If the server returns a recognizer context block, the response MUST contain this header field and its value MUST match the Content-ID of the corresponding media block. If the SET-PARAMS method contains this header field, it MUST also contain a message body containing the recognizer context data and a Content-ID matching this header field value. This Content-ID MUST match the Content-ID that came with the context data during the GET-PARAMS operation. An implementation choosing to use this mechanism to hand off recognizer context data between servers MUST distinguish its implementation-specific block of data by using an IANA-registered content type in the IANA Media Type vendor tree. recognizer-context-block = "Recognizer-Context-Block" ":" [1*VCHAR] CRLF9.4.14. Start-Input-Timers
This header field MAY be sent as part of the RECOGNIZE request. A value of false tells the recognizer to start recognition but not to start the no-input timer yet. The recognizer MUST NOT start the timers until the client sends a START-INPUT-TIMERS request to the recognizer. This is useful in the scenario when the recognizer and
synthesizer engines are not part of the same session. In such configurations, when a kill-on-barge-in prompt is being played (see Section 8.4.2), the client wants the RECOGNIZE request to be simultaneously active so that it can detect and implement kill-on- barge-in. However, the recognizer SHOULD NOT start the no-input timers until the prompt is finished. The default value is "true". start-input-timers = "Start-Input-Timers" ":" BOOLEAN CRLF9.4.15. Speech-Complete-Timeout
This header field specifies the length of silence required following user speech before the speech recognizer finalizes a result (either accepting it or generating a no-match result). The Speech-Complete- Timeout value applies when the recognizer currently has a complete match against an active grammar, and specifies how long the recognizer MUST wait for more input before declaring a match. By contrast, the Speech-Incomplete-Timeout is used when the speech is an incomplete match to an active grammar. The value is in milliseconds. speech-complete-timeout = "Speech-Complete-Timeout" ":" 1*19DIGIT CRLF A long Speech-Complete-Timeout value delays the result to the client and therefore makes the application's response to a user slow. A short Speech-Complete-Timeout may lead to an utterance being broken up inappropriately. Reasonable speech complete timeout values are typically in the range of 0.3 seconds to 1.0 seconds. The value for this header field ranges from 0 to an implementation-specific maximum value. The default value for this header field is implementation specific. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS.9.4.16. Speech-Incomplete-Timeout
This header field specifies the required length of silence following user speech after which a recognizer finalizes a result. The incomplete timeout applies when the speech prior to the silence is an incomplete match of all active grammars. In this case, once the timeout is triggered, the partial result is rejected (with a Completion-Cause of "partial-match"). The value is in milliseconds. The value for this header field ranges from 0 to an implementation- specific maximum value. The default value for this header field is implementation specific. speech-incomplete-timeout = "Speech-Incomplete-Timeout" ":" 1*19DIGIT CRLF
The Speech-Incomplete-Timeout also applies when the speech prior to the silence is a complete match of an active grammar, but where it is possible to speak further and still match the grammar. By contrast, the Speech-Complete-Timeout is used when the speech is a complete match to an active grammar and no further spoken words can continue to represent a match. A long Speech-Incomplete-Timeout value delays the result to the client and therefore makes the application's response to a user slow. A short Speech-Incomplete-Timeout may lead to an utterance being broken up inappropriately. The Speech-Incomplete-Timeout is usually longer than the Speech- Complete-Timeout to allow users to pause mid-utterance (for example, to breathe). This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS.9.4.17. DTMF-Interdigit-Timeout
This header field specifies the inter-digit timeout value to use when recognizing DTMF input. The value is in milliseconds. The value for this header field ranges from 0 to an implementation-specific maximum value. The default value is 5 seconds. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. dtmf-interdigit-timeout = "DTMF-Interdigit-Timeout" ":" 1*19DIGIT CRLF9.4.18. DTMF-Term-Timeout
This header field specifies the terminating timeout to use when recognizing DTMF input. The DTMF-Term-Timeout applies only when no additional input is allowed by the grammar; otherwise, the DTMF-Interdigit-Timeout applies. The value is in milliseconds. The value for this header field ranges from 0 to an implementation- specific maximum value. The default value is 10 seconds. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. dtmf-term-timeout = "DTMF-Term-Timeout" ":" 1*19DIGIT CRLF9.4.19. DTMF-Term-Char
This header field specifies the terminating DTMF character for DTMF input recognition. The default value is NULL, which is indicated by an empty header field value. This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. dtmf-term-char = "DTMF-Term-Char" ":" VCHAR CRLF
9.4.20. Failed-URI
When a recognizer needs to fetch or access a URI and the access fails, the server SHOULD provide the failed URI in this header field in the method response, unless there are multiple URI failures, in which case one of the failed URIs MUST be provided in this header field in the method response. failed-uri = "Failed-URI" ":" absoluteURI CRLF9.4.21. Failed-URI-Cause
When a recognizer method needs a recognizer to fetch or access a URI and the access fails, the server MUST provide the URI-specific or protocol-specific response code for the URI in the Failed-URI header field through this header field in the method response. The value encoding is UTF-8 (RFC 3629 [RFC3629]) to accommodate any access protocol, some of which might have a response string instead of a numeric response code. failed-uri-cause = "Failed-URI-Cause" ":" 1*UTFCHAR CRLF9.4.22. Save-Waveform
This header field allows the client to request the recognizer resource to save the audio input to the recognizer. The recognizer resource MUST then attempt to record the recognized audio, without endpointing, and make it available to the client in the form of a URI returned in the Waveform-URI header field in the RECOGNITION-COMPLETE event. If there was an error in recording the stream or the audio content is otherwise not available, the recognizer MUST return an empty Waveform-URI header field. The default value for this field is "false". This header field MAY occur in RECOGNIZE, SET-PARAMS, or GET-PARAMS. See the discussion on the sensitivity of saved waveforms in Section 12. save-waveform = "Save-Waveform" ":" BOOLEAN CRLF9.4.23. New-Audio-Channel
This header field MAY be specified in a RECOGNIZE request and allows the client to tell the server that, from this point on, further input audio comes from a different audio source, channel, or speaker. If the recognizer resource had collected any input statistics or adaptation state, the recognizer resource MUST do what is appropriate for the specific recognition technology, which includes but is not limited to discarding any collected input statistics or adaptation state before starting the RECOGNIZE request. Note that if there are
multiple resources that are sharing a media stream and are collecting or using this data, and the client issues this header field to one of the resources, the reset operation applies to all resources that use the shared media stream. This helps in a number of use cases, including where the client wishes to reuse an open recognition session with an existing media session for multiple telephone calls. new-audio-channel = "New-Audio-Channel" ":" BOOLEAN CRLF9.4.24. Speech-Language
This header field specifies the language of recognition grammar data within a session or request, if it is not specified within the data. The value of this header field MUST follow RFC 5646 [RFC5646] for its values. This MAY occur in DEFINE-GRAMMAR, RECOGNIZE, SET-PARAMS, or GET-PARAMS requests. speech-language = "Speech-Language" ":" 1*VCHAR CRLF9.4.25. Ver-Buffer-Utterance
This header field lets the client request the server to buffer the utterance associated with this recognition request into a buffer available to a co-resident verifier resource. The buffer is shared across resources within a session and is allocated when a verifier resource is added to this session. The client MUST NOT send this header field unless a verifier resource is instantiated for the session. The buffer is released when the verifier resource is released from the session.9.4.26. Recognition-Mode
This header field specifies what mode the RECOGNIZE method will operate in. The value choices are "normal" or "hotword". If the value is "normal", the RECOGNIZE starts matching speech and DTMF to the grammars specified in the RECOGNIZE request. If any portion of the speech does not match the grammar, the RECOGNIZE command completes with a no-match status. Timers may be active to detect speech in the audio (see Section 9.4.14), so the RECOGNIZE method may complete because of a timeout waiting for speech. If the value of this header field is "hotword", the RECOGNIZE method operates in hotword mode, where it only looks for the particular keywords or DTMF
sequences specified in the grammar and ignores silence or other speech in the audio stream. The default value for this header field is "normal". This header field MAY occur on the RECOGNIZE method. recognition-mode = "Recognition-Mode" ":" "normal" / "hotword" CRLF9.4.27. Cancel-If-Queue
This header field specifies what will happen if the client attempts to invoke another RECOGNIZE method when this RECOGNIZE request is already in progress for the resource. The value for this header field is a Boolean. A value of "true" means the server MUST terminate this RECOGNIZE request, with a Completion-Cause of "cancelled", if the client issues another RECOGNIZE request for the same resource. A value of "false" for this header field indicates to the server that this RECOGNIZE request will continue to completion, and if the client issues more RECOGNIZE requests to the same resource, they are queued. When the currently active RECOGNIZE request is stopped or completes with a successful match, the first RECOGNIZE method in the queue becomes active. If the current RECOGNIZE fails, all RECOGNIZE methods in the pending queue are cancelled, and each generates a RECOGNITION-COMPLETE event with a Completion-Cause of "cancelled". This header field MUST be present in every RECOGNIZE request. There is no default value. cancel-if-queue = "Cancel-If-Queue" ":" BOOLEAN CRLF9.4.28. Hotword-Max-Duration
This header field MAY be sent in a hotword mode RECOGNIZE request. It specifies the maximum length of an utterance (in seconds) that will be considered for hotword recognition. This header field, along with Hotword-Min-Duration, can be used to tune performance by preventing the recognizer from evaluating utterances that are too short or too long to be one of the hotwords in the grammar(s). The value is in milliseconds. The default is implementation dependent. If present in a RECOGNIZE request specifying a mode other than "hotword", the header field is ignored. hotword-max-duration = "Hotword-Max-Duration" ":" 1*19DIGIT CRLF9.4.29. Hotword-Min-Duration
This header field MAY be sent in a hotword mode RECOGNIZE request. It specifies the minimum length of an utterance (in seconds) that will be considered for hotword recognition. This header field, along
with Hotword-Max-Duration, can be used to tune performance by preventing the recognizer from evaluating utterances that are too short or too long to be one of the hotwords in the grammar(s). The value is in milliseconds. The default value is implementation dependent. If present in a RECOGNIZE request specifying a mode other than "hotword", the header field is ignored. hotword-min-duration = "Hotword-Min-Duration" ":" 1*19DIGIT CRLF9.4.30. Interpret-Text
The value of this header field is used to provide a pointer to the text for which a natural language interpretation is desired. The value is either a URI or text. If the value is a URI, it MUST be a Content-ID that refers to an entity of type 'text/plain' in the body of the message. Otherwise, the server MUST treat the value as the text to be interpreted. This header field MUST be used when invoking the INTERPRET method. interpret-text = "Interpret-Text" ":" 1*VCHAR CRLF9.4.31. DTMF-Buffer-Time
This header field MAY be specified in a GET-PARAMS or SET-PARAMS method and is used to specify the amount of time, in milliseconds, of the type-ahead buffer for the recognizer. This is the buffer that collects DTMF digits as they are pressed even when there is no RECOGNIZE command active. When a subsequent RECOGNIZE method is received, it MUST look to this buffer to match the RECOGNIZE request. If the digits in the buffer are not sufficient, then it can continue to listen to more digits to match the grammar. The default size of this DTMF buffer is platform specific. dtmf-buffer-time = "DTMF-Buffer-Time" ":" 1*19DIGIT CRLF9.4.32. Clear-DTMF-Buffer
This header field MAY be specified in a RECOGNIZE method and is used to tell the recognizer to clear the DTMF type-ahead buffer before starting the RECOGNIZE. The default value of this header field is "false", which does not clear the type-ahead buffer before starting the RECOGNIZE method. If this header field is specified to be "true", then the RECOGNIZE will clear the DTMF buffer before starting recognition. This means digits pressed by the caller before the RECOGNIZE command was issued are discarded. clear-dtmf-buffer = "Clear-DTMF-Buffer" ":" BOOLEAN CRLF
9.4.33. Early-No-Match
This header field MAY be specified in a RECOGNIZE method and is used to tell the recognizer that it MUST NOT wait for the end of speech before processing the collected speech to match active grammars. A value of "true" indicates the recognizer MUST do early matching. The default value for this header field if not specified is "false". If the recognizer does not support the processing of the collected audio before the end of speech, this header field can be safely ignored. early-no-match = "Early-No-Match" ":" BOOLEAN CRLF9.4.34. Num-Min-Consistent-Pronunciations
This header field MAY be specified in a START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method and is used to specify the minimum number of consistent pronunciations that must be obtained to voice enroll a new phrase. The minimum value is 1. The default value is implementation specific and MAY be greater than 1. num-min-consistent-pronunciations = "Num-Min-Consistent-Pronunciations" ":" 1*19DIGIT CRLF9.4.35. Consistency-Threshold
This header field MAY be sent as part of the START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method. Used during voice enrollment, this header field specifies how similar to a previously enrolled pronunciation of the same phrase an utterance needs to be in order to be considered "consistent". The higher the threshold, the closer the match between an utterance and previous pronunciations must be for the pronunciation to be considered consistent. The range for this threshold is a float value between 0.0 and 1.0. The default value for this header field is implementation specific. consistency-threshold = "Consistency-Threshold" ":" FLOAT CRLF9.4.36. Clash-Threshold
This header field MAY be sent as part of the START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method. Used during voice enrollment, this header field specifies how similar the pronunciations of two different phrases can be before they are considered to be clashing. For example, pronunciations of phrases such as "John Smith" and "Jon Smits" may be so similar that they are difficult to distinguish correctly. A smaller threshold reduces the number of clashes detected. The range for this threshold is a float value between 0.0
and 1.0. The default value for this header field is implementation specific. Clash testing can be turned off completely by setting the Clash-Threshold header field value to 0. clash-threshold = "Clash-Threshold" ":" FLOAT CRLF9.4.37. Personal-Grammar-URI
This header field specifies the speaker-trained grammar to be used or referenced during enrollment operations. Phrases are added to this grammar during enrollment. For example, a contact list for user "Jeff" could be stored at the Personal-Grammar-URI "http://myserver.example.com/myenrollmentdb/jeff-list". The generated grammar syntax MAY be implementation specific. There is no default value for this header field. This header field MAY be sent as part of the START-PHRASE-ENROLLMENT, SET-PARAMS, or GET-PARAMS method. personal-grammar-uri = "Personal-Grammar-URI" ":" uri CRLF9.4.38. Enroll-Utterance
This header field MAY be specified in the RECOGNIZE method. If this header field is set to "true" and an Enrollment is active, the RECOGNIZE command MUST add the collected utterance to the personal grammar that is being enrolled. The way in which this occurs is engine specific and may be an area of future standardization. The default value for this header field is "false". enroll-utterance = "Enroll-Utterance" ":" BOOLEAN CRLF9.4.39. Phrase-Id
This header field in a request identifies a phrase in an existing personal grammar for which enrollment is desired. It is also returned to the client in the RECOGNIZE complete event. This header field MAY occur in START-PHRASE-ENROLLMENT, MODIFY-PHRASE, or DELETE- PHRASE requests. There is no default value for this header field. phrase-id = "Phrase-ID" ":" 1*VCHAR CRLF
9.4.40. Phrase-NL
This string specifies the interpreted text to be returned when the phrase is recognized. This header field MAY occur in START-PHRASE- ENROLLMENT and MODIFY-PHRASE requests. There is no default value for this header field. phrase-nl = "Phrase-NL" ":" 1*UTFCHAR CRLF9.4.41. Weight
The value of this header field represents the occurrence likelihood of a phrase in an enrolled grammar. When using grammar enrollment, the system is essentially constructing a grammar segment consisting of a list of possible match phrases. This can be thought of to be similar to the dynamic construction of a <one-of> tag in the W3C grammar specification. Each enrolled-phrase becomes an item in the list that can be matched against spoken input similar to the <item> within a <one-of> list. This header field allows you to assign a weight to the phrase (i.e., <item> entry) in the <one-of> list that is enrolled. Grammar weights are normalized to a sum of one at grammar compilation time, so a weight value of 1 for each phrase in an enrolled grammar list indicates all items in that list have the same weight. This header field MAY occur in START-PHRASE-ENROLLMENT and MODIFY-PHRASE requests. The default value for this header field is implementation specific. weight = "Weight" ":" FLOAT CRLF9.4.42. Save-Best-Waveform
This header field allows the client to request the recognizer resource to save the audio stream for the best repetition of the phrase that was used during the enrollment session. The recognizer MUST attempt to record the recognized audio and make it available to the client in the form of a URI returned in the Waveform-URI header field in the response to the END-PHRASE-ENROLLMENT method. If there was an error in recording the stream or the audio data is otherwise not available, the recognizer MUST return an empty Waveform-URI header field. This header field MAY occur in the START-PHRASE- ENROLLMENT, SET-PARAMS, and GET-PARAMS methods. save-best-waveform = "Save-Best-Waveform" ":" BOOLEAN CRLF
9.4.43. New-Phrase-Id
This header field replaces the ID used to identify the phrase in a personal grammar. The recognizer returns the new ID when using an enrollment grammar. This header field MAY occur in MODIFY-PHRASE requests. new-phrase-id = "New-Phrase-ID" ":" 1*VCHAR CRLF9.4.44. Confusable-Phrases-URI
This header field specifies a grammar that defines invalid phrases for enrollment. For example, typical applications do not allow an enrolled phrase that is also a command word. This header field MAY occur in RECOGNIZE requests that are part of an enrollment session. confusable-phrases-uri = "Confusable-Phrases-URI" ":" uri CRLF9.4.45. Abort-Phrase-Enrollment
This header field MAY be specified in the END-PHRASE-ENROLLMENT method to abort the phrase enrollment, rather than committing the phrase to the personal grammar. abort-phrase-enrollment = "Abort-Phrase-Enrollment" ":" BOOLEAN CRLF9.5. Recognizer Message Body
A recognizer message can carry additional data associated with the request, response, or event. The client MAY provide the grammar to be recognized in DEFINE-GRAMMAR or RECOGNIZE requests. When one or more grammars are specified using the DEFINE-GRAMMAR method, the server MUST attempt to fetch, compile, and optimize the grammar before returning a response to the DEFINE-GRAMMAR method. A RECOGNIZE request MUST completely specify the grammars to be active during the recognition operation, except when the RECOGNIZE method is being used to enroll a grammar. During grammar enrollment, such grammars are OPTIONAL. The server resource sends the recognition results in the RECOGNITION-COMPLETE event and the GET-RESULT response. Grammars and recognition results are carried in the message body of the corresponding MRCPv2 messages.9.5.1. Recognizer Grammar Data
Recognizer grammar data from the client to the server can be provided inline or by reference. Either way, grammar data is carried as typed media entities in the message body of the RECOGNIZE or DEFINE-GRAMMAR
request. All MRCPv2 servers MUST accept grammars in the XML form (media type 'application/srgs+xml') of the W3C's XML-based Speech Grammar Markup Format (SRGS) [W3C.REC-speech-grammar-20040316] and MAY accept grammars in other formats. Examples include but are not limited to: o the ABNF form (media type 'application/srgs') of SRGS o Sun's Java Speech Grammar Format (JSGF) [refs.javaSpeechGrammarFormat] Additionally, MRCPv2 servers MAY support the Semantic Interpretation for Speech Recognition (SISR) [W3C.REC-semantic-interpretation-20070405] specification. When a grammar is specified inline in the request, the client MUST provide a Content-ID for that grammar as part of the content header fields. If there is no space on the server to store the inline grammar, the request MUST return with a Completion-Cause code of 016 "grammar-definition-failure". Otherwise, the server MUST associate the inline grammar block with that Content-ID and MUST store it on the server for the duration of the session. However, if the Content-ID is redefined later in the session through a subsequent DEFINE-GRAMMAR, the inline grammar previously associated with the Content-ID MUST be freed. If the Content-ID is redefined through a subsequent DEFINE-GRAMMAR with an empty message body (i.e., no grammar definition), then in addition to freeing any grammar previously associated with the Content-ID, the server MUST clear all bindings and associations to the Content-ID. Unless and until subsequently redefined, this URI MUST be interpreted by the server as one that has never been set. Grammars that have been associated with a Content-ID can be referenced through the 'session' URI scheme (see Section 13.6). For example: session:help@root-level.store Grammar data MAY be specified using external URI references. To do so, the client uses a body of media type 'text/uri-list' (see RFC 2483 [RFC2483] ) to list the one or more URIs that point to the grammar data. The client can use a body of media type 'text/ grammar-ref-list' (see Section 13.5.1) if it wants to assign weights to the list of grammar URI. All MRCPv2 servers MUST support grammar access using the 'http' and 'https' URI schemes. If the grammar data the client wishes to be used on a request consists of a mix of URI and inline grammar data, the client uses the 'multipart/mixed' media type to enclose the 'text/uri-list',
'application/srgs', or 'application/srgs+xml' content entities. The character set and encoding used in the grammar data are specified using to standard media type definitions. When more than one grammar URI or inline grammar block is specified in a message body of the RECOGNIZE request, the server interprets this as a list of grammar alternatives to match against. Content-Type:application/srgs+xml Content-ID:<request1@form-level.store> Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="request"> <!-- single language attachment to tokens --> <rule id="yes"> <one-of> <item xml:lang="fr-CA">oui</item> <item xml:lang="en-US">yes</item> </one-of> </rule> <!-- single language attachment to a rule expansion --> <rule id="request"> may I speak to <one-of xml:lang="fr-CA"> <item>Michel Tremblay</item> <item>Andre Roy</item> </one-of> </rule> <!-- multiple language attachment to a token --> <rule id="people1"> <token lexicon="en-US,fr-CA"> Robert </token> </rule>
<!-- the equivalent single-language attachment expansion -->
<rule id="people2">
<one-of>
<item xml:lang="en-US">Robert</item>
<item xml:lang="fr-CA">Robert</item>
</one-of>
</rule>
</grammar>
SRGS Grammar Example
Content-Type:text/uri-list
Content-Length:...
session:help@root-level.store
http://www.example.com/Directory-Name-List.grxml
http://www.example.com/Department-List.grxml
http://www.example.com/TAC-Contact-List.grxml
session:menu1@menu-level.store
Grammar Reference Example
Content-Type:multipart/mixed; boundary="break"
--break
Content-Type:text/uri-list
Content-Length:...
http://www.example.com/Directory-Name-List.grxml
http://www.example.com/Department-List.grxml
http://www.example.com/TAC-Contact-List.grxml
--break
Content-Type:application/srgs+xml
Content-ID:<request1@form-level.store>
Content-Length:...
<?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0"> <!-- single language attachment to tokens --> <rule id="yes"> <one-of> <item xml:lang="fr-CA">oui</item> <item xml:lang="en-US">yes</item> </one-of> </rule> <!-- single language attachment to a rule expansion --> <rule id="request"> may I speak to <one-of xml:lang="fr-CA"> <item>Michel Tremblay</item> <item>Andre Roy</item> </one-of> </rule> <!-- multiple language attachment to a token --> <rule id="people1"> <token lexicon="en-US,fr-CA"> Robert </token> </rule> <!-- the equivalent single-language attachment expansion --> <rule id="people2"> <one-of> <item xml:lang="en-US">Robert</item> <item xml:lang="fr-CA">Robert</item> </one-of> </rule> </grammar> --break-- Mixed Grammar Reference Example9.5.2. Recognizer Result Data
Recognition results are returned to the client in the message body of the RECOGNITION-COMPLETE event or the GET-RESULT response message as described in Section 6.3. Element and attribute descriptions for the recognition portion of the NLSML format are provided in Section 9.6 with a normative definition of the schema in Section 16.1.
Content-Type:application/nlsml+xml Content-Length:... <?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" xmlns:ex="http://www.example.com/example" grammar="http://www.example.com/theYesNoGrammar"> <interpretation> <instance> <ex:response>yes</ex:response> </instance> <input>OK</input> </interpretation> </result> Result Example9.5.3. Enrollment Result Data
Enrollment results are returned to the client in the message body of the RECOGNITION-COMPLETE event as described in Section 6.3. Element and attribute descriptions for the enrollment portion of the NLSML format are provided in Section 9.7 with a normative definition of the schema in Section 16.2.9.5.4. Recognizer Context Block
When a client changes servers while operating on the behalf of the same incoming communication session, this header field allows the client to collect a block of opaque data from one server and provide it to another server. This capability is desirable if the client needs different language support or because the server issued a redirect. Here, the first recognizer resource may have collected acoustic and other data during its execution of recognition methods. After a server switch, communicating this data may allow the recognizer resource on the new server to provide better recognition. This block of data is implementation specific and MUST be carried as media type 'application/octets' in the body of the message. This block of data is communicated in the SET-PARAMS and GET-PARAMS method/response messages. In the GET-PARAMS method, if an empty Recognizer-Context-Block header field is present, then the recognizer SHOULD return its vendor-specific context block, if any, in the message body as an entity of media type 'application/octets' with a specific Content-ID. The Content-ID value MUST also be specified in the Recognizer-Context-Block header field in the GET-PARAMS response. The SET-PARAMS request wishing to provide this vendor-specific data MUST send it in the message body as a typed entity with the same
Content-ID that it received from the GET-PARAMS. The Content-ID MUST also be sent in the Recognizer-Context-Block header field of the SET-PARAMS message. Each speech recognition implementation choosing to use this mechanism to hand off recognizer context data among servers MUST distinguish its implementation-specific block of data from other implementations by choosing a Content-ID that is recognizable among the participating servers and unlikely to collide with values chosen by another implementation.