9.6. Recognizer Results
The recognizer portion of NLSML (see Section 6.3.1) represents information automatically extracted from a user's utterances by a semantic interpretation component, where "utterance" is to be taken in the general sense of a meaningful user input in any modality supported by the MRCPv2 implementation.9.6.1. Markup Functions
MRCPv2 recognizer resources employ the Natural Language Semantics Markup Language (NLSML) to interpret natural language speech input and to format the interpretation for consumption by an MRCPv2 client. The elements of the markup fall into the following general functional categories: interpretation, side information, and multi-modal integration.9.6.1.1. Interpretation
Elements and attributes represent the semantics of a user's utterance, including the <result>, <interpretation>, and <instance> elements. The <result> element contains the full result of processing one utterance. It MAY contain multiple <interpretation> elements if the interpretation of the utterance results in multiple alternative meanings due to uncertainty in speech recognition or natural language understanding. There are at least two reasons for providing multiple interpretations: 1. The client application might have additional information, for example, information from a database, that would allow it to select a preferred interpretation from among the possible interpretations returned from the semantic interpreter.
2. A client-based dialog manager (e.g., VoiceXML [W3C.REC-voicexml20-20040316]) that was unable to select between several competing interpretations could use this information to go back to the user and find out what was intended. For example, it could issue a SPEAK request to a synthesizer resource to emit "Did you say 'Boston' or 'Austin'?"9.6.1.2. Side Information
These are elements and attributes representing additional information about the interpretation, over and above the interpretation itself. Side information includes: 1. Whether an interpretation was achieved (the <nomatch> element) and the system's confidence in an interpretation (the "confidence" attribute of <interpretation>). 2. Alternative interpretations (<interpretation>) 3. Input formats and Automatic Speech Recognition (ASR) information: the <input> element, representing the input to the semantic interpreter.9.6.1.3. Multi-Modal Integration
When more than one modality is available for input, the interpretation of the inputs needs to be coordinated. The "mode" attribute of <input> supports this by indicating whether the utterance was input by speech, DTMF, pointing, etc. The "timestamp- start" and "timestamp-end" attributes of <input> also provide for temporal coordination by indicating when inputs occurred.9.6.2. Overview of Recognizer Result Elements and Their Relationships
The recognizer elements in NLSML fall into two categories: 1. description of the input that was processed, and 2. description of the meaning which was extracted from the input. Next to each element are its attributes. In addition, some elements can contain multiple instances of other elements. For example, a <result> can contain multiple <interpretation> elements, each of which is taken to be an alternative. Similarly, <input> can contain multiple child <input> elements, which are taken to be cumulative. To illustrate the basic usage of these elements, as a simple example,
consider the utterance "OK" (interpreted as "yes"). The example illustrates how that utterance and its interpretation would be represented in the NLSML markup. <?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" xmlns:ex="http://www.example.com/example" grammar="http://www.example.com/theYesNoGrammar"> <interpretation> <instance> <ex:response>yes</ex:response> </instance> <input>OK</input> </interpretation> </result> This example includes only the minimum required information. There is an overall <result> element, which includes one interpretation and an input element. The interpretation contains the application- specific element "<response>", which is the semantically interpreted result.9.6.3. Elements and Attributes
9.6.3.1. <result> Root Element
The root element of the markup is <result>. The <result> element includes one or more <interpretation> elements. Multiple interpretations can result from ambiguities in the input or in the semantic interpretation. If the "grammar" attribute does not apply to all of the interpretations in the result, it can be overridden for individual interpretations at the <interpretation> level. Attributes: 1. grammar: The grammar or recognition rule matched by this result. The format of the grammar attribute will match the rule reference semantics defined in the grammar specification. Specifically, the rule reference is in the external XML form for grammar rule references. The markup interpreter needs to know the grammar rule that is matched by the utterance because multiple rules may be simultaneously active. The value is the grammar URI used by the markup interpreter to specify the grammar. The grammar can be overridden by a grammar attribute in the <interpretation> element if the input was ambiguous as to which grammar it matched. If all interpretation elements within the result element contain their own grammar attributes, the attribute can be dropped from the result element.
<?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" grammar="http://www.example.com/grammar"> <interpretation> .... </interpretation> </result>9.6.3.2. <interpretation> Element
An <interpretation> element contains a single semantic interpretation. Attributes: 1. confidence: A float value from 0.0-1.0 indicating the semantic analyzer's confidence in this interpretation. A value of 1.0 indicates maximum confidence. The values are implementation dependent but are intended to align with the value interpretation for the confidence MRCPv2 header field defined in Section 9.4.1. This attribute is OPTIONAL. 2. grammar: The grammar or recognition rule matched by this interpretation (if needed to override the grammar specification at the <interpretation> level.) This attribute is only needed under <interpretation> if it is necessary to override a grammar that was defined at the <result> level. Note that the grammar attribute for the interpretation element is optional if and only if the grammar attribute is specified in the <result> element. Interpretations MUST be sorted best-first by some measure of "goodness". The goodness measure is "confidence" if present; otherwise, it is some implementation-specific indication of quality. The grammar is expected to be specified most frequently at the <result> level. However, it can be overridden at the <interpretation> level because it is possible that different interpretations may match different grammar rules. The <interpretation> element includes an optional <input> element containing the input being analyzed, and at least one <instance> element containing the interpretation of the utterance. <interpretation confidence="0.75" grammar="http://www.example.com/grammar"> ... </interpretation>
9.6.3.3. <instance> Element
The <instance> element contains the interpretation of the utterance. When the Semantic Interpretation for Speech Recognition format is used, the <instance> element contains the XML serialization of the result using the approach defined in that specification. When there is semantic markup in the grammar that does not create semantic objects, but instead only does a semantic translation of a portion of the input, such as translating "coke" to "coca-cola", the instance contains the whole input but with the translation applied. The NLSML looks like the markup in Figure 2 below. If there are no semantic objects created, nor any semantic translation, the instance value is the same as the input value. Attributes: 1. confidence: Each element of the instance MAY have a confidence attribute, defined in the NLSML namespace. The confidence attribute contains a float value in the range from 0.0-1.0 reflecting the system's confidence in the analysis of that slot. A value of 1.0 indicates maximum confidence. The values are implementation dependent, but are intended to align with the value interpretation for the MRCPv2 header field Confidence- Threshold defined in Section 9.4.1. This attribute is OPTIONAL. <instance> <nameAddress> <street confidence="0.75">123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </nameAddress> </instance> <input> My address is 123 Maple Street, Mill Valley, California, 90952 </input> <instance> I would like to buy a coca-cola </instance> <input> I would like to buy a coke </input> Figure 2: NSLML Example
9.6.3.4. <input> Element
The <input> element is the text representation of a user's input. It includes an optional "confidence" attribute, which indicates the recognizer's confidence in the recognition result (as opposed to the confidence in the interpretation, which is indicated by the "confidence" attribute of <interpretation>). Optional "timestamp- start" and "timestamp-end" attributes indicate the start and end times of a spoken utterance, in ISO 8601 format [ISO.8601.1988]. Attributes: 1. timestamp-start: The time at which the input began. (optional) 2. timestamp-end: The time at which the input ended. (optional) 3. mode: The modality of the input, for example, speech, DTMF, etc. (optional) 4. confidence: The confidence of the recognizer in the correctness of the input in the range 0.0 to 1.0. (optional) Note that it may not make sense for temporally overlapping inputs to have the same mode; however, this constraint is not expected to be enforced by implementations. When there is no time zone designator, ISO 8601 time representations default to local time. There are three possible formats for the <input> element. 1. The <input> element can contain simple text: <input>onions</input> A future possibility is for <input> to contain not only text but additional markup that represents prosodic information that was contained in the original utterance and extracted by the speech recognizer. This depends on the availability of ASRs that are capable of producing prosodic information. MRCPv2 clients MUST be prepared to receive such markup and MAY make use of it. 2. An <input> tag can also contain additional <input> tags. Having additional input elements allows the representation to support future multi-modal inputs as well as finer-grained speech information, such as timestamps for individual words and word- level confidences.
<input> <input mode="speech" confidence="0.5" timestamp-start="2000-04-03T0:00:00" timestamp-end="2000-04-03T0:00:00.2">fried</input> <input mode="speech" confidence="1.0" timestamp-start="2000-04-03T0:00:00.25" timestamp-end="2000-04-03T0:00:00.6">onions</input> </input> 3. Finally, the <input> element can contain <nomatch> and <noinput> elements, which describe situations in which the speech recognizer received input that it was unable to process or did not receive any input at all, respectively.9.6.3.5. <nomatch> Element
The <nomatch> element under <input> is used to indicate that the semantic interpreter was unable to successfully match any input with confidence above the threshold. It can optionally contain the text of the best of the (rejected) matches. <interpretation> <instance/> <input confidence="0.1"> <nomatch/> </input> </interpretation> <interpretation> <instance/> <input mode="speech" confidence="0.1"> <nomatch>I want to go to New York</nomatch> </input> </interpretation>9.6.3.6. <noinput> Element
<noinput> indicates that there was no input -- a timeout occurred in the speech recognizer due to silence. <interpretation> <instance/> <input> <noinput/> </input> </interpretation> If there are multiple levels of inputs, the most natural place for <nomatch> and <noinput> elements to appear is under the highest level of <input> for <noinput>, and under the appropriate level of
<interpretation> for <nomatch>. So, <noinput> means "no input at all" and <nomatch> means "no match in speech modality" or "no match in DTMF modality". For example, to represent garbled speech combined with DTMF "1 2 3 4", the markup would be: <input> <input mode="speech"><nomatch/></input> <input mode="dtmf">1 2 3 4</input> </input> Note: while <noinput> could be represented as an attribute of input, <nomatch> cannot, since it could potentially include PCDATA content with the best match. For parallelism, <noinput> is also an element.9.7. Enrollment Results
All enrollment elements are contained within a single <enrollment-result> element under <result>. The elements are described below and have the schema defined in Section 16.2. The following elements are defined: 1. num-clashes 2. num-good-repetitions 3. num-repetitions-still-needed 4. consistency-status 5. clash-phrase-ids 6. transcriptions 7. confusable-phrases9.7.1. <num-clashes> Element
The <num-clashes> element contains the number of clashes that this pronunciation has with other pronunciations in an active enrollment session. The associated Clash-Threshold header field determines the sensitivity of the clash measurement. Note that clash testing can be turned off completely by setting the Clash-Threshold header field value to 0.9.7.2. <num-good-repetitions> Element
The <num-good-repetitions> element contains the number of consistent pronunciations obtained so far in an active enrollment session.
9.7.3. <num-repetitions-still-needed> Element
The <num-repetitions-still-needed> element contains the number of consistent pronunciations that must still be obtained before the new phrase can be added to the enrollment grammar. The number of consistent pronunciations required is specified by the client in the request header field Num-Min-Consistent-Pronunciations. The returned value must be 0 before the client can successfully commit a phrase to the grammar by ending the enrollment session.9.7.4. <consistency-status> Element
The <consistency-status> element is used to indicate how consistent the repetitions are when learning a new phrase. It can have the values of consistent, inconsistent, and undecided.9.7.5. <clash-phrase-ids> Element
The <clash-phrase-ids> element contains the phrase IDs of clashing pronunciation(s), if any. This element is absent if there are no clashes.9.7.6. <transcriptions> Element
The <transcriptions> element contains the transcriptions returned in the last repetition of the phrase being enrolled.9.7.7. <confusable-phrases> Element
The <confusable-phrases> element contains a list of phrases from a command grammar that are confusable with the phrase being added to the personal grammar. This element MAY be absent if there are no confusable phrases.9.8. DEFINE-GRAMMAR
The DEFINE-GRAMMAR method, from the client to the server, provides one or more grammars and requests the server to access, fetch, and compile the grammars as needed. The DEFINE-GRAMMAR method implementation MUST do a fetch of all external URIs that are part of that operation. If caching is implemented, this URI fetching MUST conform to the cache control hints and parameter header fields associated with the method in deciding whether the URIs should be fetched from cache or from the external server. If these hints/ parameters are not specified in the method, the values set for the session using SET-PARAMS/GET-PARAMS apply. If it was not set for the session, their default values apply.
If the server resource is in the recognition state, the DEFINE- GRAMMAR request MUST respond with a failure status. If the resource is in the idle state and is able to successfully process the supplied grammars, the server MUST return a success code status and the request-state MUST be COMPLETE. If the recognizer resource could not define the grammar for some reason (for example, if the download failed, the grammar failed to compile, or the grammar was in an unsupported form), the MRCPv2 response for the DEFINE-GRAMMAR method MUST contain a failure status- code of 407 and contain a Completion-Cause header field describing the failure reason. C->S:MRCP/2.0 ... DEFINE-GRAMMAR 543257 Channel-Identifier:32AECB23433801@speechrecog Content-Type:application/srgs+xml Content-ID:<request1@form-level.store> Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0"> <!-- single language attachment to tokens --> <rule id="yes"> <one-of> <item xml:lang="fr-CA">oui</item> <item xml:lang="en-US">yes</item> </one-of> </rule> <!-- single language attachment to a rule expansion --> <rule id="request"> may I speak to <one-of xml:lang="fr-CA"> <item>Michel Tremblay</item> <item>Andre Roy</item> </one-of> </rule> </grammar> S->C:MRCP/2.0 ... 543257 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Completion-Cause:000 success
C->S:MRCP/2.0 ... DEFINE-GRAMMAR 543258 Channel-Identifier:32AECB23433801@speechrecog Content-Type:application/srgs+xml Content-ID:<helpgrammar@root-level.store> Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0"> <rule id="request"> I need help </rule> S->C:MRCP/2.0 ... 543258 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Completion-Cause:000 success C->S:MRCP/2.0 ... DEFINE-GRAMMAR 543259 Channel-Identifier:32AECB23433801@speechrecog Content-Type:application/srgs+xml Content-ID:<request2@field-level.store> Content-Length:... <?xml version="1.0" encoding="UTF-8"?> <!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN" "http://www.w3.org/TR/speech-grammar/grammar.dtd"> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/06/grammar http://www.w3.org/TR/speech-grammar/grammar.xsd" version="1.0" mode="voice" root="basicCmd"> <meta name="author" content="Stephanie Williams"/> <rule id="basicCmd" scope="public"> <example> please move the window </example> <example> open a file </example> <ruleref uri="http://grammar.example.com/politeness.grxml#startPolite"/>
<ruleref uri="#command"/> <ruleref uri="http://grammar.example.com/politeness.grxml#endPolite"/> </rule> <rule id="command"> <ruleref uri="#action"/> <ruleref uri="#object"/> </rule> <rule id="action"> <one-of> <item weight="10"> open <tag>open</tag> </item> <item weight="2"> close <tag>close</tag> </item> <item weight="1"> delete <tag>delete</tag> </item> <item weight="1"> move <tag>move</tag> </item> </one-of> </rule> <rule id="object"> <item repeat="0-1"> <one-of> <item> the </item> <item> a </item> </one-of> </item> <one-of> <item> window </item> <item> file </item> <item> menu </item> </one-of> </rule> </grammar> S->C:MRCP/2.0 ... 543259 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Completion-Cause:000 success C->S:MRCP/2.0 ... RECOGNIZE 543260 Channel-Identifier:32AECB23433801@speechrecog N-Best-List-Length:2 Content-Type:text/uri-list Content-Length:...
session:request1@form-level.store session:request2@field-level.store session:helpgramar@root-level.store S->C:MRCP/2.0 ... 543260 200 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog S->C:MRCP/2.0 ... START-OF-INPUT 543260 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog S->C:MRCP/2.0 ... RECOGNITION-COMPLETE 543260 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Completion-Cause:000 success Waveform-URI:<http://web.media.com/session123/audio.wav>; size=124535;duration=2340 Content-Type:application/x-nlsml Content-Length:... <?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" xmlns:ex="http://www.example.com/example" grammar="session:request1@form-level.store"> <interpretation> <instance name="Person"> <ex:Person> <ex:Name> Andre Roy </ex:Name> </ex:Person> </instance> <input> may I speak to Andre Roy </input> </interpretation> </result> Define Grammar Example9.9. RECOGNIZE
The RECOGNIZE method from the client to the server requests the recognizer to start recognition and provides it with one or more grammar references for grammars to match against the input media. The RECOGNIZE method can carry header fields to control the sensitivity, confidence level, and the level of detail in results provided by the recognizer. These header field values override the current values set by a previous SET-PARAMS method. The RECOGNIZE method can request the recognizer resource to operate in normal or hotword mode as specified by the Recognition-Mode header field. The default value is "normal". If the resource could not start a recognition, the server MUST respond with a failure status-
code of 407 and a Completion-Cause header field in the response describing the cause of failure. The RECOGNIZE request uses the message body to specify the grammars applicable to the request. The active grammar(s) for the request can be specified in one of three ways. If the client needs to explicitly control grammar weights for the recognition operation, it MUST employ method 3 below. The order of these grammars specifies the precedence of the grammars that is used when more than one grammar in the list matches the speech; in this case, the grammar with the higher precedence is returned as a match. This precedence capability is useful in applications like VoiceXML browsers to order grammars specified at the dialog, document, and root level of a VoiceXML application. 1. The grammar MAY be placed directly in the message body as typed content. If more than one grammar is included in the body, the order of inclusion controls the corresponding precedence for the grammars during recognition, with earlier grammars in the body having a higher precedence than later ones. 2. The body MAY contain a list of grammar URIs specified in content of media type 'text/uri-list' [RFC2483]. The order of the URIs determines the corresponding precedence for the grammars during recognition, with highest precedence first and decreasing for each URI thereafter. 3. The body MAY contain a list of grammar URIs specified in content of media type 'text/grammar-ref-list'. This type defines a list of grammar URIs and allows each grammar URI to be assigned a weight in the list. This weight has the same meaning as the weights described in Section 2.4.1 of the Speech Grammar Markup Format (SRGS) [W3C.REC-speech-grammar-20040316]. In addition to performing recognition on the input, the recognizer MUST also enroll the collected utterance in a personal grammar if the Enroll-Utterance header field is set to true and an Enrollment is active (via an earlier execution of the START-PHRASE-ENROLLMENT method). If so, and if the RECOGNIZE request contains a Content-ID header field, then the resulting grammar (which includes the personal grammar as a sub-grammar) can be referenced through the 'session' URI scheme (see Section 13.6). If the resource was able to successfully start the recognition, the server MUST return a success status-code and a request-state of IN-PROGRESS. This means that the recognizer is active and that the client MUST be prepared to receive further events with this request-id.
If the resource was able to queue the request, the server MUST return a success code and request-state of PENDING. This means that the recognizer is currently active with another request and that this request has been queued for processing. If the resource could not start a recognition, the server MUST respond with a failure status-code of 407 and a Completion-Cause header field in the response describing the cause of failure. For the recognizer resource, RECOGNIZE and INTERPRET are the only requests that return a request-state of IN-PROGRESS, meaning that recognition is in progress. When the recognition completes by matching one of the grammar alternatives or by a timeout without a match or for some other reason, the recognizer resource MUST send the client a RECOGNITION-COMPLETE event (or INTERPRETATION-COMPLETE, if INTERPRET was the request) with the result of the recognition and a request-state of COMPLETE. Large grammars can take a long time for the server to compile. For grammars that are used repeatedly, the client can improve server performance by issuing a DEFINE-GRAMMAR request with the grammar ahead of time. In such a case, the client can issue the RECOGNIZE request and reference the grammar through the 'session' URI scheme (see Section 13.6). This also applies in general if the client wants to repeat recognition with a previous inline grammar. The RECOGNIZE method implementation MUST do a fetch of all external URIs that are part of that operation. If caching is implemented, this URI fetching MUST conform to the cache control hints and parameter header fields associated with the method in deciding whether it should be fetched from cache or from the external server. If these hints/parameters are not specified in the method, the values set for the session using SET-PARAMS/GET-PARAMS apply. If it was not set for the session, their default values apply. Note that since the audio and the messages are carried over separate communication paths there may be a race condition between the start of the flow of audio and the receipt of the RECOGNIZE method. For example, if an audio flow is started by the client at the same time as the RECOGNIZE method is sent, either the audio or the RECOGNIZE can arrive at the recognizer first. As another example, the client may choose to continuously send audio to the server and signal the server to recognize using the RECOGNIZE method. Mechanisms to resolve this condition are outside the scope of this specification. The recognizer can expect the media to start flowing when it receives the RECOGNIZE request, but it MUST NOT buffer anything it receives beforehand in order to preserve the semantics that application authors expect with respect to the input timers.
When a RECOGNIZE method has been received, the recognition is initiated on the stream. The No-Input-Timer MUST be started at this time if the Start-Input-Timers header field is specified as "true". If this header field is set to "false", the No-Input-Timer MUST be started when it receives the START-INPUT-TIMERS method from the client. The Recognition-Timeout MUST be started when the recognition resource detects speech or a DTMF digit in the media stream. For recognition when not in hotword mode: When the recognizer resource detects speech or a DTMF digit in the media stream, it MUST send the START-OF-INPUT event. When enough speech has been collected for the server to process, the recognizer can try to match the collected speech with the active grammars. If the speech collected at this point fully matches with any of the active grammars, the Speech-Complete-Timer is started. If it matches partially with one or more of the active grammars, with more speech needed before a full match is achieved, then the Speech-Incomplete- Timer is started. 1. When the No-Input-Timer expires, the recognizer MUST complete with a Completion-Cause code of "no-input-timeout". 2. The recognizer MUST support detecting a no-match condition upon detecting end of speech. The recognizer MAY support detecting a no-match condition before waiting for end-of-speech. If this is supported, this capability is enabled by setting the Early-No- Match header field to "true". Upon detecting a no-match condition, the RECOGNIZE MUST return with "no-match". 3. When the Speech-Incomplete-Timer expires, the recognizer SHOULD complete with a Completion-Cause code of "partial-match", unless the recognizer cannot differentiate a partial-match, in which case it MUST return a Completion-Cause code of "no-match". The recognizer MAY return results for the partially matched grammar. 4. When the Speech-Complete-Timer expires, the recognizer MUST complete with a Completion-Cause code of "success".
5. When the Recognition-Timeout expires, one of the following MUST happen: 5.1. If there was a partial-match, the recognizer SHOULD complete with a Completion-Cause code of "partial-match- maxtime", unless the recognizer cannot differentiate a partial-match, in which case it MUST complete with a Completion-Cause code of "no-match-maxtime". The recognizer MAY return results for the partially matched grammar. 5.2. If there was a full-match, the recognizer MUST complete with a Completion-Cause code of "success-maxtime". 5.3. If there was a no match, the recognizer MUST complete with a Completion-Cause code of "no-match-maxtime". For recognition in hotword mode: Note that for recognition in hotword mode the START-OF-INPUT event is not generated when speech or a DTMF digit is detected. 1. When the No-Input-Timer expires, the recognizer MUST complete with a Completion-Cause code of "no-input-timeout". 2. If at any point a match occurs, the RECOGNIZE MUST complete with a Completion-Cause code of "success". 3. When the Recognition-Timeout expires and there is not a match, the RECOGNIZE MUST complete with a Completion-Cause code of "hotword-maxtime". 4. When the Recognition-Timeout expires and there is a match, the RECOGNIZE MUST complete with a Completion-Cause code of "success- maxtime". 5. When the Recognition-Timeout is running but the detected speech/ DTMF has not resulted in a match, the Recognition-Timeout MUST be stopped and reset. It MUST then be restarted when speech/DTMF is again detected. Below is a complete example of using RECOGNIZE. It shows the call to RECOGNIZE, the IN-PROGRESS and START-OF-INPUT status messages, and the final RECOGNITION-COMPLETE message containing the result.
C->S:MRCP/2.0 ... RECOGNIZE 543257 Channel-Identifier:32AECB23433801@speechrecog Confidence-Threshold:0.9 Content-Type:application/srgs+xml Content-ID:<request1@form-level.store> Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="request"> <!-- single language attachment to tokens --> <rule id="yes"> <one-of> <item xml:lang="fr-CA">oui</item> <item xml:lang="en-US">yes</item> </one-of> </rule> <!-- single language attachment to a rule expansion --> <rule id="request"> may I speak to <one-of xml:lang="fr-CA"> <item>Michel Tremblay</item> <item>Andre Roy</item> </one-of> </rule> </grammar> S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog S->C:MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog S->C:MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Completion-Cause:000 success Waveform-URI:<http://web.media.com/session123/audio.wav>; size=424252;duration=2543 Content-Type:application/nlsml+xml Content-Length:...
<?xml version="1.0"?>
<result xmlns="urn:ietf:params:xml:ns:mrcpv2"
xmlns:ex="http://www.example.com/example"
grammar="session:request1@form-level.store">
<interpretation>
<instance name="Person">
<ex:Person>
<ex:Name> Andre Roy </ex:Name>
</ex:Person>
</instance>
<input> may I speak to Andre Roy </input>
</interpretation>
</result>
Below is an example of calling RECOGNIZE with a different grammar.
No status or completion messages are shown in this example, although
they would of course occur in normal usage.
C->S: MRCP/2.0 ... RECOGNIZE 543257
Channel-Identifier:32AECB23433801@speechrecog
Confidence-Threshold:0.9
Fetch-Timeout:20
Content-Type:application/srgs+xml
Content-Length:...
<?xml version="1.0"? Version="1.0" mode="voice"
root="Basic md">
<rule id="rule_list" scope="public">
<one-of>
<item weight=10>
<ruleref uri=
"http://grammar.example.com/world-cities.grxml#canada"/>
</item>
<item weight=1.5>
<ruleref uri=
"http://grammar.example.com/world-cities.grxml#america"/>
</item>
<item weight=0.5>
<ruleref uri=
"http://grammar.example.com/world-cities.grxml#india"/>
</item>
</one-of>
</rule>
9.10. STOP
The STOP method from the client to the server tells the resource to stop recognition if a request is active. If a RECOGNIZE request is active and the STOP request successfully terminated it, then the response header section contains an Active-Request-Id-List header field containing the request-id of the RECOGNIZE request that was terminated. In this case, no RECOGNITION-COMPLETE event is sent for the terminated request. If there was no recognition active, then the response MUST NOT contain an Active-Request-Id-List header field. Either way, the response MUST contain a status-code of 200 "Success". C->S: MRCP/2.0 ... RECOGNIZE 543257 Channel-Identifier:32AECB23433801@speechrecog Confidence-Threshold:0.9 Content-Type:application/srgs+xml Content-ID:<request1@form-level.store> Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="request"> <!-- single language attachment to tokens --> <rule id="yes"> <one-of> <item xml:lang="fr-CA">oui</item> <item xml:lang="en-US">yes</item> </one-of> </rule> <!-- single language attachment to a rule expansion --> <rule id="request"> may I speak to <one-of xml:lang="fr-CA"> <item>Michel Tremblay</item> <item>Andre Roy</item> </one-of> </rule> </grammar> S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog C->S: MRCP/2.0 ... STOP 543258 200 Channel-Identifier:32AECB23433801@speechrecog
S->C: MRCP/2.0 ... 543258 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Active-Request-Id-List:5432579.11. GET-RESULT
The GET-RESULT method from the client to the server MAY be issued when the recognizer resource is in the recognized state. This request allows the client to retrieve results for a completed recognition. This is useful if the client decides it wants more alternatives or more information. When the server receives this request, it re-computes and returns the results according to the recognition constraints provided in the GET-RESULT request. The GET-RESULT request can specify constraints such as a different confidence-threshold or n-best-list-length. This capability is OPTIONAL for MRCPv2 servers and the automatic speech recognition engine in the server MUST return a status of unsupported feature if not supported. C->S: MRCP/2.0 ... GET-RESULT 543257 Channel-Identifier:32AECB23433801@speechrecog Confidence-Threshold:0.9 S->C: MRCP/2.0 ... 543257 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Content-Type:application/nlsml+xml Content-Length:... <?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" xmlns:ex="http://www.example.com/example" grammar="session:request1@form-level.store"> <interpretation> <instance name="Person"> <ex:Person> <ex:Name> Andre Roy </ex:Name> </ex:Person> </instance> <input> may I speak to Andre Roy </input> </interpretation> </result>
9.12. START-OF-INPUT
This is an event from the server to the client indicating that the recognizer resource has detected speech or a DTMF digit in the media stream. This event is useful in implementing kill-on-barge-in scenarios when a synthesizer resource is in a different session from the recognizer resource and hence is not aware of an incoming audio source (see Section 8.4.2). In these cases, it is up to the client to act as an intermediary and respond to this event by issuing a BARGE-IN-OCCURRED event to the synthesizer resource. The recognizer resource also MUST send a Proxy-Sync-Id header field with a unique value for this event. This event MUST be generated by the server, irrespective of whether or not the synthesizer and recognizer are on the same server.9.13. START-INPUT-TIMERS
This request is sent from the client to the recognizer resource when it knows that a kill-on-barge-in prompt has finished playing (see Section 8.4.2). This is useful in the scenario when the recognition and synthesizer engines are not in the same session. When a kill-on- barge-in prompt is being played, the client may want a RECOGNIZE request to be simultaneously active so that it can detect and implement kill-on-barge-in. But at the same time the client doesn't want the recognizer to start the no-input timers until the prompt is finished. The Start-Input-Timers header field in the RECOGNIZE request allows the client to say whether or not the timers should be started immediately. If not, the recognizer resource MUST NOT start the timers until the client sends a START-INPUT-TIMERS method to the recognizer.9.14. RECOGNITION-COMPLETE
This is an event from the recognizer resource to the client indicating that the recognition completed. The recognition result is sent in the body of the MRCPv2 message. The request-state field MUST be COMPLETE indicating that this is the last event with that request-id and that the request with that request-id is now complete. The server MUST maintain the recognizer context containing the results and the audio waveform input of that recognition until the next RECOGNIZE request is issued for that resource or the session terminates. If the server returns a URI to the audio waveform, it MUST do so in a Waveform-URI header field in the RECOGNITION-COMPLETE event. The client can use this URI to retrieve or playback the audio.
Note, if an enrollment session was active, the RECOGNITION-COMPLETE event can contain either recognition or enrollment results depending on what was spoken. The following example shows a complete exchange with a recognition result. C->S: MRCP/2.0 ... RECOGNIZE 543257 Channel-Identifier:32AECB23433801@speechrecog Confidence-Threshold:0.9 Content-Type:application/srgs+xml Content-ID:<request1@form-level.store> Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="request"> <!-- single language attachment to tokens --> <rule id="yes"> <one-of> <item xml:lang="fr-CA">oui</item> <item xml:lang="en-US">yes</item> </one-of> </rule> <!-- single language attachment to a rule expansion --> <rule id="request"> may I speak to <one-of xml:lang="fr-CA"> <item>Michel Tremblay</item> <item>Andre Roy</item> </one-of> </rule> </grammar> S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog S->C: MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog
S->C: MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Completion-Cause:000 success Waveform-URI:<http://web.media.com/session123/audio.wav>; size=342456;duration=25435 Content-Type:application/nlsml+xml Content-Length:... <?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" xmlns:ex="http://www.example.com/example" grammar="session:request1@form-level.store"> <interpretation> <instance name="Person"> <ex:Person> <ex:Name> Andre Roy </ex:Name> </ex:Person> </instance> <input> may I speak to Andre Roy </input> </interpretation> </result> If the result were instead an enrollment result, the final message from the server above could have been: S->C: MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Completion-Cause:000 success Content-Type:application/nlsml+xml Content-Length:... <?xml version= "1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" grammar="Personal-Grammar-URI"> <enrollment-result> <num-clashes> 2 </num-clashes> <num-good-repetitions> 1 </num-good-repetitions> <num-repetitions-still-needed> 1 </num-repetitions-still-needed> <consistency-status> consistent </consistency-status> <clash-phrase-ids> <item> Jeff </item> <item> Andre </item> </clash-phrase-ids> <transcriptions> <item> m ay b r ow k er </item> <item> m ax r aa k ah </item> </transcriptions>
<confusable-phrases> <item> <phrase> call </phrase> <confusion-level> 10 </confusion-level> </item> </confusable-phrases> </enrollment-result> </result>9.15. START-PHRASE-ENROLLMENT
The START-PHRASE-ENROLLMENT method from the client to the server starts a new phrase enrollment session during which the client can call RECOGNIZE multiple times to enroll a new utterance in a grammar. An enrollment session consists of a set of calls to RECOGNIZE in which the caller speaks a phrase several times so the system can "learn" it. The phrase is then added to a personal grammar (speaker- trained grammar), so that the system can recognize it later. Only one phrase enrollment session can be active at a time for a resource. The Personal-Grammar-URI identifies the grammar that is used during enrollment to store the personal list of phrases. Once RECOGNIZE is called, the result is returned in a RECOGNITION-COMPLETE event and will contain either an enrollment result OR a recognition result for a regular recognition. Calling END-PHRASE-ENROLLMENT ends the ongoing phrase enrollment session, which is typically done after a sequence of successful calls to RECOGNIZE. This method can be called to commit the new phrase to the personal grammar or to abort the phrase enrollment session. The grammar to contain the new enrolled phrase, specified by Personal-Grammar-URI, is created if it does not exist. Also, the personal grammar MUST ONLY contain phrases added via a phrase enrollment session. The Phrase-ID passed to this method is used to identify this phrase in the grammar and will be returned as the speech input when doing a RECOGNIZE on the grammar. The Phrase-NL similarly is returned in a RECOGNITION-COMPLETE event in the same manner as other Natural Language (NL) in a grammar. The tag-format of this NL is implementation specific. If the client has specified Save-Best-Waveform as true, then the response after ending the phrase enrollment session MUST contain the location/URI of a recording of the best repetition of the learned phrase.
C->S: MRCP/2.0 ... START-PHRASE-ENROLLMENT 543258 Channel-Identifier:32AECB23433801@speechrecog Num-Min-Consistent-Pronunciations:2 Consistency-Threshold:30 Clash-Threshold:12 Personal-Grammar-URI:<personal grammar uri> Phrase-Id:<phrase id> Phrase-NL:<NL phrase> Weight:1 Save-Best-Waveform:true S->C: MRCP/2.0 ... 543258 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog9.16. ENROLLMENT-ROLLBACK
The ENROLLMENT-ROLLBACK method discards the last live utterance from the RECOGNIZE operation. The client can invoke this method when the caller provides undesirable input such as non-speech noises, side- speech, commands, utterance from the RECOGNIZE grammar, etc. Note that this method does not provide a stack of rollback states. Executing ENROLLMENT-ROLLBACK twice in succession without an intervening recognition operation has no effect the second time. C->S: MRCP/2.0 ... ENROLLMENT-ROLLBACK 543261 Channel-Identifier:32AECB23433801@speechrecog S->C: MRCP/2.0 ... 543261 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog9.17. END-PHRASE-ENROLLMENT
The client MAY call the END-PHRASE-ENROLLMENT method ONLY during an active phrase enrollment session. It MUST NOT be called during an ongoing RECOGNIZE operation. To commit the new phrase in the grammar, the client MAY call this method once successive calls to RECOGNIZE have succeeded and Num-Repetitions-Still-Needed has been returned as 0 in the RECOGNITION-COMPLETE event. Alternatively, the client MAY abort the phrase enrollment session by calling this method with the Abort-Phrase-Enrollment header field. If the client has specified Save-Best-Waveform as "true" in the START-PHRASE-ENROLLMENT request, then the response MUST contain a Waveform-URI header whose value is the location/URI of a recording of the best repetition of the learned phrase. C->S: MRCP/2.0 ... END-PHRASE-ENROLLMENT 543262 Channel-Identifier:32AECB23433801@speechrecog
S->C: MRCP/2.0 ... 543262 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Waveform-URI:<http://mediaserver.com/recordings/file1324.wav>; size=242453;duration=254329.18. MODIFY-PHRASE
The MODIFY-PHRASE method sent from the client to the server is used to change the phrase ID, NL phrase, and/or weight for a given phrase in a personal grammar. If no fields are supplied, then calling this method has no effect. C->S: MRCP/2.0 ... MODIFY-PHRASE 543265 Channel-Identifier:32AECB23433801@speechrecog Personal-Grammar-URI:<personal grammar uri> Phrase-Id:<phrase id> New-Phrase-Id:<new phrase id> Phrase-NL:<NL phrase> Weight:1 S->C: MRCP/2.0 ... 543265 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog9.19. DELETE-PHRASE
The DELETE-PHRASE method sent from the client to the server is used to delete a phase that is in a personal grammar and was added through voice enrollment or text enrollment. If the specified phrase does not exist, this method has no effect. C->S: MRCP/2.0 ... DELETE-PHRASE 543266 Channel-Identifier:32AECB23433801@speechrecog Personal-Grammar-URI:<personal grammar uri> Phrase-Id:<phrase id> S->C: MRCP/2.0 ... 543266 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog9.20. INTERPRET
The INTERPRET method from the client to the server takes as input an Interpret-Text header field containing the text for which the semantic interpretation is desired, and returns, via the INTERPRETATION-COMPLETE event, an interpretation result that is very similar to the one returned from a RECOGNIZE method invocation. Only
portions of the result relevant to acoustic matching are excluded from the result. The Interpret-Text header field MUST be included in the INTERPRET request. Recognizer grammar data is treated in the same way as it is when issuing a RECOGNIZE method call. If a RECOGNIZE, RECORD, or another INTERPRET operation is already in progress for the resource, the server MUST reject the request with a response having a status-code of 402 "Method not valid in this state", and a COMPLETE request state. C->S: MRCP/2.0 ... INTERPRET 543266 Channel-Identifier:32AECB23433801@speechrecog Interpret-Text:may I speak to Andre Roy Content-Type:application/srgs+xml Content-ID:<request1@form-level.store> Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="request"> <!-- single language attachment to tokens --> <rule id="yes"> <one-of> <item xml:lang="fr-CA">oui</item> <item xml:lang="en-US">yes</item> </one-of> </rule> <!-- single language attachment to a rule expansion --> <rule id="request"> may I speak to <one-of xml:lang="fr-CA"> <item>Michel Tremblay</item> <item>Andre Roy</item> </one-of> </rule> </grammar> S->C: MRCP/2.0 ... 543266 200 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog
S->C: MRCP/2.0 ... INTERPRETATION-COMPLETE 543266 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Completion-Cause:000 success Content-Type:application/nlsml+xml Content-Length:... <?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" xmlns:ex="http://www.example.com/example" grammar="session:request1@form-level.store"> <interpretation> <instance name="Person"> <ex:Person> <ex:Name> Andre Roy </ex:Name> </ex:Person> </instance> <input> may I speak to Andre Roy </input> </interpretation> </result>9.21. INTERPRETATION-COMPLETE
This event from the recognizer resource to the client indicates that the INTERPRET operation is complete. The interpretation result is sent in the body of the MRCP message. The request state MUST be set to COMPLETE. The Completion-Cause header field MUST be included in this event and MUST be set to an appropriate value from the list of cause codes. C->S: MRCP/2.0 ... INTERPRET 543266 Channel-Identifier:32AECB23433801@speechrecog Interpret-Text:may I speak to Andre Roy Content-Type:application/srgs+xml Content-ID:<request1@form-level.store> Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="request"> <!-- single language attachment to tokens --> <rule id="yes"> <one-of> <item xml:lang="fr-CA">oui</item> <item xml:lang="en-US">yes</item> </one-of> </rule>
<!-- single language attachment to a rule expansion --> <rule id="request"> may I speak to <one-of xml:lang="fr-CA"> <item>Michel Tremblay</item> <item>Andre Roy</item> </one-of> </rule> </grammar> S->C: MRCP/2.0 ... 543266 200 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog S->C: MRCP/2.0 ... INTERPRETATION-COMPLETE 543266 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Completion-Cause:000 success Content-Type:application/nlsml+xml Content-Length:... <?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" xmlns:ex="http://www.example.com/example" grammar="session:request1@form-level.store"> <interpretation> <instance name="Person"> <ex:Person> <ex:Name> Andre Roy </ex:Name> </ex:Person> </instance> <input> may I speak to Andre Roy </input> </interpretation> </result>9.22. DTMF Detection
Digits received as DTMF tones are delivered to the recognition resource in the MRCPv2 server in the RTP stream according to RFC 4733 [RFC4733]. The Automatic Speech Recognizer (ASR) MUST support RFC 4733 to recognize digits, and it MAY support recognizing DTMF tones [Q.23] in the audio.
10. Recorder Resource
This resource captures received audio and video and stores it as content pointed to by a URI. The main usages of recorders are 1. to capture speech audio that may be submitted for recognition at a later time, and 2. recording voice or video mails. Both these applications require functionality above and beyond those specified by protocols such as RTSP [RFC2326]. This includes audio endpointing (i.e., detecting speech or silence). The support for video is OPTIONAL and is mainly capturing video mails that may require the speech or audio processing mentioned above. A recorder MUST provide endpointing capabilities for suppressing silence at the beginning and end of a recording, and it MAY also suppress silence in the middle of a recording. If such suppression is done, the recorder MUST maintain timing metadata to indicate the actual time stamps of the recorded media. See the discussion on the sensitivity of saved waveforms in Section 12.10.1. Recorder State Machine
Idle Recording State State | | |---------RECORD------->| | | |<------STOP------------| | | |<--RECORD-COMPLETE-----| | | | |--------| | START-OF-INPUT | | |------->| | | | |--------| | START-INPUT-TIMERS | | |------->| | | Recorder State Machine
10.2. Recorder Methods
The recorder resource supports the following methods. recorder-method = "RECORD" / "STOP" / "START-INPUT-TIMERS"10.3. Recorder Events
The recorder resource can generate the following events. recorder-event = "START-OF-INPUT" / "RECORD-COMPLETE"10.4. Recorder Header Fields
Method invocations for the recorder resource can contain resource- specific header fields containing request options and information to augment the Method, Response, or Event message it is associated with. recorder-header = sensitivity-level / no-input-timeout / completion-cause / completion-reason / failed-uri / failed-uri-cause / record-uri / media-type / max-time / trim-length / final-silence / capture-on-speech / ver-buffer-utterance / start-input-timers / new-audio-channel10.4.1. Sensitivity-Level
To filter out background noise and not mistake it for speech, the recorder can support a variable level of sound sensitivity. The Sensitivity-Level header field is a float value between 0.0 and 1.0 and allows the client to set the sensitivity level for the recorder. This header field MAY occur in RECORD, SET-PARAMS, or GET-PARAMS. A higher value for this header field means higher sensitivity. The default value for this header field is implementation specific. sensitivity-level = "Sensitivity-Level" ":" FLOAT CRLF
10.4.2. No-Input-Timeout
When recording is started and there is no speech detected for a certain period of time, the recorder can send a RECORD-COMPLETE event to the client and terminate the record operation. The No-Input- Timeout header field can set this timeout value. The value is in milliseconds. This header field MAY occur in RECORD, SET-PARAMS, or GET-PARAMS. The value for this header field ranges from 0 to an implementation-specific maximum value. The default value for this header field is implementation specific. no-input-timeout = "No-Input-Timeout" ":" 1*19DIGIT CRLF10.4.3. Completion-Cause
This header field MUST be part of a RECORD-COMPLETE event from the recorder resource to the client. This indicates the reason behind the RECORD method completion. This header field MUST be sent in the RECORD responses if they return with a failure status and a COMPLETE state. In the ABNF below, the 'cause-code' contains a numerical value selected from the Cause-Code column of the following table. The 'cause-name' contains the corresponding token selected from the Cause-Name column. completion-cause = "Completion-Cause" ":" cause-code SP cause-name CRLF cause-code = 3DIGIT cause-name = *VCHAR +------------+-----------------------+------------------------------+ | Cause-Code | Cause-Name | Description | +------------+-----------------------+------------------------------+ | 000 | success-silence | RECORD completed with a | | | | silence at the end. | | 001 | success-maxtime | RECORD completed after | | | | reaching maximum recording | | | | time specified in record | | | | method. | | 002 | no-input-timeout | RECORD failed due to no | | | | input. | | 003 | uri-failure | Failure accessing the record | | | | URI. | | 004 | error | RECORD request terminated | | | | prematurely due to a | | | | recorder error. | +------------+-----------------------+------------------------------+
10.4.4. Completion-Reason
This header field MAY be present in a RECORD-COMPLETE event coming from the recorder resource to the client. It contains the reason text behind the RECORD request completion. This header field communicates text describing the reason for the failure. The completion reason text is provided for client use in logs and for debugging and instrumentation purposes. Clients MUST NOT interpret the completion reason text. completion-reason = "Completion-Reason" ":" quoted-string CRLF10.4.5. Failed-URI
When a recorder method needs to post the audio to a URI and access to the URI fails, the server MUST provide the failed URI in this header field in the method response. failed-uri = "Failed-URI" ":" absoluteURI CRLF10.4.6. Failed-URI-Cause
When a recorder method needs to post the audio to a URI and access to the URI fails, the server MAY provide the URI-specific or protocol- specific response code through this header field in the method response. The value encoding is UTF-8 (RFC 3629 [RFC3629]) to accommodate any access protocol -- some access protocols might have a response string instead of a numeric response code. failed-uri-cause = "Failed-URI-Cause" ":" 1*UTFCHAR CRLF10.4.7. Record-URI
When a recorder method contains this header field, the server MUST capture the audio and store it. If the header field is present but specified with no value, the server MUST store the content locally and generate a URI that points to it. This URI is then returned in either the STOP response or the RECORD-COMPLETE event. If the header field in the RECORD method specifies a URI, the server MUST attempt to capture and store the audio at that location. If this header field is not specified in the RECORD request, the server MUST capture the audio, MUST encode it, and MUST send it in the STOP response or the RECORD-COMPLETE event as a message body. In this case, the
response carrying the audio content MUST include a Content ID (cid) [RFC2392] value in this header pointing to the Content-ID in the message body. The server MUST also return the size in octets and the duration in milliseconds of the recorded audio waveform as parameters associated with the header field. Implementations MUST support 'http' [RFC2616], 'https' [RFC2818], 'file' [RFC3986], and 'cid' [RFC2392] schemes in the URI. Note that implementations already exist that support other schemes. record-uri = "Record-URI" ":" ["<" uri ">" ";" "size" "=" 1*19DIGIT ";" "duration" "=" 1*19DIGIT] CRLF10.4.8. Media-Type
A RECORD method MUST contain this header field, which specifies to the server the media type of the captured audio or video. media-type = "Media-Type" ":" media-type-value CRLF10.4.9. Max-Time
When recording is started, this specifies the maximum length of the recording in milliseconds, calculated from the time the actual capture and store begins and is not necessarily the time the RECORD method is received. It specifies the duration before silence suppression, if any, has been applied by the recorder resource. After this time, the recording stops and the server MUST return a RECORD-COMPLETE event to the client having a request-state of COMPLETE. This header field MAY occur in RECORD, SET-PARAMS, or GET- PARAMS. The value for this header field ranges from 0 to an implementation-specific maximum value. A value of 0 means infinity, and hence the recording continues until one or more of the other stop conditions are met. The default value for this header field is 0. max-time = "Max-Time" ":" 1*19DIGIT CRLF
10.4.10. Trim-Length
This header field MAY be sent on a STOP method and specifies the length of audio to be trimmed from the end of the recording after the stop. The length is interpreted to be in milliseconds. The default value for this header field is 0. trim-length = "Trim-Length" ":" 1*19DIGIT CRLF10.4.11. Final-Silence
When the recorder is started and the actual capture begins, this header field specifies the length of silence in the audio that is to be interpreted as the end of the recording. This header field MAY occur in RECORD, SET-PARAMS, or GET-PARAMS. The value for this header field ranges from 0 to an implementation-specific maximum value and is interpreted to be in milliseconds. A value of 0 means infinity, and hence the recording will continue until one of the other stop conditions are met. The default value for this header field is implementation specific. final-silence = "Final-Silence" ":" 1*19DIGIT CRLF10.4.12. Capture-On-Speech
If "false", the recorder MUST start capturing immediately when started. If "true", the recorder MUST wait for the endpointing functionality to detect speech before it starts capturing. This header field MAY occur in the RECORD, SET-PARAMS, or GET-PARAMS. The value for this header field is a Boolean. The default value for this header field is "false". capture-on-speech = "Capture-On-Speech " ":" BOOLEAN CRLF10.4.13. Ver-Buffer-Utterance
This header field is the same as the one described for the verifier resource (see Section 11.4.14). This tells the server to buffer the utterance associated with this recording request into the verification buffer. Sending this header field is permitted only if the verification buffer is for the session. This buffer is shared across resources within a session. It gets instantiated when a verifier resource is added to this session and is released when the verifier resource is released from the session.
10.4.14. Start-Input-Timers
This header field MAY be sent as part of the RECORD request. A value of "false" tells the recorder resource to start the operation, but not to start the no-input timer until the client sends a START-INPUT- TIMERS request to the recorder resource. This is useful in the scenario when the recorder and synthesizer resources are not part of the same session. When a kill-on-barge-in prompt is being played, the client may want the RECORD request to be simultaneously active so that it can detect and implement kill-on-barge-in (see Section 8.4.2). But at the same time, the client doesn't want the recorder resource to start the no-input timers until the prompt is finished. The default value is "true". start-input-timers = "Start-Input-Timers" ":" BOOLEAN CRLF10.4.15. New-Audio-Channel
This header field is the same as the one described for the recognizer resource (see Section 9.4.23).