Tech-invite3GPPspaceIETFspace
9796959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 6787

Media Resource Control Protocol Version 2 (MRCPv2)

Pages: 224
Proposed Standard
Errata
Part 5 of 8 – Pages 99 to 135
First   Prev   Next

Top   ToC   RFC6787 - Page 99   prevText

9.6. Recognizer Results

The recognizer portion of NLSML (see Section 6.3.1) represents information automatically extracted from a user's utterances by a semantic interpretation component, where "utterance" is to be taken in the general sense of a meaningful user input in any modality supported by the MRCPv2 implementation.

9.6.1. Markup Functions

MRCPv2 recognizer resources employ the Natural Language Semantics Markup Language (NLSML) to interpret natural language speech input and to format the interpretation for consumption by an MRCPv2 client. The elements of the markup fall into the following general functional categories: interpretation, side information, and multi-modal integration.
9.6.1.1. Interpretation
Elements and attributes represent the semantics of a user's utterance, including the <result>, <interpretation>, and <instance> elements. The <result> element contains the full result of processing one utterance. It MAY contain multiple <interpretation> elements if the interpretation of the utterance results in multiple alternative meanings due to uncertainty in speech recognition or natural language understanding. There are at least two reasons for providing multiple interpretations: 1. The client application might have additional information, for example, information from a database, that would allow it to select a preferred interpretation from among the possible interpretations returned from the semantic interpreter.
Top   ToC   RFC6787 - Page 100
   2.  A client-based dialog manager (e.g., VoiceXML
       [W3C.REC-voicexml20-20040316]) that was unable to select between
       several competing interpretations could use this information to
       go back to the user and find out what was intended.  For example,
       it could issue a SPEAK request to a synthesizer resource to emit
       "Did you say 'Boston' or 'Austin'?"

9.6.1.2. Side Information
These are elements and attributes representing additional information about the interpretation, over and above the interpretation itself. Side information includes: 1. Whether an interpretation was achieved (the <nomatch> element) and the system's confidence in an interpretation (the "confidence" attribute of <interpretation>). 2. Alternative interpretations (<interpretation>) 3. Input formats and Automatic Speech Recognition (ASR) information: the <input> element, representing the input to the semantic interpreter.
9.6.1.3. Multi-Modal Integration
When more than one modality is available for input, the interpretation of the inputs needs to be coordinated. The "mode" attribute of <input> supports this by indicating whether the utterance was input by speech, DTMF, pointing, etc. The "timestamp- start" and "timestamp-end" attributes of <input> also provide for temporal coordination by indicating when inputs occurred.

9.6.2. Overview of Recognizer Result Elements and Their Relationships

The recognizer elements in NLSML fall into two categories: 1. description of the input that was processed, and 2. description of the meaning which was extracted from the input. Next to each element are its attributes. In addition, some elements can contain multiple instances of other elements. For example, a <result> can contain multiple <interpretation> elements, each of which is taken to be an alternative. Similarly, <input> can contain multiple child <input> elements, which are taken to be cumulative. To illustrate the basic usage of these elements, as a simple example,
Top   ToC   RFC6787 - Page 101
   consider the utterance "OK" (interpreted as "yes").  The example
   illustrates how that utterance and its interpretation would be
   represented in the NLSML markup.

   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="http://www.example.com/theYesNoGrammar">
     <interpretation>
        <instance>
           <ex:response>yes</ex:response>
         </instance>
       <input>OK</input>
     </interpretation>
   </result>

   This example includes only the minimum required information.  There
   is an overall <result> element, which includes one interpretation and
   an input element.  The interpretation contains the application-
   specific element "<response>", which is the semantically interpreted
   result.

9.6.3. Elements and Attributes

9.6.3.1. <result> Root Element
The root element of the markup is <result>. The <result> element includes one or more <interpretation> elements. Multiple interpretations can result from ambiguities in the input or in the semantic interpretation. If the "grammar" attribute does not apply to all of the interpretations in the result, it can be overridden for individual interpretations at the <interpretation> level. Attributes: 1. grammar: The grammar or recognition rule matched by this result. The format of the grammar attribute will match the rule reference semantics defined in the grammar specification. Specifically, the rule reference is in the external XML form for grammar rule references. The markup interpreter needs to know the grammar rule that is matched by the utterance because multiple rules may be simultaneously active. The value is the grammar URI used by the markup interpreter to specify the grammar. The grammar can be overridden by a grammar attribute in the <interpretation> element if the input was ambiguous as to which grammar it matched. If all interpretation elements within the result element contain their own grammar attributes, the attribute can be dropped from the result element.
Top   ToC   RFC6787 - Page 102
   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           grammar="http://www.example.com/grammar">
     <interpretation>
      ....
     </interpretation>
   </result>

9.6.3.2. <interpretation> Element
An <interpretation> element contains a single semantic interpretation. Attributes: 1. confidence: A float value from 0.0-1.0 indicating the semantic analyzer's confidence in this interpretation. A value of 1.0 indicates maximum confidence. The values are implementation dependent but are intended to align with the value interpretation for the confidence MRCPv2 header field defined in Section 9.4.1. This attribute is OPTIONAL. 2. grammar: The grammar or recognition rule matched by this interpretation (if needed to override the grammar specification at the <interpretation> level.) This attribute is only needed under <interpretation> if it is necessary to override a grammar that was defined at the <result> level. Note that the grammar attribute for the interpretation element is optional if and only if the grammar attribute is specified in the <result> element. Interpretations MUST be sorted best-first by some measure of "goodness". The goodness measure is "confidence" if present; otherwise, it is some implementation-specific indication of quality. The grammar is expected to be specified most frequently at the <result> level. However, it can be overridden at the <interpretation> level because it is possible that different interpretations may match different grammar rules. The <interpretation> element includes an optional <input> element containing the input being analyzed, and at least one <instance> element containing the interpretation of the utterance. <interpretation confidence="0.75" grammar="http://www.example.com/grammar"> ... </interpretation>
Top   ToC   RFC6787 - Page 103
9.6.3.3. <instance> Element
The <instance> element contains the interpretation of the utterance. When the Semantic Interpretation for Speech Recognition format is used, the <instance> element contains the XML serialization of the result using the approach defined in that specification. When there is semantic markup in the grammar that does not create semantic objects, but instead only does a semantic translation of a portion of the input, such as translating "coke" to "coca-cola", the instance contains the whole input but with the translation applied. The NLSML looks like the markup in Figure 2 below. If there are no semantic objects created, nor any semantic translation, the instance value is the same as the input value. Attributes: 1. confidence: Each element of the instance MAY have a confidence attribute, defined in the NLSML namespace. The confidence attribute contains a float value in the range from 0.0-1.0 reflecting the system's confidence in the analysis of that slot. A value of 1.0 indicates maximum confidence. The values are implementation dependent, but are intended to align with the value interpretation for the MRCPv2 header field Confidence- Threshold defined in Section 9.4.1. This attribute is OPTIONAL. <instance> <nameAddress> <street confidence="0.75">123 Maple Street</street> <city>Mill Valley</city> <state>CA</state> <zip>90952</zip> </nameAddress> </instance> <input> My address is 123 Maple Street, Mill Valley, California, 90952 </input> <instance> I would like to buy a coca-cola </instance> <input> I would like to buy a coke </input> Figure 2: NSLML Example
Top   ToC   RFC6787 - Page 104
9.6.3.4. <input> Element
The <input> element is the text representation of a user's input. It includes an optional "confidence" attribute, which indicates the recognizer's confidence in the recognition result (as opposed to the confidence in the interpretation, which is indicated by the "confidence" attribute of <interpretation>). Optional "timestamp- start" and "timestamp-end" attributes indicate the start and end times of a spoken utterance, in ISO 8601 format [ISO.8601.1988]. Attributes: 1. timestamp-start: The time at which the input began. (optional) 2. timestamp-end: The time at which the input ended. (optional) 3. mode: The modality of the input, for example, speech, DTMF, etc. (optional) 4. confidence: The confidence of the recognizer in the correctness of the input in the range 0.0 to 1.0. (optional) Note that it may not make sense for temporally overlapping inputs to have the same mode; however, this constraint is not expected to be enforced by implementations. When there is no time zone designator, ISO 8601 time representations default to local time. There are three possible formats for the <input> element. 1. The <input> element can contain simple text: <input>onions</input> A future possibility is for <input> to contain not only text but additional markup that represents prosodic information that was contained in the original utterance and extracted by the speech recognizer. This depends on the availability of ASRs that are capable of producing prosodic information. MRCPv2 clients MUST be prepared to receive such markup and MAY make use of it. 2. An <input> tag can also contain additional <input> tags. Having additional input elements allows the representation to support future multi-modal inputs as well as finer-grained speech information, such as timestamps for individual words and word- level confidences.
Top   ToC   RFC6787 - Page 105
       <input>
            <input mode="speech" confidence="0.5"
                timestamp-start="2000-04-03T0:00:00"
                timestamp-end="2000-04-03T0:00:00.2">fried</input>
            <input mode="speech" confidence="1.0"
                timestamp-start="2000-04-03T0:00:00.25"
                timestamp-end="2000-04-03T0:00:00.6">onions</input>
       </input>

   3.  Finally, the <input> element can contain <nomatch> and <noinput>
       elements, which describe situations in which the speech
       recognizer received input that it was unable to process or did
       not receive any input at all, respectively.

9.6.3.5. <nomatch> Element
The <nomatch> element under <input> is used to indicate that the semantic interpreter was unable to successfully match any input with confidence above the threshold. It can optionally contain the text of the best of the (rejected) matches. <interpretation> <instance/> <input confidence="0.1"> <nomatch/> </input> </interpretation> <interpretation> <instance/> <input mode="speech" confidence="0.1"> <nomatch>I want to go to New York</nomatch> </input> </interpretation>
9.6.3.6. <noinput> Element
<noinput> indicates that there was no input -- a timeout occurred in the speech recognizer due to silence. <interpretation> <instance/> <input> <noinput/> </input> </interpretation> If there are multiple levels of inputs, the most natural place for <nomatch> and <noinput> elements to appear is under the highest level of <input> for <noinput>, and under the appropriate level of
Top   ToC   RFC6787 - Page 106
   <interpretation> for <nomatch>.  So, <noinput> means "no input at
   all" and <nomatch> means "no match in speech modality" or "no match
   in DTMF modality".  For example, to represent garbled speech combined
   with DTMF "1 2 3 4", the markup would be:
   <input>
      <input mode="speech"><nomatch/></input>
      <input mode="dtmf">1 2 3 4</input>
   </input>

   Note: while <noinput> could be represented as an attribute of input,
   <nomatch> cannot, since it could potentially include PCDATA content
   with the best match.  For parallelism, <noinput> is also an element.

9.7. Enrollment Results

All enrollment elements are contained within a single <enrollment-result> element under <result>. The elements are described below and have the schema defined in Section 16.2. The following elements are defined: 1. num-clashes 2. num-good-repetitions 3. num-repetitions-still-needed 4. consistency-status 5. clash-phrase-ids 6. transcriptions 7. confusable-phrases

9.7.1. <num-clashes> Element

The <num-clashes> element contains the number of clashes that this pronunciation has with other pronunciations in an active enrollment session. The associated Clash-Threshold header field determines the sensitivity of the clash measurement. Note that clash testing can be turned off completely by setting the Clash-Threshold header field value to 0.

9.7.2. <num-good-repetitions> Element

The <num-good-repetitions> element contains the number of consistent pronunciations obtained so far in an active enrollment session.
Top   ToC   RFC6787 - Page 107

9.7.3. <num-repetitions-still-needed> Element

The <num-repetitions-still-needed> element contains the number of consistent pronunciations that must still be obtained before the new phrase can be added to the enrollment grammar. The number of consistent pronunciations required is specified by the client in the request header field Num-Min-Consistent-Pronunciations. The returned value must be 0 before the client can successfully commit a phrase to the grammar by ending the enrollment session.

9.7.4. <consistency-status> Element

The <consistency-status> element is used to indicate how consistent the repetitions are when learning a new phrase. It can have the values of consistent, inconsistent, and undecided.

9.7.5. <clash-phrase-ids> Element

The <clash-phrase-ids> element contains the phrase IDs of clashing pronunciation(s), if any. This element is absent if there are no clashes.

9.7.6. <transcriptions> Element

The <transcriptions> element contains the transcriptions returned in the last repetition of the phrase being enrolled.

9.7.7. <confusable-phrases> Element

The <confusable-phrases> element contains a list of phrases from a command grammar that are confusable with the phrase being added to the personal grammar. This element MAY be absent if there are no confusable phrases.

9.8. DEFINE-GRAMMAR

The DEFINE-GRAMMAR method, from the client to the server, provides one or more grammars and requests the server to access, fetch, and compile the grammars as needed. The DEFINE-GRAMMAR method implementation MUST do a fetch of all external URIs that are part of that operation. If caching is implemented, this URI fetching MUST conform to the cache control hints and parameter header fields associated with the method in deciding whether the URIs should be fetched from cache or from the external server. If these hints/ parameters are not specified in the method, the values set for the session using SET-PARAMS/GET-PARAMS apply. If it was not set for the session, their default values apply.
Top   ToC   RFC6787 - Page 108
   If the server resource is in the recognition state, the DEFINE-
   GRAMMAR request MUST respond with a failure status.

   If the resource is in the idle state and is able to successfully
   process the supplied grammars, the server MUST return a success code
   status and the request-state MUST be COMPLETE.

   If the recognizer resource could not define the grammar for some
   reason (for example, if the download failed, the grammar failed to
   compile, or the grammar was in an unsupported form), the MRCPv2
   response for the DEFINE-GRAMMAR method MUST contain a failure status-
   code of 407 and contain a Completion-Cause header field describing
   the failure reason.

   C->S:MRCP/2.0 ... DEFINE-GRAMMAR 543257
   Channel-Identifier:32AECB23433801@speechrecog
   Content-Type:application/srgs+xml
   Content-ID:<request1@form-level.store>
   Content-Length:...

   <?xml version="1.0"?>

   <!-- the default grammar language is US English -->
   <grammar xmlns="http://www.w3.org/2001/06/grammar"
            xml:lang="en-US" version="1.0">

   <!-- single language attachment to tokens -->
   <rule id="yes">
               <one-of>
                     <item xml:lang="fr-CA">oui</item>
                     <item xml:lang="en-US">yes</item>
               </one-of>
         </rule>

   <!-- single language attachment to a rule expansion -->
         <rule id="request">
               may I speak to
               <one-of xml:lang="fr-CA">
                     <item>Michel Tremblay</item>
                     <item>Andre Roy</item>
               </one-of>
         </rule>

         </grammar>

   S->C:MRCP/2.0 ... 543257 200 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
Top   ToC   RFC6787 - Page 109
   C->S:MRCP/2.0 ... DEFINE-GRAMMAR 543258
   Channel-Identifier:32AECB23433801@speechrecog
   Content-Type:application/srgs+xml
   Content-ID:<helpgrammar@root-level.store>
   Content-Length:...

   <?xml version="1.0"?>

   <!-- the default grammar language is US English -->
   <grammar xmlns="http://www.w3.org/2001/06/grammar"
            xml:lang="en-US" version="1.0">

         <rule id="request">
               I need help
         </rule>

   S->C:MRCP/2.0 ... 543258 200 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success

   C->S:MRCP/2.0 ... DEFINE-GRAMMAR 543259
   Channel-Identifier:32AECB23433801@speechrecog
   Content-Type:application/srgs+xml
   Content-ID:<request2@field-level.store>
   Content-Length:...

   <?xml version="1.0" encoding="UTF-8"?>

   <!DOCTYPE grammar PUBLIC "-//W3C//DTD GRAMMAR 1.0//EN"
                     "http://www.w3.org/TR/speech-grammar/grammar.dtd">

   <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en"
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
          xsi:schemaLocation="http://www.w3.org/2001/06/grammar
              http://www.w3.org/TR/speech-grammar/grammar.xsd"
              version="1.0" mode="voice" root="basicCmd">

   <meta name="author" content="Stephanie Williams"/>

   <rule id="basicCmd" scope="public">
     <example> please move the window </example>
     <example> open a file </example>

     <ruleref
       uri="http://grammar.example.com/politeness.grxml#startPolite"/>
Top   ToC   RFC6787 - Page 110
     <ruleref uri="#command"/>
     <ruleref
       uri="http://grammar.example.com/politeness.grxml#endPolite"/>
   </rule>

   <rule id="command">
     <ruleref uri="#action"/> <ruleref uri="#object"/>
   </rule>

   <rule id="action">
      <one-of>
         <item weight="10"> open   <tag>open</tag>   </item>
         <item weight="2">  close  <tag>close</tag>  </item>
         <item weight="1">  delete <tag>delete</tag> </item>
         <item weight="1">  move   <tag>move</tag>   </item>
      </one-of>
   </rule>

   <rule id="object">
     <item repeat="0-1">
       <one-of>
         <item> the </item>
         <item> a </item>
       </one-of>
     </item>

     <one-of>
         <item> window </item>
         <item> file </item>
         <item> menu </item>
     </one-of>
   </rule>

   </grammar>


   S->C:MRCP/2.0 ... 543259 200 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success

   C->S:MRCP/2.0 ... RECOGNIZE 543260
   Channel-Identifier:32AECB23433801@speechrecog
           N-Best-List-Length:2
   Content-Type:text/uri-list
   Content-Length:...
Top   ToC   RFC6787 - Page 111
   session:request1@form-level.store
   session:request2@field-level.store
   session:helpgramar@root-level.store

   S->C:MRCP/2.0 ... 543260 200 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... START-OF-INPUT 543260 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... RECOGNITION-COMPLETE 543260 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
   Completion-Cause:000 success
   Waveform-URI:<http://web.media.com/session123/audio.wav>;
                size=124535;duration=2340
   Content-Type:application/x-nlsml
   Content-Length:...

   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="session:request1@form-level.store">
           <interpretation>
               <instance name="Person">
               <ex:Person>
                   <ex:Name> Andre Roy </ex:Name>
               </ex:Person>
            </instance>
            <input>   may I speak to Andre Roy </input>
       </interpretation>
   </result>

                          Define Grammar Example

9.9. RECOGNIZE

The RECOGNIZE method from the client to the server requests the recognizer to start recognition and provides it with one or more grammar references for grammars to match against the input media. The RECOGNIZE method can carry header fields to control the sensitivity, confidence level, and the level of detail in results provided by the recognizer. These header field values override the current values set by a previous SET-PARAMS method. The RECOGNIZE method can request the recognizer resource to operate in normal or hotword mode as specified by the Recognition-Mode header field. The default value is "normal". If the resource could not start a recognition, the server MUST respond with a failure status-
Top   ToC   RFC6787 - Page 112
   code of 407 and a Completion-Cause header field in the response
   describing the cause of failure.

   The RECOGNIZE request uses the message body to specify the grammars
   applicable to the request.  The active grammar(s) for the request can
   be specified in one of three ways.  If the client needs to explicitly
   control grammar weights for the recognition operation, it MUST employ
   method 3 below.  The order of these grammars specifies the precedence
   of the grammars that is used when more than one grammar in the list
   matches the speech; in this case, the grammar with the higher
   precedence is returned as a match.  This precedence capability is
   useful in applications like VoiceXML browsers to order grammars
   specified at the dialog, document, and root level of a VoiceXML
   application.

   1.  The grammar MAY be placed directly in the message body as typed
       content.  If more than one grammar is included in the body, the
       order of inclusion controls the corresponding precedence for the
       grammars during recognition, with earlier grammars in the body
       having a higher precedence than later ones.

   2.  The body MAY contain a list of grammar URIs specified in content
       of media type 'text/uri-list' [RFC2483].  The order of the URIs
       determines the corresponding precedence for the grammars during
       recognition, with highest precedence first and decreasing for
       each URI thereafter.

   3.  The body MAY contain a list of grammar URIs specified in content
       of media type 'text/grammar-ref-list'.  This type defines a list
       of grammar URIs and allows each grammar URI to be assigned a
       weight in the list.  This weight has the same meaning as the
       weights described in Section 2.4.1 of the Speech Grammar Markup
       Format (SRGS) [W3C.REC-speech-grammar-20040316].

   In addition to performing recognition on the input, the recognizer
   MUST also enroll the collected utterance in a personal grammar if the
   Enroll-Utterance header field is set to true and an Enrollment is
   active (via an earlier execution of the START-PHRASE-ENROLLMENT
   method).  If so, and if the RECOGNIZE request contains a Content-ID
   header field, then the resulting grammar (which includes the personal
   grammar as a sub-grammar) can be referenced through the 'session' URI
   scheme (see Section 13.6).

   If the resource was able to successfully start the recognition, the
   server MUST return a success status-code and a request-state of
   IN-PROGRESS.  This means that the recognizer is active and that the
   client MUST be prepared to receive further events with this
   request-id.
Top   ToC   RFC6787 - Page 113
   If the resource was able to queue the request, the server MUST return
   a success code and request-state of PENDING.  This means that the
   recognizer is currently active with another request and that this
   request has been queued for processing.

   If the resource could not start a recognition, the server MUST
   respond with a failure status-code of 407 and a Completion-Cause
   header field in the response describing the cause of failure.

   For the recognizer resource, RECOGNIZE and INTERPRET are the only
   requests that return a request-state of IN-PROGRESS, meaning that
   recognition is in progress.  When the recognition completes by
   matching one of the grammar alternatives or by a timeout without a
   match or for some other reason, the recognizer resource MUST send the
   client a RECOGNITION-COMPLETE event (or INTERPRETATION-COMPLETE, if
   INTERPRET was the request) with the result of the recognition and a
   request-state of COMPLETE.

   Large grammars can take a long time for the server to compile.  For
   grammars that are used repeatedly, the client can improve server
   performance by issuing a DEFINE-GRAMMAR request with the grammar
   ahead of time.  In such a case, the client can issue the RECOGNIZE
   request and reference the grammar through the 'session' URI scheme
   (see Section 13.6).  This also applies in general if the client wants
   to repeat recognition with a previous inline grammar.

   The RECOGNIZE method implementation MUST do a fetch of all external
   URIs that are part of that operation.  If caching is implemented,
   this URI fetching MUST conform to the cache control hints and
   parameter header fields associated with the method in deciding
   whether it should be fetched from cache or from the external server.
   If these hints/parameters are not specified in the method, the values
   set for the session using SET-PARAMS/GET-PARAMS apply.  If it was not
   set for the session, their default values apply.

   Note that since the audio and the messages are carried over separate
   communication paths there may be a race condition between the start
   of the flow of audio and the receipt of the RECOGNIZE method.  For
   example, if an audio flow is started by the client at the same time
   as the RECOGNIZE method is sent, either the audio or the RECOGNIZE
   can arrive at the recognizer first.  As another example, the client
   may choose to continuously send audio to the server and signal the
   server to recognize using the RECOGNIZE method.  Mechanisms to
   resolve this condition are outside the scope of this specification.
   The recognizer can expect the media to start flowing when it receives
   the RECOGNIZE request, but it MUST NOT buffer anything it receives
   beforehand in order to preserve the semantics that application
   authors expect with respect to the input timers.
Top   ToC   RFC6787 - Page 114
   When a RECOGNIZE method has been received, the recognition is
   initiated on the stream.  The No-Input-Timer MUST be started at this
   time if the Start-Input-Timers header field is specified as "true".
   If this header field is set to "false", the No-Input-Timer MUST be
   started when it receives the START-INPUT-TIMERS method from the
   client.  The Recognition-Timeout MUST be started when the recognition
   resource detects speech or a DTMF digit in the media stream.

   For recognition when not in hotword mode:

   When the recognizer resource detects speech or a DTMF digit in the
   media stream, it MUST send the START-OF-INPUT event.  When enough
   speech has been collected for the server to process, the recognizer
   can try to match the collected speech with the active grammars.  If
   the speech collected at this point fully matches with any of the
   active grammars, the Speech-Complete-Timer is started.  If it matches
   partially with one or more of the active grammars, with more speech
   needed before a full match is achieved, then the Speech-Incomplete-
   Timer is started.

   1.  When the No-Input-Timer expires, the recognizer MUST complete
       with a Completion-Cause code of "no-input-timeout".

   2.  The recognizer MUST support detecting a no-match condition upon
       detecting end of speech.  The recognizer MAY support detecting a
       no-match condition before waiting for end-of-speech.  If this is
       supported, this capability is enabled by setting the Early-No-
       Match header field to "true".  Upon detecting a no-match
       condition, the RECOGNIZE MUST return with "no-match".

   3.  When the Speech-Incomplete-Timer expires, the recognizer SHOULD
       complete with a Completion-Cause code of "partial-match", unless
       the recognizer cannot differentiate a partial-match, in which
       case it MUST return a Completion-Cause code of "no-match".  The
       recognizer MAY return results for the partially matched grammar.

   4.  When the Speech-Complete-Timer expires, the recognizer MUST
       complete with a Completion-Cause code of "success".
Top   ToC   RFC6787 - Page 115
   5.  When the Recognition-Timeout expires, one of the following MUST
       happen:

       5.1.  If there was a partial-match, the recognizer SHOULD
             complete with a Completion-Cause code of "partial-match-
             maxtime", unless the recognizer cannot differentiate a
             partial-match, in which case it MUST complete with a
             Completion-Cause code of "no-match-maxtime".  The
             recognizer MAY return results for the partially matched
             grammar.

       5.2.  If there was a full-match, the recognizer MUST complete
             with a Completion-Cause code of "success-maxtime".

       5.3.  If there was a no match, the recognizer MUST complete with
             a Completion-Cause code of "no-match-maxtime".

   For recognition in hotword mode:

   Note that for recognition in hotword mode the START-OF-INPUT event is
   not generated when speech or a DTMF digit is detected.

   1.  When the No-Input-Timer expires, the recognizer MUST complete
       with a Completion-Cause code of "no-input-timeout".

   2.  If at any point a match occurs, the RECOGNIZE MUST complete with
       a Completion-Cause code of "success".

   3.  When the Recognition-Timeout expires and there is not a match,
       the RECOGNIZE MUST complete with a Completion-Cause code of
       "hotword-maxtime".

   4.  When the Recognition-Timeout expires and there is a match, the
       RECOGNIZE MUST complete with a Completion-Cause code of "success-
       maxtime".

   5.  When the Recognition-Timeout is running but the detected speech/
       DTMF has not resulted in a match, the Recognition-Timeout MUST be
       stopped and reset.  It MUST then be restarted when speech/DTMF is
       again detected.

   Below is a complete example of using RECOGNIZE.  It shows the call to
   RECOGNIZE, the IN-PROGRESS and START-OF-INPUT status messages, and
   the final RECOGNITION-COMPLETE message containing the result.
Top   ToC   RFC6787 - Page 116
   C->S:MRCP/2.0 ... RECOGNIZE 543257
   Channel-Identifier:32AECB23433801@speechrecog
           Confidence-Threshold:0.9
   Content-Type:application/srgs+xml
   Content-ID:<request1@form-level.store>
   Content-Length:...

   <?xml version="1.0"?>

   <!-- the default grammar language is US English -->
   <grammar xmlns="http://www.w3.org/2001/06/grammar"
            xml:lang="en-US" version="1.0" root="request">

   <!-- single language attachment to tokens -->
       <rule id="yes">
               <one-of>
                     <item xml:lang="fr-CA">oui</item>
                     <item xml:lang="en-US">yes</item>
               </one-of>
         </rule>

   <!-- single language attachment to a rule expansion -->
         <rule id="request">
               may I speak to
               <one-of xml:lang="fr-CA">
                     <item>Michel Tremblay</item>
                     <item>Andre Roy</item>
               </one-of>
         </rule>

     </grammar>

   S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS
   Channel-Identifier:32AECB23433801@speechrecog

   S->C:MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE
   Channel-Identifier:32AECB23433801@speechrecog
   Completion-Cause:000 success
   Waveform-URI:<http://web.media.com/session123/audio.wav>;
                 size=424252;duration=2543
   Content-Type:application/nlsml+xml
   Content-Length:...
Top   ToC   RFC6787 - Page 117
   <?xml version="1.0"?>
   <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
           xmlns:ex="http://www.example.com/example"
           grammar="session:request1@form-level.store">
       <interpretation>
           <instance name="Person">
               <ex:Person>
                   <ex:Name> Andre Roy </ex:Name>
               </ex:Person>
           </instance>
               <input>   may I speak to Andre Roy </input>
       </interpretation>
   </result>

   Below is an example of calling RECOGNIZE with a different grammar.
   No status or completion messages are shown in this example, although
   they would of course occur in normal usage.

   C->S:   MRCP/2.0 ... RECOGNIZE 543257
           Channel-Identifier:32AECB23433801@speechrecog
           Confidence-Threshold:0.9
           Fetch-Timeout:20
           Content-Type:application/srgs+xml
           Content-Length:...

           <?xml version="1.0"? Version="1.0" mode="voice"
                 root="Basic md">
            <rule id="rule_list" scope="public">
                <one-of>
                    <item weight=10>
                        <ruleref uri=
               "http://grammar.example.com/world-cities.grxml#canada"/>
                   </item>
                   <item weight=1.5>
                       <ruleref uri=
               "http://grammar.example.com/world-cities.grxml#america"/>
                   </item>
                  <item weight=0.5>
                       <ruleref uri=
               "http://grammar.example.com/world-cities.grxml#india"/>
                  </item>
              </one-of>
           </rule>
Top   ToC   RFC6787 - Page 118

9.10. STOP

The STOP method from the client to the server tells the resource to stop recognition if a request is active. If a RECOGNIZE request is active and the STOP request successfully terminated it, then the response header section contains an Active-Request-Id-List header field containing the request-id of the RECOGNIZE request that was terminated. In this case, no RECOGNITION-COMPLETE event is sent for the terminated request. If there was no recognition active, then the response MUST NOT contain an Active-Request-Id-List header field. Either way, the response MUST contain a status-code of 200 "Success". C->S: MRCP/2.0 ... RECOGNIZE 543257 Channel-Identifier:32AECB23433801@speechrecog Confidence-Threshold:0.9 Content-Type:application/srgs+xml Content-ID:<request1@form-level.store> Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="request"> <!-- single language attachment to tokens --> <rule id="yes"> <one-of> <item xml:lang="fr-CA">oui</item> <item xml:lang="en-US">yes</item> </one-of> </rule> <!-- single language attachment to a rule expansion --> <rule id="request"> may I speak to <one-of xml:lang="fr-CA"> <item>Michel Tremblay</item> <item>Andre Roy</item> </one-of> </rule> </grammar> S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog C->S: MRCP/2.0 ... STOP 543258 200 Channel-Identifier:32AECB23433801@speechrecog
Top   ToC   RFC6787 - Page 119
   S->C:   MRCP/2.0 ... 543258 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Active-Request-Id-List:543257

9.11. GET-RESULT

The GET-RESULT method from the client to the server MAY be issued when the recognizer resource is in the recognized state. This request allows the client to retrieve results for a completed recognition. This is useful if the client decides it wants more alternatives or more information. When the server receives this request, it re-computes and returns the results according to the recognition constraints provided in the GET-RESULT request. The GET-RESULT request can specify constraints such as a different confidence-threshold or n-best-list-length. This capability is OPTIONAL for MRCPv2 servers and the automatic speech recognition engine in the server MUST return a status of unsupported feature if not supported. C->S: MRCP/2.0 ... GET-RESULT 543257 Channel-Identifier:32AECB23433801@speechrecog Confidence-Threshold:0.9 S->C: MRCP/2.0 ... 543257 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Content-Type:application/nlsml+xml Content-Length:... <?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" xmlns:ex="http://www.example.com/example" grammar="session:request1@form-level.store"> <interpretation> <instance name="Person"> <ex:Person> <ex:Name> Andre Roy </ex:Name> </ex:Person> </instance> <input> may I speak to Andre Roy </input> </interpretation> </result>
Top   ToC   RFC6787 - Page 120

9.12. START-OF-INPUT

This is an event from the server to the client indicating that the recognizer resource has detected speech or a DTMF digit in the media stream. This event is useful in implementing kill-on-barge-in scenarios when a synthesizer resource is in a different session from the recognizer resource and hence is not aware of an incoming audio source (see Section 8.4.2). In these cases, it is up to the client to act as an intermediary and respond to this event by issuing a BARGE-IN-OCCURRED event to the synthesizer resource. The recognizer resource also MUST send a Proxy-Sync-Id header field with a unique value for this event. This event MUST be generated by the server, irrespective of whether or not the synthesizer and recognizer are on the same server.

9.13. START-INPUT-TIMERS

This request is sent from the client to the recognizer resource when it knows that a kill-on-barge-in prompt has finished playing (see Section 8.4.2). This is useful in the scenario when the recognition and synthesizer engines are not in the same session. When a kill-on- barge-in prompt is being played, the client may want a RECOGNIZE request to be simultaneously active so that it can detect and implement kill-on-barge-in. But at the same time the client doesn't want the recognizer to start the no-input timers until the prompt is finished. The Start-Input-Timers header field in the RECOGNIZE request allows the client to say whether or not the timers should be started immediately. If not, the recognizer resource MUST NOT start the timers until the client sends a START-INPUT-TIMERS method to the recognizer.

9.14. RECOGNITION-COMPLETE

This is an event from the recognizer resource to the client indicating that the recognition completed. The recognition result is sent in the body of the MRCPv2 message. The request-state field MUST be COMPLETE indicating that this is the last event with that request-id and that the request with that request-id is now complete. The server MUST maintain the recognizer context containing the results and the audio waveform input of that recognition until the next RECOGNIZE request is issued for that resource or the session terminates. If the server returns a URI to the audio waveform, it MUST do so in a Waveform-URI header field in the RECOGNITION-COMPLETE event. The client can use this URI to retrieve or playback the audio.
Top   ToC   RFC6787 - Page 121
   Note, if an enrollment session was active, the RECOGNITION-COMPLETE
   event can contain either recognition or enrollment results depending
   on what was spoken.  The following example shows a complete exchange
   with a recognition result.

   C->S:   MRCP/2.0 ... RECOGNIZE 543257
           Channel-Identifier:32AECB23433801@speechrecog
           Confidence-Threshold:0.9
           Content-Type:application/srgs+xml
           Content-ID:<request1@form-level.store>
           Content-Length:...

           <?xml version="1.0"?>

           <!-- the default grammar language is US English -->
           <grammar xmlns="http://www.w3.org/2001/06/grammar"
                    xml:lang="en-US" version="1.0" root="request">

           <!-- single language attachment to tokens -->
               <rule id="yes">
                      <one-of>
                          <item xml:lang="fr-CA">oui</item>
                          <item xml:lang="en-US">yes</item>
                      </one-of>
                 </rule>

           <!-- single language attachment to a rule expansion -->
                 <rule id="request">
                     may I speak to
                      <one-of xml:lang="fr-CA">
                             <item>Michel Tremblay</item>
                             <item>Andre Roy</item>
                      </one-of>
                 </rule>
           </grammar>

   S->C:   MRCP/2.0 ... 543257 200 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog

   S->C:   MRCP/2.0 ... START-OF-INPUT 543257 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog
Top   ToC   RFC6787 - Page 122
   S->C:   MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
           Waveform-URI:<http://web.media.com/session123/audio.wav>;
                        size=342456;duration=25435
           Content-Type:application/nlsml+xml
           Content-Length:...

           <?xml version="1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   xmlns:ex="http://www.example.com/example"
                   grammar="session:request1@form-level.store">
               <interpretation>
                   <instance name="Person">
                       <ex:Person>
                           <ex:Name> Andre Roy </ex:Name>
                       </ex:Person>
                   </instance>
                   <input>   may I speak to Andre Roy </input>
               </interpretation>
           </result>

   If the result were instead an enrollment result, the final message
   from the server above could have been:

   S->C:   MRCP/2.0 ... RECOGNITION-COMPLETE 543257 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
           Content-Type:application/nlsml+xml
           Content-Length:...

           <?xml version= "1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   grammar="Personal-Grammar-URI">
               <enrollment-result>
                   <num-clashes> 2 </num-clashes>
                   <num-good-repetitions> 1 </num-good-repetitions>
                   <num-repetitions-still-needed>
                      1
                   </num-repetitions-still-needed>
                   <consistency-status> consistent </consistency-status>
                   <clash-phrase-ids>
                       <item> Jeff </item> <item> Andre </item>
                   </clash-phrase-ids>
                   <transcriptions>
                        <item> m ay b r ow k er </item>
                        <item> m ax r aa k ah </item>
                   </transcriptions>
Top   ToC   RFC6787 - Page 123
                   <confusable-phrases>
                        <item>
                             <phrase> call </phrase>
                             <confusion-level> 10 </confusion-level>
                        </item>
                   </confusable-phrases>
               </enrollment-result>
           </result>

9.15. START-PHRASE-ENROLLMENT

The START-PHRASE-ENROLLMENT method from the client to the server starts a new phrase enrollment session during which the client can call RECOGNIZE multiple times to enroll a new utterance in a grammar. An enrollment session consists of a set of calls to RECOGNIZE in which the caller speaks a phrase several times so the system can "learn" it. The phrase is then added to a personal grammar (speaker- trained grammar), so that the system can recognize it later. Only one phrase enrollment session can be active at a time for a resource. The Personal-Grammar-URI identifies the grammar that is used during enrollment to store the personal list of phrases. Once RECOGNIZE is called, the result is returned in a RECOGNITION-COMPLETE event and will contain either an enrollment result OR a recognition result for a regular recognition. Calling END-PHRASE-ENROLLMENT ends the ongoing phrase enrollment session, which is typically done after a sequence of successful calls to RECOGNIZE. This method can be called to commit the new phrase to the personal grammar or to abort the phrase enrollment session. The grammar to contain the new enrolled phrase, specified by Personal-Grammar-URI, is created if it does not exist. Also, the personal grammar MUST ONLY contain phrases added via a phrase enrollment session. The Phrase-ID passed to this method is used to identify this phrase in the grammar and will be returned as the speech input when doing a RECOGNIZE on the grammar. The Phrase-NL similarly is returned in a RECOGNITION-COMPLETE event in the same manner as other Natural Language (NL) in a grammar. The tag-format of this NL is implementation specific. If the client has specified Save-Best-Waveform as true, then the response after ending the phrase enrollment session MUST contain the location/URI of a recording of the best repetition of the learned phrase.
Top   ToC   RFC6787 - Page 124
   C->S:   MRCP/2.0 ... START-PHRASE-ENROLLMENT 543258
           Channel-Identifier:32AECB23433801@speechrecog
           Num-Min-Consistent-Pronunciations:2
           Consistency-Threshold:30
           Clash-Threshold:12
           Personal-Grammar-URI:<personal grammar uri>
           Phrase-Id:<phrase id>
           Phrase-NL:<NL phrase>
           Weight:1
           Save-Best-Waveform:true

   S->C:   MRCP/2.0 ... 543258 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog

9.16. ENROLLMENT-ROLLBACK

The ENROLLMENT-ROLLBACK method discards the last live utterance from the RECOGNIZE operation. The client can invoke this method when the caller provides undesirable input such as non-speech noises, side- speech, commands, utterance from the RECOGNIZE grammar, etc. Note that this method does not provide a stack of rollback states. Executing ENROLLMENT-ROLLBACK twice in succession without an intervening recognition operation has no effect the second time. C->S: MRCP/2.0 ... ENROLLMENT-ROLLBACK 543261 Channel-Identifier:32AECB23433801@speechrecog S->C: MRCP/2.0 ... 543261 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog

9.17. END-PHRASE-ENROLLMENT

The client MAY call the END-PHRASE-ENROLLMENT method ONLY during an active phrase enrollment session. It MUST NOT be called during an ongoing RECOGNIZE operation. To commit the new phrase in the grammar, the client MAY call this method once successive calls to RECOGNIZE have succeeded and Num-Repetitions-Still-Needed has been returned as 0 in the RECOGNITION-COMPLETE event. Alternatively, the client MAY abort the phrase enrollment session by calling this method with the Abort-Phrase-Enrollment header field. If the client has specified Save-Best-Waveform as "true" in the START-PHRASE-ENROLLMENT request, then the response MUST contain a Waveform-URI header whose value is the location/URI of a recording of the best repetition of the learned phrase. C->S: MRCP/2.0 ... END-PHRASE-ENROLLMENT 543262 Channel-Identifier:32AECB23433801@speechrecog
Top   ToC   RFC6787 - Page 125
  S->C:   MRCP/2.0 ... 543262 200 COMPLETE
          Channel-Identifier:32AECB23433801@speechrecog
          Waveform-URI:<http://mediaserver.com/recordings/file1324.wav>;
                       size=242453;duration=25432

9.18. MODIFY-PHRASE

The MODIFY-PHRASE method sent from the client to the server is used to change the phrase ID, NL phrase, and/or weight for a given phrase in a personal grammar. If no fields are supplied, then calling this method has no effect. C->S: MRCP/2.0 ... MODIFY-PHRASE 543265 Channel-Identifier:32AECB23433801@speechrecog Personal-Grammar-URI:<personal grammar uri> Phrase-Id:<phrase id> New-Phrase-Id:<new phrase id> Phrase-NL:<NL phrase> Weight:1 S->C: MRCP/2.0 ... 543265 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog

9.19. DELETE-PHRASE

The DELETE-PHRASE method sent from the client to the server is used to delete a phase that is in a personal grammar and was added through voice enrollment or text enrollment. If the specified phrase does not exist, this method has no effect. C->S: MRCP/2.0 ... DELETE-PHRASE 543266 Channel-Identifier:32AECB23433801@speechrecog Personal-Grammar-URI:<personal grammar uri> Phrase-Id:<phrase id> S->C: MRCP/2.0 ... 543266 200 COMPLETE Channel-Identifier:32AECB23433801@speechrecog

9.20. INTERPRET

The INTERPRET method from the client to the server takes as input an Interpret-Text header field containing the text for which the semantic interpretation is desired, and returns, via the INTERPRETATION-COMPLETE event, an interpretation result that is very similar to the one returned from a RECOGNIZE method invocation. Only
Top   ToC   RFC6787 - Page 126
   portions of the result relevant to acoustic matching are excluded
   from the result.  The Interpret-Text header field MUST be included in
   the INTERPRET request.

   Recognizer grammar data is treated in the same way as it is when
   issuing a RECOGNIZE method call.

   If a RECOGNIZE, RECORD, or another INTERPRET operation is already in
   progress for the resource, the server MUST reject the request with a
   response having a status-code of 402 "Method not valid in this
   state", and a COMPLETE request state.

   C->S:   MRCP/2.0 ... INTERPRET 543266
           Channel-Identifier:32AECB23433801@speechrecog
           Interpret-Text:may I speak to Andre Roy
           Content-Type:application/srgs+xml
           Content-ID:<request1@form-level.store>
           Content-Length:...

           <?xml version="1.0"?>
           <!-- the default grammar language is US English -->
           <grammar xmlns="http://www.w3.org/2001/06/grammar"
                    xml:lang="en-US" version="1.0" root="request">
           <!-- single language attachment to tokens -->
               <rule id="yes">
                   <one-of>
                       <item xml:lang="fr-CA">oui</item>
                       <item xml:lang="en-US">yes</item>
                   </one-of>
               </rule>

           <!-- single language attachment to a rule expansion -->
               <rule id="request">
                   may I speak to
                   <one-of xml:lang="fr-CA">
                       <item>Michel Tremblay</item>
                       <item>Andre Roy</item>
                   </one-of>
               </rule>
           </grammar>

   S->C:   MRCP/2.0 ... 543266 200 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog
Top   ToC   RFC6787 - Page 127
   S->C:   MRCP/2.0 ... INTERPRETATION-COMPLETE 543266 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
           Content-Type:application/nlsml+xml
           Content-Length:...

           <?xml version="1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   xmlns:ex="http://www.example.com/example"
                   grammar="session:request1@form-level.store">
               <interpretation>
                   <instance name="Person">
                       <ex:Person>
                           <ex:Name> Andre Roy </ex:Name>
                       </ex:Person>
                   </instance>
                   <input>   may I speak to Andre Roy </input>
               </interpretation>
           </result>

9.21. INTERPRETATION-COMPLETE

This event from the recognizer resource to the client indicates that the INTERPRET operation is complete. The interpretation result is sent in the body of the MRCP message. The request state MUST be set to COMPLETE. The Completion-Cause header field MUST be included in this event and MUST be set to an appropriate value from the list of cause codes. C->S: MRCP/2.0 ... INTERPRET 543266 Channel-Identifier:32AECB23433801@speechrecog Interpret-Text:may I speak to Andre Roy Content-Type:application/srgs+xml Content-ID:<request1@form-level.store> Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="request"> <!-- single language attachment to tokens --> <rule id="yes"> <one-of> <item xml:lang="fr-CA">oui</item> <item xml:lang="en-US">yes</item> </one-of> </rule>
Top   ToC   RFC6787 - Page 128
           <!-- single language attachment to a rule expansion -->
               <rule id="request">
                   may I speak to
                   <one-of xml:lang="fr-CA">
                       <item>Michel Tremblay</item>
                       <item>Andre Roy</item>
                   </one-of>
               </rule>
           </grammar>

   S->C:    MRCP/2.0 ... 543266 200 IN-PROGRESS
           Channel-Identifier:32AECB23433801@speechrecog

   S->C:    MRCP/2.0 ... INTERPRETATION-COMPLETE 543266 200 COMPLETE
           Channel-Identifier:32AECB23433801@speechrecog
           Completion-Cause:000 success
           Content-Type:application/nlsml+xml
           Content-Length:...

           <?xml version="1.0"?>
           <result xmlns="urn:ietf:params:xml:ns:mrcpv2"
                   xmlns:ex="http://www.example.com/example"
                   grammar="session:request1@form-level.store">
               <interpretation>
                   <instance name="Person">
                       <ex:Person>
                           <ex:Name> Andre Roy </ex:Name>
                       </ex:Person>
                   </instance>
                   <input>   may I speak to Andre Roy </input>
               </interpretation>
           </result>

9.22. DTMF Detection

Digits received as DTMF tones are delivered to the recognition resource in the MRCPv2 server in the RTP stream according to RFC 4733 [RFC4733]. The Automatic Speech Recognizer (ASR) MUST support RFC 4733 to recognize digits, and it MAY support recognizing DTMF tones [Q.23] in the audio.
Top   ToC   RFC6787 - Page 129

10. Recorder Resource

This resource captures received audio and video and stores it as content pointed to by a URI. The main usages of recorders are 1. to capture speech audio that may be submitted for recognition at a later time, and 2. recording voice or video mails. Both these applications require functionality above and beyond those specified by protocols such as RTSP [RFC2326]. This includes audio endpointing (i.e., detecting speech or silence). The support for video is OPTIONAL and is mainly capturing video mails that may require the speech or audio processing mentioned above. A recorder MUST provide endpointing capabilities for suppressing silence at the beginning and end of a recording, and it MAY also suppress silence in the middle of a recording. If such suppression is done, the recorder MUST maintain timing metadata to indicate the actual time stamps of the recorded media. See the discussion on the sensitivity of saved waveforms in Section 12.

10.1. Recorder State Machine

Idle Recording State State | | |---------RECORD------->| | | |<------STOP------------| | | |<--RECORD-COMPLETE-----| | | | |--------| | START-OF-INPUT | | |------->| | | | |--------| | START-INPUT-TIMERS | | |------->| | | Recorder State Machine
Top   ToC   RFC6787 - Page 130

10.2. Recorder Methods

The recorder resource supports the following methods. recorder-method = "RECORD" / "STOP" / "START-INPUT-TIMERS"

10.3. Recorder Events

The recorder resource can generate the following events. recorder-event = "START-OF-INPUT" / "RECORD-COMPLETE"

10.4. Recorder Header Fields

Method invocations for the recorder resource can contain resource- specific header fields containing request options and information to augment the Method, Response, or Event message it is associated with. recorder-header = sensitivity-level / no-input-timeout / completion-cause / completion-reason / failed-uri / failed-uri-cause / record-uri / media-type / max-time / trim-length / final-silence / capture-on-speech / ver-buffer-utterance / start-input-timers / new-audio-channel

10.4.1. Sensitivity-Level

To filter out background noise and not mistake it for speech, the recorder can support a variable level of sound sensitivity. The Sensitivity-Level header field is a float value between 0.0 and 1.0 and allows the client to set the sensitivity level for the recorder. This header field MAY occur in RECORD, SET-PARAMS, or GET-PARAMS. A higher value for this header field means higher sensitivity. The default value for this header field is implementation specific. sensitivity-level = "Sensitivity-Level" ":" FLOAT CRLF
Top   ToC   RFC6787 - Page 131

10.4.2. No-Input-Timeout

When recording is started and there is no speech detected for a certain period of time, the recorder can send a RECORD-COMPLETE event to the client and terminate the record operation. The No-Input- Timeout header field can set this timeout value. The value is in milliseconds. This header field MAY occur in RECORD, SET-PARAMS, or GET-PARAMS. The value for this header field ranges from 0 to an implementation-specific maximum value. The default value for this header field is implementation specific. no-input-timeout = "No-Input-Timeout" ":" 1*19DIGIT CRLF

10.4.3. Completion-Cause

This header field MUST be part of a RECORD-COMPLETE event from the recorder resource to the client. This indicates the reason behind the RECORD method completion. This header field MUST be sent in the RECORD responses if they return with a failure status and a COMPLETE state. In the ABNF below, the 'cause-code' contains a numerical value selected from the Cause-Code column of the following table. The 'cause-name' contains the corresponding token selected from the Cause-Name column. completion-cause = "Completion-Cause" ":" cause-code SP cause-name CRLF cause-code = 3DIGIT cause-name = *VCHAR +------------+-----------------------+------------------------------+ | Cause-Code | Cause-Name | Description | +------------+-----------------------+------------------------------+ | 000 | success-silence | RECORD completed with a | | | | silence at the end. | | 001 | success-maxtime | RECORD completed after | | | | reaching maximum recording | | | | time specified in record | | | | method. | | 002 | no-input-timeout | RECORD failed due to no | | | | input. | | 003 | uri-failure | Failure accessing the record | | | | URI. | | 004 | error | RECORD request terminated | | | | prematurely due to a | | | | recorder error. | +------------+-----------------------+------------------------------+
Top   ToC   RFC6787 - Page 132

10.4.4. Completion-Reason

This header field MAY be present in a RECORD-COMPLETE event coming from the recorder resource to the client. It contains the reason text behind the RECORD request completion. This header field communicates text describing the reason for the failure. The completion reason text is provided for client use in logs and for debugging and instrumentation purposes. Clients MUST NOT interpret the completion reason text. completion-reason = "Completion-Reason" ":" quoted-string CRLF

10.4.5. Failed-URI

When a recorder method needs to post the audio to a URI and access to the URI fails, the server MUST provide the failed URI in this header field in the method response. failed-uri = "Failed-URI" ":" absoluteURI CRLF

10.4.6. Failed-URI-Cause

When a recorder method needs to post the audio to a URI and access to the URI fails, the server MAY provide the URI-specific or protocol- specific response code through this header field in the method response. The value encoding is UTF-8 (RFC 3629 [RFC3629]) to accommodate any access protocol -- some access protocols might have a response string instead of a numeric response code. failed-uri-cause = "Failed-URI-Cause" ":" 1*UTFCHAR CRLF

10.4.7. Record-URI

When a recorder method contains this header field, the server MUST capture the audio and store it. If the header field is present but specified with no value, the server MUST store the content locally and generate a URI that points to it. This URI is then returned in either the STOP response or the RECORD-COMPLETE event. If the header field in the RECORD method specifies a URI, the server MUST attempt to capture and store the audio at that location. If this header field is not specified in the RECORD request, the server MUST capture the audio, MUST encode it, and MUST send it in the STOP response or the RECORD-COMPLETE event as a message body. In this case, the
Top   ToC   RFC6787 - Page 133
   response carrying the audio content MUST include a Content ID (cid)
   [RFC2392] value in this header pointing to the Content-ID in the
   message body.

   The server MUST also return the size in octets and the duration in
   milliseconds of the recorded audio waveform as parameters associated
   with the header field.

   Implementations MUST support 'http' [RFC2616], 'https' [RFC2818],
   'file' [RFC3986], and 'cid' [RFC2392] schemes in the URI.  Note that
   implementations already exist that support other schemes.

   record-uri               =  "Record-URI" ":" ["<" uri ">"
                               ";" "size" "=" 1*19DIGIT
                               ";" "duration" "=" 1*19DIGIT] CRLF

10.4.8. Media-Type

A RECORD method MUST contain this header field, which specifies to the server the media type of the captured audio or video. media-type = "Media-Type" ":" media-type-value CRLF

10.4.9. Max-Time

When recording is started, this specifies the maximum length of the recording in milliseconds, calculated from the time the actual capture and store begins and is not necessarily the time the RECORD method is received. It specifies the duration before silence suppression, if any, has been applied by the recorder resource. After this time, the recording stops and the server MUST return a RECORD-COMPLETE event to the client having a request-state of COMPLETE. This header field MAY occur in RECORD, SET-PARAMS, or GET- PARAMS. The value for this header field ranges from 0 to an implementation-specific maximum value. A value of 0 means infinity, and hence the recording continues until one or more of the other stop conditions are met. The default value for this header field is 0. max-time = "Max-Time" ":" 1*19DIGIT CRLF
Top   ToC   RFC6787 - Page 134

10.4.10. Trim-Length

This header field MAY be sent on a STOP method and specifies the length of audio to be trimmed from the end of the recording after the stop. The length is interpreted to be in milliseconds. The default value for this header field is 0. trim-length = "Trim-Length" ":" 1*19DIGIT CRLF

10.4.11. Final-Silence

When the recorder is started and the actual capture begins, this header field specifies the length of silence in the audio that is to be interpreted as the end of the recording. This header field MAY occur in RECORD, SET-PARAMS, or GET-PARAMS. The value for this header field ranges from 0 to an implementation-specific maximum value and is interpreted to be in milliseconds. A value of 0 means infinity, and hence the recording will continue until one of the other stop conditions are met. The default value for this header field is implementation specific. final-silence = "Final-Silence" ":" 1*19DIGIT CRLF

10.4.12. Capture-On-Speech

If "false", the recorder MUST start capturing immediately when started. If "true", the recorder MUST wait for the endpointing functionality to detect speech before it starts capturing. This header field MAY occur in the RECORD, SET-PARAMS, or GET-PARAMS. The value for this header field is a Boolean. The default value for this header field is "false". capture-on-speech = "Capture-On-Speech " ":" BOOLEAN CRLF

10.4.13. Ver-Buffer-Utterance

This header field is the same as the one described for the verifier resource (see Section 11.4.14). This tells the server to buffer the utterance associated with this recording request into the verification buffer. Sending this header field is permitted only if the verification buffer is for the session. This buffer is shared across resources within a session. It gets instantiated when a verifier resource is added to this session and is released when the verifier resource is released from the session.
Top   ToC   RFC6787 - Page 135

10.4.14. Start-Input-Timers

This header field MAY be sent as part of the RECORD request. A value of "false" tells the recorder resource to start the operation, but not to start the no-input timer until the client sends a START-INPUT- TIMERS request to the recorder resource. This is useful in the scenario when the recorder and synthesizer resources are not part of the same session. When a kill-on-barge-in prompt is being played, the client may want the RECORD request to be simultaneously active so that it can detect and implement kill-on-barge-in (see Section 8.4.2). But at the same time, the client doesn't want the recorder resource to start the no-input timers until the prompt is finished. The default value is "true". start-input-timers = "Start-Input-Timers" ":" BOOLEAN CRLF

10.4.15. New-Audio-Channel

This header field is the same as the one described for the recognizer resource (see Section 9.4.23).


(page 135 continued on part 6)

Next Section