12. Security Considerations
MRCPv2 is designed to comply with the security-related requirements documented in the SPEECHSC requirements [RFC4313]. Implementers and users of MRCPv2 are strongly encouraged to read the Security Considerations section of [RFC4313], because that document contains discussion of a number of important security issues associated with the utilization of speech as biometric authentication technology, and on the threats against systems which store recorded speech, contain large corpora of voiceprints, and send and receive sensitive information based on voice input to a recognizer or speech output from a synthesizer. Specific security measures employed by MRCPv2 are summarized in the following subsections. See the corresponding sections of this specification for how the security-related machinery is invoked by individual protocol operations.12.1. Rendezvous and Session Establishment
MRCPv2 control sessions are established as media sessions described by SDP within the context of a SIP dialog. In order to ensure secure rendezvous between MRCPv2 clients and servers, the following are required: 1. The SIP implementation in MRCPv2 clients and servers MUST support SIP digest authentication [RFC3261] and SHOULD employ it. 2. The SIP implementation in MRCPv2 clients and servers MUST support 'sips' URIs and SHOULD employ 'sips' URIs; this includes that clients and servers SHOULD set up TLS [RFC5246] connections. 3. If media stream cryptographic keying is done through SDP (e.g. using [RFC4568]), the MRCPv2 clients and servers MUST employ the 'sips' URI. 4. When TLS is used for SIP, the client MUST verify the identity of the server to which it connects, following the rules and guidelines defined in [RFC5922].12.2. Control Channel Protection
Sensitive data is carried over the MRCPv2 control channel. This includes things like the output of speech recognition operations, speaker verification results, input to text-to-speech conversion, personally identifying grammars, etc. For this reason, MRCPv2 servers must be properly authenticated, and the control channel must permit the use of both confidentiality and integrity for the data. To ensure control channel protection, MRCPv2 clients and servers MUST support TLS and SHOULD utilize it by default unless alternative
control channel protection is used. When TLS is used, the client MUST verify the identity of the server to which it connects, following the rules and guidelines defined in [RFC4572]. If there are multiple TLS-protected channels between the client and the server, the server MUST NOT send a response to the client over a channel for which the TLS identities of the server or client differ from the channel over which the server received the corresponding request. Alternative control-channel protection MAY be used if desired (e.g., Security Architecture for the Internet Protocol (IPsec) [RFC4301]).12.3. Media Session Protection
Sensitive data is also carried on media sessions terminating on MRCPv2 servers (the other end of a media channel may or may not be on the MRCPv2 client). This data includes the user's spoken utterances and the output of text-to-speech operations. MRCPv2 servers MUST support a security mechanism for protection of audio media sessions. MRCPv2 clients that originate or consume audio similarly MUST support a security mechanism for protection of the audio. One such mechanism is the Secure Real-time Transport Protocol (SRTP) [RFC3711].12.4. Indirect Content Access
MCRPv2 employs content indirection extensively. Content may be fetched and/or stored based on URI addressing on systems other than the MRCPv2 client or server. Not all of the stored content is necessarily sensitive (e.g., XML schemas), but the majority generally needs protection, and some indirect content, such as voice recordings and voiceprints, is extremely sensitive and must always be protected. MRCPv2 clients and servers MUST implement HTTPS for indirect content access and SHOULD employ secure access for all sensitive indirect content. Other secure URI schemes such as Secure FTP (FTPS) [RFC4217] MAY also be used. See Section 6.2.15 for the header fields used to transfer cookie information between the MRCPv2 client and server if needed for authentication. Access to URIs provided by servers introduces risks that need to be considered. Although RFC 6454 [RFC6454] discusses and focuses on a same-origin policy, which MRCPv2 does not restrict URIs to, it still provides an excellent description of the pitfalls of blindly following server-provided URIs in Section 3 of the RFC. Servers also need to be aware that clients could provide URIs to sites designed to tie up the server in long or otherwise problematic document fetches. MRCPv2 servers, and the services they access, MUST always be prepared for the possibility of such a denial-of-service attack.
MRCPv2 makes no inherent assumptions about the lifetime and access controls associated with a URI. For example, if neither authentication nor scheme-specific access controls are used, a leak of the URI is equivalent to a leak of the content. Moreover, MRCPv2 makes no specific demands on the lifetime of a URI. If a server offers a URI and the client takes a long, long time to access that URI, the server may have removed the resource in the interim time period. MRCPv2 deals with this case by using the URI access scheme's 'resource not found' error, such as 404 for HTTPS. How long a server should keep a dynamic resource available is highly application and context dependent. However, the server SHOULD keep the resource available for a reasonable amount of time to make it likely the client will have the resource available when the client needs the resource. Conversely, to mitigate state exhaustion attacks, MRCPv2 servers are not obligated to keep resources and resource state in perpetuity. The server SHOULD delete dynamically generated resources associated with an MRCPv2 session when the session ends. One method to avoid resource leakage is for the server to use difficult-to-guess, one-time resource URIs. In this instance, there can be only a single access to the underlying resource using the given URI. A downside to this approach is if an attacker uses the URI before the client uses the URI, then the client is denied the resource. Other methods would be to adopt a mechanism similar to the URLAUTH IMAP extension [RFC4467], where the server sets cryptographic checks on URI usage, as well as capabilities for expiration, revocation, and so on. Specifying such a mechanism is beyond the scope of this document.12.5. Protection of Stored Media
MRCPv2 applications often require the use of stored media. Voice recordings are both stored (e.g., for diagnosis and system tuning), and fetched (for replaying utterances into multiple MRCPv2 resources). Voiceprints are fundamental to the speaker identification and verification functions. This data can be extremely sensitive and can present substantial privacy and impersonation risks if stolen. Systems employing MRCPv2 SHOULD be deployed in ways that minimize these risks. The SPEECHSC requirements RFC [RFC4313] contains a more extensive discussion of these risks and ways they may be mitigated.
12.6. DTMF and Recognition Buffers
DTMF buffers and recognition buffers may grow large enough to exceed the capabilities of a server, and the server MUST be prepared to gracefully handle resource consumption. A server MAY respond with the appropriate recognition incomplete if the server is in danger of running out of resources.12.7. Client-Set Server Parameters
In MRCPv2, there are some tasks, such as URI resource fetches, that the server does on behalf of the client. To control this behavior, MRCPv2 has a number of server parameters that a client can configure. With one such parameter, Fetch-Timeout (Section 6.2.12), a malicious client could set a very large value and then request the server to fetch a non-existent document. It is RECOMMENDED that servers be cautious about accepting long timeout values or abnormally large values for other client-set parameters.12.8. DELETE-VOICEPRINT and Authorization
Since this specification does not mandate a specific mechanism for authentication and authorization when requesting DELETE-VOICEPRINT (Section 11.9), there is a risk that an MRCPv2 server may not do such a check for authentication and authorization. In practice, each provider of voice biometric solutions does insist on its own authentication and authorization mechanism, outside of this specification, so this is not likely to be a major problem. If in the future voice biometric providers standardize on such a mechanism, then a future version of MRCP can mandate it.13. IANA Considerations
13.1. New Registries
This section describes the name spaces (registries) for MRCPv2 that IANA has created and now maintains. Assignment/registration policies are described in RFC 5226 [RFC5226].13.1.1. MRCPv2 Resource Types
IANA has created a new name space of "MRCPv2 Resource Types". All maintenance within and additions to the contents of this name space MUST be according to the "Standards Action" registration policy. The initial contents of the registry, defined in Section 4.2, are given below:
Resource type Resource description Reference ------------- -------------------- --------- speechrecog Speech Recognizer [RFC6787] dtmfrecog DTMF Recognizer [RFC6787] speechsynth Speech Synthesizer [RFC6787] basicsynth Basic Synthesizer [RFC6787] speakverify Speaker Verifier [RFC6787] recorder Speech Recorder [RFC6787]13.1.2. MRCPv2 Methods and Events
IANA has created a new name space of "MRCPv2 Methods and Events". All maintenance within and additions to the contents of this name space MUST be according to the "Standards Action" registration policy. The initial contents of the registry, defined by the "method-name" and "event-name" BNF in Section 15 and explained in Sections 5.2 and 5.5, are given below. Name Resource type Method/Event Reference ---- ------------- ------------ --------- SET-PARAMS Generic Method [RFC6787] GET-PARAMS Generic Method [RFC6787] SPEAK Synthesizer Method [RFC6787] STOP Synthesizer Method [RFC6787] PAUSE Synthesizer Method [RFC6787] RESUME Synthesizer Method [RFC6787] BARGE-IN-OCCURRED Synthesizer Method [RFC6787] CONTROL Synthesizer Method [RFC6787] DEFINE-LEXICON Synthesizer Method [RFC6787] DEFINE-GRAMMAR Recognizer Method [RFC6787] RECOGNIZE Recognizer Method [RFC6787] INTERPRET Recognizer Method [RFC6787] GET-RESULT Recognizer Method [RFC6787] START-INPUT-TIMERS Recognizer Method [RFC6787] STOP Recognizer Method [RFC6787] START-PHRASE-ENROLLMENT Recognizer Method [RFC6787] ENROLLMENT-ROLLBACK Recognizer Method [RFC6787] END-PHRASE-ENROLLMENT Recognizer Method [RFC6787] MODIFY-PHRASE Recognizer Method [RFC6787] DELETE-PHRASE Recognizer Method [RFC6787] RECORD Recorder Method [RFC6787] STOP Recorder Method [RFC6787] START-INPUT-TIMERS Recorder Method [RFC6787] START-SESSION Verifier Method [RFC6787] END-SESSION Verifier Method [RFC6787] QUERY-VOICEPRINT Verifier Method [RFC6787] DELETE-VOICEPRINT Verifier Method [RFC6787] VERIFY Verifier Method [RFC6787]
VERIFY-FROM-BUFFER Verifier Method [RFC6787] VERIFY-ROLLBACK Verifier Method [RFC6787] STOP Verifier Method [RFC6787] START-INPUT-TIMERS Verifier Method [RFC6787] GET-INTERMEDIATE-RESULT Verifier Method [RFC6787] SPEECH-MARKER Synthesizer Event [RFC6787] SPEAK-COMPLETE Synthesizer Event [RFC6787] START-OF-INPUT Recognizer Event [RFC6787] RECOGNITION-COMPLETE Recognizer Event [RFC6787] INTERPRETATION-COMPLETE Recognizer Event [RFC6787] START-OF-INPUT Recorder Event [RFC6787] RECORD-COMPLETE Recorder Event [RFC6787] VERIFICATION-COMPLETE Verifier Event [RFC6787] START-OF-INPUT Verifier Event [RFC6787]13.1.3. MRCPv2 Header Fields
IANA has created a new name space of "MRCPv2 Header Fields". All maintenance within and additions to the contents of this name space MUST be according to the "Standards Action" registration policy. The initial contents of the registry, defined by the "message-header" BNF in Section 15 and explained in Section 5.1, are given below. Note that the values permitted for the "Vendor-Specific-Parameters" parameter are managed according to a different policy. See Section 13.1.6. Name Resource type Reference ---- ------------- --------- Channel-Identifier Generic [RFC6787] Accept Generic [RFC2616] Active-Request-Id-List Generic [RFC6787] Proxy-Sync-Id Generic [RFC6787] Accept-Charset Generic [RFC2616] Content-Type Generic [RFC6787] Content-ID Generic [RFC2392], [RFC2046], and [RFC5322] Content-Base Generic [RFC6787] Content-Encoding Generic [RFC6787] Content-Location Generic [RFC6787] Content-Length Generic [RFC6787] Fetch-Timeout Generic [RFC6787] Cache-Control Generic [RFC6787] Logging-Tag Generic [RFC6787] Set-Cookie Generic [RFC6787] Vendor-Specific Generic [RFC6787] Jump-Size Synthesizer [RFC6787] Kill-On-Barge-In Synthesizer [RFC6787] Speaker-Profile Synthesizer [RFC6787]
Completion-Cause Synthesizer [RFC6787] Completion-Reason Synthesizer [RFC6787] Voice-Parameter Synthesizer [RFC6787] Prosody-Parameter Synthesizer [RFC6787] Speech-Marker Synthesizer [RFC6787] Speech-Language Synthesizer [RFC6787] Fetch-Hint Synthesizer [RFC6787] Audio-Fetch-Hint Synthesizer [RFC6787] Failed-URI Synthesizer [RFC6787] Failed-URI-Cause Synthesizer [RFC6787] Speak-Restart Synthesizer [RFC6787] Speak-Length Synthesizer [RFC6787] Load-Lexicon Synthesizer [RFC6787] Lexicon-Search-Order Synthesizer [RFC6787] Confidence-Threshold Recognizer [RFC6787] Sensitivity-Level Recognizer [RFC6787] Speed-Vs-Accuracy Recognizer [RFC6787] N-Best-List-Length Recognizer [RFC6787] Input-Type Recognizer [RFC6787] No-Input-Timeout Recognizer [RFC6787] Recognition-Timeout Recognizer [RFC6787] Waveform-URI Recognizer [RFC6787] Input-Waveform-URI Recognizer [RFC6787] Completion-Cause Recognizer [RFC6787] Completion-Reason Recognizer [RFC6787] Recognizer-Context-Block Recognizer [RFC6787] Start-Input-Timers Recognizer [RFC6787] Speech-Complete-Timeout Recognizer [RFC6787] Speech-Incomplete-Timeout Recognizer [RFC6787] Dtmf-Interdigit-Timeout Recognizer [RFC6787] Dtmf-Term-Timeout Recognizer [RFC6787] Dtmf-Term-Char Recognizer [RFC6787] Failed-URI Recognizer [RFC6787] Failed-URI-Cause Recognizer [RFC6787] Save-Waveform Recognizer [RFC6787] Media-Type Recognizer [RFC6787] New-Audio-Channel Recognizer [RFC6787] Speech-Language Recognizer [RFC6787] Ver-Buffer-Utterance Recognizer [RFC6787] Recognition-Mode Recognizer [RFC6787] Cancel-If-Queue Recognizer [RFC6787] Hotword-Max-Duration Recognizer [RFC6787] Hotword-Min-Duration Recognizer [RFC6787] Interpret-Text Recognizer [RFC6787] Dtmf-Buffer-Time Recognizer [RFC6787] Clear-Dtmf-Buffer Recognizer [RFC6787] Early-No-Match Recognizer [RFC6787] Num-Min-Consistent-Pronunciations Recognizer [RFC6787]
Consistency-Threshold Recognizer [RFC6787] Clash-Threshold Recognizer [RFC6787] Personal-Grammar-URI Recognizer [RFC6787] Enroll-Utterance Recognizer [RFC6787] Phrase-ID Recognizer [RFC6787] Phrase-NL Recognizer [RFC6787] Weight Recognizer [RFC6787] Save-Best-Waveform Recognizer [RFC6787] New-Phrase-ID Recognizer [RFC6787] Confusable-Phrases-URI Recognizer [RFC6787] Abort-Phrase-Enrollment Recognizer [RFC6787] Sensitivity-Level Recorder [RFC6787] No-Input-Timeout Recorder [RFC6787] Completion-Cause Recorder [RFC6787] Completion-Reason Recorder [RFC6787] Failed-URI Recorder [RFC6787] Failed-URI-Cause Recorder [RFC6787] Record-URI Recorder [RFC6787] Media-Type Recorder [RFC6787] Max-Time Recorder [RFC6787] Trim-Length Recorder [RFC6787] Final-Silence Recorder [RFC6787] Capture-On-Speech Recorder [RFC6787] Ver-Buffer-Utterance Recorder [RFC6787] Start-Input-Timers Recorder [RFC6787] New-Audio-Channel Recorder [RFC6787] Repository-URI Verifier [RFC6787] Voiceprint-Identifier Verifier [RFC6787] Verification-Mode Verifier [RFC6787] Adapt-Model Verifier [RFC6787] Abort-Model Verifier [RFC6787] Min-Verification-Score Verifier [RFC6787] Num-Min-Verification-Phrases Verifier [RFC6787] Num-Max-Verification-Phrases Verifier [RFC6787] No-Input-Timeout Verifier [RFC6787] Save-Waveform Verifier [RFC6787] Media-Type Verifier [RFC6787] Waveform-URI Verifier [RFC6787] Voiceprint-Exists Verifier [RFC6787] Ver-Buffer-Utterance Verifier [RFC6787] Input-Waveform-URI Verifier [RFC6787] Completion-Cause Verifier [RFC6787] Completion-Reason Verifier [RFC6787] Speech-Complete-Timeout Verifier [RFC6787] New-Audio-Channel Verifier [RFC6787] Abort-Verification Verifier [RFC6787] Start-Input-Timers Verifier [RFC6787] Input-Type Verifier [RFC6787]
13.1.4. MRCPv2 Status Codes
IANA has created a new name space of "MRCPv2 Status Codes" with the initial values that are defined in Section 5.4. All maintenance within and additions to the contents of this name space MUST be according to the "Specification Required with Expert Review" registration policy.13.1.5. Grammar Reference List Parameters
IANA has created a new name space of "Grammar Reference List Parameters". All maintenance within and additions to the contents of this name space MUST be according to the "Specification Required with Expert Review" registration policy. There is only one initial parameter as shown below. Name Reference ---- ------------- weight [RFC6787]13.1.6. MRCPv2 Vendor-Specific Parameters
IANA has created a new name space of "MRCPv2 Vendor-Specific Parameters". All maintenance within and additions to the contents of this name space MUST be according to the "Hierarchical Allocation" registration policy as follows. Each name (corresponding to the "vendor-av-pair-name" ABNF production) MUST satisfy the syntax requirements of Internet Domain Names as described in Section 2.3.1 of RFC 1035 [RFC1035] (and as updated or obsoleted by successive RFCs), with one exception, the order of the domain names is reversed. For example, a vendor-specific parameter "foo" by example.com would have the form "com.example.foo". The first, or top-level domain, is restricted to exactly the set of Top-Level Internet Domains defined by IANA and will be updated by IANA when and only when that set changes. The second-level and all subdomains within the parameter name MUST be allocated according to the "First Come First Served" policy. It is RECOMMENDED that assignment requests adhere to the existing allocations of Internet domain names to organizations, institutions, corporations, etc. The registry contains a list of vendor-registered parameters, where each defined parameter is associated with a contact person and includes an optional reference to the definition of the parameter, preferably an RFC. The registry is initially empty.
13.2. NLSML-Related Registrations
13.2.1. 'application/nlsml+xml' Media Type Registration
IANA has registered the following media type according to the process defined in RFC 4288 [RFC4288]. To: ietf-types@iana.org Subject: Registration of media type application/nlsml+xml MIME media type name: application MIME subtype name: nlsml+xml Required parameters: none Optional parameters: charset: All of the considerations described in RFC 3023 [RFC3023] also apply to the application/nlsml+xml media type. Encoding considerations: All of the considerations described in RFC 3023 also apply to the 'application/nlsml+xml' media type. Security considerations: As with HTML, NLSML documents contain links to other data stores (grammars, verifier resources, etc.). Unlike HTML, however, the data stores are not treated as media to be rendered. Nevertheless, linked files may themselves have security considerations, which would be those of the individual registered types. Additionally, this media type has all of the security considerations described in RFC 3023. Interoperability considerations: Although an NLSML document is itself a complete XML document, for a fuller interpretation of the content a receiver of an NLSML document may wish to access resources linked to by the document. The inability of an NLSML processor to access or process such linked resources could result in different behavior by the ultimate consumer of the data. Published specification: RFC 6787 Applications that use this media type: MRCPv2 clients and servers Additional information: none Magic number(s): There is no single initial octet sequence that is always present for NLSML files.
Person & email address to contact for further information: Sarvi Shanmugham, sarvi@cisco.com Intended usage: This media type is expected to be used only in conjunction with MRCPv2.13.3. NLSML XML Schema Registration
IANA has registered and now maintains the following XML Schema. Information provided follows the template in RFC 3688 [RFC3688]. XML element type: schema URI: urn:ietf:params:xml:schema:nlsml Registrant Contact: IESG XML: See Section 16.1.13.4. MRCPv2 XML Namespace Registration
IANA has registered and now maintains the following XML Name space. Information provided follows the template in RFC 3688 [RFC3688]. XML element type: ns URI: urn:ietf:params:xml:ns:mrcpv2 Registrant Contact: IESG XML: RFC 678713.5. Text Media Type Registrations
IANA has registered the following text media type according to the process defined in RFC 4288 [RFC4288].13.5.1. text/grammar-ref-list
To: ietf-types@iana.org Subject: Registration of media type text/grammar-ref-list MIME media type name: text MIME subtype name: text/grammar-ref-list Required parameters: none
Optional parameters: none Encoding considerations: Depending on the transfer protocol, a transfer encoding may be necessary to deal with very long lines. Security considerations: This media type contains URIs that may represent references to external resources. As these resources are assumed to be speech recognition grammars, similar considerations as for the media types 'application/srgs' and 'application/srgs+xml' apply. Interoperability considerations: '>' must be percent encoded in URIs according to RFC 3986 [RFC3986]. Published specification: The RECOGNIZE method of the MRCP protocol performs a recognition operation that matches input against a set of grammars. When matching against more than one grammar, it is sometimes necessary to use different weights for the individual grammars. These weights are not a property of the grammar resource itself but qualify the reference to that grammar for the particular recognition operation initiated by the RECOGNIZE method. The format of the proposed 'text/grammar-ref-list' media type is as follows: body = *reference reference = "<" uri ">" [parameters] CRLF parameters = ";" parameter *(";" parameter) parameter = attribute "=" value This specification currently only defines a 'weight' parameter, but new parameters MAY be added through the "Grammar Reference List Parameters" IANA registry established through this specification. Example: <http://example.com/grammars/field1.gram> <http://example.com/grammars/field2.gram>;weight="0.85" <session:field3@form-level.store>;weight="0.9" <http://example.com/grammars/universals.gram>;weight="0.75" Applications that use this media type: MRCPv2 clients and servers Additional information: none Magic number(s): none Person & email address to contact for further information: Sarvi Shanmugham, sarvi@cisco.com
Intended usage: This media type is expected to be used only in conjunction with MRCPv2.13.6. 'session' URI Scheme Registration
IANA has registered the following new URI scheme. The information below follows the template given in RFC 4395 [RFC4395]. URI scheme name: session Status: Permanent URI scheme syntax: The syntax of this scheme is identical to that defined for the "cid" scheme in Section 2 of RFC 2392 [RFC2392]. URI scheme semantics: The URI is intended to identify a data resource previously given to the network computing resource. The purpose of this scheme is to permit access to the specific resource for the lifetime of the session with the entity storing the resource. The media type of the resource CAN vary. There is no explicit mechanism for communication of the media type. This scheme is currently widely used internally by existing implementations, and the registration is intended to provide information in the rare (and unfortunate) case that the scheme is used elsewhere. The scheme SHOULD NOT be used for open Internet protocols. Encoding considerations: There are no other encoding considerations for the 'session' URIs not described in RFC 3986 [RFC3986] Applications/protocols that use this URI scheme name: This scheme name is used by MRCPv2 clients and servers. Interoperability considerations: Note that none of the resources are accessible after the MCRPv2 session ends, hence the name of the scheme. For clients who establish one MRCPv2 session only for the entire speech application being implemented, this is sufficient, but clients who create, terminate, and recreate MRCP sessions for performance or scalability reasons will lose access to resources established in the earlier session(s). Security considerations: Generic security considerations for URIs described in RFC 3986 [RFC3986] apply to this scheme as well. The URIs defined here provide an identification mechanism only. Given that the communication channel between client and server is secure, that the server correctly accesses the resource associated
with the URI, and that the server ensures session-only lifetime and access for each URI, the only additional security issues are those of the types of media referred to by the URI. Contact: Sarvi Shanmugham, sarvi@cisco.com Author/Change controller: IESG, iesg@ietf.org References: This specification, particularly Sections 6.2.7, 8.5.2, 9.5.1, and 9.9.13.7. SDP Parameter Registrations
IANA has registered the following SDP parameter values. The information for each follows the template given in RFC 4566 [RFC4566], Appendix B.13.7.1. Sub-Registry "proto"
"TCP/MRCPv2" value of the "proto" parameter Contact name, email address, and telephone number: Sarvi Shanmugham, sarvi@cisco.com, +1.408.902.3875 Name being registered (as it will appear in SDP): TCP/MRCPv2 Long-form name in English: MCRPv2 over TCP Type of name: proto Explanation of name: This name represents the MCRPv2 protocol carried over TCP. Reference to specification of name: RFC 6787 "TCP/TLS/MRCPv2" value of the "proto" parameter Contact name, email address, and telephone number: Sarvi Shanmugham, sarvi@cisco.com, +1.408.902.3875 Name being registered (as it will appear in SDP): TCP/TLS/MRCPv2 Long-form name in English: MCRPv2 over TLS over TCP Type of name: proto Explanation of name: This name represents the MCRPv2 protocol carried over TLS over TCP.
Reference to specification of name: RFC 678713.7.2. Sub-Registry "att-field (media-level)"
"resource" value of the "att-field" parameter Contact name, email address, and telephone number: Sarvi Shanmugham, sarvi@cisco.com, +1.408.902.3875 Attribute name (as it will appear in SDP): resource Long-form attribute name in English: MRCPv2 resource type Type of attribute: media-level Subject to charset attribute? no Explanation of attribute: See Section 4.2 of RFC 6787 for description and examples. Specification of appropriate attribute values: See section Section 13.1.1 of RFC 6787. "channel" value of the "att-field" parameter Contact name, email address, and telephone number: Sarvi Shanmugham, sarvi@cisco.com, +1.408.902.3875 Attribute name (as it will appear in SDP): channel Long-form attribute name in English: MRCPv2 resource channel identifier Type of attribute: media-level Subject to charset attribute? no Explanation of attribute: See Section 4.2 of RFC 6787 for description and examples. Specification of appropriate attribute values: See Section 4.2 and the "channel-id" ABNF production rules of RFC 6787. "cmid" value of the "att-field" parameter Contact name, email address, and telephone number: Sarvi Shanmugham, sarvi@cisco.com, +1.408.902.3875
Attribute name (as it will appear in SDP): cmid Long-form attribute name in English: MRCPv2 resource channel media identifier Type of attribute: media-level Subject to charset attribute? no Explanation of attribute: See Section 4.4 of RFC 6787 for description and examples. Specification of appropriate attribute values: See Section 4.4 and the "cmid-attribute" ABNF production rules of RFC 6787.14. Examples
14.1. Message Flow
The following is an example of a typical MRCPv2 session of speech synthesis and recognition between a client and a server. Although the SDP "s=" attribute in these examples has a text description value to assist in understanding the examples, please keep in mind that RFC 3264 [RFC3264] recommends that messages actually put on the wire use a space or a dash. The figure below illustrates opening a session to the MRCPv2 server. This exchange does not allocate a resource or setup media. It simply establishes a SIP session with the MRCPv2 server. C->S: INVITE sip:mresources@example.com SIP/2.0 Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bg1 Max-Forwards:6 To:MediaServer <sip:mresources@example.com> From:sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:323123 INVITE Contact:<sip:sarvi@client.example.com> Content-Type:application/sdp Content-Length:... v=0 o=sarvi 2614933546 2614933546 IN IP4 192.0.2.12 s=Set up MRCPv2 control and audio i=Initial contact c=IN IP4 192.0.2.12
S->C: SIP/2.0 200 OK Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bg1;received=192.0.32.10 To:MediaServer <sip:mresources@example.com>;tag=62784 From:sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:323123 INVITE Contact:<sip:mresources@server.example.com> Content-Type:application/sdp Content-Length:... v=0 o=- 3000000001 3000000001 IN IP4 192.0.2.11 s=Set up MRCPv2 control and audio i=Initial contact c=IN IP4 192.0.2.11 C->S: ACK sip:mresources@server.example.com SIP/2.0 Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bg2 Max-Forwards:6 To:MediaServer <sip:mresources@example.com>;tag=62784 From:Sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:323123 ACK Content-Length:0 The client requests the server to create a synthesizer resource control channel to do speech synthesis. This also adds a media stream to send the generated speech. Note that, in this example, the client requests a new MRCPv2 TCP stream between the client and the server. In the following requests, the client will ask to use the existing connection. C->S: INVITE sip:mresources@server.example.com SIP/2.0 Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bg3 Max-Forwards:6 To:MediaServer <sip:mresources@example.com>;tag=62784 From:sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:323124 INVITE Contact:<sip:sarvi@client.example.com> Content-Type:application/sdp Content-Length:...
v=0 o=sarvi 2614933546 2614933547 IN IP4 192.0.2.12 s=Set up MRCPv2 control and audio i=Add TCP channel, synthesizer and one-way audio c=IN IP4 192.0.2.12 t=0 0 m=application 9 TCP/MRCPv2 1 a=setup:active a=connection:new a=resource:speechsynth a=cmid:1 m=audio 49170 RTP/AVP 0 96 a=rtpmap:0 pcmu/8000 a=rtpmap:96 telephone-event/8000 a=fmtp:96 0-15 a=recvonly a=mid:1 S->C: SIP/2.0 200 OK Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bg3;received=192.0.32.10 To:MediaServer <sip:mresources@example.com>;tag=62784 From:sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:323124 INVITE Contact:<sip:mresources@server.example.com> Content-Type:application/sdp Content-Length:... v=0 o=- 3000000001 3000000002 IN IP4 192.0.2.11 s=Set up MRCPv2 control and audio i=Add TCP channel, synthesizer and one-way audio c=IN IP4 192.0.2.11 t=0 0 m=application 32416 TCP/MRCPv2 1 a=setup:passive a=connection:new a=channel:32AECB23433801@speechsynth a=cmid:1 m=audio 48260 RTP/AVP 0 a=rtpmap:0 pcmu/8000 a=sendonly a=mid:1
C->S: ACK sip:mresources@server.example.com SIP/2.0 Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bg4 Max-Forwards:6 To:MediaServer <sip:mresources@example.com>;tag=62784 From:Sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:323124 ACK Content-Length:0 This exchange allocates an additional resource control channel for a recognizer. Since a recognizer would need to receive an audio stream for recognition, this interaction also updates the audio stream to sendrecv, making it a two-way audio stream. C->S: INVITE sip:mresources@server.example.com SIP/2.0 Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bg5 Max-Forwards:6 To:MediaServer <sip:mresources@example.com>;tag=62784 From:sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:323125 INVITE Contact:<sip:sarvi@client.example.com> Content-Type:application/sdp Content-Length:... v=0 o=sarvi 2614933546 2614933548 IN IP4 192.0.2.12 s=Set up MRCPv2 control and audio i=Add recognizer and duplex the audio c=IN IP4 192.0.2.12 t=0 0 m=application 9 TCP/MRCPv2 1 a=setup:active a=connection:existing a=resource:speechsynth a=cmid:1 m=audio 49170 RTP/AVP 0 96 a=rtpmap:0 pcmu/8000 a=rtpmap:96 telephone-event/8000 a=fmtp:96 0-15 a=recvonly a=mid:1 m=application 9 TCP/MRCPv2 1 a=setup:active
a=connection:existing a=resource:speechrecog a=cmid:2 m=audio 49180 RTP/AVP 0 96 a=rtpmap:0 pcmu/8000 a=rtpmap:96 telephone-event/8000 a=fmtp:96 0-15 a=sendonly a=mid:2 S->C: SIP/2.0 200 OK Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bg5;received=192.0.32.10 To:MediaServer <sip:mresources@example.com>;tag=62784 From:sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:323125 INVITE Contact:<sip:mresources@server.example.com> Content-Type:application/sdp Content-Length:... v=0 o=- 3000000001 3000000003 IN IP4 192.0.2.11 s=Set up MRCPv2 control and audio i=Add recognizer and duplex the audio c=IN IP4 192.0.2.11 t=0 0 m=application 32416 TCP/MRCPv2 1 a=channel:32AECB23433801@speechsynth a=cmid:1 m=audio 48260 RTP/AVP 0 a=rtpmap:0 pcmu/8000 a=sendonly a=mid:1 m=application 32416 TCP/MRCPv2 1 a=channel:32AECB23433801@speechrecog a=cmid:2 m=audio 48260 RTP/AVP 0 a=rtpmap:0 pcmu/8000 a=rtpmap:96 telephone-event/8000 a=fmtp:96 0-15 a=recvonly a=mid:2
C->S: ACK sip:mresources@server.example.com SIP/2.0 Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bg6 Max-Forwards:6 To:MediaServer <sip:mresources@example.com>;tag=62784 From:Sarvi <sip:sarvi@example.com>;tag=1928301774 Call-ID:a84b4c76e66710 CSeq:323125 ACK Content-Length:0 A MRCPv2 SPEAK request initiates speech. C->S: MRCP/2.0 ... SPEAK 543257 Channel-Identifier:32AECB23433801@speechsynth Kill-On-Barge-In:false Voice-gender:neutral Voice-age:25 Prosody-volume:medium Content-Type:application/ssml+xml Content-Length:... <?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>You have 4 new messages.</s> <s>The first is from Stephanie Williams <mark name="Stephanie"/> and arrived at <break/> <say-as interpret-as="vxml:time">0345p</say-as>.</s> <s>The subject is <prosody rate="-20%">ski trip</prosody></s> </p> </speak> S->C: MRCP/2.0 ... 543257 200 IN-PROGRESS Channel-Identifier:32AECB23433801@speechsynth Speech-Marker:timestamp=857205015059
The synthesizer hits the special marker in the message to be spoken and faithfully informs the client of the event. S->C: MRCP/2.0 ... SPEECH-MARKER 543257 IN-PROGRESS Channel-Identifier:32AECB23433801@speechsynth Speech-Marker:timestamp=857206027059;Stephanie The synthesizer finishes with the SPEAK request. S->C: MRCP/2.0 ... SPEAK-COMPLETE 543257 COMPLETE Channel-Identifier:32AECB23433801@speechsynth Speech-Marker:timestamp=857207685213;Stephanie The recognizer is issued a request to listen for the customer choices. C->S: MRCP/2.0 ... RECOGNIZE 543258 Channel-Identifier:32AECB23433801@speechrecog Content-Type:application/srgs+xml Content-Length:... <?xml version="1.0"?> <!-- the default grammar language is US English --> <grammar xmlns="http://www.w3.org/2001/06/grammar" xml:lang="en-US" version="1.0" root="request"> <!-- single language attachment to a rule expansion --> <rule id="request"> Can I speak to <one-of xml:lang="fr-CA"> <item>Michel Tremblay</item> <item>Andre Roy</item> </one-of> </rule> </grammar> S->C: MRCP/2.0 ... 543258 200 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog The client issues the next MRCPv2 SPEAK method. C->S: MRCP/2.0 ... SPEAK 543259 Channel-Identifier:32AECB23433801@speechsynth Kill-On-Barge-In:true Content-Type:application/ssml+xml Content-Length:...
<?xml version="1.0"?> <speak version="1.0" xmlns="http://www.w3.org/2001/10/synthesis" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://www.w3.org/2001/10/synthesis http://www.w3.org/TR/speech-synthesis/synthesis.xsd" xml:lang="en-US"> <p> <s>Welcome to ABC corporation.</s> <s>Who would you like to talk to?</s> </p> </speak> S->C: MRCP/2.0 ... 543259 200 IN-PROGRESS Channel-Identifier:32AECB23433801@speechsynth Speech-Marker:timestamp=857207696314 This next section of this ongoing example demonstrates how kill-on- barge-in support works. Since this last SPEAK request had Kill-On- Barge-In set to "true", when the recognizer (the server) generated the START-OF-INPUT event while a SPEAK was active, the client immediately issued a BARGE-IN-OCCURRED method to the synthesizer resource. The speech synthesizer then terminated playback and notified the client. The completion-cause code provided the indication that this was a kill-on-barge-in interruption rather than a normal completion. Note that, since the recognition and synthesizer resources are in the same session on the same server, to obtain a faster response the server might have internally relayed the start-of-input condition to the synthesizer directly, before receiving the expected BARGE-IN- OCCURRED event. However, any such communication is outside the scope of MRCPv2. S->C: MRCP/2.0 ... START-OF-INPUT 543258 IN-PROGRESS Channel-Identifier:32AECB23433801@speechrecog Proxy-Sync-Id:987654321 C->S: MRCP/2.0 ... BARGE-IN-OCCURRED 543259 Channel-Identifier:32AECB23433801@speechsynth Proxy-Sync-Id:987654321 S->C: MRCP/2.0 ... 543259 200 COMPLETE Channel-Identifier:32AECB23433801@speechsynth Active-Request-Id-List:543258 Speech-Marker:timestamp=857206096314
S->C: MRCP/2.0 ... SPEAK-COMPLETE 543259 COMPLETE Channel-Identifier:32AECB23433801@speechsynth Completion-Cause:001 barge-in Speech-Marker:timestamp=857207685213 The recognizer resource matched the spoken stream to a grammar and generated results. The result of the recognition is returned by the server as part of the RECOGNITION-COMPLETE event. S->C: MRCP/2.0 ... RECOGNITION-COMPLETE 543258 COMPLETE Channel-Identifier:32AECB23433801@speechrecog Completion-Cause:000 success Waveform-URI:<http://web.media.com/session123/audio.wav>; size=423523;duration=25432 Content-Type:application/nlsml+xml Content-Length:... <?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" xmlns:ex="http://www.example.com/example" grammar="session:request1@form-level.store"> <interpretation> <instance name="Person"> <ex:Person> <ex:Name> Andre Roy </ex:Name> </ex:Person> </instance> <input> may I speak to Andre Roy </input> </interpretation> </result> Since the client was now finished with the session, including all resources, it issued a SIP BYE request to close the SIP session. This caused all control channels and resources allocated under the session to be deallocated. C->S: BYE sip:mresources@server.example.com SIP/2.0 Via:SIP/2.0/TCP client.atlanta.example.com:5060; branch=z9hG4bK74bg7 Max-Forwards:6 From:Sarvi <sip:sarvi@example.com>;tag=1928301774 To:MediaServer <sip:mresources@example.com>;tag=62784 Call-ID:a84b4c76e66710 CSeq:323126 BYE Content-Length:0
14.2. Recognition Result Examples
14.2.1. Simple ASR Ambiguity
System: To which city will you be traveling? User: I want to go to Pittsburgh. <?xml version="1.0"?> <result xmlns="urn:ietf:params:xml:ns:mrcpv2" xmlns:ex="http://www.example.com/example" grammar="http://www.example.com/flight"> <interpretation confidence="0.6"> <instance> <ex:airline> <ex:to_city>Pittsburgh</ex:to_city> <ex:airline> <instance> <input mode="speech"> I want to go to Pittsburgh </input> </interpretation> <interpretation confidence="0.4" <instance> <ex:airline> <ex:to_city>Stockholm</ex:to_city> </ex:airline> </instance> <input>I want to go to Stockholm</input> </interpretation> </result>14.2.2. Mixed Initiative
System: What would you like? User: I would like 2 pizzas, one with pepperoni and cheese, one with sausage and a bottle of coke, to go. This example includes an order object which in turn contains objects named "food_item", "drink_item", and "delivery_method". The representation assumes there are no ambiguities in the speech or natural language processing. Note that this representation also assumes some level of intra-sentential anaphora resolution, i.e., to resolve the two "one"s as "pizza". <?xml version="1.0"?> <nl:result xmlns:nl="urn:ietf:params:xml:ns:mrcpv2" xmlns="http://www.example.com/example" grammar="http://www.example.com/foodorder">
<nl:interpretation confidence="1.0" > <nl:instance> <order> <food_item confidence="1.0"> <pizza> <ingredients confidence="1.0"> pepperoni </ingredients> <ingredients confidence="1.0"> cheese </ingredients> </pizza> <pizza> <ingredients>sausage</ingredients> </pizza> </food_item> <drink_item confidence="1.0"> <size>2-liter</size> </drink_item> <delivery_method>to go</delivery_method> </order> </nl:instance> <nl:input mode="speech">I would like 2 pizzas, one with pepperoni and cheese, one with sausage and a bottle of coke, to go. </nl:input> </nl:interpretation> </nl:result>14.2.3. DTMF Input
A combination of DTMF input and speech is represented using nested input elements. For example: User: My pin is (dtmf 1 2 3 4) <input> <input mode="speech" confidence ="1.0" timestamp-start="2000-04-03T0:00:00" timestamp-end="2000-04-03T0:00:01.5">My pin is </input> <input mode="dtmf" confidence ="1.0" timestamp-start="2000-04-03T0:00:01.5" timestamp-end="2000-04-03T0:00:02.0">1 2 3 4 </input> </input>
Note that grammars that recognize mixtures of speech and DTMF are not currently possible in SRGS; however, this representation might be needed for other applications of NLSML, and this mixture capability might be introduced in future versions of SRGS.14.2.4. Interpreting Meta-Dialog and Meta-Task Utterances
Natural language communication makes use of meta-dialog and meta-task utterances. This specification is flexible enough so that meta- utterances can be represented on an application-specific basis without requiring other standard markup. Here are two examples of how meta-task and meta-dialog utterances might be represented. System: What toppings do you want on your pizza? User: What toppings do you have? <interpretation grammar="http://www.example.com/toppings"> <instance> <question> <questioned_item>toppings<questioned_item> <questioned_property> availability </questioned_property> </question> </instance> <input mode="speech"> what toppings do you have? </input> </interpretation> User: slow down. <interpretation grammar="http://www.example.com/generalCommandsGrammar"> <instance> <command> <action>reduce speech rate</action> <doer>system</doer> </command> </instance> <input mode="speech">slow down</input> </interpretation>
14.2.5. Anaphora and Deixis
This specification can be used on an application-specific basis to represent utterances that contain unresolved anaphoric and deictic references. Anaphoric references, which include pronouns and definite noun phrases that refer to something that was mentioned in the preceding linguistic context, and deictic references, which refer to something that is present in the non-linguistic context, present similar problems in that there may not be sufficient unambiguous linguistic context to determine what their exact role in the interpretation should be. In order to represent unresolved anaphora and deixis using this specification, one strategy would be for the developer to define a more surface-oriented representation that leaves the specific details of the interpretation of the reference open. (This assumes that a later component is responsible for actually resolving the reference). Example: (ignoring the issue of representing the input from the pointing gesture.) System: What do you want to drink? User: I want this. (clicks on picture of large root beer.) <?xml version="1.0"?> <nl:result xmlns:nl="urn:ietf:params:xml:ns:mrcpv2" xmlns="http://www.example.com/example" grammar="http://www.example.com/beverages.grxml"> <nl:interpretation> <nl:instance> <doer>I</doer> <action>want</action> <object>this</object> </nl:instance> <nl:input mode="speech">I want this</nl:input> </nl:interpretation> </nl:result>14.2.6. Distinguishing Individual Items from Sets with One Member
For programming convenience, it is useful to be able to distinguish between individual items and sets containing one item in the XML representation of semantic results. For example, a pizza order might consist of exactly one pizza, but a pizza might contain zero or more toppings. Since there is no standard way of marking this distinction directly in XML, in the current framework, the developer is free to adopt any conventions that would convey this information in the XML markup. One strategy would be for the developer to wrap the set of items in a grouping element, as in the following example.
<order> <pizza> <topping-group> <topping>mushrooms</topping> </topping-group> </pizza> <drink>coke</drink> </order> In this example, the programmer can assume that there is supposed to be exactly one pizza and one drink in the order, but the fact that there is only one topping is an accident of this particular pizza order. Note that the client controls both the grammar and the semantics to be returned upon grammar matches, so the user of MRCPv2 is fully empowered to cause results to be returned in NLSML in such a way that the interpretation is clear to that user.14.2.7. Extensibility
Extensibility in NLSML is provided via result content flexibility, as described in the discussions of meta-utterances and anaphora. NLSML can easily be used in sophisticated systems to convey application- specific information that more basic systems would not make use of, for example, defining speech acts.