RFC 6189

ZRTP: Media Path Key Agreement for Unicast Secure RTP

Pages: 115
Informational

Part 4 of 5 – Pages 74 to 101

RFC6189 - Page 74 prevText

6.  Retransmissions

   ZRTP uses two retransmission timers T1 and T2.  T1 is used for
   retransmission of Hello messages, when the support of ZRTP by the
   other endpoint may not be known.  T2 is used in retransmissions of
   all the other ZRTP messages.

   All message retransmissions MUST be identical to the initial message
   including nonces, public values, etc; otherwise, hashes of the
   message sequences may not agree.

   Practical experience has shown that RTP packet loss at the start of
   an RTP session can be extremely high.  Since the entire ZRTP message
   exchange occurs during this period, the defined retransmission scheme

RFC6189 - Page 75

   is defined to be aggressive.  Since ZRTP packets with the exception
   of the DHPart1 and DHPart2 messages are small, this should have
   minimal effect on overall bandwidth utilization of the media session.

   ZRTP endpoints MUST NOT exceed the bandwidth of the resulting media
   session as determined by the offer/answer exchange in the signaling
   layer.

   The Ping message (Section 5.15) may follow the same retransmission
   schedule as the Hello message, but this is not required in this
   specification.  Ping message retransmission is subject to
   application-specific ZRTP proxy heuristics.

   Hello ZRTP messages are retransmitted at an interval that starts at
   T1 seconds and doubles after every retransmission, capping at 200 ms.
   T1 has a recommended initial value of 50 ms.  A Hello message is
   retransmitted 20 times before giving up, which means the entire retry
   schedule for Hello messages is exhausted after 3.75 seconds (50 + 100
   + 18*200 ms).  Retransmission of a Hello ends upon receipt of a
   HelloACK or Commit message.

   The post-Hello ZRTP messages are retransmitted only by the session
   initiator -- that is, only Commit, DHPart2, and Confirm2 are
   retransmitted if the corresponding message from the responder,
   DHPart1, Confirm1, and Conf2ACK, are not received.  Note that the
   Confirm2 message retransmission can also be stopped by receiving the
   first SRTP media (with a valid SRTP auth tag) from the responder.

   The GoClear, Error, and SASrelay messages may be initiated and
   retransmitted by either party, and responded to by the other party,
   regardless of which party is the overall session initiator.  They are
   retransmitted if the corresponding response message ClearACK,
   ErrorACK, and RelayACK are not received.

   Non-Hello (and non-Ping) ZRTP messages are retransmitted at an
   interval that starts at T2 seconds and doubles after every
   retransmission, capping at 1200 ms.  T2 has a recommended initial
   value of 150 ms.  Each non-Hello message is retransmitted 10 times
   before giving up, which means the entire retry schedule is exhausted
   after 9.45 seconds (150 + 300 + 600 + 7*1200 ms).  Only the initiator
   performs retransmissions.  Each message has a response message that
   stops retransmissions, as shown in the table below.  The higher
   values of T2 means that retransmissions will likely occur only in the
   event of packet loss.

RFC6189 - Page 76

      Message      Acknowledgement Message
      -------      -----------------------
      Hello        HelloACK or Commit
      Commit       DHPart1 or Confirm1
      DHPart2      Confirm1
      Confirm2     Conf2ACK or SRTP media
      GoClear      ClearACK
      Error        ErrorACK
      SASrelay     RelayACK
      Ping         PingACK

     Table 9. Retransmitted ZRTP Messages and Responses

   The retry schedule must handle not only packet loss, but also slow or
   heavily loaded peers that need additional time to perform their DH
   calculations.  The following mitigations are recommended:

   o  Slow or heavily loaded ZRTP endpoints that are at risk of taking
      too long to perform their DH calculation SHOULD use a HelloACK
      message instead of a Commit message to reply to a Hello from the
      other party.

   o  If a ZRTP endpoint has evidence that the other party is a ZRTP
      endpoint, by receiving a Hello message or Ping message, or by
      receiving a Hello Hash in the signaling layer, it SHOULD extend
      its own Hello retry schedule to span at least 12 seconds of
      retries.  If this extended Hello retry schedule is exhausted
      without receiving a HelloACK or Commit message, a late Commit
      message from the peer SHOULD still be accepted.

   These recommended retransmission intervals are designed for a typical
   broadband Internet connection.  In some high-latency communication
   channels, such as those provided by some mobile phone environments or
   geostationary satellites, a different retransmission schedule may be
   used.  The initial value for the T1 or T2 retransmission timer should
   be increased to be no less than the round-trip time provided by the
   communications channel.  It should take into account the time
   required to transmit the entire message and the entire reply, as well
   as a reasonable time estimate to perform the DH calculation.

   ZRTP has its own retransmission schedule because it is carried along
   with RTP, usually over UDP.  In unusual cases, RTP can run over a
   non-UDP transport, such as TCP or DCCP, which provides its own
   built-in retransmission mechanism.  It may be hard for the ZRTP
   endpoint to detect that TCP is being used if media relays are
   involved.  The ZRTP endpoint may be sending only UDP, but there may
   be a media relay along the media path that converts from UDP to TCP
   for part of the journey.  Or, if the ZRTP endpoint is sending TCP,

RFC6189 - Page 77

   the media relay might be converting from TCP to UDP.  There have been
   empirical observations of this in the wild.  In cases where TCP is
   used, ZRTP and TCP might together generate some extra
   retransmissions.  It is tempting to avoid this effect by eliminating
   the ZRTP retransmission schedule when connected to a TCP channel, but
   that would risk failure of the protocol, because it may not be TCP
   all the way to the remote ZRTP endpoint.  It only takes a few packets
   to complete a ZRTP exchange, so trying to optimize out the extra
   retransmissions in that scenario is not worth the risk.

   After receiving a Commit message, but before receiving a Confirm2
   message, if a ZRTP responder receives no ZRTP messages for more than
   10 seconds, the responder MAY send a protocol timeout Error message
   and terminate the ZRTP protocol.

7.  Short Authentication String

   This section will discuss the implementation of the Short
   Authentication String, or SAS in ZRTP.  The SAS can be verbally
   compared by the human users reading the string aloud, or it can be
   compared by validating an OPTIONAL digital signature (described in
   Section 7.2) exchanged in the Confirm1 or Confirm2 messages.

   The use of hash commitment in the DH exchange (Section 4.4.1.1)
   constrains the attacker to only one guess to generate the correct SAS
   in his attack, which means the SAS can be quite short.  A 16-bit SAS,
   for example, provides the attacker only one chance out of 65536 of
   not being detected.  How the hash commitment enables the SAS to be so
   short is explained in Section 4.4.1.1.

   There is only one SAS value computed per call.  That is the SAS value
   for the first media stream established, which is calculated in
   Section 4.5.2.  This SAS applies to all media streams for the same
   session.

   The SAS SHOULD be rendered to the user for authentication.  The
   rendering of the SAS value through the user interface at both
   endpoints depends on the SAS Type agreed upon in the Commit message.
   See Section 5.1.6 for a description of how the SAS is rendered to the
   user.

   The SAS is not treated as a secret value, but it must be compared to
   see if it matches at both ends of the communications channel.  The
   two users verbally compare it using their human voices, human ears,
   and human judgement.  If it doesn't match, it indicates the presence
   of a MiTM attack.

RFC6189 - Page 78

   It is worse than useless and absolutely unsafe to rely on a robot
   voice from the remote endpoint to compare the SAS, because a robot
   voice can be trivially forged by a MiTM.  The SAS verbal comparison
   can only be done with a real live human at the remote endpoint.

7.1.  SAS Verified Flag

   The SAS Verified flag (V) is set based on the user indicating that
   SAS comparison has been successfully performed.  The SAS Verified
   flag is exchanged securely in the Confirm1 and Confirm2 messages
   (Figure 10) of the next session.  In other words, each party sends
   the SAS Verified flag from the previous session in the Confirm
   message of the current session.  It is perfectly reasonable to have a
   ZRTP endpoint that never sets the SAS Verified flag, because it would
   require adding complexity to the user interface to allow the user to
   set it.  The SAS Verified flag is not required to be set, but if it
   is available to the client software, it allows for the possibility
   that the client software could render to the user that the SAS verify
   procedure was carried out in a previous session.

   Regardless of whether there is a user interface element to allow the
   user to set the SAS Verified flag, it is worth caching a shared
   secret, because doing so reduces opportunities for an attacker in the
   next call.

   If at any time the users carry out the SAS comparison procedure, and
   it actually fails to match, then this means there is a very
   resourceful MiTM.  If this is the first call, the MiTM was there on
   the first call, which is impressive enough.  If it happens in a later
   call, it also means the MiTM must also know the cached shared secret,
   because you could not have carried out any voice traffic at all
   unless the session key was correctly computed and is also known to
   the attacker.  This implies the MiTM must have been present in all
   the previous sessions, since the initial establishment of the first
   shared secret.  This is indeed a resourceful attacker.  It also means
   that if at any time he ceases his participation as a MiTM on one of
   your calls, the protocol will detect that the cached shared secret is
   no longer valid -- because it was really two different shared secrets
   all along, one of them between Alice and the attacker, and the other
   between the attacker and Bob.  The continuity of the cached shared
   secrets makes it possible for us to detect the MiTM when he inserts
   himself into the ongoing relationship, as well as when he leaves.
   Also, if the attacker tries to stay with a long lineage of calls, but
   fails to execute a DH MiTM attack for even one missed call, he is
   permanently excluded.  He can no longer resynchronize with the chain
   of cached shared secrets.

RFC6189 - Page 79

   A user interface element (i.e., a checkbox or button) is needed to
   allow the user to tell the software the SAS verify was successful,
   causing the software to set the SAS Verified flag (V), which
   (together with our cached shared secret) obviates the need to perform
   the SAS procedure in the next call.  An additional user interface
   element can be provided to let the user tell the software he detected
   an actual SAS mismatch, which indicates a MiTM attack.  The software
   can then take appropriate action, clearing the SAS Verified flag, and
   erase the cached shared secret from this session.  It is up to the
   implementer to decide if this added user interface complexity is
   warranted.

   If the SAS matches, it means there is no MiTM, which also implies it
   is now safe to trust a cached shared secret for later calls.  If
   inattentive users don't bother to check the SAS, it means we don't
   know whether there is or is not a MiTM, so even if we do establish a
   new cached shared secret, there is a risk that our potential attacker
   may have a subsequent opportunity to continue inserting himself in
   the call, until we finally get around to checking the SAS.  If the
   SAS matches, it means no attacker was present for any previous
   session since we started propagating cached shared secrets, because
   this session and all the previous sessions were also authenticated
   with a continuous lineage of shared secrets.

7.2.  Signing the SAS

   In most applications, it is desirable to avoid the added complexity
   of a PKI-backed digital signature, which is why ZRTP is designed not
   to require it.  Nonetheless, in some applications, it may be hard to
   arrange for two human users to verbally compare the SAS.  Or, an
   application may already be using an existing PKI and wants to use it
   to augment ZRTP.

   To handle these cases, ZRTP allows for an OPTIONAL signature feature,
   which allows the SAS to be checked without human participation.  The
   SAS MAY be signed and the signature sent inside the Confirm1,
   Confirm2 (Figure 10), or SASrelay (Figure 16) messages.  The
   signature type (Section 5.1.7), length of the signature, and the key
   used to create the signature (or a link to it) are all sent along
   with the signature.  The signature is calculated across the entire
   SAS hash result (sashash), from which the sasvalue was derived.  The
   signatures exchanged in the encrypted Confirm1, Confirm2, or SASrelay
   messages MAY be used to authenticate the ZRTP exchange.  A signature
   may be sent only in the initial media stream in a DH or ECDH ZRTP
   exchange, not in Multistream mode.

RFC6189 - Page 80

   Although the signature is sent, the material that is signed, the
   sashash, is not sent with it in the Confirm message, since both
   parties have already independently calculated the sashash.  That is
   not the case for the SASrelay message, which must relay the sashash.
   To avoid unnecessary signature calculations, a signature SHOULD NOT
   be sent if the other ZRTP endpoint did not set the (S) flag in the
   Hello message (Section 5.2).

   Note that the choice of hash algorithm used in the digital signature
   is independent of the hash used in the sashash.  The sashash is
   determined by the negotiated Hash Type (Section 5.1.2), while the
   hash used by the digital signature is separately defined by the
   digital signature algorithm.  For example, the sashash may be based
   on SHA-256, while the digital signature might use SHA-384, if an
   ECDSA P-384 key is used.

   If the sashash (which is always truncated to 256 bits) is shorter
   than the signature hash, the security is not weakened because the
   hash commitment precludes the attacker from searching for sashash
   collisions.

   ECDSA algorithms may be used with either OpenPGP-formatted keys, or
   X.509v3 certificates.  If the ZRTP key exchange is ECDH, and the SAS
   is signed, then the signature SHOULD be ECDSA, and SHOULD use the
   same size curve as the ECDH exchange if an ECDSA key of that size is
   available.

   If a ZRTP endpoint supports incoming signatures (evidenced by setting
   the (S) flag in the Hello message), it SHOULD be able to parse
   signatures from the other endpoint in OpenPGP format and MUST be able
   to parse them in X.509v3 format.  If the incoming signature is in an
   unsupported format, or the trust model does not lead to a trusted
   introducer or a trusted certificate authority (CA), another
   authentication method may be used if available, such as the SAS
   compare, or a cached shared secret from a previous session.  If none
   of these methods are available, it is up to the ZRTP user agent and
   the user to decide whether to proceed with the call, after the user
   is informed.

   Both ECDSA and DSA [FIPS-186-3] have a feature that allows most of
   the signature calculation to be done in advance of the session,
   reducing latency during call setup.  This is useful for low-power
   mobile handsets.

   ECDSA is preferred because it has compact keys as well as compact
   signatures.  If the signature along with its public key certificate
   are insufficiently compact, the Confirm message may become too long
   for the maximum transmission unit (MTU) size, and UDP fragmentation

RFC6189 - Page 81

   may result.  Some firewalls and NATs may discard fragmented UDP
   packets, which would cause the ZRTP exchange to fail.  It is
   RECOMMENDED that a ZRTP endpoint avoid sending signatures if they
   would cause UDP fragmentation.  For a discussion on MTU size and PMTU
   discovery, see [RFC1191] and [RFC1981].

   From a packet-size perspective, ECDSA and DSA both produce equally
   compact signatures for a given signature strength.  DSA keys are much
   bigger than ECDSA keys, but in the case of OpenPGP signatures, the
   public key is not sent along with the signature.

   All signatures generated MUST use only NIST-approved hash algorithms,
   and MUST avoid using SHA1.  This applies to both OpenPGP and X.509v3
   signatures.  NIST-approved hash algorithms are found in [FIPS-180-3]
   or its SHA-3 successor.  All ECDSA curves used throughout this spec
   are over prime fields, drawn from Appendix D.1.2 of [FIPS-186-3].

7.2.1.  OpenPGP Signatures

   If the SAS Signature Type (Section 5.1.7) specifies an OpenPGP
   signature ("PGP "), the signature-related fields are arranged as
   follows.

   The first field after the 4-octet Signature Type Block is the OpenPGP
   signature.  The format of this signature and the algorithms that
   create it are specified by [RFC4880].  The signature is comprised of
   a complete OpenPGP version 4 signature in binary form (not Radix-64),
   as specified in RFC 4880, Section 5.2.3, enclosed in the full OpenPGP
   packet syntax.  The length of the OpenPGP signature is parseable from
   the signature, and depends on the type and length of the signing key.

   If OpenPGP signatures are supported, an implementation SHOULD NOT
   generate signatures using any other signature algorithm except DSA or
   ECDSA (ECDSA is a reserved algorithm type in RFC 4880), but MAY
   accept other signature types from the other party.  DSA signatures
   with keys shorter than 2048 bits or longer than 3072 bits MUST NOT be
   generated.

   Implementers should be aware that ECDSA signatures for OpenPGP are
   expected to become available when the work in progress [ECC-OpenPGP]
   becomes an RFC.  Any use of ECDSA signatures in ZRTP SHOULD NOT
   generate signatures using ECDSA key sizes other than P-224, P-256,
   and P-384, as defined in [FIPS-186-3].

   RFC 4880, Section 5.2.3.18, specifies a way to embed, in an OpenPGP
   signature, a URI of the preferred key server.  The URI should be
   fully specified to obtain the public key of the signing key that
   created the signature.  This URI MUST be present.  It is up to the

RFC6189 - Page 82

   recipient of the signature to obtain the public key of the signing
   key and determine its validity status using the OpenPGP trust model
   discussed in [RFC4880].

   The contents of Figure 20 lie inside the encrypted region of the
   Confirm message (Figure 10) or the SASrelay message (Figure 16).

   The total length of all the material in Figure 20, including the key
   server URI, must not exceed 511 32-bit words (2044 octets).  This
   length, in words, is stored in the signature length field in the
   Confirm or SASrelay message containing the signature.  It is
   desirable to avoid UDP fragmentation, so the URI should be kept
   short.

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |             Signature Type Block = "PGP " (1 word)            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      |                       OpenPGP signature                       |
      |                       (variable length)                       |
      |                             . . .                             |
      |                                                               |
      +===============================================================+

                    Figure 20: OpenPGP Signature Format

7.2.2.  ECDSA Signatures with X.509v3 Certs

   If the SAS Signature Type (Section 5.1.7) is "X509", the ECDSA
   signature-related fields are arranged as follows.

   The first field after the 4-octet Signature Type Block is the DER
   encoded X.509v3 certificate (the signed public key) of the ECDSA
   signing key that created the signature.  The format of this
   certificate is specified by the NSA's Suite B Certificate and CRL
   Profile [RFC5759].

   Following the X.509v3 certificate at the next word boundary is the
   ECDSA signature itself.  The size of this field depends on the size
   and type of the public key in the aforementioned certificate.  The
   format of this signature and the algorithms that create it are
   specified by [FIPS-186-3].  The signature is comprised of the ECDSA
   signature output parameters (r, s) in binary form, concatenated, in
   network byte order, with no truncation of leading zeros.  The first
   half of the signature is r and the second half is s.  If ECDSA P-256

RFC6189 - Page 83

   is specified, the signature fills 16 words (64 octets), 32 octets
   each for r and s.  If ECDSA P-384 is specified, the signature fills
   24 words (96 octets), 48 octets each for r and s.

   It is up to the recipient of the signature to use information in the
   certificate and path discovery mechanisms to trace the chain back to
   the root CA.  It is recommended that end user certificates issued for
   secure telephony should contain appropriate path discovery links to
   facilitate this.

   Figure 21 shows a certificate and an ECDSA signature.  All this
   material lies inside the encrypted region of the Confirm message
   (Figure 10) or the SASrelay message (Figure 16).

   The total length of all the material in Figure 21, including the
   X.509v3 certificate, must not exceed 511 32-bit words (2044 octets).
   This length, in words, is stored in the signature length field in the
   Confirm or SASrelay message containing the signature.  It is
   desirable to avoid UDP fragmentation, so the certificate material
   should be kept to a much smaller size than this.  End user certs
   issued for this purpose should minimize the size of extraneous
   material such as legal notices.

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |             Signature Type Block = "X509" (1 word)            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      |                Signing key's X.509v3 certificate              |
      |                        (variable length)                      |
      |                             . . .                             |
      |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      |                ECDSA P-256 or P-384 signature                 |
      |                    (16 words or 24 words)                     |
      |                             . . .                             |
      |                                                               |
      +===============================================================+

                 Figure 21: X.509v3 ECDSA Signature Format

7.2.3.  Signing the SAS without a PKI

   It's not strictly necessary to use a PKI to back the public key that
   signs the SAS.  For example, it is possible to use a self-signed
   X.509v3 certificate or an OpenPGP key that is not signed by any other

RFC6189 - Page 84

   key.  In this scenario, the same key continuity technique used by SSH
   [RFC4251] may be used.  The public key is cached locally the first
   time it is encountered, and when the same public key is encountered
   again in subsequent sessions, it's deemed not to be a MiTM attack.
   If there is no MiTM attack in the first session, there cannot be a
   MiTM attack in any subsequent session.  This is exactly how SSH does
   it.

   Of course, the security rests on the assumption that the MiTM did not
   attack in the first session.  That assumption seems to work most of
   the time in the SSH world.  The user would have to be warned the
   first time a public key is encountered, just as in SSH.  If possible,
   the SAS should be checked before the user consents to caching the new
   public key.  If the SAS matches in the first session, there is no
   MiTM, and it's safe to cache the public key.  If no SAS comparison is
   possible, it's up to the user, or up to the application, to decide
   whether to take a leap of faith and proceed.  That's how SSH works
   most of the time, because SSH users don't have the chance to verbally
   compare an SAS with anyone.

   For a phone that is SIP-registered to a PBX, it may be provisioned
   with the public key of the PBX, using a trusted automated
   provisioning process.  Even without a PKI, the phone knows that the
   public key is the correct one, since it was provisioned into the
   phone by a trusted provisioning mechanism.  This makes it easy for
   the phone to access several automated services commonly offered by a
   PBX, such as voice mail or a conference bridge, where there is no
   human at the PBX to do a verbal SAS compare.  The same provisioning
   may be used to preload the pbxsecret into the phone, which is
   discussed in Section 7.3.1.

7.3.  Relaying the SAS through a PBX

   ZRTP is designed to use end-to-end encryption.  The two parties'
   verbal comparison of the short authentication string (SAS) depends on
   this assumption.  But in some PBX environments, such as Asterisk,
   there are usage scenarios that have the PBX acting as a trusted MiTM,
   which means there are two back-to-back ZRTP connections with separate
   session keys and separate SASs.

   For example, imagine that Bob has a ZRTP-enabled VoIP phone that has
   been registered with his company's PBX, so that it is regarded as an
   extension of the PBX.  Alice, whose phone is not associated with the
   PBX, might dial the PBX from the outside, and a ZRTP connection is
   negotiated between her phone and the PBX.  She then selects Bob's
   extension from the company directory in the PBX.  The PBX makes a
   call to Bob's phone (which might be offsite, many miles away from the
   PBX through the Internet) and a separate ZRTP connection is

RFC6189 - Page 85

   negotiated between the PBX and Bob's phone.  The two ZRTP sessions
   have different session keys and different SASs, which would render
   the SAS useless for verbal comparison between Alice and Bob.  They
   might even mistakenly believe that a wiretapper is present because of
   the SAS mismatch, causing undue alarm.

   ZRTP has a mechanism for solving this problem by having the PBX relay
   the Alice/PBX SAS to Bob, sending it through to Bob in a special
   SASrelay message as defined in Section 5.13, which is sent after the
   PBX/Bob ZRTP negotiation is complete, after the Confirm messages.
   Only the PBX, acting as a special trusted MiTM (trusted by the
   recipient of the SASrelay message), will relay the SAS.  The SASrelay
   message protects the relayed SAS from tampering via an included MAC,
   similar to how the Confirm message is protected.  Bob's ZRTP-enabled
   phone accepts the relayed SAS for rendering only because Bob's phone
   had previously been configured to trust the PBX.  This special
   trusted relationship with the PBX can be established through a
   special security enrollment procedure (Section 7.3.1).  After that
   enrollment procedure, the PBX is treated by Bob as a special trusted
   MiTM.  This results in Alice's SAS being rendered to Bob, so that
   Alice and Bob may verbally compare them and thus prevent a MiTM
   attack by any other untrusted MiTM.

   A real "bad-guy" MiTM cannot exploit this protocol feature to mount a
   MiTM attack and relay Alice's SAS to Bob, because Bob has not
   previously carried out a special registration ritual with the bad
   guy.  The relayed SAS would not be rendered by Bob's phone, because
   it did not come from a trusted PBX.  The recognition of the special
   trust relationship is achieved with the prior establishment of a
   special shared secret between Bob and his PBX, which is called
   pbxsecret (defined in Section 7.3.1), also known as the trusted MiTM
   key.

   The trusted MiTM key can be stored in a special cache at the time of
   the initial enrollment (which is carried out only once for Bob's
   phone), and Bob's phone associates this key with the ZID of the PBX,
   while the PBX associates it with the ZID of Bob's phone.  After the
   enrollment has established and stored this trusted MiTM key, it can
   be detected during subsequent ZRTP session negotiations between the
   PBX and Bob's phone, because the PBX and the phone MUST pass the hash
   of the trusted MiTM key in the DH message.  It is then used as part
   of the key agreement to calculate s0.

   The PBX can determine whether it is trusted by the ZRTP user agent of
   a phone.  The presence of a shared trusted MiTM key in the key
   negotiation sequence indicates that the phone has been enrolled with
   this PBX and therefore trusts it to act as a trusted MiTM.  During a
   key agreement with two other ZRTP endpoints, the PBX may have a

RFC6189 - Page 86

   shared trusted MiTM key with both endpoints, only one endpoint, or
   neither endpoint.  If the PBX has a shared trusted MiTM key with
   neither endpoint, the PBX MUST NOT relay the SAS.  If the PBX has a
   shared trusted MiTM key with only one endpoint, the PBX MUST relay
   the SAS from one party to the other by sending an SASrelay message to
   the endpoint with which it shares a trusted MiTM key.  If the PBX has
   a (separate) shared trusted MiTM key with each of the endpoints, the
   PBX MUST relay the SAS to only one endpoint, not both endpoints.

      Note: In the case of a PBX sharing trusted MiTM keys with both
      endpoints, it does not matter which endpoint receives the relayed
      SAS as long as only one endpoint receives it.

   The relayed SAS fields contain the SAS rendering type and the
   complete sashash.  The receiver absolutely MUST NOT render the
   relayed SAS if it does not come from a specially trusted ZRTP
   endpoint.  The security of the ZRTP protocol depends on not rendering
   a relayed SAS from an untrusted MiTM, because it may be relayed by a
   MiTM attacker.  See the SASrelay message definition (Figure 16) for
   further details.

   To ensure that both Alice and Bob will use the same SAS rendering
   scheme after the keys are negotiated, the PBX also sends the SASrelay
   message to the unenrolled party (which does not regard this PBX as a
   trusted MiTM), conveying the SAS rendering scheme, but not the
   sashash, which it sets to zero.  The unenrolled party will ignore the
   relayed SAS field, but will use the specified SAS rendering scheme.

   It is possible to route a call through two ZRTP-enabled PBXs using
   this scheme.  Assume Alice is a ZRTP endpoint who trusts her local
   PBX in Atlanta, and Bob is a ZRTP endpoint who trusts his local PBX
   in Biloxi.  The call is routed from Alice to the Atlanta PBX to the
   Biloxi PBX to Bob.  Atlanta would relay the Atlanta-Biloxi SAS to
   Alice because Alice is enrolled with Atlanta, and Biloxi would relay
   the Atlanta-Biloxi SAS to Bob because Bob is enrolled with Biloxi.
   The two PBXs are not assumed to be enrolled with each other in this
   example.  Both Alice and Bob would view and verbally compare the same
   relayed SAS, the Atlanta-Biloxi SAS.  No more than two trusted MiTM
   nodes can be traversed with this relaying scheme.  This behavior is
   extended to two PBXs that are enrolled with each other, via this
   rule: In the case of a PBX sharing trusted MiTM keys with both
   endpoints (i.e., both enrolled with this PBX), one of which is
   another PBX (evidenced by the M-flag) and one of which is a non-PBX,
   the MiTM PBX must always relay the PBX-to-PBX SAS to the non-PBX
   endpoint.

RFC6189 - Page 87

   A ZRTP endpoint phone that trusts a PBX to act as a trusted MiTM is
   effectively delegating its own policy decisions of algorithm
   negotiation to the PBX.

   When a PBX is between two ZRTP endpoints and is terminating their
   media streams at the PBX, the PBX presents its own ZID to the two
   parties, eclipsing the ZIDs of the two parties from each other.  For
   example, if several different calls are routed through such a PBX to
   several different ZRTP-enabled phones behind the PBX, only a single
   ZID is presented to the calling party in every case -- the ZID of the
   PBX itself.

   The next section describes the initial enrollment procedure that
   establishes a special shared secret, a trusted MiTM key, between a
   PBX and a phone, so that the phone will learn to recognize the PBX as
   a trusted MiTM.

7.3.1.  PBX Enrollment and the PBX Enrollment Flag

   Both the PBX and the endpoint need to know when enrollment is taking
   place.  One way of doing this is to set up an enrollment extension on
   the PBX that a newly configured endpoint would call and establish a
   ZRTP session.  The PBX would then play audio media that offers the
   user an opportunity to configure his phone to trust this PBX as a
   trusted MiTM.  The PBX calculates and stores the trusted MiTM shared
   secret in its cache and associates it with this phone, indexed by the
   phone's ZID.  The trusted MiTM PBX shared secret is derived from
   ZRTPSess via the ZRTP key derivation function (Section 4.5.1) in this
   manner:

      pbxsecret = KDF(ZRTPSess, "Trusted MiTM key", (ZIDi || ZIDr), 256)

   The pbxsecret is calculated for the whole ZRTP session, not for each
   stream within a session, thus the KDF Context field in this case does
   not include any stream-specific nonce material.

   The PBX signals the enrollment process by setting the PBX Enrollment
   flag (E) in the Confirm message (Figure 10).  This flag is used to
   trigger the ZRTP endpoint's user interface to prompt the user to see
   if it wants to trust this PBX and calculate and store the pbxsecret
   in the cache.  If the user decides to respond by activating the
   appropriate user interface element (a menu item, checkbox, or
   button), his ZRTP user agent calculates pbxsecret using the same
   formula, and saves it in a special cache entry associated with this
   PBX.

RFC6189 - Page 88

   During a PBX enrollment, the GoClear features are disabled.  If the
   (E) flag is set by the PBX, the PBX MUST NOT set the Allow Clear (A)
   flag.  Thus, (E) implies not (A).  If a received Confirm message has
   the (E) flag set, the (A) flag MUST be disregarded and treated as
   false.

   If the user elects not to enroll, perhaps because he dialed a wrong
   number or does not yet feel comfortable with this PBX, he can simply
   hang up and not save the pbxsecret in his cache.  The PBX will have
   it saved in the PBX cache, but that will do no harm.  The SASrelay
   scheme does not depend on the PBX trusting the phone.  It only
   depends on the phone trusting the PBX.  It is the phone (the user)
   who is at risk if the PBX abuses its MiTM privileges.

   An endpoint MUST NOT store the pbxsecret in the cache without
   explicit user authorization.

   After this enrollment process, the PBX and the ZRTP-enabled phone
   both share a secret that enables the phone to recognize the PBX as a
   trusted MiTM in future calls.  This means that when a future call
   from an outside ZRTP-enabled caller is relayed through the PBX to
   this phone, the phone will render a relayed SAS from the PBX.  If the
   SASrelay message comes from a MiTM that does not know the pbxsecret,
   the phone treats it as a bad-guy MiTM, and refuses to render the
   relayed SAS.  Regardless of which party initiates any future phone
   calls through the PBX, the enrolled phone or the outside phone, the
   PBX will relay the SAS to the enrolled phone.

   This enrollment procedure is designed primarily for phones that are
   already associated with the PBX -- enterprise phones that are
   "behind" the PBX.  It is not intended for the countless outside
   phones that are not registered to this PBX's SIP server.  It should
   be regarded as part of the installation and provisioning process for
   a new phone in the organization.

   There are more streamlined methods to configure ZRTP user agents to
   trust a PBX.  In large scale deployments, the pbxsecret may be
   configured into the phone by an automated provisioning process, which
   may be less burdensome for the users and less error prone.  This
   specification does not require a manual enrollment process.  Any
   process that results in a pbxsecret to be computed and shared between
   the PBX and the phone will suffice, as long as the user is made aware
   that this puts the PBX in a position to wiretap the calls.

   It is recommended that a ZRTP client not proceed with the PBX
   enrollment procedure without evidence that a MiTM attack is not
   taking place during the enrollment session.  It would be especially
   damaging if a MiTM tricks the client into enrolling with the wrong

RFC6189 - Page 89

   PBX.  That would enable the malevolent MiTM to wiretap all future
   calls without arousing suspicion, because he would appear to be
   trusted.

8.  Signaling Interactions

   This section discusses how ZRTP, SIP, and SDP work together.

   Note that ZRTP may be implemented without coupling with the SIP
   signaling.  For example, ZRTP can be implemented as a "bump in the
   wire" or as a "bump in the stack" in which RTP sent by the SIP User
   Agent (UA) is converted to ZRTP.  In these cases, the SIP UA will
   have no knowledge of ZRTP.  As a result, the signaling path discovery
   mechanisms introduced in this section should not be definitive --
   they are a hint.  Despite the absence of an indication of ZRTP
   support in an offer or answer, a ZRTP endpoint SHOULD still send
   Hello messages.

   ZRTP endpoints that have control over the signaling path include a
   ZRTP SDP attributes in their SDP offers and answers.  The ZRTP
   attribute, a=zrtp-hash, is used to indicate support for ZRTP and to
   convey a hash of the Hello message.  The hash is computed according
   to Section 8.1.

   Aside from the advantages described in Section 8.1, there are a
   number of potential uses for this attribute.  It is useful when
   signaling elements would like to know when ZRTP may be utilized by
   endpoints.  It is also useful if endpoints support multiple methods
   of SRTP key management.  The ZRTP attribute can be used to ensure
   that these key management approaches work together instead of against
   each other.  For example, if only one endpoint supports ZRTP, but
   both support another method to key SRTP, then the other method will
   be used instead.  When used in parallel, an SRTP secret carried in an
   a=keymgt [RFC4567] or a=crypto [RFC4568] attribute can be used as a
   shared secret for the srtps computation defined in Section 8.2.  The
   ZRTP attribute is also used to signal to an intermediary ZRTP device
   not to act as a ZRTP endpoint, as discussed in Section 10.

   The a=zrtp-hash attribute can only be included in the SDP at the
   media level since Hello messages sent in different media streams will
   have unique hashes.

RFC6189 - Page 90

   The ABNF for the ZRTP attribute is as follows:

       zrtp-attribute   = "a=zrtp-hash:" zrtp-version zrtp-hash-value

       zrtp-version     = token

       zrtp-hash-value  = 1*(HEXDIG)

   Here's an example of the ZRTP attribute in an initial SDP offer or
   answer used at the media level, using the <allOneLine> convention
   defined in RFC 4475, Section 2.1 [RFC4475]:

     v=0
     o=bob 2890844527 2890844527 IN IP4 client.biloxi.example.com
     s=
     c=IN IP4 client.biloxi.example.com
     t=0 0
     m=audio 3456 RTP/AVP 97 33
     a=rtpmap:97 iLBC/8000
     a=rtpmap:33 no-op/8000
   <allOneLine>
     a=zrtp-hash:1.10 fe30efd02423cb054e50efd0248742ac7a52c8f91bc2
     df881ae642c371ba46df
   </allOneLine>

   A mechanism for carrying this same zrtp-hash information in the
   Jingle signaling protocol is defined in [XEP-0262].

   It should be safe to send ZRTP messages even when there is no
   evidence in the signaling that the other party supports it, because
   ZRTP has been designed to be clearly different from RTP, having a
   similar structure to STUN packets sent during an ICE exchange.

8.1.  Binding the Media Stream to the Signaling Layer via the Hello Hash

   Tying the media stream to the signaling channel can help prevent a
   third party from inserting false media packets.  If the signaling
   layer contains information that ties it to the media stream, false
   media streams can be rejected.

   To accomplish this, the entire Hello message (Figure 3) is hashed,
   using the hash algorithm defined in Section 5.1.2.2.  The ZRTP packet
   framing from Figure 2 is not included in the hash.  The resulting
   hash image is made available without truncation to the signaling
   layer, where it is transmitted as a hexadecimal value in the SIP
   channel using the SDP attribute a=zrtp-hash, defined in this
   specification.  Assuming Section 5.1.2.2 defines a 256-bit hash
   length, the a=zrtp-hash field in the SDP attribute carries 64

RFC6189 - Page 91

   hexadecimal digits.  Each media stream (audio or video) will have a
   separate Hello message, and thus will require a separate a=zrtp-hash
   in an SDP attribute.  The recipient of the SIP/SDP message can then
   use this hash image to detect and reject false Hello messages in the
   media channel, as well as identify which media stream is associated
   with this SIP call.  Each Hello message hashes uniquely, because it
   contains the H3 field derived from a random nonce, defined in
   Section 9.

   The Hello Hash as an SDP attribute is not a REQUIRED feature, because
   some ZRTP endpoints do not have the ability to add SDP attributes to
   the signaling.  For example, if ZRTP is implemented in a hardware
   bump-in-the-wire device, it might only have the ability to modify the
   media packets, not the SIP packets, especially if the SIP packets are
   integrity protected and thus cannot be modified on the wire.  If the
   SDP has no hash image of the ZRTP Hello message, the recipient's ZRTP
   user agent cannot check it, and thus will not be able to reject Hello
   messages based on this hash.

   After the Hello Hash is used to properly identify the ZRTP Hello
   message as belonging to this particular SIP call, the rest of the
   ZRTP message sequence is protected from false packet injection by
   other protection mechanisms, such as the hash chaining mechanism
   defined in Section 9.

   An attacker who controls only the signaling layer, such as an
   uncooperative VoIP service provider, may be able to deny service by
   corrupting the hash of the Hello message in the SDP attribute, which
   would force ZRTP to reject perfectly good Hello messages.  If there
   is reason to believe this is happening, the ZRTP endpoint MAY allow
   Hello messages to be accepted that do not match the hash image in the
   SDP attribute.

   Even in the absence of SIP integrity protection, the inclusion of the
   a=zrtp-hash SDP attribute, when coupled with the hash chaining
   mechanism defined in Section 9, meets the R-ASSOC requirement in the
   Media Security Requirements [RFC5479], which requires:

      ...a mechanism for associating key management messages with both
      the signaling traffic that initiated the session and with
      protected media traffic.  It is useful to associate key management
      messages with call signaling messages, as this allows the SDP
      offerer to avoid performing CPU-consuming operations (e.g.,
      Diffie-Hellman or public key operations) with attackers that have
      not seen the signaling messages.

RFC6189 - Page 92

   The a=zrtp-hash SDP attribute becomes especially useful if the SDP is
   integrity-protected end-to-end by SIP Identity [RFC4474] or better
   still, Dan Wing's SIP Identity using Media Path [SIP-IDENTITY].  This
   leads to an ability to stop MiTM attacks independent of ZRTP's SAS
   mechanism, as explained in Section 8.1.1.

8.1.1.  Integrity-Protected Signaling Enables Integrity-Protected DH
        Exchange

   If and only if the signaling path and the SDP is protected by some
   form of end-to-end integrity protection, such as one of the
   abovementioned mechanisms, so that it can guarantee delivery of the
   a=zrtp-hash attribute without any tampering by a third party, and if
   there is good reason to trust the signaling layer to protect the
   interests of the end user, it is possible to authenticate the key
   exchange and prevent a MiTM attack.  This can be done without
   requiring the users to verbally compare the SAS, by using the hash
   chaining mechanism defined in Section 9 to provide a series of MAC
   keys that protect the entire ZRTP key exchange.  Thus, an end-to-end
   integrity-protected signaling layer automatically enables an
   integrity-protected Diffie-Hellman exchange in ZRTP, which in turn
   means immunity from a MiTM attack.  Here's how it works.

   The integrity-protected SIP SDP contains a hash commitment to the
   entire Hello message.  The Hello message contains H3, which provides
   a hash commitment for the rest of the hash chain H0-H2 (Section 9).
   The Hello message is protected by a 64-bit MAC, keyed by H2.  The
   Commit message is protected by a 64-bit MAC, keyed by H1.  The
   DHPart1 or DHPart2 messages are protected by a 64-bit MAC, keyed by
   H0.  The MAC protecting the Confirm messages is computed by a
   different MAC key derived from the resulting key agreement.  Each
   message's MAC is checked when the MAC key is received in the next
   message.  If a bad MAC is discovered, it MUST be treated as a
   security exception indicating a MiTM attack, perhaps by logging or
   alerting the user, and MUST NOT be treated as a random error.  Random
   errors are already discovered and quietly rejected by bad CRCs
   (Figure 2).

   The Hello message must be assembled before any hash algorithms are
   negotiated, so an implicit predetermined hash algorithm and MAC
   algorithm (both defined in Section 5.1.2.2) must be used.  All of the
   aforementioned MACs keyed by the hashes in the aforementioned hash
   chain MUST be computed with the MAC algorithm defined in
   Section 5.1.2.2, with the MAC truncated to 64 bits.

   The Media Security Requirements [RFC5479] R-EXISTING requirement can
   be fully met by leveraging a certificate-backed PKI in the signaling
   layer to integrity protect the delivery of the a=zrtp-hash SDP

RFC6189 - Page 93

   attribute.  This would thereby protect ZRTP against a MiTM attack,
   without requiring the user to check the SAS, without adding any
   explicit signatures or signature keys to the ZRTP key exchange and
   without any extra public key operations or extra packets.

   Without an end-to-end integrity-protection mechanism in the signaling
   layer to guarantee delivery of the a=zrtp-hash SDP attribute without
   modification by a third party, these MACs alone will not prevent a
   MiTM attack.  In that case, ZRTP's built-in SAS mechanism will still
   have to be used to authenticate the key exchange.  At the time of
   this writing, very few deployed VoIP clients offer a fully
   implemented SIP stack that provides end-to-end integrity protection
   for the delivery of SDP attributes.  Also, end-to-end signaling
   integrity becomes more problematic if E.164 numbers [RFC3824] are
   used in SIP.  Thus, real-world implementations of ZRTP endpoints will
   continue to depend on SAS authentication for quite some time.  Even
   after there is widespread availability of SIP user agents that offer
   integrity protected delivery of SDP attributes, many users will still
   be faced with the fact that the signaling path may be controlled by
   institutions that do not have the best interests of the end user in
   mind.  In those cases, SAS authentication will remain the gold
   standard for the prudent user.

   Even without SIP integrity protection, the Media Security
   Requirements [RFC5479] R-ACT-ACT requirement can be met by ZRTP's SAS
   mechanism.  Although ZRTP may benefit from an integrity-protected SIP
   layer, it is fortunate that ZRTP's self-contained MiTM defenses do
   not actually require an integrity-protected SIP layer.  ZRTP can
   bypass the delays and problems that SIP integrity faces, such as
   E.164 number usage, and the complexity of building and maintaining a
   PKI.

   In contrast, DTLS-SRTP [RFC5764] appears to depend heavily on end-to-
   end integrity protection in the SIP layer.  Further, DTLS-SRTP must
   bear the additional cost of a signature calculation of its own, in
   addition to the signature calculation the SIP layer uses to achieve
   its integrity protection.  ZRTP needs no signature calculation of its
   own to leverage the signature calculation carried out in the SIP
   layer.

8.2.  Deriving the SRTP Secret (srtps) from the Signaling Layer

   The shared secret calculations defined in Section 4.3 make use of the
   SRTP secret (srtps), if it is provided by the signaling layer.

   It is desirable for only one SRTP key negotiation protocol to be
   used, and that protocol should be ZRTP.  But in the event the
   signaling layer negotiates its own SRTP master key and salt, using

RFC6189 - Page 94

   the SDP Security Descriptions (SDES [RFC4568]) or [RFC4567], it can
   be passed from the signaling to the ZRTP layer and mixed into ZRTP's
   own shared secret calculations, without compromising security by
   creating a dependency on the signaling for media encryption.

   ZRTP computes srtps from the SRTP master key and salt parameters
   provided by the signaling layer in this manner, truncating the result
   to 256 bits:

      srtps = KDF(SRTP master key, "SRTP Secret", (ZIDi || ZIDr ||
                    SRTP master salt), 256)

   It is expected that the srtps parameter will be rarely computed or
   used in typical ZRTP endpoints, because it is likely and desirable
   that ZRTP will be the sole means of negotiating SRTP keys, needing no
   help from [RFC4568] or [RFC4567].  If srtps is computed, it will be
   stored in the auxiliary shared secret auxsecret, defined in
   Section 4.3 and used in Section 4.3.1.

8.3.  Codec Selection for Secure Media

   Codec selection is negotiated in the signaling layer.  If the
   signaling layer determines that ZRTP is supported by both endpoints,
   this should provide guidance in codec selection to avoid variable
   bitrate (VBR) codecs that leak information.

   When voice is compressed with a VBR codec, the packet lengths vary
   depending on the types of sounds being compressed.  This leaks a lot
   of information about the content even if the packets are encrypted,
   regardless of what encryption protocol is used [Wright1].  It is
   RECOMMENDED that VBR codecs be avoided in encrypted calls.  It is not
   a problem if the codec adapts the bitrate to the available channel
   bandwidth.  The vulnerable codecs are the ones that change their
   bitrate depending on the type of sound being compressed.

   It also appears that voice activity detection (VAD) leaks information
   about the content of the conversation, but to a lesser extent than
   VBR.  This effect can be mitigated by lengthening the VAD hangover
   time by a random amount between 1 and 2 seconds, if this is feasible
   in your application.  Only short bursts of speech would benefit from
   lengthening the VAD hangover time.

   The security problems of VBR and VAD are addressed in detail by the
   guidelines in [VBR-AUDIO].  It is RECOMMENDED that ZRTP endpoints
   follow these guidelines.

RFC6189 - Page 95

9.  False ZRTP Packet Rejection

   An attacker who is not in the media path may attempt to inject false
   ZRTP protocol packets, possibly to effect a denial-of-service attack
   or to inject his own media stream into the call.  VoIP, by its
   nature, invites various forms of denial-of-service attacks and
   requires protocol features to reject such attacks.  While bogus SRTP
   packets may be easily rejected via the SRTP auth tag field, that can
   only be applied after a key agreement is completed.  During the ZRTP
   key negotiation phase, other false packet rejection mechanisms are
   needed.  One such mechanism is the use of the total_hash in the final
   shared secret calculation, but that can only detect false packets
   after performing the computationally expensive Diffie-Hellman
   calculation.

   A lot of work has been done on the analysis of denial-of-service
   attacks, especially from attackers who are not in the media path.
   Such an attacker might inject false ZRTP packets to force a ZRTP
   endpoint to engage in an endless series of pointless and expensive DH
   calculations.  To detect and reject false packets cheaply and rapidly
   as soon as they are received, ZRTP uses a one-way hash chain, which
   is a series of successive hash images.  Before each session, the
   following values are computed:

      H0 = 256-bit random nonce (different for each party)

      H1 = hash (H0)

      H2 = hash (H1)

      H3 = hash (H2)

   This one-way hash chain MUST use the hash algorithm defined in
   Section 5.1.2.2, truncated to 256 bits.  Each 256-bit hash image is
   the preimage of the next, and the sequence of images is sent in
   reverse order in the ZRTP packet sequence.  The hash image H3 is sent
   in the Hello message, H2 is sent in the Commit message, H1 is sent in
   the DHPart1 or DHPart2 messages, and H0 is sent in the Confirm1 or
   Confirm2 messages.  The initial random H0 nonces that each party
   generates MUST be unpredictable to an attacker and unique within a
   ZRTP session, which thereby forces the derived hash images H1-H3 to
   also be unique and unpredictable.

   The recipient checks if the packet has the correct hash preimage, by
   hashing it and comparing the result with the hash image for the
   preceding packet.  Packets that contain an incorrect hash preimage
   MUST NOT be used by the recipient, but they MAY be processed as
   security exceptions, perhaps by logging or alerting the user.  As

RFC6189 - Page 96

   long as these bogus packets are not used, and correct packets are
   still being received, the protocol SHOULD be allowed to run to
   completion, thereby rendering ineffective this denial-of-service
   attack.

   Note that since H2 is sent in the Commit message, and the initiator
   does not receive a Commit message, the initiator computes the
   responder's missing H2 by hashing the responder's H1.  An analogous
   interpolation is performed by both parties to handle the skipped
   DHPart1 and DHPart2 messages in Preshared (Section 3.1.2) or
   Multistream (Section 3.1.3) modes.

   Because these hash images alone do not protect the rest of the
   contents of the packet they reside in, this scheme assumes the
   attacker cannot modify the packet contents from a legitimate party,
   which is a reasonable assumption for an attacker who is not in the
   media path.  This covers an important range of denial-of-service
   attacks.  For dealing with the remaining set of attacks that involve
   packet modification, other mechanisms are used, such as the
   total_hash in the final shared secret calculation, and the hash
   commitment in the Commit message.

   Hello messages injected by an attacker may be detected and rejected
   by the inclusion of a hash of the Hello message in the signaling, as
   described in Section 8.  This mechanism requires that each Hello
   message be unique, and the inclusion of the H3 hash image meets that
   requirement.

   If and only if an integrity-protected signaling channel is available,
   the MACs that are keyed by this hash chaining scheme can be used to
   authenticate the entire ZRTP key exchange, and thereby prevent a MiTM
   attack, without relying on the users verbally comparing the SAS.  See
   Section 8.1.1 for details.

   Some ZRTP user agents allow the user to manually switch to clear mode
   (via the GoClear message) in the middle of a secure call, and then
   later initiate secure mode again.  Many consumer client products will
   omit this feature, but those that allow it may return to secure mode
   again in the same media stream.  Although the same chain of hash
   images will be reused and thus rendered ineffective the second time,
   no real harm is done because the new SRTP session keys will be
   derived in part from a cached shared secret, which was safely
   protected from the MiTM in the previous DH exchange earlier in the
   same session.

RFC6189 - Page 97

10.  Intermediary ZRTP Devices

   This section discusses the operation of a ZRTP endpoint that is
   actually an intermediary.  For example, consider a device that
   proxies both signaling and media between endpoints.  There are three
   possible ways in which such a device could support ZRTP.

   An intermediary device can act transparently to the ZRTP protocol.
   To do this, a device MUST pass non-RTP protocols multiplexed on the
   same port as RTP (to allow ZRTP and STUN).  This is the RECOMMENDED
   behavior for intermediaries as ZRTP and SRTP are best when done end-
   to-end.

   An intermediary device could implement the ZRTP protocol and act as a
   ZRTP endpoint on behalf of non-ZRTP endpoints behind the intermediary
   device.  The intermediary could determine on a call-by-call basis
   whether the endpoint behind it supports ZRTP based on the presence or
   absence of the ZRTP SDP attribute flag (a=zrtp-hash).  For non-ZRTP
   endpoints, the intermediary device could act as the ZRTP endpoint
   using its own ZID and cache.  This approach SHOULD only be used when
   there is some other security method protecting the confidentiality of
   the media between the intermediary and the inside endpoint, such as
   IPsec or physical security.

   The third mode, which is NOT RECOMMENDED, is for the intermediary
   device to attempt to back-to-back the ZRTP protocol.  The only
   exception to this case is where the intermediary device is a trusted
   element providing services to one of the endpoints -- e.g., a Private
   Branch Exchange or PBX.  In this mode, the intermediary would attempt
   to act as a ZRTP endpoint towards both endpoints of the media
   session.  This approach MUST NOT be used except as described in
   Section 7.3 as it will always result in a detected MiTM attack and
   will generate alarms on both endpoints and likely result in the
   immediate termination of the session.  The PBX MUST uses a single ZID
   for all endpoints behind it.

   In cases where centralized media mixing is taking place, the SAS will
   not match when compared by the humans.  This situation can sometimes
   be known in the SIP signaling by the presence of the isfocus feature
   tag [RFC4579].  As a result, when the isfocus feature tag is present,
   the DH exchange can be authenticated by the mechanism defined in
   Section 8.1.1 or by validating signatures (Section 7.2) in the
   Confirm or SASrelay messages.  For example, consider an audio
   conference call with three participants Alice, Bob, and Carol hosted
   on a conference bridge in Dallas.  There will be three ZRTP encrypted
   media streams, one encrypted stream between each participant and
   Dallas.  Each will have a different SAS.  Each participant will be
   able to validate their SAS with the conference bridge by using

RFC6189 - Page 98

   signatures optionally present in the Confirm messages (described in
   Section 7.2).  Or, if the signaling path has end-to-end integrity
   protection, each DH exchange will have automatic MiTM protection by
   using the mechanism in Section 8.1.1.

   SIP feature tags can also be used to detect if a session is
   established with an automaton such as an Interactive Voice Response
   (IVR), voicemail system, or speech recognition system.  The display
   of SAS strings to users should be disabled in these cases.

   It is possible that an intermediary device acting as a ZRTP endpoint
   might still receive ZRTP Hello and other messages from the inside
   endpoint.  This could occur if there is another inline ZRTP device
   that does not include the ZRTP SDP attribute flag.  An intermediary
   acting as a ZRTP endpoint receiving ZRTP Hello and other messages
   from the inside endpoint MUST NOT pass these ZRTP messages.

11.  The ZRTP Disclosure Flag

   There are no back doors defined in the ZRTP protocol specification.
   The designers of ZRTP would like to discourage back doors in ZRTP-
   enabled products.  However, despite the lack of back doors in the
   actual ZRTP protocol, it must be recognized that a ZRTP implementer
   might still deliberately create a rogue ZRTP-enabled product that
   implements a back door outside the scope of the ZRTP protocol.  For
   example, they could create a product that discloses the SRTP session
   key generated using ZRTP out-of-band to a third party.  They may even
   have a legitimate business reason to do this for some customers.

   For example, some environments have a need to monitor or record
   calls, such as stock brokerage houses who want to discourage insider
   trading, or special high-security environments with special needs to
   monitor their own phone calls.  We've all experienced automated
   messages telling us that "This call may be monitored for quality
   assurance".  A ZRTP endpoint in such an environment might
   unilaterally disclose the session key to someone monitoring the call.
   ZRTP-enabled products that perform such out-of-band disclosures of
   the session key can undermine public confidence in the ZRTP protocol,
   unless we do everything we can in the protocol to alert the other
   user that this is happening.

   If one of the parties is using a product that is designed to disclose
   their session key, ZRTP requires them to confess this fact to the
   other party through a protocol message to the other party's ZRTP
   client, which can properly alert that user, perhaps by rendering it
   in a graphical user interface.  The disclosing party does this by
   sending a Disclosure flag (D) in Confirm1 and Confirm2 messages as
   described in Section 5.7.

RFC6189 - Page 99

   Note that the intention here is to have the Disclosure flag identify
   products that are designed to disclose their session keys, not to
   identify which particular calls are compromised on a call-by-call
   basis.  This is an important legal distinction, because most
   government sanctioned wiretap regulations require a VoIP service
   provider to not reveal which particular calls are wiretapped.  But
   there is nothing illegal about revealing that a product is designed
   to be wiretap-friendly.  The ZRTP protocol mandates that such a
   product "out" itself.

   You might be using a ZRTP-enabled product with no back doors, but if
   your own graphical user interface tells you the call is (mostly)
   secure, except that the other party is using a product that is
   designed in such a way that it may have disclosed the session key for
   monitoring purposes, you might ask him what brand of secure telephone
   he is using, and make a mental note not to purchase that brand
   yourself.  If we create a protocol environment that requires such
   back-doored phones to confess their nature, word will spread quickly,
   and the "invisible hand" of the free market will act.  The free
   market has effectively dealt with this in the past.

   Of course, a ZRTP implementer can lie about his product having a back
   door, but the ZRTP standard mandates that ZRTP-compliant products
   MUST adhere to the requirement that a back door be confessed by
   sending the Disclosure flag to the other party.

   There will be inevitable comparisons to Steve Bellovin's 2003 April
   fool joke, when he submitted RFC 3514 [RFC3514], which defined the
   "Evil bit" in the IPv4 header, for packets with "evil intent".  But
   we submit that a similar idea can actually have some merit for
   securing VoIP.  Sure, one can always imagine that some implementer
   will not be fazed by the rules and will lie, but they would have lied
   anyway even without the Disclosure flag.  There are good reasons to
   believe that it will improve the overall percentage of
   implementations that at least tell us if they put a back door in
   their products, and may even get some of them to decide not to put in
   a back door at all.  From a civic hygiene perspective, we are better
   off with having the Disclosure flag in the protocol.

   If an endpoint stores or logs SRTP keys or information that can be
   used to reconstruct or recover SRTP keys after they are no longer in
   use (i.e., the session is active), or otherwise discloses or passes
   SRTP keys or information that can be used to reconstruct or recover
   SRTP keys to another application or device, the Disclosure flag D
   MUST be set in the Confirm1 or Confirm2 message.

RFC6189 - Page 100

11.1.  Guidelines on Proper Implementation of the Disclosure Flag

   Some implementers have asked for guidance on implementing the
   Disclosure flag.  Some people have incorrectly thought that a
   connection secured with ZRTP cannot be used in a call center, with
   voluntary voice recording, or even with a voicemail system.
   Similarly, some potential users of ZRTP have over considered the
   protection that ZRTP can give them.  These guidelines clarify both
   concerns.

   The ZRTP Disclosure flag only governs the ZRTP/SRTP stream itself.
   It does not govern the underlying RTP media stream, nor the actual
   media itself.  Consequently, a PBX that uses ZRTP may provide
   conference calls, call monitoring, call recording, voicemail, or
   other PBX features and still say that it does not disclose the ZRTP
   key material.  A video system may provide DVR features and still say
   that it does not disclose the ZRTP key material.  The ZRTP Disclosure
   flag, when not set, means only that the ZRTP cryptographic key
   material stays within the bounds of the ZRTP subsystem.

   If an application has a need to disclose the ZRTP cryptographic key
   material, the easiest way to comply with the protocol is to set the
   flag to the proper value.  The next easiest way is to overestimate
   disclosure.  For example, a call center that commonly records calls
   might choose to set the Disclosure flag even though all recording is
   an analog recording of a call (and thus outside the ZRTP scope)
   because it sets an expectation with clients that their calls might be
   recorded.

   Note also that the ZRTP Disclosure Flag does not require an
   implementation to preclude hacking or malware.  Malware that leaks
   ZRTP cryptographic key material does not create a liability for the
   implementer from non-compliance with the ZRTP specification.

   A user of ZRTP should note that ZRTP is not a panacea against
   unauthorized recording.  ZRTP does not and cannot protect against an
   untrustworthy partner who holds a microphone up to the speaker.  It
   does not protect against someone else being in the room.  It does not
   protect against analog wiretaps in the phone or in the room.  It does
   not mean your partner has not been hacked with spyware.  It does not
   mean that the software has no flaws.  It means that the ZRTP
   subsystem is not knowingly leaking ZRTP cryptographic key material.

12.  Mapping between ZID and AOR (SIP URI)

   The role of the ZID in the management of the local cache of shared
   secrets is explained in Section 4.9.  A particular ZID is associated
   with a particular ZRTP endpoint, typically a VoIP client.  A single

RFC6189 - Page 101

   SIP URI (also known as an Address-of-Record, or AOR) may be hosted on
   several different soft VoIP clients, desktop phones, and mobile
   handsets, and each of them will have a different ZID.  Further, a
   single VoIP client may have several SIP URIs configured into its
   profiles, but only one ZID.  There is not a one-to-one mapping
   between a ZID and a SIP URI.  A single SIP URI may be associated with
   several ZIDs, and a single ZID may be associated with several SIP
   URIs on the same client.

   Not only that, but ZRTP is independent of which signaling protocol is
   used.  It works equally well with SIP, Jingle, H.323, or any
   proprietary signaling protocol.  Thus, a ZRTP ZID has little to do
   with SIP, per se, which means it has little to do with a SIP URI.

   Even though a ZID is associated with a device, not a human, it is
   often the case that a ZRTP endpoint is controlled mainly by a
   particular human.  For example, it may be a mobile phone.  To get the
   full benefit of the key continuity features, a local cache entry (and
   thus a ZID) should be associated with some sort of name of the remote
   party.  That name could be a human name, or it could be made more
   precise by specifying which ZRTP endpoint he's using.  For example
   "Jon Callas", or "Jon Callas on his iPhone", or "Jon on his iPad", or
   "Alice on her office phone".  These name strings can be stored in the
   local cache, indexed by ZID, and may have been initially provided by
   the local user by hand.  Or the local cache entry may contain a
   pointer to an entry in the local address book.  When a secure session
   is established, if a prior session has established a cache entry, and
   the new session has a matching cache entry indexed by the same ZID,
   and the SAS has been previously verified, the person's name stored in
   that cache entry should be displayed.

   If the remote ZID originates from a PBX, the displayed name would be
   the name of that PBX, which might be the name of the company who owns
   that PBX.

   If it is desirable to associate some key material with a particular
   AOR, digital signatures (Section 7.2) may be used, with public key
   certificates that associate the signature key with an AOR.  If more
   than one ZRTP endpoint shares the same AOR, they may all use the same
   signature key and provide the same public key certificate with their
   signatures.

(next page on part 5)