Tech-invite3GPPspaceIETFspace
96959493929190898887868584838281807978777675747372717069686766656463626160595857565554535251504948474645444342414039383736353433323130292827262524232221201918171615141312111009080706050403020100
in Index   Prev   Next

RFC 4867

RTP Payload Format and File Storage Format for the Adaptive Multi-Rate (AMR) and Adaptive Multi-Rate Wideband (AMR-WB) Audio Codecs

Pages: 59
Proposed Standard
Errata
Obsoletes:  3267
Part 2 of 3 – Pages 15 to 37
First   Prev   Next

Top   ToC   RFC4867 - Page 15   prevText

4. AMR and AMR-WB RTP Payload Formats

The AMR and AMR-WB payload formats have identical structure, so they are specified together. The only differences are in the types of codec frames contained in the payload. The payload format consists of the RTP header, payload header, and payload data.

4.1. RTP Header Usage

The format of the RTP header is specified in [8]. This payload format uses the fields of the header in a manner consistent with that specification. The RTP timestamp corresponds to the sampling instant of the first sample encoded for the first frame-block in the packet. The timestamp clock frequency is the same as the sampling frequency, so the timestamp unit is in samples.
Top   ToC   RFC4867 - Page 16
   The duration of one speech frame-block is 20 ms for both AMR and
   AMR-WB.  For AMR, the sampling frequency is 8 kHz, corresponding to
   160 encoded speech samples per frame from each channel.  For AMR-WB,
   the sampling frequency is 16 kHz, corresponding to 320 samples per
   frame from each channel.  Thus, the timestamp is increased by 160 for
   AMR and 320 for AMR-WB for each consecutive frame-block.

   A packet may contain multiple frame-blocks of encoded speech or
   comfort noise parameters.  If interleaving is employed, the frame-
   blocks encapsulated into a payload are picked according to the
   interleaving rules as defined in Section 4.4.1.  Otherwise, each
   packet covers a period of one or more contiguous 20 ms frame-block
   intervals.  In case the data from all the channels for a particular
   frame-block in the period is missing (for example, at a gateway from
   some other transport format), it is possible to indicate that no data
   is present for that frame-block rather than breaking a multi-frame-
   block packet into two, as explained in Section 4.3.2.

   To allow for error resiliency through redundant transmission, the
   periods covered by multiple packets MAY overlap in time.  A receiver
   MUST be prepared to receive any speech frame multiple times, in exact
   duplicates, in different AMR rate modes, or with data present in one
   packet and not present in another.  If multiple versions of the same
   speech frame are received, it is RECOMMENDED that the mode with the
   highest rate be used by the speech decoder.  A given frame MUST NOT
   be encoded as speech in one packet and comfort noise parameters in
   another.

   The payload length is always made an integral number of octets by
   padding with zero bits if necessary.  If additional padding is
   required to bring the payload length to a larger multiple of octets
   or for some other purpose, then the P bit in the RTP in the header
   may be set and padding appended as specified in [8].

   The RTP header marker bit (M) SHALL be set to 1 if the first frame-
   block carried in the packet contains a speech frame which is the
   first in a talkspurt.  For all other packets the marker bit SHALL be
   set to zero (M=0).

   The assignment of an RTP payload type for this new packet format is
   outside the scope of this document, and will not be specified here.
   It is expected that the RTP profile under which this payload format
   is being used will assign a payload type for this encoding or specify
   that the payload type is to be bound dynamically.
Top   ToC   RFC4867 - Page 17

4.2. Payload Structure

The complete payload consists of a payload header, a payload table of contents, and speech data representing one or more speech frame- blocks. The following diagram shows the general payload format layout: +----------------+-------------------+---------------- | payload header | table of contents | speech data ... +----------------+-------------------+---------------- Payloads containing more than one speech frame-block are called compound payloads. The following sections describe the variations taken by the payload format depending on whether the AMR session is set up to use the bandwidth-efficient mode or octet-aligned mode and any of the OPTIONAL functions for robust sorting, interleaving, and frame CRCs. Implementations SHOULD support both bandwidth-efficient and octet- aligned operation to increase interoperability.

4.3. Bandwidth-Efficient Mode

4.3.1. The Payload Header

In bandwidth-efficient mode, the payload header simply consists of a 4-bit codec mode request: 0 1 2 3 +-+-+-+-+ | CMR | +-+-+-+-+ CMR (4 bits): Indicates a codec mode request sent to the speech encoder at the site of the receiver of this payload. The value of the CMR field is set to the frame type index of the corresponding speech mode being requested. The frame type index may be 0-7 for AMR, as defined in Table 1a in [2], or 0-8 for AMR-WB, as defined in Table 1a in [4]. CMR value 15 indicates that no mode request is present, and other values are for future use. The codec mode request received in the CMR field is valid until the next codec mode request is received, i.e., a newly received CMR value corresponding to a speech mode, or NO_DATA overrides the previously received CMR value corresponding to a speech mode or NO_DATA. Therefore, if a terminal continuously wishes to receive frames in the
Top   ToC   RFC4867 - Page 18
   same mode X, it needs to set CMR=X for all its outbound payloads, and
   if a terminal has no preference in which mode to receive, it SHOULD
   set CMR=15 in all its outbound payloads.

   If receiving a payload with a CMR value that is not a speech mode or
   NO_DATA, the CMR MUST be ignored by the receiver.

   In a multi-channel session, the codec mode request SHOULD be
   interpreted by the receiver of the payload as the desired encoding
   mode for all the channels in the session.

   An IP end-point SHOULD NOT set the codec mode request based on packet
   losses or other congestion indications, for several reasons:

      -  The other end of the IP path may be a gateway to a non-IP
         network (such as a radio link) that needs to set the CMR field
         to optimize performance on that network.

      -  Congestion on the IP network is managed by the IP sender, in
         this case, at the other end of the IP path.  Feedback about
         congestion SHOULD be provided to that IP sender through RTCP or
         other means, and then the sender can choose to avoid congestion
         using the most appropriate mechanism.  That may include
         adjusting the codec mode, but also includes adjusting the level
         of redundancy or number of frames per packet.

   The encoder SHOULD follow a received codec mode request, but MAY
   change to a lower-numbered mode if it so chooses, for example, to
   control congestion.

   The CMR field MUST be set to 15 for packets sent to a multicast
   group.  The encoder in the speech sender SHOULD ignore codec mode
   requests when sending speech to a multicast session but MAY use RTCP
   feedback information as a hint that a codec mode change is needed.

   The codec mode selection MAY be restricted by a session parameter to
   a subset of the available modes.  If so, the requested mode MUST be
   among the signalled subset (see Section 8).  If the received CMR
   value is outside the signalled subset of modes, it MUST be ignored.


4.3.2. The Payload Table of Contents

The table of contents (ToC) consists of a list of ToC entries, each representing a speech frame.
Top   ToC   RFC4867 - Page 19
   In bandwidth-efficient mode, a ToC entry takes the following format:

    0 1 2 3 4 5
   +-+-+-+-+-+-+
   |F|  FT   |Q|
   +-+-+-+-+-+-+

   F (1 bit): If set to 1, indicates that this frame is followed by
      another speech frame in this payload; if set to 0, indicates that
      this frame is the last frame in this payload.

   FT (4 bits): Frame type index, indicating either the AMR or AMR-WB
      speech coding mode or comfort noise (SID) mode of the
      corresponding frame carried in this payload.

   The value of FT is defined in Table 1a in [2] for AMR and in Table 1a
   in [4] for AMR-WB.  FT=14 (SPEECH_LOST, only available for AMR-WB)
   and FT=15 (NO_DATA) are used to indicate frames that are either lost
   or not being transmitted in this payload, respectively.

   NO_DATA (FT=15) frame could mean either that no data for that frame
   has been produced by the speech encoder or that no data for that
   frame is transmitted in the current payload (i.e., valid data for
   that frame could be sent in either an earlier or later packet).

   If receiving a ToC entry with a FT value in the range 9-14 for AMR or
   10-13 for AMR-WB, the whole packet SHOULD be discarded.  This is to
   avoid the loss of data synchronization in the depacketization
   process, which can result in a huge degradation in speech quality.

   Note that packets containing only NO_DATA frames SHOULD NOT be
   transmitted in any payload format configuration, except in the case
   of interleaving.  Also, frame-blocks containing only NO_DATA frames
   at the end of a packet SHOULD NOT be transmitted in any payload
   format configuration, except in the case of interleaving.  The AMR
   SCR/DTX is described in [6] and AMR-WB SCR/DTX in [7].

   The extra comfort noise frame types specified in table 1a in [2]
   (i.e., GSM-EFR CN, IS-641 CN, and PDC-EFR CN) MUST NOT be used in
   this payload format because the standardized AMR codec is only
   required to implement the general AMR SID frame type and not those
   that are native to the incorporated encodings.

   Q (1 bit): Frame quality indicator.  If set to 0, indicates the
      corresponding frame is severely damaged, and the receiver should
      set the RX_TYPE (see [6]) to either SPEECH_BAD or SID_BAD
      depending on the frame type (FT).
Top   ToC   RFC4867 - Page 20
   The frame quality indicator is included for interoperability with the
   ATM payload format described in ITU-T I.366.2, the UMTS Iu interface
   [20], as well as other transport formats.  The frame quality
   indicator enables damaged frames to be forwarded to the speech
   decoder for error concealment.  This can improve the speech quality
   more than dropping the damaged frames.  See Section 4.4.2.1 for more
   details.

   For multi-channel sessions, the ToC entries of all frames from a
   frame-block are placed in the ToC in consecutive order as defined in
   Section 4.1 in [12].  When multiple frame-blocks are present in a
   packet in bandwidth-efficient mode, they will be placed in the packet
   in order of their creation time.

   Therefore, with N channels and K speech frame-blocks in a packet,
   there MUST be N*K entries in the ToC, and the first N entries will be
   from the first frame-block, the second N entries will be from the
   second frame-block, and so on.

   The following figure shows an example of a ToC of three entries in a
   single-channel session using bandwidth-efficient mode.

    0                   1
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |1|  FT   |Q|1|  FT   |Q|0|  FT   |Q|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Below is an example of how the ToC entries will appear in the ToC of
   a packet carrying three consecutive frame-blocks in a session with
   two channels (L and R).

   +----+----+----+----+----+----+
   | 1L | 1R | 2L | 2R | 3L | 3R |
   +----+----+----+----+----+----+
   |<------->|<------->|<------->|
     Frame-    Frame-    Frame-
     Block 1   Block 2   Block 3

4.3.3. Speech Data

Speech data of a payload contains zero or more speech frames or comfort noise frames, as described in the ToC of the payload. Note, for ToC entries with FT=14 or 15, there will be no corresponding speech frame present in the speech data.
Top   ToC   RFC4867 - Page 21
   Each speech frame represents 20 ms of speech encoded with the mode
   indicated in the FT field of the corresponding ToC entry.  The length
   of the speech frame is implicitly defined by the mode indicated in
   the FT field.  The order and numbering notation of the bits are as
   specified for Interface Format 1 (IF1) in [2] for AMR and [4] for
   AMR-WB.  As specified there, the bits of speech frames have been
   rearranged in order of decreasing sensitivity, while the bits of
   comfort noise frames are in the order produced by the encoder.  The
   resulting bit sequence for a frame of length K bits is denoted d(0),
   d(1), ..., d(K-1).

4.3.4. Algorithm for Forming the Payload

The complete RTP payload in bandwidth-efficient mode is formed by packing bits from the payload header, table of contents, and speech frames in order (as defined by their corresponding ToC entries in the ToC list), and to bring the payload to octet alignment, 0 to 7 padding bits. Padding bits MUST be set to zero and MUST be ignored on reception. They are packed contiguously into octets beginning with the most significant bits of the fields and the octets. To be precise, the four-bit payload header is packed into the first octet of the payload with bit 0 of the payload header in the most significant bit of the octet. The four most significant bits (numbered 0-3) of the first ToC entry are packed into the least significant bits of the octet, ending with bit 3 in the least significant bit. Packing continues in the second octet with bit 4 of the first ToC entry in the most significant bit of the octet. If more than one frame is contained in the payload, then packing continues with the second and successive ToC entries. Bit 0 of the first data frame follows immediately after the last ToC bit, proceeding through all the bits of the frame in numerical order. Bits from any successive frames follow contiguously in numerical order for each frame and in consecutive order of the frames. If speech data is missing for one or more speech frame within the sequence, because of, for example, DTX, a ToC entry with FT set to NO_DATA SHALL be included in the ToC for each of the missing frames, but no data bits are included in the payload for the missing frame (see Section 4.3.5.2 for an example).

4.3.5. Payload Examples

4.3.5.1. Single-Channel Payload Carrying a Single Frame
The following diagram shows a bandwidth-efficient AMR payload from a single-channel session carrying a single speech frame-block.
Top   ToC   RFC4867 - Page 22
   In the payload, no specific mode is requested (CMR=15), the speech
   frame is not damaged at the IP origin (Q=1), and the coding mode is
   AMR 7.4 kbps (FT=4).  The encoded speech bits, d(0) to d(147), are
   arranged in descending sensitivity order according to [2].  Finally,
   two padding bits (P) are added to the end as padding to make the
   payload octet aligned.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | CMR=15|0| FT=4  |1|d(0)                                       |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                     d(147)|P|P|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

4.3.5.2. Single-Channel Payload Carrying Multiple Frames
The following diagram shows a single-channel, bandwidth-efficient compound AMR-WB payload that contains four frames, of which one has no speech data. The first frame is a speech frame at 6.6 kbps mode (FT=0) that is composed of speech bits d(0) to d(131). The second frame is an AMR-WB SID frame (FT=9), consisting of bits g(0) to g(39). The third frame is a NO_DATA frame and does not carry any speech information, it is represented in the payload by its ToC entry. The fourth frame in the payload is a speech frame at 8.85 kbps mode (FT=1), it consists of speech bits h(0) to h(176). As shown below, the payload carries a mode request for the encoder on the receiver's side to change its future coding mode to AMR-WB 8.85 kbps (CMR=1). None of the frames are damaged at IP origin (Q=1). The encoded speech and SID bits, d(0) to d(131), g(0) to g(39), and h(0) to h(176), are arranged in the payload in descending sensitivity order according to [4]. (Note, no speech bits are present for the third frame.) Finally, seven zero bits are padded to the end to make the payload octet aligned.
Top   ToC   RFC4867 - Page 23
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | CMR=1 |1| FT=0  |1|1| FT=9  |1|1| FT=15 |1|0| FT=1  |1|d(0)   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                         d(131)|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |g(0)                                                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |          g(39)|h(0)                                           |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                           h(176)|P|P|P|P|P|P|P|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

4.3.5.3. Multi-Channel Payload Carrying Multiple Frames
The following diagram shows a two-channel payload carrying 3 frame- blocks, i.e., the payload will contain 6 speech frames. In the payload, all speech frames contain the same mode 7.4 kbps (FT=4) and are not damaged at IP origin. The CMR is set to 15, i.e., no specific mode is requested. The two channels are defined as left (L) and right (R) in that order. The encoded speech bits is designated dXY(0).. dXY(K-1), where X = block number, Y = channel, and K is the number of speech bits for that mode. Exemplifying this, for frame-block 1 of the left channel, the encoded bits are designated as d1L(0) to d1L(147).
Top   ToC   RFC4867 - Page 24
    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | CMR=15|1|1L FT=4|1|1|1R FT=4|1|1|2L FT=4|1|1|2R FT=4|1|1|3L FT|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |4|1|0|3R FT=4|1|d1L(0)                                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                               |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                               d1L(147)|d1R(0) |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   : ...                                                           :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                       d1R(147)|d2L(0)                         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   : ...                                                           :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |d2L(147|d2R(0)                                                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   : ...                                                           :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                       d2R(147)|d3L(0)         |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   : ...                                                           :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |               d3L(147)|d3R(0)                                 |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   : ...                                                           :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |                                                       d3R(147)|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
Top   ToC   RFC4867 - Page 25

4.4. Octet-Aligned Mode

4.4.1. The Payload Header

In octet-aligned mode, the payload header consists of a 4-bit CMR, 4 reserved bits, and optionally, an 8-bit interleaving header, as shown below: 0 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 +-+-+-+-+-+-+-+-+- - - - - - - - | CMR |R|R|R|R| ILL | ILP | +-+-+-+-+-+-+-+-+- - - - - - - - CMR (4 bits): same as defined in Section 4.3.1. R: is a reserved bit that MUST be set to zero. All R bits MUST be ignored by the receiver. ILL (4 bits, unsigned integer): This is an OPTIONAL field that is present only if interleaving is signalled out-of-band for the session. ILL=L indicates to the receiver that the interleaving length is L+1, in number of frame-blocks. ILP (4 bits, unsigned integer): This is an OPTIONAL field that is present only if interleaving is signalled. ILP MUST take a value between 0 and ILL, inclusive, indicating the interleaving index for frame-blocks in this payload in the interleaving group. If the value of ILP is found greater than ILL, the payload SHOULD be discarded. ILL and ILP fields MUST be present in each packet in a session if interleaving is signalled for the session. Interleaving MUST be performed on a frame-block basis (i.e., NOT on a frame basis) in a multi-channel session. The following example illustrates the arrangement of speech frame- blocks in an interleaving group during an interleaving session. Here we assume ILL=L for the interleaving group that starts at speech frame-block n. We also assume that the first payload packet of the interleaving group is s, and the number of speech frame-blocks carried in each payload is N. Then we will have:
Top   ToC   RFC4867 - Page 26
   Payload s (the first packet of this interleaving group):
      ILL=L, ILP=0,
      Carry frame-blocks: n, n+(L+1), n+2*(L+1), ..., n+(N-1)*(L+1)

   Payload s+1 (the second packet of this interleaving group):
      ILL=L, ILP=1,
      frame-blocks: n+1, n+1+(L+1), n+1+2*(L+1), ..., n+1+(N-1)*(L+1)
      ...

   Payload s+L (the last packet of this interleaving group):
      ILL=L, ILP=L,
      frame-blocks: n+L, n+L+(L+1), n+L+2*(L+1), ..., n+L+(N-1)*(L+1)

   The next interleaving group will start at frame-block n+N*(L+1).

   There will be no interleaving effect unless the number of frame-
   blocks per packet (N) is at least 2.  Moreover, the number of frame-
   blocks per payload (N) and the value of ILL MUST NOT be changed
   inside an interleaving group.  In other words, all payloads in an
   interleaving group MUST have the same ILL and MUST contain the same
   number of speech frame-blocks.

   The sender of the payload MUST only apply interleaving if the
   receiver has signalled its use through out-of-band means.  Since
   interleaving will increase buffering requirements at the receiver,
   the receiver uses media type parameter "interleaving=I" to set the
   maximum number of frame-blocks allowed in an interleaving group to I.

   When performing interleaving, the sender MUST use a proper number of
   frame-blocks per payload (N) and ILL so that the resulting size of an
   interleaving group is less or equal to I, that is, N*(L+1)<=I.

4.4.2. The Payload Table of Contents and Frame CRCs

The table of contents (ToC) in octet-aligned mode consists of a list of ToC entries where each entry corresponds to a speech frame carried in the payload and, optionally, a list of speech frame CRCs. That is, the ToC is as follows: +---------------------+ | list of ToC entries | +---------------------+ | list of frame CRCs | (optional) - - - - - - - - - - - Note, for ToC entries with FT=14 or 15, there will be no corresponding speech frame or frame CRC present in the payload.
Top   ToC   RFC4867 - Page 27
   The list of ToC entries is organized in the same way as described for
   bandwidth-efficient mode in 4.3.2, with the following exception:
   when interleaving is used, the frame-blocks in the ToC will almost
   never be placed consecutively in time.  Instead, the presence and
   order of the frame-blocks in a packet will follow the pattern
   described in 4.4.1.

   The following example shows the ToC of three consecutive packets,
   each carrying three frame-blocks, in an interleaved two-channel
   session.  Here, the two channels are left (L) and right (R) with L
   coming before R, and the interleaving length is 3 (i.e., ILL=2).
   This results in the interleaving group size of 9 frame-blocks.


   Packet #1
   ---------

   ILL=2, ILP=0:
   +----+----+----+----+----+----+
   | 1L | 1R | 4L | 4R | 7L | 7R |
   +----+----+----+----+----+----+
   |<------->|<------->|<------->|
     Frame-    Frame-    Frame-
     Block 1   Block 4   Block 7

   Packet #2
   ---------

   ILL=2, ILP=1:
   +----+----+----+----+----+----+
   | 2L | 2R | 5L | 5R | 8L | 8R |
   +----+----+----+----+----+----+
   |<------->|<------->|<------->|
     Frame-    Frame-    Frame-
     Block 2   Block 5   Block 8

   Packet #3
   ---------

   ILL=2, ILP=2:
   +----+----+----+----+----+----+
   | 3L | 3R | 6L | 6R | 9L | 9R |
   +----+----+----+----+----+----+
   |<------->|<------->|<------->|
     Frame-    Frame-    Frame-
     Block 3   Block 6   Block 9
Top   ToC   RFC4867 - Page 28
   A ToC entry takes the following format in octet-aligned mode:

    0 1 2 3 4 5 6 7
   +-+-+-+-+-+-+-+-+
   |F|  FT   |Q|P|P|
   +-+-+-+-+-+-+-+-+

   F (1 bit): see definition in Section 4.3.2.

   FT (4 bits, unsigned integer): see definition in Section 4.3.2.

   Q (1 bit): see definition in Section 4.3.2.

   P bits: padding bits, MUST be set to zero, and MUST be ignored on
           reception.

   The list of CRCs is OPTIONAL.  It only exists if the use of CRC is
   signalled out-of-band for the session.  When present, each CRC in the
   list is 8 bits long and corresponds to a speech frame (NOT a frame-
   block) carried in the payload.  Calculation and use of the CRC is
   specified in the next section.

4.4.2.1. Use of Frame CRC for UED over IP
The general concept of UED/UEP over IP is discussed in Section 3.6. This section provides more details on how to use the frame CRC in the octet-aligned payload header together with a partial transport layer checksum to achieve UED. To achieve UED, one SHOULD use a transport layer checksum (for example, the one defined in UDP-Lite [19]) to protect the IP, transport protocol (e.g., UDP-Lite), and RTP headers, as well as the payload header and the table of contents in the payload. The frame CRC, when used, MUST be calculated only over all class A bits in the AMR or AMR-WB frame. Class B and C bits in the AMR or AMR-WB frame MUST NOT be included in the CRC calculation and SHOULD NOT be covered by the transport checksum. Note, the number of class A bits for various coding modes in AMR codec is specified as informative in [2] and is therefore copied into Table 1 in Section 3.6 to make it normative for this payload format. The number of class A bits for various coding modes in AMR-WB codec is specified as normative in Table 2 in [4], and the SID frame (FT=9) has 40 class A bits. These definitions of class A bits MUST be used for this payload format.
Top   ToC   RFC4867 - Page 29
   If the transport layer checksum or link layer checksum detects any
   errors within the protected (sensitive) part, it is assumed that the
   complete packet will be discarded as defined by UDP-Lite [19].

   The receiver of the payload SHOULD examine the data integrity of the
   received class A bits by re-calculating the CRC over the received
   class A bits and comparing the result to the value found in the
   received payload header.  If the two values mismatch, the receiver
   SHALL consider the class A bits in the receiver frame damaged and
   MUST clear the Q flag of the frame (i.e., set it to 0).  This will
   subsequently cause the frame to be marked as SPEECH_BAD, if the FT of
   the frame is 0..7 for AMR or 0..8 for AMR-WB, or SID_BAD if the FT of
   the frame is 8 for AMR or 9 for AMR-WB, before it is passed to the
   speech decoder.  See [6] and [7] more details.

   The following example shows an octet-aligned ToC with a CRC list for
   a payload containing 3 speech frames from a single-channel session
   (assuming none of the FTs is equal to 14 or 15):

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |1|  FT#1 |Q|P|P|1|  FT#2 |Q|P|P|0|  FT#3 |Q|P|P|     CRC#1     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |     CRC#2     |     CRC#3     |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Each of the CRCs takes 8 bits

     0   1   2   3   4   5   6   7
   +---+---+---+---+---+---+---+---+
   | c0| c1| c2| c3| c4| c5| c6| c7|
   +---+---+---+---+---+---+---+---+
   (MSB)                       (LSB)

   and is calculated by the cyclic generator polynomial,

     C(x) = 1 + x^2 + x^3 + x^4 + x^8

   where ^ is the exponentiation operator.

   In binary form, the polynomial appears as follows: 101110001
   (MSB..LSB).

   The actual calculation of the CRC is made as follows:  First, an
   8-bit CRC register is reset to zero: 00000000.  For each bit over
   which the CRC shall be calculated, an XOR operation is made between
   the rightmost (LSB) bit of the CRC register and the bit.  The CRC
Top   ToC   RFC4867 - Page 30
   register is then right-shifted one step (each bit's significance is
   reduced by one), inputting a "0" as the leftmost bit (MSB).  If the
   result of the XOR operation mentioned above is a "1", then "10111000"
   is bit-wise XOR-ed into the CRC register.  This operation is repeated
   for each bit that the CRC should cover.  In this case, the first bit
   would be d(0) for the speech frame for which the CRC should cover.
   When the last bit (e.g., d(54) for AMR 5.9 according to Table 1 in
   Section 3.6) has been used in this CRC calculation, the contents in
   CRC register should simply be copied to the corresponding field in
   the list of CRCs.

   Fast calculation of the CRC on a general-purpose CPU is possible
   using a table-driven algorithm.

4.4.3. Speech Data

In octet-aligned mode, speech data is carried in a similar way to that in the bandwidth-efficient mode as discussed in Section 4.3.3, with the following exceptions: - The last octet of each speech frame MUST be padded with zero bits at the end if all bits in the octet are not used. The padding bits MUST be ignored on reception. In other words, each speech frame MUST be octet-aligned. - When multiple speech frames are present in the speech data (i.e., compound payload), the speech frames are arranged either one whole frame after another as usual, or with the octets of all frames interleaved together at the octet level, depending on the media type parameters negotiated for the payload type. Since the bits within each frame are ordered with the most error-sensitive bits first, interleaving the octets collects those sensitive bits from all frames to be nearer the beginning of the packet. This is called "robust sorting order" which allows the application of UED (such as UDP-Lite [19]) or UEP (such as the ULP [22]) mechanisms to the payload data. The details of assembling the payload are given in the next section. The use of robust sorting order for a payload type MUST be agreed via out-of-band means. Section 8 specifies a media type parameter for this purpose. Note, robust sorting order MUST only be performed on the frame level and thus is independent of interleaving, which is at the frame-block level, as described in Section 4.4.1. In other words, robust sorting can be applied to either non-interleaved or interleaved payload types.
Top   ToC   RFC4867 - Page 31

4.4.4. Methods for Forming the Payload

Two different packetization methods, namely, normal order and robust sorting order, exist for forming a payload in octet-aligned mode. In both cases, the payload header and table of contents are packed into the payload the same way; the difference is in the packing of the speech frames. The payload begins with the payload header of one octet, or two octets if frame interleaving is selected. The payload header is followed by the table of contents consisting of a list of one-octet ToC entries. If frame CRCs are to be included, they follow the table of contents with one 8-bit CRC filling each octet. Note that if a given frame has a ToC entry with FT=14 or 15, there will be no CRC present. The speech data follows the table of contents, or the CRCs if present. For packetization in the normal order, all of the octets comprising a speech frame are appended to the payload as a unit. The speech frames are packed in the same order as their corresponding ToC entries are arranged in the ToC list, with the exception that if a given frame has a ToC entry with FT=14 or 15, there will be no data octets present for that frame. For packetization in robust sorting order, the octets of all speech frames are interleaved together at the octet level. That is, the data portion of the payload begins with the first octet of the first frame, followed by the first octet of the second frame, then the first octet of the third frame, and so on. After the first octet of the last frame has been appended, the cycle repeats with the second octet of each frame. The process continues for as many octets as are present in the longest frame. If the frames are not all the same octet length, a shorter frame is skipped once all octets in it have been appended. The order of the frames in the cycle will be sequential if frame interleaving is not in use, or according to the interleave pattern specified in the payload header if frame interleaving is in use. Note that if a given frame has a ToC entry with FT=14 or 15, there will be no data octets present for that frame, so it is skipped in the robust sorting cycle. The UED and/or UEP is RECOMMENDED to cover at least the RTP header, payload header, table of contents, and class A bits of a sorted payload. Exactly how many octets need to be covered depends on the network and application. If CRCs are used together with robust sorting, only the RTP header, the payload header, and the ToC SHOULD be covered by UED/UEP. The means for communicating the number of octets to be covered to other layers performing UED/UEP is beyond the scope of this specification.
Top   ToC   RFC4867 - Page 32

4.4.5. Payload Examples

4.4.5.1. Basic Single-Channel Payload Carrying Multiple Frames
The following diagram shows an octet aligned payload from a single channel payload type that carries two AMR frames of 7.95 kbps coding mode (FT=5). In the payload, a codec mode request is sent (CMR=6), requesting the encoder at the receiver's side to use AMR 10.2 kbps coding mode. No frame CRC, interleaving, or robust sorting is in use. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | CMR=6 |R|R|R|R|1|FT#1=5 |Q|P|P|0|FT#2=5 |Q|P|P| f1(0..7) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f1(8..15) | f1(16..23) | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... |f1(152..158) |P| f2(0..7) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | f2(8..15) | f2(16..23) | .... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ : ... : +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... |f2(152..158) |P| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Note, in the above example, the last octet in both speech frames is padded with one zero bit to make it octet-aligned.
4.4.5.2. Two-Channel Payload with CRC, Interleaving, and Robust Sorting
This example shows an octet aligned payload from a two-channel payload type. Two frame-blocks, each containing two speech frames of 7.95 kbps coding mode (FT=5), are carried in this payload. The two channels are left (L) and right (R) with L coming before R. In the payload, a codec mode request is also sent (CMR=6), requesting the encoder at the receiver's side to use AMR 10.2 kbps coding mode. Moreover, frame CRC, robust sorting, and frame-block interleaving are all enabled for the payload type. The interleaving length is 2 (ILL=1), and this payload is the first one in an interleaving group (ILP=0).
Top   ToC   RFC4867 - Page 33
   The first two frames in the payload are the L and R channel speech
   frames of frame-block #1, consisting of bits f1L(0..158) and
   f1R(0..158), respectively.  The next two frames are the L and R
   channel frames of frame-block #3, consisting of bits f3L(0..158) and
   f3R(0..158), respectively, due to interleaving.  For each of the four
   speech frames, a CRC is calculated as CRC1L(0..7), CRC1R(0..7),
   CRC3L(0..7), and CRC3R(0..7), respectively.  Finally, the payload is
   robust sorted.

    0                   1                   2                   3
    0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | CMR=6 |R|R|R|R| ILL=1 | ILP=0 |1|FT#1L=5|Q|P|P|1|FT#1R=5|Q|P|P|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |1|FT#3L=5|Q|P|P|0|FT#3R=5|Q|P|P|      CRC1L    |      CRC1R    |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |      CRC3L    |      CRC3R    |   f1L(0..7)   |   f1R(0..7)   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |   f3L(0..7)   |   f3R(0..7)   |  f1L(8..15)   |  f1R(8..15)   |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |  f3L(8..15)   |  f3R(8..15)   |  f1L(16..23)  |  f1R(16..23)  |
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   : ...                                                           :
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   | f3L(144..151) | f3R(144..151) |f1L(152..158)|P|f1R(152..158)|P|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   |f3L(152..158)|P|f3R(152..158)|P|
   +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

   Note, in the above example, the last octet in all four speech frames
   is padded with one zero bit to make it octet-aligned.

4.5. Implementation Considerations

An application implementing this payload format MUST understand all the payload parameters in the out-of-band signaling used. For example, if an application uses SDP, all the SDP and media type parameters in this document MUST be understood. This requirement ensures that an implementation always can decide if it is capable or not of communicating. No operating mode of the payload format is mandatory to implement. The requirements of the application using the payload format should be used to determine what to implement. To achieve basic interoperability, an implementation SHOULD at least implement both bandwidth-efficient and octet-aligned modes for a single audio
Top   ToC   RFC4867 - Page 34
   channel.  The other operating modes: interleaving, robust sorting,
   and frame-wise CRC (in both single and multi-channel) are OPTIONAL to
   implement.

   The mode-change-period, mode-change-capability, and mode-change-
   neighbor parameters are intended for signaling with GSM endpoints.
   When interoperability with GSM is desired, encoders SHOULD only
   perform codec mode changes to neighboring modes and in integer
   multiples of 40 ms (two frame-blocks), but decoders SHOULD accept
   codec mode changes at any time, i.e., for every frame-block.  The
   encoder may arbitrarily select the initial phase (odd or even frame-
   block) where codec mode changes are performed, but then SHOULD stick
   to that phase as far as possible.  However, in rare cases, handovers
   or other events (e.g., call forwarding) may change this phase and may
   also cause mode changes to non-neighboring modes.  The decoder SHALL
   therefore be prepared to accept changes also in the other phase and
   to other modes.  Section 8 specifies the usage of the parameters
   mode-change-period and mode-change-capability to indicate the desired
   behavior in applications.

   See 3GPP TS 26.103 [28] for preferred AMR and AMR-WB configurations
   for operation in GSM and 3GPP UMTS networks.  In gateway scenarios,
   encoders can be requested through the "mode-set" parameter to use a
   limited mode-set that is supported by the link beyond the gateway.
   Further, to avoid congestion on that link, the encoder SHOULD limit
   the initial codec mode for a session to a lower mode, until at least
   one frame-block is received with rate control information.

4.5.1. Decoding Validation

When processing a received payload packet, if the receiver finds that the calculated payload length, based on the information for the payload type and the values found in the payload header fields, does not match the size of the received packet, the receiver SHOULD discard the packet. This is because decoding a packet that has errors in its length field could severely degrade the speech quality.
Top   ToC   RFC4867 - Page 35

5. AMR and AMR-WB Storage Format

The storage format is used for storing AMR or AMR-WB speech frames in a file or as an email attachment. Multiple channel content is supported. In general, an AMR or AMR-WB file has the following structure: +------------------+ | Header | +------------------+ | Speech frame 1 | +------------------+ : ... : +------------------+ | Speech frame n | +------------------+ Note, to preserve interoperability with already deployed implementations, single-channel content uses a file header format different from that of multi-channel content. There also exists another storage format for AMR and AMR-WB that is suitable for applications with more advanced demands on the storage format, like random access or synchronization with video. This format is the 3GPP-specified ISO-based multimedia file format 3GP [31]. Its media type is specified by RFC 3839 [32].

5.1. Single-Channel Header

A single-channel AMR or AMR-WB file header contains only a magic number. Different magic numbers are defined to distinguish AMR from AMR-WB. The magic number for single-channel AMR files MUST consist of ASCII character string: "#!AMR\n" (or 0x2321414d520a in hexadecimal). The magic number for single-channel AMR-WB files MUST consist of ASCII character string: "#!AMR-WB\n" (or 0x2321414d522d57420a in hexadecimal).
Top   ToC   RFC4867 - Page 36
   Note, the "\n" is an important part of the magic numbers and MUST be
   included in the comparison, since, otherwise, the single-channel
   magic numbers above will become indistinguishable from those of the
   multi-channel files defined in the next section.

5.2. Multi-Channel Header

The multi-channel header consists of a magic number followed by a 32-bit channel description field, giving the multi-channel header the following structure: +------------------+ | magic number | +------------------+ | chan-desc field | +------------------+ The magic number for multi-channel AMR files MUST consist of the ASCII character string: "#!AMR_MC1.0\n" (or 0x2321414d525F4D43312E300a in hexadecimal). The magic number for multi-channel AMR-WB files MUST consist of the ASCII character string: "#!AMR-WB_MC1.0\n" (or 0x2321414d522d57425F4D43312E300a in hexadecimal). The version number in the magic numbers refers to the version of the file format. The 32 bit channel description field is defined as: 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Reserved bits | CHAN | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Reserved bits: MUST be set to 0 when written, and a reader MUST ignore them. CHAN (4 bits, unsigned integer): Indicates the number of audio channels contained in this storage file. The valid values and the order of the channels within a frame-block are specified in Section 4.1 in [12].
Top   ToC   RFC4867 - Page 37

5.3. Speech Frames

After the file header, speech frame-blocks consecutive in time are stored in the file. Each frame-block contains a number of octet- aligned speech frames equal to the number of channels, and stored in increasing order, starting with channel 1. Each stored speech frame starts with a one-octet frame header with the following format: 0 1 2 3 4 5 6 7 +-+-+-+-+-+-+-+-+ |P| FT |Q|P|P| +-+-+-+-+-+-+-+-+ The FT field and the Q bit are defined in the same way as in Section 4.3.2. The P bits are padding and MUST be set to 0, and MUST be ignored. Following this one octet header come the speech bits as defined in 4.4.3. The last octet of each frame is padded with zeroes, if needed, to achieve octet alignment. The following example shows an AMR frame in 5.9 kbps coding mode (with 118 speech bits) in the storage format. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |P| FT=2 |Q|P|P| | +-+-+-+-+-+-+-+-+ + | | + Speech bits for frame-block n, channel k + | | + +-+-+ | |P|P| +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Non-received speech frames or frame-blocks between SID updates during non-speech periods MUST be stored as NO_DATA frames (frame type 15, as defined in [2] and [4]). Frames or frame-blocks lost in transmission MUST be stored as NO_DATA frames or SPEECH_LOST (frame type 14, only available for AMR-WB) in complete frame-blocks to keep synchronization with the original media. Comfort noise frames of other types than AMR SID (FT=8) (i.e., frame type 9, 10, and 11 for AMR) SHALL NOT be used in the AMR file format.


(next page on part 3)

Next Section