RFC 4396

RTP Payload Format for 3rd Generation Partnership Project (3GPP) Timed Text

Pages: 66
Proposed Standard
→ Errata

Part 2 of 3 – Pages 25 to 45

RFC4396 - Page 25 prevText

4.2.  Buffering of Sample Descriptions

   The buffering of sample descriptions is a matter of the client's
   timed text codec implementation.  In order to work properly, this
   payload format requires that:

     o Static sample descriptions MUST be buffered at the client, at
       least, for the duration of the session.

RFC4396 - Page 26

     o If dynamic sample descriptions are used, their buffering and
       update of the SIDX values MUST follow the mechanism described in
       the next section.

4.2.1.  Dynamic SIDX Wraparound Mechanism

   The use of dynamic sample descriptions by senders is OPTIONAL.
   However, if they are used, senders MUST implement this mechanism.
   Receivers MUST always implement it.

   Dynamic SIDX values remain active either during the entire duration
   of the session (if used just once) or in different intervals of it
   (if used once or more).

        Note: In the following, SIDX means dynamic SIDX.

   For choosing the wraparound mechanism, the following rationale was
   used: There are 128 dynamic SIDX values possible, [0..127].  If one
   chooses to allow a maximum of 127 to be used as dynamic SIDXs, then
   any reordered packet with a new sample description would make the
   mechanism fail.  For example, if the last packet received is SIDX=5,
   then all 127 values except SIDX=6 would be "active".  Now, if a
   reordered packet arrives with a new description, SIDX=9, it will be
   mistakenly discarded, because the SIDX=9 is, at that moment, marked
   as "active" and active sample descriptions shall not be re-written.
   Therefore, a "guard interval" is introduced.  This guard interval
   reduces the number of active SIDXs at any point in time to 64.
   Although most timed text applications will probably need less than 64
   sample descriptions during a session (in total), a wraparound
   mechanism to handle the need for more is described here.

   Thereby, a sliding window of 64 active SIDX values is used.  Values
   within the window are "active"; all others are marked "inactive".  An
   SIDX value becomes active if at least one sample description
   identified by that SIDX has been received.  Since sample descriptions
   MAY be sent redundantly, it is possible that a client receives a
   given SIDX several times.  However, active sample descriptions SHALL
   NOT be overwritten: The receiver SHALL ignore redundant sample
   descriptions and it MUST use the already cached copy.  The "guard
   interval" of (64) inactive values ensures that the correct
   association SIDX <-> sample description is always used.

        Informative note: As for the "guard interval" value itself, 64
        as 128/2 was considered simple enough while still meeting the
        expected maximum number of sample descriptions.  Besides that,
        there's no other motivation for choosing 64 or a different
        value.

RFC4396 - Page 27

   The following algorithm is used to buffer dynamic sample descriptions
   and to maintain the dynamic SIDX values:

   Let X be the last SIDX received that updated the range of active
   sample descriptions.  Let Y be a value within the allowed range for
   dynamic SIDX: [0,127], and different from X.  Let Z be the SIDX of
   the last received sample description.  Then:

     1. Initialize all dynamic SIDX values as inactive.  For stored
        contents, read the sample description index in the Sample to
        Chunk box ("stsc") for that sample.  For live streaming, the
        first value MAY be zero or any other value in the interval
        above.  Go to step 2.

     2. First, in-band sample description with SIDX=Z is received and
        stored; set X=Z.  Go to step 3.

     3. Any SIDX within the interval [X+1 modulo(128), X+64 modulo(128)]
        is marked as inactive, and any corresponding sample description
        is deleted.  Any SIDX within the interval [X+65 modulo(128), X]
        is set active.  Go to step 4 (wait state).

     4. Wait for next sample description.  Once the client is
        initialized, the interval of active SIDX values MUST change
        whenever a sample description with an SIDX value in the inactive
        set is received.  That is, upon reception of a sample
        description with SIDX=Z, do the following:

        a. If Z is in the (closed) interval [X+1 modulo(128), X+64
           modulo(128)] then set X=Z, store the sample description, and
           go to step 3.

        b. Else, Z must be in the interval [X+65 modulo(128), X], thus:

            i. If SIDX=Z is not stored, then store the sample
               description. Go to beginning of step 4 (wait state).
           ii. Else, go to the beginning of step 4 (wait state).

        Informative note: It is allowed that any value of SIDX=X be sent
        in the interval [0,127].  For example, if [64..127] is the
        current active set and SIDX=0 is sent, a new sample description
        is defined (0) and an old one deleted (64); thus [65..127] and
        [0] are active.  Similarly, one could now send SIDX=64, thus
        inverting the active and inactive sets.

   Example:
        If X=4, any SIDX in the interval [5,68] is inactive.  Active
        SIDX values are in the complementary interval [69,127] plus

RFC4396 - Page 28

        [0,4].  For example, if the client receives a SIDX=6, then the
        active interval is now different: [0,6] plus [71,127].  If the
        received SIDX is in the current active interval, no change SHALL
        be applied.

4.3.  Finding Payload Header Values in 3GP Files

   For the purpose of streaming timed text contents, some values in the
   boxes contained in a 3GP file are mapped to fields of this payload
   header.  This section explains where to find those values.

   Additionally, for the duration and sample description indexes,
   extension mechanisms are provided.  All senders MUST implement the
   extension mechanisms described herein.

   If the file is streamed out of a 3GP file, the following guidelines
   SHALL be followed.

        Note: All fields in the objects (boxes) of a 3GP file are found
        in network byte order.

   Information obtained from the Sample Table Box (stbl):

        o Sample Descriptions and Sample Description length: The Sample
          Description box (stsd, inside the stbl) contains the sample
          descriptions.  For timed text media, each element of stsd is a
          timed text sample entry (type "tx3g").

          The (unsigned) 32 bits of the "size" field in the stsd box
          represent the length (in bytes) of the sample description, as
          carried in TYPE 5 units.  On the other hand, the LEN field of
          TYPE 5 units is restricted to 16 bits.  Therefore, if the
          value of "size" is greater than (2^16-1-3)[bytes], then the
          sample description SHALL NOT be streamed with this payload
          format.  There is no extension mechanism defined in this case,
          since fragmentation of sample descriptions is not defined
          (sample descriptions are typically up to some 200 bytes in
          size).  Note: The three (3) accounts for the TYPE 5 header
          fields included in the LEN value.

        o SDUR from the Decoding Time to Sample Box (stts).  The
          (unsigned) 32 bits of the "sample delta" field are used for
          calculating SDUR.  However, since the SDUR field is only 3
          bytes long, text samples with duration values larger than
          (2^24-1)/(timestamp clockrate)[seconds] cannot be streamed
          directly.  The solution is simple: Copies of the corresponding
          text sample SHALL be sent.  Thereby, the timestamp and
          duration values SHALL be adjusted so that a continuous display

RFC4396 - Page 29

          is guaranteed as if just one sample would have been sent.
          That is, a sample with timestamp TS and duration SDUR can be
          sent as two samples having timestamps TS1 and TS2 and
          durations SDUR1 and SDUR2, such that TS1=TS, TS2=TS1+SDUR1,
          and SDUR=SDUR1+SDUR2.

        o Text sample length from the Sample Size Box (stsz).  The
          (unsigned) 32 bits of the "sample size" or "entry size" (one
          of them, depending on whether the sample size is fixed or
          variable) indicate the length (in bytes) of the 3GP text
          sample.  For obtaining the length of the (actual) streamed
          text sample, the lengths of the text string byte count (2
          bytes) and, in case of UTF-16 strings, the length the BOM
          (also 2 bytes) SHALL be deducted.  This is illustrated in
          Figure 9.

          Text Sample according to 3GPP TS 26.245

                               TEXT SAMPLE (length=stsz)
                 .--------------------------------------------------.
                /                                                    \
                               TEXT STRING  (length=TBC)
                    .------------------------------------.
                   /                                      \
                TBC BOM                                     MODIFIERS
               +---+---+----------------------------------+-----------+
                                     ||
                                     ||    TBC BOM  -> TLEN  field
                                     ||   +---+---+    U bit
                                     ||
                                     \/

          Text Sample according to this Payload Format

                                 TEXT SAMPLE (length=SLEN w/o TBC,BOM)
                        .--------------------------------------------.
                       /                                              \
                                     TEXT STRING (length=TLEN)
                        .--------------------------------.
                       /                                  \
                                    TEXT STRING             MODIFIERS
                       +----------------------------------+-----------+

              KEY:
              TBC = Text string Byte Count
              BOM = Byte Order Mark

                    Figure 9.  Text sample composition

RFC4396 - Page 30

          Moreover, since the LEN field in TYPE 1 unit header is 16 bits
          long, larger text sample sizes than (2^16-1-8) [bytes] SHALL
          NOT be streamed.  Also, in this case, no extension mechanism
          is defined.  This is because this maximum is considered enough
          for the targeted streaming applications. (Note: The eight (8)
          accounts for the TYPE 1 header fields included in the LEN
          value).

        o SIDX from the Sample to Chunk Box (stsc): The stsc Box is used
          to find samples and their corresponding sample descriptions.
          These are referenced by the "sample description index", a
          32-bit (unsigned) integer.  If possible, these indices may be
          directly mapped to the SIDX field.  However, there are several
          cases where this may not be possible:

                  a) The total number of indices used is greater than
               the number of indices available, i.e., if the static
               sample descriptions are more than 127 or the dynamic ones
               are more than 64.

                  b) The original SIDX value ranges do not fit in the
               allowed ranges for static (129-254) or dynamic (0-127)
               values.

          Therefore, when assigning SIDX values to the sample
          descriptions, the following guidelines are provided:

          o    Static sample descriptions can simply be assigned
               consecutive values within the range 129-254 (closed
               interval).  This range should be well enough for static
               sample descriptions.

          o    As for dynamic sample descriptions:

                  a) Streams that use less than 64 dynamic sample
               descriptions SHOULD use consecutive values for SIDX
               anywhere in the range 0-127 (closed interval).

                  b) For streams with more than 64 sample descriptions,
               the SIDX values MUST be assigned in usage order, and if
               any sample description shall be used after it has been
               set inactive, it will need to be re-sent and assigned a
               new SIDX value (according to the algorithm in Section
               4.2.1).

RFC4396 - Page 31

   Information obtained from the Media Data Box:

        o Text strings, TLEN, U bit, and modifiers from the Media Data
          Box (mdat).  Text strings, 16-bit text string byte count, Byte
          Order Mark (BOM, indicating UTF encoding), and modifier boxes
          can be found here.

          For TYPE 1 units, the value of TLEN is extracted from the text
          string byte count that precedes the text string in the text
          sample, as stored in the 3GP file.  If UTF-16 encoding is
          used, two (2) more bytes have to be deducted from this byte
          count beforehand, in order to exclude the BOM.  See Figure 9.

4.4.  Fragmentation of Timed Text Samples

   This section explains why text samples may have to be fragmented and
   discusses some of the possible approaches to doing it.  A solution is
   proposed together with rules and recommendations for fragmenting and
   transporting text samples.

   3GPP Timed Text applications are expected to operate at low bitrates.
   This fact, added to the small size of timed text samples (typically
   one or two hundred bytes) makes fragmentation of text samples a rare
   event.  Samples should usually fit into the MTU size of the used
   network path.

   Nevertheless, some text strings (e.g., ending roll in a movie) and
   some modifier boxes (i.e., for hyperlinks, for karaoke, or for
   styles) may become large.  This may also apply for future modifier
   boxes.  In such cases, the first option to consider is whether it is
   possible to adjust the encoding (e.g., the size of sample) in such a
   way that fragmentation is avoided.  If it is, this is preferred to
   fragmentation and SHOULD be done.

   Otherwise, if this is not possible or other constraints prevent it,
   fragmentation MAY be used, and the basic guidelines given in this
   document MUST be followed:

   o It is RECOMMENDED that text samples be fragmented as seldom as
     possible, i.e., the least possible number of fragments is created
     out of a text sample.

   o If there is some bitrate and free space in the payload available,
     sample descriptions (if at hand) SHOULD be aggregated.

   o Text strings MUST split at character boundaries; see TYPE 2 header.
     Otherwise, it is not possible to display the text contents of a
     fragment if a previous fragment was lost.  As a consequence, text

RFC4396 - Page 32

     string fragmentation requires knowledge of the UTF-8/UTF-16
     encoding formats to determine character boundaries.

   o Unlike text strings, the modifier boxes are NOT REQUIRED to be
     split at meaningful boundaries.  However, it is RECOMMENDED that
     this be done whenever possible.  This decreases the effects of
     packet loss.  This payload format does not ensure that partially
     received modifiers are applied to text strings.  If only part of
     the modifiers is received, it is an application issue how to deal
     with these, i.e., whether or not to use them.

        Informative note: Ensuring that partially received modifiers can
        be applied to text strings in all cases (for all modifier types
        and for all fragment loss constellations) would place additional
        requirements on the payload format.  In particular, this would
        require that: a) senders understand the semantics of the
        modifier boxes and b) specific fragment headers for each of the
        modifier boxes are defined, in addition to the payload formats
        defined below.  Understanding the modifiers semantics means
        knowing, e.g., where each modifier starts and ends, which text
        fragments are affected, which modifiers may or may not be split,
        or what the fields indicate.  This is necessary to be able to
        split the modifiers in such a way that each fragment can be
        applied independently of previous packet losses.  This would
        require a more intelligent fragmentation entity and more complex
        headers.  Given the low probability of fragmentation and the
        desire to keep the requirements low, it does not seem reasonable
        to specify such modifier box specific headers.

   o Modifier and text string fragments SHOULD be protected against
     packet losses, i.e., using FEC [7], retransmission [11], repetition
     (Section 5), or an equivalent technique.  This minimizes the
     effects of packet loss.

   o An additional requirement when fragmenting text samples is that the
     start of the modifiers MUST be indicated using the payload header
     defined for that purpose, i.e., a TYPE 3 unit MUST be used (see
     Section 4.1.4).  This enables a receiver to detect the start of the
     modifiers as long as there are not two or more consecutive packet
     losses.

   o Finally, sample descriptions SHALL NOT be fragmented because they
     contain important information that may affect several text samples.

RFC4396 - Page 33

4.5.  Reassembling Text Samples at the Receiver

   The payload headers defined in this document allow reassembling
   fragmented text samples.  For this purpose, the standard RTP
   timestamp, the duration field (SDUR), and the fields TOTAL/THIS in
   the payload headers are used.

   Units that belong to the same text sample MUST have the same
   timestamp.  TYPE 5 units do not comply with this rule since they are
   not part of any particular text sample.

   The process for collecting the different fragments (units) of a text
   sample is as follows:

     1. Search for units having the same timestamp value, i.e., units
        that belong to the same text sample or sample descriptions that
        shall become available at that time instant.  If several units
        of the same sample are repeated, only one of them SHALL be used.
        Repeated units are those that have the same timestamp and the
        same values for TOTAL/THIS.

                Note that, as mentioned in Section 4.1.1, the receiver
                SHALL ignore units with unrecognized TYPE value.
                However, the RTP header fields and the rest of the units
                (if any) in the payload are still useful.

     2. Check within this set whether any of the units from the text
        sample is missing.  This is done using the TOTAL and THIS
        fields; the TOTAL field indicates how many fragments were
        created out of the text sample, and the THIS field indicates the
        position of this fragment in the text sample.  As result of this
        operation, two outcomes are possible:

          a. No fragment is missing.  Then, the THIS field SHALL be used
             to order the fragments and reassemble the text sample
             before forwarding it to the decoding application.  Special
             care SHALL be taken when reassembling the text string as
             indicated in bullet 4 below.

          b. One or more fragments are missing: Check whether this
             fragment belongs to the text string or to the modifiers.
             TYPE 2 units identify text string fragments, and TYPE 3 and
             4 identify modifier fragments:

              i. If the fragment or fragments missing belong to the text
                 string and the modifiers were received complete, then
                 the received text characters may, at least, be
                 displayed as plain text.  Some modifiers may only be

RFC4396 - Page 34

                 applied as long as it is possible to identify the
                 character numbers, e.g., if only the last text string
                 fragment is lost.  This is the case for modifiers
                 defining specific font styles ('styl'), highlighted
                 characters ('hlit'), karaoke feature ('krok'), and
                 blinking characters ('blnk').  Other modifiers such as
                 'dlay' or 'tbox' can be applied without the knowledge
                 of the character number.  It is an application issue to
                 decide whether or not to apply the modifiers.

             ii. If the fragment missing belongs to the modifiers and
                 the text strings were received complete, then the
                 incomplete modifiers may be used.  The text string
                 SHOULD at least be displayed as plain text.  As
                 mentioned in Section 4.4, modifiers may split without
                 observing meaningful boundaries.  Hence, it may not
                 always be possible to make use of partially received
                 modifiers.  However, to avoid this, it is RECOMMENDED
                 that the modifiers do split at meaningful boundaries.

            iii. A third possibility is that it is not possible to
                 discern whether modifiers or text strings were received
                 complete.  For example, if the TYPE 3 unit of a sample
                 plus the following or preceding packet is lost, there
                 is no way for the RTP receiver to know if one or both
                 packets lost belong to the modifiers or if there are
                 also some missing text strings.  Repetition, FEC,
                 retransmission, or other protection mechanisms as per
                 section 4.6 are RECOMMENDED to avoid this situation.

             iv. Finally, if it is sure that neither text strings nor
                 modifiers were received complete, then the text strings
                 and the modifiers may be rendered partially or may be
                 discarded.  This is an application choice.

     3. Sample descriptions can be directly associated with the
        reassembled text samples, via the sample description index
        (SIDX).

     4. Reassembling of text strings: Since the text strings transported
        in RTP packets MUST NOT include any byte order mark (BOM), the
        receiver MUST prepend it to the reassembled UTF-16 string before
        handling it to the timed text decoder (see Figure 9).  The value
        of the BOM is 0xFEFF because only big endian serialization of
        UTF-16 strings is supported by this payload format.

RFC4396 - Page 35

4.6.  On Aggregate Payloads

   Units SHOULD be aggregated to avoid overhead, whenever possible.  The
   aggregate payloads MUST comply with one of the following ordered
   configurations:

   1. Zero or more sample descriptions (TYPE 5) followed by zero or more
      whole text samples (TYPE 1 units).  At least one unit of either
      type MUST be present.

   2. Zero or more sample descriptions followed by zero or one modifier
      fragment, either TYPE 3 or TYPE 4.  At least one unit MUST be
      present.

   3. Zero or more sample descriptions, followed by zero or one text
      string fragment (TYPE 2), followed by zero or one TYPE 3 unit.  If
      a TYPE 2 unit and a TYPE 3 unit are present, then they MUST belong
      to the same text sample.  At least one unit MUST be present.

   Some observations:

   o Different aggregates than the ones listed above SHALL NOT be used.

   o Sample descriptions MUST be placed in the aggregate payload before
     the occurrence of any non-TYPE 5 units.

   o Correct reception of TYPE 5 units is important since their contents
     may be referenced by several other units in the stream.

     Receivers are unable to use text samples until their corresponding
     sample descriptions are received.  Accordingly, a sender SHOULD
     send multiple copies of a sample description to ensure reliability
     (see Section 5).  Receivers MAY use payload-specific feedback
     messages [21] to tell a sender that they have received a particular
     sample description.

   o Regarding timestamp calculation: In general, the rules for
     calculating the timestamp of units in an aggregate payload depend
     on the type of unit.  Based on the possible constellations for
     aggregate payloads, as above, we have:

           o Sample descriptions MUST receive the RTP timestamp of the
             packet in which they are included.

             Note that for TYPE 5 units, the timestamp actually does not
             represent the instant when they are played out, but instead
             the instant at which they become available for use.

RFC4396 - Page 36

           o For the first configuration: The first TYPE 1 unit receives
             the RTP timestamp.  The timestamp of any subsequent TYPE 1
             unit MUST be obtained by adding sample duration and
             timestamp, both of the preceding TYPE 1 unit.

           o For the second and third configuration, all units, TYPE 2,
             3, and 4, MUST receive the RTP timestamp.

           Refer to detailed examples on the timestamp calculation
           below.

   o As per configuration 3 above, a payload MAY contain several
     fragments of one (and only one) text sample.  If it does, then
     exactly one TYPE 2 unit followed by exactly one TYPE 3 unit is
     allowed in the same payload.  This is in line with RFC 3640 [12],
     Section 2.4, which explicitly disallows combining fragments of
     different samples in the same RTP payload.  Note that, in this
     special case, no timestamp calculation is needed.  That is, the RTP
     timestamp of both units is equal to the timestamp in the packet's
     RTP header.

   o Finally, note that the use of empty text samples allows for
     aggregating non-consecutive TYPE 1 units in the same payload.  Two
     text samples, with timestamps TS1 and TS3 and durations SDUR1 and
     SDUR3, are not consecutive if it holds TS1+SDUR1 < TS3.  A solution
     for this is to include an empty TYPE 1 unit with duration SDUR2
     between them, such that TS2+SDUR2 = TS1+SDUR1+SDUR2 = TS3.

   Some examples of aggregate payloads are illustrated in Figure 10.
   (Note: The figure is not scaled.)

RFC4396 - Page 37

      N/A    TS1   TS2     TS3
    +------+-----+------+-----+
    |TYPE5 |TYPE1|TYPE1 |TYPE1|
    +------+-----+------+-----+
      N/A   sdur1  sdur2  sdur3

                                   N/A    TS4
                                 +-----+-------+
                                 |TYPE5| TYPE 1|                   a)
                                 +-----+-------+
                                   N/A   sdur4

                                        TS4         TS4    TS4
                                 +--------------+ +--------------+
                                 |    TYPE2     | |TYPE2 |TYPE 3 | b)
                                 +--------------+ +--------------+
                                       sdur4       sdur4   sdur4

                                        TS4             TS4
                                 +--------------+ +--------------+
                                 | TYPE2| TYPE 3| |     TYPE4    | c)
                                 +--------------+ +--------------+
                                   sdur4  sdur4        sdur4

    |----------PAYLOAD 1------|  |--PAYLOAD 2---| |--PAYLOAD 3---|
               rtpts1               rtpts2           rtpts3

        KEY:
        TSx    = Text Sample x
        rtptsy = the standard RTP timestamp for PAYLOAD y
        sdurx  = the duration of Text Sample x
        N/A    =  not applicable

                  Figure 10.  Example aggregate payloads

   In Figure 10, four text samples (TS1 through TS4) are sent using
   three RTP packets.  These configurations have been chosen to show how
   the 5 TYPE headers are used.  Additionally, three different
   possibilities for the last text sample, TS4, are depicted: a), b),
   and c).

   In Figure 11, option b) from Figure 10 is chosen to illustrate how
   the timestamp for each unit is found.

RFC4396 - Page 38

      N/A    TS1   TS2    TS3        TS4            TS4    TS4
    +------+-----+------+-----+  +--------------+ +--------------+
    |TYPE5 |TYPE1|TYPE1 |TYPE1|  |    TYPE2     | |TYPE2 |TYPE 3 |
    +------+-----+------+-----+  +--------------+ +--------------+
      N/A   sdur1 sdur2  sdur3         sdur4       sdur4   sdur4

     (#1)    (#2) (#3)   (#4)           (#5)        (#6)    (#7)

    |----------PAYLOAD 1------|  |--PAYLOAD 2---| |--PAYLOAD 3---|
               rtpts1               rtpts2           rtpts3

               Figure 11.  Selected payloads from Figure 10

   Assuming TSx means Text Sample x, rtptsy represents the standard RTP
   timestamp for PAYLOAD y and sdurx, the duration of Text Sample x, the
   timestamp for unit #z, ts(#z), can be found as the sum of rtptsy and
   the cumulative sum of the durations of preceding units in that
   payload (except in the case of PAYLOAD 3 as per rule 3 above).  Thus,
   we have:

          1. for the units in the first aggregate payload, PAYLOAD 1:

                        ts(#1) = rtpts1
                        ts(#2) = rtpts1
                        ts(#3) = rtpts1 + sdur1
                        ts(#4) = rtpts1 + sdur1 + sdur2

           Note that the TYPE 5 and the first TYPE 1 unit have both the
           RTP timestamp.

          2. for PAYLOAD 2:

                        ts(#5) = rtpts2

          3. for PAYLOAD 3:

                        ts(#6) = ts(#7) = rtpsts2 = rtpts3

           According to configuration 3 above, the TYPE2 and the TYPE 3
           units shall belong to the same sample.  Hence, rtpts3 must be
           equal to rtpts2.  For the same reason, the value of SDUR is
           not be used to calculate the timestamp of the next unit.

RFC4396 - Page 39

4.7.  Payload Examples

   Some examples of payloads using the defined headers are shown below:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |V=2|P|X| CC    |M|    PT       |        sequence number        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                           timestamp                           |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           synchronization source (SSRC) identifier            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE1|       LEN  (always >=8)       |    SIDX       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                     SDUR                      |     TLEN      |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |    TLEN       |                                               |
      +---------------+                                               |
      |                  text string (no.bytes=TLEN)                  |
      |                                                               |
      |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                   modifiers   (no.bytes=LEN - 8 - TLEN)       |
      |                                                               |
      |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE1|       LEN  (always >=8)       |    SIDX       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                     SDUR                      |     TLEN      |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |    TLEN       |                                               |
      +---------------+                                               |
      |                  text string (no.bytes=TLEN)                  |
      |                                                               |
      |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                   modifiers   (no.bytes=LEN - 8 - TLEN)       |
      |                                               +-+-+-+-+-+-+-+-+
      |                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

            Figure 12.  A payload carrying two TYPE 1 units

   In Figure 12, an RTP packet carrying two TYPE 1 units is depicted.
   It can be seen how the length fields LEN and TLEN can be used to find
   the start of the next unit (LEN), the start of the modifiers (TLEN),
   and the length of the modifiers (LEN-TLEN).

RFC4396 - Page 40

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |V=2|P|X| CC    |M|    PT       |        sequence number        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                           timestamp                           |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           synchronization source (SSRC) identifier            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE5|      LEN( always >3)          |   SIDX        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                                                               |
      |                   sample description (no.bytes=LEN - 3)       |
      |                                                               |
      |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE1|       LEN  (always >=8)       |    SIDX       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                      SDUR                     |     TLEN      |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |      TLEN     |                                               |
      +-+-+-+-+-+-+-+-+                                               |
      |                  text string fragment (no.bytes=TLEN)         |
      |                                                               |
      |                                                               |
      |                                               +-+-+-+-+-+-+-+-+
      |                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     Figure 13.  An RTP packet carrying a TYPE 5 and a TYPE 1 unit

   In Figure 13, a sample description and a TYPE 1 unit are aggregated.
   The TYPE 1 unit happens to contain only text strings and is small, so
   an additional TYPE 5 unit is included to take advantage of the
   available bits in the packet.

RFC4396 - Page 41

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |V=2|P|X| CC    |M|    PT       |        sequence number        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                           timestamp                           |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           synchronization source (SSRC) identifier            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE2|          LEN( always >9)      |TOTAL=4|THIS=1 |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                    SDUR                       |    SIDX       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |               SLEN            |                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
      |                  text string fragment (no.bytes=LEN - 9)      |
      |                                                               |
      :                                                               :
      :                                                               :
      |                                               +-+-+-+-+-+-+-+-+
      |                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

    Figure 14.  Payload with first text string fragment of a sample

   In Figures 14, 15, and 16, a text sample is split into three RTP
   packets.  In Figure 14, the text string is big and takes the whole
   packet length.  In Figure 15, the only possibility for carrying two
   fragments of the same text sample is represented (see configuration 3
   in Section 4.6).  The last packet, shown in Figure 16, carries the
   last modifier fragment, a TYPE 4.

RFC4396 - Page 42

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |V=2|P|X| CC    |M|    PT       |        sequence number        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                           timestamp                           |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           synchronization source (SSRC) identifier            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE2|          LEN( always >9)      |TOTAL=4|THIS=2 |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                    SDUR                       |    SIDX       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |               SLEN            |                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+                               |
      |                  text string fragment (no.bytes=LEN - 9)      |
      |                                                               |
      |                                                               |
      |                                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE3|        LEN( always >6)        |TOTAL=4|THIS=3 |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                      SDUR                     |               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
      |                                                               |
      |                    modifiers (no.bytes=LEN - 6)               |
      |                                               +-+-+-+-+-+-+-+-+
      |                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

      Figure 15.  An RTP packet carrying a TYPE 2 unit and a TYPE 3 unit

RFC4396 - Page 43

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |V=2|P|X| CC    |M|    PT       |        sequence number        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                           timestamp                           |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           synchronization source (SSRC) identifier            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE4|        LEN( always >6)        |TOTAL=4|THIS=4 |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                      SDUR                     |               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               |
      |                                                               |
      |                    modifiers (no.bytes=LEN - 6)               |
      |                                               +-+-+-+-+-+-+-+-+
      |                                               |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

     Figure 16.  An RTP packet carrying last modifiers fragment (TYPE 4)

4.8.  Relation to RFC 3640

   RFC 3640 [12] defines a payload format for the transport of any non-
   multiplexed MPEG-4 elementary stream.  One of the various MPEG-4
   elementary stream types is MPEG-4 timed text streams, specified in
   MPEG-4 part 17 [26], also known as ISO/IEC 14496-17.  MPEG-4 timed
   text streams are capable of carrying 3GPP timed text data, as
   specified in 3GPP TS 26.245 [1].

   MPEG-4 timed text streams are intentionally constructed so as to
   guarantee interoperability between RFC 3640 and this payload format.
   This means that the construction of the RTP packets carrying timed
   text is the same.  That is, the MPEG-4 timed text elementary stream
   as per ISO/IEC 14496-17 is identical to the (aggregate) payloads
   constructed using this payload format.

   Figure 17 illustrates the process of constructing an RTP packet
   containing timed text.  As can be seen in the partition block, the
   (transport) units used in this payload format are identical to the
   Timed Text Units (TTUs) defined in ISO/IEC 14496-17.  Likewise, the
   rules for payload aggregation as per Section 4.6 are identical to
   those defined in ISO/IEC 14496-17 and are compliant with RFC 3640.
   As a result, an RTP packet that uses this payload format is identical
   to an RTP packet using RFC 3640 conveying TTUs according to ISO/IEC
   14496-17.  In particular, MPEG-4 Part 17 specifies that when using

RFC4396 - Page 44

   RFC 3640 for transporting timed text streams, the "streamType"
   parameter value is set to 0x0D, and the value of the
   "objectTypeIndication" in "config" takes the value 0x08.

                +--------------------------------------+
   Text samples | +--------------+   +--------------+  |
   as per 3GPP  | |Text Sample 1 |   |Text Sample N |  |
   TS 26245     | +--------------+   +--------------+  |
                +--------------------------------------+
                                  \/
   +-------------------------------------------------------------------+
   | Partition Text Samples into units.  TTU[i]= TYPE i units.         |
   |                                                                   |
   |[U R TYPE LEN][{TOTAL,THIS}SIDX{SDUR}{TLEN}{SLEN}][SampleContents] |
   |{..} means present if applicable, [..] means always present        |
   +-------------------------------------------------------------------+
                   \/                                \/
   +-------------------------------------------------------------------+
   |                      Aggregation (if possible)                    |
   +-------------------------------------------------------------------+
                   \/                                \/
   +-------------------------------------------------------------------+
   | RTP Entity adds and fills RTP header and Sends RTP packet, where  |
   |  RTP packets according to this Payload Format =                   |
   |  RTP packets carrying MPEG-4 Timed Text ES over RFC 3640          |
   +-------------------------------------------------------------------+

                     Figure 17.  Relation to RFC 3640

   Note: The use of RFC 3640 for transport of ISO/IEC 14496-17 data does
   not require any new SDP parameters or any new mode definition.

4.9.  Relation to RFC 2793

   RFC 2793 [22] and its revision, RFC 4103 [23], specify a protocol for
   enabling text conversation.  Typical applications of this payload
   format are text communication terminals and text conferencing tools.
   Text session contents are specified in ITU-T Recommendation T.140
   [24].  T.140 text is UTF-8 coded as specified in T.140 [24] with no
   extra framing.  The T140block contains one or more T.140 code
   elements as specified in T.140.  Code elements are control sequences
   such as "New Line", "Interrupt", "String Terminator", or "Start of
   String".  Most T.140 code elements are single ISO 10646 [25]
   characters, but some are multiple character sequences.  Each
   character is UTF-8 encoded [18] into one or more octets.

RFC4396 - Page 45

   This payload format may also be used for conversational applications
   (even for instant messaging).  However, this is not its main target.
   The differentiating feature of 3GPP Timed Text media format is that
   it allows text decoration.  This is especially useful in multimedia
   presentations, karaoke, commercial banners, news tickers, clickable
   text strings, and captions.  T.140 text contents used in RFC 2793 do
   not allow the use of text decoration.

   Furthermore, the conversational text RTP payload format recommends a
   method to include redundant text from already transmitted packets in
   order to reduce the risk of text loss caused by packet loss.  Thereby
   payloads would include a redundant copy of the last payload sent.
   This payload format does not describe such a method, but this is also
   applicable here.  As explained in Section 5, packet redundancy SHOULD
   be used, whenever possible.  The aggregation guidelines in Section
   4.6 allow redundant payloads.

(page 45 continued on part 3)