RFC 4396

RTP Payload Format for 3rd Generation Partnership Project (3GPP) Timed Text

Pages: 66
Proposed Standard
→ Errata

Part 1 of 3 – Pages 1 to 25

RFC4396 - Page 1

Network Working Group                                             J. Rey
Request for Comments: 4396                                     Y. Matsui
Category: Standards Track                                      Panasonic
                                                           February 2006


                           RTP Payload Format
       for 3rd Generation Partnership Project (3GPP) Timed Text

Status of This Memo

   This document specifies an Internet standards track protocol for the
   Internet community, and requests discussion and suggestions for
   improvements.  Please refer to the current edition of the "Internet
   Official Protocol Standards" (STD 1) for the standardization state
   and status of this protocol.  Distribution of this memo is unlimited.

Copyright Notice

   Copyright (C) The Internet Society (2006).

Abstract

   This document specifies an RTP payload format for the transmission of
   3GPP (3rd Generation Partnership Project) timed text.  3GPP timed
   text is a time-lined, decorated text media format with defined
   storage in a 3GP file.  Timed Text can be synchronized with
   audio/video contents and used in applications such as captioning,
   titling, and multimedia presentations.  In the following sections,
   the problems of streaming timed text are addressed, and a payload
   format for streaming 3GPP timed text over RTP is specified.

RFC4396 - Page 2

Table of Contents

   1. Introduction ....................................................3
   2. Motivation, Requirements, and Design Rationale ..................3
      2.1. Motivation .................................................3
      2.2. Basic Components of the 3GPP Timed Text Media Format .......4
      2.3. Requirements ...............................................5
      2.4. Limitations ................................................6
      2.5. Design Rationale ...........................................7
   3. Terminology ....................................................10
   4. RTP Payload Format for 3GPP Timed Text .........................12
      4.1. Payload Header Definitions ................................13
           4.1.1. Common Payload Header Fields .......................15
           4.1.2. TYPE 1 Header ......................................17
           4.1.3. TYPE 2 Header ......................................20
           4.1.4. TYPE 3 Header ......................................23
           4.1.5. TYPE 4 Header ......................................24
           4.1.6. TYPE 5 Header ......................................25
      4.2. Buffering of Sample Descriptions ..........................25
           4.2.1. Dynamic SIDX Wraparound Mechanism ..................26
      4.3. Finding Payload Header Values in 3GP Files ................28
      4.4. Fragmentation of Timed Text Samples .......................31
      4.5. Reassembling Text Samples at the Receiver .................33
      4.6. On Aggregate Payloads .....................................35
      4.7. Payload Examples ..........................................39
      4.8. Relation to RFC 3640 ......................................43
      4.9. Relation to RFC 2793 ......................................44
   5. Resilient Transport ............................................45
   6. Congestion Control .............................................46
   7. Scene Description ..............................................47
      7.1. Text Rendering Position and Composition ...................47
      7.2. SMIL Usage ................................................48
      7.3. Finding Layout Values in a 3GP File .......................48
   8. 3GPP Timed Text Media Type .....................................49
   9. SDP Usage ......................................................53
      9.1. Mapping to SDP ............................................53
      9.2. Parameter Usage in the SDP Offer/Answer Model .............53
           9.2.1. Unicast Usage ......................................54
           9.2.2. Multicast Usage ....................................57
      9.3. Offer/Answer Examples .....................................58
      9.4. Parameter Usage outside of Offer/Answer ...................60
   10. IANA Considerations ...........................................60
   11. Security Considerations .......................................60
   12. References ....................................................61
      12.1. Normative References .....................................61
      12.2. Informative References ...................................61
   13. Basics of the 3GP File Structure ..............................64
   14. Acknowledgements ..............................................65

RFC4396 - Page 3

1.  Introduction

   3GPP timed text is a media format for time-lined, decorated text
   specified in the 3GPP Technical Specification TS 26.245, "Transparent
   end-to-end packet switched streaming service (PSS); Timed Text Format
   (Release 6)" [1].  Besides plain text, the 3GPP timed text format
   allows the creation of decorated text such as that for karaoke
   applications, scrolling text for newscasts, or hyperlinked text.
   These contents may or may not be synchronized with other media, such
   as audio or video.

   The purpose of this document is to provide a means to stream 3GPP
   timed text contents using RTP [3].  This includes the streaming of
   timed text being read out of a (3GP) file, as well as the streaming
   of timed text generated in real-time, a.k.a. live streaming.

   Section 2 contains the motivation for this document, an overview of
   the media format, the requirements, and the design rationale.
   Section 3 defines the terminology used.  Section 4 specifies the
   payload headers, the fragmentation and re-assembly rules for text
   samples, the rules for payload aggregation, and the relations of this
   document to RFC 3640 [12] and RFC 2793 [22].  Section 5 specifies
   some simple schemes for resilient transport and gives pointers to
   other possible mechanisms.  Section 6 addresses congestion control.
   Section 7 specifies scene description.  Section 8 defines the media
   type.  Section 9 specifies SDP for unicast and multicast sessions,
   including usage in the Offer/Answer model [13].  Sections 10 and 11
   address IANA and security considerations.  Section 12 lists
   references.  Basics of the 3GP File Structure are in Section 13.

2.  Motivation, Requirements, and Design Rationale

2.1.  Motivation

   The 3GPP timed text format was developed for use in the services
   specified in the 3GPP Transparent End-to-end Packet-switched
   Streaming Services (3GPP PSS) specification [16].

   As of today, PSS allows downloading 3GPP timed text contents stored
   in 3GP files.  However, due to the lack of a RTP payload format, it
   is not possible to stream 3GPP timed text contents over RTP.

   This document specifies such a payload format.

RFC4396 - Page 4

2.2.  Basic Components of the 3GPP Timed Text Media Format

   Before going into the details of the design, it is necessary to know
   how the media format is constructed.  We can identify four
   differentiated functional components: layout information, default
   formatting, text strings, and decoration.  In the following, we
   shortly explain these and match them to their designations in a 3GP
   file:

        o Initial spatial layout information related to the text
          strings: These are the height and width of the text region
          where text is displayed, the position of the text region in
          the display, and the layer or proximity of the text to the
          user.  In 3GP files, this information is contained in the
          Track Header Box (3GP file designations are capitalized for
          clarity).

        o Default settings for formatting and positioning of text: style
          (font, size, color,...), background color, horizontal and
          vertical justification, line width, scrolling, etc.  For 3GP
          files, this corresponds to the Sample Descriptions.

        o The actual text strings: encoded characters using either UTF-8
          [18] or UTF-16 [19] encoding.

        o The decoration: If some characters have different style,
          delay, blink, etc., this needs to be indicated.  The
          decoration is only present in the text samples if it is
          actually needed.  Otherwise, the default settings as above
          apply.  In 3GP files, within each Text Sample, the decoration
          (i.e., Modifier Boxes) is appended to the text strings, if
          needed.  At the time of writing this payload format, the
          following modifiers are specified in the 3GPP timed text media
          format specification [1]:

           - text highlight
           - highlight color
           - blinking text
           - karaoke feature
           - hyperlink
           - text delay
           - text style
           - positioning of the text box
           - text wrap indication

RFC4396 - Page 5

2.3.  Requirements

   Once the basic components are known, it is necessary to define which
   requirements the payload format shall fulfill:

     1. It shall enable both live streaming and streaming from a 3GP
        file.

                Informative note: For the purpose of this document, the
                term "live streaming" refers to those scenarios where
                the timed text stream is sent from a live encoder.  Upon
                reception, the content may or may not be stored in a 3GP
                file.  Typically, in live streaming applications, the
                sender encapsulates the timed text content in RTP
                packets following the guidelines given in this document.
                At the receiving side, a buffer is used to cancel the
                network delay and delay jitter.  If receiver and sender
                support packet loss resilience mechanisms (see Section
                5), it may also be possible to recover from packet
                losses.  Note that how sender and receiver actually
                manage and dimension the buffers is an implementation
                design choice.

     2. Furthermore, it shall be possible for an RTP receiver using this
        payload format, and capable of storing in 3GP format, to obtain
        all necessary information from the RTP packets for storing the
        received text contents according to the 3GP file format.  This
        file may or may not be the same as the original file.

                Informative note: The 3GP file format itself is based on
                the ISO Base Media File Format recommendation [2].
                Section 13.1 gives some insight into the 3GP file
                structure.  Further, Sections 4.3 and 7.3 specify where
                the information needed for filling in payload headers is
                found in a 3GP file.  For live streaming, appropriate
                values complying with the format and units described in
                [1] shall be used.  Where needed, clarifications on
                appropriate values are given in this document.

     3. It shall enable efficient and resilient transport of timed text
        contents over RTP.  In particular:

          a. Enable the transmission of the sample descriptions by both
             out-of-band and in-band means.  Sample descriptions are
             important information, which potentially apply to several
             text samples.  These default formatting settings are
             typically transmitted out-of-band (reliably) once at the
             initialization phase.  If additional sample descriptions

RFC4396 - Page 6

             are needed in the course of a session, these may also be
             sent out-of-band or in-band.  In-band transmission,
             although unreliable, may be more appropriate for sending
             sample descriptions if these should be sent frequently, as
             opposed to establishing an additional communication channel
             for SDP, for example.  It is also useful in cases where an
             out-of-band channel may not be available and for live
             streaming, where contents are not known a priori.  Thus,
             the payload format shall enable out-of-band and in-band
             transmission of sample descriptions.  Section 4.1.6
             specifies a payload header for transmitting sample
             descriptions in-band.  Section 9 specifies how sample
             descriptions are mapped to SDP.

          b. Enable the fragmentation of a text sample into several RTP
             packets in order to cover a wide range of applications and
             network environments.  In general, fragmentation should be
             a rare event, given the low bit rates and relatively small
             text sample sizes.  However, the 3GPP Timed Text media
             format does allow for larger text samples.  Therefore, the
             payload format shall take this into account and provide a
             means for coping with fragmentation and reassembly. Section
             4.4 deals with fragmentation.

          c. Enable the aggregation of units into an RTP packet for
             making the transport more efficient.  In a mobile
             communication environment, a typical text sample size is
             around 100-200 bytes.  If the available bit rate and the
             packet size allow it, units should be aggregated into one
             RTP packet.  Section 4.6 deals with aggregation.

          d. Enable the use of resilient transport mechanisms, such as
             repetition, retransmission [11], and FEC [7] (see Section
             5).  For a more general discussion, refer to RFC 2354 [8],
             which discusses available mechanisms for stream repair.

2.4.  Limitations

     The payload headers have been optimized in size for RTP.  Instead
     of using 32-bit (S)LEN, SDUR, and SIDX header fields, which would
     carry many unused bits much of the time, it has been a design
     choice to reduce the size of these fields.  As a consequence, this
     payload format has reduced maximum values with respect to sizes and
     durations of (text) samples and sample descriptions.  These maximum
     values differ from those allowed in 3GP files, where they are
     expressed using 32-bit (unsigned) integers.  In some cases,

RFC4396 - Page 7

     extension mechanisms are provided to deal with larger values.
     However, it is noted that the values used here should be enough for
     the streaming applications targeted.

     The following limitations apply:

     1. The maximum size of text samples carried in RTP packets is
        restricted to be a 16-bit (unsigned) integer (this includes the
        text strings and modifiers).  This means a maximum size for the
        unit would be about 64 Kbytes.  No extension mechanism is
        provided.

     2. The sample description index values are restricted to be an 8-
        bit (unsigned) integer.  An extension mechanism is given in
        Section 4.3.

     3. The text sample duration is restricted to be a 24-bit (unsigned)
        integer.  This yields a maximum duration at a timestamp
        clockrate of 1000 Hz of about 4.6 hours.  Nevertheless, an
        extension mechanism is provided in Section 4.3.

     4. Sample descriptions are also restricted in size: If the size
        cannot be expressed as a 16-bit (unsigned) integer, the sample
        description shall not be conveyed.  As in the case of the sample
        size, no extension mechanism is provided.

     5. A further limitation concerns the UTF-16 encodings supported:
        Only transport of text strings following big endian byte order
        is supported.  See Section 4.1.1 for details.

2.5.  Design Rationale

   The following design choices were made:

     1. 'Unit' approach: The payload formats specified in this document
        follow a simple scheme: a 3-byte common header (Common Payload
        Header) followed by a specific header for each text sample
        (fragment) type.  Following these headers, the text sample
        contents are placed (Section 4.1.1 and following).  This
        structure is called a 'unit'.

        The following units have been devised to comply with the
        requirements mentioned in Section 2.3:

          a. A TYPE 1 unit that contains one complete text sample,

          b. A TYPE 2 unit that contains a complete text string or a
             fragment thereof,

RFC4396 - Page 8

          c. A TYPE 3 unit that contains the complete modifiers or only
             the first fragment thereof,

          d. A TYPE 4 unit that contains one modifier fragment other
             than the first, and

          e. A TYPE 5 unit that contains one sample description.

        This 'unit' approach was motivated by the following reasons:

              1. Allows a simple classification of the text samples and
                 text sample fragments that can be conveyed by the
                 payload format.

              2. Enables easy interoperability with RFC 3640 [12].
                 During the development of this payload format, interest
                 was shown from MPEG-4 standardization participants in
                 developing a common payload structure for the transport
                 of 3GPP Timed Text.  While interoperability is not
                 strictly necessary for this payload format to work, it
                 has been pursued in this payload format.  Section 4.8
                 explains how this is done.

     2. Character count is not implemented.  This payload format does
        detect lost text samples fragments, but it does not enable an
        RTP receiver to find out the exact number of text characters
        lost.  In fact, the fragment size included in the payload
        headers does not help in finding the number of lost characters
        because the UTF-8/UTF-16 [18][19] encodings used yield a
        variable number of bytes per character.

        For finding the exact number of lost characters, an additional
        field reflecting the character count (and possibly the character
        offset) upon fragmentation would be required.  This would
        additionally require that the entity performing fragmentation
        count the characters included in each text fragment.

        One benefit of having a character count would be that the
        display application would be able to replace missing characters
        through some other character representing character loss.  For
        example:

             If we take the "Some text is lost now" and assume the loss
             of a packet containing the text in the middle, this could
             be displayed (with a character count):

             "Some ############now"

RFC4396 - Page 9

             As opposed to:

             "Some #now"

             which is what this payload format enables ("#" indicates a
             missing character or packet, respectively).

        However, it is the consensus of the working group that for
        applications such as subtitling applications and multimedia
        presentations that use this payload format, such partial error
        correction is not worth the cost of including two additional
        fields; namely, character count and character offset.  Instead,
        it is recommended that some more overhead be invested to provide
        full error correction by protecting the less text sample
        fragments using the measures outlined in Section 5.

     3. Fragment re-assembly: In order to re-assemble the text samples,
        offset information is needed.  Instead of a character or byte
        offset, a single byte, TOTAL/THIS, is used.  These two values
        indicate the total number and current index of fragments of a
        text sample.  This is simpler than having a character offset
        field in each fragment.  Details in Section 4.1.3.

     4. A length field, LEN, is present in the common header fields.
        While the length in the RTP payload format is not needed by most
        RTP applications (typically lower layers, like UDP, provide this
        information), it does ease interoperability with RFC 3640.  This
        is because the Access Units (AUs) used for carriage of data in
        RFC 3640 must include a length indication.  Details are in
        Section 4.8.

     5. The header fields in the specific payload headers (TYPE headers
        in Sections 4.1.2 to 4.1.6) have been arranged for easy
        processing on 32-bit machines.  For this reason, the fields SIDX
        and SDUR are swapped in TYPE 1 unit, compared to the other
        units.

RFC4396 - Page 10

3.  Terminology

   The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT",
   "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
   document are to be interpreted as described in RFC 2119 [5].

   Furthermore, the following terms are used and have specific meaning
   within the context of this document:

   text sample or whole text sample

        In the 3GPP Timed Text media format [1], these terms refer to a
        unit of timed text data as contained in the source (3GP) file.
        This includes the text string byte count, possibly a Byte Order
        Mark, the text string and any modifiers that may follow.  Its
        equivalent in audio/video would be a frame.

        In this document, however, a text sample contains only text
        strings followed by zero or more modifiers.  This definition of
        text sample excludes the 16-bit text string byte count and the
        16-bit Byte Order Mark (BOM) present in 3GP file text samples
        (see Section 4.3 and Figure 9).  The 16-bit BOM is not
        transported in RTP, as explained in Section 4.1.1.

   text strings

        The actual text characters encoded either as UTF-8 or UTF-16.
        When using this payload format, the text string does not contain
        any byte order mark (BOM).  See Figure 9 for details.

   fragment or text sample fragment

        A fraction of a text sample.  A fragment may contain either text
        strings or modifier (decoration) contents, but not both at the
        same time.

   sample contents

        General term to identify timed text data transported when using
        this payload format.  Sample contents may be one or several text
        samples, sample descriptions, and sample fragments (note that,
        as per Section 4.6, there is only one case in which more than
        one fragment may be included in a payload).

RFC4396 - Page 11

   decoration or modifiers

        These terms are used interchangeably throughout the document to
        denote the contents of the text sample that modify the default
        text formatting.  Modifiers may, for example, specify different
        font size for a particular sequence of characters or define
        karaoke timing for the sample.

   sample description

        Information that is potentially shared by more than one text
        sample.  In a 3GP file, a sample description is stored in a
        place where it can be shared.  It contains setup and default
        information such as scrolling direction, text box position,
        delay value, default font, background color, etc.

   units or transport units

        The payload headers specified in this document encapsulate text
        samples, fragments thereof, and sample descriptions by placing a
        common header and specific payload header (Sections 4.1.1 to
        4.1.6) before them, thus building what is here called a
        (transport) unit.

   aggregation or aggregate packet

        The payload of an aggregate (RTP) packet consists of several
        (transport) units.

   track or stream

        3GP files contain audio/video and text tracks.  This document
        enables streaming of text tracks using RTP.  Therefore, these
        terms are used interchangeably in this document in the context
        of 3GP files.

   Media Header Box / Track Header Box / ...

        The 3GP file format makes use of these structures defined in the
        ISO Base File Format [2].  When referring to these in this
        document, initials are capitalized for clarity.

RFC4396 - Page 12

4.  RTP Payload Format for 3GPP Timed Text

   The format of an RTP packet containing 3GPP timed text is shown
   below:

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |V=2|P|X| CC    |M|    PT       |        sequence number        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                           timestamp                           |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |           synchronization source (SSRC) identifier            |
     /+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
    | |U|   R   | TYPE|             LEN               |               :
    | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+               :
   U| :           (variable header fields depending on TYPE           :
   N| :                                                               :
   I< +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
   T| |                                                               |
    | :                    SAMPLE CONTENTS                            :
    | |                                               +-+-+-+-+-+-+-+-+
    | |                                               |
     \+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

               Figure 1. 3GPP Timed Text RTP Packet Format

   Marker bit (M): The marker bit SHALL be set to 1 if the RTP packet
   includes one or more whole text samples or the last fragment of a
   text sample; otherwise, it is set to zero (0).

   Timestamp: The timestamp MUST indicate the sampling instant of the
   earliest (or only) unit contained in the RTP packet.  The initial
   value SHOULD be randomly determined, as specified in RTP [3].

        The timestamp value should provide enough timing resolution for
        expressing the duration of text samples, for synchronizing text
        with other media, and for performing RTP Control Protocol (RTCP)
        measurements such as the interarrival delay jitter or the RTCP
        Packet Receipt Times Report Block (Section 4.3 of RFC 3611
        [20]).  This is compliant to RTP, Section 5.1:

             "The resolution of the clock MUST be sufficient for the
             desired synchronization accuracy and for measuring packet
             arrival jitter (one tick per video frame is typically not
             sufficient)".

RFC4396 - Page 13

        The above observation applies to both timed text tracks included
        in a 3GP file and live streaming sessions.  In the case of a 3GP
        timed text track, the timestamp clockrate is the value of the
        "timescale" parameter in the Media Header Box for that text
        track.  Each track in a 3GP file MAY have its own clockrate as
        specified in the Media Header Box.  Likewise, live streaming
        applications SHALL use an appropriate timestamp clockrate.  A
        default value of 1000 Hz is RECOMMENDED.  Other timestamp
        clockrates MAY be used.  In this case, the typical behavior here
        is to match the 3GPP timed text clockrate to that used by an
        associated audio or video stream.

        In an aggregate payload, units MUST be placed in play-out order,
        i.e., earliest first in the payload.  If TYPE 1 units are
        aggregated, the timestamp of the subsequent units MUST be
        obtained by adding the timed text sample duration of previous
        samples to the RTP timestamp value.  There are two exceptions to
        this rule: TYPE 5 units and an aggregate payload containing two
        fragments of the same text sample.  The details of the timestamp
        calculation are given in Section 4.6.

        Finally, timestamp clockrates MUST be signaled by out-of-band
        means at session setup, e.g., using the media type "rate"
        parameter in SDP.  See Section 9 for details.

   Payload Type (PT): The payload type is set dynamically and sent by
   out-of-band means.

   The usage of the remaining RTP header fields (namely, V, P, X, CC, SN
   and SSRC) follows the rules of RTP and the profile in use.

4.1.  Payload Header Definitions

   The (transport) units specified in this document consist of a set of
   common fields (U, R, TYPE, LEN), followed by specific header fields
   (TYPES 1-5) and text sample contents.  See Figure 1 and Figure 2.

   In Figure 2, two example RTP packets are depicted.  The first
   contains an aggregate RTP payload with two complete text samples, and
   the second contains one text sample fragment.  After each unit header
   is explained, detailed payload examples follow in Section 4.7.

RFC4396 - Page 14

                                        +----------------------+
                                        |                      |
                                        |   RTP Header         |
                                        |                      |
                               ---------+----------------------+
                               |        |                      |
                               |        |COMMON + TYPE 1 Header|
                               |        ........................
                        UNIT 1 -        |                      |
                               |        |    Text Sample       |
                               |        |                      |
                               |-------\........................
                                -------/|                      |
                               |        |COMMON + TYPE 1 Header|
                               |        ........................
                        UNIT 2 -        |                      |
                               |        |    Text Sample       |
                               |        |                      |
                               |        |                      |
                               ---------+----------------------+

                                        +----------------------+
                                        |                      |
                                        |   RTP Header         |
                                        |                      |
                               ---------+----------------------+
                               |        |  COMMON + TYPE 2     |
                               |        |    (or 3 or 4) Hdr   |
                               |        ........................
                        UNIT 3 -        |                      |
                               |        | Text Sample Fragment |
                               |        |                      |
                               |        |                      |
                               ---------+----------------------+

                     Figure 2.  Example RTP packets

RFC4396 - Page 15

4.1.1.  Common Payload Header Fields

   The fields common to all payload headers have the following format:

            0                   1                   2
            0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
           |U|   R   |TYPE |             LEN               |
           +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                Figure 3.  Common payload header fields

   Where:

   o U (1 bit) "UTF Transformation flag": This is used to inform RTP
     receivers whether UTF-8 (U=0) or UTF-16 (U=1) was used to encode
     the text string.  UTF-16 text strings transported by this payload
     format MUST be serialized in big endian order, a.k.a. network byte
     order.

        Informative note: Timed text clients complying with the 3GPP
        Timed Text format [1] are only required to understand the big
        endian serialization.  Thus, in order to ease interoperability,
        the reverse serialization (little endian) is not supported by
        this payload format.

     For the payload formats defined in this document, the U bit is only
     used in TYPE 1 and TYPE 2 headers.  Senders MUST set the U bit to
     zero in TYPE 3, TYPE 4, and TYPE 5 headers.  Consequently,
     receivers MUST ignore the U bit in TYPE 3, TYPE 4, and TYPE 5
     headers.

   o R (4 bits) "Reserved bits": for future extensions.  This field MUST
     be set to zero (0x0) and MUST be ignored by receivers.

   o TYPE (3 bits) "Type Field": This field specifies which specific
     header fields follow.  The following TYPE values are defined:

        - TYPE 1, for a whole text sample.
        - TYPE 2, for a text string fragment (without modifiers).
        - TYPE 3, for a whole modifier box or the first fragment of a
          modifier box.
        - TYPE 4, for a modifier fragment other than first.
        - TYPE 5, for a sample description.  Exactly one header per
          sample description.
        - TYPE 0, 6, and 7 are reserved for future extensions.  Note
          that future extensions are possible, e.g., a unit that
          explicitly signals the number of characters present in a

RFC4396 - Page 16

          fragment (see Section 2.5).  In order to guarantee backwards-
          compatibility, it SHALL be possible that older clients ignore
          (newer) units they do not understand, without invalidating the
          timestamp calculation mechanisms or otherwise preventing them
          from decoding the other units.

   o Finally, the LEN (16 bits) "Length Field": indicates the size (in
     bytes) of this header field and all the fields following, i.e., the
     LEN field followed by the unit payload: text strings and modifiers
     (if any).  This definition only excludes the initial U/R/TYPE byte
     of the common header.  The LEN field follows network byte order.

     The way in which LEN is obtained when streaming out of a 3GP file
     depends on the particular unit type.  This is explained for each
     unit in the sections below.

     For live streaming, both sample length and the LEN value for the
     current fragment MUST be calculated during the sampling process or
     during fragmentation.

     In general, LEN may take the following values:

      - TYPE = 1, LEN >= 8
      - TYPE = 2, LEN > 9
      - TYPE = 3, LEN > 6
      - TYPE = 4, LEN > 6
      - TYPE = 5, LEN > 3

     Receivers MUST discard units that do not comply with these values.
     However, the RTP header fields and the rest of the units in the
     payload (if any) are still useful, as guaranteed by the requirement
     for future extensions above.

     In the following subsections the different payload headers for the
     values of TYPE are specified.

RFC4396 - Page 17

4.1.2.  TYPE 1 Header

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE |       LEN  (always >=8)       |    SIDX       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                      SDUR                     |     TLEN      |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |      TLEN     |
      +-+-+-+-+-+-+-+-+

                    Figure 4.  TYPE 1 Header Format

   This header type is used to transport whole text samples.  This unit
   should be the most common case, i.e., the text sample should usually
   be small enough to be transported in one unit without having to
   separate text strings from modifiers.  In an aggregate (RTP packet)
   payload containing several text samples, every sample is preceded by
   its own TYPE 1 header (see Figure 12).

        Informative note: As indicated in Section 3, "Terminology", a
        text sample is composed of the text strings followed by the
        modifiers (if any).  This is also how text samples are stored in
        3GP files.  The separation of a text sample into text strings
        and modifiers is only needed for large samples (or small
        available IP MTU sizes; see Section 4.4), and it is accomplished
        with TYPE 2 and TYPE 3 headers, as explained in the sections
        below.

   Note also that empty text samples are considered whole text samples,
   although they do not contain sample contents.  Empty text samples may
   be used to clear the display or to put an end to samples of unknown
   duration, for example.  Units without sample contents SHALL have a
   LEN field value of 8 (0x0008).

   The fields above have the following meaning:

   o U, R, and TYPE, as defined in Section 4.1.1.

   o LEN, in this case, represents the length of the (complete) text
     sample plus eight (8) bytes of headers.  For finding the length of
     the text sample in the Sample Size Box of 3GP files, see Section
     4.3.

   o SIDX (8 bits) "Text Sample Entry Index": This is an index used to
     identify the sample descriptions.

RFC4396 - Page 18

     The SIDX field is used to find the sample description corresponding
     to the unit's payload.  There are two types of SIDX values: static
     and dynamic.

     Static SIDX values are used to identify sample descriptions that
     MUST be sent out-of-band and MUST remain active during the whole
     session.  A static SIDX value is unequivocally linked to one
     particular sample description during the whole session.  Carrying
     many sample descriptions out-of-band SHOULD be avoided, since these
     may become large and, ultimately, transport is not the goal of the
     out-of-band channel.  Thus, this feature is RECOMMENDED for
     transporting those sample descriptions that provide a set of
     minimum default format settings.  Static SIDX values MUST fall in
     the (closed) interval [129,254].

     Dynamic SIDX values are used for sample descriptions sent in-band.
     Sample descriptions MAY be sent in-band for several reasons:
     because they are generated in real time, for transport resiliency,
     or both.  A dynamic SIDX value is unequivocally linked to one
     particular sample description during the period in which this is
     active in the session, and it SHALL NOT be modified during that
     period.  This period MAY be smaller than or equal to the session
     duration.  This period is not known a priori.  A maximum of 64
     dynamic simultaneously active SIDX values is allowed at any moment.
     Dynamic SIDX values MUST fall in the closed interval [0,127].  This
     should be enough for both recorded content and live streaming
     applications.  Nevertheless, a wraparound mechanism is provided in
     Section 4.2.1 to handle streaming sessions where more than 64 SIDX
     values might be needed.  Servers MAY make use of dynamic sample
     descriptions.  Clients MUST be able to receive and interpret
     dynamic sample descriptions.

     Finally, SIDX values 128 and 255 are reserved for future use.

   o SDUR (24 bits) "Text Sample Duration": indicates the sample
     duration in RTP timestamp units of the text sample.  For this
     field, a length of 3 bytes is preferred to 2 bytes.  This is
     because, for a typical clockrate of 1000 Hz, 16 bits would allow
     for a maximum duration of just 65 seconds, which might be too short
     for some streams.  On the other hand, 24 bits at 1000 Hz allow for
     a maximum duration of about 4.6 hours, while for 90 KHz, this value
     is about 3 minutes.  These values should be enough for streaming
     applications.  However, if a larger duration is needed, the
     extension mechanism specified in Section 4.3 SHALL be used.

     Apart from defining the time period during which the text is
     displayed, the duration field is also used to find the timestamp of
     subsequent units within the aggregate RTP packet payload (if any).

RFC4396 - Page 19

     This is explained in Section 4.6.

     Text samples have generally a known duration at the time of
     transmission.  However, in some cases such as live streaming, the
     time for which a text piece shall be presented might not be known a
     priori.  Thus, the value zero SDUR=0 (0x000000) is reserved to
     signal unknown duration.  The amount of time that a sample of
     unknown duration is presented is determined by the timestamp of the
     next sample that shall be displayed at the receiver: Text samples
     of unknown duration SHALL be displayed until the next text sample
     becomes active, as indicated by its timestamp.

     The next example illustrates how units of unknown duration MUST be
     presented.  If no text sample following is available, it is an
     implementation issue what should be displayed.  For example, a
     server could send an empty sample to clear the text box.

        Example: Imagine you are in an airport watching the latest news
        report while you wait for your plane.  Airports are loud, so the
        news report is transcribed in the lower area of the screen.
        This area displays two lines of text: the headlines and the
        words spoken by the news speaker.  As usual, the headlines are
        shown for a longer time than the rest.  This time is, in
        principle, unknown to the stream server, which is streaming
        live.  A headline is just replaced when the next headline is
        received.

     However, upon storing a text sample with SDUR=0 in a 3GP file, the
     SDUR value MUST be changed to the effective duration of the text
     sample, which MUST be always greater than zero (note that the ISO
     file format [2] explicitly forbids a sample duration of zero).  The
     effective duration MUST be calculated as the timestamp difference
     between the current sample (with unknown duration) and the next
     text sample that is displayed.

     Note that samples of unknown duration SHALL NOT use features, which
     require knowledge of the duration of the sample up front.  Such
     features are scrolling and karaoke in [1].  This also applies for
     future extensions of the Timed Text format.  Furthermore, only
     sample descriptions (TYPE 5 units) MAY follow units of unknown
     duration in the same aggregate payload.  Otherwise, it would not be
     possible to calculate the timestamp of these other units.

     For text contents stored in 3GP files, see Section 4.3 for details
     on how to extract the duration value.  For live streaming, live
     encoders SHALL assign appropriate values and units according to [1]
     and later releases.

RFC4396 - Page 20

   o TLEN (16 bits), "Text String Length", is a byte count of the text
     string.  The decoder needs the text string length in order to know
     where the modifiers in the payload start.  TLEN is not present in
     text string fragments (TYPE 2) since it can be deductively
     calculated from the LEN values of each fragment.

     The TLEN value is obtained from the text samples as contained in
     3GP files.  Refer to Section 4.3.  For live content, the TLEN MUST
     be obtained during the sampling process.

   o Finally, the actual text sample is placed after the TLEN field.  As
     defined in Section 3, a text sample consists of a string of
     characters encoded using either UTF-8 or UTF-16, followed by zero
     or more modifiers.  Note also that no BOM and no byte count are
     included in the strings carried in the payload (as opposed to text
     samples stored in 3GP files [1]).

4.1.3.  TYPE 2 Header

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE |          LEN( always >9)      | TOTAL | THIS  |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                    SDUR                       |    SIDX       |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |               SLEN            |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                      Figure 5.  TYPE 2 Header Format

   This header type is used to transport either a whole text string or a
   fragment of it.  TYPE 2 units SHALL NOT contain modifiers.  In
   detail:

   o U, R, and TYPE, as defined in Section 4.1.1.

   o SIDX and SDUR, as defined in Section 4.1.2.

        Note that the U, SIDX, and SDUR fields are meaningful since
        partial text strings can also be displayed.

   o The LEN field (16 bits) indicates the length of the text string
     fragment plus nine (9) bytes of headers.  Its value is calculated
     upon fragmentation.  LEN MUST always be greater than nine (0x0009).
     Otherwise, the unit MUST be discarded.

RFC4396 - Page 21

     According to the guidelines in Section 4.4, text strings MUST be
     split at character boundaries for allowing the display of text
     fragments.  Therefore, a text fragment MUST contain at least one
     character in either UTF-8 or UTF-16.  Actually, this is just a
     formalism since by observing the guidelines, much larger fragments
     should be created.

     Note also that TYPE 2 units do not contain an explicit text string
     length, TLEN (see TYPE 1).  This is because TYPE 2 units do not
     contain any modifiers after the text string.  If needed, the length
     of the received string can be obtained using the LEN values of the
     TYPE 2 units.

   o The SLEN field (16 bits) indicates the size (in bytes) of the
     original (whole) text sample to which this fragment belongs.  This
     length comprises the text string plus any modifier boxes present
     (and includes neither the byte order mark nor the text string
     length as mentioned in Section 3, "Terminology").

     Regarding the text sample length: Timed text samples are not
     generated at regular intervals, nor is there a default sample size.
     If 3GP files are streamed, the length of the text samples is
     calculated beforehand and included in the track itself, while for
     live encoding it is the real time encoder that SHALL choose an
     appropriate size for each text sample.  In this case, the amount of
     text 'captured' in a sample depends on the text source and the
     particular application (see examples below).  Samples may, e.g., be
     tailored to match the packet MTU as closely as possible or to
     provide a given redundancy for the available bit rate.  The
     encoding application MUST also take into account the delay
     constraints of the real-time session and assess whether FEC,
     retransmission, or other similar techniques are reasonable options
     for stream repair.

     The following examples shall illustrate how a real-time encoder may
     choose its settings to adapt to the scenario constraints.

          Example: Imagine a newscast scenario, where the spoken news is
          transcribed and synchronized with the image and voice of the
          reporter.  We assume that the news speaker talks at an average
          speed of 5 words per second with an average word length of 5
          characters plus one space per word, i.e., 30 characters per
          second.  We assume an available IP MTU of 576 bytes and an
          available bitrate of 576*8 bits per second = 4.6 Kbps.  We
          assume each character can be encoded using 2 bytes in UTF-16.
          In this scenario, several constraints may apply; for example:
          available IP MTU, available bandwidth, allowable delay, and
          required redundancy.  If the target were to minimize the

RFC4396 - Page 22

          packet overhead, a text sample covering 8 seconds of text
          would be closest to the IP MTU:

       IP/UDP/RTP/TYPE1 Header + (8-second text sample)
     = 20 + 8 + 12 + 8 + (~6 chars/word * 5 word/s * 8 s * 2 chars/word)
     = 528 bytes < 576 bytes

    For other scenarios, like lossy networks, it may happen that just
    one packet per sample is too low a redundancy.  In this case, a
    choice could be that the encoder 'collects' text every second, thus
    yielding text samples (TYPE 1 units) of 68 bytes, TYPE 1 header
    included.  We can, e.g., include three contiguous text samples in
    one RTP payload: the current and last two text samples (see below).
    This accounts to a total IP packet size of 20 + 8 + 12 + 3*(8 + 60)
    = 244 bytes.  Now, with the same available bitrate of 4.6 Kbps,
    these 244-byte packets can be sent redundantly up two times per
    second:

          RTP payload (1,2,3)(1,2,3) (2,3,4)(2,3,4) (3,4,5)(3,4,5) ...
          Time:       <----1s------> <----1s------> <-----1s-----> ...

          This means that each text sample is sent at least six times,
          which should provide enough redundancy.  Although not as
          bandwidth efficient (488*8 < 528*8  < 576*8 bps) as the
          previous packetization, this option increases the stream
          redundancy while still meeting the delay and bandwidth
          constraints.

          Another example would be a user sending timed text from a
          type-in area in the display.  In this case, the text sample is
          created as soon as the user clicks the 'send' button.
          Depending on the packet length, fragmentation may be needed.

          In a video conferencing application, text is synchronized with
          audio and video.  Thus, the text samples shall be displayed
          long enough to be read by a human, shall fit in the video
          screen, and shall 'capture' the audio contents rendered during
          the time the corresponding video and audio is rendered.

     For stored content, see Section 4.3 for details on how to find the
     SLEN value in a 3GP file.  For live content, the SLEN MUST be
     obtained during the sampling process.

     Finally, note that clients MAY use SLEN to buffer space for the
     remaining fragments of a text sample.

   o The fields TOTAL (4 bits) and THIS (4 bits) indicate the total
     number of fragments in which the original text sample (i.e., the

RFC4396 - Page 23

     text string and its modifiers) has been fragmented and which order
     occupies the current fragment in that sequence, respectively.  Note
     that the sequence number alone cannot replace the functionality of
     the THIS field, since packets (and fragments) may be repeated,
     e.g., as in repeated transmission (see Section 5).  Thus, an
     indication for "fragment offset" is needed.

     The usual "byte offset" field is not used here for two reasons: a)
     it would take one more byte and b) it does not provide any
     information on the character offset.  UTF-8/UTF-16 text strings
     have, in general, a variable character length ranging from 1 to 6
     bytes.  Therefore, the TOTAL/THIS solution is preferred.  It could
     also be argued that the LEN and SLEN fields be used for this
     purpose, but while they would provide information about the
     completeness of the text sample, they do not specify the order of
     the fragments.

     In all cases (TYPEs 2, 3 and 4), if the value of THIS is greater
     than TOTAL or if TOTAL equals zero (0x0), the fragment SHALL be
     discarded.

   o Finally, the sample contents following the SLEN field consist of a
     fragment of the UTF-8/UTF-16 character string; no modifiers follow.

4.1.4.  TYPE 3 Header

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE |        LEN( always >6)        |TOTAL  |  THIS |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                      SDUR                     |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                      Figure 6.  TYPE 3 Header Format

   This header type is used to transport either the entire modifier
   contents present in a text sample or just the first fragment of them.
   This depends on whether the modifier boxes fit in the current RTP
   payload.

   If a text sample containing modifiers is fragmented, this header MUST
   be used to transport the first fragment or, if possible, the complete
   modifiers.

   In detail:

   o The U, R, and TYPE fields are defined as in Section 4.1.1.

RFC4396 - Page 24

   o LEN indicates the length of the modifier contents.  Its value is
     obtained upon fragmentation.  Additionally, the LEN field MUST be
     greater than six (0x0006).  Otherwise, the unit MUST be discarded.

   o The TOTAL/THIS field has the same meaning as for TYPE 2.

     For TYPE 3 units containing the last (trailing) modifier fragment,
     the value of TOTAL MUST be equal to that of THIS (TOTAL=THIS).  In
     addition, TOTAL=THIS MUST be greater than one, because the total
     number of fragments of a text sample is logically always larger
     than one.

     Otherwise, if TOTAL is different from THIS in a TYPE 3 unit, this
     means that the unit contains the first fragment of the modifiers.

   o The SDUR has the same definition for TYPE 1.  Since the fragments
     are always transported in own RTP packets, this field is only
     needed to know how long this fragment is valid.  This may, e.g., be
     used to determine how long it should be kept in the display buffer.

   Note that the SLEN and SIDX fields are not present in TYPE 3 unit
   headers.  This is because a) these fragments do not contain text
   strings and b) these types of fragments are applied over text string
   fragments, which already contain this information.

4.1.5.  TYPE 4 Header

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE |        LEN( always >6)        |TOTAL  |  THIS |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |                      SDUR                     |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                      Figure 7.  TYPE 4 Header Format

   This header type is placed before modifier fragments, other than the
   first one.

   The U, R, and TYPE fields are used as per Section 4.1.1.

   LEN indicates as for TYPE 3 the length of the modifier contents and
   SHALL also be obtained upon fragmentation.  The LEN field MUST be
   greater than six (0x0006).  Otherwise, the unit MUST be discarded.

   TOTAL/THIS is used as in TYPE 2.

RFC4396 - Page 25

   The SDUR field is defined as in TYPE 1.  The reasoning behind the
   absence of SLEN and SIDX is the same as in TYPE 3 units.

4.1.6.  TYPE 5 Header

       0                   1                   2                   3
       0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+
      |U|   R   |TYPE |      LEN( always >3)          |   SIDX        |
      +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+

                      Figure 8.  TYPE 5 Header Format

   This header type is used to transport (dynamic) sample descriptions.
   Every sample description MUST have its own TYPE 5 header.

   The U, R, and TYPE fields are used as per Section 4.1.1.

   The LEN field indicates the length of the sample description, plus
   three units accounting for the SIDX and LEN field itself.  Thus, this
   field MUST be greater than three (0x0003).  Otherwise, the unit MUST
   be discarded.

   If the sample is streamed from a 3GP file, the length of the sample
   description contents (i.e., what comes after SIDX in the unit itself)
   is obtained from the file (see Section 4.3).

   The SIDX field contains a dynamic SIDX value assigned to the sample
   description carried as sample content of this unit.  As only dynamic
   sample descriptions are carried using TYPE 5, the possible SIDX
   values are in the (closed) interval [0,127].

   Senders MAY make use of TYPE 5 units.  All receivers MUST implement
   support for TYPE 5 units, since it adds minimum complexity and may
   increase the robustness of the streaming session.

   The next section specifies how SIDX values are calculated.

(page 25 continued on part 2)