5. Resilient Transport
Apart from the basic fragmentation guidelines described in the section above, the simplest option for packet-loss-resilient transport is packet repetition. This mechanism may consist of a strict window-based repetition mechanism or, simply, a repetition mechanism in a wider sense, where new and old packets are mixed, for example. A server MAY decide to use repetition as a measure for packet loss resilience. Thereby, a server MAY send the same RTP payloads or just some of the units from the payloads. As for the case of complete payloads, single repeated units MUST exactly match the same units sent in the first transmission; i.e., if fragmentation is needed, it SHALL be performed only once for each text sample. Only then, a receiver can use the already received and the repeated units to reconstruct the original text samples. Since the RTP timestamp is used to group together the fragments of a sample, care must taken to preserve the timing of units when constructing new RTP packets. For example, if a text sample was originally sent as a single non-fragmented text sample (one TYPE 1 unit), a repetition of that sample MUST be sent also as a single non-fragmented text sample in one unit. Likewise, if the original text sample was fragmented and spread over several RTP packets (say, a total of 3 units), then the repeated fragments SHALL also have the same byte boundaries and use the same unit headers and bytes per fragment.
With repetition, repeated units resolve to the same timestamp as their originals. Where redundant units are available, only one of them SHALL be used. Regarding the RTP header fields: o If the whole RTP payload is repeated, all payload-specific fields in the RTP header (the M, TS and PT fields) MUST keep their original values except the sequence number, which MUST be incremented to comply with RTP (the fields TOTAL/THIS enable to re-assemble fragments with different sequence numbers). o In packets containing single repeated units, the general rules in Section 3 for assigning values to the RTP header fields apply. Keeping the value of the RTP timestamp to preserve the timing of the units is particularly relevant here. Apart from repetition, other mechanisms such as FEC [7], retransmission [11], or similar techniques could be used to cope with packet losses.6. Congestion Control
Congestion control for RTP SHALL be implemented in accordance with RTP [3] and the applicable RTP profile, e.g., RTP/AVP [17]. When using this payload format, mainly two factors may affect the congestion control: o The use of (unit) aggregation may make the payload format more bandwidth efficient, by avoiding header overhead and thus reducing the used bitrate. o The use of resilient transport mechanisms: Although timed text applications typically operate at low bitrates, the increase due to resilient transport shall be considered for congestion control mechanisms. This applies to all mechanisms but especially to less efficient ones like repetition.
7. Scene Description
7.1. Text Rendering Position and Composition
In order to set up a timed text session, regardless of the stream being stored in a 3GP file or streamed live, some initial layout information is needed by the communicating peers. +-------------------------------------------+ | <-> tx | +-------------+ | +-------------------------------+ |<---|Display Area | | ^ | | | +-------------+ | : | | | | :ty| | | +-------------+ | : | |<---------|Video track | | : | | | +-------------+ | : | | | | : | | | | : | | | | v | | | | - | x-------------------------+ | | +-------------+ |h ^ | | |<-----------|Text Track | |e : +---|-------------------------|-+ | +-------------+ |i : | +---------------------+ | | |g : | | | | | +-------------+ |h : | | |<------------ |Text Box | |t v | +---------------------+ | | +-------------+ | - +-------------------------+ | +-------------------------------------------+ <........................> w i d t h Figure 18. Illustration of text rendering position and composition The parameters used for negotiating the position and size of the text track in the display area are shown in Figure 18. These are the "width" and "height" of the text track, its translation values, "tx" and "ty", and its "layer" or proximity to the user. At the same time, the sender of the stream needs to know the receiver's capabilities. In this case, the maximum allowable values for the text track height and width: "max-h" and "max-w", for the stream the receiver shall display. This layout information MUST be conveyed in a reliable form before the start of the session, e.g., during session announcement or in an Offer/Answer (O/A) exchange. An example of a reliable transport may be the out-of-band channel used for SDP. Sections 8 and 9 provide
details on the mapping of these parameters to SDP descriptions and their usage in O/A. For stored content, the layout values expressing stream properties MUST be obtained from the Track Header Box. See Section 7.3. For live streaming, appropriate values as negotiated during session setup shall be used.7.2. SMIL Usage
The attributes contained in the Track Header Boxes of a 3GP file only specify the spatial relationship of the tracks within the given 3GP file. If multiple 3GP files are sent, they require spatial synchronization. For example, for a text and video stream, the positions of the text and video tracks in Figure 18 shall be determined. For this purpose, SMIL [9] MAY be used. SMIL assigns regions in the display to each of those files and places the tracks within those regions. Generally, in SMIL, the position of one track (or stream) is expressed relative to another track. This is different from the 3GP file, where the upper left corner is the reference for all translation offsets. Hence, only if the position in SMIL is relative to the video track origin, then this translation offset has the same value as (tx, ty) in the 3GP file. Note also that the original track header information is used for each track only within its region, as assigned by SMIL. Therefore, even if SMIL scene description is used, the track header information pieces SHOULD be sent anyway, as they represent the intrinsic media properties. See 3GPP SMIL Language Profile in [27] for details.7.3. Finding Layout Values in a 3GP File
In a 3GP file, within the Track Header Box (tkhd): o tx, ty: These values specify the translation offset of the (text) track relative to the upper left corner of the video track, if present. They are the second but last and third but last values in the unity matrix; values are fixed-point 16.16 values, restricted to be (signed) integers (i.e., the lower 16 bits of each value shall be all zeros). Therefore, only the first 16 bits are used for obtaining the value of the media type parameters.
o width, height: They have the same name in the tkhd box. All (unsigned) 32 bits are meaningful. o layer: All (signed) 16 bits are used.8. 3GPP Timed Text Media Type
The media subtype for the 3GPP Timed Text codec is allocated from the standards tree. The top-level media type under which this payload format is registered is 'video'. This registration is done using the template defined in [29] and following RFC 3555 [28]. The receiver MUST ignore any unrecognized parameter. Media type: video Media subtype: 3gpp-tt Required parameters rate: Refer to Section 3 in RFC 4396. sver: The parameter "sver" contains a list of supported backwards-compatible versions of the timed text format specification (3GPP TS 26.245) that the sender accepts to receive (and that are the same that it would be willing to send). The first value is the value preferred to receive (or preferred to send). The first value MAY be followed by a comma-separated list of versions that SHOULD be used as alternatives. The order is meaningful, being first the most preferred and last the least preferred. Each entry has the format Zi(xi*256+yi), where "Zi" is the number of the Release and "xi" and "yi" are taken from the 3GPP specification version (i.e., vZi.xi.yi). For example, for 3GPP TS 26.245 v6.0.0, Zi(xi*256+yi)=6(0), the version value is "60". (Note that "60" is the concatenation of the values Zi=6 and (xi*256+yi)=0 and not their product.) If no "sver" value is available, for example, when streaming out of a 3GP file, the default value "60", corresponding to the 3GPP Release 6 version of 3GPP TS 26.245, SHALL be used.
Optional parameters: tx: This parameter indicates the horizontal translation offset in pixels of the text track with respect to the origin of the video track. This value is the decimal representation of a 16-bit signed integer. Refer to TS 3GPP 26.245 for an illustration of this parameter. ty: This parameter indicates the vertical translation offset in pixels of the text track with respect to the origin of the video track. This value is the decimal representation of a 16-bit signed integer. Refer to TS 3GPP 26.245 for an illustration of this parameter. layer: This parameter indicates the proximity of the text track to the viewer. More negative values mean closer to the viewer. This parameter has no units. This value is the decimal representation of a 16-bit signed integer. tx3g: This parameter MUST be used for conveying sample descriptions out-of-band. It contains a comma-separated list of base64-encoded entries. The entries of this list MAY follow any particular order and the list SHALL NOT be empty. Each entry is the result of running base64 encoding over the concatenation of the (static) SIDX value as an 8-bit unsigned integer and the (static) sample description for that SIDX, in that order. The format of a sample description entry can be found in 3GPP TS 26.245 Release 6 and later releases. All servers and clients MUST understand this parameter and MUST be capable of using the sample description(s) contained in it. Please refer to RFC 3548 [6] for details on the base64 encoding. width: This parameter indicates the width in pixels of the text track or area of the text being sent. This value is the decimal representation of a 32-bit unsigned integer. Refer to TS 3GPP 26.245 for an illustration of this parameter.
height: This parameter indicates the height in pixels of the text track being sent. This value is the decimal representation of a 32-bit unsigned integer. Refer to TS 3GPP 26.245 for an illustration of this parameter. max-w: This parameter indicates display capabilities. This is the maximum "width" value that the sender of this parameter supports. This value is the decimal representation of a 32-bit unsigned integer. max-h: This parameter indicates display capabilities. This is the maximum "height" value that the sender of this parameter supports. This value is the decimal representation of a 32-bit unsigned integer. Encoding considerations: This media type is framed (see Section 4.8 in [29]) and partially contains binary data. Restrictions on usage: This media type depends on RTP framing, and hence is only defined for transfer via RTP [3]. Transport within other framing protocols is not defined at this time. Security considerations: Please refer to Section 11 of RFC 4396. Interoperability considerations: The 3GPP Timed Text media format and its file storage is specified in Release 6 of 3GPP TS 26.245, "Transparent end-to- end packet switched streaming service (PSS); Timed Text Format (Release 6)". Note also that 3GPP may in future releases specify extensions or updates to the timed text media format in a backwards-compatible way, e.g., new modifier boxes or extensions to the sample descriptions. The payload format defined in RFC 4396 allows for such extensions. For future 3GPP Releases of the Timed Text Format, the parameter "sver" is used to identify the exact specification used.
The defined storage format for 3GPP Timed Text format is the 3GPP File Format (3GP) [30]. 3GP files may be transferred using the media type video/3gpp as registered by RFC 3839 [31]. The 3GPP File Format is a container file that may contain, e.g., audio and video that may be synchronized with the 3GPP Timed Text. Published specification: RFC 4396 Applications which use this media type: Multimedia streaming applications. Additional information: The 3GPP Timed Text media format is specified in 3GPP TS 26.245, "Transparent end-to-end packet switched streaming service (PSS); Timed Text Format (Release 6)". This document and future extensions to the 3GPP Timed Text format are publicly available at http://www.3gpp.org. Magic number(s): None. File extension(s): None. Macintosh File Type Code(s): None. Person & email address to contact for further information: Jose Rey, jose.rey@eu.panasonic.com Yoshinori Matsui, matsui.yoshinori@jp.panasonic.com Audio/Video Transport Working Group. Intended usage: COMMON Authors: Jose Rey Yoshinori Matsui Change controller: IETF Audio/Video Transport Working Group delegated from the IESG.
9. SDP Usage
9.1. Mapping to SDP
The information carried in the media type specification has a specific mapping to fields in SDP [4]. If SDP is used to specify sessions using this payload format, the mapping is done as follows: o The media type ("video") goes in the SDP "m=" as the media name. m=video <port number> RTP/<RTP profile> <dynamic payload type> o The media subtype ("3gpp-tt") and the timestamp clockrate "rate" (the RECOMMENDED 1000 Hz or other value) go in SDP "a=rtpmap" line as the encoding name and rate, respectively: a=rtpmap:<payload type> 3gpp-tt/1000 o The REQUIRED parameter "sver" goes in the SDP "a=fmtp" attribute by copying it directly from the media type string as a semicolon- separated parameter=value pair. o The OPTIONAL parameters "tx", "ty", "layer", "tx3g", "width", "height", "max-w" and "max-h" go in the SDP "a=fmtp" attribute by copying them directly from the media type string as a semicolon separated list of parameter=value(s) pairs: a=fmtp:<dynamic payload type> <parameter name>=<value>[,<value>][; <parameter name>=<value>] o Any parameter unknown to the device that uses the SDP SHALL be ignored. For example, parameters added to the media format in later specifications MAY be copied into the SDP and SHALL be ignored by receivers that do not understand them.9.2. Parameter Usage in the SDP Offer/Answer Model
In this section, the meaning of the SDP parameters defined in this document within the Offer/Answer [13] context is explained. In unicast, sender and receiver typically negotiate the streams, i.e., which codecs and parameter values are used in the session. This is also possible in multicast to a lesser extent. Additionally, the meaning of the parameters MAY vary depending on which direction is used. In the following sections, a "<directionality> offer" means an offer that contains a stream set to <directionality>. <directionality> may take the values sendrecv,
sendonly, and recvonly. Similar considerations apply for answers. For example, an answer to a sendonly offer is a recvonly answer.9.2.1. Unicast Usage
The following types of parameters are used in this payload format: 1. Declarative parameters: Offerer and answerer declare the values they will use for the incoming (sendrecv/recvonly) or outgoing (sendonly) stream. Offerer and answerer MAY use different values. a. "tx", "ty", and "layer": These are parameters describing where the received text track is placed. Depending on the directionality: i. They MUST appear in all sendrecv offers and answers and in all recvonly offers and answers (thus applying to the incoming stream). In the case of sendrecv offers and answers and in recvonly offers, these values SHOULD be used by the sender of the stream unless it has a particular preference, in which case, it MUST make sure that these different values do not corrupt the presentation. For recvonly answers, the answerer MAY accept the proposed values for the incoming stream (in a sendonly offer; see ii. below) or respond with different ones. The offerer MUST use the returned values. ii. They MAY appear in sendonly offers and MUST appear in sendonly answers. In sendonly offers, they specify the values that the offerer proposes for sending (see example in Section 9.3). In sendonly answers, these values SHOULD be copied from the corresponding recvonly offer upon accepting the stream, unless a particular preference by the receiver of the stream exists, as explained in the previous point. 2. Parameters describing the display capabilities, "max-h" and "max-w", which indicate the maximum dimensions of the text track (text display area) for the incoming stream "tx" and "ty" values (see Figure 18). "max-h" and "max-w" MUST be included in all offers and answers where "tx" and "ty" refer to the incoming stream, thus excluding sendonly offers and answers (see example in Section 9.3), where they SHALL NOT be present.
3. Parameters describing the sent stream properties, i.e., the sender of the stream decides upon the values of these: a. "width" and "height" specify the text track dimensions. They SHALL ALWAYS be present in sendrecv and sendonly offers and answers. For recvonly answers, the answerer MUST include the offered parameter values (if any) verbatim in the answer upon accepting the stream. b. "tx3g" contains static sample descriptions. It MAY only be present in sendrecv and sendonly offers and answers. This parameter applies to the stream that offerers or answerers send. 4. Negotiable parameters, which MUST be agreed on. This is the case of "sver". This parameter MUST be present in every offer and answer. The answerer SHALL choose one supported value from the offerer's list, or else it MUST remove the stream or reject the session. 5. Symmetric parameters: "rate", timestamp clockrate, belongs to this class. Symmetric parameters MUST be echoed verbatim in the answer. Otherwise, the stream MUST be removed or the session rejected. The following table summarizes all options:
+..---------------------------+----------+----------+----------+ | ``--..__ Directionality/ | sendrecv | recvonly | sendonly | + Type of ``--..__ O or A +----------+----------+----------+ | Parameter ``--..__ | O/A | O/A | O/A | +--------------+------------``+----------+----------+----------+ | Declarative |tx, ty, layer | M/M | M/M | m/M | | | | | | | +--------------+--------------+----------+----------+----------+ | Display |max-h, max-w | M/M | M/M | -/- | | Capabilities | | | | | +--------------+--------------+----------+----------+----------+ | Stream |height, width | M/M | -/(M) | M/M | | properties |tx3g | m/m | -/- | m/m | | | | | | | +--------------+--------------+----------+----------+----------+ | Negotiable |sver | M/M | M/M | M/M | | | | | | | +--------------+--------------+----------+----------+----------+ | Symmetric |rate | M/M | M/M | M/M | +--------------+--------------+----------+----------+----------+ Table 1. Parameter usage in Unicast Offer / Answer. KEY: o M means MUST be present. o m means MAY be present (such as proposed values). o (M) or (m) means MUST or MAY, if applicable. o a hyphen ("-") means the parameter MUST NOT be present. Other observations regarding parameter usage: o Translation and transparency values: In sendonly offers, "tx", "ty", and "layer" indicate proposed values. This is useful for visually composed sessions where the different streams occupy different parts of the display, e.g., a video stream and the captions. These are just suggested values; the peer rendering the text ultimately decides where to place the text track. o Text track (area) dimensions, "height" and "width": In the case of sendonly offers, an answerer accepting the offer MUST be prepared to render the stream using these values. If any of these conditions are not met, the stream MUST be removed or the session rejected. o Display capabilities, "max-h" and "max-w": An answerer sending a stream SHALL ensure that the "height" and "width" values in the answer are compatible with the offerer's signaled capabilities.
o Version handling via "sver": The idea is that offerer and answerer communicate using the same version. This is achieved by letting the answerer choose from a list of supported versions, "sver". For recvonly streams, the first value in the list is the preferred version to receive. Consequently, for sendonly (and sendrecv) streams, the first value is the one preferred for sending (and receiving). The answerer MUST choose one value and return it in the answer. Upon receiving the answer, the offerer SHALL be prepared to send (sendonly and sendrecv) and receive (recvonly and sendrecv) a stream using that version. If none of the versions in the list is supported, the stream MUST be removed or the session rejected. Note that, if alternative non- compatible versions are offered, then this SHALL be done using different payload types.9.2.2. Multicast Usage
In multicast, the parameter usage is similar to the unicast case, except as follows: o the parameters "tx", "ty", and "layer" in multicast offers only have meaning for sendrecv and recvonly streams. In order for all clients to have the same vision of the session, they MUST be used symmetrically. o for "height", "width", and "tx3g" (for sendrecv and sendonly), multicast offers specify which values of these parameters the participants MUST use for sending. Thus, if the stream is accepted, the answerer MUST also include them verbatim in the answer (also "tx3g", if present). o The capability parameters, "max-h" and "max-w", SHALL NOT be used in multicast. If the offered text track should change in size, a new offer SHALL be used instead. o Regarding version handling: In the case of multicast offers, an answerer MAY accept a multicast offer as long as one of the versions listed in the "sver" is supported. Therefore, if the stream is accepted, the answerer MUST choose its preferred version, but, unlike in unicast, the offerer SHALL NOT change the offered stream to this chosen version because there may be other session participants that do support the newer extensions. Consequently, different session participants may end up using different backwards-compatible media format versions. It is RECOMMENDED that the multicast offer contains a limited number of versions, in order for all participants to have the same view of the session. This is a responsibility of the session creator. If
none of the offered versions is supported, the stream SHALL be removed or the session rejected. Also in this case, if alternative non-compatible versions are offered, then this SHALL be done using different payload types.9.3. Offer/Answer Examples
In these unicast O/A examples, the long lines are wrapped around. Static sample descriptions are shortened for clarity. For sendrecv: O -> A m=video <port> RTP/AVP 98 a=rtpmap:98 3gpp-tt/1000 a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100; max-h=120; max-w=160; sver=6256,60; tx3g=81... a=sendrecv A -> O m=video <port> RTP/AVP 98.. a=rtpmap:98 3gpp-tt/1000 a=fmtp:98 tx=100; ty=95; layer=0; height=90; width=100; max-h=100; max-w=160; sver=60; tx3g=82... a=sendrecv In this example, the offerer is telling the answerer where it will place the received stream and what is the maximum height and width allowable for the stream that it will receive. Also, it tells the answerer the dimensions of the text track for the stream sent and which sample description it shall use. It offers two versions, 6256 and 60. The answerer responds with an equivalent set of parameters for the stream it receives. In this case, the answerer's "max-h" and "max-w" are compatible with the offerer's "height" and "width". Otherwise, the answerer would have to remove this stream, and the offerer would have to issue a new offer taking the answerer's capabilities into account. This is possible only if multiple payload types are present in the initial offer so that at least one of them matches the answerer's capabilities as expressed by "max-h" and "max-w" in the negative answer. Note also that the answerer's text box dimensions fit within the maximum values signaled in the offer. Finally, the answerer chooses to use version 60 of the timed text format.
For recvonly: Offerer -> Answerer m=video <port> RTP/AVP 98 a=rtpmap:98 3gpp-tt/1000 a=fmtp:98 tx=100; ty=100; layer=0; max-h=120; max-w=160; sver=6256,60 a=recvonly A -> O m=video <port> RTP/AVP 98.. a=rtpmap:98 3gpp-tt/1000 a=fmtp:98 tx=100; ty=100; layer=0; height=90; width=100; sver=60; tx3g=82... a=sendonly In this case, the offer is different from the previous case: It does not include the stream properties "height", "width", and "tx3g". The answerer copies the "tx", "ty", and "layer" values, thus acknowledging these. "max-h" and "max-w" are not present in the answer because the "tx" and "ty" (and "layer") in this special case do not apply to the received stream, but to the sent stream. Also, if offerer and answerer had very different display sizes, it would not be possible to express the answerer's capabilities. In the example above and for an answerer with a 50x50 display, the translation values are already out of range. For sendonly: O -> A m=video <port> RTP/AVP 98 a=rtpmap:98 3gpp-tt/1000 a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100; sver=6256,60; tx3g=81... a=sendonly A -> O m=video <port> RTP/AVP 98.. a=rtpmap:98 3gpp-tt/1000 a=fmtp:98 tx=100; ty=100; layer=0; height=80; width=100; max-h=100; max-w=160; sver=60 a=recvonly
Note that "max-h" and "max-w" are not present in the offer. Also, with this answer, the answerer would accept the offer as is (thus echoing "tx", "ty", "height", "width", and "layer") and additionally inform the offerer about its capabilities: "max-h" and "max-w". Another possible answer for this case would be: A -> O m=video <port> RTP/AVP 98.. a=rtpmap:98 3gpp-tt/1000 a=fmtp:98 tx=120; ty=105; layer=0; max-h=95; max-w=150; sver=60 a=recvonly In this case, the answerer does not accept the values offered. The offerer MUST use these values or else remove the stream.9.4. Parameter Usage outside of Offer/Answer
SDP may also be employed outside of the Offer/Answer context, for instance for multimedia sessions that are announced through the Session Announcement Protocol (SAP) [14] or streamed through the Real Time Streaming Protocol (RTSP) [15]. In this case, the receiver of a session description is required to support the parameters and given values for the streams, or else it MUST reject the session. It is the responsibility of the sender (or creator) of the session descriptions to define the session parameters so that the probability of unsuccessful session setup is minimized. This is out of the scope of this document.10. IANA Considerations
IANA has registered the media subtype name "3gpp-tt" for the media type "video" as specified in Section 8 of this document.11. Security Considerations
RTP packets using the payload format defined in this specification are subject to the security considerations discussed in the RTP specification [3] and any applicable RTP profile, e.g., AVP [17]. In particular, an attacker may invalidate the current set of active sample descriptions at the client by means of repeating a packet with an old sample description, i.e., replay attack. This would mean that the display of the text would be corrupted, if displayed at all. Another form of attack may consist of sending redundant fragments, whose boundaries do not match the exact boundaries of the originals
(as indicated by LEN) or fragments that carry different sample lengths (SLEN). This may cause a decoder to crash. These types of attack may easily be avoided by using source authentication and integrity protection. Additionally, peers in a timed text session may desire to retain privacy in their communication, i.e., confidentiality. This payload format does not provide any mechanisms for achieving these. Confidentiality, integrity protection, and authentication have to be solved by a mechanism external to this payload format, e.g., SRTP [10].12. References
12.1. Normative References
[1] Transparent end-to-end packet switched streaming service (PSS); Timed Text Format (Release 6), TS 26.245 v 6.0.0, June 2004. [2] ISO/IEC 14496-12:2004 Information technology - Coding of audio- visual objects - Part 12: ISO base media file format. [3] Schulzrinne, H., Casner, S., Frederick, R., and V. Jacobson, "RTP: A Transport Protocol for Real-Time Applications", STD 64, RFC 3550, July 2003. [4] Handley, M. and V. Jacobson, "SDP: Session Description Protocol", RFC 2327, April 1998. [5] Bradner, S., "Key words for use in RFCs to Indicate Requirement Levels", BCP 14, RFC 2119, March 1997. [6] Josefsson, S., "The Base16, Base32, and Base64 Data Encodings", RFC 3548, July 2003.12.2. Informative References
[7] Rosenberg, J. and H. Schulzrinne, "An RTP Payload Format for Generic Forward Error Correction", RFC 2733, December 1999. [8] Perkins, C. and O. Hodson, "Options for Repair of Streaming Media", RFC 2354, June 1998. [9] W3C, "Synchronised Multimedia Integration Language (SMIL 2.0)", August, 2001.
[10] Baugher, M., McGrew, D., Naslund, M., Carrara, E., and K. Norrman, "The Secure Real-time Transport Protocol (SRTP)", RFC 3711, March 2004. [11] Rey, J., Leon, D., Miyazaki, A., Varsa, V., and R. Hakenberg, "RTP Retransmission Payload Format", Work in Progress, September 2005. [12] van der Meer, J., Mackie, D., Swaminathan, V., Singer, D., and P. Gentric, "RTP Payload Format for Transport of MPEG-4 Elementary Streams", RFC 3640, November 2003. [13] Rosenberg, J. and H. Schulzrinne, "An Offer/Answer Model with Session Description Protocol (SDP)", RFC 3264, June 2002. [14] Handley, M., Perkins, C., and E. Whelan, "Session Announcement Protocol", RFC 2974, October 2000. [15] Schulzrinne, H., Rao, A., and R. Lanphier, "Real Time Streaming Protocol (RTSP)", RFC 2326, April 1998. [16] Transparent end-to-end packet switched streaming service (PSS); Protocols and codecs (Release 6), TS 26.234 v 6.1.0, September 2004. [17] Schulzrinne, H. and S. Casner, "RTP Profile for Audio and Video Conferences with Minimal Control", STD 65, RFC 3551, July 2003. [18] Yergeau, F., "UTF-8, a transformation format of ISO 10646", STD 63, RFC 3629, November 2003. [19] Hoffman, P. and F. Yergeau, "UTF-16, an encoding of ISO 10646", RFC 2781, February 2000. [20] Friedman, T., Caceres, R., and A. Clark, "RTP Control Protocol Extended Reports (RTCP XR)", RFC 3611, November 2003. [21] Ott, J., Wenger, S., Sato, N., Burmeister, C., and J. Rey, "Extended RTP Profile for RTCP-based Feedback (RTP/AVPF)", Work in Progress, August 2004. [22] Hellstrom, G., "RTP Payload for Text Conversation", RFC 2793, May 2000. [23] Hellstrom, G. and P. Jones, "RTP Payload for Text Conversation", RFC 4103, June 2005.
[24] ITU-T Recommendation T.140 (1998) - Text conversation protocol for multimedia application, with amendment 1, (2000). [25] ISO/IEC 10646-1: (1993), Universal Multiple Octet Coded Character Set. [26] ISO/IEC FCD 14496-17 Information technology - Coding of audio- visual objects - Part 17: Streaming text format, Work in progress, June 2004. [27] Transparent end-to-end Packet-switched Streaming Service (PSS); 3GPP SMIL language profile, (Release 6), TS 26.246 v 6.0.0, June 2004. [28] Casner, S. and P. Hoschka, "MIME Type Registration of RTP Payload Formats", RFC 3555, July 2003. [29] Freed, N. and J. Klensin, "Media Type Specifications and Registration Procedures", BCP 13, RFC 4288, December 2005. [30] Transparent end-to-end packet switched streaming service (PSS); 3GPP file format (3GP) (Release 6), TS 26.244 V6.3. March 2005. [31] Castagno, R. and D. Singer, "MIME Type Registrations for 3rd Generation Partnership Project (3GPP) Multimedia files", RFC 3839, July 2004.
13. Basics of the 3GP File Structure
This section provides a coarse overview of the 3GP file structure, which follows the ISO Base Media file Format [2]. Each 3GP file consists of "Boxes". In general, a 3GP file contains the File Type Box (ftyp), the Movie Box (moov), and the Media Data Box (mdat). The File Type Box identifies the type and properties of the 3GP file itself. The Movie Box and the Media Data Box, serving as containers, include their own boxes for each media. Boxes start with a header, which indicates both size and type (these fields are called, namely, "size" and "type"). Additionally, each box type may include a number of boxes. In the following, only those boxes are mentioned that are useful for the purposes of this payload format. The Movie Box (moov) contains one or more Track Boxes (trak), which include information about each track. A Track Box contains, among others, the Track Header Box (tkhd), the Media Header Box (mdhd), and the Media Information Box (minf). The Track Header Box specifies the characteristics of a single track, where a track is, in this case, the streamed text during a session. Exactly one Track Header Box is present for a track. It contains information about the track, such as the spatial layout (width and height), the video transformation matrix, and the layer number. Since these pieces of information are essential and static (i.e., constant) for the duration of the session, they must be sent prior to the transmission of any text samples. The Media Header Box contains the "timescale" or number of time units that pass in one second, i.e., cycles per second or Hertz. The Media Information Box includes the Sample Table Box (stbl), which contains all the time and data indexing of the media samples in a track. Using this box, it is possible to locate samples in time and to determine their type, size, container, and offset into that container. Inside the Sample Table Box, we can find the Sample Description Box (stsd, for finding sample descriptions), the Decoding Time to Sample Box (stts, for finding sample duration), the Sample Size Box (stsz), and the Sample to Chunk Box (stsc, for finding the sample description index). Finally, the Media Data Box contains the media data itself. In timed text tracks, this box contains text samples. Its equivalent to audio and video is audio and video frames, respectively. The text sample consists of the text length, the text string, and one or several Modifier Boxes. The text length is the size of the text in bytes.
The text string is plain text to render. The Modifier Box is information to render in addition to the text, such as color, font, etc.14. Acknowledgements
The authors would like to thank Dave Singer, Jan van der Meer, Magnus Westerlund, and Colin Perkins for their comments and suggestions about this document. The authors would also like to thank Markus Gebhard for the free and publicly available JavE ASCII Editor (used for the ASCII drawings in this document) and Henrik Levkowetz for the Idnits web service.Authors' Addresses
Jose Rey Panasonic R&D Center Germany GmbH Monzastr. 4c D-63225 Langen, Germany EMail: jose.rey@eu.panasonic.com Phone: +49-6103-766-134 Fax: +49-6103-766-166 Yoshinori Matsui Matsushita Electric Industrial Co., LTD. 1006 Kadoma Kadoma-shi, Osaka, Japan EMail: matsui.yoshinori@jp.panasonic.com Phone: +81 6 6900 9689 Fax: +81 6 6900 9699
Full Copyright Statement Copyright (C) The Internet Society (2006). This document is subject to the rights, licenses and restrictions contained in BCP 78, and except as set forth therein, the authors retain all their rights. This document and the information contained herein are provided on an "AS IS" basis and THE CONTRIBUTOR, THE ORGANIZATION HE/SHE REPRESENTS OR IS SPONSORED BY (IF ANY), THE INTERNET SOCIETY AND THE INTERNET ENGINEERING TASK FORCE DISCLAIM ALL WARRANTIES, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO ANY WARRANTY THAT THE USE OF THE INFORMATION HEREIN WILL NOT INFRINGE ANY RIGHTS OR ANY IMPLIED WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Intellectual Property The IETF takes no position regarding the validity or scope of any Intellectual Property Rights or other rights that might be claimed to pertain to the implementation or use of the technology described in this document or the extent to which any license under such rights might or might not be available; nor does it represent that it has made any independent effort to identify any such rights. Information on the procedures with respect to rights in RFC documents can be found in BCP 78 and BCP 79. Copies of IPR disclosures made to the IETF Secretariat and any assurances of licenses to be made available, or the result of an attempt made to obtain a general license or permission for the use of such proprietary rights by implementers or users of this specification can be obtained from the IETF on-line IPR repository at http://www.ietf.org/ipr. The IETF invites any interested party to bring to its attention any copyrights, patents or patent applications, or other proprietary rights that may cover technology that may be required to implement this standard. Please address the information to the IETF at ietf-ipr@ietf.org. Acknowledgement Funding for the RFC Editor function is provided by the IETF Administrative Support Activity (IASA).