Network Working Group J. Lazzaro Request for Comments: 4695 J. Wawrzynek Category: Standards Track UC Berkeley November 2006 RTP Payload Format for MIDI Status of This Memo This document specifies an Internet standards track protocol for the Internet community, and requests discussion and suggestions for improvements. Please refer to the current edition of the "Internet Official Protocol Standards" (STD 1) for the standardization state and status of this protocol. Distribution of this memo is unlimited. Copyright Notice Copyright (C) The IETF Trust (2006).Abstract
This memo describes a Real-time Transport Protocol (RTP) payload format for the MIDI (Musical Instrument Digital Interface) command language. The format encodes all commands that may legally appear on a MIDI 1.0 DIN cable. The format is suitable for interactive applications (such as network musical performance) and content- delivery applications (such as file streaming). The format may be used over unicast and multicast UDP and TCP, and it defines tools for graceful recovery from packet loss. Stream behavior, including the MIDI rendering method, may be customized during session setup. The format also serves as a mode for the mpeg4-generic format, to support the MPEG 4 Audio Object Types for General MIDI, Downloadable Sounds Level 2, and Structured Audio.Table of Contents
1. Introduction ....................................................4 1.1. Terminology ................................................5 1.2. Bitfield Conventions .......................................6 2. Packet Format ...................................................6 2.1. RTP Header .................................................7 2.2. MIDI Payload ..............................................11 3. MIDI Command Section ...........................................12 3.1. Timestamps ...............................................14 3.2. Command Coding ...........................................16
4. The Recovery Journal System ....................................22 5. Recovery Journal Format ........................................24 6. Session Description Protocol ...................................28 6.1. Session Descriptions for Native Streams ...................29 6.2. Session Descriptions for mpeg4-generic Streams ............30 6.3. Parameters ................................................33 7. Extensibility ..................................................34 8. Congestion Control .............................................35 9. Security Considerations ........................................35 10. Acknowledgements ..............................................36 11. IANA Considerations ...........................................37 11.1. rtp-midi Media Type Registration .........................37 11.1.1. Repository Request for "audio/rtp-midi" ...........40 11.2. mpeg4-generic Media Type Registration ....................41 11.2.1. Repository Request for Mode rtp-midi for mpeg4-generic .....................................44 11.3. asc Media Type Registration ..............................46 A. The Recovery Journal Channel Chapters ..........................48 A.1. Recovery Journal Definitions ..............................48 A.2. Chapter P: MIDI Program Change ............................52 A.3. Chapter C: MIDI Control Change ............................53 A.3.1. Log Inclusion Rules ................................54 A.3.2. Controller Log Format ..............................55 A.3.3. Log List Coding Rules ..............................57 A.3.4. The Parameter System ...............................60 A.4. Chapter M: MIDI Parameter System ..........................62 A.4.1. Log Inclusion Rules ................................64 A.4.2. Log Coding Rules ...................................65 A.4.2.1. The Value Tool .............................67 A.4.2.2. The Count Tool .............................70 A.5. Chapter W: MIDI Pitch Wheel ...............................71 A.6. Chapter N: MIDI NoteOff and NoteOn ........................71 A.6.1. Header Structure ...................................73 A.6.2. Note Structures ....................................74 A.7. Chapter E: MIDI Note Command Extras .......................75 A.7.1. Note Log Format ....................................76 A.7.2. Log Inclusion Rules ................................76 A.8. Chapter T: MIDI Channel Aftertouch ........................77 A.9. Chapter A: MIDI Poly Aftertouch ...........................78 B. The Recovery Journal System Chapters ...........................79 B.1. System Chapter D: Simple System Commands ..................79 B.1.1. Undefined System Commands ..........................80 B.2. System Chapter V: Active Sense Command ....................83 B.3. System Chapter Q: Sequencer State Commands ................83 B.3.1. Non-compliant Sequencers ...........................85 B.4. System Chapter F: MIDI Time Code Tape Position ............86 B.4.1. Partial Frames .....................................88
B.5. System Chapter X: System Exclusive ........................89 B.5.1. Chapter Format .....................................90 B.5.2. Log Inclusion Semantics ............................92 B.5.3. TCOUNT and COUNT Fields ............................95 C. Session Configuration Tools ....................................95 C.1. Configuration Tools: Stream Subsetting ....................97 C.2. Configuration Tools: The Journalling System ..............101 C.2.1. The j_sec Parameter ...............................102 C.2.2. The j_update Parameter ............................103 C.2.2.1. The anchor Sending Policy .................104 C.2.2.2. The closed-loop Sending Policy ............104 C.2.2.3. The open-loop Sending Policy ..............108 C.2.3. Recovery Journal Chapter Inclusion Parameters .....110 C.3. Configuration Tools: Timestamp Semantics .................115 C.3.1. The comex Algorithm ...............................115 C.3.2. The async Algorithm ...............................116 C.3.3. The buffer Algorithm ..............................117 C.4. Configuration Tools: Packet Timing Tools .................118 C.4.1. Packet Duration Tools .............................119 C.4.2. The guardtime Parameter ...........................120 C.5. Configuration Tools: Stream Description ..................121 C.6. Configuration Tools: MIDI Rendering ......................128 C.6.1. The multimode Parameter ...........................129 C.6.2. Renderer Specification ............................129 C.6.3. Renderer Initialization ...........................131 C.6.4. MIDI Channel Mapping ..............................133 C.6.4.1. The smf_info Parameter ....................134 C.6.4.2. The smf_inline, smf_url, and smf_cid Parameters ................................136 C.6.4.3. The chanmask Parameter ....................136 C.6.5. The audio/asc Media Type ..........................137 C.7. Interoperability .........................................139 C.7.1. MIDI Content Streaming Applications ...............139 C.7.2. MIDI Network Musical Performance Applications .....142 D. Parameter Syntax Definitions ..................................150 E. A MIDI Overview for Networking Specialists ....................156 E.1. Commands Types ...........................................159 E.2. Running Status ...........................................159 E.3. Command Timing ...........................................160 E.4. AudioSpecificConfig Templates for MMA Renderers ..........160 References .......................................................165 Normative References .............................................165 Informative References ...........................................166
1. Introduction
The Internet Engineering Task Force (IETF) has developed a set of focused tools for multimedia networking ([RFC3550] [RFC4566] [RFC3261] [RFC2326]). These tools can be combined in different ways to support a variety of real-time applications over Internet Protocol (IP) networks. For example, a telephony application might use the Session Initiation Protocol (SIP, [RFC3261]) to set up a phone call. Call setup would include negotiations to agree on a common audio codec [RFC3264]. Negotiations would use the Session Description Protocol (SDP, [RFC4566]) to describe candidate codecs. After a call is set up, audio data would flow between the parties using the Real Time Protocol (RTP, [RFC3550]) under any applicable profile (for example, the Audio/Visual Profile (AVP, [RFC3551])). The tools used in this telephony example (SIP, SDP, RTP) might be combined in a different way to support a content streaming application, perhaps in conjunction with other tools, such as the Real Time Streaming Protocol (RTSP, [RFC2326]). The MIDI (Musical Instrument Digital Interface) command language [MIDI] is widely used in musical applications that are analogous to the examples described above. On stage and in the recording studio, MIDI is used for the interactive remote control of musical instruments, an application similar in spirit to telephony. On web pages, Standard MIDI Files (SMFs, [MIDI]) rendered using the General MIDI standard [MIDI] provide a low-bandwidth substitute for audio streaming. This memo is motivated by a simple premise: if MIDI performances could be sent as RTP streams that are managed by IETF session tools, a hybridization of the MIDI and IETF application domains may occur. For example, interoperable MIDI networking may foster network music performance applications, in which a group of musicians, located at different physical locations, interact over a network to perform as they would if they were located in the same room [NMP]. As a second example, the streaming community may begin to use MIDI for low- bitrate audio coding, perhaps in conjunction with normative sound synthesis methods [MPEGSA]. To enable MIDI applications to use RTP, this memo defines an RTP payload format and its media type. Sections 2-5 and Appendices A-B define the RTP payload format. Section 6 and Appendices C-D define the media types identifying the payload format, the parameters needed for configuration, and how the parameters are utilized in SDP.
Appendix C also includes interoperability guidelines for the example applications described above: network musical performance using SIP (Appendix C.7.2) and content-streaming using RTSP (Appendix C.7.1). Another potential application area for RTP MIDI is MIDI networking for professional audio equipment and electronic musical instruments. We do not offer interoperability guidelines for this application in this memo. However, RTP MIDI has been designed with stage and studio applications in mind, and we expect that efforts to define a stage and studio framework will rely on RTP MIDI for MIDI transport services. Some applications may require MIDI media delivery at a certain service quality level (latency, jitter, packet loss, etc). RTP itself does not provide service guarantees. However, applications may use lower-layer network protocols to configure the quality of the transport services that RTP uses. These protocols may act to reserve network resources for RTP flows [RFC2205] or may simply direct RTP traffic onto a dedicated "media network" in a local installation. Note that RTP and the MIDI payload format do provide tools that applications may use to achieve the best possible real-time performance at a given service level. This memo normatively defines the syntax and semantics of the MIDI payload format. However, this memo does not define algorithms for sending and receiving packets. An ancillary document [RFC4696] provides informative guidance on algorithms. Supplemental information may be found in related conference publications [NMP] [GRAME]. Throughout this memo, the phrase "native stream" refers to a stream that uses the rtp-midi media type. The phrase "mpeg4-generic stream" refers to a stream that uses the mpeg4-generic media type (in mode rtp-midi) to operate in an MPEG 4 environment [RFC3640]. Section 6 describes this distinction in detail.1.1. Terminology
In this document, the key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" are to be interpreted as described in BCP 14, RFC 2119 [RFC2119].
1.2. Bitfield Conventions
In this document, the packet bitfields that share a common name often have identical semantics. As most of these bitfields appear in Appendices A-B, we define the common bitfield names in Appendix A.1. However, a few of these common names also appear in the main text of this document. For convenience, we list these definitions below: o R flag bit. R flag bits are reserved for future use. Senders MUST set R bits to 0. Receivers MUST ignore R bit values. o LENGTH field. All fields named LENGTH (as distinct from LEN) code the number of octets in the structure that contains it, including the header it resides in and all hierarchical levels below it. If a structure contains a LENGTH field, a receiver MUST use the LENGTH field value to advance past the structure during parsing, rather than use knowledge about the internal format of the structure.2. Packet Format
In this section, we introduce the format of RTP MIDI packets. The description includes some background information on RTP, for the benefit of MIDI implementors new to IETF tools. Implementors should consult [RFC3550] for an authoritative description of RTP. This memo assumes that the reader is familiar with MIDI syntax and semantics. Appendix E provides a MIDI overview, at a level of detail sufficient to understand most of this memo. Implementors should consult [MIDI] for an authoritative description of MIDI. The MIDI payload format maps a MIDI command stream (16 voice channels + systems) onto an RTP stream. An RTP media stream is a sequence of logical packets that share a common format. Each packet consists of two parts: the RTP header and the MIDI payload. Figure 1 shows this format (vertical space delineates the header and payload). We describe RTP packets as "logical" packets to highlight the fact that RTP itself is not a network-layer protocol. Instead, RTP packets are mapped onto network protocols (such as unicast UDP, multicast UDP, or TCP) by an application [ALF]. The interleaved mode of the Real Time Streaming Protocol (RTSP, [RFC2326]) is an example of an RTP mapping to TCP transport, as is [RFC4571].
2.1. RTP Header
[RFC3550] provides a complete description of the RTP header fields. In this section, we clarify the role of a few RTP header fields for MIDI applications. All fields are coded in network byte order (big- endian). 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | V |P|X| CC |M| PT | Sequence number | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Timestamp | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | SSRC | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MIDI command section ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Journal section ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 1 -- Packet format The behavior of the 1-bit M field depends on the media type of the stream. For native streams, the M bit MUST be set to 1 if the MIDI command section has a non-zero LEN field, and MUST be set to 0 otherwise. For mpeg4-generic streams, the M bit MUST be set to 1 for all packets (to conform to [RFC3640]). In an RTP MIDI stream, the 16-bit sequence number field is initialized to a randomly chosen value and is incremented by one (modulo 2^16) for each packet sent in the stream. A related quantity, the 32-bit extended packet sequence number, may be computed by tracking rollovers of the 16-bit sequence number. Note that different receivers of the same stream may compute different extended packet sequence numbers, depending on when the receiver joined the session. The 32-bit timestamp field sets the base timestamp value for the packet. The payload codes MIDI command timing relative to this value. The timestamp units are set by the clock rate parameter. For example, if the clock rate has a value of 44100 Hz, two packets whose base timestamp values differ by 2 seconds have RTP timestamp fields that differ by 88200.
Note that the clock rate parameter is not encoded within each RTP MIDI packet. A receiver of an RTP MIDI stream becomes aware of the clock rate as part of the session setup process. For example, if a session management tool uses the Session Description Protocol (SDP, [RFC4566]) to describe a media session, the clock rate parameter is set using the rtpmap attribute. We show examples of session setup in Section 6. For RTP MIDI streams destined to be rendered into audio, the clock rate SHOULD be an audio sample rate of 32 KHz or higher. This recommendation is due to the sensitivity of human musical perception to small timing errors in musical note sequences, and due to the timbral changes that occur when two near-simultaneous MIDI NoteOns are rendered with a different timing than that desired by the content author due to clock rate quantization. RTP MIDI streams that are not destined for audio rendering (such as MIDI streams that control stage lighting) MAY use a lower clock rate but SHOULD use a clock rate high enough to avoid timing artifacts in the application. For RTP MIDI streams destined to be rendered into audio, the clock rate SHOULD be chosen from rates in common use in professional audio applications or in consumer audio distribution. At the time of this writing, these rates include 32 KHz, 44.1 KHz, 48 KHz, 64 KHz, 88.2 KHz, 96 KHz, 176.4 KHz, and 192 KHz. If the RTP MIDI session is a part of a synchronized media session that includes another (non-MIDI) RTP audio stream with a clock rate of 32 KHz or higher, the RTP MIDI stream SHOULD use a clock rate that matches the clock rate of the other audio stream. However, if the RTP MIDI stream is destined to be rendered into audio, the RTP MIDI stream SHOULD NOT use a clock rate lower than 32 KHz, even if this second stream has a clock rate less than 32 KHz. Timestamps of consecutive packets do not necessarily increment at a fixed rate, because RTP MIDI packets are not necessarily sent at a fixed rate. The degree of packet transmission regularity reflects the underlying application dynamics. Interactive applications may vary the packet sending rate to track the gestural rate of a human performer, whereas content-streaming applications may send packets at a fixed rate. Therefore, the timestamps for two sequential RTP packets may be identical, or the second packet may have a timestamp arbitrarily larger than the first packet (modulo 2^32). Section 3 places additional restrictions on the RTP timestamps for two sequential RTP packets, as does the guardtime parameter (Appendix C.4.2). We use the term "media time" to denote the temporal duration of the media coded by an RTP packet. The media time coded by a packet is
computed by subtracting the last command timestamp in the MIDI command section from the RTP timestamp (modulo 2^32). If the MIDI list of the MIDI command section of a packet is empty, the media time coded by the packet is 0 ms. Appendix C.4.1 discusses media time issues in detail. We now define RTP session semantics, in the context of sessions specified using the session description protocol [RFC4566]. A session description media line ("m=") specifies an RTP session. An RTP session has an independent space of 2^32 synchronization sources. Synchronization source identifiers are coded in the SSRC header field of RTP session packets. The payload types that may appear in the PT header field of RTP session packets are listed at the end of the media line. Several RTP MIDI streams may appear in an RTP session. Each stream is distinguished by a unique SSRC value and has a unique sequence number and RTP timestamp space. Multiple streams in the RTP session may be sent by a single party. Multiple parties may send streams in the RTP session. An RTP MIDI stream encodes data for a single MIDI command name space (16 voice channels + Systems). Streams in an RTP session may use different payload types, or they may use the same payload type. However, each party may send, at most, one RTP MIDI stream for each payload type mapped to an RTP MIDI payload format in an RTP session. Recall that dynamic binding of payload type numbers in [RFC4566] lets a party map many payload type numbers to the RTP MIDI payload format; thus a party may send many RTP MIDI streams in a single RTP session. Pairs of streams (unicast or multicast) that communicate between two parties in an RTP session and that share a payload type have the same association as a MIDI cable pair that cross-connects two devices in a MIDI 1.0 DIN network. The RTP session architecture described above is efficient in its use of network ports, as one RTP session (using a port pair per party) supports the transport of many MIDI name spaces (16 MIDI channels + systems). We define tools for grouping and labelling MIDI name spaces across streams and sessions in Appendix C.5 of this memo. The RTP header timestamps for each stream in an RTP session have separately and randomly chosen initialization values. Receivers use the timing fields encoded in the RTP control protocol (RTCP, [RFC3550]) sender reports to synchronize the streams sent by a party. The SSRC values for each stream in an RTP session are also separately and randomly chosen, as described in [RFC3550]. Receivers use the CNAME field encoded in RTCP sender reports to verify that streams were sent by the same party, and to detect SSRC collisions, as described in [RFC3550].
In some applications, a receiver renders MIDI commands into audio (or into control actions, such as the rewind of a tape deck or the dimming of stage lights). In other applications, a receiver presents a MIDI stream to software programs via an Application Programmer Interface (API). Appendix C.6 defines session configuration tools to specify what receivers should do with a MIDI command stream. If a multimedia session uses different RTP MIDI streams to send different classes of media, the streams MUST be sent over different RTP sessions. For example, if a multimedia session uses one MIDI stream for audio and a second MIDI stream to control a lighting system, the audio and lighting streams MUST be sent over different RTP sessions, each with its own media line. Session description tools defined in Appendix C.5 let a sending party split a single MIDI name space (16 voice channels + systems) over several RTP MIDI streams. Split transport of a MIDI command stream is a delicate task, because correct command stream reconstruction by a receiver depends on exact timing synchronization across the streams. To support split name spaces, we define the following requirements: o A party MUST NOT send several RTP MIDI streams that share a MIDI name space in the same RTP session. Instead, each stream MUST be sent from a different RTP session. o If several RTP MIDI streams sent by a party share a MIDI name space, all streams MUST use the same SSRC value and MUST use the same randomly chosen RTP timestamp initialization value. These rules let a receiver identify streams that share a MIDI name space (by matching SSRC values) and also let a receiver accurately reconstruct the source MIDI command stream (by using RTP timestamps to interleave commands from the two streams). Care MUST be taken by senders to ensure that SSRC changes due to collisions are reflected in both streams. Receivers MUST regularly examine the RTCP CNAME fields associated with the linked streams, to ensure that the assumed link is legitimate and not the result of an SSRC collision by another sender. Except for the special cases described above, a party may send many RTP MIDI streams in the same session. However, it is sometimes advantageous for two RTP MIDI streams to be sent over different RTP sessions. For example, two streams may need different values for RTP session-level attributes (such as the sendonly and recvonly attributes). As a second example, two RTP sessions may be needed to send two unicast streams in a multimedia session that originate on
different computers (with different IP numbers). Two RTP sessions are needed in this case because transport addresses are specified on the RTP-session or multimedia-session level, not on a payload type level. On a final note, in some uses of MIDI, parties send bidirectional traffic to conduct transactions (such as file exchange). These commands were designed to work over MIDI 1.0 DIN cable networks may be configured in a multicast topology, which use pure "party-line" signalling. Thus, if a multimedia session ensures a multicast connection between all parties, bidirectional MIDI commands will work without additional support from the RTP MIDI payload format.2.2. MIDI Payload
The payload (Figure 1) MUST begin with the MIDI command section. The MIDI command section codes a (possibly empty) list of timestamped MIDI commands, and provides the essential service of the payload format. The payload MAY also contain a journal section. The journal section provides resiliency by coding the recent history of the stream. A flag in the MIDI command section codes the presence of a journal section in the payload. Section 3 defines the MIDI command section. Sections 4-5 and Appendices A-B define the recovery journal, the default format for the journal section. Here, we describe how these payload sections operate in a stream in an RTP session. The journalling method for a stream is set at the start of a session and MUST NOT be changed thereafter. A stream may be set to use the recovery journal, to use an alternative journal format (none are defined in this memo), or not to use a journal. The default journalling method of a stream is inferred from its transport type. Streams that use unreliable transport (such as UDP) default to using the recovery journal. Streams that use reliable transport (such as TCP) default to not using a journal. Appendix C.2.1 defines session configuration tools for overriding these defaults. For all types of transport, a sender MUST transmit an RTP packet stream with consecutive sequence numbers (modulo 2^16). If a stream uses the recovery journal, every payload in the stream MUST include a journal section. If a stream does not use journalling, a journal section MUST NOT appear in a stream payload. If a stream uses an alternative journal format, the specification for the journal format defines an inclusion policy.
If a stream is sent over UDP transport, the Maximum Transmission Unit (MTU) of the underlying network limits the practical size of the payload section (for example, an Ethernet MTU is 1500 octets), for applications where predictable and minimal packet transmission latency is critical. A sender SHOULD NOT create RTP MIDI UDP packets whose size exceeds the MTU of the underlying network. Instead, the sender SHOULD take steps to keep the maximum packet size under the MTU limit. These steps may take many forms. The default closed-loop recovery journal sending policy (defined in Appendix C.2.2.2) uses RTP control protocol (RTCP, [RFC3550]) feedback to manage the RTP MIDI packet size. In addition, Section 3.2 and Appendix B.5.2 provide specific tools for managing the size of packets that code MIDI System Exclusive (0xF0) commands. Appendix C.5 defines session configuration tools that may be used to split a dense MIDI name space into several UDP streams (each sent in a different RTP session, per Section 2.1) so that the payload fits comfortably into an MTU. Another option is to use TCP. Section 4.3 of [RFC4696] provides non-normative advice for packet size management.3. MIDI Command Section
Figure 2 shows the format of the MIDI command section. 0 1 2 3 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ |B|J|Z|P|LEN... | MIDI list ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 2 -- MIDI command section The MIDI command section begins with a variable-length header. The header field LEN codes the number of octets in the MIDI list that follow the header. If the header flag B is 0, the header is one octet long, and LEN is a 4-bit field, supporting a maximum MIDI list length of 15 octets. If B is 1, the header is two octets long, and LEN is a 12-bit field, supporting a maximum MIDI list length of 4095 octets. LEN is coded in network byte order (big-endian): the 4 bits of LEN that appear in the first header octet code the most significant 4 bits of the 12-bit LEN value. A LEN value of 0 is legal, and it codes an empty MIDI list.
If the J header bit is set to 1, a journal section MUST appear after the MIDI command section in the payload. If the J header bit is set to 0, the payload MUST NOT contain a journal section. We define the semantics of the P header bit in Section 3.2. If the LEN header field is nonzero, the MIDI list has the structure shown in Figure 3. +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Delta Time 0 (1-4 octets long, or 0 octets if Z = 1) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MIDI Command 0 (1 or more octets long) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Delta Time 1 (1-4 octets long) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MIDI Command 1 (1 or more octets long) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | ... | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | Delta Time N (1-4 octets long) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ | MIDI Command N (0 or more octets long) | +-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+-+ Figure 3 -- MIDI list structure If the header flag Z is 1, the MIDI list begins with a complete MIDI command (coded in the MIDI Command 0 field, in Figure 3) preceded by a delta time (coded in the Delta Time 0 field). If Z is 0, the Delta Time 0 field is not present in the MIDI list, and the command coded in the MIDI Command 0 field has an implicit delta time of 0. The MIDI list structure may also optionally encode a list of N additional complete MIDI commands, each coded in a MIDI Command K field. Each additional command MUST be preceded by a Delta Time K field, which codes the command's delta time. We discuss exceptions to the "command fields code complete MIDI commands" rule in Section 3.2. The final MIDI command field (i.e., the MIDI Command N field, shown in Figure 3) in the MIDI list MAY be empty. Moreover, a MIDI list MAY consist a single delta time (encoded in the Delta Time 0 field) without an associated command (which would have been encoded in the MIDI Command 0 field). These rules enable MIDI coding features that are explained in Section 3.1. We delay the explanations because an understanding of RTP MIDI timestamps is necessary to describe the features.
3.1. Timestamps
In this section, we describe how RTP MIDI encodes a timestamp for each MIDI list command. Command timestamps have the same units as RTP packet header timestamps (described in Section 2.1 and [RFC3550]). Recall that RTP timestamps have units of seconds, whose scaling is set during session configuration (see Section 6.1 and [RFC4566]). As shown in Figure 3, the MIDI list encodes time using a compact delta-time format. The RTP MIDI delta time syntax is a modified form of the MIDI File delta time syntax [MIDI]. RTP MIDI delta times use 1-4 octet fields to encode 32-bit unsigned integers. Figure 4 shows the encoded and decoded forms of delta times. Note that delta time values may be legally encoded in multiple formats; for example, there are four legal ways to encode the zero delta time (0x00, 0x8000, 0x808000, 0x80808000). RTP MIDI uses delta times to encode a timestamp for each MIDI command. The timestamp for MIDI Command K is the summation (modulo 2^32) of the RTP timestamp and decoded delta times 0 through K. This cumulative coding technique, borrowed from MIDI File delta time coding, is efficient because it reduces the number of multi-octet delta times. All command timestamps in a packet MUST be less than or equal to the RTP timestamp of the next packet in the stream (modulo 2^32). This restriction ensures that a particular RTP MIDI packet in a stream is uniquely responsible for encoding time starting at the moment after the RTP timestamp encoded in the RTP packet header, and ending at the moment before the final command timestamp encoded in the MIDI list. The "moment before" and "moment after" qualifiers acknowledge the "less than or equal" semantics (as opposed to "strictly less than") in the sentence above this paragraph. Note that it is possible to "pad" the end of an RTP MIDI packet with time that is guaranteed to be void of MIDI commands, by setting the "Delta Time N" field of the MIDI list to the end of the void time, and by omitting its corresponding "MIDI Command N" field (a syntactic construction the preamble of Section 3 expressly made legal). In addition, it is possible to code an RTP MIDI packet to express that a period of time in the stream is void of MIDI commands. The RTP timestamp in the header would code the start of the void time. The MIDI list of this packet would consist of a "Delta Time 0" field
that coded the end of the void time. No other fields would be present in the MIDI list (a syntactic construction the preamble of Section 3 also expressly made legal). By default, a command timestamp indicates the execution time for the command. The difference between two timestamps indicates the time delay between the execution of the commands. This difference may be zero, coding simultaneous execution. In this memo, we refer to this interpretation of timestamps as "comex" (COMmand EXecution) semantics. We formally define comex semantics in Appendix C.3. The comex interpretation of timestamps works well for transcoding a Standard MIDI File (SMF) into an RTP MIDI stream, as SMFs code a timestamp for each MIDI command stored in the file. To transcode an SMF that uses metric time markers, use the SMF tempo map (encoded in the SMF as meta-events) to convert metric SMF timestamp units into seconds-based RTP timestamp units. The comex interpretation also works well for MIDI hardware controllers that are coding raw sensor data directly onto an RTP MIDI stream. Note that this controller design is preferable to a design that converts raw sensor data into a MIDI 1.0 cable command stream and then transcodes the stream onto an RTP MIDI stream. The comex interpretation of timestamps is usually not the best timestamp interpretation for transcoding a MIDI source that uses implicit command timing (such as MIDI 1.0 DIN cables) into an RTP MIDI stream. Appendix C.3 defines alternatives to comex semantics and describes session configuration tools for selecting the timestamp interpretation semantics for a stream.
One-Octet Delta Time: Encoded form: 0ddddddd Decoded form: 00000000 00000000 00000000 0ddddddd Two-Octet Delta Time: Encoded form: 1ccccccc 0ddddddd Decoded form: 00000000 00000000 00cccccc cddddddd Three-Octet Delta Time: Encoded form: 1bbbbbbb 1ccccccc 0ddddddd Decoded form: 00000000 000bbbbb bbcccccc cddddddd Four-Octet Delta Time: Encoded form: 1aaaaaaa 1bbbbbbb 1ccccccc 0ddddddd Decoded form: 0000aaaa aaabbbbb bbcccccc cddddddd Figure 4 -- Decoding delta time formats3.2. Command Coding
Each non-empty MIDI Command field in the MIDI list codes one of the MIDI command types that may legally appear on a MIDI 1.0 DIN cable. Standard MIDI File meta-events do not fit this definition and MUST NOT appear in the MIDI list. As a rule, each MIDI Command field codes a complete command, in the binary command format defined in [MIDI]. In the remainder of this section, we describe exceptions to this rule. The first MIDI channel command in the MIDI list MUST include a status octet. Running status coding, as defined in [MIDI], MAY be used for all subsequent MIDI channel commands in the list. As in [MIDI], System Common and System Exclusive messages (0xF0 ... 0xF7) cancel the running status state, but System Real-time messages (0xF8 ... 0xFF) do not affect the running status state. All System commands in the MIDI list MUST include a status octet. As we note above, the first channel command in the MIDI list MUST include a status octet. However, the corresponding command in the original MIDI source data stream might not have a status octet (in this case, the source would be coding the command using running status). If the status octet of the first channel command in the MIDI list does not appear in the source data stream, the P (phantom) header bit MUST be set to 1. In all other cases, the P bit MUST be set to 0.
Note that the P bit describes the MIDI source data stream, not the MIDI list encoding; regardless of the state of the P bit, the MIDI list MUST include the status octet. As receivers MUST be able to decode running status, sender implementors should feel free to use running status to improve bandwidth efficiency. However, senders SHOULD NOT introduce timing jitter into an existing MIDI command stream through an inappropriate use or removal of running status coding. This warning primarily applies to senders whose RTP MIDI streams may be transcoded onto a MIDI 1.0 DIN cable [MIDI] by the receiver: both the timestamps and the command coding (running status or not) must comply with the physical restrictions of implicit time coding over a slow serial line. On a MIDI 1.0 DIN cable [MIDI], a System Real-time command may be embedded inside of another "host" MIDI command. This syntactic construction is not supported in the payload format: a MIDI Command field in the MIDI list codes exactly one MIDI command (partially or completely). To encode an embedded System Real-time command, senders MUST extract the command from its host and code it in the MIDI list as a separate command. The host command and System Real-time command SHOULD appear in the same MIDI list. The delta time of the System Real-time command SHOULD result in a command timestamp that encodes the System Real-time command placement in its original embedded position. Two methods are provided for encoding MIDI System Exclusive (SysEx) commands in the MIDI list. A SysEx command may be encoded in a MIDI Command field verbatim: a 0xF0 octet, followed by an arbitrary number of data octets, followed by a 0xF7 octet. Alternatively, a SysEx command may be encoded as multiple segments. The command is divided into two or more SysEx command segments; each segment is encoded in its own MIDI Command field in the MIDI list. The payload format supports segmentation in order to encode SysEx commands that encode information in the temporal pattern of data octets. By encoding these commands as a series of segments, each data octet may be associated with a distinct delta time. Segmentation also supports the coding of large SysEx commands across several packets. To segment a SysEx command, first partition its data octet list into two or more sublists. The last sublist MAY be empty (i.e., contain no octets); all other sublists MUST contain at least one data octet. To complete the segmentation, add the status octets defined in Figure
5 to the head and tail of the first, last, and any "middle" sublists. Figure 6 shows example segmentations of a SysEx command. A sender MAY cancel a segmented SysEx command transmission that is in progress, by sending the "cancel" sublist shown in Figure 5. A "cancel" sublist MAY follow a "first" or "middle" sublist in the transmission, but MUST NOT follow a "last" sublist. The cancel MUST be empty (thus, 0xF7 0xF4 is the only legal cancel sublist). The cancellation feature is needed because Appendix C.1 defines configuration tools that let session parties exclude certain SysEx commands in the stream. Senders that transcode a MIDI source onto an RTP MIDI stream under these constraints have the responsibility of excluding undesired commands from the RTP MIDI stream. The cancellation feature lets a sender start the transmission of a command before the MIDI source has sent the entire command. If a sender determines that the command whose transmission is in progress should not appear on the RTP stream, it cancels the command. Without a method for cancelling a SysEx command transmission, senders would be forced to use a high-latency store-and-forward approach to transcoding SysEx commands onto RTP MIDI packets, in order to validate each SysEx command before transmission. The recommended receiver reaction to a cancellation depends on the capabilities of the receiver. For example, a sound synthesizer that is directly parsing RTP MIDI packets and rendering them to audio will be aware of the fact that SysEx commands may be cancelled in RTP MIDI. These receivers SHOULD detect a SysEx cancellation in the MIDI list and act as if they had never received the SysEx command. As a second example, a synthesizer may be receiving MIDI data from an RTP MIDI stream via a MIDI DIN cable (or a software API emulation of a MIDI DIN cable). In this case, an RTP-MIDI-aware system receives the RTP MIDI stream and transcodes it onto the MIDI DIN cable (or its emulation). Upon the receipt of the cancel sublist, the RTP-MIDI- aware transcoder might have already sent the first part of the SysEx command on the MIDI DIN cable to the receiver. Unfortunately, the MIDI DIN cable protocol cannot directly code "cancel SysEx in progress" semantics. However, MIDI DIN cable receivers begin SysEx processing after the complete command arrives. The receiver checks to see if it recognizes the command (coded in the first few octets) and then checks to see if the command is the correct length. Thus, in practice, a transcoder can cancel a SysEx command by sending an 0xF7 to (prematurely) end the SysEx command -- the receiver will detect the incorrect command length and discard the command.
Appendix C.1 defines configuration tools that may be used to prohibit SysEx command cancellation. The relative ordering of SysEx command segments in a MIDI list must match the relative ordering of the sublists in the original SysEx command. By default, commands other than System Real-time MIDI commands MUST NOT appear between SysEx command segments (Appendix C.1 defines configuration tools to change this default, to let other commands types appear between segments). If the command segments of a SysEx command are placed in the MIDI lists of two or more RTP packets, the segment ordering rules apply to the concatenation of all affected MIDI lists. ----------------------------------------------------------- | Sublist Position | Head Status Octet | Tail Status Octet | |-----------------------------------------------------------| | first | 0xF0 | 0xF0 | |-----------------------------------------------------------| | middle | 0xF7 | 0xF0 | |-----------------------------------------------------------| | last | 0xF7 | 0xF7 | |-----------------------------------------------------------| | cancel | 0xF7 | 0xF4 | ----------------------------------------------------------- Figure 5 -- Command segmentation status octets [MIDI] permits 0xF7 octets that are not part of a (0xF0, 0xF7) pair to appear on a MIDI 1.0 DIN cable. Unpaired 0xF7 octets have no semantic meaning in MIDI, apart from cancelling running status. Unpaired 0xF7 octets MUST NOT appear in the MIDI list of the MIDI Command section. We impose this restriction to avoid interference with the command segmentation coding defined in Figure 5. SysEx commands carried on a MIDI 1.0 DIN cable may use the "dropped 0xF7" construction [MIDI]. In this coding method, the 0xF7 octet is dropped from the end of the SysEx command, and the status octet of the next MIDI command acts both to terminate the SysEx command and start the next command. To encode this construction in the payload format, follow these steps: o Determine the appropriate delta times for the SysEx command and the command that follows the SysEx command. o Insert the "dropped" 0xF7 octet at the end of the SysEx command, to form the standard SysEx syntax.
o Code both commands into the MIDI list using the rules above. o Replace the 0xF7 octet that terminates the verbatim SysEx encoding or the last segment of the segmented SysEx encoding with a 0xF5 octet. This substitution informs the receiver of the original dropped 0xF7 coding. [MIDI] reserves the undefined System Common commands 0xF4 and 0xF5 and the undefined System Real-time commands 0xF9 and 0xFD for future use. By default, undefined commands MUST NOT appear in a MIDI Command field in the MIDI list, with the exception of the 0xF5 octets used to code the "dropped 0xF7" construction and the 0xF4 octets used by SysEx "cancel" sublists. During session configuration, a stream may be customized to transport undefined commands (Appendix C.1). For this case, we now define how senders encode undefined commands in the MIDI list. An undefined System Real-time command MUST be coded using the System Real-time rules. If the undefined System Common commands are put to use in a future version of [MIDI], the command will begin with an 0xF4 or 0xF5 status octet, followed by an arbitrary number of data octets (i.e., zero or more data bytes). To encode these commands, senders MUST terminate the command with an 0xF7 octet and place the modified command into the MIDI Command field. Unfortunately, non-compliant uses of the undefined System Common commands may appear in MIDI implementations. To model these commands, we assume that the command begins with an 0xF4 or 0xF5 status octet, followed by zero or more data octets, followed by zero or more trailing 0xF7 status octets. To encode the command, senders MUST first remove all trailing 0xF7 status octets from the command. Then, senders MUST terminate the command with an 0xF7 octet and place the modified command into the MIDI Command field. Note that we include the trailing octets in our model as a cautionary measure: if such commands appeared in a non-compliant use of an undefined System Common command, an RTP MIDI encoding of the command that did not remove trailing octets could be mistaken for an encoding of "middle" or "last" sublist of a segmented SysEx commands (Figure 5) under certain packet loss conditions.
Original SysEx command: 0xF0 0x01 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0xF7 A two-segment segmentation: 0xF0 0x01 0x02 0x03 0x04 0xF0 0xF7 0x05 0x06 0x07 0x08 0xF7 A different two-segment segmentation: 0xF0 0x01 0xF0 0xF7 0x02 0x03 0x04 0x05 0x06 0x07 0x08 0xF7 A three-segment segmentation: 0xF0 0x01 0x02 0xF0 0xF7 0x03 0x04 0xF0 0xF7 0x05 0x06 0x07 0x08 0xF7 The segmentation with the largest number of segments: 0xF0 0x01 0xF0 0xF7 0x02 0xF0 0xF7 0x03 0xF0 0xF7 0x04 0xF0 0xF7 0x05 0xF0 0xF7 0x06 0xF0 0xF7 0x07 0xF0 0xF7 0x08 0xF0 0xF7 0xF7 Figure 6 -- Example segmentations