This clause focuses on the interoperability point to a media decoder as indicated in Figure 6.1-1. This clause does not deal with the access engine and file parser which addresses aspects how the audio bitstream is delivered.
In all audio operation points, the VR Presentation can be rendered using a single media decoder which provides decoded PCM signals and rendering metadata to the audio renderer.
This clause defines the potential parameters of Audio Operation Points. This includes the detailed audio decoder requirements and audio rendering metadata. The requirements are defined from the perspective of the audio decoder and renderer.
Parameters for an Audio Operation Point include:
the audio decoder that the bitstream needs to conform to,
the mandated or permitted rendering data that is included in the audio bitstream.
Table 6.1-1 provides an informative overview of the Audio Operating Points. The detailed, normative specification for each audio operating point is subsequently provided in the referenced clause.
The 3GPP MPEG-H Audio Operation Point fulfills the requirements to support 3D audio and is specified in ISO/IEC 23090-2 [13], clause 10.2.2. Channels, Objects and First/Higher-Order Ambisonics (FOA/HOA) are supported, as well as combinations of those. The Operation Point is based on MPEG-H 3D Audio [19].
A bitstream conforming to the 3GPP MPEG-H Audio Operation Point shall conform to the requirements in clause 6.1.4.2.
A receiver conforming to the 3GPP MPEG-H Audio Operation Point shall support decoding and rendering a Bitstream conforming to the 3GPP MPEG-H Audio Operation Point. Detailed receiver requirements are provided in clause 6.1.4.3.
The audio stream shall comply with the MPEG-H 3D Audio Low Complexity (LC) Profile, Levels 1, 2 or 3 as defined in ISO/IEC 23008-3, clause 4.8 [19]. The values of the mpegh3daProfileLevelIndication for LC Profile Levels 1, 2 and 3 are "0x0B", "0x0C" and "0x0D", respectively, as specified in ISO/IEC 23008-3 [19], clause 5.3.2.
Audio encapsulation shall be done according to ISO/IEC 23090-2 [12], clause 10.2.2.2.
All Low Complexity Profile and Levels restrictions specified in ISO/IEC 23008-3 [19], clause 4.8.2 shall apply. The constraints on input and output configurations are provided in Table 3 - "Levels and their corresponding restrictions for the Low Complexity Profile", of ISO/IEC 23008-3 [19]. This includes the following for Low Complexity Profile Level 3:
Maximum number of core coded channels (in compressed data stream): 32,
Maximum number of decoder processed core channels: 16,
The receiver shall be capable of decoding MPEG-H Audio LC Profile Level 1, Level 2 and Level 3 bitstreams as specified in ISO/IEC 23008-3, clause 4.8 [19] with the following relaxations:
The carriage of generic data defined in ISO/IEC 23008-3 [19], clause 14.7 is optional and thus MHAS packets of the type PACTYP_GENDATA are optional and the decoder may ignore packets of this type.
The decoder may read and process MHAS packets of the following types:
PACTYP_SYNCGAP,
PACTYP_BUFFERINFO,
PACTYP_MARKER and
PACTYP_DESCRIPTOR.
Other MHAS packets may be present in an MHAS elementary stream, but may be ignored.
The Earcon metadata shall be processed and applied as described in ISO/IEC 23008-3 [19], clause 28.
The audio decoder is able to start decoding a new audio stream at every random access point (RAP). As defined in clause 6.1.4.2, the sync sample (RAP) contains the configuration information (PACTYP_MPEGH3DACFG and PACTYP_AUDIOSCENEINFO) that is used to initialize the audio decoder. After initialization, the audio decoder reads encoded audio frames (PACTYP_MPEGH3DAFRAME) and decodes them.
To optimize startup delay at random access, the information from the MHAS PACTYP_BUFFERINFO packet should be taken into account. The input buffer should be filled at least to the state indicated in the MHAS PACTYP_BUFFERINFO packet before starting to decode audio frames.
It is recommended that, at random access into an audio stream, the receiving device performs a 100 ms fade-in on the first PCM output buffer that it receives from the audio decoder.
If the decoder receives an MHAS stream that contains a configuration change, the decoder shall perform a configuration change according to ISO/IEC 23008-3 [19], clause 5.5.6. The configuration change can, for instance, be detected through the change of the MHASPacketLabel of the packet PACTYP_MPEGH3DACFG compared to the value of the MHASPacketLabel of previous MHAS packets.
If MHAS packets of type PACTYP_AUDIOTRUNCATION are present, they shall be used as described in ISO/IEC 23008-3 [19], clause 14.
The Access Unit that contains the configuration change and the last Access Unit before the configuration change may contain a truncation message (PACTYP_AUDIOTRUNCATION) as defined in ISO/IEC 23008-3 [19], clause 14. The MHAS packet of type PACTYP_AUDIOTRUNCATION enables synchronization between video and audio elementary streams at program boundaries. When used, sample-accurate splicing and reconfiguration of the audio stream are possible.
The receiver shall be capable of simultaneously receiving at least 3 MHAS streams. The MHAS streams can be simultaneously decoded or combined into a single stream prior to the decoder, by utilizing the field mae_bsMetaDataElementIDoffset in the Audio Scene Information as described in ISO/IEC 23008-3 [19], clause 14.6.
The 3GPP MPEG-H Audio Operation Point builds on the MPEG-H 3D Audio codec, which includes rendering to loudspeakers, binaural rendering and also provides an interface for external rendering. Legacy binaural rendering using fixed loudspeaker setups can be supported by using loudspeaker feeds as output of the decoder.
MPEG-H 3D Audio specifies methods for binauralizing the presentation of immersive content for playback via headphones, as is needed for omnidirectional media presentations. MPEG-H 3D Audio specifies a normative interface for the user's viewing orientation and permits low-complexity, low-latency rendering of the audio scene to any user orientation.
The binaural rendering of MPEG-H 3D Audio shall be applied as described in ISO/IEC 23008-3 [19], clause 13 according to the Low Complexity Profile and Levels restrictions for binaural rendering specified in ISO/IEC 23008-3 [19], clause 4.8.2.2.
For binaural rendering using head tracking the useTrackingMode flag in the BinauralRendering() syntax element shall be set to 1, as described in ISO/IEC 23008-3 [19], clause 17.4. This flag defines if a tracker device is connected and the binaural rendering shall be processed in a special headtracking mode, using the scene displacement values (yaw, pitch and roll).
The values for the scene displacement data shall be sent using the interface for scene displacement data specified in ISO/IEC 23008-3 [19], clause 17.9. The syntax of mpegh3daSceneDisplacementData() interface provided in ISO/IEC 23008-3 [19], clause 17.9.3 shall be used.
The metadata flag fixedPosition in SignalGroupInformation() indicates if the corresponding audio signals are updated during the processing of scene-displacement angles. In case the flag is equal to one, the positions of the corresponding audio signals are not updated during the processing of scene displacement angles.
Channel groups for which the flag gca_directHeadphone is set to "1" in the mpegh3da_getChannelMetadata() syntax element are routed to left and right output channel directly and are excluded from binaural rendering using scene displacement data (non-diegetic content). Non-diegetic content may have stereo or mono format. For mono, the signal is mixed to left and right headphone channel with a gain factor of 0.707.
The interface for binaural room impulse responses (BRIRs) specified in ISO/IEC 23008-3 [19], clause 17.4 shall be used for external BRIRs and HRIRs. The HRIR/BRIR data for the binaural rendering can be fed to the decoder by using the syntax element BinauralRendering(). The number of BRIR/HRIR pairs in each BRIR/HRIR set shall correspond to the number indicated in the relevant level-dependent row in Table 9 - "The binaural restrictions for the LC profile" of ISO/IEC 23008-3 [19] according to the Low Complexity Profile and Levels restrictions in ISO/IEC 23008-3 [19], clause 4.8.2.2.
The measured BRIR positions are passed to the mpegh3daLocalSetupInformation(), as specified in ISO/IEC 23008-3 [19], clause 4.8.2.2. Thus, all renderer stages are set to the target layout that is equal to the transmitted channel configuration. As one BRIR is available per regular input channel, the Format Converter can be passed through in case regular input channel positions are used. Preferably, the BRIR measurement positions for standard target layouts 2.0, 5.1, 10.2 and 7.1.4 should be provided.
MPEG-H 3DA provides the output interfaces for the delivery of un-rendered channels, objects, and HOA content and associated metadata as specified in clause 6.1.4.3.6.5. External binaural renderers can connect to this interface e.g. for playback of head-tracked audio via headphones. An example of such external binaural renderer that connects to the external rendering interface of MPEG-H 3DA is specified in Annex B.
ISO/IEC 23008-3 [19], clause 17.10 specifies the output interfaces for the delivery of un-rendered channels, objects, and HOA content and associated metadata. For connecting to external renderers, a receiver shall implement the interfaces for object output, channel output and HOA output as specified in ISO/IEC 23008-3 [19], clause 17.10, including the additional specification of production metadata defined in ISO/IEC 23008-3 [19], clause 27. Any external renderer should apply the metadata provided in this interface and related audio data in the same manner as if MPEG-H internal rendering is applied:
Correct handling of loudness-related metadata in particular with the aim of preserving intended target loudness
Preserving artistic intent, such as applying transmitted Downmix and HOA Rendering matrices correctly
Rendering spatial attributes of objects appropriately (position, spatial extent, etc.)
In this interface the PCM data of the channels and objects interfaces is provided through the decoder PCM buffer, which first contains the regular rendered PCM signals (e.g. 12 signals for a 7.1+4 setup). Subsequently additional signals carry the PCM data of the originally transmitted channel representation. These are followed by signals carrying the PCM data of the un-rendered output objects. Then additional signals carry the HOA audio PCM data which number is indicated in the HOA metadata interface via the HOA order (e.g. 16 signals for HOA order 3). The HOA audio PCM data in the HOA output interface is provided in the so-called Equivalent Spatial Domain (ESD) representation. The conversion from the HOA domain into the ESD representation and vice versa is described in ISO/IEC 23008-3 [19], Annex C.5.1.
The metadata for channels, objects, and HOA is available once per frame and their syntax is specified in mpegh3da_getChannelMetadata(), mpegh3da_getObjectAudioAndMetadata(), and mpegh3da_getHoaMetadata() respectively. The metadata and PCM data shall be aligned for an external renderer to match each metadata element with the respective PCM frame.
MPEG-H 3D Audio [19] specifies coding of immersive audio material and the storage of the coded representation in an ISO BMFF track. The MPEG-H 3D Audio decoder has a constant latency, see Table 1 - "MPEG-H 3DA functional blocks and internal processing domain", of ISO/IEC 23008-3 [19]. With this information, content authors could synchronize audio and video portions of a media presentation, e.g. ensuring lip-synch.
ISO BMFF integration for this profile is provided following the requirements and recommendations in ISO/IEC 23090-2 [12], clause 10.2.2.3.
3GP VR Tracks conforming to this media profile used in the context of the specification shall conform to the ISO BMFF [17] with the following further requirements:
The audio track shall comply to the Bitstream requirements and recommendations for the Operation Point as defined in clause 6.1.4.
A configuration change takes place in an audio stream when the content setup or the Audio Scene Information changes (e.g., when changes occur in the channel layout, the number of objects etc.), and therefore new PACTYP_MPEGH3DACFG and PACTYP_AUDIOSCENEINFO packets are required upon such occurrences. A configuration change usually happens at program boundaries, but it may also occur within a program.
Configuration change constraints specified in ISO/IEC 23090-2 [12], clause 10.2.2.3.2 shall apply.
The multi-stream-enabled MPEG-H Audio System is capable of handling Audio Programme Components delivered in several different elementary streams (e.g., the main MHAS stream containing one complete audio main, and one or more auxiliary MHAS streams, containing different languages and audio descriptions). The MPEG-H Audio Metadata information (MAE) allows the MPEG-H Audio Decoder to correctly decode several MHAS streams.
The sample entry 'mhm2' shall be used in cases of multi-stream delivery, i.e., the MPEG-H Audio Scene is split into two or more streams for delivery as described in ISO/IEC 23008-3 [19], clause 14.6. All constraints for file formats using the sample entry 'mhm2' specified in ISO/IEC 23090-2 [12], clause 10.2.2.3.3 shall apply.
An instantiation of an OMAF 3D Audio Baseline Profile in DASH should be represented as one Adaptation Set. If so the Adaptation Set should provide the following signalling according to ISO/IEC 23090-2 [12] and ISO/IEC 23008-3 [19], clause 21 as shown in Table 6.2-2.
Mapping of relevant MPD elements and attributes to MPEG-H Audio as well as the Preselection Element and Preselection descriptor are specified in ISO/IEC 23090-2 [12], clause B.2.1.2.
MPEG-H 3D Audio enables seamless bitrate switching in a DASH environment with different Representations (i.e., bit streams encoded at different bitrates) of the same content, i.e., those Representations are part of the same Adaptation Set.
If the decoder receives a DASH Segment of another Representation of the same Adaptation Set, the decoder shall perform an adaptive switch according to ISO/IEC 23008-3 [19], clause 5.5.6.