This section provides general design guidelines for solutions for split rendering scenarios of immersive audio. These guidelines do not in themselves define any design constraint or requirement. They rather build the background for setting physical and functional design constraints as well as performance requirements for split rendering solutions for specific target codecs/systems. Accordingly, the physical and functional design constraints and performance requirements for split rendering solutions for the IVAS codec in sections 6-8 were devised based on these guidelines. For any other target audio codec system corresponding design constraints and performance requirements would have to be defined based on the guidelines.
Ideally, decoding and rendering of head-trackable immersive audio would be implementable and operational on any UE including XR End Devices like AR glasses or earbuds. However, lightweight End Devices of this class frequently operate under strict constraints especially in terms of computational complexity. Reasons are tight limits in terms of power consumption to reduce battery weight, power dissipation heat, and strict implementation cost constraints.
Thus, it may not be possible to always ensure immersive decoding and rendering support consuming the native immersive audio format of a given codec in a lightweight XR End Device.
It is thus the objective of Immersive Audio Split Rendering to solve this problem with solutions targeting:
-
Complexity of operation in end-rendering lightweight device is reduced substantially compared to the native decoding and head-tracked binaural rendering of the original coded audio format.
-
Memory consumption in end-rendering lightweight device is reduced compared to the native decoding and head-tracked binaural rendering of the original coded audio format.
-
Minimum impact on QoS/QoE:
-
All required immersive audio formats of the original coding format, coding modes, and operating ranges (bit rates) ought to be supported.
-
Given that Split Rendering relies on pre-rendering on a capable first UE or network node to an intermediate representation, followed by coding and transmitting that representation for decoding and rendering on the lightweight device, it is unavoidably a transcoding approach. The transcoding impact on QoS/QoE, i.e., on quality and latency, thus ought to be as small as possible in comparison to a relevant reference system, which is to operate decoding and head-tracked rendering of the original coding format. This system is referred to as native decoding/rendering reference or for the specific IVAS reference IVAS Local Decoding/Rendering Reference.
-
Head-tracked immersive audio attributes and especially DOF attributes of the audio formats of the original coding format ought to be retained. I.e., if the immersive audio of the original coding format is head-trackable in 3-DOF, Split Rendering ought to retain this property.
The
'golden' reference system for any Split Rendering solution is to operate decoding and head-tracked rendering of the original coding format in the lightweight End Device. This system is referred to as native coding reference or native decoding/rendering reference. This is also the reference considered by the ISAR WID, where the reference to be considered should be the one where audio decoding and rendering happen entirely on the End Device. In
clause 4.2.2 the term
"Local Audio Rendering" is used for this case. The characteristics of such a solution depend on the specific coding solution, i.e., the
"complexity and memory as well as constraints related to relevant interfaces between presentation engine and End Device such as bit rate, latency, down- and upstream traffic characteristics" would be defined by the specific solution. With the IVAS candidate now available, those parameters can also be extracted, using this solution as the reference. This specific reference is referred to as
"IVAS Local Decoding/Rendering Reference".
A further relevant reference system is a basic transcoding-based system with decoding and head-tracked binaural rendering of the native codec format carried out by the capable UE or network node. The rendered binaural audio signal is subsequently re-encoded. In
clause 4.2.4 the term
"Remote Audio Rendering" is used for this case.
The most viable coding mode of the native codec that can code the binaural audio signal is assumed. Most viable would generally mean a coding mode with least complexity and memory footprint for decoding on the lightweight End Device. This would typically be a stereo coding mode of that codec. Subsequently, the re-encoded binaural audio signal is transmitted to the lightweight End Device where it is decoded and output without final pose adjustment. As the End Device does not carry out any pose corrections matching the actual pose of the End Device, this reference configuration is referred to as 0-DOF native transcoding reference system. This approach is the only reasonable baseline for lightweight End Devices to render binaural audio derived from the native codec format if full native decoding with head-tracked binaural rendering is not possible on that device, implementing the most basic split rendering approach.
Distributed and Remote Rendering are inherently architectures that involve transcoding, i.e. there is an immersive audio representation that is terminated in the Remote MAF and then processed to have an intermediate representation on the link between capable device and lightweight device to achieve a similar QoE as in the case of Local Rendering but allowing implementation on devices with less capabilities. For this, the link characteristics must be considered, such as maximum allowed bit rate on the channel and link latency. Since ISAR solutions need to cope with a range of traffic characteristics (with link-specific devices potentially having individual constraints), it is evident that no single set of requirements would be able to target all scenarios.
Scenarios with a site-local link could e.g., be based on 5G Sidelink, PINs, Wi-Fi, or Bluetooth. The non-3GPP networks Wi-Fi and Bluetooth use only unlicensed spectrum and are therefore prone to packet collision leading to traffic characteristics that are less stable than what a 5G RAN (Radio Access Network) in licensed spectrum can offer. The short distance between capable and lightweight devices, e.g., in the case of a powerful smartphone and wireless AR glasses, allows links with high throughput (in the range of at least several megabits) while keeping power consumption under control. Also, the latency can be incredibly low in the range of only a few milliseconds but also high to allow robust communication on a congested radio channel (such as unlicensed spectrum), also depending on the system design of the network. The link can be stand-alone or be shared with the video path. Since both the capable and the lightweight device could be assumed to typically be in the possession of the end user, volumes of such devices may be high and thus cost sensitivity might apply to both devices. To offer a benefit over pass-through of IVAS bitstreams to the lightweight device (IVAS Local Decoding/Rendering Reference), solutions ought to offer a significant benefit to justify the addition of split rendering functionality for Distributed Rendering and Remote Rendering in energy efficiency to enable a service on a wider range of devices than the ones being capable of Local Rendering.
A split across a link between the edge and the UE, where the edge server is part of the operator's network, and the end user may only have possession of the lightweight UE is the traditional assumption on 3GPP networks. The link characteristics would be depending on the provisioning by the network operator and may vary based on e.g., 5QI but may also be affected on behavior of the lightweight UE, such as having bad reception or for mobility use cases. With a cell being a shared channel for multiple users, radio resources are precious and thus the trade-off between bit rate and complexity may fall more towards the focus-on-rate side as computing resources can always be extended at a certain cost while radio resources are limited by spectrum availability. This results in sensitivity to bit rate, where the pain point is the lowest rate to enable a user with lightweight UE to use a service with a certain QoE. Latency can vary based on 5QI and the resulting scheduling of the gNodeB, but could be as low as a few milliseconds and does not necessarily need to be significantly higher than for a site-local link. Still, in case of OTT or network impairments the latency might increase such that it lowers QoE significantly due to too-high motion-to-sound latency and end-to-end latency and result in packet loss that needs to be concealed.
Physical attributes are complexity, memory consumption, algorithmic delay, motion-to-sound latency, bit rate, etc. The following generic physical design constraints are provided as an informative guidance when defining the physical design constraints for split rendering solutions applicable to a specific target codec/system:
Physical attribute |
Constraint |
Comment |
Complexity of operation in end-rendering lightweight device | Complexity of operation in end-rendering lightweight device is expected to be reduced substantially compared to the native decoding/rendering reference. | Trade-offs between complexity and memory constraints can be considered.
|
Complexity of operation at capable device/node | The complexity of operation at pre-rendering device/node ought to be characterized. | Complexity benefits at the End Device are likely to come at the expense of increased complexity at the pre-rendering device/network node. The reason is that it requires decoding/rendering of the native format followed by re-encoding.
However, it appears difficult to specify a reasonable generic constraint.
|
Memory footprint of operation in end-rendering lightweight device | Memory footprint in end-rendering lightweight device is expected to be reduced substantially compared to the native decoding/rendering reference. | Trade-offs between complexity and memory constraints can be considered.
|
Memory footprint of operation at pre-rendering device/node | The memory footprint of operation at pre-rendering device/node ought to be characterized. | Memory benefits at the End Device are likely to come at the expense of increased memory consumption at the pre-rendering device/network node. The reason is that it requires Decoding/rendering of the native format followed by re-encoding.
However, it appears difficult to specify a reasonable generic constraint.
|
Algorithmic motion-to-sound latency in head-tracked rendering operation | For given DOF level, the algorithmic motion-to-sound latency is expected to be not worse than 0- DOF split rendering systems. | For 0-DOF split rendering systems, there is no constraint.
|
Algorithmic audio delay | The total algorithmic end-to-end audio delay including the Split Rendering operation is expected to not substantially exceed the end-to-end delay of the native reference coding/rendering system.
| A Split Rendering system involves transcoding of the native coding format to an intermediate representation used to transfer the audio to the lightweight device. Accordingly, the total algorithmic end-to-end audio delay including the Split Rendering operation can at best be the same as the one of the native decoding/rendering reference.
It is expected that the Split Rendering approach would do algorithmic audio latency optimizations compared to the native transcoding reference system.
Shared memory buffers during the transcoding from the native coding format to the intermediate representation can be assumed.
|
Bit rate of coded intermediate representation | Split Rendering operation ought to offer multiple bit rate options enabling different QoS/QoE levels.
It can be considered to define different bit rate requirements depending on DOF level. | The bit rate supported on the interface between pre-rendering device/network node and end-rendering lightweight device may depend on the specific system and service configuration of a given service deployment. It is therefore desirable if Split Rendering operation offers flexible trade-offs between bit rate and QoS/QoE. Even if the bit rate of the coded intermediate representation is expected not to substantially exceed the bit maximum bit rate of the native reference system, the main priority is best possible QoS/QoE under a range of bit rates supported on the interface.
|
Functional design constraints are related to the Split Rendering objective that the required immersive audio formats, operation modes and ranges of the original (native) coding format be supported. A further functional attribute associated with immersive audio to be retained is head-trackability.
Accordingly, the following generic functional design constraints are provided as an informative guidance when defining the functional design constraints for split rendering solutions applicable to a specific target codec/system:
Functional attribute |
Constraint |
Comment |
Immersive audio formats of native coding format | All required immersive audio formats of the native coding format are expected to be supported by the Split Rendering operation. |
|
Bit rates of required immersive audio coding modes of native coding format | All bit rates are expected to be supported. |
|
Head-trackability of immersive audio formats | The head-trackability of the immersive audio formats of the native coding format is expected to be retained. While the preservation of the DOF level can be an objective, it can be considered to additionally provide reduced DOF levels. | Explanation: an audio format supported by the native coding system may be 3-DOF head-trackable, i.e., around 3 axes (yaw, pitch, roll). The Split Rendering system ought to retain this possibility. Complexity or bit rate reduced variants may though reduce this to lower DOF levels like yaw-only correction (1-DOF).
|
Packet loss concealment (PLC) | A PLC solution is expected to be provided. |
|
Non-diegetic audio support | Split Rendering operation in non-diegetic mode (non-head-tracked) is expected to be possible.
It ought to be possible to overlay post rendered audio obtained from instances operated with diegetic and non-diegetic audio. |
|
ISAR supports immersive audio experiences in various use cases. A number of factors can contribute to general QoE of the ISAR solution. These include general audio quality, motion-to-sound latency, end-to-end-latency, spatial image quality, pose accuracy, etc.
Relevant reference systems for audio quality requirements are the native coding reference system and the 0-DOF native transcoding reference system. The quality of the native coding reference system is the optimum which cannot be surpassed by a Split Rendering system. The 0-DOF native transcoding reference system on the other hand may suffer from quality degradations due to transcoding and due to potential differences between the assumed End Device pose during binaural rendering and the actual End Device pose. These deviations may be caused by the transmission round trip delay between the End Device and the capable device/network node performing the head-tracked binaural rendering.
It is expected that an ISAR solution employing post-rendering with pose correction provides higher quality of experience (QoE) than the 0-DOF native transcoding reference system under a given scenario with significant latency between the End Device and the capable device/network node performing the head-tracked binaural rendering. Given the large number of potential split rendering scenarios, the performance requirements may be defined for one or several relevant scenario(s). In addition, the requirements are expected to be met under the defined physical and functional design constraints.
The ISAR solution ought further to provide QoE that is as close as possible to the QoE offered by the native coding reference system under the same scenario.