The main part of the present document is a detailed algorithmic description of functions for Split Rendering of immersive audio. It comprises
-
Intermediate pre-rendered audio representation,
-
Encoder, bitstream and decoder for the intermediate representation,
-
Post-rendering of the decoded intermediate representation to provide binaural audio output with and without head-tracker input and post-rendering control metadata.
Along with the intermediate pre-rendered audio representation, functional requirements for pre-renderer operations are provided, which, if met, enable a Presentation Engine to connect to an ISAR compliant ISAR decoder and post-renderer. Interfaces are described allowing an immersive audio decoder/renderer in a Presentation Engine to connect to the ISAR pre-renderer.
The post-renderer procedures of this document are mandatory for implementation all User Equipment (UE)s claiming ISAR compliant post-rendering capabilities.
This clause provides a generic ISAR systems overview based on the example of the ISAR compliant IVAS split rendering feature, which define the baseline ISAR system illustrated in
Figure 4.2-1.
The immersive audio rendering process is split between a capable device or network node (by the Presentation Engine relying on IVAS decoding and rendering) and a less capable device with limited computing and memory resources and motion-sensing for head-tracked binaural output. ISAR split rendering consists of the following core components:
-
Pre-renderer & encoder with
-
Pose correction metadata computation and metadata encoder
-
Binaural audio encoder (transport codec encoder)
-
ISAR decoder & post-renderer with
-
Pose correction metadata decoder
-
Binaural audio decoder (transport codec decoder)
-
Pose corrective post renderer
The metadata-based pose correction scheme allows adjusting in a lightweight process a binaural audio signal originally rendered for a first pose according to a second pose. In split rendering context, the first pose is the potentially outdated lightweight-device pose available at the pre-renderer while the second pose is the current and accurate pose of the lightweight-device. The metadata is calculated at the capable device or network node based on additional binaural renditions at probing poses different from the first pose. For increasing degrees-of-freedom (DOF) an increasing number of additional binaural renditions at different probing poses is required. The metadata is transmitted to the lightweight device along with the coded binaural audio signal rendered for the first pose.
The pose correction metadata computation is done in CLDFB domain. Thus, unless the immersive audio decoder/renderer already operates in that domain, a conversion of the pre-rendered immersive audio signal and the additional binaural renditions to that domain is required. The binaural audio signal rendered to the first pose is encoded using one of the two codecs, LCLD or LC3plus, whereby the former of these codecs operates in CLDFB domain. The two codecs have complementary properties giving implementors the freedom to make individual trade-offs between complexity, memory, latency, and rate-distortion performance and to implement a design that is optimized for a given immersive audio service and hardware configuration. It is also possible to use another transport codec for the binaural audio signal.
ISAR split rendering can operate at various DOFs, ranging from 0-DOF (no pose correction) to 3-DOF (pose correction on the three rotational axes yaw, pitch, roll) at bit rates from 256 kbps (0-DOF) to 384 - 768 kbps (3-DOF).