Based on the use cases, the following formats, codecs, and packaging formats are of relevance for media streaming of AR:
-
General
-
Basic scene graph and scene description
-
2D uncompressed video formats and video compression codecs
-
Regular audio formats and audio compression codecs
-
In addition, for STAR-based UE
-
Richer scene graph data
-
3D formats such as static and dynamic point clouds or meshes
-
Several video decoding instances
-
Decoding tools for such formats
-
DASH/CMAF based delivery
-
In addition, for EDGAR-based UE
-
2D compression tools for eye buffers as defined in clause 4.5.2
-
Decoding tools for such formats
-
At least two video decoding instances
-
Low-latency downlink real-time streaming of the above media
-
Uplink streaming of pose information and other relevant information, such as input actions
The above scenarios relate to the following cases in
clause 6 of TR 26.928. In particular:
For STAR-based devices and viewport-independent streaming, processing of updated pose information is only done locally in the XR device. Delivery latency requirements are independent of the motion-to-photon latency. Initial considerations on QoE parameters are provided in
clause 6.2.2.5 of TR 26.928. The XR media delivery are typically built based on download or adaptive streaming such as DASH such that one may adjust quality to the available bitrate to a large extent. Compared to the viewport independent delivery, for viewport dependent streaming, updated tracking and sensor information impacts the network interactivity. Typically, due to updated pose information, HTTP/TCP level information and responses are exchanged every 100-200ms. For more details, refer to
clause 6.2.3 of TR 26.928. Such approaches may reduce the required bitrate compared to viewport independent streaming by a factor of 2 to 4 at the same rendered quality. It is important to note that viewport-dependent streaming technologies are typically also built based on adaptive streaming allowing to adjust quality to the available bitrate. The knowledge of tracking information in the XR Delivery receiver just adds another adaptation parameter. However, generally such systems may be flexibly designed taking into account a combination/tradeoff of bitrates, latencies, complexity and quality. Suitable 5QI values for adaptive streaming over HTTP are 6, 8, or 9 as defined in
clause 5.7.4 of TS 23.501, and also indicated in
clause 4.3.3 of TR 26.928.
For EDGAR-based devices, raster-based split rendering based on
clause 6.2.5 of TR 26.928 applies. With the use of pose corrections, the key latency for the network is the motion-to-render-to-photon delay as introduced in
clause 4.5.2 and
clause 4.5.3, i.e. the end-to-end latency between the user motion and the rendering is 50-60ms. The formats are defined in
clause 4.5.2 as follows
-
for 30 x 20 degrees, 1.5K by 1K per eye is required and 1.8K by 1.2K per eye is desired
-
for 40 x 40 degrees, 2K by 2K required and 2.5 K by 2.5 K desired
Colours are typically RGB but may be converted to YUV. Framerates are typically 60fps to 90fps. The above formats result in typically in maximum 4K content at 60 fps. Modern compression tools compress this to 30 to 50 Mbit/s. Stereo audio signals are considered, requiring bitrates that are negligible compared to the video signals. In order to support warping and late stage reprojection, some depth information may be added. For communication a real-time capable content delivery protocol is needed, and the network needs provide reliable delivery mechanisms. 5QI values exist that may address the use case, such 5QI value number 80 with 10ms, however this is part of the non-GBR bearers (see clause). In addition, it is unclear whether the 10ms with such high bitrates and low required error rates may be too stringent and resource consuming.
Hence, for simple split rendering in the context of the requirements in this clause, suitable 5QIs 89 and 90 have been defined in Rel-17 in
TS 23.501 in Rel-17 addressing the latency requirements in the range of 10-20ms and bitrate guarantees to be able to stream up to 50 Mbps consistently. Significant opportunities exist to support split rendering with advanced radio tools, see for example
TR 26.926 for performance evaluation.
The uplink is predominantly the pose information. Data rates are several 100 kbit/s and the latency need to be small in order to not add to the overall target latency. Suitable 5QIs 87 and 88 have been defined in Rel-17 in
TS 23.501 to stream uplink pose information.
The list of potential standardization area that has been collected is provided in the following:
-
HTTP-Streaming of immersive scenes with 2D and 3D media formats and objects to STAR-based devices including
-
Immersive media format and profile with integration into 5GMS for STAR-based devices
-
Scene description format, functionality, and profile as an entry point of immersive media
-
Relevant subset of media codecs for different media types and formats
-
CMAF encapsulation of immersive media for 5G media streaming
-
Viewport independent and viewport dependent streaming
-
Split rendering delivery of immersive scenes to EDGAR-based devices
-
Media payload format to be mapped into RTP streams
-
Capability exchange mechanism and relevant signalling
-
Protocol stack and content delivery protocol
-
Cross-layer design, radio and 5G system optimizations for QoS support
-
Uplink streaming of predicted pose information and input actions
-
Required QoE metric