This clause focuses on rendering and media centric architectures. The architectures are simplified and illustrative, they only consider an XR server and an XR device to identify the functions in XR servers and XR devices that communicate and exchange information, possibly over a 5GS communication. The baseline technologies are introduced in
clause 4. These architectures focus on processes where the following main tasks are carried out:
-
Display
-
Tracking and Pose Generation
-
Viewport Rendering
-
Capture of real-world content
-
Media encoding
-
Media decoding
-
Media content delivery
-
5GS communication
-
Media Formats, metadata and other data delivered on communication links
-
Spatial Location Processing
The section also discusses benefits and challenges of the different approaches in terms of required bitrates, latencies, reliability, etc. A main aspect to be addressed in the following are the processes that are involved in the motion-to-photon/sound latency and how the processed may impact the XR viewport rendering.
In the viewport independent delivery case, following the architecture in
clause 4.3 of TS 26.118, tracking and sensor information is only processed in the XR device as shown in
Figure 6.2.2-1. This means that the entire XR scene is delivered and decoded.
Use cases that may be addressed partially or completely by this delivery architecture are summarized in
clause 5.4.
The basic procedures follow the procedures of 5G Media Streaming in
clause 5 of TS 26.501. Both, on-demand and live streaming may be considered.
No content format for 6DoF streaming is fully defined yet, but the content may for example be a scene for which foreground 3D objects, for example represented by point-clouds, are mixed with background content, for example a omnidirectional scene. The combination of the content may be provided by a scene description that places the object in the 6DoF scene. Typically, the data needs to be streamed into buffers that are jointly rendered.
Due to the significant amount of data that needs to be processed in the device, hardware supported decoding and rendering is necessary.
Such an approach typically requires the delivery and decoding of several video and audio streams in parallel.
In the case of viewport-independent streaming, processing of updated pose information is only done locally in the XR device. Delivery latency requirements are independent of the motion-to-photon latency.
In order to provide sufficient content quality, the video material is referably encoded such that the QoE parameters as defined as defined in
clause 4.2 can be fulfilled. The necessary QoS and bitrates on the 5G System depend depend on the type of the XR media as well as on the streaming protocol. Based on information from the workshop
"Immersive Media meets 5G" in April 2019 as well as from publicly announced demos, that based on today's equipment and the one available over the next 2-3 years, around 100 Mbps are sufficient bitrates to address high-quality 6DOF VR services. This is expected to allow 2k per eye streaming at 90 fps (see
clause 4.2) using existing video codecs (see
clause 4.5). The QoE requirements may increase further, for example higher resolution and frame rate, but with the advance of new compression tools, this is expected to be compensated.
The XR media delivery are typically built based on download or adaptive streaming such as DASH (see for example
TS 26.118 and
TS 26.247), such that one can adjust quality to the available bitrate to a large extent.
Suitable 5QI values for adaptive streaming over HTTP are 6, 8, or 9 as defined in
clause 4.3.3.
If other protocols are applied for streaming, then suitable 5QIs would be for FFS.
In the context of the present document, relevant 3D media formats, efficient compression, adaptive delivery as well as the perceived quality of the XR media is of key relevance.
In the context of this delivery scenario, the following potential standardisation needs are identified:
-
Very high-bitrate and efficient/scalable streaming protocols
-
6DoF Scene Description and XR media integration
-
Video and audio codec extensions to efficiently code and render graphics centric formats (2D, meshes, point clouds)
-
Support of decoding platforms that support the challenges documented in clause 4.5.2.
In the viewport dependent delivery case, following the architecture in
clause 4.3 of TS 26.118, the tracking information is predominantly processed in the XR device, but the current pose information is provided to the XR delivery engine in order to include the pose information in the adaptive media requests. In an extension to this in the case of XR and 6DoF, the XR pose and additional information may be shared with the XR content delivery in order to only access the information that is relevant for the current viewports. According to
Figure 6.2.3-2, the tracking and sensor data is processed in the XR device for XR rendering, and the media is adaptively delivered/requested based on the XR viewport. A reduced or a viewport optimized scene is delivered and also only a reduced scene is processed. Examples include an object that is not visible is not delivered, or only delivered in low quality, or that only the part of the object that is in the viewport is delivered with the highest quality.
Use cases that may be addressed partially or completely by this delivery architecture are summarized in
clause 5.4.
The basic procedures follow the procedures of 5G Media Streaming in
clause 5 of TS 26.501. Both, on-demand and live streaming may be considered.
In addition, the request for data is accompanied with information from the XR engine.
The same formats as discussed in
clause 6.2.2.4 apply.
Compared to the viewport independent delivery in
clause 6.2.2, for viewport dependent streaming, updated tracking and sensor information impacts the network interactivity. Typically, due to updated pose information, HTTP/TCP level information and responses are exchanged every 100-200 ms in viewport-dependent streaming.
From analysis in
TR 26.918 and other experience as for example documented the workshop
"Immersive Media meets 5G" in April 2019
[42], such approaches can reduce the required bitrate compared to viewport independent streaming by a factor of 2 to 4 at the same rendered quality.
It is important to note that viewport-dependent streaming technologies are typically also built based on adaptive streaming allowing to adjust quality to the available bitrate. The knowledge of tracking information in the XR Delivery receiver just adds another adaptation parameter. However, generally such systems may be flexibly designed taking into account a combination/tradeoff of bitrates, latencies, complexity and quality.
In the context of the this architecture, the following potential standardisation needs are identified:
-
The same aspects as defined in clause 6.2.2.6.
-
In addition, more flexible data structures and access to these data structures as well as concurrentdecoding and streaming of smaller units of data, such as tile-based structures, may be defined.
-
If other protocols than adaptive streaming over HTTP would be applied, then suitable 5QIs would be for FFS.
In a architecture as shown in
Figure 6.2.4-1 below, the viewport is entirely rendered in the XR server. The XR server generates the XR Media on the fly based on incoming Tracking and Sensor information, for example a game engine. The generated XR media is provided for the viewport in a 2D format (flattened), encoded and delivered over the 5G network. The tracking and sensor information is delivered in the reverse direction. In the XR device, the media decoders decode the media and the viewport is directly rendered without using the viewport information.
The following call flow highlights the key steps:
-
An XR device connects to the network and XR Media Generation application
-
Sends static XR device information and capabilities (supported decoders, viewport)
-
Based on this information, XR server sets up encoder and formats
-
Loop
-
XR device collects pose (or a predicted pose)
-
XR Pose is sent to XR Server
-
The XR Server uses the pose to generate/compose the viewport
-
XR Viewport is encoded with regular media encoders
-
The compressed video is sent to XR Device
-
The XR device decompresses video and directly renders viewport
Such an architecture enables simple clients, but has significantly challenges on compression and transport to fulfill the latency requirements. Latencies should be kept low for each processing step including delivery, to make sure that the cumulative delay for all the processing steps (including tracking, pose delivery, viewport rendering, media encoding, media delivery, media decoding and display) is within the immersive motion-to-photon latency upper limit of 20ms.
The following three cases, with different media delivery bitrates, are considered:
-
Around 100 Mbps: In this case, the XR device needs to perform certain amount of processing and decoding.
-
Around 1 Gbps: In this case, only lightweight and low-latency compression (e.g. intra only) may be used to provide sufficiently high quality (4k or even 8k at sufficiently high frame rates above 60 fps) and sufficiently low latency (immersive limits of less than 20ms for motion to photon) for such applications. It is still expected that some processing (e.g. decoding) by the XR device is needed.
-
Around 10 Gbps or even more: A full "USB-C like" wireless connection, providing functionalities that currently can only be provided by cable, possibly uncompressed formats such as 8K. The processing requirements for the XR device in this case may be minimal.
Note that the lightweight compression or no compression in cases 2) and 3) can help to reduce processing delays.
In addition, the formats exported from Game engines needs to be supported by the respective media encoders.
Note that in this case for XR-based services, the motion-to-photon latency determines the maximum latency requirements for the content. This means that 20ms as defined in
clause 4.5.1 are the end-to-end latency targets, including the uplink streaming of the pose information.
This is different, if the content is rendered on a flat device (for example for cloud/edge gaming applications on smartphones), for which not motion-to-photon latency determines the latency requirements, but the roundtrip interaction delay. In this case, the requirements from
clause 4.5.2 apply, typically a 50ms latency is a requirement for most advanced games.
On Formats and codecs:
-
From the analysis, for case 1, similar aspects as defined in clause 6.2.2.6 apply for the formats.
-
For cases 2 and 3, formats are of less relevance for 3GPP as such formats are typically defined by other consortia, if at all.
On network support:
-
Network rendering for cloud gaming on flat screens is expected to be of significant relevance. In this case the end-to-end latency (action to photon) is determined by the roundtrip interaction delay, i.e. 50ms (see 4.5.2). 5QIs to support such latencies as well as guaranteed bitrates are considered of relevance. Required bitrates follow case 1) from above.
-
Network rendering for XR services would require an end-to-end latency including motion-to-photon (including network rendering, encoding, delivery and decoding) of 20ms to meet the immersive limits and it is expected that the bitrates would be higher due to low-complexity and low-latency encoding, following case 2) and 3) from above. Hence,
-
5QIs and QoS would be necessary, that provides significantly lower latency than 10ms in both directions and the same time provides a stable and high bitrate in the range of 0.1 - 1 Gbps according to case 2).
-
It is not expected to be practical for Uu-based communication to achieve such low-latencies at very high bitrates (mostly case 3, e.g. 1Gbps and higher) in the short term, but final studies on this matter are FFS.
-
However, sidelink-based based communication addressing network rendering is expected to be feasible in the 5G architecture and is subject to active work in 3GPP.
Raster-based split rendering refers to the case where the XR Server runs an XR engine to generate the XR Scene based on information coming from an XR device. The XR Server rasterizes the XR viewport and does XR pre-rendering.
According to
Figure 6.2.5-1, the viewport is pre-dominantly rendered in the XR server, but the device is able to do latest pose correction, for example by asynchronuous time-warping (see
clause 4.1) or other XR pose correction to address changes in the pose.
-
XR graphics workload is split into rendering workload on a powerful XR server (in the cloud or the edge) and pose correction (such as ATW) on the XR device
-
Low motion-to-photon latency is preserved via on device Asynchronous Time Warping (ATW) or other pose correction methods.
As ATW is applied the motion-to-photon latency requirements (of at most 20 ms) are met by XR device internal processing. What determines the network requirements for split rendering is time of pose-to-render-to-photon and the roundtrip interaction delay. According to
clause 4.5, the latency is typically 50-60ms. This determines the latency requirements for the 5G delivery.
The use cases in
clause 5.5 may be addressed by this architecture.
The following call flow highlights the key steps:
-
An XR Device connects to the network and joins XR application
-
Sends static device information and capabilities (supported decoders, viewport)
-
Based on this information, the XR server sets up encoders and formats
-
Loop
-
XR Device collects XR pose (or a predicted XR pose)
-
XR Pose is sent to XR Server
-
The XR Server uses the pose to pre-render the XR viewport
-
XR Viewport is encoded with 2D media encoders
-
The compressed media is sent to XR device along with XR pose that it was rendered for
-
The XR device decompresses video
-
The XR device uses the XR pose provided with the video frame and the actual XR pose for an improved prediction using and to correct the local pose, e.g. using ATW.
Rasterized 3D scenes available in frame buffers (see
clause 4.4) are provided by the XR engine and need to be encoded, distributed and decoded. According to
clause 4.2.1, relevant formats for frame buffers are 2k by 2k per eye, potentially even higher. Frame rates are expected to be at least 60fps, potentially higher up to 90 fps.The formats of frame buffers are regular texture video signals that are then directly rendered. As the processing is graphics centric, formats beyond commonly used 4:2:0 signals and YUV signals may be considered.
With the use of time warp, the latency requirements follow those documented in
clause 4.2.2, i.e. the end-to-end latency between the user motion and the rendering is 50ms.
It is known from experiments that with H.264/AVC the bitrates are in the order of 50 Mbps per eye buffer. It is expect that this can be reduced to lower bitrates with improved compression tools (see
clause 4.5) but higher quality requirements may absorb the gains. It is also known that this is both content and user movements dependent, but it is known from experiments that 50 - 100 Mbps is a valid target bitrate for split rendering.
Regular stereo audio signals are considered, requiring bitrates that are negligible compared to the video signals.
5QI values exist that may address the use case, such 5QI value number 80 with 10ms, however this is part of the non-GBR bearers (see clause). In addition, it is unclear whether the 10ms with such high bitrates and low required error rates may be too stringent and resource consuming. Hence, for simple split rendering in the context of the requirements in this clause, suitable 5QIs may have to be defined addressing the latency requirements in the range of 10-20ms and bitrate guarantees to be able to stream 50 to 100 Mbps consistently.
The uplink is predominantly the pose information, see
clause 4.1 for details. Data rates are several 100 kbit/s and the latency should be small in order to not add to the overall target latency.
In the context of this architecture, the following potential standardisation needs are identified:
-
Regular 2D video encoders and decoders that are capable encode and decode 2K per eye as well as 90 fps and are capable to encode typical graphics frame buffer signals.
-
Pose information in the uplink at sufficiently high frequency
-
Content Delivery protocols to support the delivery requirements
-
Edge computing discovery and capability discovery
-
A simple XR split rendering application framework for single buffer streaming
-
New 5QIs and QoS support in 5G System for split rendering addressing latency requirements in the range of 10-20ms and bitrate guarantees to be able to stream 50 to 100 Mbps consistently.
In
Figure 6.2.6-1, an architecture is shown for which the XR server pre-renders the 3D scene into a simpler format to be processed by the device (e.g. it may provide additional metadata that is delivered with the pre-rendered version). The device recovers the baked media and does the final rendering based on local correction on the actual pose.
-
XR graphics workload is split into rendering workload on a powerful XR server and simpler XR processing on the XR device
-
This approach enables to relax the latency requirements to maintain a full immersive experience as time-critical adjustment to the correct pose is done in the device.
-
this approach may provide more flexibility in terms of bitrates, latency requirements, processing, etc. than the single buffer split rendering in clause 6.2.5.
Such an approach needs careful considerations on the formats of projected media and their compression with media decoders. Also important is distribution of latencies to different components of the system. More details and breakdown of the architectures is necessary. The interfaces in the device however are aligned with the general structure defined above.
In general, the similar requirements and considerations as in
clause 6.2.5 apply, but a more flexible framework may be considered by providing not only 2D frame buffers, but different buffers that are split over the network.
The use cases in
clause 5.5 may be addressed by this architecture.
The following call flow highlights the key steps:
-
An XR Device connects to the network and joins XR application
-
Sends static device information (supported decoders, viewport, supported formats)
-
Based on this information, network server sets up encoder and formats
-
Loop
-
XR Device collects XR pose (or a predicted XR pose)
-
XR Pose is sent to XR Server
-
The XR Server uses the pose to pre-render the XR viewport by creating one or multiple rendering buffers, possibly with different update frequencies
-
The rendering buffers are encoded with 2D and 3D media encoders
-
The compressed media is sent to XR device along with additional metadata that describes the media
-
The XR device decompresses the multiple buffers and adds this to the XR rendering engine.
-
The XR rendering engine takes the buffers, the rendering pose assigned to the buffers and the latest XR pose to create the finally rendered viewport.
In this context, the buffers may not only be 2D texture or frame buffers as in case of
clause 6.2.5, but may include geometric data, 3D data, meshes and so on. Also multiple objects may be generated. The content formats discussed in
clause 4.6 apply.
With the use of different buffers, the latency requirements follow those documented in
clause 4.5.2, i.e. the end-to-end latency between the user motion and the rendering is 50ms. However, it may well be that the update frequency of certain buffers is less. This may result in differentiated QoS requirements for different encoded media, for example in terms of latency, bitrates, etc.
More details are FFS.
In the context of the this architecture, the following potential standardisation needs are identified:
-
Similar aspects as defined clause 6.2.5.6
-
Flexible 2D and 3D formats that can be shared over the network to serve device rendering buffers
-
Formats and decoding capabilities as defined in clause 4.5.2
-
Edge computing discovery and capability discovery for Generalized Split rendering
-
A generalized XR split rendering application framework
-
More flexible 5QIs and QoS support in 5G System for generalized split rendering addressing differentiated latency requirements in the range of 10ms up to potentially several 100ms and with bitrate guarantees.
This clause provides the architecture for extended reality applications which supports the XR split rendering. The workload for XR processing is split into workloads on XR server and the device. The below
Figure 6.2.7-1 shows a high-level structure of the XR distributed computing architecture which describes their components and interfaces.
An XR client may have following capabilities.
-
XR capture
-
Sensor data processing (e.g., AR pose tracking)
-
XR scene generation
-
XR rendering.
-
2D or 3D Media decoding
-
Metadata (including scene description) processing
-
5G delivery
An XR edge server may have following capabilities.
-
Sensor data processing
-
XR scene generation
-
2D or 3D media encoding
-
Metadata (including scene description) generation
-
5G delivery
An XR client connects to the network and joins XR rendering application. The XR client sends static device information (e.g., sensors, supported decoders, display configuration) to the XR edge server. Based on this information, the XR edge server sets up encoder and formats.
When the XR client has a set of sensors (e.g., trackers and capturing devices, it collects sensor data from sensors. The collected sensor data is processed either locally or at the XR edge server. The collected sensor data or locally processed information (e.g., a current AR pose) is sent to the XR edge server. The XR edge server uses the information to generate the XR scene. The XR edge server converts the XR scene into a simpler format as 2D or 3D media with metadata (including scene description). The media component is compressed, and the compressed media stream and metadata are delivered to the XR client. The XR client generates the XR scene by compositing locally generated or received media and metadata and renders the XR viewport via the XR display (e.g., HMD, AR glass).
For example, the XR client captures the 2D video stream from a camera and sends the captured stream to the XR edge server. The XR edge server performs the AR tracking and generates the AR scene which a 3D object is overlaid over a certain position in the 2D video based on the AR tracking information. The 3D object or 2D video for the AR scene are encoded with 2D/3D media encoders, and the scene description or the metadata is generated. The compressed media and metadata are sent to the XR client. The XR client decodes the media or metadata and generates an AR scene which overlays the 3D object in the 2D video., A user viewport is determined by horizontal/vertical field of view of the screen of a head-mounted display or any other display device. The appropriate part of AR scene for the user viewport is rendered and displayed.
In
Figure 6.2.8-1, a general architecture for XR conversational and conference services is depicted. As stated, these services are an extension on the current MTSI work, using the IMS for session signalling. In order to support XR conversational services (in 5G), extensions are needed in the signalling to enable VR/AR specific attributes, and the media and metadata need to support the right codecs, profiles and metadata. A new interface is the interface to the network (based) media processing. This is an interface similar to that to an SRF, but is expected to be much more extensive to support various types of media processing. This interface can be based on the work in MPEG on Network Based Media Processing.
Typical steps for call session setup follow normal IMS procedures, in case the clients have a simple peer-2-peer call and also do all processing themselves (simplified procedure as follows):
-
The first client initiates call setup (SIP INVITE);
-
The IMS (i.e. central session control) routes the call setup to the second client, ensuring proper bandwidth reservations in the network;
-
The second client, after call acceptance, responds to the call setup (200 OK);
-
The network controls all bandwidth reservations;
-
Call setup is completed.
But, given mobile clients, their limited processing capabilities, battery capacity and potentially problems with heat dissipation, processing might be moved to the network. Typical processing for XR conferencing:
-
Foreground/background segmentation;
-
HMD removal, i.e. replacing a users HMD with a 3D model of the actual face, possibly including eye tracking / reinsertion;
-
3D avatar reconstruction, i.e. using RGB + depth cameras or multiple cameras to create 3D user video avatars;
-
Support for multiple users with a (centralised or distributed) VR conferencing bridge, stitching multiple user captures together;
-
Creating a self-view, i.e. local 3D user avatar from the user's own perspective.
In such network-processing scenario, setup is somewhat extended:
-
First a client initiates the call setup;
-
Based on the call setup, the session control triggers network based media processing, reserves resources in the network, incl. media processing resources;
-
Session control forwards call setup to the second client;
-
After call acceptance, both first and second client are connected to the network processor.
-
Session control instruct the network processor on the actual processing and the stream forwarding, i.e. which input streams go to which clients.
Specific details here are for further study. Routing media streams can be performed in various ways, using existing methods for stream transcoding, or more centralised conferencing signalling. The interface to the media processor can be based on the existing MRF interface. But, given the developments within MPEG on Network Based Media Processing, maybe a new interface should be defined.
For the routing of signalling, many options allready exist within the IMS. Re-use and perhaps slight modifications are expected to be sufficient to cover the various use cases defined here. For the SDP within the signalling, more modifications are expected. Besides support for new types of media and thus new types of codecs (point cloud streaming, mesh streaming, depth streaming) and profiles for those codecs, new types of metadata also need to be supported. Calibration data (i.e. calibration of HMD position vs camera position), HMD information (position, orientation, characteristics), camera information (position/orientation, lens/sensor characteristics, settings), user information (3D model, IOD, HRTF, etc) can all be used to perform or improve the media processing.
Also, there are different media types, both for the environment and for the user avatars. A virtual environment can consist of a rendered environment, a 360 photo or video, or some hybrid. User avatars can be graphical avatars, video based, 3D video avatars, rendered avatars. Devices can be a single device (one mobile in an HMD enclosure, potentially with a separate bluetooth camera) or can be multiple devices (separate stand-alone VR HMD, multiple (smartphones as cameras).
Additional aspects to be taken into account are:
-
placement of media processor: central vs edge, centralised vs distributed.
-
delay aspects for communication purposes. Ideally, delay is kept to a minimum, i.e. <150 ms one-way delay. Given the required processing, this is a challenge, and will effect e.g. codec choices and rendering choices.
For XR Conversational services, we can consider 3 bandwidth cases according to the type of capture/user representation transmitted (with almost constant bandwidth on the upload):
-
2D+/RGBD: >2.7Mbit/s (1 camera), >5.4Mbit/s (2 Camera)
-
3D Mesh: ~30 Mbit/s
-
3D VPCC / GPCC: 5-50 Mbit/s (CTC MPEG)
Furthermore we can assume that joining an communication experience session will result in a download peek at the beginning of the session to download the environment and associated media objects within the XR application. Throughout a XR communication experience session the download might vary depanding on the amound of users represented and the upload format of those users.