Shared AR Conversational experience is an end-to-end conversational service that includes communication between two or more parties through a network/cloud entity that creates a shared experience, meaning that every party in the call in its AR experience sees the same relative arrangements of the other participants relative to each other. Therefore, for instance the interaction between two parties seating next to each other in the virtual space (e.g. when these parties turn to each other when talking) is seen by all participants in the same way. Note that the AR runtime in each device customizes and updates the arrangement of the people in the virtual room. The absolute positioning of people or objects in a user's scene may vary based on the physical constraints of the user's room. This shared experience distinguishes this use case from the AR conversational experience of clause 6.5.
In addition to the building blocks listed in clause 6.5.1, an immersive media processing function is needed to create the shared virtual experience. This requirement is discussed as an abstract functionality. In various deployments, this functionally may be implemented in different ways or by different entities, in a centralized or distributed fashion, and other possible arrangements.
This experience may be deployed with a combination of AR and non-AR devices. In this context, an AR device is capable of laying over received media object on a see-through glass (e.g. AR glasses) or the display of the device while capturing live content through its camera and rendering on its display (e.g. a table or phone). A non-AR device only receives one or multiple 2D video streams each representing one of the other participants but is incapable of laying over received media object with the see-through or captured by its camera scene. In such a scenario, each AR device creates an AR scene as mentioned above. But an application running on the edge/cloud may create one or multiple 2D videos (i.e. a VR video or multi-view videos) of a virtual room which includes all other participants and streams one or more of them to a non-AR device, based on its user's preference. The user on non-AR devices can also change its viewport to the virtual room by changing the position of its device or using navigation devices such as keyboard or mouse, but the device does not provide an AR experience.
To describe the functional architecture for shared AR conversational experience use case such as Annex A.7 and identify the content delivery protocols and performance indicators, an end-to-end architecture is addressed. The end-to-end architecture for shared AR conferencing (one direction) is shown in Figure 6.6.3-1. To simplify the architecture, only 5G STAR UE is considered in this Figure.
Camera(s) are capturing the participant(s) in an AR conferencing scenario. The camera(s) for each participant are connected to a UE (e.g. laptop or mobile phone or AR glasses) via a data network (wired/wireless). Live camera feeds, sensors, and audio signals are provided to a UE which processes, encodes, and transmits immersive media content to the 5G system for distribution. In multi-party AR conversational services, the immersive media processing function on the cloud/network receives the uplink streams from various devices and composes a scene description defining the arrangement of individual participants in a single virtual conference room. The scene description as well as the encoded media streams are delivered to each receiving participant. A receiving participant's 5G STAR UE receives, decodes, and processes the 3D video and audio streams, and renders them using the received scene description and the information received from its AR Runtime, creating an AR scene of the virtual conference room with all other participants.
Also note that if the format conversion is desired, the immersive media processing function on the cloud may optionally use media services such as pre-processing of the captured 3D video, format conversion, and any other processing before compression of immersive media content including 3D representation, such as in form of meshes or point clouds, of participants in an AR conferencing scenario.
Figure 6.6.3-2 illustrates the architecture for shared AR conversational experience use case when an 5G EDGAR UE (receiver) is used. While the functionalities of the sender and the network/cloud shown in Figure 6.6.3-1 are identical in the STAR and EDGAR devices, an EDGAR device uses a split-rendering function on Cloud/Edge.
The AR session management may be done by AR/MR application on device. In this case, it is a responsibility of the device to connect and to acquire an entry point for edge/cloud during the session management. AR Scene Manager on cloud/edge generates the lightweight scene description and simple format of AR media that match AR glass display capabilities of the individual participant's 5G EDGAR device. The lightweight scene description and encoded rendered scene are delivered to the UE. The UE receives the simple format of AR media and audio streams, decodes and renders them using the received lightweight scene description and the information received from its AR Runtime, creating an AR scene of the virtual conference room with all other participants.