Content for TR 26.928 Word version: 18.0.0

0… 4… 4.1.2… 4.2… 4.3… 4.4… 4.5… 4.6… 4.6.7 4.7… 4.9… 5… 6… 7… 8 A… A.4… A.7… A.10… A.13 A.14 A.15 A.16 A.17 A.18…

A.16 Use Case 15: XR Meeting p. 110

Use Case Name
XR Meeting
Description
This use case is a mix of a physical and a virtual meeting. It is an XR extension of the virtual meeting place use case described in TR 26.918. The use case is exemplified as follows: Company X organizes a workshop with discussions in a couple of smaller subgroups in a conference room, as for instance shown in the Figure below. Each subgroup gathers around dedicated spots or tables and discusses a certain topic and participants are free to move to the subgroup of their interest. Remote participation is enabled. The main idea for the remote participants is to create a virtual, 3D-rendered space where they can meet and interact through their avatars with other people. This 3D-rendered virtual space is a simplified representation of the real conference room, with tables at the same positions as in the real world. Remote participants are equipped with HMD supporting binaural playback. A remote participant can move freely in the virtual conference room and interact with the different subgroups of people depending, for example, on the discussion they are having. A remote participant can speak to other participants in their immediate proximity and obtain a spatial audio rendering of what the other participants are saying. They can hear the real participants from their relative positions in the virtual world, and they can freely walk from one subgroup to another to seamlessly join different conversations that may happen concurrently in the meeting space. Consistent with the auditory scene, the remote participant will be able to see on the HMD a rendered "Scene view" of the complete virtual meeting space from their viewpoint, i.e. relative to position and viewing direction. As options, the remote participant may also select to see a "Top view" of the complete meeting space with all participants (or their avatars) or a "Table view". The latter is generated from a 360-degree video capture at the relevant table. The audio experience remains in any case as during "Scene view". The physical participants see and hear avatars representing the remote participants through AR Glasses supporting binaural playback. They interact with the avatars in the discussions as if these were physically present participants. For physical participants, the interactions with other physical and virtual participants happen in an augmented reality. In addition, at each subgroup meeting spot, a video screen displays a 360-degree panoramic "Table view" taken from the middle of the respective table, including the overlaid avatars of the remote participants taking part in the subgroup discussion. Also displayed is the complete meeting space with all participants (or their avatars) in a top view. A schematic of the configuration at the physical meeting space is shown in the following Figure. In that Figure, P1 through P4 represent the physical participants while V1 through V3 are the remote participants. Also shown are two subgroup meeting spots (tables), each with a 360-degree camera mounted on its center. Further, at each table the two video screens are shown for the 360-degree panoramic "Table view" and for the "Top view". Figure A.16-1 (⇒ copy of original 3GPP image)
Categorization
Type: AR, VR, XR Degrees of Freedom: 6DoF Delivery: Interactive, Conversational Device: Phone, HMD with binaural playback support, AR Glasses with binaural playback support
Preconditions
On a general level the assumption is that all physical attendees (inside the meeting facilities) wear a device capable of binaural playback and, preferably, AR glasses. Remote participants are equipped with HMDs supporting binaural playback. The meeting facility is a large conference room with a number of spatially separated spots (tables) for subgroup discussions. Each of these spots is equipped with at least one video screen. At each of the spots a 360-degree camera system is installed. Specific minimum preconditions Remote participants: UE with render capability through connected HMD supporting binaural playback. Mono audio capture. 6DOF Position tracking. Physical participants: UE with render capability through a non-occluded binaural playback system and preferably, but not necessarily, AR Glasses. Mono audio capture of each individual participant e.g. using attached mic or detached mic with suitable directivity and/or acoustic scene capture at dedicated subgroup spots (tables). 6DOF Position tracking. Meeting facilities: Acoustic scene capture at dedicated subgroup spots (tables) and/or mono audio capture of each individual participant. 360-degree video capture at dedicated subgroup spots (tables). Video screens (connected to driving UE/PC-client) at dedicated subgroup meeting spots visualizing participants including remote participants at a subgroup spot ("Table view") and/or positions of participants in shared meeting space in "Top view". Conference call server: Maintenance of participant position data in shared virtual meeting space. (Optional) synthesis of graphics visualizing positions of participants in shared meeting space in "Top view". (Optional) generation of overlay/merge of synthesized avatars with 360-degree video to "Table view". Media preconditions: Audio: The capability of simultaneous spatial render of multiple received audio streams according to their associated 6DOF attributes. Adequate adjustments of the rendered scene upon rotational and translational movements of the listener's head. Video/Graphics: 360-degree video capture at subgroup meeting spots. Support of simultaneous graphics render of multiple avatars according to their associated 6DOF attributes, including position, orientation, directivity: Render on AR glasses. Render on HMDs. Overlay/merge synthesized avatars with 360-degree video to "Table view": Render as panoramic view on video screen. VR Render on HMD excluding a segment containing the remote participant itself. Synthesis of "Top view" graphics visualizing positions of participants in shared meeting space. Media synchronization and presentation format control: Required for controlling the flow and proper render of the various used media types. System preconditions: A metadata framework for the representation and transmission of positional information of an audio sending endpoint, including 6DOF attributes, including position, orientation, directivity. Maintenance of a shared virtual meeting space that intersects consistently with the physical meeting space: Real and virtual participant positions are merged into a combined shared virtual meeting space that is consistent with the positions of the real participant positions in the physical meeting space.
Requirements and QoS/QoE Considerations
QoS: conversational requirements as for MTSI, using RTP for Audio and Video transport. Audio: Relatively low bit rate requirements, that will meet conversational latency requirements. 360-degree video: Specified in TS 26.118, and will meet conversational latency requirements. It is assumed that remote participants will at each time receive only the 360-degree video stream of a single subgroup meeting spot (typically the closest). Graphics for representing participants in shared meeting space may rely on a vector-graphics media format, see e.g. TS 26.140. The associated bit rates are low. Graphics synthesis may also be done locally in render devices, based on positional information of participants in shared meeting space. QoE: Immersive voice/audio and visual experience, Quality of the mixing of virtual objects into real scenes. The described scenario provides the remote users with a 6DOF VR meeting experience and the auditory experience of being physically present in the physical meeting space. Quality of Experience for the audio aspect can further be enhanced if the user's UEs not only share their position but also their orientation. This will allow render of the other virtual users not only at their positions in the virtual conference space but additionally with proper rotational orientation. This is of use if the audio subsystem and the avatars associated with the virtual users support directivity, such as specific audio characteristics related to face and back. The "Scene view" for the remote participants allows consistent rendering of the audio with the 3D-rendered graphics video of the meeting space. However, that view obviously compromises naturalness and "being-there" experience through the mere visual presentation of the participants through avatars. The optional "Table view" may improve the naturalness as it relies on a real 360-degree video capture. However, QoE of that view is compromised since the 360-degree camera position does not coincide with virtual position of remote user. Viewpoint correction techniques may be used to mitigate this problem. The physical meeting users experience the remote participants audio-visually at virtual positions as if these were physically present and as if they could come closer or move around like physical users. The AR glasses display the avatars of the remote participants at positions and in orientation matching the auditory perception. Physical participants without AR glasses get a visual impression of where the remote participants are located in relation to the own position through the video screens at the subgroup meeting spots with the offered "Table view" and/or the "Top view".
Feasibility
Under "Preconditions" the minimum preconditions are detailed and broken down by all involved nodes of the service, such as remote participants, physical participants, meeting facilities and conference call server. In summary, the following capabilities and technologies are required: UE with render capability through connected HMD supporting binaural playback. UE with render capability through a non-occluded binaural playback system and preferably, but not necessarily, AR Glasses. Mono audio capture and/or acoustic scene capture. 6DOF position tracking. 360-degree video capture at dedicated subgroup spots. Video screens (connected to driving UE/PC-client) at dedicated subgroup meeting spots visualizing participants including remote participants at a subgroup spot ("Table view") and/or positions of participants in shared meeting space in "Top view". Maintenance of participant position data in shared virtual meeting space. (Optional) synthesis of graphics visualizing positions of participants in shared meeting space in "Top view". (Optional) generation of overlay/merge of synthesized avatars with 360-degree video to "Table view". While the suggested AR glasses for the physical meeting participants are very desirable for high QoE, the use case is fully feasible without glasses. Immersion is in that case merely provided through the audio media component. Thus, none of the preconditions constitute a feasibility barrier, given the technologies widely available and affordable today.
Potential Standardization Status and Needs
Requires standardization of at least a 6DOF metadata framework and a 6DOF capable renderer for immersive voice and audio. The presently ongoing IVAS codec work item may provide an immersive voice and audio codec/renderer and a metadata framework that may meet these requirements. Other media (non-audio) may rely on existing video/graphics coding standards available to 3GPP. Also required are suitable session protocols coordinating the distribution and proper rendering of the media flows.