Content for TR 26.928 Word version: 18.0.0

0… 4… 4.1.2… 4.2… 4.3… 4.4… 4.5… 4.6… 4.6.7 4.7… 4.9… 5… 6… 7… 8 A… A.4… A.7… A.10… A.13 A.14 A.15 A.16 A.17 A.18…

A.14 Use Case 13: 3D shared experience
...

A.14 Use Case 13: 3D shared experience p. 104

Use Case Description: 3D shared experience
In this shared 3D use case two friends (Eilean and Bob) are sharing a virtual experience. The experience builds around a crime investigation showing an investigation of two murder suspects and allowing the users to discuss and identify who committed the murder. Both Eileen and Bob are joining from home wearing a VR HMD and being captured via an RGB+depth camera. In VR they experience a 3-dimensional room (6DOF, police station), being represented in 3D and including a self-representation that allows them to point at items in the room and at each other. This representation can be based on the same capture that is made with the RGB+depth camera for communication purposes. Further, in the virtual police station each one of them has a window to follow a different interrogation (windowed 6DOF / 3DOF+), allowing them to collect information to solve the murder together (see Figure A.14-1). Figure A.14-1: example image of a virtual 3D experience with photo-realistic user representations (⇒ copy of original 3GPP image)
Categorization
Type: AR, MR, VR Degrees of Freedom: 3DoF+ / 6DOF Delivery: Conversational Device: Mobile / Laptop
Preconditions
The above use case results into the following hardware requirements: Each user needs a VR HMD (mobile, stand alone, wired/wireless VR HMD). Each user needs a depth camera to be captured (based on Bluetooth, integrated into a mobile phone or wired) Each user needs a microphone and audio headset for audio upload and spatial audio playback Each user needs to be connected and registered to a network that is able to facitilate the end-to-end audio/video call.
Requirements and QoS/QoE Considerations
The following QoS requirements are considered: Bandwidth: As minimal bandwidth it is expected at least 6Mbit/s (this is for a single 2D+ user stream with RGB + depth video), however this requirement can increase with more complex and higher resolution streams. Delay: suitable for real-time communication Delay (self-view): suitable for feeling of embodiment The main goal of this use case is to create a shared presence and immersion in a 3DOF+/6DOF experience. Thus the following QoE Considerations are relevant: Capture & Processing: The resolution of the rgb+depth camera needs to be sufficient. The foreground / background extraction needs to result into an accurate cut-out of a user Transmission: The compression of audio and video data should follow similar constraints as traditional video conferencing. Rendering: Users, needs to be scaled and positioned in the AR/VR environment in a natural way Audio playback needs to match the spatial orientation of the user A self view needs to be properly aligned with the actual body movement to align proprioceptive and visual experience. Also, delay for this needs to be kept to a minimum.
Feasibility
Demos & Technology overview: M. J. Prins, S. N. B. Gunkel, H. M. Stokking, and O. A. Niamut. TogetherVR: A Framework for photorealistic shared media experiences in 360-degree VR. SMPTE Motion Imaging Journal 127.7:39-44, August 2018. S. N. B. Gunkel, H. M. Stokking, M. J. Prins, N. van der Stap, F.B.T. Haar, and O.A. Niamut, 2018, June. Virtual Reality Conferencing: Multi-user immersive VR experiences on the web. In Proceedings of the 9th ACM Multimedia Systems Conference (pp. 498-501). ACM. 2018, IBC Demo: https://vrtogether.eu/2018/09/14/ibc-show-2018/ In summary: Users are captured with an RGB+depth device, e.g. Microsoft Kinect or Intel Realsense Camera This capture is processed locally for foreground/background segmentation and optionally for creation of a self-view. WebRTC is used for setting up streams to the other call participants. A-Frame / WebVR is used for rendering the virtual environment. Existing Service: http://www.mimesysvr.com/ Summary of steps: Figure A.14-2: Functional blocks of end-to-end communication (⇒ copy of original 3GPP image) Furthermore to realize this use case it is mapped into the following functional blocks: Capture & Processing: The Data from the rgb+depth camera needs to be acquired and further processed (to remove the user from its background), particularly the depth information might need further possessing before transmission Transmission: There needs to be a two-way end to end link between individual participants to transmit audio and video data. The video data should include a both the rgb colour and depth information. Rendering: The transferred user representation has to be blended into the VR environment (according to its geometrical properties based on the RGB + Depth data) and any audio needs to be played according to its special origin within the environment. Further the self-representation of the user has to be displayed aligned so that the view of the user and its physical position match. Please not that all 3 functional blocks can be executed either on one device, multiple devices or the network.
Potential Standardization Status and Needs
The following aspects may require standardization work: System Architecture Communication interfaces (signalling) Media Orchestration (i.e. metadata) Position and scaling of people Spatial Audio (e.g. including audio directionality of users) Background audio Shared content, i.e. multi-device media synchronization Allow Network based processing (e.g. cloud rendering, foreground /background removal of user capture, image enhancements like hole filling, replace HMD of user with a photo-realistic representation of there face, etc.) Transmission The end-to-end system (including the network) needs to support the RGB+Depth video data.