Tech-invite 3GPPspace IETFspace

21 22 23 24 25 26 27 28 29 31 32 33 34 35 36 37 38 4‑5x

Content for TR 26.928 Word version: 18.0.0

0… 4… 4.1.2… 4.2… 4.3… 4.4… 4.5… 4.6… 4.6.7 4.7… 4.9… 5… 6… 7… 8 A… A.4… A.7… A.10… A.13 A.14 A.15 A.16 A.17 A.18…

A.18 Use Case 17: AR animated avatar calls A.19 Use Case 18: AR avatar multi-party calls A.20 Use Case 19: Front-facing camera video multi-party calls A.21 Use Case 20: AR Streaming with Localization Registry A.22 Use Case 21: Immersive 6DoF Streaming with Social Interaction A.23 Use Case 22: 5G Online Gaming party A.24 Use Case 23: 5G Shared Spatial Data

A.18 Use Case 17: AR animated avatar calls p. 119

Use Case Name
AR animated avatar call
Description
This use case is about a call scenario between one user wearing AR glasses and the other user using a phone in handset mode. The AR glasses user sees an animated avatar of the phone user. Movements of the phone user are used to control the animation of his avatar. This improves the call experience of the user of the AR glasses. A potential user experience is described as a user story: Tina is wearing AR glasses while walking around in the city. She receives an incoming call by Alice, who is using her phone, and who is displayed as an overlay ("head-up display") on Tina's AR glasses. Alice doesn't have a camera facing at her, therefore a recorded 3D image of her is sent to Tina as the call is initiated. The 3D image Alice sent can be animated, following Alice's actions. As Alice holds her phone in handset mode, her head movements result in corresponding animations of her transmitted 3D image, giving Tina the impression that Alice is attentive. NOTE: An option for this use case is a "mute animations" control. Note that Alice didn't press the "mute animations" button that would have disabled all animations of her 3D image for Tina. As Tina's AR glasses also include a pair of headphones, Alice' mono audio is rendered binaurally at the position where she is displayed on Tina's AR glasses. Tina also has interactivity settings, allowing to lock Alice's position on her AR screen. Therefore, her visual and auditory appearance moves when Tina rotates her head. As Tina disables the position lock, the visual and auditory appearance of Alice is placed within Tina's real world and thus Tina's head rotation leads to compensation on the screen and audio appearance, requiring visual and binaural audio rendering with scene displacement.
Categorization
Type: AR Degrees of Freedom: 2D, 3DoF Delivery: Conversational Device: Phone, HMD, Glasses, headphones
Preconditions
AR participants: Phone with tethered AR glasses and headphones (with acoustic transparency). Phone participant: Phone with motion sensor and potentially proximity sensor to detect handset mode.
Requirements and QoS/QoE Considerations
QoS: QoS requirements like MTSI requirements (conversational, RTP), e.g. 5QI 1. QoE: Immersive voice/audio and visual experience, Quality of the mixing of virtual objects (avatars) into real scenes and rendering an audio overlaid to the real acoustic environment.
Feasibility
AR glasses in various form factors exist, including motion sensing and inside-out tracking. This allows locking of avatars and audio objects to the real world. Smart phones typically come with built-in motion sensing, using a combination of gyroscopes, magnetometers and accelerometer. This allows extraction of the head's rotation, when the phone is used in handset mode, which could be motion data sent to the other endpoint to animate/rotate the avatar/3D image.
Potential Standardization Status and Needs
Visual coding and transmission of avatars or cut-out heads, alpha channel coding Transmission and potentially coding of motion data to show attentiveness

A.19 Use Case 18: AR avatar multi-party calls p. 120

Use Case Name
AR avatar multi-party call
Description
This use case is about multi-party communication with spatial audio rendering, where avatars and audio of each participant are transmitted and spatially rendered in the direction of their geolocation. Each participant is equipped with AR glasses with external or built-in head phones. 3D audio can be captured and transmitted instead of mono, which leads to enhancements when sharing the audio experience. A potential user experience is described as a user story: Bob, Jeff, and Frank are in Venice and walking around the old city sightseeing. They are all wearing AR glasses with a mobile connection via their smartphone. The AR glasses support audio spatialization, e.g. via binaural rendering and playback over the built-in headphones, allowing the real world to be augmented with visuals and audio. They start a multi-party call, where each of them gets the other two friends displayed on his AR glasses and can hear the audio. While they walk around in the silent streets, they have a continuous voice call with the avatars displayed on their AR glasses, while also other information is displayed to direct them to the secret places of Venice. Each of them transmits his current location to his friends. Their AR glasses / headphones place the others visually and acoustically (i.e. binaurally rendered) in the direction where the others are. Thus, they all at least know the direction of the others. As Jeff wants to buy some ice cream, he switches to push-to-talk to not annoy his friends with all the interactions he has with the ice cream shop. As Bob gets closer to Piazza San Marco the environment gets noisier with sitting and flying pigeons surrounding him. Bob turns on the "hear what I hear" feature to give them an impression on the fascinating environment, sending 3D audio of the scene to Frank and Jeff. As they got interested, they also want to experience the pigeons around them and walk through the city to the square. Each of the friends is still placed on the AR glasses visually and acoustically in the direction where the friend is, which makes it easy for them to find Piazza San Marco and for Frank to just walk across the square to Bob as he approaches him. Jeff, who still eats his ice cream is now also coming closer to Piazza San Marco and just walks directly to Bob and Jeff. As they get close to each other they are no longer rendered (avatars and audio), based on the positional information, and they simply chat with each other.
Categorization
Type: AR Degrees of Freedom: 3DoF Delivery: Conversational Device: AR glasses, headphones
Preconditions
Connected AR glasses or phone with tethered AR glasses and headphones (with acoustic transparency). Positioning support (e.g. using GNSS) to derive geolocation, allowing calculation of relative position. 3D audio capturing (e.g. using microphone arrays) and rendering.
Requirements and QoS/QoE Considerations
QoS: QoS requirements like MTSI requirements for voice/audio and avatars (conversational, RTP), e.g. 5QI 1 for audio. QoE: Immersive voice/audio and visual experience, Quality of the capturing and rendering of avatars, the different participants and 3D audio.
Feasibility
AR glasses in various form factors exist. Those usually include motion sensors, e.g. based on accelerometers, gyroscopes, and magnetometers, but also cameras are common, allowing inside-out tracking and augmentation of the real world. 3D audio capturing and rendering are available, e.g. using spherical or arbitrary microphone arrays for capturing and using binaural rendering technologies for audio spatialization. An audio-only solution using headphones and head-tracking is easier to implement, this would however remove the possibility to visually augment the real world and display avatars.
Potential Standardization Status and Needs
Visual coding and transmission of avatars Audio coding and transmission of mono objects and 3D audio for streams from all participants NOTE: scene composition is usually a differentiating factor

A.20 Use Case 19: Front-facing camera video multi-party calls p. 121

Use Case Name
Front-facing camera video multi-party call
Description
This use case is based on front-facing camera calls, i.e. a user is having a video call, seeing the other participants on the display of e.g. a smartphone he holds at arm's length. The use case has some overlap with UC 6 (AR face-to-face calls) and UC 10 (Real-time 3D Communication), extended by spatial audio rendering for headphones/headsets. The spatial audio rendering is based on the head-tracker data extracted from the smartphones front-facing camera, giving a user the impression, even with movements, that the voice of the other participants originates from a virtual stage in the direction of the phone with the video of the other's faces. A potential user experience is described as a user story: Bob, Jeff, and Frank are back in New York City and each of them is walking to work. They just have their smart phones with a front-facing camera and a small headset, allowing the real world to be augmented with audio. They start a multi-party video call to discuss the plans for the evening, where each of them gets the other two friends displayed on the phone and can hear the audio, coming from the direction on the horizontal plane where the phone is placed in their hand and some small spread to allow easy distinction. While they walk around in the streets of New York, they have a continuous voice call with their phones at arm's length, with the, potentially cut-out, faces of their pals displayed on their phones. For Bob the acoustic front is always in the direction of his phone, thus the remote participants are always in the front. When Bob rotates his head though, the front-facing camera tracks this rotation and the spatial audio is binauralized using the head-tracking information, leaving the position of the other participants steady relative to the phone's position. As Bob turns around a corner with the phone still at arm's length for the video call using the front-facing camera, his friends remain steady relative to the phone's position.
Categorization
Type: AR Degrees of Freedom: 3DoF Delivery: Conversational Device: Smartphone with front-facing camera, headset , AR glasses
Preconditions
Phone with front-facing camera, motion sensors, and headset (more or less acoustically transparent). Motion sensors to compensate movement of the phone, front-facing camera to capture the video for the call and potentially track the head's rotation.
Requirements and QoS/QoE Considerations
QoS: QoS requirements like MTSI requirements (conversational, RTP), e.g. 5GQI 1 and 2. QoE: Immersive voice/audio and visual experience, Quality of the capturing, coding and rendering of the participant video (potentially cut out faces), Quality of the capturing, coding and rendering of the participant audio, including binaural rendering taking head tracking data into account.
Feasibility
Several multi-party video call applications using the front-facing camera exist, e.g. https://www.cnet.com/how-to/how-to-use-group-facetime-iphone-ipad-ios-12/ , https://faq.whatsapp.com/en/android/26000026/?category=5245237 Head tracking using cameras exists, e.g. https://xlabsgaze.com Binaural rendering with head-tracking also exists (see also TS 26.118)
Potential Standardization Status and Needs
Visual coding and transmission of video recorded by front-facing camera; potentially cut-out heads, alpha channel coding Audio coding and transmission for streams from all participants NOTE: scene composition is usually a differentiating factor

A.21 Use Case 20: AR Streaming with Localization Registry p. 123

Use Case Description: AR Streaming with Localization Registry
A group of friends has arrived at a museum. The museum provides them with an AR guide for the exhibits. The museum's exhibition space has been earlier scanned and registered via one of the museum's AR devices to a Spatial Computing Service. The service allows storing, recalling and updating of spatial configuration of the exhibition space by a registered AR device. Visitors' AR devices (to be used by museum guests as AR guides) have downloaded the spatial configuration upon entering the museum and are ready to use. The group proceeds to the exhibit together with their AR guides, which receive a VoD stream of the museum guide with the identifier Group A. Registered surfaces next to exhibits are used for displaying the video content (may be 2D or 3D content) of the guide. In the case of spatial audio content, this may be presented in relation to the registered surfaces. The VoD stream playback is synched amongst the users of Group A. Any user within the group can pause, rewind or fast forward the content, and this affects the playback for all the members of the group. Since all users view the content together, this allows them to experience the exhibit as a group, and discuss during pauses without affecting the content streams for other museum visitors that they are physically sharing the space with. Other groups in the museum use the same spatial configuration, but their VoD content is synched within their own group. The use case can be extended to private spaces, e.g., a group of friends gathered at their friend Alice's house to watch a movie. Alice's living room is registered already under her home profile on the Spatial Computing Service; the saved information includes her preferred selection of the living room wall as the movie screening surface. She shares this configuration via the service with her guests. In this use case, the interaction with a travel guide avatar may also occur in a conversational fashion.
Categorization
Type: AR and Social AR Degrees of Freedom: 6DoF Delivery: Streaming, Interactive, Conversational Device: AR glasses with binaural audio playback support
Preconditions
The use case requires technical solutions for the following functions: Spatial Computing Service A 5G service that registers users and stores their indoor spatial configuration with the following features: Reception of a stream of visual features for a space to be registered. The input may be from a mobile phone camera, an AR device or a combination of data from multiple sensors and cameras located in the space. Usage of a localization algorithm such as SLAM (Simultaneous Localization and Mapping) for indoor spatial localization, and the storage of special configurations, such as the selection of particular surfaces for special functions (e.g., wall for displaying a video stream). Distribution of previously stored information to other devices belonging to the same user or to other authorized users. Updating of localization information and redistribution when required. Content synchronization A streaming server that distributes content and ensures synchronized content playback for multiple AR users. The server does not need to have the content stored locally. It can, for example, get the content stream from a streaming service and then redistribute it. For the museum guests, the functionality may be part of the XR client or embedded in a home gateway or console-like device.
QoS/QoE Considerations
Required QoS: Sufficiently low latency for synchronized streaming playback and conversational QoS. Required QoE: Synchronization of VoD content for multiple users within acceptable parameters. This requires ensuring the streams' playback occurs near simultaneously for all users, so that user reactions to specific scenes such as jump scares in a horror movie or a goal in a sport sequence are also synced within the group. Furthermore, playback reaction time to user actions such as pause, fast forward and rewind should be low and similar for all users within the group. Conversational low-delay QoE is also expected.
Feasibility
The use case is feasible within a timeframe of 3 years. Required hardware, AR glasses, are available in the market, and network requirements are no more or less than existing streaming services. The feasibility of the use case depends on the accuracy of the localization registration and mapping algorithm. Multiparty AR experiences, such as a shared AR map annotation demo from Mapbox (https://blog.mapbox.com/multi-user-ar-experience-1a586f40b2ce?gi=60ceb3226701) and the Multiuser AR experience exhibition at the San Fransisco Museum of Modern Art by Ubiquity6 (https://www.youtube.com/watch?v=T-I3YG_w-Z4), provide good examples for proof of concept of already available technology for creating a shared AR experience.
Potential Standardization Status and Needs
The following aspects may require standardization work: Standardized way of sharing and storing indoor spatial information with the service and other devices. Mixing VoD streams may require some additional functions for social AR media control playback. This would relate to allowing users to control the playout of the VoD stream (pause, rewind, fast-forward) for all users in a synchronized manner.

A.22 Use Case 21: Immersive 6DoF Streaming with Social Interaction p. 124

Use Case Description: Immersive 6DoF Streaming with Social Interaction
In an extension to the use case 3 in clause 6.4 for which Alice is consuming the game in live mode, Alice is now integrated into social interaction: She virtually watching the game with other friends who are geographically distributed and whose avatars are sitting in the stadium next to her. She has voice conversations with those friends while watching the game. While she moves through the stadium to another location, she make friends with other folks watching the same game in the virtual environment. She gets overlaid contextually relevant twitter feeds
Categorization
Type: VR and Social VR Degrees of Freedom: 3DoF+, 6DoF Delivery: Streaming, Split, Conversational, Interactive Device: HMD with a controller
Preconditions
Application is installed that permits to consume the scene The application uses existing HW capabilities on the device, including A/V decoders, rendering functionalities as well as sensors. Inside-out Tracking is available. Media is captured properly and accessible on cloud storage through HTTP access One or multiple communication channels across users can be setup
Requirements and QoS/QoE Considerations
Same as use case in clause 6.3. In addition, the following applies Required QoS: Sufficient low latency for the communication channel Required QoE: Sufficiently low communication latency Synchronization of user communication with action Synchronized and context-aware twitter feeds
Feasibility
See use case 3 in clause A.4. The addition of social aspects can be addressed by apps. Some discussion on this matter: https://www.roadtovr.com/nextvr-latest-tech-is-bringing-new-levels-of-fidelity-to-vr-video/, see the second page. However, still no publicly announced details. Social VR is used in different context. See for example here: https://www.juegostudio.com/infographic/various-social-vr-platforms https://www.g2crowd.com/categories/vr-social-platforms Some example applications are provided Facebook Spaces™ https://www.facebook.com/spaces VRChat https://www.vrchat.net/ https://en.wikipedia.org/wiki/VRChat https://youtu.be/5cpElonP33k Oculus Venues ™ https://www.engadget.com/2018/05/30/oculus-venues-hands-on https://www.esquireme.com/oculus-headset-will-let-you-watch-live-sport-in-virtual-reality Optimizations can be done by integrating social A/V with main content (rendering, blending, overlay). Additional pointers to deployments and use cases: https://www.nextvr.com/nextvr-gets-social-with-oculus-venues-now-fans-can-enjoy-live-vr-experiences-together/ https://www.oculus.com/blog/go-behind-the-scenes-of-the-oc5-oculus-venues-livestream-with-supersphere/?locale=en_US Verizon presentation at XR Workshop Virtual Live Events w/Friends Virtually attend live events with friends in 4K/8K 360°3D video (aka 'VR') Technical Requirements 4K, 8K+ (6DoF) real time (volumetric)streaming, Immersive 360°Video (stereoscopic, 90+ FPS) → MEC for video stitching is optional on 4K Directional audio, user point of view → For real time chat, selectable viewpoints Integrated Videos and Communications → RCS-based communication, supports delivery to all deployed smartphones as well as VR devices Potential Challenges: Quality of avatars Synchronization of scene Quality of interactions
Potential Standardization Status and Needs
The following aspects may require standardization work: same as use case 6.3 Social VR component and conversation Synchronized Playout of users in the same room

A.23 Use Case 22: 5G Online Gaming party p. 126

Use Case Description: 5G Online Gaming party
In an extension to use case 5 in Annex A.6 on Online Immersive Gaming experience, the users join a Gaming Party either physically or virtually in order to experience maximum and controlled user experience. There are two setups for the party: The friends connect to a common server through 5G that provides managed resources and access guarantees to meet their high-demand requirements for gaming. The friends meet physically and connect to an infrastructure using wireless 5G connection. The setup explores all options, including connecting to a centralized infrastructure, but also possibly connecting HMDs using device to device communication. The experience is improved and especially very consistent compared to best effort connections they had been used to before. In a similar use case as presented during the 2nd XR Workshop, it is referred to as "City-wide multiplayer, immersive AR gaming action/adventure experience" User enters an outdoor geo-fenced space including parks & other activation sites for an AR gaming play experience. Once inside the geolocation, user opens app on 5G phone & goes through local matchmaking to join with other nearby players for co-operative experience. Players use AR wayfinding to head to the first dead drop. User scans environment using AR Lens to uncover first clue and work alongside other players to solve AR puzzle to unlock the next level. The winners from the battle unlock AR Wayfinding for next location and next battle. At the final location, the remaining users confront final opponent and play AR combat mini game to defeat him and unlock exclusive content.
Categorization
Type: VR, AR Degrees of Freedom: 6DoF Delivery: Streaming, Interactive, Split, device-to-device Device: HMD with a Gaming controller, AR glasses
Preconditions
Gaming client is installed that permits to consume the game The application uses existing HW capabilities on the device, including game engines, rendering functionalities as well as sensors. Inside-out Tracking is available. Connectivity to the network is provided. Connectivity can be managed properly Devices may connect using device-to-device communication Wayfinding and SLAM is provided to locate and map to the venue in case of AR AR and AI functionalities are provided for example for Image & Object Recognition, XR Lighting, Occlusion Avoidance, Shared Persistence
Requirements and QoS/QoE Considerations
The requirements are similar to what is discussed in use case 6.25.
Feasibility
Feasibility follows the previous discussions. However, a 5G Core Architecture that would provide such functionalities, would be needed. In addition, authentication for such "5G parties" is needed.
Potential Standardization Status and Needs
The following aspects may require standardization work: Network conditions that fulfill the QoS and QoE Requirements Content Delivery Protocols Decoding, rendering and sensor APIs Architectures for computing support in the network TR 22.842 provides a gap analysis in clause 5.3.6 that is in line with these needs Authentication to such groups Possible support for device-to-device communication

A.24 Use Case 23: 5G Shared Spatial Data p. 128

Use Case Description: Shared Spatial Data
Consider as an example people moving through Heathrow airport. The environment is supported by spatial map sharing, spatial anchors, and downloading/streaming location based digital content. The airport is a huge dynamic environment with thousands of people congregating. Spatial maps and content will change frequently. Whereas base maps have been produced by professional scanners, they are continuously updated and improved by crowd sourced data. Semi-dynamic landmarks such a growing tree, a new park bench, or holiday decorations are incorporated into the base map via crowd sourced data. Based on this individuals have their own maps and portions of those maps may be shared with friends nearby. One could imagine spatial content will consume as much bandwidth as permitted, be it a high resolution volumetric marketing gimmick with virtually landing Concorde in Heathrow or a simple overlay outside a lounge showing the current wait time for getting access. As people walk through 1km+ size spaces like the airport, they'll be progressively downloading updates and discarding map information that is no longer relevant. Similar to data flows in Google maps, smartphones continually send location and 3D positioning data (GPS, WiFi, scans, etc…) to the cloud in order to improve and augment 3D information. AR maps and content will in all likelihood be similarly layered, dynamic, and progressively downloaded. Spatial AR maps will be a mixture of underlying living spatial maps and digital content items. The use case addresses several scenarios: Co-located people wearing an XR HMD collaboratively interact with a detailed 3D virtual model from their own perspective into a shared coordinate system (using a shared map). One person wearing an XR HMD places virtual objects at locations in 3D space for later discovery by other's wearing an XR HMD. This requires a shared map and shared digital assets. XR clients continuously send sensing data to a cloud service. The service constructs a detailed and timely map from client contributions and provides the map back to clients. An XR HMD receives a detailed reconstruction of a space, potentially captured by a device(s) with superior sensing and processing capabilities.
Categorization
Type: AR Degrees of Freedom: 6DoF Delivery: Streaming, Interactive, Split, device-to-device, different types Device: HMD, AR Glasses
Preconditions
Application is installed on an HMD or phone with connected AR glass The application uses existing HW capabilities on the device, rendering functionalities as well as sensors. Inside-out Tracking is available. Also a global positioning system for anchoring is available Connectivity to the network is provided. Wayfinding and SLAM is provided to locate and map in case of AR AR and AI functionalities are provided for example for Image & Object Recognition, XR Lighting, Occlusion Avoidance, Shared Persistence
Requirements and QoS/QoE Considerations
5G's low-latency high-bandwidth capabilities, as compared to 4G's capabilities, make 5G better suited for sending dense spatial data and associated 3D digital assets over a mobile network to XR clients. This data could be transferred as discrete data downloads or streamed and may be lossy or lossless. Continuous connectivity is important, sharing local information to improve maps. The underlying AR maps should be accurate and should be up to date. The content objects should be realistic. The data representation for the AR maps and the content objects is scalable.
Feasibility
Microsoft Spatial Anchors: https://azure.microsoft.com/en-us/services/spatial-anchors/ Co-located people wearing an XR HMD collaboratively interact with a detailed 3D virtual model from their own perspective into a shared coordinate system (using a shared map). Google: Shared AR Experiences with Cloud Anchors: https://developers.google.com/ar/develop/java/cloud-anchors/overview-android One person wearing an XR HMD places virtual objects at locations in 3D space for later discovery by other's wearing an XR HMD. This requires a shared map and shared digital assets Figure A.24-1 (⇒ copy of original 3GPP image) Google Visual Positioning Service: https://www.roadtovr.com/googles-visual-positioning-service-announced-tango-ar-platform/ XR clients continuously send sensing data to a cloud service. The service constructs a detailed and timely map from client contributions and provides the map back to clients. Example is Google's Visual Positioning Service Drivenet Maps - Open Data real-time road Maps for Autonomous Driving from 3D LIDAR point clouds: https://sdi4apps.eu/2016/03/drivenet-maps-open-data-real-time-road-maps-for-autonomous-driving-from-3d-lidar-point-clouds/ An XR HMD receives a detailed reconstruction of a space, potentially captured by a device(s) with superior sensing and processing capabilities. An example of navigation is given in the MPEG-I use case document for point cloud compression (w16331, section 2.6)
Potential Standardization Status and Needs
The following aspects may require standardization work: Data representations for AR maps Collected sensor data to be streamed up streams Scalable streaming and storage formats for AR maps Content delivery protocols to access AR maps and content items Network conditions that fulfill the QoS and QoE Requirements

$ Change history p. 131