The AR Runtime is a device-resident software or firmware that implements a set of APIs to provide access to the underlying AR/MR hardware. These APIs are referred to as AR Runtime APIs. An AR Runtime typically provides the following functions:
-
System capability discovery: allows applications to discover capabilities of the AR glasses
-
Session management: manages an AR session and its state
-
Input and Haptics: receives information about user's actions, e.g. through usage of trackpads, and passes that information to the application. On request by the application, it may trigger haptics feedback using the AR glasses and associated hardware.
-
Rendering: synchronizes the display and renders the composited frame onto the AR glasses displays.
-
Spatial Computing: processes sensor data to generate information about the world 3D space surrounding the AR user. Spatial computing includes functions such as
-
Tracking to estimate the movement of the AR device at a high frequency
-
Relocalization to estimate the pose of the AR device at initialization, when tracking is lost or regularly to correct the drift of the tracking.
-
Mapping, for reconstructing the surrounding space, for example through triangulation of identified points. This reconstruction may be sparse for localization purposes, or dense for visualization.
-
A combination of tracking, mapping and relocalization functions, for example through Simultaneous Localization and Mapping (SLAM) to build a map of the environment and establish the position of users and objects within that environment
-
Semantic perception: to process the captured information into semantical concepts, typically uses some sort of Artificial Intelligence (AI) and/or Machine Learning (ML). Examples include object or user activity segmentation, recognition, and classification.
Spatial computing functions typically include data exchange and requires network architecture.
Clause 4.2.5 provides more details on XR Spatial computing.
AR runtimes are usually extensible to add support for a wide range of AR glasses and controllers that are on the market or that might be released in the future. This will allow different vendors to add custom functionality such as gaze tracking, hand control, new reference spaces, etc.
Two key representative and standardized AR runtimes are Khronos defined OpenXR
[4] and W3C defined WebXR
[5]. More details are provided in
clause 4.6.4.
A Scene Manager is a software component that is able to process a scene description and renders the corresponding 3D scene. The Scene Manager parses a scene description document to create a scene graph representation of the scene. For each node of the scene graph, it adds the associated media components for correct rendering of the corresponding object.
To render the scene, the Scene Manager typically uses a Graphics Engine that may be accessed by well-specified APIs such as defined by Vulkan, OpenGL, Metal, DirectX, etc. Spatial audio is also handled by the Scene Manager based on a description of the audio scene. Other media types may be added as well.
The Scene Manager is able to understand the capabilities of the underlying hardware and system, to appropriately adjust the scene complexity and rendering process. The Scene Manager may, for instance, delegate some of the rendering tasks to an edge or remote server. As an example, the Scene Manager may only be capable of rendering a flattened 3D scene that has a single node with depth and colour information. The light computation, animations, and flattening of the scene may be delegated to an edge server.
Clause 4.6.5 provides a description of glTF2.0 and the extensions defined by the MPEG-I Scene Description, which may serve as a reference for the entry point towards a Scene Manager.
XR Spatial computing summarizes functions which process sensor data to generate information about the world 3D space surrounding the AR user. It includes functions such as SLAM for spatial mapping (creating a map of the surrounding area) and localization (establishing the position of users and objects within that space), 3D reconstruction and semantic perception. This requires accurately localizing the AR device worn by the end-user in relation to a spatial coordinate system of the real-world space. Vision-based localization systems reconstruct a sparse spatial mapping of the real-world space in parallel (e.g. SLAM). Beyond the localization within a world coordinate system based on a sparse spatial map, additionally dense spatial mapping of objects is essential in order to place 3D objects on real surfaces, but also provides the ability to occlude objects behind surfaces, doing physics-based interactions based on surface properties, providing navigation functions or providing a visualization of the surface. Thirdly, for the purpose of understanding and perceiving the scene semantically, machine-learning and/or artificial intelligence may be used to provide context of the observed scene. The output of spatial computing is spatial mapping information that is organized in a data structure called the XR Spatial Description for storing and exchanging the information. Further details on XR Spatial Description formats are provided in
clause 4.4.7.
XR Spatial Compute processes may be carried out entirely on the AR device. However, it may be beneficial or necessary to use cloud or edge resources to support spatial computing functions. At least two primary scenarios may be differentiated:
-
Spatial computing is done on the AR device, but an XR Spatial Description server is used for storage and retrieval of XR Spatial Description.
-
At least parts of the spatial compute functions are offloaded to a XR Spatial Compute server
Both cases are discussed further in the following.
Typically, the device stores a certain amount of XR Spatial Description locally on the device. However, in particular to create a world-experience, the AR device may not be able to store all information related to XR Spatial Description on the device, and hence, such information may continuously be updated by downloading or streaming updated information from an XR Spatial Description server as shown in
Figure 4.2.5-1. In addition, the device may use personalized storage on the cloud to offload device-generated XR spatial information components. This may for example include so-called key frames, i.e. frames that are useful to provide accurate spatial mapping with relevant key points.
The architecture in
Figure 4.2.5-1 relates to the case where XR Spatial computing is done standalone on the device, and hence we refer to this as STAR architecture in context of XR Spatial computing.
If the device is limited in processing power, or if complex XR compute functionalities need to be carried out, the XR compute function on the device may be assisted by or even depend on compute resources in the network, for example on the edge or cloud.
Figure 4.2.5-2 provide a basic architecture for the case where XR Spatial computing is delegated partially or completely to an XR Spatial computing edge server.
-
The device sends sensor data or pre-processed sensor data (e.g. captured frames or visual features extracted from such frames) to the XR Spatial Compute server.
-
The XR Spatial Compute server carries out supporting functions to extract relevant information and returns directly XR Spatial Compute-related AR Runtime data (according to the AR Runtime API), e.g. pose information, or pre-computed XR Spatial information that is used by a lightweight XR Spatial Compute function on the device to create AR Runtime data. Pre-computed XR Spatial information may, for example, be dense map of segmented objects for visualization, labels or id of recognized objects, 2D contours of recognized object to highlight them, or labels of the recognized user activity.
-
The XR spatial edge compute server may further fetch the XR Spatial Description from the XR Spatial Description server and perform spatial computing based on device sensor data.
The architecture in
Figure 4.2.5-2 relates to the case for XR Spatial computing depends on an edge function, and hence we refer to this as EDGAR architecture in context of XR Spatial computing.
The Media Access Function supports the AR UE to access and stream media. For this purpose, a Media Access Function as shown in
Figure 4.2.6-1 includes:
-
Codecs: used to compress and decompress the rich media. In several cases, not only a single instance of a codec per media type is needed, but multiple ones.
-
Content Delivery Protocol: Container format and protocol to deliver media content between the UE and the network according to the requirements of the application. This includes timing, synchronization, reliability, reporting and other features.
-
5G connectivity: a modem and 5G System functionalities that allow the UE to connect to a 5G network and get access to the features and service offered by the 5G System.
-
Media Session Handler: A generic function on the device to setup 5G System capabilities. This may setup edge functionalities, provide QoS support, support reporting, etc.
-
Content protection and decryption: This function handles protection of content from being played on unauthorized devices.
Functions are needed in both uplink and downlink, depending on use cases and scenarios.
Example for Media Access Functions are
-
5GMSd client that includes a Media Session Handler and a Media Player as defined in TS 26.501 and TS 26.512.
-
5GMSu client that includes a Media Session Handler and a Media Streamer as defined in TS 26.501 and TS 26.512.
-
A real-time communication client that includes either uplink or downlink, or both to support more latency critical communication services.
-
A combination of the above based on the needs of the XR application. An XR scene may have a mix of static, streaming, and real-time media that require the usage of multiple transport channels and protocol stacks.
In all cases, the basic function of Media Session Handler and a delivery client (which includes content delivery protocols and codecs) is expected to be maintained. The Media Session Handler is a generic function to support 5G System integration.
As a subject of this report, the needs to support different types of instantiations is for codecs, delivery protocols, session handling and so is identified. Not all components are necessarily required for all scenarios and even further, not all functions may be available on all device types.