This clause provides more details of a client reference architecture for VR streaming applications and describes their components and interfaces.
Figure 4.3-1 and
Figure 4.3-2 show a high-level structure of the client reference architecture for VR DASH streaming and VR local playback, respectively, which consist of five functional components:
-
VR Application: The VR application controls the rendering depending on a user viewport or display capabilities. The application may communicate with other functional components, e.g., the access engine, the file decoder. The access engine or file decoder may parse some abstracted control information to the VR application and the application makes the decision on which adaptation sets or preselections to select or which tracks to choose taking into account platform or user information as well as the dynamic pose information.
-
Access Engine: The access engine connects through a 3GPP bearer and provides a conforming VR presentation to the receiver. The access engine fetches the Media Presentation Description (MPD), constructs and issues requests and receives Segments or parts of Segments. In the case of local playback, the 3GPP VR Track is accessed from the local storage. The access engine may interface with the VR application function to dynamically change the delivery session. The access engine provides a conforming 3GPP VR track to the file decoder.
-
File Decoder: The file decoder processes the 3GPP VR Track to generate signals that can be processed by the renderer. The file decoder typically includes at least of two sub-modules; the file parser and the media decoder. The file parser processes the file or segments, extracts elementary streams, and parses the metadata, if present. The processing may be supported by dynamic information provided by the VR application, for example which tracks to choose based on static and dynamic configurations. The media decoder decodes media streams of the selected tracks into the decoded signals. The file decoder outputs the decoded signals and metadata which is used for rendering. The file decoder is the primary focus of the present document.
-
VR Renderer: The VR Renderer uses the decoded signals and rendering metadata and provides a viewport presentation taking into account the viewport and possible other information. With the pose, a user viewport is determined by determining horizontal/vertical field of view of the screen of a head-mounted display or any other display device to render the appropriate part of decoded video or audio signals. The renderer is addressed in individual media profiles. For video, textures from decoded signals are projected to the sphere with rendering metadata received from the file decoder. During the texture-to-sphere mapping, a sample of the decoded signal is remapped to a position on the sphere. Likewise, the decoded audio signals are represented in the reference system domain. The appropriate part of video and audio signals for a current pose is generated by synchronizing and spatially aligning the rendered video and audio.
-
Sensor: The sensor extracts the current pose according to the user's movement and provides it to the renderer for viewport generation. The current pose may for example be determined by the head tracking and possibly also eye tracking functionalities. The current pose may also be used by the VR application to control the access engine on which adaptation sets or preselections to select (for the streaming case), or to control the file decoder on which tracks to choose for decoding (for the local playback case).
The main objective of the present document is to enable the file decoder to generate decoded signals and the rendering metadata from a conforming 3GPP VR Track by generating a bitstream that conforms to a 3GPP Operation Point. Both, a 3GPP VR Track as well as a bitstream conforming to an Operation Point are a well-defined conformance points for a VR File decoder and a Media Decoder. Both enable to represent the contained media in the VR reference system (spatially and temporally).