For providing XR experiences that make the user feel immersed and present, several relevant quality of experience factors have been collected (
https://xinreality.com/wiki/Presence). Immersion is an objective description of aspects of the system such as field of view and display resolution. Presence is the feeling of being physically and spatially located in an environment. According to
Slater and Usoh [43],
"immersion is a necessary rather than a sufficient condition for presence - immersion describes a kind of technology, and presence describes an associated state of consciousness." Presence is divided into 2 types: Cognitive Presence and Perceptive Presence.
Cognitive Presence is the presence of one's mind. It can be achieved by watching a compelling film or reading an engaging book. Cognitive Presence is important for an immersive experience of any kind.
Perceptive Presence is the presence of one's senses. To accomplish perceptive presence, one's senses, sights, sound, touch and smell, have to be tricked. To create perceptive presence, the XR Device has to fool the user's senses, most notably the audio-visual system. XR Devices achieve this through positional tracking based on the movement. The goal of the system is to maintain your sense of presence and avoid breaking it.
Perceptive Presence is the objective to be achieved by XR applications and is what is referred in the following.
In a
paper [9] titled
"Research on Presence in Virtual Reality: A Survey", the authors quote Matthew Lombard's slightly more scientific definition of presence:
"Presence (a shortened version of the term "telepresence") is a psychological state of subjective perception in which even though part or all of an individual's current experience is generated by and/or filtered through human-made technology, part or all of the individual's perception fails to accurately acknowledge the role of the technology in the experience. Except in the most extreme cases, the individual can indicate correctly that s/he is using the technology, but at some level, and to some degree, her/his perceptions overlook that knowledge and objects, events, entities, and environments are perceived as if the technology was not involved in the experience." In other words, feeling like you're really there.
Presence is achieved when the involuntary aspects of our reptilian corners of our brains are activated. When the user reaches out to grab the virtual apple, becomes unwilling to step off a plank or feel nervous when walking on rooftops. According to
Teo Choong Ching [10], there are four components relevant for feeling present, namely the
-
The Illusion of being in a stable spatial place
-
The Illusion of self-embodiment.
-
The Illusion of Physical Interaction
-
The Illusion of Social Communication
The most relevant component from a the technical aspect in the context of this Technical Report is the first one. This part of presence can be broken down into three broad categories, listed in order of most important to least important for their impact on creating presence: Visual presence, Auditory presence, and sensory or haptic presence.
Technical Requirements for visual presence have been formulated by Valve's ™ R&D Team Brendan Iribe (
https://www.roadtovr.com/oculus-shares-5-key-ingredients-for-presence-in-virtual-reality/) from Oculus ™ as well as from experience collected from 3GPP members product development teams:
-
Tracking
-
6 degrees of freedom tracking - ability to track user's head in rotational and translational movements.
-
360 degrees tracking - track user's head independent of the direction the user is facing.
-
Sub-centimeter accuracy - tracking accuracy of less than a centimeter.
-
Quarter-degree-accurate rotation tracking
-
No jitter - no shaking, image on the display has to stay perfectly still.
-
For room-scale games and experiences, comfortable tracking volume - large enough space to move around and still be tracked of roughly 2m cubes. For seated games/experiences a smaller tracking volume is sufficient.
-
Tracking needs to be done frequently to always be able to operate with the latest XR Viewer Pose. Minimum update rates as discussed above are 1000Hz and beyond. Especially rotational tracking requires high frequency.
-
Latency
-
Less than 20 ms motion-to-photon latency - less than 20 milliseconds of overall latency (from the time you move your head to when you see the display change).
-
Minimize the time of pose-to-render-to-photon. Rendering content as quickly as possible. Less than 50ms for render to photon in order to avoid wrongly rendered content. For more details refer to clause 4.2.2.
-
Fuse optical tracking and inertial measurement unit (IMU) data -
-
Minimize loop: tracker → CPU → GPU → display → photons.
-
Minimize interaction delays and age of content depending on the application. For more details see 4.2.2.
-
Persistence
-
Low persistence - Turn pixels on and off every 2 - 3 ms to avoid smearing / motion blur. Pixel persistence is the amount of time per frame that the display is actually lit rather than black. "Low persistence" is simply the idea of having the screen lit for only a small fraction of the frame. The reason is that the longer a frame goes on for, the less accurate it will be compared to where you're currently looking. The brain is receiving the same exact image for the entire frame even as you turn your head whereas in real life your view would constantly adjust.
-
90 Hz and beyond display refresh rate to eliminate visible flicker.
-
Resolution
-
Spatial Resolution: No visible pixel structure - you cannot see the pixels. Low resolution and low pixels per inch (PPI), can cause the user to feel pixelation and feel like he or she is looking through a screen door.
-
In 2014, It was thought at least 1k by 1k pixels per eye would be sufficient.
-
However, in theory, in our fovea, we need about 120 pixels per degree of view to match reality, possibly requiring significantly more than the 1k by 1k, all the way to 8k.
-
In 2019, it is commonly accepted that 2k by 2k per eye provides acceptable quality. Increasing the horizontal resolution to 4k is considered a next step.
-
According to Plamer Luckey, founder of Oculus® Rift™, pixelation will not go away completely until at least 8K resolution (8196 x 4096) per eye is achieved (https://arstechnica.com/gaming/2013/09/virtual-perfection-why-8k-resolution-per-eye-isnt-enough-for-perfect-vr/).
-
Temporal Resolution: According to https://developer.oculus.com/blog/asynchronous-timewarp-examined/, to deliver comfortable, compelling VR that truly generates presence, developers will still need to target a sustained frame rate of 90Hz and beyond, despite the usage of asynchronous time warping.
-
Optics
-
Wide Field of view (FOV) is the extent of observable world at any given moment and typically 100 - 110 degrees FOV is needed. For details on FoV, see clause 4.2.2 of TR 26.918.
-
Comfortable eyebox - the minimum and maximum eye-lens distance wherein a comfortable image can be viewed through the lenses.
-
High quality calibration and correction - correction for distortion and chromatic aberration that exactly matches the lens characteristics. For details on optics, see clauses 4.2.3 and 4.2.4 of TR 26.918.
For requirements on auditory presence, refer to
TR 26.918 and
[11].
For requirements on sensory and haptics presence, refer for example to
[11].
The sense of presence is not only important to VR experiences, but equally so to immersive AR experiences. To achieve Presence in Augmented Reality, seamless integration of virtual content and physical environment is required. Like in VR, the virtual content has to align with user's expectations. For truly immersive AR and in particular MR, it is expected that users cannot discern virtual objects from real objects.
Also relevant for VR and AR, but in particular AR, is not only the awareness for the user as shown in
Figure 4.2.1-1, but also for the environment. This includes:
-
Safe zone discovery
-
Dynamic obstacle warning
-
Geometric and semantic environment parsing
-
Environmental lighting
-
World mapping
For AR, to obtain an enhanced view of the real environment, the user may wear a see-through HMD to see 3D computer-generated objects superimposed on his/her real-world view. This see-through capability can be accomplished using either an optical see-through or a video see-through HMD. Tradeoffs between optical and video see-through HMDs with respect to technological, perceptual, and human factors issues are for example discussed in
[18].
Beyond the sense of presence and immersiveness, the age of the content and user interaction delay are of the uttermost importance for immersive and non-immersive interactive experiences, i.e. experiences for which the user interaction with the scene impacts the content of scene (such as online gaming).
User interaction delay is defined as the time duration between the moment at which a user action is initiated and the time such an action is taken into account by the content creation engine. In the context of gaming, this is the time between the moment the user interacts with the game and the moment at which the game engine processes such a player response.
Age of content is defined as the time duration between the moment a content is created and the time it is presented to the user. In the context of gaming, this is the time between the creation of a video frame by the game engine and the time at which the frame is finally presented to the player.
The roundtrip interaction delay is therefore the sum of the Age of Content and the User Interaction Delay. If part of the rendering is done on an XR server and the service produces a frame buffer as rendering result of the state of the content, then for raster-based split rendering (as defined in
clause 6.2.5) in cloud gaming applications, the following processes contribute to such a delay:
-
User Interaction Delay
-
capture of user interaction in game client,
-
delivery of user interaction to the game engine, i.e. to the server (aka network delay),
-
processing of user interaction by the game engine/server,
-
Age of Content
-
creation of one or several video buffers (e.g. one for each eye) by the game engine/server,
-
encoding of the video buffers into a video stream frame,
-
delivery of the video frame to the game client (a.k.a. network delay),
-
decoding of the video frame by the game client,
-
presentation of the video frame to the user (a.k.a. framerate delay).
For gaming, for example, references
[19] and
[20] provide interaction delay tolerance thresholds per game type illustrated in
Table 4.2.2-1. Note that this Interaction delay refers to the roundtrip interaction delay as defined above.
In
[21], the authors set up a 12 players match of Unreal Tournament 2003TM in a controlled environment. Each player is assigned a specific amount of latency and jitter for the duration of the match. After the match, the players answer a questionnaire about their experience in the game. This study still uses relatively few players, but they are able to conclude that more than 60ms of latency noticeably reduces both performance and experience of this game.
In general, it seems that 60 ms
[18], or even 45 ms
[22] are better estimates at how much latency is acceptable in the most fast-paced games than the traditionally quoted 100ms value.
In other cases the latency of the content is for example determined by conversational delay thresholds. Typically, around 200ms of latency is acceptable.
Overall, different applications and use cases require different delay requirements and this phenomena should be considered.
The four following categories are considered with respect to roundtrip interaction delay:
-
Ultra-Low-Latency applications: roundtrip interaction delay threshold of at most 50ms latency.
-
Low-Latency applications: roundtrip interaction delay threshold of at most 100ms latency.
-
Moderate latency applications: roundtrip interaction delay threshold of at most 200ms latency.
-
Non- critical latency applications: roundtrip interaction delay threshold higher than 200ms latency.