Virtual humans (or digital representations of humans, also referred to as 'avatars' in this use case) are simulations of human beings on computers [47]. There is a wide range of applications, such as games, film and TV productions, financial industry (smart adviser), telecommunications (avatars), etc.
In the coming era, the technology of virtual humans is one of foundations of mobile metaverse service. A virtual human can be a digital representation of a natural person in a mobile metaverse service, which is driven by the natural person. Or a virtual human also can be a digital representation of a digital assistant driven by AI model.
Mobile metaverse services offer an important opportunity for socialization and entertainment, where user experience of the virtual world and the real world combine. This use case focuses on the scenario of a natural person's digital embodiment in a metaverse as a location agnostic service experience. A virtual human is customized according to a user's personal characteristics and shape preferences. Users wear motion capture devices, vibrating backpacks, haptic gloves, VR glasses to drive the virtual human in a meta-universe space for semi-open exploration. The devices mentioned above are 5G UEs, which need to collaborate with each other to complete the actions of user and get real-time feedback.
(Source: https://vr.baidu.com/product/xirang, https://en.wikipedia.org/wiki/Virtual_humans)
For smooth experience, the motion-to-photon latency should be less than 20ms [48]. The motion-to-photon latency requires that the latency between the moment that players do one movement and the corresponding new videos shown in VR glasses and tactile from vibrating backpacks or haptic gloves should be less than 20ms. As the asynchrony between different modalities increases, users' experience will decrease because uses are able to detect asynchronies. Therefore, the synchronisation among audio, visual and tactile is also very important. The synchronisation thresholds regarding audio, visual and tactile modalities measured by Hirsh and Sherrick are described as follows [49]. The obtained results vary, depending on the kind of stimuli, biasing effects of stimulus range, the psychometric methods employed, etc.
audio-tactile stimuli: 12 ms when the audio comes first and 25 ms when the tactile comes first to be perceived as being synchronous.
visual-tactile stimuli: 30 ms when the video comes first and 20 ms when the tactile comes first to be perceived as being synchronous.
audio-visual stimuli: 20 ms when the audio comes first and 20 ms when the video comes first to be perceived as being synchronous.
Alice's virtual human exists as a digital representation in a mobile metaverse service. Alice's virtual human wants to explore a newly opened area, including both natural environment and the humanities environment. The equipment Alice wears are all connected to 5G network. The mobile metaverse service interacts with 5G network to provide QoS requirements. The network provides the pre-agreed policy between the mobile metaverse service provider and operator on QoS requirements appropriate to each mobile metaverse media data flow.
Alice's virtual human digital representation exists as part of the mobile metaverse service.
Alice's virtual human digital representation can interact with other virtual humans. These could correspond to virtual humans representing other players or to machine generated virtual humans. Interactions could include a handshake, shopping, visiting an exhibition together, etc.
When someone or something touches Alice's virtual human (e.g. Alice's virtual human's hand or back touches some virtual object or human in the mobile metaverse service), Alice can see the object and feel the temperature and weight of the object at the same time. For example, when a virtual leaf falls on the hand of Alice's virtual human, Alice should see the leaf fall on her hand and sense the presence of a leaf at the same time. It means that the tactile impression from haptic gloves should come within 30ms after the video in VR glasses (assuming the video media precedes the haptic media.) Or, the video in VR glasses should come within 20ms if the tactile impression resulting from tactile media presented by the haptic gloves, if the tactile media precedes the video media.
3GPP TS 22.261 specifies KPIs for high data rate and low latency interactive services including Cloud/Edge/Split Rendering, Gaming or Interactive Data Exchanging, Consumption of VR content via tethered VR headset, and audio-video synchronization thresholds.
Support of audio-video synchronization thresholds has been captured in TS 22.261:
Due to the separate handling of the audio and video component, the 5G system will have to cater for the VR audio-video synchronisation in order to avoid having a negative impact on the user experience (i.e. viewers detecting lack of synchronization). To support VR environments, the 5G system shall support audio-video synchronisation thresholds:
in the range of [125 ms to 5 ms] for audio delayed and
in the range of [45 ms to 5 ms] for audio advanced.
Existing synchronization requirements in current SA1 specification are only for data transmission of one UE. Existing specifications do not contain requirements for coordination of synchronization transmission of data packet for multiple UEs.
The 5G system shall provide a mechanism to support coordination and synchronization of multiple data flows transmitted via one UE or different UEs, e.g., subject to synchronization thresholds provided by 3rd party.
[PR 5.12.6-2]
The 5G system shall provide means to achieve low end-to-end round-trip latency (e.g., [20ms]).