AI/ML-based speech processing has been widely used in applications on mobile devices (e.g. smartphone, personal assistant, language translator), including automatic speech recognition (ASR), voice translation, speech synthesis. Speech recognition for dictation, search, and voice commands has become a standard feature on smartphones and wearable devices.
Service requirements to ASR have been addressed in
[29]. Traditional ASR systems are based on hidden Markov model (HMM) and Gaussian mixture model (GMM). However, the HMM-GMM systems suffer a relatively high WER (Word Error Rate) with presence of environmental noise. Although some enhancements were developed including
"feature enhancement" (attempts to remove the corrupting noise from the observations prior to recognition) and
"model adaptation" (leaves the observations unchanged and instead updates the model parameters of the recognizer to be more representative of the observed speech), the traditional models can hardly fulfil the requirements of commercial applications. Acoustic models based on deep neural networks (DNN) have remarkable noise robustness
[25]-
[26], and have been widely used in the ASR applications in mobile devices.
Nowadays, most of ASR applications on smartphones are operating in cloud servers. The end device uploads the speech to the cloud server, and then downloads the decoded results back to the device. However, cloud-based speech recognition potentially introduces a higher latency (not only due to the 4G/5G network latency, but also the internet latency), and the reliability network connection and privacy issue need to be considered.
An embedded speech recognition system running on a mobile device is more reliable and can have lower latency. Currently, some ASR apps would switch from cloud-based model inference to offline model inference when uplink coverage of the mobile user turns weak, e.g. when entering a basement or an elevator. However, the ASR models for cloud servers are too complex for computation and storage resources on mobile devices. The size of a ML-based ASR model running on cloud server has rapidly increased in the recent year, from ~1GByte to ~10GByte, which cannot be run on a mobile device. Due to the restriction, only the simple ASR applications, e.g. wakeword detection, can be implemented on smart phones. Realizing more complicated ASR applications, e.g. large vocabulary continuous speech recognition (LVCSR) is still a challenging area for an offline speech recognizer.
In 2019, a state-of-the-art offline LVCSR recognizer for Android mobile devices was announced. The streaming end-to-end recognizer is based on the recurrent neural network transducer (RNN-T) model
[27]-
[28]. It was stated that, by employing all kinds of improvements and optimizations, the memory footprint can be dramatically reduced and the computation can be speed up. The model can be compressed to 80MB. Meanwhile the ASR model is compressed to fit the use at mobile device, the robustness to the various types of background noises has to been sacrificed. When the noise environment changes, the model needs to be re-selected, and in case the model is not kept in the device, the model needs to be downloaded from the cloud/edge server of the AI/ML model owner via 5G network.
The UE runs an application providing the capability of AI/ML model inference for speech recognition.
An AI/ML server manages the AI/ML model pool, and is capable to download the requested model to the application providing AI/ML based speech recognition.
The 5G system has the ability to provide 5G network related information to the AI/ML server.
The content in the input speech is recognized by the AI/ML based speech recognition application and the inference accuracy and latency need to be guaranteed.
The speech recognition task can be completed under the available computation and energy resource of the UE.
This use case mainly requires high data rate together with low latency. The high data rate requirements to 5G system are listed in
Clause 7.1 and
7.6 of
TS 22.261. As in
Table 7.1-1 of
TS 22.261, 300Mbps DL experienced data rate and 50Mbps UL experienced data rate are required in dense urban scenario, and 1Gbps DL experienced data rate and 500Mbps UL experienced data rate are required in indoor hotspot scenario. As in
Table 7.6.1-1 of
TS 22.261, cloud/edge/split rendering-related data transmission requires up to 0.1Gbps data rate with [5-10]ms latency countrywide.
An ASR model can only be used until the whole model is completely downloaded. And device needs to adopt the noise-robust ASR model which adapts to the changing noise environment. In general, the device's microphone is switched on only when the speech recognition application is triggered. The device needs to identify the noise environment and download the corresponding ASR model in a low latency (level of 1s). The sizes of some typical ASR models are listed
[27][47] in
Table 6.3.6-1, which are used to derive the required data rate.