The AI/ML-based mobile applications are increasingly computation-intensive, memory-consuming and power-consuming. Meanwhile end devices usually have stringent energy consumption, compute and memory limitations for running a complete offline AI/ML inference on-board. Many AI/ML applications, e.g. image recognition, currently intent to offload the inference processing from mobile devices to internet datacenters (IDC). For example, photos shot by a smartphone are often processed in a cloud AI/ML server before shown to the user who shot them. However, the cloud-based AI/ML inference tasks need to take into account the computation pressure at IDCs, required data rate/latency and privacy protection requirement.
Image and video are the biggest data on today's Internet. Videos account for over 70% of daily Internet traffic
[4]. Convolutional Neural Network (CNN) models have be widely used for image/video recognition tasks on mobile devices, e.g. image classification, image segmentation, object localization and detection, face authentication, action recognition, enhanced photography, VR/AR, video games. Meanwhile, CNN model inference requires an intensive computation and storage resource. For example,
AlexNet [7],
VGG-16 [8] and
GoogleNet [9] require 724M, 15.5G and 1.43G MACs (multiply-add computation) respectively for a typical image classification task.
Many references
[10]-
[14] have shown that AI/ML inference for image processing with device-network synergy can alleviate the pressure of computation, memory footprint, storage, power and required data rate on devices, reduce end-to-end latency and energy consumption, and improve the end-to-end accuracy, efficiency and privacy when compared to the local execution approach on either side. The scheme of split AI/ML image recognition can be depicted in
Figure 5.1.1-1. The CNN is split into two parts according to the current image recognition task and environment. The intention is to offload the computation-intensive, energy-intensive parts to network server, whereas leave the privacy-sensitive and delay- sensitive parts at the end device. The device executes the inference up to a specific CNN layer and sends the intermediate data to the network server. The network server runs through the remaining CNN layers. While the model is developed or invocated, the split AI/ML operation is based on the legacy model.
Due to the characteristics of some algorithms in the model training phase, a model has a certain degree of robustness[xx-xy]. Therefore, if there are errors in the intermediate data transmission, the model has a certain tolerance and can still guarantee the accuracy of the inference results. Since the inference result needs to be forwarded to the UE, the reliability of the inference result transmission needs to be guaranteed.
The split AI/ML image recognition algorithms can be analyzed based on the computation and data characteristics of the layers in the CNN. As shown in
Figure 5.1.1-2 and
Figure 5.1.1-3 (based on figures adopted from
[13]), the intermediate data size transferred from one CNN layer to the next depends on the location of the split point. Hence, the required UL data rate is related to the model split point and the frame rate for the image recognition, as also observed by
[13]-
[14]. For example, assuming images from a video stream with 30 frames per second (FPS) need to be classified, the required UL data rate for different split points ranges from 4.8 to 65 Mbit/s (listed in
Table 5.1.1-1). The result is based on the 227×227 input images. In case of images with a higher resolution, higher data rates would be required.
VGG-16 is another widely-used CNN model for image recognition. Still assuming images from a video stream with 30 FPS need to be classified, the required UL data rate for different split points ranges from 24 to 720 Mbit/s (listed in
Table 5.1.1-2).
The involved AI/ML endpoints (e.g. UE, AI/ML cloud/edge server) run applications providing the capability of AI/ML model inference for image recognition, and support the split AI/ML image recognition operation.
The 5G system has the ability to provide 5G network related information to the AI/ML server.
-
The AI/ML based image recognition application is requested by the user to start recognizing the image/video shot by the UE.
-
Under the determined split mode and split point, the AI/ML based image recognition application in an involved AI/ML endpoint executes the allocated part of AI/ML model, and sends the intermediate data to the next endpoint in the AI/ML pipeline.
-
After all the involved AI/ML endpoints finish the co-inference, the image recognition results are fed to the user using the results.
-
The AI/ML based image recognition applications in the endpoints perform the split image recognition until the image recognition task is terminated.
Redo Step 3) and 4) for split mode/point re-selection/switching if needed to adapt to the changing conditions.
The objects in the input images or videos are recognized and the recognition accuracy and latency need to be guaranteed.
The image recognition task can be completed under the available computation and energy resource of the UE. And the consumed the computation, communication and energy resources over the AI/ML endpoints are optimized.
This use case mainly requires high data rate together with low latency. The high data rate requirements to 5G system are listed in
Clause 7.1 and
7.6 of
TS 22.261. As in
Table 7.1-1 of
TS 22.261, 300Mbps DL experienced data rate and 50Mbps UL experienced data rate are required in dense urban scenario, and 1Gbps DL experienced data rate and 500Mbps UL experienced data rate are required in indoor hotspot scenario. As in
Table 7.6.1-1 of
TS 22.261, cloud/edge/split rendering- related data transmission requires up to 0.1Gbps data rate with [5-10]ms latency countrywide.
The
"image recognition latency" can be defined as the latency from the image is captured to the recognition results of the image are output to the user application, which was not specially addressed in
[4] [15]. Following the principle of analyzing the latency and data rate requirements of split image recognition introduced in
clause 5.1.1, the image recognition latency is related to the user application the recognition is used for.
Computer vision and image recognition have been widely used for many important mobile applications such as object recognition, photo enhancements, intelligent video surveillance, mobile AR, remote-controlled automotive, industrial control and robotics. The image recognition is usually a step of the processing pipeline of the application. And the recognition latency is a part of the end-to-end latency, as depicted in
Figure 5.1.6-1.
For example, if the image recognition results are just used for the object recognition e.g. unknown object recognition for smartphone user or criminal searching in database for intelligent security, it is acceptable that the image recognition is finished in seconds. If the image recognition result is used as an input to another time-sensitive application, e.g. AR display/gaming, remote-controlled automotive, industrial control and robotics, a much more stringent latency will be required. Based on the end-to-end latency requirements of the applications, the image recognition latency requirement can be derived, as listed in
Table 5.1.6.1-1.