Image recognition is an area where a rich set of pre-trained AI/ML models are available. The optimum model depends on the feature of the input image/video, environment and the precision requirement. The model used for vision processing at device needs to be adaptively updated for different vision objects, backgrounds, lighting conditions, purposes (e.g. image recovery vs. classification), and even target compression rates. Although a static model can also work as default in some cases, adapting the model to different working conditions will provide improved recognition accuracy and better user experience.
An example was given in
[20] for the motivation of selecting the optimum model for different image recognition tasks and environments. As shown in
Figure 6.1.1-1, 4 typical CNN models were evaluated and compared for different image recognition tasks, i.e.
MobileNet_v1_025 [19], ResNet_v1_50 (
ResNet [18] with 50 layers),
Inception_v2 [23], and ResNet_v2_152 (ResNet with 152 layers). This example shows that the best model depends on the type of input image and the task requirement. For a mobile device which needs to recognize diverse types of images and meet various requirements for different applications, the model needs to be adaptively switched.
In case the selected model has not pre-loaded in the device, the device needs to download it from the network before the image recognition task can start. A model can be reused if it is kept in storage after the previous use. But due to the limited storage resource, the device cannot retain all models for potential use in storage. The data rate for downloading the needed models depends on the size of the model and the required downloading latency.
Along with the increasing performance requirements to AI/ML operations, the size of the models also keeps increasing, although model compression techniques are under improvements. The typical sizes of typical DNN models for image recognition are listed in
Table 6.1.1-1. A DNN parameter can be expressed in 32 bits for a higher inference accuracy. The model size and downloading overhead can be compressed if the size of a parameter is reduced to 8 bits, by potentially sacrificing the image recognition accuracy.
The required model downloading latency depends on how fast the model needs to be ready at the device. It is impacted by the extent to which the oncoming application can be predicted. In the use case, we assume the device cannot predict and download the needed model in advance. In this case, the downloading of the AI/ML model needs to be finished in seconds or even in milliseconds. Different from a streamed video which can be played when a small portion is buffered, a DNN model can only be used until the whole model is completely downloaded.
For example, if the downloading latency is 1s, the required DL data rate ranges from 134.4 Mbit/s to 1.92Gbit/s in case of 32-bit parameters, as shown in
Table 6.1.1-1. In case of 8-bit parameters, the required DL data rate can be limited to 33.6Mbit/s~1.1Gbit/s.
A model consists of model topology and model weight factors. The topology reflects the structure of a neural network (e.g. neurons of each layer, the connections of neurons between two neighboring layers). The size of the model parameters is shown in
Table 6.1.1-1. The size of the configuration file of a model (i.e. the model topology), usually does not exceed 1Mbits. When an application of the UE requests a model, the third-party server sends one model profile which may consist of two parts, i.e. model topology and model weight factors. Because the model topology and model weight factors are from the same server, probably they have the same IP tuple. The transmission error of model topology matters a lot comparing to model weight factors (the UE can hardly run the model if model topology has an error while the weight factors may have a high transmission error tolerance due to model robustness). Therefore, the transmission of data of model topology is more important and requires higher reliability.
The UE runs an application providing the capability of AI/ML model inference for image recognition.
An AI/ML server manages the AI/ML model pool, and is capable to download the requested model to the application providing AI/ML based image recognition.
The 5G system has the ability to provide 5G network related information to the AI/ML server.
The objects in the input images or videos are recognized by the AI/ML based image recognition application and the inference accuracy and latency need to be guaranteed.
The image recognition task can be completed under the available computation and energy resource of the UE.
This use case mainly requires high data rate together with low latency. The high data rate requirements to 5G system are listed in
Clause 7.1 and
7.6 of
TS 22.261. As in
Table 7.1-1 of
TS 22.261, 300Mbps DL experienced data rate and 50Mbps UL experienced data rate are required in dense urban scenario, and 1Gbps DL experienced data rate and 500Mbps UL experienced data rate are required in indoor hotspot scenario. As in
Table 7.6.1-1 of
TS 22.261, cloud/edge/split rendering-related data transmission requires up to 0.1Gbps data rate with [5-10]ms latency countrywide.
Considering the time taken by the device to finish the image recognition task, a small portion of the recognition latency budget can be used to download the model. For one-shot object recognition at smartphone and photo enhancements at smartphone, a pre-installed sub-optimal model can be temporarily employed while downloading the optimal model. The optimal model in need should be downloaded on level of 1s, and then replaces the pre-installed model. If 8-bit parameters are used for describing the DNN, the required DL data rate ranges from 33.6Mbit/s to 1.1Gbit/s. For video recognition, the target can be updating the model in one frame duration (so to adopt the updated model for the next frame). But for an application with an always-on camera, the device can predict the needed model and start the downloading in-advance. Downloading the model within 1s is acceptable. Similarly, for other applications with an always-on camera, i.e. person identification in security surveillance system, AR display/gaming, remote driving, remote-controlled robotics, the required DL data rate ranges from 33.6Mbit/s to 1.1Gbit/s. It should be noted that the size of the model may be further reduced if more advanced model compression techniques can be adopted.