Content for TR 22.874 Word version: 18.2.0

0… 4 5… 5.2… 5.3… 5.4… 5.5… 6… 6.2… 6.3… 6.4… 6.5… 6.6… 6.7… 7… 7.2… 7.3… 7.4… 8… A… A.2 A.3 A.4 B C D…

6.1 AI/ML model distribution for image recognition 6.1.1 Description 6.1.2 Pre-conditions 6.1.3 Service Flows 6.1.4 Post-conditions 6.1.5 Existing features partly or fully covering the use case functionality 6.1.6 Potential New Requirements needed to support the use case
...

6 AI/ML model/data distribution and sharing over 5G system p. 25

6.1 AI/ML model distribution for image recognition p. 25

6.1.1 Description p. 25

Image recognition is an area where a rich set of pre-trained AI/ML models are available. The optimum model depends on the feature of the input image/video, environment and the precision requirement. The model used for vision processing at device needs to be adaptively updated for different vision objects, backgrounds, lighting conditions, purposes (e.g. image recovery vs. classification), and even target compression rates. Although a static model can also work as default in some cases, adapting the model to different working conditions will provide improved recognition accuracy and better user experience.

An example was given in [20] for the motivation of selecting the optimum model for different image recognition tasks and environments. As shown in Figure 6.1.1-1, 4 typical CNN models were evaluated and compared for different image recognition tasks, i.e. MobileNet_v1_025 [19], ResNet_v1_50 (ResNet [18] with 50 layers), Inception_v2 [23], and ResNet_v2_152 (ResNet with 152 layers). This example shows that the best model depends on the type of input image and the task requirement. For a mobile device which needs to recognize diverse types of images and meet various requirements for different applications, the model needs to be adaptively switched.

Copy of original 3GPP image for 3GPP TS 22.874, Fig. 6.1.1-1: Example of selecting the optimum model for different image recognition tasks/environments (Figure adopted from [20])

Figure 6.1.1-1: Example of selecting the optimum model for different image recognition tasks/environments (Figure adopted from [20])
(⇒ copy of original 3GPP image)

In case the selected model has not pre-loaded in the device, the device needs to download it from the network before the image recognition task can start. A model can be reused if it is kept in storage after the previous use. But due to the limited storage resource, the device cannot retain all models for potential use in storage. The data rate for downloading the needed models depends on the size of the model and the required downloading latency.

Along with the increasing performance requirements to AI/ML operations, the size of the models also keeps increasing, although model compression techniques are under improvements. The typical sizes of typical DNN models for image recognition are listed in Table 6.1.1-1. A DNN parameter can be expressed in 32 bits for a higher inference accuracy. The model size and downloading overhead can be compressed if the size of a parameter is reduced to 8 bits, by potentially sacrificing the image recognition accuracy.

The required model downloading latency depends on how fast the model needs to be ready at the device. It is impacted by the extent to which the oncoming application can be predicted. In the use case, we assume the device cannot predict and download the needed model in advance. In this case, the downloading of the AI/ML model needs to be finished in seconds or even in milliseconds. Different from a streamed video which can be played when a small portion is buffered, a DNN model can only be used until the whole model is completely downloaded.

For example, if the downloading latency is 1s, the required DL data rate ranges from 134.4 Mbit/s to 1.92Gbit/s in case of 32-bit parameters, as shown in Table 6.1.1-1. In case of 8-bit parameters, the required DL data rate can be limited to 33.6Mbit/s~1.1Gbit/s.

A model consists of model topology and model weight factors. The topology reflects the structure of a neural network (e.g. neurons of each layer, the connections of neurons between two neighboring layers). The size of the model parameters is shown in Table 6.1.1-1. The size of the configuration file of a model (i.e. the model topology), usually does not exceed 1Mbits. When an application of the UE requests a model, the third-party server sends one model profile which may consist of two parts, i.e. model topology and model weight factors. Because the model topology and model weight factors are from the same server, probably they have the same IP tuple. The transmission error of model topology matters a lot comparing to model weight factors (the UE can hardly run the model if model topology has an error while the weight factors may have a high transmission error tolerance due to model robustness). Therefore, the transmission of data of model topology is more important and requires higher reliability.

Table 6.1.1-1: Sizes of typical image-recognition models and required DL data rates for downloading in 1s

DNN model for image recognition	Number of parameters (Million)	32 bits per parameter		8 bits per parameter
DNN model for image recognition	Number of parameters (Million)	Size of the model (MByte)	Required DL data rate (Mbit/s)	Size of the model (MByte)	Required DL data rate (Mbit/s)
AlexNet [7]	60	240	1920	60	480
VGG16 [8]	138	552	4416	138	1104
ResNet-152 [18]	60	240	1920	60	480
ResNet-50 [18]	25	100	800	25	200
GoogleNet [9]	6.8	27.2	217.6	6.8	54.4
Inception-V3 [23]	23	92	736	23	184
1.0 MobileNet-224 [19]	4.2	16.8	134.4	4.2	33.6

6.1.2 Pre-conditions p. 26

The UE runs an application providing the capability of AI/ML model inference for image recognition.

An AI/ML server manages the AI/ML model pool, and is capable to download the requested model to the application providing AI/ML based image recognition.

The 5G system has the ability to provide 5G network related information to the AI/ML server.

6.1.3 Service Flows p. 26

The AI/ML based image recognition application is requested by the user to start recognizing the image/video shot by the UE.
The AI/ML model is downloaded from the model server to the AI/ML based image recognition application via 5G network.
The AI/ML based image recognition application employs the AI/ML model for inference until the image recognition task is finished.
Redo Step 2) to 3) for AI/ML model re-selection and re-downloading if needed to adapt to the changing conditions.

6.1.4 Post-conditions p. 27

The objects in the input images or videos are recognized by the AI/ML based image recognition application and the inference accuracy and latency need to be guaranteed.

The image recognition task can be completed under the available computation and energy resource of the UE.

6.1.5 Existing features partly or fully covering the use case functionality p. 27

This use case mainly requires high data rate together with low latency. The high data rate requirements to 5G system are listed in Clause 7.1 and 7.6 of TS 22.261. As in Table 7.1-1 of TS 22.261, 300Mbps DL experienced data rate and 50Mbps UL experienced data rate are required in dense urban scenario, and 1Gbps DL experienced data rate and 500Mbps UL experienced data rate are required in indoor hotspot scenario. As in Table 7.6.1-1 of TS 22.261, cloud/edge/split rendering-related data transmission requires up to 0.1Gbps data rate with [5-10]ms latency countrywide.

6.1.6 Potential New Requirements needed to support the use case p. 27

Considering the time taken by the device to finish the image recognition task, a small portion of the recognition latency budget can be used to download the model. For one-shot object recognition at smartphone and photo enhancements at smartphone, a pre-installed sub-optimal model can be temporarily employed while downloading the optimal model. The optimal model in need should be downloaded on level of 1s, and then replaces the pre-installed model. If 8-bit parameters are used for describing the DNN, the required DL data rate ranges from 33.6Mbit/s to 1.1Gbit/s. For video recognition, the target can be updating the model in one frame duration (so to adopt the updated model for the next frame). But for an application with an always-on camera, the device can predict the needed model and start the downloading in-advance. Downloading the model within 1s is acceptable. Similarly, for other applications with an always-on camera, i.e. person identification in security surveillance system, AR display/gaming, remote driving, remote-controlled robotics, the required DL data rate ranges from 33.6Mbit/s to 1.1Gbit/s. It should be noted that the size of the model may be further reduced if more advanced model compression techniques can be adopted.

6.1.6.1 Potential KPI Requirements p. 27

The potential KPI requirements needed to support the use case include:

[P.R.6.1-001]

The 5G system shall support AI/ML model downloading for image recognition with a maximum latency as given in Table 6.1.6.1-1.

[P.R.6.1-002]

The 5G system shall support AI/ML model downloading for image recognition with a user experienced DL data rate as given in Table 6.1.6.1-1.

[P.R.6.1-003]

The 5G system shall support AI/ML model downloading for image recognition with communication service availability not lower than 99.999 %.

[P.R.6.1-004]

The 5G system shall support AI/ML model downloading with a reliability for the transmission of data of model topology not lower than 99.999% and a reliability for the transmission of data of model weight factors not lower than 99.9%.

Table 6.1.6.1-1: Image recognition model downloading latency analysis for example applications (8-bit parameters for the DNN)

User application	Latency: maximum		User experienced DL data rate for model downloading
User application	Image recognition latency	Model downloading latency	User experienced DL data rate for model downloading
One-shot object recognition at smartphone	~1s	1s (Note 1)	33.6Mbit/s~1.1Gbit/s
Person identification in security surveillance system	~1s	1s (Note 2)	33.6Mbit/s~1.1Gbit/s
Photo enhancements at smartphone	~1s	1s (Note 1)	33.6Mbit/s~1.1Gbit/s
Video recognition	33ms@30FPS	1s (Note 2)	33.6Mbit/s~1.1Gbit/s
AR display/gaming	<5ms (Note 3)	1s (Note 2)	33.6Mbit/s~1.1Gbit/s
Remote driving	<5ms (Note 4)	1s (Note 2)	33.6Mbit/s~1.1Gbit/s
Remote-controlled robotics	<5ms (Note 5)	1s (Note 2)	33.6Mbit/s~1.1Gbit/s
NOTE 1: A pre-installed sub-optimal model can be temporarily employed while downloading the optimal model. The optimal model should be downloaded in the order of 1s. NOTE 2: For applications with an always-on camera, the device can predict the needed model and start the downloading in-advance. Downloading the model within 1s is acceptable. NOTE 3: According to [4] [5], the VR motion-to-photon latency is in the range of 7-15ms. It can be assumed that the AR display or gaming requires a similar end-to-end latency. Considering the time taken by the AR rendering (e.g. 3D rendering of the virtual objects, rendering augmentation in overlay), the background video recognition should be finished within 5ms. NOTE 4: According to [46] the one-way latency required for remote driving is 5ms. The round-trip latency is assumed to be 10ms. Considering the time taken by the vision-based driving inference at server, the latency budget for traffic object recognition can be estimated to be 5ms. NOTE 5: According to [5] the end-to-end latency required for video-operated remote-controlled robotics is 10~100ms. Considering the time taken by the robot controlling inference at server, the robot vision recognition needs to be finished within 5ms.