Recognition Task | Latency: maximum (see note 4) | User experienced DL data rate | User experienced UL data rate | ||
---|---|---|---|---|---|
Faster R-CNN [50] (see note 1) | YOLOv3 [51] (see note 2) | Faster R-CNN | YOLOv3 | ||
Uplink Streaming | 100-200ms | 100-1000 kbit/s | 200-1500 kbit/s | ||
Generic FPN Inference | 100-500ms | FPN:
4-10fps Sum(Pi)~1MB/frame 32-100Mbit/s uncompressed (see note 3) Compression factor 10~100 | Multiple scale (similar to FPN):
1.5 MB feature map/frame 40-150 Mbit/s uncompressed Compression factor 10~100 | ||
Object Classification | 20-50ms | Performed on UE | Performed on UE | ||
Bounding Box Detection | 20-50ms | Performed on UE | Performed on UE | ||
Object Tracking | 50-150ms | Performed on UE | Performed on UE | ||
Enhanced Information Retrieval | Few kBytes per request | Few kBytes per request | |||
Overlay Rendering | 10ms | Performed on UE | Performed on UE | ||
NOTE 1:
Faster R-CNN uses an input image size of 3x224x224. The video is downscaled on the UE to that target resolution and then compressed (e.g. using HEVC) and streamed to the edge for further processing.
NOTE 2:
YOLOv3 uses an input image size of 3x416x416. The captured video is downscaled on the UE to the target resolution and compressed prior to streaming to the edge.
NOTE 3:
Faster R-CNN uses an FPN with ResNet 101 as backbone; thus resulting in feature maps {P2=(256x56x56), P3=(256x28x2), P4=(256x14x14), P5=(256x7x7)}.
NOTE 4:
the latency estimates assume an overall latency of around 1s from a user pointing at an object until overlay information is displayed to the user.
|