The present document comprises a technical report on Video Codec Performance, for packet-switched video-capable multimedia services standardized by 3GPP.
The following documents contain provisions which, through reference in this text, constitute provisions of the present document.
References are either specific (identified by date of publication, edition number, version number, etc.) or non-specific.
For a specific reference, subsequent revisions do not apply.
For a non-specific reference, the latest version applies. In the case of a reference to a 3GPP document (including a GSM document), a non-specific reference implicitly refers to the latest version of that document in the same Release as the present document.
The present document is organized as discussed below.
Clause 5 introduces the service scenarios, including their relationship with 3GPP services. Furthermore, it discusses the performance measurement metrics used in the present document.
Clause 6 (performance figures) defines representative test cases and contains a listing, in the form of tables, performance of video codecs for each of the test cases.
Clause 7 (supplementary information on figure generation) contains pointers to accompanying files containing video sequences, anchor bit streams, and error prone test bit streams. It also describes the mechanisms used to generate the anchor compressed video data, compressed video data exposed to typical error masks, and descriptions on the creation of error masks.
Annex A sketches one possible environment that could be used by interested parties as a starting point for defining a process to assess the performances of a particular video codec against the performance figures.
Annex B introduces details on the H.263 encoder and decoder configurations.
Annex C introduces details of the H.264 encoder and decoder configurations
Annex D introduces details on the usage of 3G file format in the present document.
Annex E introduces details on the usage of RTPdump format in the present document.
Annex F introduces details on the simulator, bearers, and dump files.
Annex G introduces the details on the Quality Metric Evaluation.
Annex H introduces the details on the Video Test Sequences.
Annex I provides information on verification of appropriate use of the tools provided in this document.
Video transmission in a 3GPP packet-switched environment conceptually consists of an Encoder, one or more Channels, and a Decoder. The Encoder, as defined here, comprises the steps of the source coding and, when required by the service, the packetization into RTP packets, according to the relevant 3GPP Technical Specification for the service and media codec in question. The Channel, as defined here comprises all steps of conveying the information created by the Encoder to the Decoder. Note that the Channel, in some environments, may be prone to packet erasures, and in others it may be error free. In an erasure prone environment, it is not guarantied that all information created by the Encoder can be processed by the Decoder; implying that the Decoder needs to cope to some extent with compressed video data not compliant with the video codec standard. The Decoder, finally, de-packetizes and reconstructs the - potentially erasure prone and perhaps non-compliant - packet stream to a reconstructed video sequence. The only type of error considered at the depacketizer/decoder is RTP packet erasures.
3GPP includes video in different services, e.g. PSS [11], MBMS [12], PSC [13], [14], and MTSI [15]. This report lists the performance figures only one service scenario focusing on an RTP-based conversational service such as PSC or MTSI.
Service scenario A (PSC/MTSI-like) relates to conversational services involving compressed video data (an erasure prone transport, low latency requirements, application layer transport quality feedback, etc.). In this scenario, UE-based video encoding and decoding are assumed. The foremost examples for this service scenario are PSC or MTSI. Within the this service scenario, the performance of an encoder and a decoder is of importance for the service quality. Service scenario A refers to the performance of a decoder to consume a possibly non-compliant (due to transmission errors) compressed video data generated by an encoder that fulfils the provision of sufficient quality in this scenario.
This clause defines performance metrics as used in clause 6, to numerically and objectively express a Decoder's reaction to compressed video data that is possibly modified due to erasures. Only objective metrics are considered which can be computed from sequences being available in a 3G format as described in Annex D by using the method detailed in annex G.
The following section provides a general description of the quality metrics. For the exact computation with the availability of sequences in 3G format please refer to annex G.
The following acronyms are utilized throughout the remainder of this subclause:
OrigSeq: The original video sequence that has been used as input for the video encoder.
ReconSeq: The reconstructed video sequence, the output of a standard compliant decoder that operates on the output of the video encoder without channel simulation, i.e. without any errors. Timing alignment between the OrigSeq and ReconSeq are assumed.
ReceivedSeq: The video sequence that has been reconstructed and error-concealed by an error-tolerant video decoder, after a) the video encoder operated on the OrigSeq and produced an error free packet stream file as output, b) the channel simulator used the error free packet stream file and applied errors and delays to it so to produce an error-prone packet file which is used as the input of the error-tolerant video decoder. For comparison purpose, a constant delay between OrigSeq and the ReceivedSeq is assumed, whereby this constant delay is removed before comparison.
Each of the following metrics generates a single value when run for a complete video sequence.
The average Peak Signal-to-Noise Ratio (APSNR) calculated between all pictures of the OrigSeq and the ReconSeq or the ReceivedSeq, respectively. First, the Peak Signal-to-Noise Ratio (PSNR) of each picture is calculated with a precision sufficient to prevent rounding errors in the future steps. Thereafter, the PSNR values of all pictures are averaged. The result is reported with a precision of two digits.
Only the luminance component of the video signal is used.
In case that results from several ReceivedSeq are to be combined, the average of all PSNR values for all ReceivedSeq is computed as the final result.
The PSNR of Average Normalized Square Difference (PANSD) is calculated between all pictures of the OrigSeq and the ReceivedSeq, respectively. First, the normalized square difference, also know as Mean Square Error (MSE) of each picture is calculated with a precision sufficient to prevent rounding errors in the future steps. Thereafter, the NSD values of all pictures are averaged. The result is reported with a precision of two digits. Then, a conversion of this value into a PSNR value is carried out.
Only the luminance component of the video signal is used.
In case that results from several ReceivedSeq are to be combined, the average of all NSD values for all ReceivedSeq is computed and the final result is the PSNR over this averaged NSD.
The Percentage of Degraded Video Duration (PDVD) is defined as the percentage of time of the entire display time for which the PSNR of the erroneous video frames are more than x dB worse than PSNR of the reconstructed frames whereby x is set to 2 dB. This metric computation requires three sequences, the OrigSeq, the ReconSeq, and the ReceivedSeq.
Only the luminance component of the video signal is used.
In case that results from several ReceivedSeq are to be combined, the average of all PDVD values for all ReceivedSeq is computed as the final result.