With recent advances in multimedia technologies, the demand is increasing for higher resolution video services with ultra-high definition. In addition, the widespread distribution of smartphones and tablet personal computers and innovations in network technology are facilitating various high-resolution video services, even in mobile environments. To address these market challenges, a compression standard has become necessary for the efficient compression of high-resolution videos. Thus, the ISO/IEC Moving Picture Experts Group and the ITU-T Video Coding Experts Group established a joint collaborative team for video coding and developed the next-generation video compression standard, known as High Efficiency Video Coding (HEVC). In January 2013, the Final Draft International Standard for HEVC was released , .
With the establishment of the HEVC standard, fast and optimized encoders and decoders are now essential to ensuring that HEVC is quickly distributed in the market –. HEVC uses two in-loop filters (a deblocking filter and a sample adaptive offset) to improve perceptual quality and coding efficiency , . These in-loop filters are applied to the decoded picture to minimize errors and improve subjective and objective image quality. In addition, the in-loop filtered frames are used as reference pictures in inter-prediction modules based on the prediction performance. However, for HEVC decoders, these two in-loop filters increase the computational load. Consequently, the in-loop filtering in the decoding process accounts for approximately 12% of the total complexity of the HEVC test model (HM) 12.0 decoder with the HEVC main profile , . We computed the complexity portion by measuring the running time with HM 12.0. To obtain the figures, we employed respective all-intra, random access, low-delay B, and low-delay P coding configurations with quantization parameters (QPs) of 22, 27, 32, and 37. Table 1 shows the decoding time of the deblocking filter and sample adaptive offset (SAO). On average, the deblocking filter accounts for 10.38% and SAO accounts for 2.15%.
|Sequence||QP||Decoding time (s)||Portion (%)|
*DBF: deblocking filter.
To reduce the computational complexity of fast video decoders, many studies have been conducted on acceleration of the decoding speed using data-level parallelism such as single-instruction multiple data (SIMD) , . Unlike motion compensation and inverse transform modules, SIMD is not appropriate for application in in-loop filters. The HEVC deblocking filter consists of multiple stages such as the filtering on/off decision, strong/weak filter selection, and pixel-wise filtering for 4 × 4 pixel lines. For each line, each pixel is filtered with different filter coefficients. Thus, we cannot load multiple data with the same pattern. Depending on the pixel location, different loading types and different coefficients are required. Thus, it is not easy to apply SIMD to the deblocking filter. In addition, vertical loading is required for the vertical deblocking filter. In this case, it would not be efficient to load the pixel data. We do not contend that SIMD cannot be used for deblocking; however, it is not easy to apply SIMD to it. Accordingly, the proportion of in-loop filters becomes higher in optimized decoders. Thus, in-loop filter parallelization is necessary and effective in fast, optimized decoders on multi-core platforms.
Today, mobile devices and personal computers equipped with multi-core processors are common, and the trend is expected to continue. To maximize processor efficiency, many applications are structurally modified to facilitate parallel processing. Moreover, new applications are being developed that assume parallelization capability. For real-time decoding, many video decoder parallelization methods have been investigated to maximize the efficiency of multi-core processors. With respect to the parallelization of deblocking filters, Jo and others  proposed a method in which a slice is equally partitioned into many regions, and each region is then allocated to a core. However, a parallel algorithm in which a frame is equally divided has an unequal number of boundaries to which the deblocking filter is applied. This lowers the parallelization efficiency on account of the workload imbalance. Ikeda and others  proposed a parallelized algorithm that allocates a core to each coding tree unit (CTU) row. However, it has a problem in maximizing the parallelization efficiency on account of the workload imbalance over the partitioned regions. In addition, coding unit (CU) partition information is used to estimate the complexity of the deblocking filter . Acceleration of parallelization is improved for the existing algorithms. Nevertheless, the algorithm cannot handle accurate deblocking complexity with CU partition information.
Parallelization efficiency can differ according to how work items are assigned to cores or threads. When an actual processing quantity can be accurately estimated, it enables the equal assignment of work to each core or thread. It can thus decrease the idle time of the cores and improve parallelization efficiency. In contrast, with inaccurate complexity estimation, the last core processed determines the total processing time, which decreases parallelization efficiency. In this paper, we propose a method for maximizing parallelization efficiency, wherein the deblocking filter complexity is considered when assigning data to each core. This approach effectively reduces the complexity of the HEVC deblocking filter. With a minimum computational load, we estimate the complexity of the CTU based on the CU depth and transform unit (TU) split information. The total computation complexity is then equally divided among the total number of cores or threads. The decoding speed factor of the proposed deblocking filter based on complexity estimation is 2% and 6% higher than those of the existing uniform distribution deblocking filter parallelism and CTU row distribution parallelism, respectively.
The remainder of this paper is organized as follows. In Section II, we introduce the HEVC deblocking filter and previous research in video decoder parallelization. In Section III, we describe the proposed HEVC deblocking filter parallelization method. In Section IV, we demonstrate and analyze the performance of the proposed parallelization method in a comparative evaluation. Finally, we present our conclusions in Section V.
The HEVC deblocking filter is similar to the H.264/AVC deblocking filter in terms of its basic concept; however, the HEVC deblocking filter has lower complexity. Moreover, the HEVC deblocking filter was designed to have a reduced level of dependence between adjacent blocks so that it can be performed in parallel. Proper parallelization of the HEVC deblocking filter is important for realizing maximum parallelization performance in practical implementations.
In video compression, quantization results in both quality degradation and artifact blocking due to the block-based coding structures. Since blocking artifacts significantly affect perceptual quality, HEVC also employs a deblocking filter to reduce the artifacts. In each picture, the vertical boundary of the prediction unit (PU) or TU is first horizontally filtered. Then, the horizontal boundary is vertically filtered. This design is more efficient in terms of data-level parallelism than the horizontal and vertical filtering of H.264/AVC at the macro-block level because the former eliminates the dependence between adjacent blocks . While the H.264/AVC deblocking filter is appropriately used on 4 × 4 block boundaries, the HEVC deblocking filter should be applied on 8 × 8 block boundaries to lower the computational load.
Figure 1 shows the HEVC deblocking filtering process. In the HEVC deblocking filter, the boundary candidates to which the deblocking filter will be applied are first determined. These boundary candidates are 8 × 8 TU or PU boundaries. For each candidate boundary, the boundary strength (BS) (0, 1, or 2) is computed depending on the prediction modes, transform coefficients, reference picture index, and motion vector. When the BS is 0, no deblocking filter is applied to the boundary. For boundaries with a BS of 1 or 2, the β and tc values are computed. We must again decide whether the boundary is to be filtered, depending on computed β and luminance differences between neighboring pixels. When the luminance difference is sufficiently small, the difference arises from quantization, rather than from the original edges. Then, we determine whether a strong or weak filter is appropriate depending on the β and tc values.
Parallelization can be categorized as either task-level parallelism (TLP) or data-level parallelism (DLP). TLP classifies all tasks at a functional level and assigns them to cores or threads, which requires that data be transferred between dependent tasks. TLP is generally used when the data transfer quantity between tasks is not large. DLP divides all the data and assigns each data item to a core or thread. DLP can yield maximum parallelism performance when there is no dependence among the tasks assigned to the cores or threads. In DLP, parallelization efficiency determines how to divide the workload and assign the partitioned jobs to the cores. Because each core performs all the video decoding processes for the designated area of a picture, there is no overhead for data communication between threads. In addition, because DLP has a higher scalability than TLP, it can be easily modified to increase or decrease the number of cores.
In DLP, when the quantity of the workload is proportional to the quantity of data, partitioning of equal amounts of data is generally desirable. However, when the work required for each amount of data varies, then equal data partitioning reduces the parallelization efficiency on account of the workload imbalance between the cores. As shown in Fig. 2, three cores remain idle until the last task is completed because they have different workloads. If one or several cores are overburdened, the other cores become idle, even though their tasks were completed. Consequently, the parallelization performance degraded. In contrast, if the workload is equally distributed among the cores, the total work will be simultaneously finished, thereby increasing parallelization efficiency. However, it is not possible to allocate equal workloads when it is not known how to optimally divide the tasks before processing. Thus, it is very important to accurately estimate workloads prior to execution.
Thread-queue parallelization is a widely used DLP method in the development of video decoders. This method divides a given workload into certain units and schedules the cores to maximize parallel performance. Additionally, it ensures the equal division of the workload. When decoding a picture, we can divide one task into multiple tasks, such as one CTU or more. Then, we can use thread-queue parallelization by which a single core executes the next task immediately after the initial task is finished. However, the existence of many units can cause more overhead, such as context switching, with respect to core scheduling.
The number of HEVC CTUs in one picture (3,840 × 2,160) is 2,040. In this case, the scheduling overhead can become a dominant time-consuming part. The maximum size of the HEVC CTU is 64 × 64. When a pixel is expressed in 8 bits and 4:2:0 chroma subsampling, the data quantity of a CTU is approximately 6 kB. This is 16 times more than that of an H.264/AVC macroblock. Thus, it is known that the optimum number of job partitions is required to reduce the context switching overhead before job assigning.
Figure 3 shows a video DLP method that partitions the picture into many areas and assigns each partitioned area to a core. This method simply facilitates parallelization without requiring additional scheduling after a partitioned area is assigned to a core. Furthermore, the data can be loaded sooner using an independent bus with direct memory access because the next block to be processed can be predicted. By this parallelization method, an entire picture can be equally partitioned and the cores can be assigned to the partitioned pictures for parallelization. However, an equal amount of data does not generate the same amount of workload for the video decoders. Deblocking filters are adaptively on/off depending on the block locations, and different regions require different computational loads. Therefore, equal-sized picture partitioning does not maximize parallelization performance because of the workload imbalance, and parallelization performance depends on the visual characteristics of the pictures.
Considerable research has been devoted to the video coding standard, H.264/AVC, with respect to reducing the dependency of the deblocking filter on multi-core platforms , . In the H.264/AVC deblocking filter, horizontal and vertical filtering are sequentially employed at the macroblock level. In HEVC, on the other hand, horizontal and vertical deblocking filters can be applied at the picture level. Thus, we can say that the dependency of HEVC deblocking filtering is quite low. Furthermore, an algorithm was proposed for allocating the same number of CTUs to each core . It divides the deblocking filtering process by separating the horizontal and vertical filtering at the picture level and employing CTU-level DLP to each filtering process, as shown in Fig. 4.
In another algorithm , each CTU row is processed by a core. These algorithms both suffer from unequal workloads, depending on the video characteristics. To alleviate this problem, a parallelization method with a load balancing algorithm was proposed with CU partitioning information . However, the algorithm  cannot deal with an accurate deblocking filter complexity with CU partition information because the deblocking filter is performed on TU and PU boundaries. In a CU block, the computation complexity of the deblocking filter could thus differ from the CU partition information only.
HEVC employs a deblocking filter and SAO to improve compression performance. The deblocking filter and SAO are applied to the motion-compensated picture after inverse quantization and the inverse transform, as shown in Fig. 5. Several studies have focused on the parallelization of pixel decoding in HEVC decoders . However, an HEVC decoder decodes each picture based on the previously filtered pictures. Thus, the running time performance of the HEVC decoder cannot be maximized by parallelizing only the pixel decoding aspect. In-loop filter parallelization is required to maximize the overall performance of the HEVC decoder. In addition, several studies have focused on SAO parallelization , . In this paper, we propose deblocking filter parallelization and an efficient method of applying DLP to the HEVC deblocking filter. As described above, the existing parallelization method for the deblocking filter cannot maximize parallelization efficiency because of the workload imbalance. To address this problem, we propose a parallelization efficiency maximization method for predicting the computational complexity of the deblocking filter with CU and TU partition information and then equally distributing the workload across multiple cores.
Figure 6 shows the deblocking filter parallelization method, which is based on the estimation of computational complexity. First, we calculate the complexity of the deblocking filtering task for a specific frame. Based on the estimated complexity, we calculate the region to be assigned to a core or thread, and we assign the region to a core or thread for parallel deblocking filtering.
In the proposed method, we calculate the complexity of the deblocking filtering using not only CU but also TU partitioning information. Because the deblocking filter determines whether filtering will be performed, and it is applied to several selected PU or TU boundaries, as shown in Fig. 7, the CU partitioning information is closely related to the block boundary to which the deblocking filter is applied. To accurately evaluate the complexity, we use TU partitioning information from the CU to determine whether the TU is split. As shown in Fig. 7, we determine the number of 8 × 8 block boundaries based on the CU depth. When the CU depth is 0, the number of block boundaries is 16. We can have 8, 4, and 2 block boundaries for CU depths of 1, 2, and 3, respectively. In addition, if a CU is split into multiple TUs, the respective numbers of block boundaries are 32, 16, 8, and 2.
Table 2 calculates the deblocking filter complexity in a frame, which shows the ratio of the number of 8 × 8 block boundaries. Based on the CU depth, we use 8, 4, 2, or 1 to calculate the complexity of the deblocking filter. If the CU is split in a TU level, the complexity factor is doubled when the CU depth is not 3. Because the deblocking filter is not applied on PU or TU boundaries smaller than 8 × 8, TU split information is not available in small blocks. In a frame, the complexity of the deblocking filter can be expressed by:
where SCFd is the sum of the complexity factors of CUs, where d represents the CU depth. We estimate the total complexity of the deblocking filter for each frame and then divide the total complexity by the number of threads or cores. In this way, we can determine the ideal equal workload to be processed by each thread or core. Next, we compute the accumulated complexity from the first to the last CTU, and we determine the group of CTUs in the raster scanning order to realize almost equal workloads for all the threads or cores.
To evaluate the performance of the proposed deblocking filter parallelization method, we implemented the proposed algorithm using HM 12.0 reference software . For this implementation, we used OpenMP 2.0  for parallelization and standard HEVC sequences for our evaluation . We encoded each sequence in 22, 27, 32, and 37 QP to produce the bitstreams. Table 3 shows the test conditions for this evaluation. To reduce testing errors, we performed decoding five times and then calculated the average decoding time. To measure the degree of performance improvement achieved by the parallelization method, we computed the speed-up factor and average time saving (ATS) as follows:
where DTprop. is the decoding time when the proposed method is used, and DTref. is the decoding time of the HM 12.0 decoder.
|Reference software||HM 12.0|
|Encoding conditions||Low-delay, Random-access, All-intra|
|QP||22, 27, 32, 37|
|Sequences||BasketballDrive, BQTerrace, Cactus, Kimono, ParkScene|
|Resolution||Full-HD (1,920 × 1,080)|
|CPU||Intel core i7 X990 3.47GHz|
|OS||Windows 7 (64-bit)|
In Table 4, the P1 values are the speed-up factors when the same number of CTUs is allocated to each core for deblocking filtering . The P2 column shows the speed-up factors when each CTU row is allocated to each core . The P3 values are the speed-up factors by the proposed algorithm. It is evident that, on average, the speed-up factor in P3 is slightly higher than the others. With two threads, the proposed algorithm is 3% faster on average than P1 and P2. With four threads, the proposed algorithm is 2% and 6% faster on average than P1 and P2, respectively. However, the gain of the proposed algorithm varies significantly depending on the sequences. The proposed algorithm is 17% faster than P1 for the “BasketballDrive” sequence with two threads. For the “ParkScene” sequence, we found the proposed algorithm to be 10% better than P2 with a QP of 27. With four threads, the proposed algorithm is 17% and 19% faster than P2 for two sequences, respectively. We consider the proposed algorithm to be effective for several sequences that have different local characteristics. We found the estimated and actual complexities to be quite different depending on the CTU locations. In addition, more threads/cores can improve load-balancing accuracy for not only the proposed algorithm, but also existing algorithms. However, we found that the number of threads is not always proportional to acceleration factors on account of the context switching overhead.
|QP||Sequence||Decoding time (s) of deblocking filter according to the number of threads||Speed-up factors according to the number of threads|
|HM 12.0||2 threads||4 threads||2 threads||4 threads|
Figure 8 shows the relationship between the actual deblocking filtering time and the estimated complexity, as calculated by the proposed method. As the estimated complexity increases, the deblocking filter time also increases. As shown in the figure, the correlation is quite high and the estimated complexity of the deblocking filter by the proposed algorithm is sufficiently accurate to be applied to load balancing. We obtained a correlation number of 0.84 for Fig. 8.
The estimated complexity with only CU split information  could differ according to actual complexity. For confirmation of the argument, we generated bitstreams by coding half of the picture with CU and TU partition information and the remaining half of the picture with TU partition information. We found that the proposed algorithm with CU and TU split information is approximately 1.48 times better than the conventional algorithm using CU split information. We do not contend that the proposed algorithm is always significantly better than the conventional algorithms. However, the proposed algorithm can further accelerate the parallel performance for several videos that are coded with various TU splits.
In this paper, we proposed an efficient parallelization method for the HEVC decoder deblocking filter. To alleviate the workload imbalance problem and better ensure the efficient parallelization of the deblocking filter, we estimate the complexity to enable the distribution of equal workloads across multiple cores. The proposed deblocking filter based on complexity estimation improves the average time saving by 2% and 6% over those of uniform distribution deblocking filter parallelism and CTU row distribution parallelism, respectively. In addition, we determined that the speed-up factor of the proposed method in the percentage run-time was up to 3.1 compared to single core decoding with the deblocking filter. Further work is recommended to parallelize all parts of the HEVC decoder with a focus on workload balancing.