Network Working Group N. Egge
Internet-Draft Mozilla Corporation
Intended status: Informational October 18, 2015
Expires: April 20, 2016

Chroma-from-Luma Intraprediction for NETVC


This document proposes a scheme for predicting chroma coefficients from reconstructed luma coefficients in the frequency domain. When this technique is used with Perceptual Vector Quantization (PVQ), the expensive parameter fitting step can be completely omitted.

Status of This Memo

This Internet-Draft is submitted in full conformance with the provisions of BCP 78 and BCP 79.

Internet-Drafts are working documents of the Internet Engineering Task Force (IETF). Note that other groups may also distribute working documents as Internet-Drafts. The list of current Internet-Drafts is at

Internet-Drafts are draft documents valid for a maximum of six months and may be updated, replaced, or obsoleted by other documents at any time. It is inappropriate to use Internet-Drafts as reference material or to cite them other than as "work in progress."

This Internet-Draft will expire on April 20, 2016.

Copyright Notice

Copyright (c) 2015 IETF Trust and the persons identified as the document authors. All rights reserved.

This document is subject to BCP 78 and the IETF Trust's Legal Provisions Relating to IETF Documents ( in effect on the date of publication of this document. Please review these documents carefully, as they describe your rights and restrictions with respect to this document. Code Components extracted from this document must include Simplified BSD License text as described in Section 4.e of the Trust Legal Provisions and are provided without warranty as described in the Simplified BSD License.

Table of Contents

1. Introduction

Still image and video codecs typically consider the problem of intra-prediction in the spatial domain. A predicted image is generated on a block-by-block basis using the previously reconstructed neighboring blocks for reference, and the residual is encoded using standard entropy coding techniques. Modern codecs use the boundary pixels of the neighboring blocks along with a directional mode to predict the pixel values across the target block. These directional predictors are cheap to compute (often directly copying pixel values or applying a simple linear kernel), exploit local coherency (with low error near the neighbors) and predict hard to code features (extending sharp directional edges across the block).

In codecs that use lapped transforms [I-D.egge-netvc-tdlt], the reconstructed pixel data is not available. The challenge here is that the neighboring spatial image data is not available until after the target block has been decoded and the appropriate unlapping filter has been applied across the block boundaries.

A promising technique was proposed in [Lee09] to predict the chroma channels using the spatially coincident reconstructed luma channel. We propose a different technique that adapts the spatial chroma-from-luma intra-prediction for use with frequency-domain coefficients. We call this algorithm frequency-domain chroma-from-luma.

More recently, work on the Daala video codec [Daala-website] has included replacing scalar quantization with gain-shape quantization [Val15]. We show that when prediction is used with gain-shape quantization, it is possible to design a frequency-domain chroma-from-luma predictor without the added encoder and decoder overhead.

2. Chroma from Luma Prediction

In spatial-domain chroma-from-luma, the key observation is that the local correlation between luminance and chrominance can be exploited using a linear prediction model. For the target block, the chroma values can be estimated from the reconstructed luma values as:

chroma(u,v) = alpha * luma(u,v) + beta

where the model parameters alpha and beta are computed as a linear least-squares regression using 2*N pairs of spatially coincident luma and chroma pixel values along the block boundary.

        L L L L L L L L
        L L L L L L L L
    L L . . . . . . . .
    L L . . . . . . . .        L L L L         C C C C
    L L . . . . . . . .      L . . . .       C . . . .
    L L . . . . . . . .  =>  L . . . .       C . . . .
    L L . . . . . . . .      L . . . .       C . . . .
    L L . . . . . . . .      L . . . .       C . . . .
    L L . . . . . . . .
    L L . . . . . . . .

Predicting a 4x4 chroma block uses 2*N = 8 pairs of spatially
coincident pixel values.  When using 4:2:0 input, the 8x8 block
of luma coefficients must be first down-sampled to 4x4.

2.1. Extension to Frequency Domain

In codecs that use lapped transforms, the neighboring reconstructed pixel data is not available for use in spatial prediction. However the transform coefficients in the lapped frequency domain are the product of two linear transforms: the linear pre-filter followed by the linear forward DCT. Thus the same assumption of a linear correlation between luma and chroma coefficients holds. In addition, we can take advantage of the fact that prediction is being done in the frequency domain to use only a small subset of coefficients when computing model parameters.

The chroma values can thus be estimated using frequency-domain chroma-from-luma:

     chroma_DC = alpha_DC * luma_DC + beta_DC
chroma_AC(u,v) = alpha_AC * luma_AC(u,v)

Figure 1: Predicting chroma from luma using linear regression in the frequency domain

where alpha_DC and beta_DC are computed using a linear regression with the DC coefficients of the three neighboring blocks: up, left and up-left. When estimating chroma_AC(u,v) we can omit the constant offset beta_AC as we expect the AC coefficients to be zero mean. Additionally, we need not include all of the AC coefficients from the three neighboring blocks when computing alpha_AC.

  L . . .|L . . .       C . . .|C . . .
  . . . .|. . . .       . . . .|. . . .
  . . . .|. . . .       . . . .|. . . .
  . . . .|. . . .       . . . .|. . . .
  -------+-------       -------+-------
  L . . .|. . . .       C . . .|. . . .
  . . . .|. . . .       . . . .|. . . .
  . . . .|. . . .       . . . .|. . . .
  . . . .|. . . .       . . . .|. . . .

The three pairs of DC coefficients used to compute alpha_DC and

  . L . .|. L . .       . C . .|. C . .
  L L . .|L L . .       C C . .|C C . .
  . . . .|. . . .       . . . .|. . . .
  . . . .|. . . .       . . . .|. . . .
  -------+-------       -------+-------
  . L . .|. . . .       . C . .|. . . .
  L L . .|. . . .       C C . .|. . . .
  . . . .|. . . .       . . . .|. . . .
  . . . .|. . . .       . . . .|. . . .

The nine pairs of AC coefficients used to compute alpha_AC.

It is sufficient to use the three lowest AC coefficients from the neighboring blocks. This means that the number of input pairs is constant regardless of the size of chroma block being predicted. Moreover, the input AC coefficients have semantic meaning: we use the strongest horizontal, vertical and diagonal components. This has the effect of preserving features across the block as edges are correlated between luma and chroma.

2.2. Time-Frequency Resolution Switching

When image data is 4:4:4 or 4:2:0, the chroma and luma blocks are aligned so that the lowest 3 AC coefficients describe the same frequency range. In codecs that support multiple block sizes (or that support 4:2:2 image data) it is the case that the luma blocks and the chroma blocks are not aligned. For example, in the Daala video codec [Daala-website] the smallest block size supported is 4x4. In 4:2:0, when an 8x8 block of luma image data is split into four 4x4 blocks, the corresponding 4x4 chroma image data is still coded as a single 4x4 block.

This is a problem for frequency-domain chroma-from-luma as it requires the reconstructed luma frequency-domain coefficients to cover the same spatial extent. Using Time-Frequency resolution switching (TF) it is possible to trade off resolution in the spatial domain for resolution in the frequency domain. Here the four 4x4 luma blocks are merged into a single 8x8 block with half the spatial resolution and twice the frequency resolution. We apply the 2x2 Walsh-Hadamard transform described in Section 3.4 of [I-D.terriberry-netvc-codingtools] to corresponding transform coefficients in four 4x4 blocks to merge them into a single 8x8 block. The low frequency (LF) coefficients are then used with frequency-domain chroma-from-luma.

3. Gain-Shape Quantization

In codecs that use Perceptual Vector Quantization [Val15], an entire block of transform coefficients may be grouped together and jointly quantized. This is done by considering them as an n-dimensional vector and then separating the vector into two intuitive components: its magnitude (gain) and its direction (shape). For an input vector x:

  g = |x|     gain

  u = ---     shape

The gain represents how much energy is contained in the block, and the shape indicates where that energy is distributed among the coefficients. The gain is then quantized using scalar quantization, while the shape is quantized by finding the nearest VQ-codeword in an algebraically defined codebook. By explicitly signaling the amount of energy in a block, and roughly where that energy is located, gain-shape quantization is texture preserving. A complete description of PVQ and its other advantages over scalar quantization can be found in [Val15].

3.1. Prediction with PVQ

As described in Section 3 of [I-D.valin-netvc-pvq], this gain-shape quantization scheme can be extended to include prediction while maintaining the advantages of PVQ, e.g., texture-preservation, implicit activity masking, etc. Consider an n-dimensional vector r of predicted coefficients for x. Then the normal to the reflection plane can be computed as:

  v = --- + s * e_m

where s * e_m is the signed unit vector in the direction of the axis we would like to reflect r onto. The input vector x can then be reflected across this plane by computing:

                v^T x
  z =  x - 2 * ------- * v
                v^T v

We can measure how well the predictor r matches our input vector x by computing the cosine of the angle theta between them as

                x^T r     z^T r         z_m
  cos(theta) = ------- = ------- = -s * ---
               |x| |r|   |z| |r|        |z|

We select e_m to be the dimension of the largest component of our prediction vector r and s = sgn(r_m). Thus the largest component lies on the m-axis after reflection. When the predictor is good, we expect that the largest component of z will also be in the e_m direction and theta will be small. If we code theta using scalar quantization, we can remove the largest dimension of z and reduce the coding of x to a gain-shape quantization of the remaining n − 1 coefficients where the gain has been reduced to sin(theta) * g. Given a predictor r, the reconstructed coefficients x' are computed as:

x' = g' * (-s * cos(theta') * e_m + sin(theta') * u')

Where g', theta' and u' are the reconstructed gain, prediction quality and shape respectively.

When the predictor is poor, theta will be large and the reflection is unlikely to improve coding efficiency. Thus when theta > pi/2 a flag is coded and PVQ with no predictor is used. Conversely when r is exact, theta' is zero and no additional shape information needs to be coded. In addition, because we expect r to have roughly the same amount of energy as x, we use |r| as a predictor for the gain.

3.2. Chroma-from-Luma using PVQ Prediction

Let us now return to the frequency-domain chroma-from-luma algorithm from Section 2.1 and consider what happens when it is used with gain-shape quantization. As an example, consider a 4x4 chroma block where the 15 AC coefficients are coded using gain-shape quantization with the chroma_AC predictor from Figure 1. The 15-dimensional predictor r is simply a linearly scaled vector of the coincident reconstructed luma coefficients:

chroma_AC(u,v) = alpha_AC * luma_AC(u,v)  =>  r = alpha_AC * x'_L

Thus the shape of the chroma predictor r is exactly that of the reconstructed luma coefficients x'_L with one exception:

 r     alpha_AC * x'_L                     x'_L
--- = ----------------- = sgn(alpha_AC) * ------
|r|   |alpha_AC * x'_L|                   |x'_L|

Because the chroma coefficients are sometimes inversely correlated with the coincident luma coefficients, the linear term alpha_AC can be negative. In these instances the shape of x'_L points in exactly the wrong direction and must be flipped.

Moreover, consider what happens to the gain of x_C when it is predicted from r. The PVQ prediction technique assumes that |r| = alpha_AC * |x'_L| is a good predictor of chroma gain g_C = |x_C|. Because alpha_AC for a block is learned from its previously decoded neighbors, often it is based on highly quantized or even zeroed coefficients. When this happens, alpha_AC * |x'_L| is no longer a good predictor of g_C and the cost to code |x_C| − alpha_AC * |x'_L| using scalar quantization is actually greater than the cost of just coding g_C alone.

4. Algorithm and Implementation

Thus we present a modified version of PVQ prediction that is used just for chroma-from-luma intra-prediction. For each set of chroma coefficients coded by PVQ, the prediction vector r is exactly the coincident luma coefficients. Note that for 4:2:0 video we still need to apply the Time-Frequency resolution switching (TF) described in Section 2.2 to merge the reconstructed coefficients of 4x4 luma blocks to get the coincident predictor x'_L for the corresponding 4x4 chroma block x_C. We determine if we need to flip the predictor by computing the sign of the cosine of the angle between x'_L and x_C:

                           x'_L^T x_C
f = sgn(cos(theta)) = sgn(------------) = sgn(x'_L^T x_C)
                          |x'_L| |x_C|

A negative sign means the angle between the two is greater than pi/2 and negating x'_L is guaranteed to make the angle less than pi/2.

We then code f using a single bit, and the gain g_C using scalar quantization with no predictor. The shape quantization algorithm for x_C is unchanged except that r = f * x'_L. This algorithm has the advantage over frequency-domain chroma-from-luma of being both lower complexity (neither the encoder nor decoder need to compute a linear regression per block) and providing better compression (the chroma gain g_C is never incorrectly predicted).

The steps of the encoder algorithm are:

  1. Let r = x'_L
  2. Compute theta, the angle between x'_L and x_C
  3. If theta = 0 (prediction is exact)
    Code theta

  4. Let f = theta > pi/2
  5. Code f
  6. If f, negate r
  7. Code x_C using PVQ with predictor r

4.1. Chroma-from-Luma with Frequency Bands

Up to this point we have only examined the case when all of the AC coefficients for an NxN block are considered together as a single input vector for PVQ prediction. In practice, it may be better to consider portions of the AC coefficients together so partitions of the block where g' = 0 or theta' = 0 are coded more efficiently. Consider the frequency band structure currently used by Daala in Figure 2. The chroma-from-luma using PVQ prediction technique in Section 4 is trivially modified to work with any arbitrary partitioning of block coefficients into bands.

++------+    8x8
++      | +-------+       16x16
|       | |       | +---------------+               32x32
|       | +-------+ |               | +-------------------------------+
+-------+ |       | |               | |                               |
  +---+---+       | |               | |                               |
  |   |           | +---------------+ |                               |
  |   |           | |               | |                               |
  |   |           | |               | |                               |
  +---+-----------+ |               | |                               |
    +-------+-------+               | |                               |
    |       |                       | +-------------------------------+
    |       |                       | |                               |
    |       |                       | |                               |
    |       |                       | |                               |
    |       |                       | |                               |
    |       |                       | |                               |
    |       |                       | |                               |
    +-------+-----------------------+ |                               |
      +---------------+---------------+                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |
      |               |                                               |

Figure 2: The band structure of 4x4, 8x8, 16x16 and 32x32 blocks in Daala.

Instead of considering whether to flip the direction of x_L for each band partition individually (a signaling cost of 10 bits per 32x32 block at best), simply look at the lowest 4x4 AC partition and use the flip decision there for the entire block. The assumption is that having those larger low frequency coefficients predicted well is far more important than getting it exactly right at higher frequencies. When the quantization step size is large, the high frequency coefficients will be sent to zero regardless.

5. Informative References

, "
[I-D.egge-netvc-tdlt] Egge, N. and T. Terriberry, "Time Domain Lapped Transforms for Video Coding", Internet-Draft draft-egge-netvc-tdlt-00, July 2015.
[I-D.terriberry-netvc-codingtools] Terriberry, T., "Coding Tools for a Next Generation Video Codec", Internet-Draft draft-terriberry-netvc-codingtools-00, June 2015.
[I-D.valin-netvc-pvq] Valin, J., "Pyramid Vector Quantization for Video Coding", Internet-Draft draft-valin-netvc-pvq-00, June 2015.
[Lee09] Lee, SH. and NI. Cho, "Intra Prediction Method Based on the Linear Relationship between the Channels for YUV 4:2:0 Intra Coding", Proceedings of the 16th IEEE International Conference on Image Processing , November 2009.
[Val15] Valin, JM. and TB. Terriberry, Perceptual Vector Quantization for Video Coding", Proceedings of SPIE Visual Information Processing and Communication , February 2015.
[Daala-website]Daala website", Xiph.Org Foundation

Author's Address

Nathan E. Egge Mozilla Corporation 331 E. Evelyn Avenue Mountain View, CA 94041 USA Phone: +1 650 903-0800 EMail: