Network Working Group                                            N. Egge
Internet-Draft                                       Mozilla Corporation
Intended status: Informational                          October 18, 2015
Expires: April 20, 2016


               Chroma-from-Luma Intraprediction for NETVC
                        draft-egge-netvc-cfl-00

Abstract

   This document proposes a scheme for predicting chroma coefficients
   from reconstructed luma coefficients in the frequency domain.  When
   this technique is used with Perceptual Vector Quantization (PVQ), the
   expensive parameter fitting step can be completely omitted.

Status of This Memo

   This Internet-Draft is submitted in full conformance with the
   provisions of BCP 78 and BCP 79.

   Internet-Drafts are working documents of the Internet Engineering
   Task Force (IETF).  Note that other groups may also distribute
   working documents as Internet-Drafts.  The list of current Internet-
   Drafts is at http://datatracker.ietf.org/drafts/current/.

   Internet-Drafts are draft documents valid for a maximum of six months
   and may be updated, replaced, or obsoleted by other documents at any
   time.  It is inappropriate to use Internet-Drafts as reference
   material or to cite them other than as "work in progress."

   This Internet-Draft will expire on April 20, 2016.

Copyright Notice

   Copyright (c) 2015 IETF Trust and the persons identified as the
   document authors.  All rights reserved.

   This document is subject to BCP 78 and the IETF Trust's Legal
   Provisions Relating to IETF Documents
   (http://trustee.ietf.org/license-info) in effect on the date of
   publication of this document.  Please review these documents
   carefully, as they describe your rights and restrictions with respect
   to this document.  Code Components extracted from this document must
   include Simplified BSD License text as described in Section 4.e of
   the Trust Legal Provisions and are provided without warranty as
   described in the Simplified BSD License.


Egge                     Expires April 20, 2016                 [Page 1]

Internet-Draft                Coding Tools                  October 2015


Table of Contents

   1.  Introduction  . . . . . . . . . . . . . . . . . . . . . . . .   2
   2.  Chroma from Luma Prediction . . . . . . . . . . . . . . . . .   3
     2.1.  Extension to Frequency Domain . . . . . . . . . . . . . .   3
     2.2.  Time-Frequency Resolution Switching . . . . . . . . . . .   4
   3.  Gain-Shape Quantization . . . . . . . . . . . . . . . . . . .   5
     3.1.  Prediction with PVQ . . . . . . . . . . . . . . . . . . .   5
     3.2.  Chroma-from-Luma using PVQ Prediction . . . . . . . . . .   7
   4.  Algorithm and Implementation  . . . . . . . . . . . . . . . .   7
     4.1.  Chroma-from-Luma with Frequency Bands . . . . . . . . . .   8
   5.  Informative References  . . . . . . . . . . . . . . . . . . .  10
   Author's Address  . . . . . . . . . . . . . . . . . . . . . . . .  10

1.  Introduction

   Still image and video codecs typically consider the problem of intra-
   prediction in the spatial domain.  A predicted image is generated on
   a block-by-block basis using the previously reconstructed neighboring
   blocks for reference, and the residual is encoded using standard
   entropy coding techniques.  Modern codecs use the boundary pixels of
   the neighboring blocks along with a directional mode to predict the
   pixel values across the target block.  These directional predictors
   are cheap to compute (often directly copying pixel values or applying
   a simple linear kernel), exploit local coherency (with low error near
   the neighbors) and predict hard to code features (extending sharp
   directional edges across the block).

   In codecs that use lapped transforms [I-D.egge-netvc-tdlt], the
   reconstructed pixel data is not available.  The challenge here is
   that the neighboring spatial image data is not available until after
   the target block has been decoded and the appropriate unlapping
   filter has been applied across the block boundaries.

   A promising technique was proposed in [Lee09] to predict the chroma
   channels using the spatially coincident reconstructed luma channel.
   We propose a different technique that adapts the spatial chroma-from-
   luma intra-prediction for use with frequency-domain coefficients.  We
   call this algorithm frequency-domain chroma-from-luma.

   More recently, work on the Daala video codec [Daala-website] has
   included replacing scalar quantization with gain-shape quantization
   [Val15].  We show that when prediction is used with gain-shape
   quantization, it is possible to design a frequency-domain chroma-
   from-luma predictor without the added encoder and decoder overhead.


Egge                     Expires April 20, 2016                 [Page 2]

Internet-Draft                Coding Tools                  October 2015


2.  Chroma from Luma Prediction

   In spatial-domain chroma-from-luma, the key observation is that the
   local correlation between luminance and chrominance can be exploited
   using a linear prediction model.  For the target block, the chroma
   values can be estimated from the reconstructed luma values as:

                  chroma(u,v) = alpha * luma(u,v) + beta

   where the model parameters alpha and beta are computed as a linear
   least-squares regression using 2*N pairs of spatially coincident luma
   and chroma pixel values along the block boundary.

           L L L L L L L L
           L L L L L L L L
       L L . . . . . . . .
       L L . . . . . . . .        L L L L         C C C C
       L L . . . . . . . .      L . . . .       C . . . .
       L L . . . . . . . .  =>  L . . . .       C . . . .
       L L . . . . . . . .      L . . . .       C . . . .
       L L . . . . . . . .      L . . . .       C . . . .
       L L . . . . . . . .
       L L . . . . . . . .

   Predicting a 4x4 chroma block uses 2*N = 8 pairs of spatially
   coincident pixel values.  When using 4:2:0 input, the 8x8 block
   of luma coefficients must be first down-sampled to 4x4.

2.1.  Extension to Frequency Domain

   In codecs that use lapped transforms, the neighboring reconstructed
   pixel data is not available for use in spatial prediction.  However
   the transform coefficients in the lapped frequency domain are the
   product of two linear transforms: the linear pre-filter followed by
   the linear forward DCT.  Thus the same assumption of a linear
   correlation between luma and chroma coefficients holds.  In addition,
   we can take advantage of the fact that prediction is being done in
   the frequency domain to use only a small subset of coefficients when
   computing model parameters.

   The chroma values can thus be estimated using frequency-domain
   chroma-from-luma:

                    chroma_DC = alpha_DC * luma_DC + beta_DC
               chroma_AC(u,v) = alpha_AC * luma_AC(u,v)

   Figure 1: Predicting chroma from luma using linear regression in the
                             frequency domain


Egge                     Expires April 20, 2016                 [Page 3]

Internet-Draft                Coding Tools                  October 2015


   where alpha_DC and beta_DC are computed using a linear regression
   with the DC coefficients of the three neighboring blocks: up, left
   and up-left.  When estimating chroma_AC(u,v) we can omit the constant
   offset beta_AC as we expect the AC coefficients to be zero mean.
   Additionally, we need not include all of the AC coefficients from the
   three neighboring blocks when computing alpha_AC.

        L . . .|L . . .       C . . .|C . . .
        . . . .|. . . .       . . . .|. . . .
        . . . .|. . . .       . . . .|. . . .
        . . . .|. . . .       . . . .|. . . .
        -------+-------       -------+-------
        L . . .|. . . .       C . . .|. . . .
        . . . .|. . . .       . . . .|. . . .
        . . . .|. . . .       . . . .|. . . .
        . . . .|. . . .       . . . .|. . . .

      The three pairs of DC coefficients used to compute alpha_DC and
      beta_DC.

        . L . .|. L . .       . C . .|. C . .
        L L . .|L L . .       C C . .|C C . .
        . . . .|. . . .       . . . .|. . . .
        . . . .|. . . .       . . . .|. . . .
        -------+-------       -------+-------
        . L . .|. . . .       . C . .|. . . .
        L L . .|. . . .       C C . .|. . . .
        . . . .|. . . .       . . . .|. . . .
        . . . .|. . . .       . . . .|. . . .

      The nine pairs of AC coefficients used to compute alpha_AC.

   It is sufficient to use the three lowest AC coefficients from the
   neighboring blocks.  This means that the number of input pairs is
   constant regardless of the size of chroma block being predicted.
   Moreover, the input AC coefficients have semantic meaning: we use the
   strongest horizontal, vertical and diagonal components.  This has the
   effect of preserving features across the block as edges are
   correlated between luma and chroma.

2.2.  Time-Frequency Resolution Switching

   When image data is 4:4:4 or 4:2:0, the chroma and luma blocks are
   aligned so that the lowest 3 AC coefficients describe the same
   frequency range.  In codecs that support multiple block sizes (or
   that support 4:2:2 image data) it is the case that the luma blocks
   and the chroma blocks are not aligned.  For example, in the Daala
   video codec [Daala-website] the smallest block size supported is 4x4.


Egge                     Expires April 20, 2016                 [Page 4]

Internet-Draft                Coding Tools                  October 2015


   In 4:2:0, when an 8x8 block of luma image data is split into four 4x4
   blocks, the corresponding 4x4 chroma image data is still coded as a
   single 4x4 block.

   This is a problem for frequency-domain chroma-from-luma as it
   requires the reconstructed luma frequency-domain coefficients to
   cover the same spatial extent.  Using Time-Frequency resolution
   switching (TF) it is possible to trade off resolution in the spatial
   domain for resolution in the frequency domain.  Here the four 4x4
   luma blocks are merged into a single 8x8 block with half the spatial
   resolution and twice the frequency resolution.  We apply the 2x2
   Walsh-Hadamard transform described in Section 3.4 of
   [I-D.terriberry-netvc-codingtools] to corresponding transform
   coefficients in four 4x4 blocks to merge them into a single 8x8
   block.  The low frequency (LF) coefficients are then used with
   frequency-domain chroma-from-luma.

3.  Gain-Shape Quantization

   In codecs that use Perceptual Vector Quantization [Val15], an entire
   block of transform coefficients may be grouped together and jointly
   quantized.  This is done by considering them as an n-dimensional
   vector and then separating the vector into two intuitive components:
   its magnitude (gain) and its direction (shape).  For an input vector
   x:

                              g = |x|     gain

                                   x
                              u = ---     shape
                                  |x|

   The gain represents how much energy is contained in the block, and
   the shape indicates where that energy is distributed among the
   coefficients.  The gain is then quantized using scalar quantization,
   while the shape is quantized by finding the nearest VQ-codeword in an
   algebraically defined codebook.  By explicitly signaling the amount
   of energy in a block, and roughly where that energy is located, gain-
   shape quantization is texture preserving.  A complete description of
   PVQ and its other advantages over scalar quantization can be found in
   [Val15].

3.1.  Prediction with PVQ

   As described in Section 3 of [I-D.valin-netvc-pvq], this gain-shape
   quantization scheme can be extended to include prediction while
   maintaining the advantages of PVQ, e.g., texture-preservation,
   implicit activity masking, etc.  Consider an n-dimensional vector r


Egge                     Expires April 20, 2016                 [Page 5]

Internet-Draft                Coding Tools                  October 2015


   of predicted coefficients for x.  Then the normal to the reflection
   plane can be computed as:

                                   r
                              v = --- + s * e_m
                                  |r|

   where s * e_m is the signed unit vector in the direction of the axis
   we would like to reflect r onto.  The input vector x can then be
   reflected across this plane by computing:

                                        v^T x
                          z =  x - 2 * ------- * v
                                        v^T v

   We can measure how well the predictor r matches our input vector x by
   computing the cosine of the angle theta between them as

                                x^T r     z^T r         z_m
                  cos(theta) = ------- = ------- = -s * ---
                               |x| |r|   |z| |r|        |z|

   We select e_m to be the dimension of the largest component of our
   prediction vector r and s = sgn(r_m).  Thus the largest component
   lies on the m-axis after reflection.  When the predictor is good, we
   expect that the largest component of z will also be in the e_m
   direction and theta will be small.  If we code theta using scalar
   quantization, we can remove the largest dimension of z and reduce the
   coding of x to a gain-shape quantization of the remaining n - 1
   coefficients where the gain has been reduced to sin(theta) * g.
   Given a predictor r, the reconstructed coefficients x' are computed
   as:

           x' = g' * (-s * cos(theta') * e_m + sin(theta') * u')

   Where g', theta' and u' are the reconstructed gain, prediction
   quality and shape respectively.

   When the predictor is poor, theta will be large and the reflection is
   unlikely to improve coding efficiency.  Thus when theta > pi/2 a flag
   is coded and PVQ with no predictor is used.  Conversely when r is
   exact, theta' is zero and no additional shape information needs to be
   coded.  In addition, because we expect r to have roughly the same
   amount of energy as x, we use |r| as a predictor for the gain.


Egge                     Expires April 20, 2016                 [Page 6]

Internet-Draft                Coding Tools                  October 2015


3.2.  Chroma-from-Luma using PVQ Prediction

   Let us now return to the frequency-domain chroma-from-luma algorithm
   from Section 2.1 and consider what happens when it is used with gain-
   shape quantization.  As an example, consider a 4x4 chroma block where
   the 15 AC coefficients are coded using gain-shape quantization with
   the chroma_AC predictor from Figure 1.  The 15-dimensional predictor
   r is simply a linearly scaled vector of the coincident reconstructed
   luma coefficients:

     chroma_AC(u,v) = alpha_AC * luma_AC(u,v)  =>  r = alpha_AC * x'_L

   Thus the shape of the chroma predictor r is exactly that of the
   reconstructed luma coefficients x'_L with one exception:

              r     alpha_AC * x'_L                     x'_L
             --- = ----------------- = sgn(alpha_AC) * ------
             |r|   |alpha_AC * x'_L|                   |x'_L|

   Because the chroma coefficients are sometimes inversely correlated
   with the coincident luma coefficients, the linear term alpha_AC can
   be negative.  In these instances the shape of x'_L points in exactly
   the wrong direction and must be flipped.

   Moreover, consider what happens to the gain of x_C when it is
   predicted from r.  The PVQ prediction technique assumes that |r| =
   alpha_AC * |x'_L| is a good predictor of chroma gain g_C = |x_C|.
   Because alpha_AC for a block is learned from its previously decoded
   neighbors, often it is based on highly quantized or even zeroed
   coefficients.  When this happens, alpha_AC * |x'_L| is no longer a
   good predictor of g_C and the cost to code |x_C| - alpha_AC * |x'_L|
   using scalar quantization is actually greater than the cost of just
   coding g_C alone.

4.  Algorithm and Implementation

   Thus we present a modified version of PVQ prediction that is used
   just for chroma-from-luma intra-prediction.  For each set of chroma
   coefficients coded by PVQ, the prediction vector r is exactly the
   coincident luma coefficients.  Note that for 4:2:0 video we still
   need to apply the Time-Frequency resolution switching (TF) described
   in Section 2.2 to merge the reconstructed coefficients of 4x4 luma
   blocks to get the coincident predictor x'_L for the corresponding 4x4
   chroma block x_C.  We determine if we need to flip the predictor by
   computing the sign of the cosine of the angle between x'_L and x_C:


Egge                     Expires April 20, 2016                 [Page 7]

Internet-Draft                Coding Tools                  October 2015


                                    x'_L^T x_C
         f = sgn(cos(theta)) = sgn(------------) = sgn(x'_L^T x_C)
                                   |x'_L| |x_C|

   A negative sign means the angle between the two is greater than pi/2
   and negating x'_L is guaranteed to make the angle less than pi/2.

   We then code f using a single bit, and the gain g_C using scalar
   quantization with no predictor.  The shape quantization algorithm for
   x_C is unchanged except that r = f * x'_L.  This algorithm has the
   advantage over frequency-domain chroma-from-luma of being both lower
   complexity (neither the encoder nor decoder need to compute a linear
   regression per block) and providing better compression (the chroma
   gain g_C is never incorrectly predicted).

   The steps of the encoder algorithm are:

   1.  Let r = x'_L

   2.  Compute theta, the angle between x'_L and x_C

   3.  If theta = 0 (prediction is exact)

       (a)  Code theta

       (b)  Stop

   4.  Let f = theta > pi/2

   5.  Code f

   6.  If f, negate r

   7.  Code x_C using PVQ with predictor r

4.1.  Chroma-from-Luma with Frequency Bands

   Up to this point we have only examined the case when all of the AC
   coefficients for an NxN block are considered together as a single
   input vector for PVQ prediction.  In practice, it may be better to
   consider portions of the AC coefficients together so partitions of
   the block where g' = 0 or theta' = 0 are coded more efficiently.
   Consider the frequency band structure currently used by Daala in
   Figure 2.  The chroma-from-luma using PVQ prediction technique in
   Section 4 is trivially modified to work with any arbitrary
   partitioning of block coefficients into bands.


Egge                     Expires April 20, 2016                 [Page 8]

Internet-Draft                Coding Tools                  October 2015


    4x4
 ++------+    8x8
 ++      | +-------+       16x16
 |       | |       | +---------------+               32x32
 |       | +-------+ |               | +-------------------------------+
 +-------+ |       | |               | |                               |
   +---+---+       | |               | |                               |
   |   |           | +---------------+ |                               |
   |   |           | |               | |                               |
   |   |           | |               | |                               |
   +---+-----------+ |               | |                               |
     +-------+-------+               | |                               |
     |       |                       | +-------------------------------+
     |       |                       | |                               |
     |       |                       | |                               |
     |       |                       | |                               |
     |       |                       | |                               |
     |       |                       | |                               |
     |       |                       | |                               |
     +-------+-----------------------+ |                               |
       +---------------+---------------+                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       |               |                                               |
       +---------------+-----------------------------------------------+

    Figure 2: The band structure of 4x4, 8x8, 16x16 and 32x32 blocks in
                                  Daala.

   Instead of considering whether to flip the direction of x_L for each
   band partition individually (a signaling cost of 10 bits per 32x32
   block at best), simply look at the lowest 4x4 AC partition and use
   the flip decision there for the entire block.  The assumption is that
   having those larger low frequency coefficients predicted well is far
   more important than getting it exactly right at higher frequencies.


Egge                     Expires April 20, 2016                 [Page 9]

Internet-Draft                Coding Tools                  October 2015


   When the quantization step size is large, the high frequency
   coefficients will be sent to zero regardless.

5.  Informative References

   [I-D.egge-netvc-tdlt]
              Egge, N. and T. Terriberry, "Time Domain Lapped Transforms
              for Video Coding", draft-egge-netvc-tdlt-00 (work in
              progress), July 2015.

   [I-D.terriberry-netvc-codingtools]
              Terriberry, T., "Coding Tools for a Next Generation Video
              Codec", draft-terriberry-netvc-codingtools-00 (work in
              progress), June 2015.

   [I-D.valin-netvc-pvq]
              Valin, J., "Pyramid Vector Quantization for Video Coding",
              draft-valin-netvc-pvq-00 (work in progress), June 2015.

   [Lee09]    Lee, SH. and NI. Cho, "Intra Prediction Method Based on
              the Linear Relationship between the Channels for YUV 4:2:0
              Intra Coding", Proceedings of the 16th IEEE International
              Conference on Image Processing , November 2009.

   [Val15]    Valin, JM. and TB. Terriberry, "Perceptual Vector
              Quantization for Video Coding", Proceedings of SPIE Visual
              Information Processing and Communication , February 2015,
              <http://jmvalin.ca/papers/spie_pvq.pdf>.

   [Daala-website]
              "Daala website", Xiph.Org Foundation , <https://xiph.org/
              daala/>.

Author's Address

   Nathan E. Egge
   Mozilla Corporation
   331 E. Evelyn Avenue
   Mountain View, CA  94041
   USA

   Phone: +1 650 903-0800
   Email: negge@xiph.org


Egge                     Expires April 20, 2016                [Page 10]