Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the y_likelihoods in Entropy models #314

Closed
chunbaobao opened this issue Oct 25, 2024 · 4 comments
Closed

About the y_likelihoods in Entropy models #314

chunbaobao opened this issue Oct 25, 2024 · 4 comments

Comments

@chunbaobao
Copy link

Hi, it's an excellent work. I have recently been working in the area of learned compression. I have already checked issue #306, but I am still confused about the actual meaning of y_likelihoods:

  1. The shape of y_likelihoods is equal to the shape of y, and I thought each element of y_likelihoods represents the probability $p(y_i)$, where $p(y_i)$ is the probability that the symbol $y_i$ appears. However, when I run the code below from the demo,
torch.sum(out_net['likelihoods']['y']).item()/torch.prod(torch.tensor(out_net['likelihoods']['y'].shape)).item()
# result = 0.9271195729573568

it produces an output where the elements of y_likelihoods are close to 1 everywhere. Isn't this contradictory? Shouldn't the sum of y_likelihoods is 1? What is the actual meaning of y_likelihoods?

  1. The calculation for bpp is $\sum -log_2(p(y_i))/N$, but the information entropy is actually $\sum - p(y_i)* log_2(p(y_i))$, I just wandering where did the $p(y_i)$ go? Is it because $\sum -log_2(p(y_i))/N$ is the upper bound of $\sum - p(y_i)*\log_2(p(y_i))$
    from the inequality below?
    image
@YodaEmbedding
Copy link
Contributor

YodaEmbedding commented Oct 28, 2024

For each $i$, there is a discrete probability distribution $p_i$, i.e.,

$$\int_{\mathbb{R}} p_i(t) \, dt = 1.$$

There are many such probability distributions -- one for each element in $\hat{y}$.

Each element $\hat{y}_i$ is encoded using its corresponding $p_i$. We can measure the "likelihood" $l_i$ for each element:

$$l_i = p_i(\hat{y}_i)$$

...and since the rate is the negative log-likelihood, the bit cost of the $i$-th element is:

$$R_i = -\log_2 l_i$$

The total rate cost is then:

$$R = \sum_i R_i$$

...which can be averaged over $N$ pixels to obtain the bpp.


For a single fixed encoding distribution $p$, the average rate cost for encoding a single symbol that is drawn from the same distribution $p$ is:

$$R = \sum_t - p(t) \, \log p(t)$$

But this is not what we're doing. What we're actually interested in is the cross-entropy. That is the average rate cost for encoding a single symbol drawn from the true distribution $\hat{p}$:

$$R = \sum_t - \hat{p}(t) \, \log p(t)$$

To be consistent with our notation above, we should also sprinkle in some $i$ s:

$$R_i = \sum_t - \hat{p}_i(t) \, \log p_i(t)$$

In our case, we know exactly what $\hat{p}$ is...

$$\hat{p}_i(t) = \delta[t - \hat{y}_i] = \begin{cases} 1 & \text{if } t = \hat{y}_i \\ 0 & \text{otherwise} \end{cases} $$

If we plug this into the earlier equation, the rate cost for encoding the $i$-th element becomes:

$$R_i = -\log p_i(\hat{y}_i)$$

@chunbaobao
Copy link
Author

Thank you for your reply; it has been incredibly helpful! However, I still have several stupid questions:

  1. In the update() function, we need to add pmf_start to sample, and the symbols need to subtract offset in the C++ interface to get $p(y_i - \text{offset}= t+\text{pmfstart}) $, where $t \in {0, 1, 2, \ldots, N} $. I’m just wondering why we can't simply set sample = torch.arange(max_length) and omit the pmf_start and offset steps to get $p(y_i = t) $, where $t \in {0, 1, 2, \ldots, N} $? Whats the meaning of pmf_start and offset here? Is it because the min of latent y is not zero?

samples = torch.arange(max_length, device=device)
samples = samples[None, :] + pmf_start[:, None, None]

int32_t value = symbols[i] - offsets[cdf_idx];

  1. Similar to q1, the $p(y_i - \text{offset}= t+\text{pmfstart})$ (if it is true). Would it be more meaningful to express it as below?
    prob = torch.cat((tail_mass[i]/2, p[: pmf_length[i]], tail_mass[i]/2), dim=0)

def _pmf_to_cdf(self, pmf, tail_mass, pmf_length, max_length):
cdf = torch.zeros(
(len(pmf_length), max_length + 2), dtype=torch.int32, device=pmf.device
)
for i, p in enumerate(pmf):
prob = torch.cat((p[: pmf_length[i]], tail_mass[i]), dim=0)
_cdf = pmf_to_quantized_cdf(prob, self.entropy_coder_precision)
cdf[i, : _cdf.size(0)] = _cdf
return cdf

@YodaEmbedding
Copy link
Contributor

YodaEmbedding commented Nov 18, 2024

  1. Most values of y are expected to occur between quantiles[0] and quantiles[2] with a total probability of 1 - tail_mass. That is, P(quantiles[0] ≤ y ≤ quantiles[2]) = 1 - tail_mass.

When converting these to a discrete distribution over the range [0, pmf_length], we:

  • Round quantiles[0] and quantiles[2] outwards (i.e floor and ceil).
  • Determine the lengths of the distributions (pmf_length) and the value that the 0th symbol represents (pmf_start).
  • Calculate pmf. We need to evaluate it at all the samples/points it represents. samples = torch.arange(max_length) + pmf_start[:, None]. (The code also includes an empty dimension in the middle for compatibility with _likelihood.)
  • Reevaluate the tail_mass (should be equal to the sum of our new pmf).
  • Finally, create the quantized_cdf that we will feed into the C++ entropy coder.

Related links:

P.S. In the EntropyBottleneck, we use a single distribution per channel, as visualized here in Figure 4a.


  1. Values that are out of range (i.e. on the left of quantiles[0] or on the right of quantiles[2]) are encoded using the "bypass mode". This is done by encoding a max value symbol, then encoding the out-of-range value with the bypass method.
    if (value == max_value) {
    /* Bypass decoding mode */
    int32_t val = Rans64DecGetBits(&rans, &ptr, bypass_precision);
    int32_t n_bypass = val;
    while (val == max_bypass_val) {
    val = Rans64DecGetBits(&rans, &ptr, bypass_precision);
    n_bypass += val;
    }
    int32_t raw_val = 0;
    for (int j = 0; j < n_bypass; ++j) {
    val = Rans64DecGetBits(&rans, &ptr, bypass_precision);
    assert(val <= max_bypass_val);
    raw_val |= val << (j * bypass_precision);
    }
    value = raw_val >> 1;
    if (raw_val & 1) {
    value = -value - 1;
    } else {
    value += max_value;
    }
    }

Since max value symbols (i.e. out-of-range values to the left of quantiles[0] or to the right of quantiles[2]) are expected to occur with probability of tail_mass, we use:

prob = torch.cat((p[: pmf_length[i]], tail_mass[i]), dim=0) 

Notice that the extra out-of-range symbol actually makes the quantized pmf we use of size pmf_length + 1, and its quantized cdf of size pmf_length + 2.

@chunbaobao
Copy link
Author

Got it! Thank you so much! Your answers really help a lot. Appreciate your sharing!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants