Skip to content
This repository has been archived by the owner on Jul 7, 2020. It is now read-only.

TDigest percentiles are off on the last centroid (and close by percentiles have inverted values) #83

Open
phoenixdownita opened this issue Sep 16, 2014 · 7 comments

Comments

@phoenixdownita
Copy link

I have a serialized TDigest in Base64:
AAAAAkBZAAAAAAAAAAABLkkzm5BGwGAARy0nAEYp0ABGlTYARLegAEU48ABHBuMARTgQAESsIABFSPAARDqAAEaLFgBEpKAARYYwAEQVgABE9mAARTdgAESuYABD3QAARiY4AEWVgABDMwAARTeAAEXByABEYcAARajAAEQwwABF5RgARShwAENWAABF/WAARMAgAEQxAABFGEAAQsoAAEUZMABF4wAARaVgAEV54ABFbuAARVYgAEQqQABGLPwARKngAEQwgABEhOAARIzgAER7gABEoaAARPOgAEXFoABFLkAARHWAAEVDwABE2WAAQ3kAAESfIABFdbAARX6gAESHAABELoAARXxQAETDgABDNQAARa5wAEU6oABFIkAARcg4AESlIABEpDAARcM0AEWy+ABFZlgARPSwAEVRcABE+aAAQmQAAESAgABF22wARZh4AESmUABEbkAAQ+yAAEQ24ABFb3AARdYoAEXyNABD70AARgjMq0WGQqtFxTgARQDAAEUBAABCNAAARRCQAEU/aABEy5AARPlAAETwkABFL1gARa0wAESiYABFR/AARJaVVUUmLVVF5uQARbnSq0Wc7VVFXVAARaTwAEWFtABFFpgARGJAAESQ0ABFMbAAQ08qqkW40qtFgIgARZYYAEWNUABFmOgARD2VVUWYM1VFyWYARWqiq0UGkABE3JqrRWFIAESeAABF+34ARUR0AETgwABElVAARIagAEUfcABFhZFVRKyqq0YD1qtFbPqqRMTdVUPG4ABFNOVVRMUqq0UNiqtE0AAARZe6q0XLUKtFaBAARDLQAEVwSqtFpQarRYb8AEXYSABFNlAARI5gAET+iABFgt4ARYlQAEVdkABEv3AARU5sAEWchgBFPHAARSGgAEW6iABFH/AARQFQAESnqABEZbAARNlwAETB5VVEpVVVRNoFVUUA2ABFVNVVRQ1qq0TMeABFTC6rRcuNVUTVVVVFX6AAROG1VUVlAABF2h1VRYf5VUUarVVFrKVVRMxAAETy9VVE6cqrRNtwAEXxbABFEeVVRTcwAEUOGqpEynVVRTU1VUWxWABEv0AARQ3YAEUN4qtGDBQAROLKq0VnIABFWcAARfqYAEWSaABFhSgAReyiq0XmhVVGEVFVRYohVUUOqABE72AARRvgAEYUigBFVTgARP8QAETH0ABGHsAARSs4AEUgSABFfLAARQ8gAEWXuABFTVAARl1AAEUBQABDuIAAQ6uAAEUDkABD/4AARbl4AEVbEABFEOAAReMIAEXSkABGL8AARW8wAEYHTABFFbAAQ0gAAEUBsABFt4AARY8wAEY0FABGDwwARJWAAEVIoABFSZAARZVoAEYOsABFNVAARgWwAEU5EABForgARSAQAEXb8ABFStAARaa4AEPTgABErMAARU8gAEQVwABE7CAARh8UAET1gABGN+AARn9sAEQlAABFOsAARZ1oAEPlgABGIoQARhb8AEW8GABD8oAARhOMAEVXgABE7AAARtkeAEZZ/ABGl7gARtqgAEW9qABGybIARnG0AEUs4ABGiKoAQ9eAAEeglQBGCMwARnf0AEbJvgBGfmgAR0fpAEgXB4BHsy4ASAMpAEgYdkBMBzAdAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQECAQEBAQECAgEBAQICAgIBAgEBAQICAQECAgICAgIDAQMBAgEBAgEBAgMCAQIDAgIDAwMBAgMBAgIDAgEDAwEDBAIDAwICAwQBAgICAgMDAwMEAQMDAwMDBAQBAwQEAgEEBAIEAwIEAgQBAgMBBAICAwMCAgMDBAMDAQMDAwMCAwMCAwMCAQMDAgMCAgMCAwMBAgIDAQMDAQMCAQECAgECAQICAQECAgEBAQEBAQEBAQECAQICAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQE=

To get it back into a TDigest tree:
byte[] data = DatatypeConverter.parseBase64Binary(encoded);
ByteBuffer bb = ByteBuffer.wrap(data);
TDigest td = TDigest.fromBytes(bb);

where encode is the base64 string above.

when you query for quantiles it happens that:
td.quantile(0.999d) > td.quantile(0.9999d)
which should never happen.

I believe part of th eproblem is on https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java
line 320 (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L320).
I believe in that line (center.count() - (q - t)) should be instead (q - t)
Symmetrically line 317 (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L317)
should move from (q - t) to (center.count() - (q - t)) restablishing the symmetry.

The same holds true for lines 300 and 303 (theu have to have the (q-t) and (center.count() - (q - t)) swapped around.
I tried that td.quantile(0.5d) > td.quantile(0.5001d).

I am not sure it is the right fix but the invariants should be respected:
td.quantile(0.999d)<=td.quantile(0.9999d)
as well as
td.quantile(0.5d)<=td.quantile(0.5001d)

Thanks

@phoenixdownita phoenixdownita changed the title TDigest percentiles are off on the last centroid TDigest percentiles are off on the last centroid (and close by percentiles have inverted values) Sep 16, 2014
@tdunning
Copy link
Contributor

Interesting bug!

could you write up a test that reproduces this?

On Tue, Sep 16, 2014 at 6:18 PM, phoenixdownita [email protected]
wrote:

I have a serialized TDigest in Base64:
AAAAAkBZAAAAAAAAAAABLkkzm5BGwGAARy0nAEYp0ABGlTYARLegAEU48ABHBuMARTgQAESsIABFSPAARDqAAEaLFgBEpKAARYYwAEQVgABE9mAARTdgAESuYABD3QAARiY4AEWVgABDMwAARTeAAEXByABEYcAARajAAEQwwABF5RgARShwAENWAABF/WAARMAgAEQxAABFGEAAQsoAAEUZMABF4wAARaVgAEV54ABFbuAARVYgAEQqQABGLPwARKngAEQwgABEhOAARIzgAER7gABEoaAARPOgAEXFoABFLkAARHWAAEVDwABE2WAAQ3kAAESfIABFdbAARX6gAESHAABELoAARXxQAETDgABDNQAARa5wAEU6oABFIkAARcg4AESlIABEpDAARcM0AEWy+ABFZlgARPSwAEVRcABE+aAAQmQAAESAgABF22wARZh4AESmUABEbkAAQ+yAAEQ24ABFb3AARdYoAEXyNABD70AARgjMq0WGQqtFxTgARQDAAEUBAABCNAAARRCQAEU/aABEy5AARPlAAETwkABFL1gARa0wAESiYABFR/AARJaVVUUmLVVF5uQARbnSq0Wc7VVFXVAARaTwAEWFtABFFpgARGJAAESQ0ABFMbAAQ08qqkW40qtFgIgARZYYAEWNUABFmOgARD2VVUWYM1VFyWYARWqiq0UGkABE3JqrRWFIAESeAABF+34ARUR0AETgwABElVAARIagAEUfcABFhZFVRKyqq0YD1qtFbPqqRMTdVUPG4ABFNOVVRMUqq0UNiqtE0AAARZe6q0XLUKtFaBAARDLQAEVwSqtFpQarRYb8AEXYSABFNlAARI5gAET+iABFgt4ARYlQAEVdkABEv3AARU5sAEWchgBFPHAARSGgAEW6iABFH/AARQFQAESnqABEZbAARNlwAETB5VVEpVVVRNoFVUUA2ABFVNVVRQ1qq0TMeABFTC6rRcuNVUTVVV
VFX6AARO
G1VUVlAABF2h1VRYf5VUUarVVFrKVVRMxAAETy9VVE6cqrRNtwAEXxbABFEeVVRTcwAEUOGqpEynVVRTU1VUWxWABEv0AARQ3YAEUN4qtGDBQAROLKq0VnIABFWcAARfqYAEWSaABFhSgAReyiq0XmhVVGEVFVRYohVUUOqABE72AARRvgAEYUigBFVTgARP8QAETH0ABGHsAARSs4AEUgSABFfLAARQ8gAEWXuABFTVAARl1AAEUBQABDuIAAQ6uAAEUDkABD/4AARbl4AEVbEABFEOAAReMIAEXSkABGL8AARW8wAEYHTABFFbAAQ0gAAEUBsABFt4AARY8wAEY0FABGDwwARJWAAEVIoABFSZAARZVoAEYOsABFNVAARgWwAEU5EABForgARSAQAEXb8ABFStAARaa4AEPTgABErMAARU8gAEQVwABE7CAARh8UAET1gABGN+AARn9sAEQlAABFOsAARZ1oAEPlgABGIoQARhb8AEW8GABD8oAARhOMAEVXgABE7AAARtkeAEZZ/ABGl7gARtqgAEW9qABGybIARnG0AEUs4ABGiKoAQ9eAAEeglQBGCMwARnf0AEbJvgBGfmgAR0fpAEgXB4BHsy4ASAMpAEgYdkBMBzAdAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQECAQEBAQECAgEBAQICAgIBAgEBAQICAQECAgICAgIDAQMBAgEBAgEBAgMCAQIDAgIDAwMBAgMBAgIDAgEDAwEDBAIDAwICAwQBAgICAgMDAwMEAQMDAwMDBAQBAwQEAgEEBAIEAwIEAgQBAgMBBAICAwMCAgMDBAMDAQMDAwMCAwMCAwMCAQMDAgMCAgMCAwMBAgIDAQMDAQMCAQECAgECAQICAQECAgEBAQEBAQEBAQECAQICAQEBAQEBAQEBAQEBAQEBAQEBAQEBA
QEBAQEBA QEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQE=

To get it back into a TDigest tree:
byte[] data = DatatypeConverter.parseBase64Binary(encoded);
ByteBuffer bb = ByteBuffer.wrap(data);
TDigest td = TDigest.fromBytes(bb);

where encode is the base64 string above.

when you query for quantiles it happens that:
td.quantile(0.999d) > td.quantile(0.9999d)
which should never happen.

I believe part of th eproblem is on
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java
line 320 (
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L320
).
I believe in that line (center.count() - (q - t)) should be instead (q - t)
Symmetrically line 317 (
https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L317
)
should move from (q - t) to (center.count() - (q - t)) restablishing the
symmetry.

The same holds true for lines 300 and 303 (theu have to have the (q-t) and
(center.count() - (q - t)) swapped around.
I tried that td.quantile(0.5d) > td.quantile(0.5001d).

I am not sure it is the right fix but the invariants should be respected:
td.quantile(0.999d)<=td.quantile(0.9999d)
as well as
td.quantile(0.5d)<=td.quantile(0.5001d)

Thanks


Reply to this email directly or view it on GitHub
#83.

@phoenixdownita
Copy link
Author

You can literally:
String encoded = "..." <-- the big Base64 String I attached above form the first A to the last =
byte[] data = DatatypeConverter.parseBase64Binary(encoded); //this is a javax class available on openJDK so it should be available on your preferred JVM/JDK combo
ByteBuffer bb = ByteBuffer.wrap(data);
TDigest td = TDigest.fromBytes(bb); //TDigest from stream-lib

then:
double td1= td.quantile(0.999d);
double td2 = td.quantile(0.9999d);

and you'll see td1 > td2.

Those 2 values 0.999d and 0.9999d are pretty useful in this case because the land the algo in the last of the centroids. While debugging I could see all other variables in that function were identical at line 320, just q for 0.9999 was bigger than q for 0.999.

If instead you'd want the actual raw data that created the TDigest in the first place, I have to say I don't have a repro for it that way.
Assuming there are no bugs in the serialization of the TDigest (that is how I got the base64 string in the first place, I used the smallbytes version), deserializing from the base64 string should be as good as I can get it.

I applied to my own "copy" of the TDigest the inversion of (q-t) with (center.count() - (q - t)) in those 4 lines involved and literally used that base64 in a UTest to assert that even with small delta quantiles (say 50% and 50.000001%) it respects the invariant that the value of the bigger quantile should be bigger or equal to the value of the lesser one and it seems to work.

I am not too familiar with the code or theory behind it although I skimmed thru the paper, I just use it as is to compute quantiles in a distributed fashion, but I believe that the piece of code I suggest to "fix" has to do with interpolating the value based on how many elements are contained in the centroid and if the center of said centroid is to the left or the right of the sough after percentile.
If that intuition is right then I believe I can say that (center.count() - (q - t)) gets smaller (closer to the center) and not bigger as q gets bigger than t (but not enough to need to go in the next centroid). So I believe that when on the right side of the centroid (q-t) is appropriate as even slightly bigger values of q yield bigger values of the interpolated percentile (line 320).
Line 317 I did for symmetry, thinking that maybe the coder inadvertently swapped the 2 interpolation functions. I do not have an example to prove 317 is required.
Lines 300 and 303 had to be "swapped" as well as they follow the same logic but for the case in which the requested percentile falls in an intermediate centroid and not just the last ... in this case I had tests that hit both conditions and proved the swap "fixed" the invariant.

I am not saying that the swap of code snippets that I propose is the right fix, just that with it values do not break the invariant, but my "fix" may as well be the wrong fix in terms of actual values that should be returned (not just their relative order).

Does it make sense? I hope with the info here you can repro the situation and either justify the behavior (maybe serialization/deserialization breaks some assumptions) or work on a the real fix, mine was literally done on a hunch ... so much for math proofing it.

@tdunning
Copy link
Contributor

On Wed, Sep 17, 2014 at 5:18 PM, phoenixdownita [email protected]
wrote:

Does it make sense? I hope with the info here you can repro the situation
and either justify the behavior (maybe serialization/deserialization breaks
some assumptions) or work on a the real fix, mine was literally done on a
hunch ... so much for math proofing it.

Yeah... I can go with the serialized image, but since there are several
implementations, it is kind of hairy thinking about that. Having data that
causes the problem might be nicer.

But I think you are right that it is conflicting conventions and
assumptions at work.

@phoenixdownita
Copy link
Author

I have a source data example:

    int maxVal = 31;
    TDigest td = new TDigest(100);
    for(int i = 1; i <= maxVal; i++) {
        td.add(i);
    }

    double tp999 = td.quantile(0.999d);
    double tp9999 = td.quantile(0.9999d);
    assertTrue(tp9999 >= tp999);

The assert will trigger because of the inversion mentioned.

Also they show an imprecision of the algo, namely they both are > 31 (as even for the last centroid the assumption is that half the count is one side the other on the other side [even when the count is 1]).
Not sure if we can amend that for the last centroid in all cases (or only in the case of count=1) to enforce that no percentile can be bigger then the biggest element seen (which I believe is precise only when the count = 1).

Anyway hope this helps you in "fixing" the issue or vetting my solution of replacing (q - t) with (center.count() - (q - t)) and viceversa.

@tdunning
Copy link
Contributor

Thanks. Excellent input on this.

We changed the computation of quantiles previously to address a different
issue. That means that I have to actually think about this for a bit.

It is always possible that this is a matter of definitions. For that
matter, your fix may be the continuation of the previous fix with each
being on one side or the other of the quantile computation (x -> q or q ->
x).

I have just been swamped lately which has made it difficult to think hard
about a problem like this for a few hours.

On Thu, Sep 25, 2014 at 2:28 PM, phoenixdownita [email protected]
wrote:

I have a source data example:

int maxVal = 31;
TDigest td = new TDigest(100);
for(int i = 1; i <= maxVal; i++) {
    td.add(i);
}

double tp999 = td.quantile(0.999d);
double tp9999 = td.quantile(0.9999d);
assertTrue(tp9999 >= tp999);

The assert will trigger because of the inversion mentioned.

Also they show an imprecision of the algo, namely they both are > 31 (as
even for the last centroid the assumption is that half the count is one
side the other on the other side [even when the count is 1]).
Not sure if we can amend that for the last centroid in all cases (or only
in the case of count=1) to enforce that no percentile can be bigger then
the biggest element seen (which I believe is precise only when the count =
1).

Anyway hope this helps you in "fixing" the issue or vetting my solution of
replacing (q - t) with (center.count() - (q - t)) and viceversa.


Reply to this email directly or view it on GitHub
#83 (comment).

@phoenixdownita
Copy link
Author

This is the code I have in place as a fix (I just put it here in case, but I make no assertion for it to be the correct fix in terms of the values that are returned).
it inverts the (q-t) with (center.count() - (q - t))
AND in case of a single centroid rather than failing it returns the value of it as any quantile (a single value is the collapse of all quantiles anyway).

public double quantile(double q) {
    GroupTree values = summary;
    //Preconditions.checkArgument(values.size() > 1);
    //fix to allow single value
    Preconditions.checkArgument(values.size() >= 1);

    Iterator<Group> it = values.iterator();
    Group center = it.next();
    //fix for single value return the value
    if(!it.hasNext()) {
        //only 1 centroid (with 1 value)
        return center.mean();
    }
    Group leading = it.next();
    if (!it.hasNext()) {
        // only two centroids because of size limits
        // both a and b have to have just a single element
        double diff = (leading.mean() - center.mean()) / 2;
        if (q > 0.75) {
            return leading.mean() + diff * (4 * q - 3);
        } else {
            return center.mean() + diff * (4 * q - 1);
        }
    } else {
        q *= count;
        double right = (leading.mean() - center.mean()) / 2;
        // we have nothing else to go on so make left hanging width same as right to start
        double left = right;

        double t = center.count();
        while (it.hasNext()) {
            if (t + center.count() / 2 >= q) {
                // left side of center
                //return center.mean() - left * 2.0d * (q - t) / center.count();
                //fix
                return center.mean() - left * 2.0d * (center.count() - (q - t)) / center.count();
            } else if (t + leading.count() >= q) {
                // right of b but left of the left-most thing beyond
                //return center.mean() + right * 2.0 * (center.count() - (q - t)) / center.count();
                //fix
                return center.mean() + right * 2.0d * (q - t) / center.count();

            }
            t += center.count();

            center = leading;
            leading = it.next();
            left = right;
            right = (leading.mean() - center.mean()) / 2;
        }
        // ran out of data ... assume final width is symmetrical
        center = leading;
        left = right;
        if (t + center.count() / 2 >= q) {
            // left side of center
            //return center.mean() - left * 2.0d * (q - t) / center.count();
            //fix
            return center.mean() - left * 2.0d * (center.count() - (q - t)) / center.count();
        } else if (t + leading.count() >= q) {
            // right of center but left of leading
            //return center.mean() + right * 2.0d * (center.count() - (q - t)) / center.count();
            //fix
            return center.mean() + right * 2.0d * (q - t) / center.count();
        } else {
            // shouldn't be possible
            return 1;
        }
    }
}

@phoenixdownita
Copy link
Author

I decided to actually take the impl of this functions from:
https://github.com/tdunning/t-digest/blob/master/src/main/java/com/tdunning/math/stats/TreeDigest.java
as it seems to work much better (and no issues of inversions etc...)
It takes very minor changes (Group vs Centroid naming) so it wasn't an hassle at all.

I guess this impl was left behind ... no biggie, now I know.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants