-
Notifications
You must be signed in to change notification settings - Fork 557
TDigest percentiles are off on the last centroid (and close by percentiles have inverted values) #83
Comments
Interesting bug! could you write up a test that reproduces this? On Tue, Sep 16, 2014 at 6:18 PM, phoenixdownita [email protected]
|
You can literally: then: and you'll see td1 > td2. Those 2 values 0.999d and 0.9999d are pretty useful in this case because the land the algo in the last of the centroids. While debugging I could see all other variables in that function were identical at line 320, just q for 0.9999 was bigger than q for 0.999. If instead you'd want the actual raw data that created the TDigest in the first place, I have to say I don't have a repro for it that way. I applied to my own "copy" of the TDigest the inversion of (q-t) with (center.count() - (q - t)) in those 4 lines involved and literally used that base64 in a UTest to assert that even with small delta quantiles (say 50% and 50.000001%) it respects the invariant that the value of the bigger quantile should be bigger or equal to the value of the lesser one and it seems to work. I am not too familiar with the code or theory behind it although I skimmed thru the paper, I just use it as is to compute quantiles in a distributed fashion, but I believe that the piece of code I suggest to "fix" has to do with interpolating the value based on how many elements are contained in the centroid and if the center of said centroid is to the left or the right of the sough after percentile. I am not saying that the swap of code snippets that I propose is the right fix, just that with it values do not break the invariant, but my "fix" may as well be the wrong fix in terms of actual values that should be returned (not just their relative order). Does it make sense? I hope with the info here you can repro the situation and either justify the behavior (maybe serialization/deserialization breaks some assumptions) or work on a the real fix, mine was literally done on a hunch ... so much for math proofing it. |
On Wed, Sep 17, 2014 at 5:18 PM, phoenixdownita [email protected]
But I think you are right that it is conflicting conventions and |
I have a source data example:
The assert will trigger because of the inversion mentioned. Also they show an imprecision of the algo, namely they both are > 31 (as even for the last centroid the assumption is that half the count is one side the other on the other side [even when the count is 1]). Anyway hope this helps you in "fixing" the issue or vetting my solution of replacing (q - t) with (center.count() - (q - t)) and viceversa. |
Thanks. Excellent input on this. We changed the computation of quantiles previously to address a different It is always possible that this is a matter of definitions. For that I have just been swamped lately which has made it difficult to think hard On Thu, Sep 25, 2014 at 2:28 PM, phoenixdownita [email protected]
|
This is the code I have in place as a fix (I just put it here in case, but I make no assertion for it to be the correct fix in terms of the values that are returned).
|
I decided to actually take the impl of this functions from: I guess this impl was left behind ... no biggie, now I know. |
I have a serialized TDigest in Base64:
AAAAAkBZAAAAAAAAAAABLkkzm5BGwGAARy0nAEYp0ABGlTYARLegAEU48ABHBuMARTgQAESsIABFSPAARDqAAEaLFgBEpKAARYYwAEQVgABE9mAARTdgAESuYABD3QAARiY4AEWVgABDMwAARTeAAEXByABEYcAARajAAEQwwABF5RgARShwAENWAABF/WAARMAgAEQxAABFGEAAQsoAAEUZMABF4wAARaVgAEV54ABFbuAARVYgAEQqQABGLPwARKngAEQwgABEhOAARIzgAER7gABEoaAARPOgAEXFoABFLkAARHWAAEVDwABE2WAAQ3kAAESfIABFdbAARX6gAESHAABELoAARXxQAETDgABDNQAARa5wAEU6oABFIkAARcg4AESlIABEpDAARcM0AEWy+ABFZlgARPSwAEVRcABE+aAAQmQAAESAgABF22wARZh4AESmUABEbkAAQ+yAAEQ24ABFb3AARdYoAEXyNABD70AARgjMq0WGQqtFxTgARQDAAEUBAABCNAAARRCQAEU/aABEy5AARPlAAETwkABFL1gARa0wAESiYABFR/AARJaVVUUmLVVF5uQARbnSq0Wc7VVFXVAARaTwAEWFtABFFpgARGJAAESQ0ABFMbAAQ08qqkW40qtFgIgARZYYAEWNUABFmOgARD2VVUWYM1VFyWYARWqiq0UGkABE3JqrRWFIAESeAABF+34ARUR0AETgwABElVAARIagAEUfcABFhZFVRKyqq0YD1qtFbPqqRMTdVUPG4ABFNOVVRMUqq0UNiqtE0AAARZe6q0XLUKtFaBAARDLQAEVwSqtFpQarRYb8AEXYSABFNlAARI5gAET+iABFgt4ARYlQAEVdkABEv3AARU5sAEWchgBFPHAARSGgAEW6iABFH/AARQFQAESnqABEZbAARNlwAETB5VVEpVVVRNoFVUUA2ABFVNVVRQ1qq0TMeABFTC6rRcuNVUTVVVVFX6AAROG1VUVlAABF2h1VRYf5VUUarVVFrKVVRMxAAETy9VVE6cqrRNtwAEXxbABFEeVVRTcwAEUOGqpEynVVRTU1VUWxWABEv0AARQ3YAEUN4qtGDBQAROLKq0VnIABFWcAARfqYAEWSaABFhSgAReyiq0XmhVVGEVFVRYohVUUOqABE72AARRvgAEYUigBFVTgARP8QAETH0ABGHsAARSs4AEUgSABFfLAARQ8gAEWXuABFTVAARl1AAEUBQABDuIAAQ6uAAEUDkABD/4AARbl4AEVbEABFEOAAReMIAEXSkABGL8AARW8wAEYHTABFFbAAQ0gAAEUBsABFt4AARY8wAEY0FABGDwwARJWAAEVIoABFSZAARZVoAEYOsABFNVAARgWwAEU5EABForgARSAQAEXb8ABFStAARaa4AEPTgABErMAARU8gAEQVwABE7CAARh8UAET1gABGN+AARn9sAEQlAABFOsAARZ1oAEPlgABGIoQARhb8AEW8GABD8oAARhOMAEVXgABE7AAARtkeAEZZ/ABGl7gARtqgAEW9qABGybIARnG0AEUs4ABGiKoAQ9eAAEeglQBGCMwARnf0AEbJvgBGfmgAR0fpAEgXB4BHsy4ASAMpAEgYdkBMBzAdAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQECAQEBAQECAgEBAQICAgIBAgEBAQICAQECAgICAgIDAQMBAgEBAgEBAgMCAQIDAgIDAwMBAgMBAgIDAgEDAwEDBAIDAwICAwQBAgICAgMDAwMEAQMDAwMDBAQBAwQEAgEEBAIEAwIEAgQBAgMBBAICAwMCAgMDBAMDAQMDAwMCAwMCAwMCAQMDAgMCAgMCAwMBAgIDAQMDAQMCAQECAgECAQICAQECAgEBAQEBAQEBAQECAQICAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQEBAQE=
To get it back into a TDigest tree:
byte[] data = DatatypeConverter.parseBase64Binary(encoded);
ByteBuffer bb = ByteBuffer.wrap(data);
TDigest td = TDigest.fromBytes(bb);
where encode is the base64 string above.
when you query for quantiles it happens that:
td.quantile(0.999d) > td.quantile(0.9999d)
which should never happen.
I believe part of th eproblem is on https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java
line 320 (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L320).
I believe in that line (center.count() - (q - t)) should be instead (q - t)
Symmetrically line 317 (https://github.com/addthis/stream-lib/blob/master/src/main/java/com/clearspring/analytics/stream/quantile/TDigest.java#L317)
should move from (q - t) to (center.count() - (q - t)) restablishing the symmetry.
The same holds true for lines 300 and 303 (theu have to have the (q-t) and (center.count() - (q - t)) swapped around.
I tried that td.quantile(0.5d) > td.quantile(0.5001d).
I am not sure it is the right fix but the invariants should be respected:
td.quantile(0.999d)<=td.quantile(0.9999d)
as well as
td.quantile(0.5d)<=td.quantile(0.5001d)
Thanks
The text was updated successfully, but these errors were encountered: