-
Notifications
You must be signed in to change notification settings - Fork 210
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complex Properties Overhaul #121
Conversation
Another slight changed proposed by @kkaefer I should mention (but not reflected currently)
This could allow all types to be inlined. |
3.0/vector_tile.proto
Outdated
repeated double double_values = 8 [ packed = true ]; | ||
repeated float float_values = 9 [ packed = true ]; | ||
repeated int64 int64_values = 10 [ packed = true ]; | ||
repeated uint64 uint64_values = 11 [ packed = true ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use sint64
instead of int64
since this would only be preferred over uint64
when the value is negative.
3.0/vector_tile.proto
Outdated
// | | (if 4th bit is 1 is map) | ||
// | | remaining bits are the number of key_index and | ||
// | | complex_value pairs to follow (same as properties) | ||
repeated uint64 properties = 5 [ packed = true ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try an experimental implementation of this so we can see if it actually makes the tiles significantly smaller.
3.0/vector_tile.proto
Outdated
// | ||
// Type | Id | Parameter | ||
// --------------------------------- | ||
// inline int | 0 | value of integer ( values between -2^60+1 to 2^60-1 ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use sint
instead of int
since this would only be preferred over uint
when values are negative.
An experimental implementation of this scheme is in mapbox/tippecanoe#611 In a first, superficial test, it reduces the Natural Earth countries (to z5) to 99.55% of their usual size. There may be other data sets that show meaningful improvement though. |
I have experimented a little more with your branch and put in a few different sets of data: North American Road DataFor one set of data, I used some north american road data, here are the results of file sizes:
or
This is about ~5% reduction in size. Sample data format properties: "properties": { "prefix": null, "number": "331", "class": "State", "type": "Other Paved", "divided": null, "country": "United States", "state": "Alabama", "note": null, "scalerank": 11, "uident": 142, "length": 4.559190, "rank": 0, "continent": "North America" } My guess would be that we are seeing a reduction due to inline integers mostly in this case. OSM Based Point DataFile sizes:
or
This is about a 7% reduction. Example properties:
|
Interesting. Thanks for the additional research. Can you add links to the files you are testing with? |
New experiment with sorting the values but retaining the v2 encoding:
At least now I know it's worth spending a little extra time to sort the |
The Natural Earth roads are improved slightly by also sorting the keys:
|
Blake's format, but without inline ints:
So I think inlining is helping more than repeated messages are. |
Inlining floats does help a little, but the difference is in the noise (91.97% vs 92.16%):
This also highlights that we need more than 3 bits for types. In fact this PR already actually requires 4, because it specifies "list / map" as type |
Adding the 4th type bit raises the roads from using 92.16% to 92.49% of the original tileset size.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I really like the flat approach. While we introduce 4 bits per value, this should be more than offset by not wrapping each value as a separate tagged message, and nested properties fit here naturally.
3.0/vector_tile.proto
Outdated
// list / map | 8 | (if 4th bit is 0 is list) | ||
// | | remaining bits are length of the list where | ||
// | | each item in the list is a complex value | ||
// | | (if 4th bit is 1 is map) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bit confusing — the id 8 is 0b1000
, but if the 4th bit is 1 (so that it becomes 0b1001
), the id equals 9. Then why not just indicate 8 for list and 9 for map instead of mentioning the fourth bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was attempting to get away with just using 3 bits so that we could represent higher int values with out having to using the int index system. I am not against 4 bits.
3.0/vector_tile.proto
Outdated
// | ||
// The properties field is much like the tags value in the it is two integers | ||
// pairs that reference key and value pairs however, it is broken out into a | ||
// "key_index" and an "complex_value". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: had to read many times to understand the sentence. the
-> that
? also, ;
before "however" would help
Also question — how does an encoder decide whether to inline a value or put it in the packed array? Should we always inline int/sint vlues? And if we remove "index to int/sint" types, this makes the types fit into 3 bytes again (including list/map without an additional bit). |
@mourner We can not currently remove the indexed integer types because there is a limit to the size that the inlined values can represent currently. Here is how @ericfischer currently did the implementation for when to use inline vs indexed. I think this makes sense overall, the only time there might be a bit savings by using index over inline would be values that are larger then could be represented by 24 bit integers that are highly repeated (probably more then 3 or 4 times) and would have a low index value. This could be calculated on the fly if required, but I don't know that we need to have such a complex implementation. I think overall the inline seems to save space on average so we may not need such complex code. |
Here are some random thoughts:
|
|
The mantissas of floating point numbers seem to be fairly uniformly distributed across the [.5…1) interval, so there's probably not much potential for giving more common mantissas shorter representations. Low exponents are more common than high ones, though, so we might be able to squeeze a little bit out there. |
Regarding special encoding of floating point numbers: I don't think it is worth it to come up with complex schemes here. I had thought about just using raw bytes stored in a string field or something. But while that might be easy to use in C++, it will be more difficult in JS or so. Regarding the keys/string_values tables: With the encoding proposed here it doesn't cost us anything to split these up, because each string is encoded by itself anyway. But as mentioned it will lead to smaller index numbers which, especially for the keys case is probably worth it. Here is another idea though: Currently all keys/string_values are directly in the Regarding integer value encodings: If we put large integers that can't be inlined as separate varint in the
This is one of those cases where we are hitting the limits of protobuf encoding again. We know the type, so we could do the zigzag encoding ourselfs for sints and not for uints. For the C++ code this doesn't matter, because we do the zigzag encoding ourselves anyway, but for anybody using the protobuf encodings, we either need two tables, or they have to do the zigzag encoding outside the protobuf lib. |
On a different topic:
|
3.0/vector_tile.proto
Outdated
// uses the properties field instead. This would only be used if version | ||
// for a layer is 3 or greater and tags should not be used at that point | ||
// Additional tags (or all the tags) of this feature may be | ||
// encoded as repeated pairs of 32-bit integers, to take |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't we want to get away from the "tags" name and use "properties" instead? Also the properties
field has 64bit uints, not 32 bit ints. And this is not necessarily "pairs" when we deal with lists and maps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will reword. @flippmoke and I want to consistently call it "attributes" to match what OGC does.
3.0/vector_tile.proto
Outdated
@@ -69,6 +112,12 @@ message Tile { | |||
// See https://github.com/mapbox/vector-tile-spec/issues/47 | |||
optional uint32 extent = 5 [ default = 4096 ]; | |||
|
|||
repeated string string_values = 7; | |||
repeated double double_values = 8 [ packed = true ]; | |||
repeated float float_values = 9 [ packed = true ]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we try to keep the types ordered consistently throughout the .proto
file, ie some places have float
first, then double
, others in different order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me. I'll make that edit.
3.0/vector_tile.proto
Outdated
repeated float float_values = 9 [ packed = true ]; | ||
repeated sfixed64 sfixed64_values = 10 [ packed = true ]; | ||
repeated fixed64 fixed64_values = 11 [ packed = true ]; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest these should get "logical" names like signed_integer_values
or so instead of ones based on the encoding sfixed...
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fine with me.
3.0/vector_tile.proto
Outdated
// | | each item in the list is a complex value | ||
// | | (if 4th bit is 1 is map) | ||
// | | remaining bits are the number of key_index and | ||
// | | complex_value pairs to follow (same as properties) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simply make these list -> 8
, map > 9
? The extra bit is confusing and doesn't buy us anything, because we already have 9 values (0-8) for the Id anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good to me, since we have enough type fields to spare. I think they were combined only because it looked like the types would fit in 3 bits.
3.0/vector_tile.proto
Outdated
// an index position into a value storage of the layer. | ||
// | ||
// uint64t type = complex_value & 0x0F; // First 4 Bits | ||
// uint64t parameter = complex_value >> 4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
uint64_t
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, will fix
3.0/vector_tile.proto
Outdated
// bool/null | 7 | value of 0 = false, 1 = true, 2 = null | ||
// list | 8 | value is the number of sub-attributes to follow: | ||
// | | each item in the list is a complex value | ||
// map | 9 | value is the number of sub-attributes to follow: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question on intent of wording here, is the number of sub attributes to follow based on number of key value pairs or the number of keys and values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant it to be the number of pairs, not the total number of words to follow. Thanks. I'll reword.
3.0/vector_tile.proto
Outdated
// int | 3 | index into layer.attribute_pool.signed_integer_values | ||
// uint | 4 | index into layer.attribute_pool.unsigned_integer_values | ||
// inline uint | 5 | value of unsigned integer (values between 0 to 2^60-1) | ||
// inline sint | 6 | value of zigzag-encoded integer (values between -2^59 to 2^59-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably should change this to be 2^56 for uint and 2^55 for signed due to the way varints are encoded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to differentiate between what encodings are possible and what encodings are recommended. The spec may well say that this or that encoding is recommended becaus it is usually better, but still require readers to understand a different encoding.
Closing in favor of #123 |
This is different then the more simple approach to adding lists and maps in #117. This is a complete overhaul of the way properties are encoded. The thought process behind this change is to allow for higher levels of compression of values by:
This also allows for null values.
Solves #75 and #62