-
Notifications
You must be signed in to change notification settings - Fork 132
method to convert CharTrie into a compact trie for space-optimized representation (feature request) #27
Comments
I'd like to contribute code for this, if required. |
I’m always happy to review pull requests though adding a ‘compressed’ representation to the library is rather involved change so this might benefit from discussion of the design before code is written. There are non-obvious cases to consider such as how would the data structure look with Lastly, you should be looking at https://github.com/mina86/pygtrie repository since that’s where the latest code resides. |
One solution for the example you provided is, node decomposition, as shown, ---foobarbaz: 42 trie['foobar'] = 42 in this example, 'foobar' matched with 'foobarbaz' upto 5th position, therefore, the root node breaks at the 5th position creating, after trie['foo'] = 42, It is to be noted that, at a node, no two keys start form same character and hence time complexity of addition is unaffected after compression. |
In the above trie, if I perform, the resulting structure would be, ie, node with key 'bar' decomposed at 0th position ('b'). |
What if I then did |
If you do trie['bar'] = 42, |
When a lookup for ‘foo’ is done, trie’s behaviour is to look for a child ‘f’ of the root node. If root’s node children wolud hold ‘foo’ and ‘bar’, you’d never find ‘foo’ in there. |
Look-ups would be implemented similar to additions, at each node we find the key which has the same first character as that of the query, at root node 'foo' starts with an 'f' just like 'foobar', this indicates that 'foobar' must belong to the subtrie at 'foo'. in the 'foo' subtrie now we lookup for 'bar', this time 'b' matches with 'bar', now in 'b' subtrie we look-up for 'ar' where it matches with 'ar', since the query has been found, we stop here. Yes, at every node, we need to traverse to all the children until we find the matching first character, therefore time complexity increases. But, lots of storage is saved which is often helpful for storage concerned applications. |
The
correspond to? What’s the value of Because if the root node has When I say that design discussion may be in order, I mean discussion of how each kind of node is represented. What are |
Your interpretation of children field is correct. after compression, now trie keys are strings and hence now at each node, rather than looking of a character like 'f' we look for strings beginning with 'f' like 'foo', at each node, log(N) complexity is required to find a key, where N is the total number of children of that node. Yes the operations are not time efficient, but this is the cost of the saved memory space. If I'm right, combining each pair of nodes saves 56B. |
I don’t see how that’s correct. The actual complexity is linear to the length of the prefix. Recall that Python uses hash maps for dictionaries, not binary search trees. As a result, when looking for child ‘foobar’ you have to first check ‘f’, than ‘fo’, then ‘foo’ etc. At that point it’s simpler to just have a flat dictionary and do key lookups by doing longer and longer prefix matches. |
Actually, at one node there will be only one key starting from 'f' or any other character. In the following trie, if I do trie['foob'] = 9 the node with key 'bar' decomposes at 'b' to avoid having both 'bar' and 'b' as keys. Once we match the first character of query (in O(logN) ), we can simply traverse the matched key to find out upto which position it matches with the query ie, 'foo' with 'foob' upto 2nd position, then in the subtrie we search for remaining part of the query ie. the remaining 'b' of 'foob'. |
So then also if I add |
no, in this case the root node will have one more key as 'bar', due to the fact that no existing key of root node matches with 'b'. |
So again, how do you look up a key in such a node? |
If sorting is enabled, as done by the function otherwise, if sorting is not enabled we have to perform a linear search. Do note that, query string will change for each node, depending upto which character the match occurs in previous node. |
I don’t think either of those is acceptable. The first option makes lookup O(n log n) operation; the second makes it O(n) where n is number of children. Meanwhile, the idea behind a trie is to have O(k) lookup where k is the length of the key. I don’t see getting away from children being keyed by a single character (in case of
The question then is how to encode and handle the compressed path. |
Once sorting is enabled, look-ups should have a worst-case time complexity of O( k log n ), and that will be the case when the trie remains unaffected after compression. For an average case, the time complexity should be O( log n ) + O( k ). You suggest that we use only the first character of a path as a key, right? which should save the search time of a key. |
On Tue, May 28, 2019, 09:03 Manthan Chauhan ***@***.***> wrote:
Once sorting is enabled, look-ups should have a worst-case time complexity
of O( k log n ), and that will be the case when the trie remains unaffected
after compression. For an average case, the time complexity should be O(
log n ) + O( k ).
When sorting is enabled children are not stored sorted. A hash table is
still used. With sorting enabled children are sorted each time items are
iterated over.
As such, there is no way to take advantage of sorting being enabled during
key lookup. The chdren would need to be sorted each time which means lookup
takes O(n log n).
You suggest that we use only the first character of a path as a key, right?
which should save the search time of a key.
Yes. That's how tries work.
… |
Thinking about your idea for character keys,
at each node, we can use the first character of the query string to locate a key-value pair, here, each value is a collection of a string and a node pointer. once we locate a key-value pair, we traverse the associated string upto the position it matches with the query string, which would be equivalent to traversing that much nodes of an uncompressed CharTrie. The remaining part of the query string will be passed to the pointed _Node. |
This sounds workable. I only wonder how would it interact with interfaces
such as `walk_towards`.
|
sorry but can you please provide some references for this |
For this to work, we'll need to modify the _path_from_key() functions, because in this proposed model of compact tries, the trie['foo'] = 1 +- f -> (oo) -> 1 in this trie structure, if we consider the key with the above described path definition |
I don't think that'll work. Whether compressed representation is used or not, it should be transparent to the user so |
Then we can have to do a little modification in the If we consider that the value at each children key is a tuple(string, Node),
|
Sorry about delay.
That doesn't sound viable for two reasons. First of all, the result of import pygtrie
t = pygtrie.CharTrie(foobar=42)
steps = t.walk_towards('foobar')
res = ' '.join('<%s>' % step.key for step in steps)
assert res == '<> <f> <fo> <foo> <foob> <fooba> <foobar>' Second of all, if the motivation is to reduce amount of memory used, I think only children with compressed paths should be a There's also another aspect that ideally, the following test should pass as well: from itertools import islice
import pygtrie
t = pygtrie.CharTrie(foobar=42)
res = ''
for step in t.walk_towards('foobar'):
res += '<%s>' % step.key
if 'fred' in t:
del t['fred']
del t['foobaz']
else:
t['fred'] = t['foobaz'] = 24
assert res == '<><f><fo><foo><foob><fooba><foobar>' In other words, converting nodes between compressed and not-compressed should not break |
Okay, but for compressed tries to pass the That is, if the trie contains, then while |
Correct. This is the desired behaviour because it's the current behaviour (currently the method goes through all nodes regardless of whether they have values assigned to them or how many children they have) and methods should not expose details of the implementation. |
For that, we can use this trick and inbetween every two valid I mean without actually traversing the path character by character. |
char_trie['he'] = 3
char_trie['him'] = 4
---h
|__e: 3
|__i
|__m: 4
can be compressed as,
---h
|__e: 3
|__i m:4
since removing each redundant node saves atleast 56 bytes, this can provide high memmory optimisation to a data structure that is used for memmory optimized storage.
The text was updated successfully, but these errors were encountered: