-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perf/fast unions #81
Perf/fast unions #81
Conversation
Just came back from a break week with no laptop and catching up on things. When it comes to the behaviors described in #79 (comment) and #79 (comment), I don't think that those can and/or should be addressed in In the case of sources being lazily generated in an asynchronous manner, ensuring correct error propagation and disposal of sources would be best addressed in a dedicated sources iterator that is bound to be context-specific. The only thing that AsyncIterator/asynciterator.ts Lines 1668 to 1673 in 6e7f093
IMHO, when it comes to error propagation and source disposal this PR is already complete as it is. |
So then I guess the only remaining question for this PR is when an array of sources is passed into the |
I don't think so as that would lead to slightly different behaviors in the case of sources passed as an array vs. sources passed as an iterator. Wrapping the array into an iterator with |
Ok great - in that case I am happy with the status of this PR and consider it ready for review :) |
Actually, I realized you were likely making a point that I failed to grasp. If sources are passed as an array of iterators that is fed to In this case I would either leave things as they are right now or create a different iterator class that is specific to the use case of abstracting an array of sources into an iterator. Then, the if (Array.isArray(sources)) {
this._sources = new SourceArrayIterator(sources);
} else { ... } However, the implementation of
These questions seem very context-specific to me and I'm sure I could come up with plenty more. Perhaps it's worth thinking of |
So to conclude this PR is ready to be reviewed in its current state. Bump @RubenVerborgh |
@jeswr Some questions about the circular list:
I do agree that, conceptually, it is indeed a circular list though. Given that the number of iterators is assumed to be small (10–20ish?) and not often changed (?), I am curious if there is a big impact.
So this is a v4? (I followed the links, but no result for keyword major.) |
That's probably sensible. When I was first writing I didn't have clarity on what APIs would need to be exposed; but I can now try and refactor that retrospectively.
The reason I have introduced it is that we need to always be able to delete an element in The other type of data structure that would allow for this type of deletion would be a Doubly LinkedList, which could become part of the current linkedlist implementation, at a memory cost to other iterators.
Yes, since the erroring behavior will change in some cases this will need to be a v4. |
I agree; but it seems to be that it would be I guess I'm simultaneously asking: what are the main sources of performance improvement in this PR? Is it indeed this new list structure? |
Comunica has some cases where the number of iterators in a union could become very large (property paths and link traversal). However, currently those cases do not make use of the UnionIterator yet, so perhaps not necessary to optimize for those. |
Okay; I'm happy to have a circular list. (Although, side-note, what we actually need is a list with O(1) item removal. Which can perfectly be a (Actually, scratch that. There might be multiple occurrences of an item—at least for lists in general.) (But do note that you don't need a doubly linked list. All we need is a Map from item to previous list node; which is currently implemented via Symbol. Also the current implementation breaks if an element is added multiple times to a list; which is not allowed for iterators, but it is for lists in general.) (So my points still hold if we simply don't allow duplicate items.) Maybe And there could be a |
I'll have a crack at this in the morning (jet lag is starting to hit me hard). To confirm - this would be preferred over a circular linked list implementation (provided this has similar performance to the current implementation, which it should) |
Not necessarily; my remark was more that all we need is O(1) deletion. Circular lists as such doesn't offer that; it's first an O(n) lookup, and only then deletion is O(1). So what I'm arguing is that the Symbol lookup trick is basically what we need (and that we can also implement it with a Map); and that we get O(1) deletion by making it point at the previous node rather than current node. |
Could you details these a bit? Because we should line up all the changes we want for v4, and see which ones we can bundle. |
See #79 (comment). The reason that test broke is because we do not subscribe to the error event of iterators until after we start reading from them. So if the iterator errors before we first call Though, if we remove pretty well all autostarting behavior in #45 - then there really shouldn't be errors being emitted prior to the first |
Can't we change that behavior? Not just for backwards compatibility reasons, but it's a necessity for correct functioning. The moment an iterator is handed over to a component, that component takes full responsibility of that iterator. That includes handling its errors, which would terminate the process when unhandled (default EventEmitter behavior). So in this concrete case, when a source iterator is handed over to a union iterator, that union iterator should immediately attach error listeners. |
Just to make this even more concrete; if I have const source = range(1, Infinity).map(x => range(1, 10))
const unionRange = union(source) Then This is still breaking relative to the last version as the fact that the The one case in which it may make sense to attach error listeners to all source iterators in advance is if we are taking a union over an array of iterators rather than an iterator of iterators; i.e. const unionRange = union([range(1, 10), range(1, 10)]) But in the same token; I'm not sure if it makes sense to change this behavior when taking the union over an iterator vs. the union over an array. |
The only way to attach error handlers to a union of lazily-generated sources would be to let upstream instantiate those sources and buffer all of them, even though we might need only a few such as in the case of destroying the union after a set number of items. That would likely have significant performance and resource utilization issues.
This is why I suggested using a separate class to handle that specific use case in #81 (comment) - to keep the |
Closing as I have a better implementation in the upcoming v4 PR. |
Supercedes #79
The concerns about behavior (and reasons for a major version bump) that I pointed out there still remain. But not sure if I have a good solution for those at this point.
Before
For loop with 5x10^9 elems: 5.207s
UnionIterator 2x10^7 iterators: 30.297s
UnionIterator 1000x500 iterators: 1.897s
UnionIterator 1000x500 iterators - max parallelism of 1: 1.837s
After
For loop with 5x10^9 elems: 7.832s
UnionIterator 2x10^7 iterators: 3.844s
UnionIterator 1000x500 iterators: 276.211ms
UnionIterator 1000x500 iterators - max parallelism of 1: 157.52ms