Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf/fast unions #81

Closed
wants to merge 6 commits into from
Closed

Perf/fast unions #81

wants to merge 6 commits into from

Conversation

jeswr
Copy link
Collaborator

@jeswr jeswr commented Aug 11, 2022

Supercedes #79

The concerns about behavior (and reasons for a major version bump) that I pointed out there still remain. But not sure if I have a good solution for those at this point.

Before
For loop with 5x10^9 elems: 5.207s
UnionIterator 2x10^7 iterators: 30.297s
UnionIterator 1000x500 iterators: 1.897s
UnionIterator 1000x500 iterators - max parallelism of 1: 1.837s

After
For loop with 5x10^9 elems: 7.832s
UnionIterator 2x10^7 iterators: 3.844s
UnionIterator 1000x500 iterators: 276.211ms
UnionIterator 1000x500 iterators - max parallelism of 1: 157.52ms

@jacoscaz
Copy link
Collaborator

Just came back from a break week with no laptop and catching up on things. When it comes to the behaviors described in #79 (comment) and #79 (comment), I don't think that those can and/or should be addressed in UnionIterator as the latter can not make any assumption on how sources are generated in the first place.

In the case of sources being lazily generated in an asynchronous manner, ensuring correct error propagation and disposal of sources would be best addressed in a dedicated sources iterator that is bound to be context-specific. The only thing that UnionIterator should do is to catch errors both at the single source level and at the iterator-of-sources level (which is already happening at

AsyncIterator/asynciterator.ts

Lines 1668 to 1673 in 6e7f093

source[DESTINATION] = this;
source.on('error', destinationEmitError);
// TODO: On readable events we should start reading from the iterator
// that emitted the event
source.on('readable', destinationSetReadable);
source.on('end', destinationRemoveEmptySources);
).

IMHO, when it comes to error propagation and source disposal this PR is already complete as it is.

@jeswr
Copy link
Collaborator Author

jeswr commented Aug 16, 2022

In the case of sources being lazily generated in an asynchronous manner, ensuring correct error propagation and disposal of sources would be best addressed in a dedicated sources iterator that is bound to be context-specific.

So then I guess the only remaining question for this PR is when an array of sources is passed into the UnionIterator whether we should immediately start listening for the error event by subscribing that listener in the constructor.

@jacoscaz
Copy link
Collaborator

So then I guess the only remaining question for this PR is when an array of sources is passed into the UnionIterator whether we should immediately start listening for the error event by subscribing that listener in the constructor.

I don't think so as that would lead to slightly different behaviors in the case of sources passed as an array vs. sources passed as an iterator. Wrapping the array into an iterator with wrap() and consuming it as if it had always been an iterator leads to more consistent behavior.

@jeswr
Copy link
Collaborator Author

jeswr commented Aug 16, 2022

Ok great - in that case I am happy with the status of this PR and consider it ready for review :)

@jeswr jeswr requested a review from RubenVerborgh August 16, 2022 15:21
@jacoscaz
Copy link
Collaborator

jacoscaz commented Aug 16, 2022

Actually, I realized you were likely making a point that I failed to grasp. If sources are passed as an array of iterators that is fed to wrap(), the resulting iterator will not relay errors from the sources themselves as there's nothing in wrap() that handles that specific case (rightly so as wrap() should not modify its behavior based on item type).

In this case I would either leave things as they are right now or create a different iterator class that is specific to the use case of abstracting an array of sources into an iterator. Then, the UnionIterator constructor could do something like:

if (Array.isArray(sources)) {
  this._sources = new SourceArrayIterator(sources);
} else { ... } 

However, the implementation of SourceArrayIterator would be a non-trivial one:

  1. should there be an option to simply discard errored sources?
  2. should an errored source automatically lead to the destruction of other sources?

These questions seem very context-specific to me and I'm sure I could come up with plenty more. Perhaps it's worth thinking of SourceArrayIterator as an orthogonal feature to UnionIterator and better addressed in a different PR if needed at all.

@jeswr
Copy link
Collaborator Author

jeswr commented Aug 21, 2022

Perhaps it's worth thinking of SourceArrayIterator as an orthogonal feature to UnionIterator and better addressed in a different PR if needed at all.

So to conclude this PR is ready to be reviewed in its current state.

Bump @RubenVerborgh

@RubenVerborgh RubenVerborgh self-assigned this Aug 22, 2022
@RubenVerborgh RubenVerborgh mentioned this pull request Aug 22, 2022
@RubenVerborgh
Copy link
Owner

@jeswr Some questions about the circular list:

  1. Shall we extract CircularList to a separate file like LinkedList?
  2. Is a CircularList strictly needed? I.e., do the performance gains really come from not having an array for the iterators?

I do agree that, conceptually, it is indeed a circular list though.

Given that the number of iterators is assumed to be small (10–20ish?) and not often changed (?), I am curious if there is a big impact.

reasons for a major version bump

So this is a v4? (I followed the links, but no result for keyword major.)

@jeswr
Copy link
Collaborator Author

jeswr commented Aug 22, 2022

Shall we extract CircularList to a separate file like LinkedList?

That's probably sensible. When I was first writing I didn't have clarity on what APIs would need to be exposed; but I can now try and refactor that retrospectively.

Is a CircularList strictly needed? I.e., do the performance gains really come from not having an array for the iterators?

The reason I have introduced it is that we need to always be able to delete an element in O(1) time when the end event for that particular iterator is called; which we cannot do with an array or the current LinkedList as both require the traversal of O(n) elements (where n is index of the element in the array/linkedlist).

The other type of data structure that would allow for this type of deletion would be a Doubly LinkedList, which could become part of the current linkedlist implementation, at a memory cost to other iterators.

So this is a v4? (I followed the links, but no result for keyword major.)

Yes, since the erroring behavior will change in some cases this will need to be a v4.

@RubenVerborgh
Copy link
Owner

RubenVerborgh commented Aug 22, 2022

both require the traversal of O(n) elements

I agree; but it seems to be that it would be O(10) at most; unless we are looking at scenarios where the number of iterators is really large?

I guess I'm simultaneously asking: what are the main sources of performance improvement in this PR? Is it indeed this new list structure?

@rubensworks
Copy link
Collaborator

I agree; but it seems to be that it would be O(10) at most; unless we are looking at scenarios where the number of iterators is really large?

Comunica has some cases where the number of iterators in a union could become very large (property paths and link traversal). However, currently those cases do not make use of the UnionIterator yet, so perhaps not necessary to optimize for those.

@RubenVerborgh
Copy link
Owner

RubenVerborgh commented Aug 22, 2022

Okay; I'm happy to have a circular list.

(Although, side-note, what we actually need is a list with O(1) item removal. Which can perfectly be a LinkedList with lazy deletion: you keep a Set of items to be deleted, and as you iterate through elements, you remove nodes as needed. But we might be getting into premature optimization territory here 😅.)

(Actually, scratch that. There might be multiple occurrences of an item—at least for lists in general.)

(But do note that you don't need a doubly linked list. All we need is a Map from item to previous list node; which is currently implemented via Symbol. Also the current implementation breaks if an element is added multiple times to a list; which is not allowed for iterators, but it is for lists in general.)

(So my points still hold if we simply don't allow duplicate items.)

Maybe OrderedSet extends LinkedList, adding O(1) has and delete operations via a Set or Map?
Then we also keep the LinkedList without the Map memory and perf overhead.

And there could be a .looped iterator that is infinite/circular!

@jeswr
Copy link
Collaborator Author

jeswr commented Aug 22, 2022

Maybe OrderedSet extends LinkedList, adding O(1) has and delete operations via a Set or Map?

I'll have a crack at this in the morning (jet lag is starting to hit me hard). To confirm - this would be preferred over a circular linked list implementation (provided this has similar performance to the current implementation, which it should)

@RubenVerborgh
Copy link
Owner

RubenVerborgh commented Aug 22, 2022

this would be preferred over a circular linked list implementation

Not necessarily; my remark was more that all we need is O(1) deletion.

Circular lists as such doesn't offer that; it's first an O(n) lookup, and only then deletion is O(1).
Which is why you added the Symbol lookup to make it O(1).

So what I'm arguing is that the Symbol lookup trick is basically what we need (and that we can also implement it with a Map); and that we get O(1) deletion by making it point at the previous node rather than current node.

@RubenVerborgh
Copy link
Owner

Yes, since the erroring behavior will change in some cases this will need to be a v4.

Could you details these a bit?

Because we should line up all the changes we want for v4, and see which ones we can bundle.

@jeswr
Copy link
Collaborator Author

jeswr commented Aug 22, 2022

Could you details these a bit?

See #79 (comment). The reason that test broke is because we do not subscribe to the error event of iterators until after we start reading from them. So if the iterator errors before we first call .read() on it, then the error will not be forwarded.

Though, if we remove pretty well all autostarting behavior in #45 - then there really shouldn't be errors being emitted prior to the first .read() call anyway.

@RubenVerborgh
Copy link
Owner

we do not subscribe to the error event of iterators until after we start reading from them

Can't we change that behavior? Not just for backwards compatibility reasons, but it's a necessity for correct functioning.

The moment an iterator is handed over to a component, that component takes full responsibility of that iterator. That includes handling its errors, which would terminate the process when unhandled (default EventEmitter behavior).

So in this concrete case, when a source iterator is handed over to a union iterator, that union iterator should immediately attach error listeners.

@jeswr
Copy link
Collaborator Author

jeswr commented Aug 23, 2022

Can't we change that behavior?

Just to make this even more concrete; if I have

const source = range(1, Infinity).map(x => range(1, 10))

const unionRange = union(source)

Then source will have an error listener attached immediately, but each of the infinitely generated range(1, 10) iterators will not have an error listener attached until immediately before the first .read() call. In this case, and cases with a large number of generated iterators it does not make sense to try and retrieve them all from the source and attach error listeners before we start being able to iterate over unionRange.

This is still breaking relative to the last version as the fact that the UnionIterator could autoStart and buffer could mean that listeners are attached at an earlier point in time.

The one case in which it may make sense to attach error listeners to all source iterators in advance is if we are taking a union over an array of iterators rather than an iterator of iterators; i.e.

const unionRange = union([range(1, 10), range(1, 10)])

But in the same token; I'm not sure if it makes sense to change this behavior when taking the union over an iterator vs. the union over an array.

@jacoscaz
Copy link
Collaborator

jacoscaz commented Aug 23, 2022

The only way to attach error handlers to a union of lazily-generated sources would be to let upstream instantiate those sources and buffer all of them, even though we might need only a few such as in the case of destroying the union after a set number of items. That would likely have significant performance and resource utilization issues.

But in the same token; I'm not sure if it makes sense to change this behavior when taking the union over an iterator vs. the union over an array.

This is why I suggested using a separate class to handle that specific use case in #81 (comment) - to keep the UnionIterator class as consistent as possible.

@RubenVerborgh RubenVerborgh added this to the v4-beta milestone Sep 21, 2022
@jeswr
Copy link
Collaborator Author

jeswr commented Dec 1, 2022

Closing as I have a better implementation in the upcoming v4 PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants