-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dijkstra*: refactor for performance and API cleanup #100
Comments
O(m+n) would be fantastic. You could just precompute all shortest paths and be O(1) on every query. I'm finding that shortest-path queries are taking 3-6 seconds, though fortunately most are avoided because of caching. For what it's worth, my band-aid fix of calling reset() every few hours is going to carry me for at least a year, as far as I can tell. But it really does use a ton of space. In addition to the 2M distances, there are ~4M LinkedHashMap node entries for my m+n=110k graph. |
@jrtom I've not yet looked at Perhap's JGraphT's |
That would be nice. It was somewhat nontrivial to figure out how to use the API. My code looks like this:
|
Merging the classes together is almost certainly the right thing to do. The one caveat is that we don't want to cause people who only care about distances to incur the costs for calculating (and storing) paths as well; that was the original motivation for the distinction. There's also an internal discussion going on with the Guava team about what a good API looks like for graph distance/path calculation; once that's resolved we'll almost certainly be using its results. @gstevey just curious, is that Kotlin? |
Oh wow! The Guava team continue to amaze me with their approach to API design, so I look forward to any results that may crop up from their discussion. :) |
That discussion has been on the back burner for a while, mostly because it got blocked on a related issue that got blocked on basically coming up with definitions for graph theoretical terms. Which led to me dragging out all my graph theory/algorithms texts and writing up a doc that established that, sure enough, there is not a lot of consensus in the graph theory world on some very basic semantic questions. :/ But I'll get back to that soon, and hopefully that will lead to various other things getting unblocked. :) |
related: #80 (proposal to allow the size of the cache to be limited) |
bringing in comment from elsewhere:
Also something to consider: using fastutil for cases like these, although I'd kind of hate to add a dependency. |
I admit I do not fully understand yet why we'd want to use fastutil (to share Hmm, I do know that Bazel has an internal class called |
The Dijkstra* classes keep track of distances using a Map<N, Number> objects rather than using primitive doubles, which adds overhead. The question of whether In any event, these are definitely third-order concerns, after being more intelligent about storing distances and sharing common distances (as discussed above). |
@jrtom Do you know if the Guava team came to any conclusions regarding a good API for graph distance/path calculation? If not, then it seems sensible for me to have a go at another issue in the meantime. |
We haven't yet sorted that out, no (and that's on me, and it's blocked behind a few other common.graph-related things, I'm afraid). That said, even if we change the API, we're still going to need to figure out the performance issues, so if you want to dig into that some more, that would be great. If you do, I'll paste into this bug some more details that @gstevey gave me based on his investigations. |
DijkstraDistance
andDijkstraShortestPath
each (optionally) cache information about previously calculated shortest distances/paths, as a time/space tradeoff in case those distances/paths are needed again.However, they don't do so in a very intelligent fashion: when we've calculated the shortest path/distance from A to C, and the path passes through B, we already know the shortest path/distance from B to C also, but we don't store it in such a way as to be able to take advantage of this.
This is known to be a problem in practice; one user has reported that with a graph of ~10k nodes and 100k edges, after a number of calls for path lengths, there are ~2M distances being stored. :(
The task here is to come up with a better representation of the cached distances/paths that is more compact. Offhand I'd guess that this can be done in O(m + n) space (i.e., basically the size of the input graph), or perhaps O(n^2); what we have now looks more like O(mn).
(While we're in here, we should also clean up the API, e.g., make use of builders, make the interface more uniform, etc.)
The text was updated successfully, but these errors were encountered: