Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dijkstra*: refactor for performance and API cleanup #100

Open
jrtom opened this issue Aug 4, 2017 · 14 comments
Open

Dijkstra*: refactor for performance and API cleanup #100

jrtom opened this issue Aug 4, 2017 · 14 comments

Comments

@jrtom
Copy link
Owner

jrtom commented Aug 4, 2017

DijkstraDistance and DijkstraShortestPath each (optionally) cache information about previously calculated shortest distances/paths, as a time/space tradeoff in case those distances/paths are needed again.

However, they don't do so in a very intelligent fashion: when we've calculated the shortest path/distance from A to C, and the path passes through B, we already know the shortest path/distance from B to C also, but we don't store it in such a way as to be able to take advantage of this.

This is known to be a problem in practice; one user has reported that with a graph of ~10k nodes and 100k edges, after a number of calls for path lengths, there are ~2M distances being stored. :(

The task here is to come up with a better representation of the cached distances/paths that is more compact. Offhand I'd guess that this can be done in O(m + n) space (i.e., basically the size of the input graph), or perhaps O(n^2); what we have now looks more like O(mn).

(While we're in here, we should also clean up the API, e.g., make use of builders, make the interface more uniform, etc.)

@gstevey
Copy link

gstevey commented Aug 4, 2017

O(m+n) would be fantastic. You could just precompute all shortest paths and be O(1) on every query. I'm finding that shortest-path queries are taking 3-6 seconds, though fortunately most are avoided because of caching.

For what it's worth, my band-aid fix of calling reset() every few hours is going to carry me for at least a year, as far as I can tell. But it really does use a ton of space. In addition to the 2M distances, there are ~4M LinkedHashMap node entries for my m+n=110k graph.

@jbduncan
Copy link
Contributor

jbduncan commented Aug 4, 2017

@jrtom I've not yet looked at DijkstraDistance or DijkstraShortestPath in detail, but whilst we're cleaning up the API, I would encourage us to see if we could merge the two classes together.

Perhap's JGraphT's DijkstraShortestPath (javadoc) could prove to be a source of inspiration for a cleaner API.

@gstevey
Copy link

gstevey commented Aug 4, 2017

That would be nice. It was somewhat nontrivial to figure out how to use the API. My code looks like this:

        val edges = getAlgorithm(graph).getPath(a, b)
        if (edges.size == 0) return listOf()
        val result: MutableList<Location> = mutableListOf()
        var v_prev: Location?
        var v: Location? = a
        result.add(a)
        for (edge_num in edges) {
            v_prev = v
            v = graph.getOpposite(v_prev, edge_num)
            result.add(v)
        }

@jrtom
Copy link
Owner Author

jrtom commented Aug 4, 2017

Merging the classes together is almost certainly the right thing to do. The one caveat is that we don't want to cause people who only care about distances to incur the costs for calculating (and storing) paths as well; that was the original motivation for the distinction.

There's also an internal discussion going on with the Guava team about what a good API looks like for graph distance/path calculation; once that's resolved we'll almost certainly be using its results.

@gstevey just curious, is that Kotlin?

@jbduncan
Copy link
Contributor

jbduncan commented Aug 4, 2017

There's also an internal discussion going on with the Guava team about what a good API looks like for graph distance/path calculation; once that's resolved we'll almost certainly be using its results.

Oh wow! The Guava team continue to amaze me with their approach to API design, so I look forward to any results that may crop up from their discussion. :)

@jrtom
Copy link
Owner Author

jrtom commented Aug 4, 2017

That discussion has been on the back burner for a while, mostly because it got blocked on a related issue that got blocked on basically coming up with definitions for graph theoretical terms. Which led to me dragging out all my graph theory/algorithms texts and writing up a doc that established that, sure enough, there is not a lot of consensus in the graph theory world on some very basic semantic questions. :/

But I'll get back to that soon, and hopefully that will lead to various other things getting unblocked. :)

@jrtom
Copy link
Owner Author

jrtom commented Aug 9, 2017

related: #80 (proposal to allow the size of the cache to be limited)

@jrtom
Copy link
Owner Author

jrtom commented Aug 9, 2017

bringing in comment from elsewhere:

I liked your idea about sharing the common shortest paths, but I also think you can get significant mileage out of reusing edges (or their distances) -- make a Set and then create or fetch the edge distance object box from the set when storing it in a path. You'd still need all the path holder boxes, but it would at least avoid duplicating identical distances everywhere. I'm seeing an order of magnitude more distance objects than edges in my graph.

Also something to consider: using fastutil for cases like these, although I'd kind of hate to add a dependency.

@jbduncan
Copy link
Contributor

jbduncan commented Sep 7, 2017

I admit I do not fully understand yet why we'd want to use fastutil (to share double-typed distances between... something, presumably), but I also agree with you that I'd kind of hate to bring in an additional dependency just to effectively cache such distances.

Hmm, I do know that Bazel has an internal class called CompactHashSet (which I believe consumes less memory than HashSet), so even if we'd still box the distances as objects, I wonder if we could borrow CompactHashSet as a sort of stopgap. But having said that, it does occur to me that Bazel is licensed under Apache 2.0 rather than BSD, so I don't know how that would affect us legal-stuff-wise.

@jrtom
Copy link
Owner Author

jrtom commented Sep 8, 2017

The Dijkstra* classes keep track of distances using a Map<N, Number> objects rather than using primitive doubles, which adds overhead.

The question of whether CompactHashSet is worth it has been the subject of a great deal of internal debate (which is why you saw it in Bazel but not Guava). I don't think it helps that much.

In any event, these are definitely third-order concerns, after being more intelligent about storing distances and sharing common distances (as discussed above).

@jbduncan
Copy link
Contributor

jbduncan commented Nov 12, 2017

@jrtom Do you know if the Guava team came to any conclusions regarding a good API for graph distance/path calculation? If not, then it seems sensible for me to have a go at another issue in the meantime.

@jrtom
Copy link
Owner Author

jrtom commented Nov 13, 2017

We haven't yet sorted that out, no (and that's on me, and it's blocked behind a few other common.graph-related things, I'm afraid). That said, even if we change the API, we're still going to need to figure out the performance issues, so if you want to dig into that some more, that would be great. If you do, I'll paste into this bug some more details that @gstevey gave me based on his investigations.

@jrtom jrtom closed this as completed Nov 13, 2017
@jrtom jrtom reopened this Nov 13, 2017
@jbduncan
Copy link
Contributor

@jrtom Sounds interesting! Time and energies permitting, I'd be interested in pursuing this issue further, so if you could indeed paste here the details from @gstevey's investigations, then that would be great!

@jbduncan
Copy link
Contributor

Hi @jrtom, just wanted to follow up on this particular issue. Would you have time over the next week or so to share @gstevey's investigations with me?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants