-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New article on forcing direction of an expansion #179
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,128 @@ | ||
= How to force direction of an expansion | ||
:slug: how-to-force-direction-of-expansion | ||
:author: Andrew Bowman | ||
:neo4j-versions: 3.5, 4.0, 4.1, 4.2, 4.3, 4.4, 5.x | ||
:tags: cypher | ||
:category: cypher | ||
|
||
Like SQL, Cypher is a declarative query language. | ||
This is most evident with Match patterns, which describe what you want to find in the graph. | ||
You do not dictate to it how to find these patterns, the way you might in an imperative programming language. | ||
|
||
For the most part, this works out well, with the planner using the counts and statistical metadata in the graph to figure out how to find the patterns you want to match against, and which nodes to lookup (via indexes) to serve as anchor nodes that you will begin your expansions from. | ||
|
||
However, sometimes the counts and metadata information is not sufficient for producing an optimal plan, particularly when it comes to the direction of certain expansions. | ||
That is, for some expansions, traversing in one direction may be considerably cheap, but expanding in the opposite direction may be very expensive. | ||
|
||
This article discusses the problem, and provides means by which you can force the planner to expand in a certain direction if needed. | ||
|
||
== When one direction is clearly more expensive | ||
|
||
Consider if you have a graph of social graph contact data, which may include this pattern for all contacts: | ||
|
||
[source,cypher] | ||
---- | ||
(:User)-[:LIVES_IN]->(:Country) | ||
---- | ||
|
||
If we are querying for a specific user in a specific country, we might have part of the query include: | ||
|
||
[source,cypher] | ||
---- | ||
MATCH (user:User {id:12345})-[:LIVES_IN]->(country:Country {name:'United States'}) | ||
---- | ||
|
||
There are several different ways we can imagine how this match portion might get planned, subject to indexes present and graph metadata. | ||
|
||
If only one of these nodes is chosen as an anchor node, the expansion will start there, expand the relationship to the other node, and then filter the other node's label and properties to find the matches. | ||
|
||
If both nodes are looked up via index, still only one will be the anchor node that we expand from, so the expansion process should remain about the same, but filtering will be more efficient, only having to filter on the node's internal graph id to see if the other node is the same one we matched earlier. | ||
|
||
It should be clear that there are two possible ways to expand the pattern in this case, and that one is going to be far more efficient than the other. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually 4
|
||
|
||
=== The costly direction | ||
|
||
If we anchored on the country node, we would have to expand to every user in the United States, then filter all of them to find the ones that match the pattern. | ||
This would result in a ton of expanded rows, and a lot of filtering work where we would likely be throwing out every single row, except one. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. a ton -> a lot or "many" |
||
|
||
Even if we find that user early in the filtering, nothing in the query tells it to stop looking (no `LIMIT 1` present), and of course without a unique constraint nothing prevents a node with the same properties from being in the graph (or multiple relationship going to the same node), so it will keep on matching and filtering and throwing out all other non-matching results. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. really important aspect and even with the unique constraint (if it's not used) it will not take it into account and filter all the remaining rows too. probably highlight with a NOTE or such. |
||
|
||
=== The cheap direction | ||
|
||
If we anchored to the user node, if we assume that each user only has one `:LIVES_IN` relationship, we would only have to expand on that single relationship and filter on that one connected node to see if the user really does live in the United States. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. you didn't say how to force the "cheap" direction :) |
||
|
||
Even if the graph contains historical data of where a user lives, we can expect only a handful of these paths that we need to filter. | ||
This approach would still be far quicker and easier than having to filter through every single user in the country to find the path we want. | ||
|
||
=== Other expensive pitfalls | ||
|
||
There are other ways the planner can decide to plan a query, some of them being cheaper with a small number of nodes and relationships, and some far more costly. | ||
|
||
A NodeHashJoin, for example (when we expand from two different anchor nodes to a common node in the middle of the pattern), might be very quick when the number of anchor nodes matched is low, and when we are expanding only a few relationships from the anchor nodes. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. probably simplify the sentence by coming from the two nodes then saying that depending on the side it would check against an already found node in a set (that's why hash-join) not sure if you want to put JOINs into a separate KB and link to it? |
||
But this can be very expensive if we're traversing a ton of relationships, becoming a hinderance to query execution. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. that's why you can chose which side to join on with the index hint. |
||
|
||
In the case of more complicated queries, there may be quite a few different nodes that could potentially be used as anchor nodes, with many possibilities on which one or which combination to anchor on, and how to expand to fulfill the desired patterns. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. point out to build the query step by step and look at the plan for the number of rows produced from an expansion |
||
|
||
In any case, it is possible for the planner to make a bad choice, either because the approach isn't universally efficient across all data in your graph (some nodes may be supernodes, and cause the query over them to choke) or the metadata available to the planner is insufficient to warn it away from these more expensive expansions. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. probably have a separate highlighted statement that says "don't query across or from supernodes with many relationships only against them, i.e. from the other side" |
||
|
||
In these cases, what we as humans know about the general shape of the graph may be greater than what can be inferred via metadata. Remember that walking the actual graph data is not possible here, since we're talking about query planning, which precedes execution. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. really a shame that we have no histograms There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. actually ExpandInto does a runtime check for degrees and picks the smaller non-dense or smaller side to expand from :) |
||
|
||
== Ways to force the direction of an expansion | ||
|
||
Unfortunately we do not yet have planner hints that directly require or forbid expansion in a certain direction. | ||
Instead, we must influence this indirectly, through other hints, or through Cypher tricks which leave the planner no other choice. | ||
|
||
=== Using hash joins to force expansion to a supernode | ||
|
||
Remember that supernodes are really only problematic when expanding through, or expanding away from, but depending on your graph data it may be just fine if you are only expanding to them. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. highlight this with a NOTE: |
||
|
||
|
||
In the case where you have multiple efficient anchor nodes, and a known or potential supernode in the middle of the pattern, and you know the expansion TO the supernode from both sides is cheap, you can use a join hint to force expanding to the super node. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would still show an example. |
||
|
||
This is discussed in more detail on an alternate article here: | ||
|
||
https://neo4j.com/developer/kb/how-to-avoid-costly-traversals-with-join-hints/[How to avoid costly traversals with join hints] | ||
|
||
=== Label filtering without anchoring | ||
|
||
The presence of a label on a node in the Match pattern can significantly change whether or not the planner will try to use it as an anchor node. | ||
|
||
That is, the planner prefers index lookups to find anchor nodes, though it may still use label scans to find anchor nodes (especially when there are no opportunities for index lookups in the query). | ||
|
||
Neither index lookups or label scans are possible when the label isn't present in the pattern (or otherwise described in a Where clause, like `WHERE c:Country`), and these two means of lookup are the two most common for finding anchor nodes in the graph. | ||
Of course, you may need to have that label in the query for the sake of correctness, otherwise the wrong nodes and patterns might be matched. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. But that can be accomodated with more specific relationship-types too. |
||
|
||
There is a way to perform the label filtering in a special manner which also prevents the planner from using it for index lookup or label scanning. | ||
Here is an example: | ||
|
||
[source,cypher] | ||
---- | ||
MATCH (user:User {id:12345})-[:LIVES_IN]->(country {name:'United States'}) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is really ouch, not sure you want to show this. |
||
WHERE 'Country' IN labels(country) | ||
---- | ||
|
||
Note that we removed the `:Country` label from the Match pattern, preventing the planner from being able to use it as an anchor node via an index lookup or label scan. | ||
The planner has no choice but to use the `user` node as an anchor, and expand relationships to the `country` node at the other end. | ||
|
||
The second line, this membership check of a certain label being present in the node's labels, lets us still retain the label filtering we need for correctness. | ||
|
||
Note that attempting to use `WHERE country:Country` to accomplish the same thing will not work, as the planner is aware of this syntax and can still use labels used here for lookups to find anchor nodes. | ||
|
||
=== Node identity filtering to avoid anchoring | ||
|
||
If the end nodes are known such that they can be efficiently pre-matched (such as via index lookup), when we want to prevent one from being used as an anchor for expansion, then we can use this trick. | ||
|
||
Consider this query: | ||
|
||
[source,cypher] | ||
---- | ||
MATCH (user:User {id:12345}), (country:Country {name:'United States'}) | ||
MATCH (user)-[:LIVES_IN]->(c) | ||
WHERE c = country | ||
---- | ||
|
||
The key is in the second and third lines, usage of the unlabeled variable `c`. | ||
|
||
Even though we have matched to both end nodes, and they are both potential anchors, in the second Match it is clear that the `user` node is the only one we can expand from; the planner is only aware that `c` is an unlabeled node and not a candidate for an anchor node. | ||
|
||
The filtering that the `c` node we expand to must be the same as the `country` anchor node is something the planner can only consider after the expansion is finished, so there is no oppportunity for it to use `country` as an anchor for expansion. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the planner usually does a hash join here checking the expanded end node against the set of "country" nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably rewrite this to something along the lines of:
If both nodes are looked up from an index, an "expand-into" operation is happening.
Expansion will be more efficient if the node with the smaller degree is chosen.
(At runtime the degree of both nodes will be taken into account)
I guess the trickier case if it only picks one index and the wrong side (dense node).
Then you need to force it with using index or using join on (the other node)