-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New article on forcing direction of an expansion #179
base: master
Are you sure you want to change the base?
Conversation
|
||
If only one of these nodes is chosen as an anchor node, the expansion will start there, expand the relationship to the other node, and then filter the other node's label and properties to find the matches. | ||
|
||
If both nodes are looked up via index, still only one will be the anchor node that we expand from, so the expansion process should remain about the same, but filtering will be more efficient, only having to filter on the node's internal graph id to see if the other node is the same one we matched earlier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably rewrite this to something along the lines of:
If both nodes are looked up from an index, an "expand-into" operation is happening.
Expansion will be more efficient if the node with the smaller degree is chosen.
(At runtime the degree of both nodes will be taken into account)
I guess the trickier case if it only picks one index and the wrong side (dense node).
Then you need to force it with using index or using join on (the other node)
|
||
If both nodes are looked up via index, still only one will be the anchor node that we expand from, so the expansion process should remain about the same, but filtering will be more efficient, only having to filter on the node's internal graph id to see if the other node is the same one we matched earlier. | ||
|
||
It should be clear that there are two possible ways to expand the pattern in this case, and that one is going to be far more efficient than the other. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually 4
- left + expand
- right + expand
- both + join left
-
- both + join right
=== The costly direction | ||
|
||
If we anchored on the country node, we would have to expand to every user in the United States, then filter all of them to find the ones that match the pattern. | ||
This would result in a ton of expanded rows, and a lot of filtering work where we would likely be throwing out every single row, except one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a ton -> a lot or "many"
If we anchored on the country node, we would have to expand to every user in the United States, then filter all of them to find the ones that match the pattern. | ||
This would result in a ton of expanded rows, and a lot of filtering work where we would likely be throwing out every single row, except one. | ||
|
||
Even if we find that user early in the filtering, nothing in the query tells it to stop looking (no `LIMIT 1` present), and of course without a unique constraint nothing prevents a node with the same properties from being in the graph (or multiple relationship going to the same node), so it will keep on matching and filtering and throwing out all other non-matching results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really important aspect and even with the unique constraint (if it's not used) it will not take it into account and filter all the remaining rows too.
probably highlight with a NOTE or such.
|
||
There are other ways the planner can decide to plan a query, some of them being cheaper with a small number of nodes and relationships, and some far more costly. | ||
|
||
A NodeHashJoin, for example (when we expand from two different anchor nodes to a common node in the middle of the pattern), might be very quick when the number of anchor nodes matched is low, and when we are expanding only a few relationships from the anchor nodes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably simplify the sentence by coming from the two nodes then saying that depending on the side it would check against an already found node in a set (that's why hash-join)
not sure if you want to put JOINs into a separate KB and link to it?
There are other ways the planner can decide to plan a query, some of them being cheaper with a small number of nodes and relationships, and some far more costly. | ||
|
||
A NodeHashJoin, for example (when we expand from two different anchor nodes to a common node in the middle of the pattern), might be very quick when the number of anchor nodes matched is low, and when we are expanding only a few relationships from the anchor nodes. | ||
But this can be very expensive if we're traversing a ton of relationships, becoming a hinderance to query execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's why you can chose which side to join on with the index hint.
|
||
=== The cheap direction | ||
|
||
If we anchored to the user node, if we assume that each user only has one `:LIVES_IN` relationship, we would only have to expand on that single relationship and filter on that one connected node to see if the user really does live in the United States. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you didn't say how to force the "cheap" direction :)
A NodeHashJoin, for example (when we expand from two different anchor nodes to a common node in the middle of the pattern), might be very quick when the number of anchor nodes matched is low, and when we are expanding only a few relationships from the anchor nodes. | ||
But this can be very expensive if we're traversing a ton of relationships, becoming a hinderance to query execution. | ||
|
||
In the case of more complicated queries, there may be quite a few different nodes that could potentially be used as anchor nodes, with many possibilities on which one or which combination to anchor on, and how to expand to fulfill the desired patterns. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
point out to build the query step by step and look at the plan for the number of rows produced from an expansion
(from either side) to select the better one (and with the knowledge of the model ofc)
|
||
In the case of more complicated queries, there may be quite a few different nodes that could potentially be used as anchor nodes, with many possibilities on which one or which combination to anchor on, and how to expand to fulfill the desired patterns. | ||
|
||
In any case, it is possible for the planner to make a bad choice, either because the approach isn't universally efficient across all data in your graph (some nodes may be supernodes, and cause the query over them to choke) or the metadata available to the planner is insufficient to warn it away from these more expensive expansions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably have a separate highlighted statement that says "don't query across or from supernodes with many relationships only against them, i.e. from the other side"
|
||
In any case, it is possible for the planner to make a bad choice, either because the approach isn't universally efficient across all data in your graph (some nodes may be supernodes, and cause the query over them to choke) or the metadata available to the planner is insufficient to warn it away from these more expensive expansions. | ||
|
||
In these cases, what we as humans know about the general shape of the graph may be greater than what can be inferred via metadata. Remember that walking the actual graph data is not possible here, since we're talking about query planning, which precedes execution. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really a shame that we have no histograms
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually ExpandInto does a runtime check for degrees and picks the smaller non-dense or smaller side to expand from :)
|
||
=== Using hash joins to force expansion to a supernode | ||
|
||
Remember that supernodes are really only problematic when expanding through, or expanding away from, but depending on your graph data it may be just fine if you are only expanding to them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
highlight this with a NOTE:
Remember that supernodes are really only problematic when expanding through, or expanding away from, but depending on your graph data it may be just fine if you are only expanding to them. | ||
|
||
|
||
In the case where you have multiple efficient anchor nodes, and a known or potential supernode in the middle of the pattern, and you know the expansion TO the supernode from both sides is cheap, you can use a join hint to force expanding to the super node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would still show an example.
That is, the planner prefers index lookups to find anchor nodes, though it may still use label scans to find anchor nodes (especially when there are no opportunities for index lookups in the query). | ||
|
||
Neither index lookups or label scans are possible when the label isn't present in the pattern (or otherwise described in a Where clause, like `WHERE c:Country`), and these two means of lookup are the two most common for finding anchor nodes in the graph. | ||
Of course, you may need to have that label in the query for the sake of correctness, otherwise the wrong nodes and patterns might be matched. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But that can be accomodated with more specific relationship-types too.
|
||
[source,cypher] | ||
---- | ||
MATCH (user:User {id:12345})-[:LIVES_IN]->(country {name:'United States'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really ouch, not sure you want to show this.
in this case not checking the label should be fine as the rel-type is specific enough.
|
||
Even though we have matched to both end nodes, and they are both potential anchors, in the second Match it is clear that the `user` node is the only one we can expand from; the planner is only aware that `c` is an unlabeled node and not a candidate for an anchor node. | ||
|
||
The filtering that the `c` node we expand to must be the same as the `country` anchor node is something the planner can only consider after the expansion is finished, so there is no oppportunity for it to use `country` as an anchor for expansion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the planner usually does a hash join here checking the expanded end node against the set of "country" nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Harsh but true
No description provided.