-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Create using-subqueries-to-avoid-the-eager.adoc #178
base: master
Are you sure you want to change the base?
Conversation
:tags: cypher, performance, load-csv | ||
:category: cypher | ||
|
||
Eager operators in a query plan can be disruptive, especially when performing writes involving large amounts of data, or batch loading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note that Eager aggregation is something different than the write horizon eager.
|
||
Eager operators in a query plan can be disruptive, especially when performing writes involving large amounts of data, or batch loading. | ||
|
||
If you've used `USING PERIODIC COMMIT LOAD CSV` to import data into Neo4j, it's likely at some point that you've been bitten by the Eager: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using periodic commit is no longer supported in 5.x afaik, you might want to mention that?
btw. you also get Eager with call in transactions ...
Eager operators in a query plan can be disruptive, especially when performing writes involving large amounts of data, or batch loading. | ||
|
||
If you've used `USING PERIODIC COMMIT LOAD CSV` to import data into Neo4j, it's likely at some point that you've been bitten by the Eager: | ||
Some operations require eagerly pulling in interim results for all rows, which effectively disables the `PERIODIC COMMIT` behavior, possibly causing you to go out of memory when running on a large input dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explain why it's necessary (to separate the read and write horizon -> i.e. do all writes first or do all reads first)
that's why it has to pull all data from the top through the operation, because that effectively will create a "materialized" horizon
(you don't need to mention that neo4j doesn't support MVCC)
Some operations require eagerly pulling in interim results for all rows, which effectively disables the `PERIODIC COMMIT` behavior, possibly causing you to go out of memory when running on a large input dataset. | ||
|
||
The culprit, in an EXPLAIN query plan, is usually the Eager operator, with a dark blue header. | ||
These are not just "monkeywrench operators" meant to disrupt your query, there are valid reasons these operators exist; they maintain the Cypher semantics which aim to minimize the effect of row order affecting processing and results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
explanation is a bit too vague
In most cases, the Eager operator cannot be removed from the query plan entirely, but with subqueries its effects can be scoped such that they are no longer disruptive or have an impact on heap and memory. | ||
This article provides some ways to minimize eager behavior by scoping their effect to local per-row executions with subqueries. | ||
|
||
NOTE: We won't be talking about EagerAggregations here, which result from aggregation functions like count() and collect(). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would pull that to the front
This article provides some ways to minimize eager behavior by scoping their effect to local per-row executions with subqueries. | ||
|
||
NOTE: We won't be talking about EagerAggregations here, which result from aggregation functions like count() and collect(). | ||
https://support.neo4j.com/hc/en-us/articles/4403024564243-Using-Subqueries-to-Control-the-Scope-of-Aggregations[We have a separate article for those here], but they can be similarly scoped via subqueries to avoid adding pressure to heap memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change the link target text to something that google SEO can use
== Understanding Eager, and why it exists | ||
|
||
What does the Eager operator mean? From the perspective of those writing the query, behavior-wise it means that your query is going to be processed differently than you may expect, operation-by-operation for all rows at a time with each step. | ||
As a side effect this may be memory-intensive, as all input rows and intermediate rows are processed all at once, and holding onto massive sets of interim results can cause high GC pauses, maybe even out of memory errors, and definitely prevents you from doing batch commiting as you intended to when `USING PERIODIC COMMIT`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't refer to USING PERIODIC COMMIT, just say "batch processing" or similar, that's more version independent
As a side effect this may be memory-intensive, as all input rows and intermediate rows are processed all at once, and holding onto massive sets of interim results can cause high GC pauses, maybe even out of memory errors, and definitely prevents you from doing batch commiting as you intended to when `USING PERIODIC COMMIT`. | ||
|
||
Batch commits are not compatible with Eager behavior, as they require lazy row-by-row semantics for correct operation. | ||
While Cypher does not stop you from attempting batch commit operations when there is an Eager in the plan, they will not commit in batches, and you may encounter the above mentioned issues around GC pauses and heap problems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there should actually a warning in Cypher that points that out. Not sure if there is. I think it used to be with the old planner.
Why does this happen? | ||
|
||
Cypher semantics demand that to the greatest extent possible, operations from later in the query should not influence the results of operations from earlier in the query. | ||
For example, a MERGE that appears later in the query should not influence a MATCH that shows up earlier in the query (on nodes of the same label). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change MERGE to CREATE
"Otherwise you could end up with infinite loops where the newly created data is matched again and will lead to more data being created"
Cypher planning would ordinarily try for row-by-row processing, so the entire remaining query would execute for each input row. | ||
Because of that, a MERGE from later in the query, applying to an earlier row of input, would happen before processing of a later row of input, that has yet to execute its MATCH operation. | ||
|
||
If that MERGE from later in the query could affact the results of a MATCH earlier in the query, then that violates the above mentioned Cypher semantics, and so Eager is planned to preserve them. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And then it could produce infinite loops or at least affect "perceived already executed" operations.
If that MERGE from later in the query could affact the results of a MATCH earlier in the query, then that violates the above mentioned Cypher semantics, and so Eager is planned to preserve them. | ||
This causes the change in execution behavior, so instead of lazy row-by-row processing, all rows are processed operation-by-operation. | ||
|
||
If the input size is too large, either from the very start, or building up over the course of execution, since all interim results must be built up and held in memory at once for all rows, this could easily exceed the bounds of the heap and cause out of memory errors, thus the problem of Eager. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably mention - if your neo4j instance is configured with transaction memory limits then the query will be aborted. If that's not the case the server might run into memory allocation errors.
|
||
We have a blog entry by Jennifer Reif discussing Eager and its effects in more detail here: | ||
|
||
https://community.neo4j.com/t5/general-discussions/cypher-sleuthing-the-eager-operator/m-p/50596 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that link is broken better use the original article:
https://medium.com/neo4j/cypher-sleuthing-the-eager-operator-84a64d91a452
* MATCH (regular or OPTIONAL) and CREATE clauses (in any ordering) on the same node labels | ||
* MATCH (regular or OPTIONAL) and MERGE clauses (in any ordering) on the same node labels | ||
* CREATE and MERGE clauses (in any ordering) on the same node labels | ||
* Multiple MERGE clauses on the same labels |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
which can happen if you load a mono-partite graph like (:User)-[:FOLLOWS]->(:User)
:tags: cypher, performance, load-csv | ||
:category: cypher | ||
|
||
Eager operators in a query plan can be disruptive, especially when performing writes involving large amounts of data, or batch loading. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps mention somewhere at the beginning that the Cypher planner is sometimes over-eager to insert Eager operation (pun intended) as it rather wants to be safe than sorry.
|
||
Remember that in Cypher, operators produce rows, and execute per row. | ||
That's why the UNWIND is important for the Eager to show up, as the planner infers that there are multiple rows for which the MATCH will be called (not just once), | ||
so the execution of the MATCH when processing a later row could be influenced by the MERGE being performed when processing an earlier row. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sentence per line
That's why the UNWIND is important for the Eager to show up, as the planner infers that there are multiple rows for which the MATCH will be called (not just once), | ||
so the execution of the MATCH when processing a later row could be influenced by the MERGE being performed when processing an earlier row. | ||
|
||
The same thing would happen if we derived the id from a LOAD CSV, with the difference that we might be ingesting from a massive file, in which case the Eager behavior would be much more impactful on memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or apoc.load procedures or large matches, e.g. in graph refactorings
|
||
We can see the Eager operator in the resulting query plan here: | ||
|
||
image::https://i.imgur.com/7cCwf9x.jpeg[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably better to upload to the CDN?
== Subqueries enforce per-row processing | ||
|
||
To review, a subquery means per input row, the subquery will execute in full. | ||
The planner has no ability to insert an Eager between separate per-row executions of a single subquery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but there might be an eager introduce before the subquery.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see the screenshot I sent you for:
MATCH (n)-[r]->()
CALL {
WITH r
DELETE r
} IN TRANSACTIONS OF 1 ROWS```
MATCH (n:Node {id: id}) | ||
MERGE (x:Node {id: id + 1}) | ||
RETURN true as done | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now you can apply batches here too
* The subsequent rows from the UNWIND will execute in a similar manner from the subquery. | ||
* As a result of the subquery scoping the Eager and enforcing per-row execution here, a single run of this query will produce 5 new nodes, for a total of 6, with ids of 1 through 6. | ||
|
||
While this has changed behavior such that it won't pressure the heap, and will again allow sane batch processing, it is important to note that the query results changed! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
statement results
The difference is in where the Eager occurs in the plan. | ||
In this case, the Eager is on the right-hand side of an Apply operator. | ||
|
||
The Apply operator means: for each input row from the left side, do all the stuff on the right side of the operator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do all the operations on the right hand side of apply.
|
||
Subqueries generate Apply operators, so this plan just confirms that the Eager is scoped to an individual subquery execution, and won't alter behavior outside of the subquery. | ||
|
||
When managing eager behavior, this kind of plan is what you're looking for to confirm that the Eager is scoped behind an Apply, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sentence per line
|
||
== Nested subqueries for additional scoping | ||
|
||
For a more complex query, a single subquery usage may not be enough to properly reign in the eager behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this feels like the hacks people do with cache-line padding where they create one subclass with one pad to avoid the CPU/JVM reordering fields
For a more complex query, a single subquery usage may not be enough to properly reign in the eager behavior. | ||
|
||
That is, when the Eager is scoped behind a subquery, it means each individual subquery execution behaves eagerly, and that's usually enough to make the impact minimal. | ||
But when an individual subquery execution can generate a ton of rows (such as additional MATCHes) such that the Eager still retains a negative impact, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a ton -> a lot
|
||
More on usage of aggregations and subqueries can be found here: | ||
|
||
https://support.neo4j.com/hc/en-us/articles/4403024564243-Using-Subqueries-to-Control-the-Scope-of-Aggregations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
perhaps point to the public KB instead?
|
||
== Using APOC procs as subqueries | ||
|
||
If you aren't running Neo4j 4.1 or higher, you can make use of some procs in APOC to act as subqueries for a similar effect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
procs -> procedures
MERGE (e)-[r:DEDICATED_TO]->(c) | ||
---- | ||
|
||
In this one, we conditionally add the :Customer label to the :Employee node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
backticks for the labels
|
||
Just like before, isolating the scope with a subquery prevents the planner from adding the Eager, it vanishes from the query plan. | ||
|
||
Be aware, however, that usage of APOC procedures that execute a dynamic query like this require overhead to parse, compile, and execute the query, a cost that you do not have to pay when using native Cypher subqueries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and there can be a potential of Cypher injection when executing subqueries as strings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my comments
No description provided.