Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create using-subqueries-to-avoid-the-eager.adoc #178

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

InverseFalcon
Copy link
Collaborator

No description provided.

:tags: cypher, performance, load-csv
:category: cypher

Eager operators in a query plan can be disruptive, especially when performing writes involving large amounts of data, or batch loading.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that Eager aggregation is something different than the write horizon eager.


Eager operators in a query plan can be disruptive, especially when performing writes involving large amounts of data, or batch loading.

If you've used `USING PERIODIC COMMIT LOAD CSV` to import data into Neo4j, it's likely at some point that you've been bitten by the Eager:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

using periodic commit is no longer supported in 5.x afaik, you might want to mention that?

btw. you also get Eager with call in transactions ...

Eager operators in a query plan can be disruptive, especially when performing writes involving large amounts of data, or batch loading.

If you've used `USING PERIODIC COMMIT LOAD CSV` to import data into Neo4j, it's likely at some point that you've been bitten by the Eager:
Some operations require eagerly pulling in interim results for all rows, which effectively disables the `PERIODIC COMMIT` behavior, possibly causing you to go out of memory when running on a large input dataset.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explain why it's necessary (to separate the read and write horizon -> i.e. do all writes first or do all reads first)
that's why it has to pull all data from the top through the operation, because that effectively will create a "materialized" horizon

(you don't need to mention that neo4j doesn't support MVCC)

Some operations require eagerly pulling in interim results for all rows, which effectively disables the `PERIODIC COMMIT` behavior, possibly causing you to go out of memory when running on a large input dataset.

The culprit, in an EXPLAIN query plan, is usually the Eager operator, with a dark blue header.
These are not just "monkeywrench operators" meant to disrupt your query, there are valid reasons these operators exist; they maintain the Cypher semantics which aim to minimize the effect of row order affecting processing and results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

explanation is a bit too vague

In most cases, the Eager operator cannot be removed from the query plan entirely, but with subqueries its effects can be scoped such that they are no longer disruptive or have an impact on heap and memory.
This article provides some ways to minimize eager behavior by scoping their effect to local per-row executions with subqueries.

NOTE: We won't be talking about EagerAggregations here, which result from aggregation functions like count() and collect().
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pull that to the front

This article provides some ways to minimize eager behavior by scoping their effect to local per-row executions with subqueries.

NOTE: We won't be talking about EagerAggregations here, which result from aggregation functions like count() and collect().
https://support.neo4j.com/hc/en-us/articles/4403024564243-Using-Subqueries-to-Control-the-Scope-of-Aggregations[We have a separate article for those here], but they can be similarly scoped via subqueries to avoid adding pressure to heap memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change the link target text to something that google SEO can use

== Understanding Eager, and why it exists

What does the Eager operator mean? From the perspective of those writing the query, behavior-wise it means that your query is going to be processed differently than you may expect, operation-by-operation for all rows at a time with each step.
As a side effect this may be memory-intensive, as all input rows and intermediate rows are processed all at once, and holding onto massive sets of interim results can cause high GC pauses, maybe even out of memory errors, and definitely prevents you from doing batch commiting as you intended to when `USING PERIODIC COMMIT`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't refer to USING PERIODIC COMMIT, just say "batch processing" or similar, that's more version independent

As a side effect this may be memory-intensive, as all input rows and intermediate rows are processed all at once, and holding onto massive sets of interim results can cause high GC pauses, maybe even out of memory errors, and definitely prevents you from doing batch commiting as you intended to when `USING PERIODIC COMMIT`.

Batch commits are not compatible with Eager behavior, as they require lazy row-by-row semantics for correct operation.
While Cypher does not stop you from attempting batch commit operations when there is an Eager in the plan, they will not commit in batches, and you may encounter the above mentioned issues around GC pauses and heap problems.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there should actually a warning in Cypher that points that out. Not sure if there is. I think it used to be with the old planner.

Why does this happen?

Cypher semantics demand that to the greatest extent possible, operations from later in the query should not influence the results of operations from earlier in the query.
For example, a MERGE that appears later in the query should not influence a MATCH that shows up earlier in the query (on nodes of the same label).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change MERGE to CREATE
"Otherwise you could end up with infinite loops where the newly created data is matched again and will lead to more data being created"

Cypher planning would ordinarily try for row-by-row processing, so the entire remaining query would execute for each input row.
Because of that, a MERGE from later in the query, applying to an earlier row of input, would happen before processing of a later row of input, that has yet to execute its MATCH operation.

If that MERGE from later in the query could affact the results of a MATCH earlier in the query, then that violates the above mentioned Cypher semantics, and so Eager is planned to preserve them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And then it could produce infinite loops or at least affect "perceived already executed" operations.

If that MERGE from later in the query could affact the results of a MATCH earlier in the query, then that violates the above mentioned Cypher semantics, and so Eager is planned to preserve them.
This causes the change in execution behavior, so instead of lazy row-by-row processing, all rows are processed operation-by-operation.

If the input size is too large, either from the very start, or building up over the course of execution, since all interim results must be built up and held in memory at once for all rows, this could easily exceed the bounds of the heap and cause out of memory errors, thus the problem of Eager.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably mention - if your neo4j instance is configured with transaction memory limits then the query will be aborted. If that's not the case the server might run into memory allocation errors.


We have a blog entry by Jennifer Reif discussing Eager and its effects in more detail here:

https://community.neo4j.com/t5/general-discussions/cypher-sleuthing-the-eager-operator/m-p/50596
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that link is broken better use the original article:
https://medium.com/neo4j/cypher-sleuthing-the-eager-operator-84a64d91a452

* MATCH (regular or OPTIONAL) and CREATE clauses (in any ordering) on the same node labels
* MATCH (regular or OPTIONAL) and MERGE clauses (in any ordering) on the same node labels
* CREATE and MERGE clauses (in any ordering) on the same node labels
* Multiple MERGE clauses on the same labels
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

which can happen if you load a mono-partite graph like (:User)-[:FOLLOWS]->(:User)

:tags: cypher, performance, load-csv
:category: cypher

Eager operators in a query plan can be disruptive, especially when performing writes involving large amounts of data, or batch loading.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps mention somewhere at the beginning that the Cypher planner is sometimes over-eager to insert Eager operation (pun intended) as it rather wants to be safe than sorry.


Remember that in Cypher, operators produce rows, and execute per row.
That's why the UNWIND is important for the Eager to show up, as the planner infers that there are multiple rows for which the MATCH will be called (not just once),
so the execution of the MATCH when processing a later row could be influenced by the MERGE being performed when processing an earlier row.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sentence per line

That's why the UNWIND is important for the Eager to show up, as the planner infers that there are multiple rows for which the MATCH will be called (not just once),
so the execution of the MATCH when processing a later row could be influenced by the MERGE being performed when processing an earlier row.

The same thing would happen if we derived the id from a LOAD CSV, with the difference that we might be ingesting from a massive file, in which case the Eager behavior would be much more impactful on memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or apoc.load procedures or large matches, e.g. in graph refactorings


We can see the Eager operator in the resulting query plan here:

image::https://i.imgur.com/7cCwf9x.jpeg[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably better to upload to the CDN?

== Subqueries enforce per-row processing

To review, a subquery means per input row, the subquery will execute in full.
The planner has no ability to insert an Eager between separate per-row executions of a single subquery.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but there might be an eager introduce before the subquery.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see the screenshot I sent you for:

MATCH (n)-[r]->()
CALL {
  WITH r
  DELETE r
} IN TRANSACTIONS OF 1 ROWS```

MATCH (n:Node {id: id})
MERGE (x:Node {id: id + 1})
RETURN true as done
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now you can apply batches here too

* The subsequent rows from the UNWIND will execute in a similar manner from the subquery.
* As a result of the subquery scoping the Eager and enforcing per-row execution here, a single run of this query will produce 5 new nodes, for a total of 6, with ids of 1 through 6.

While this has changed behavior such that it won't pressure the heap, and will again allow sane batch processing, it is important to note that the query results changed!
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

statement results

The difference is in where the Eager occurs in the plan.
In this case, the Eager is on the right-hand side of an Apply operator.

The Apply operator means: for each input row from the left side, do all the stuff on the right side of the operator.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do all the operations on the right hand side of apply.


Subqueries generate Apply operators, so this plan just confirms that the Eager is scoped to an individual subquery execution, and won't alter behavior outside of the subquery.

When managing eager behavior, this kind of plan is what you're looking for to confirm that the Eager is scoped behind an Apply,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sentence per line


== Nested subqueries for additional scoping

For a more complex query, a single subquery usage may not be enough to properly reign in the eager behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels like the hacks people do with cache-line padding where they create one subclass with one pad to avoid the CPU/JVM reordering fields

For a more complex query, a single subquery usage may not be enough to properly reign in the eager behavior.

That is, when the Eager is scoped behind a subquery, it means each individual subquery execution behaves eagerly, and that's usually enough to make the impact minimal.
But when an individual subquery execution can generate a ton of rows (such as additional MATCHes) such that the Eager still retains a negative impact,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a ton -> a lot


More on usage of aggregations and subqueries can be found here:

https://support.neo4j.com/hc/en-us/articles/4403024564243-Using-Subqueries-to-Control-the-Scope-of-Aggregations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps point to the public KB instead?


== Using APOC procs as subqueries

If you aren't running Neo4j 4.1 or higher, you can make use of some procs in APOC to act as subqueries for a similar effect.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

procs -> procedures

MERGE (e)-[r:DEDICATED_TO]->(c)
----

In this one, we conditionally add the :Customer label to the :Employee node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

backticks for the labels


Just like before, isolating the scope with a subquery prevents the planner from adding the Eager, it vanishes from the query plan.

Be aware, however, that usage of APOC procedures that execute a dynamic query like this require overhead to parse, compile, and execute the query, a cost that you do not have to pay when using native Cypher subqueries.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and there can be a potential of Cypher injection when executing subqueries as strings

Copy link
Contributor

@jexp jexp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my comments

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants