Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce SPI and Core Support for JDBC Join Pushdown Optimization #24115

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

Haritha-Koloth
Copy link
Contributor

@Haritha-Koloth Haritha-Koloth commented Nov 22, 2024

Description

This change introduces foundational support for join pushdown in JDBC connectors. By delegating join operations, directly to the remote datasource, it unlocks significant performance gains by leveraging the processing capabilities of the underlying datasource.

Motivation and Context

Fixes 23152

Impact

This update includes only the SPI and core changes for the feature. Users will see its impact once the JDBC modifications to leverage the pushdown optimizer are also integrated. The primary benefit is enhanced performance for INNER join queries involving large tables with substantial row counts.

Detailed RFC - prestodb/rfcs#32

Test Plan

  • Unit tests added for newly added optimizer

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

Please follow release notes guidelines and fill in the release notes below.

== RELEASE NOTES ==

General Changes
* Add configuration property ``optimizer.inner-join-pushdown-enabled`` to support inner join pushdown. Defaults to ``false``. Session property to override this ``optimizer_inner_join_pushdown_enabled``. :pr:`24115`
* Add configuration property ``optimizer.inequality-join-pushdown-enabled`` to support inequality join pushdown. Defaults to ``false``. Session property to override this ``optimizer_inequality_join_pushdown_enabled``. :pr:`24115`

If release note is NOT required, use:

== NO RELEASE NOTE ==

@Haritha-Koloth Haritha-Koloth force-pushed the join_pushdown_spi branch 2 times, most recently from e248f96 to 50a7e8b Compare November 22, 2024 07:22
@Haritha-Koloth
Copy link
Contributor Author

The jdbc code to leverage the changes in this PR could be found here - Haritha-Koloth#2

@Haritha-Koloth
Copy link
Contributor Author

Test failure seems to be unrelated to the code changes in this PR. Raised a sample PR with ReadMe change and it fails with the same error too - #24118

Will force-push after some time to see if the error resolves.

@tdcmeehan tdcmeehan added the from:IBM PR from IBM label Nov 22, 2024
@prestodb-ci prestodb-ci requested review from a team, pratyakshsharma and anandamideShakyan and removed request for a team November 22, 2024 15:56
@prestodb-ci
Copy link

Saved that user @Haritha-Koloth is from IBM

@Haritha-Koloth Haritha-Koloth marked this pull request as ready for review November 22, 2024 15:57
@Haritha-Koloth Haritha-Koloth force-pushed the join_pushdown_spi branch 2 times, most recently from c7d072a to 471eb62 Compare December 2, 2024 05:27
@ethanyzhang ethanyzhang requested review from a team and removed request for a team December 4, 2024 16:14
@lschampion
Copy link

this feature is very useful, and significant when presto as virtual data engine.

@jaystarshot
Copy link
Member

jaystarshot commented Dec 10, 2024

By delegating join operations, directly to the remote datasource, it unlocks significant performance gains by leveraging the processing capabilities of the underlying datasource.

Just curious, why will delegating joins to the datasource result in performance gains?
Nvm just saw one use case where data was sent to presto and then joined so saves transfer cost or something?

@aaneja
Copy link
Contributor

aaneja commented Dec 12, 2024

@jaystarshot Join pushdown will help when -

  1. We have a reducing join - data that Presto reads from the 'join' source will be much smaller than streaming the two join sources
  2. The underlying relational system can do more joins more efficiently than Presto - say use index lookups instead of having to do a hash-join

See these examples listed in the RFC

Copy link
Contributor

@aaneja aaneja left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did a quick first pass. I think we should switch GroupInnerJoinsByConnector to be an iterative optimizer rule.

ImmutableList.of(
new SchemaTableName("test_schema", "test_view"),
new SchemaTableName("test_schema", "another_table")))
.withConnectorCapabilities(ImmutableSet.of()).build();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's have all the connectors SUPPORTS_JOIN_PUSHDOWN. The test should work with or without the connectors supporting it

@Test
public void testDoesNotPushDownOuterJoin()
{
String catalogName = "test_catalog";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please try refactoring this setup code into smaller methods that can be reused across tests

VariableReferenceExpression right = newBigintVariable("a2");
EquiJoinClause joinClause = new EquiJoinClause(left, right);

PlanNode plan = output(join(FULL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Join pushdown will also not work for LEFT and RIGHT joins, can we add those assertions too ?

constraint,
Optional.empty());

PlanBuilder planBuilder = new PlanBuilder(session, new PlanNodeIdAllocator(), metadata);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused

public class GroupInnerJoinsByConnector
implements PlanOptimizer
{
private static final Logger logger = Logger.get(GroupInnerJoinsByConnector.class);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unused

{
if (isEnabled(session)) {
PlanNode rewrittenPlan = SimplePlanRewriter.rewriteWith(new Rewriter(functionResolution, determinismEvaluator, idAllocator, metadata, session), plan);
return PlanOptimizerResult.optimizerResult(rewrittenPlan, !rewrittenPlan.equals(plan));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been preferring the use of the IterativeOptimizer over PlanOptimizer.

Since your rewrite can very explicitly determine if the plan is changed (by using rewrittenPlan.equals(plan)) modifying this class to instead implement an iterative optimizer rule (i.e implements Rule<JoinNode>) should be pretty straightforward.

This should reduce the test burden as well - you will be able to use a RuleTester and its associated helpers directly

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @aaneja , @Haritha-Koloth

Below are the reasons to use PlanOptimizer instead of Iterative optimizer Rule.

  1. Need to visit Multiple nodes (visitJoin, visitFilter, etc.). Using a Rule we may able to perform operations on a single node only, I guess.

  2. Repeated application of JoinNode Rule, due to child join.

Not sure these cases can resolved through a Single Iterative Optimizer Rule. For handling filter node, required a Rule and for join required another Rule.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aaneja Thanks! We are exploring the potential and viability of the above suggestions.

Copy link

@Ajas-Mangal Ajas-Mangal Dec 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Haritha-Koloth and @aaneja

I think we already did some good test coverage for this using PlanOptimizer itself.

Do we have any additional benefit for changing to optimizer Rule? If we use optimizer Rule how we can handle non-equijoin scenario, where we have to visit FilterNode and JoinNode

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ajas-Mangal As per the discussion with @aaneja, the usage of IterativeOptimizer is being preferred over PlanOptimizer implementations and there is a potential advantage of opening up more nodes for pushdown.

As for the non-equijoin scenario, I guess we can have multiple rule matchers to handle the filter and join nodes.

I am trying out the suggested changes, will update here if met with some blockers/concerns.

@ethanyzhang
Copy link
Contributor

Hi @aaneja , any next step on this?

@aaneja
Copy link
Contributor

aaneja commented Jan 7, 2025

Hi @aaneja , any next step on this?

A rewrite of the existing rule as an IterativeOptimizer is being explored, see #24115 (comment).

@steveburnett
Copy link
Contributor

Thanks for the release note entry! Nit, please include the PR number for each item.

== RELEASE NOTES ==

General Changes
* Add configuration property ``optimizer.inner-join-pushdown-enabled`` to support inner join pushdown. Defaults to ``false``. Session property to override this ``optimizer_inner_join_pushdown_enabled``. :pr:`24115`
* Add configuration property ``optimizer.inequality-join-pushdown-enabled`` to support inequality join pushdown. Defaults to ``false``. Session property to override this ``optimizer_inequality_join_pushdown_enabled``. :pr:`24115`

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add documentation for the configuration properties optimizer.inner-join-pushdown-enabled and optimizer.inequality-join-pushdown-enabled, and the session properties optimizer_inner_join_pushdown_enabled and optimizer_inequality_join_pushdown_enabled.

@Haritha-Koloth
Copy link
Contributor Author

Please add documentation for the configuration properties optimizer.inner-join-pushdown-enabled and optimizer.inequality-join-pushdown-enabled, and the session properties optimizer_inner_join_pushdown_enabled and optimizer_inequality_join_pushdown_enabled.

@steveburnett - Is this the correct file to add documentation for the above configurations?

@steveburnett
Copy link
Contributor

Please add documentation for the configuration properties optimizer.inner-join-pushdown-enabled and optimizer.inequality-join-pushdown-enabled, and the session properties optimizer_inner_join_pushdown_enabled and optimizer_inequality_join_pushdown_enabled.

@steveburnett - Is this the correct file to add documentation for the above configurations?

* https://github.com/prestodb/presto/blob/05bc56ceddaf7debed9f5fd7aa20e07093aea969/presto-docs/src/main/sphinx/admin/properties.rst#optimizer-properties

Yes, for the configuration properties. Please also document the session properties by updating https://github.com/prestodb/presto/blob/05bc56ceddaf7debed9f5fd7aa20e07093aea969/presto-docs/src/main/sphinx/admin/properties-session.rst.

@Haritha-Koloth Haritha-Koloth force-pushed the join_pushdown_spi branch 2 times, most recently from 151c76d to 30a3dbb Compare January 13, 2025 16:27
Join Pushdown SPI and Core changes

Co-Authored-By: Ajas M <[email protected]>
Co-Authored-By: Glerin Pinhero <[email protected]>
Co-Authored-By: Thanzeel Hassan <[email protected]>
Co-Authored-By: Anant Aneja <[email protected]>
@Haritha-Koloth
Copy link
Contributor Author

Hi @aaneja

We explored your suggestion to convert GroupInnerJoinsByConnector into an IterativeOptimizer. Below are our observations and findings:

  • Equality Joins: For equality joins, we are able to construct a single table scan node based on the current logic. However, this approach fails because the IterativeOptimizer flow requires the input and output variables to remain consistent. When building the single table scan node, we need to include output variables from all sources to enable predicate pushdown. Below is the specific error encountered after the optimizer is run once.
    java.lang.IllegalArgumentException: com.facebook.presto.sql.planner.optimizations.GroupInnerJoinsByConnector: transformed expression doesn't produce same outputs: [intcolumn2, varcharcolumn1_2] vs [intcolumn1, intcolumn2, intcolumn1_0, varcharcolumn1_2] for node: com.facebook.presto.spi.plan.FilterNode@d85792e3

  • Inequality Joins: To handle inequality joins, we may need to create a separate rule that can accommodate FilterNodes. Alternatively, we may need to enhance the logic in Pattern.java to return a generic node type(PlanNode) after evaluating multiple pattern matches and implement Rule from GroupInnerJoinsByConnector.

  • Exit Condition: Additionally, we may need to introduce an exit condition to skip over processed join nodes. This would address scenarios where nodes are being iteratively processed and prevent redundant processing.

The modified code is available here - iterative_opt_pushdown

Please let us know your thoughts on these.

CC: @Ajas-Mangal

@aaneja
Copy link
Contributor

aaneja commented Jan 14, 2025

  1. Yes, the IterativeOptimizer has an invariant that a transformation cannot produce more or less output variables than the input plan nodes. A simple way to restrict the ouputs is to add a ProjectNode that constrains the output to the required variables, e.g aaneja@dc2eff4. If there is scope for pushing down this ProjectNode, it will be handled by later rules.
  2. The pattern matcher allows for creating multiple nested matchers and has the ability to extract any specific node during the matching process. See this matcher for example. If you can describe the kind of matcher you need, I can help you build it
  3. Can you describe an example of redundant processing ? There can be cases where we reapply the same rule more than once on the same already optimized node. This is by design. The 'exit' condition we want for an IterativeOptimizer rule is that there are no plan modifications after rule re-application

Reg 1 - I did a couple of join pushdown tests with the outputs restricted (see aaneja/iterative_opt_pushdown). I did not find any errors here. Please feel to try it out with your setup and let me know if you spot any failures

Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the doc! Formatting nits, all the links and everything else looks good.

presto-docs/src/main/sphinx/admin/properties-session.rst Outdated Show resolved Hide resolved
presto-docs/src/main/sphinx/admin/properties.rst Outdated Show resolved Hide resolved
Copy link
Contributor

@steveburnett steveburnett left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! (docs)

Pull updated branch, new local doc build, looks good. Thanks!

@Haritha-Koloth
Copy link
Contributor Author

  • The pattern matcher allows for creating multiple nested matchers and has the ability to extract any specific node during the matching process. See this matcher for example. If you can describe the kind of matcher you need, I can help you build it

Hi @aaneja , regarding the point 2):

We need to build a matcher that can accept a join OR a filter. But I could not find any examples where 2 different types of nodes can be handled via a single rule. The kind of node we are trying to accept should be mentioned on the implementation also, right? How can we accommodate 2 different types there? Did you mean that we can have 2 different rules and patterns for these 2 use cases? Or is there a way we could use a generic PlanNode for this?

@aaneja
Copy link
Contributor

aaneja commented Jan 16, 2025

@Haritha-Koloth What you have is a case where you want to match on JoinNode or Filter<-JoinNode and in the latter case, use the filter as an extra filter while building the 'combined' MultiJoinNode.
You can use a rule-set to define the relevant patterns and call a base-class for the actual rewriting
I rewrote your code here - aaneja@ed62d8b

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
from:IBM PR from IBM
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Implement Jdbc join pushdown capabilities in presto
9 participants