Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Push Down Sort/Limit/Project through LogicalEval #2864

Open
1 of 3 tasks
qianheng-aws opened this issue Jul 29, 2024 · 2 comments
Open
1 of 3 tasks

[RFC] Push Down Sort/Limit/Project through LogicalEval #2864

qianheng-aws opened this issue Jul 29, 2024 · 2 comments
Labels
enhancement New feature or request PPL Piped processing language SQL

Comments

@qianheng-aws
Copy link
Contributor

qianheng-aws commented Jul 29, 2024

Is your feature request related to a problem?
Currently in a logical plan of OpenSearch, the LogicalEval is a blocker for many push-down optimization rules like TableScanPushDown.PUSH_DOWN_PROJECT, TableScanPushDown.PUSH_DOWN_SORT, TableScanPushDown.PUSH_DOWN_LIMIT and etc. Not only does it reduce the performance of execution, it also lead to wrong results if we cannot do pushdown optimization.

E.g. Below 2 PPL are semantic equal, but the latter one with eval retrieves all fields from OpenSearch(which is unnecessary), while not enough documents due to the limitation of settings of plugins.query.size_limit.

# no eval
POST _plugins/_ppl/_explain
{
  "query": """
    source=opensearch_dashboards_sample_data_flights |  head 300 | fields FlightTimeMin
  """
}
# explain result
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightTimeMin]"
    },
    "children": [
      {
        "name": "OpenSearchIndexScan",
        "description": {
          "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, sourceBuilder={"from":0,"size":200,"timeout":"1m","_source":{"includes":["FlightTimeMin"],"excludes":[]}}, searchDone=false)"""
        },
        "children": []
      }
    ]
  }
}

--------------------------------------------------------

# with eval
POST _plugins/_ppl/_explain
{
  "query": """
    source=opensearch_dashboards_sample_data_flights | eval FlightMin = FlightTimeMin |  head 300 | fields FlightTimeMin
  """
}
# explain result
{
  "root": {
    "name": "ProjectOperator",
    "description": {
      "fields": "[FlightTimeMin]"
    },
    "children": [
      {
        "name": "LimitOperator",
        "description": {
          "limit": 10,
          "offset": 0
        },
        "children": [
          {
            "name": "EvalOperator",
            "description": {
              "expressions": {
                "FlightMin": "FlightTimeMin"
              }
            },
            "children": [
              {
                "name": "OpenSearchIndexScan",
                "description": {
                  "request": """OpenSearchQueryRequest(indexName=opensearch_dashboards_sample_data_flights, sourceBuilder={"from":0,"size":300,"timeout":"1m"}, searchDone=false)"""
                },
                "children": []
              }
            ]
          }
        ]
      }
    ]
  }
}

What solution would you like?
Since eval operator only adds or updates fields without any benefit on skipping or project data, and it could block other optimization, we should push down sort/limit/project operator through eval to evaluate eval's output as late as possible.

1. Limit

For limit, we could push down it under eval directly since it doesn't have any expressions.

2. Sort

For sort, for each sort field, we should analyze whether it's produced by eval.

2.1 All sort fields are not produced by eval, which means they can all be resolved by the Environment before eval, so we can push down it under eval directly

2.2 One or more fields are produced by eval, there are also some different cases we can dive deep whether replacement can work as semantic equally like sort(a) - eval(a=b+1), or other cases it cannot like sort(a) - eval(a=b^2). And there may be cases sort field are calculated by more than 1 original fields. I think we should talk about this case in another issue for advance enhancement.

  • 2.2.1 the field is produced by equation to ReferenceExpression which is just simple rename like eval(sortField=originField), we can also push down sort by do some field replacement for these fields. Of course, there is some more complex cases like eval(newField=originField, sortField=newField), we should do replacement iteratively from right to left, or we just narrow down this case to not let it push down.

  • 2.2.2 the field is produced by function, there are also some different cases we can dive deep whether replacement can work as semantic equally like sort(a) - eval(a=b+1), or other cases it cannot like sort(a) - eval(a=b^2). And there may be cases sort field are calculated by more than 1 original fields. I think we should talk about this case in another issue for advance enhancement.

3. Project

When we say pushing down a project under eval, it actually inserts another project before eval instead of pushing down the origin project itself.

We should consruct the new projectList by append all eval reference expressions to the original projectList, and kick off all reference expressions produced by eval. For example, if we have plan project(a, b, d) - eval(b=c+2, a=b+1), we should optimize it to project(a, b, c) - eval(a=b+1, b=c+2)) - project(c, d). Thus, we can do project in advance or push down project(c, d) into TableScan if possible, and ensure they are semantic equal.

Sub-Tasks

What alternatives have you considered?

Do you have any additional context?

@qianheng-aws qianheng-aws added enhancement New feature or request untriaged labels Jul 29, 2024
@qianheng-aws qianheng-aws changed the title [FEATURE] Pull Up LogicalEval to evaluate as late as possible [FEATURE] Push Down Sort/Limit/Project through LogicalEval Jul 29, 2024
@LantaoJin
Copy link
Member

Not all scenarios for operators sort/limit/project in a query are able to push down through eval. Could you explain more details of your design in this description? @qianheng-aws

@LantaoJin LantaoJin changed the title [FEATURE] Push Down Sort/Limit/Project through LogicalEval [RFC] Push Down Sort/Limit/Project through LogicalEval Aug 2, 2024
@YANG-DB YANG-DB added SQL PPL Piped processing language labels Nov 29, 2024
@YANG-DB
Copy link
Member

YANG-DB commented Nov 29, 2024

@qianheng-aws can we continue this PR for final reviews and eventually merge ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PPL Piped processing language SQL
Projects
Status: Design
Development

No branches or pull requests

3 participants