Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add affected, inserted, updated, deleted row to DatabricksAdapterResponse #883

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ghjklw
Copy link

@ghjklw ghjklw commented Dec 18, 2024

Resolves #351

Description

Try to parse the result of the databricks query to extract the following variables and add them to the DatabricksAdapterResponse:

  • num_affected_rows
  • num_inserted_rows
  • num_deleted_rows
  • num_updated_rows

I have only quickly tested it and have been able to retrieve results like this in run_results.json:

      "adapter_response": {
        "_message": "OK",
        "rows_affected": 358,
        "query_id": "2128ebeb-d6e8-4033-8849-dfa2a5279d65",
        "rows_updated": 358,
        "rows_deleted": 0,
        "rows_inserted": 0
      },

I have not yet written any test for it as I would first like to get your feedback on the approach used. It also certainly requires some more testing to make sure that it returns these metrics consistently and did not have a side effect on other methods like fetchone, fetchmany and fetchall (the last 2 in particular are likely to have been impacted).

Checklist

  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt-databricks next" section.

@ghjklw
Copy link
Author

ghjklw commented Dec 18, 2024

By the way, a much cleaner (and potentially more efficient) approach than the caching of fetchone result would be to make sure that this is executed only for some request types (MERGE, INSERT, UPDATE and DELETE), or even better based on the materialization type. This way, we would now that there would be no reason to use a fetch method on the cursor and could get rid of the caching issue.

Likewise, I don't believe this would work if the table is not Delta. It would not fail either, but it would be nice to avoid trying.

I found no way in DatabricksConnectionManager.execute to find any information about the query type, materialization or table type, and parsing the SQL query would be even more hacky. Maybe there could be a way to feed this information to it? It could for example take an additional optional boolean parameter (defaulting to False) to define whether the query impact metadata should be return, just like I did for get_response.

return self._cursor.fetchall()

def fetchone(self) -> Optional[tuple]:
def query_impact(self) -> DatabricksQueryImpact:
Copy link

@nicor88 nicor88 Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

little nit on the name. Maybe you can call it get_query_statistics - I'm not a native English speaker, but in other query engines the term statistics is used for this type of information.
If you like this name more, you should consider to replace impact with statistics everywhere for consistency.

@benc-db
Copy link
Collaborator

benc-db commented Dec 18, 2024

Another great issue to tackle, thanks.

Copy link
Collaborator

@benc-db benc-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the feature. Can we add some unit tests?

@@ -179,12 +180,21 @@ def dbr_version(self) -> tuple[int, int]:
return self._dbr_version


@dataclass
class DatabricksQueryImpact:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As mentioned by the other commenter, 'statistics' are used elsewhere for a similar concept, so could be used here.

except Error:
return DatabricksQueryImpact()

if not self._cache_fetchone:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this condition reachable? When you fetchone(), is it ever empty?

def fetchone(self) -> Optional[Row]:
if self._cache_fetchone:
# If `fetchone` result was cached by `query_metadata`, return it and invalidate it
row = self._cache_fetchone
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we invalidate the cache, but in the callsite reassign the cache to the same value?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like you would just want:

def fetchone(self) -> Optional[Row]:
    if not self._cache_fetchone:
        self._cache_fetchone = self._cursor.fetchone()
    return self._cache_fetchone

@benc-db
Copy link
Collaborator

benc-db commented Dec 20, 2024

@ghjklw I'll be on vacation next two weeks. Just wanted to give you heads up that I still want your two PRs, even though you won't see any activity for me until the new year.

@ghjklw
Copy link
Author

ghjklw commented Dec 20, 2024

@ghjklw I'll be on vacation next two weeks. Just wanted to give you heads up that I still want your two PRs, even though you won't see any activity for me until the new year.

Thanks for the heads up and no worries! That will give me some time to polish these 😊

If you have a couple of minutes, just to circle back to my message above, do you know if there's any chance within DatabricksConnectionManager.execute to get information about either the type of operation/materialisation or the type of query to avoid trying to collect the statistics when unnecessary... Or is that a dead-end?

@benc-db
Copy link
Collaborator

benc-db commented Dec 20, 2024

If you have a couple of minutes, just to circle back to my message above, do you know if there's any chance within DatabricksConnectionManager.execute to get information about either the type of operation/materialisation or the type of query to avoid trying to collect the statistics when unnecessary... Or is that a dead-end?

I don't think there is any metadata that conveys this client-side, short of inspecting the SQL. Do you have any idea of the overhead of performing this for every operation? If it's small enough, we can ignore; if it's large, we probably want to give a config hook to opt into the behavior, since I don't think most users care about the data, though for some scenarios it is very nice to have. It might be possible to get the information from the cursor, but the type information in the databricks python sql connector (https://github.com/databricks/databricks-sql-python) is a little sparse.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

rows_affected not returned by adapter
3 participants