Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

materialize-databricks: new connector #1021

Merged
merged 35 commits into from
Nov 21, 2023
Merged

materialize-databricks: new connector #1021

merged 35 commits into from
Nov 21, 2023

Conversation

mdibaiee
Copy link
Member

@mdibaiee mdibaiee commented Oct 18, 2023

Description:

  • Databricks uses the Recovery Log is authoritative & Idempotent apply pattern of a materialization
  • This has been tested using: integration tests, and manual testing on the 25m document collection (imported_25m)

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)


This change is Reviewable

@mdibaiee mdibaiee force-pushed the mahdi/databricks branch 4 times, most recently from 10b19c0 to 5e17779 Compare October 25, 2023 14:08
@mdibaiee mdibaiee force-pushed the mahdi/databricks branch 10 times, most recently from c98fb69 to 3a25853 Compare November 8, 2023 11:54
@mdibaiee mdibaiee marked this pull request as ready for review November 9, 2023 14:47
@mdibaiee
Copy link
Member Author

mdibaiee commented Nov 9, 2023

What remains to be tested: since I have changed the logic of the CounterWriter, we need to test whether compression is still working fine. The non-compression case is working as I have tested it with Databricks.

UPDATE: tested this by running bigquery's integration tests 👍🏽

@mdibaiee mdibaiee force-pushed the mahdi/databricks branch 2 times, most recently from da101e8 to 5362209 Compare November 10, 2023 14:42
Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments/considerations - looking good so far, this one is a bit of a beast.

.github/workflows/ci.yaml Show resolved Hide resolved
tests/materialize/materialize-databricks/fetch-data.go Outdated Show resolved Hide resolved
materialize-databricks/config.go Outdated Show resolved Hide resolved
materialize-databricks/config.go Outdated Show resolved Hide resolved
materialize-databricks/driver.go Outdated Show resolved Hide resolved
materialize-databricks/staged_file.go Outdated Show resolved Hide resolved
materialize-databricks/staged_file.go Show resolved Hide resolved
materialize-databricks/staged_file.go Outdated Show resolved Hide resolved
materialize-databricks/staged_file.go Outdated Show resolved Hide resolved
materialize-databricks/staged_file.go Outdated Show resolved Hide resolved

type tableConfig struct {
Table string `json:"table" jsonschema:"title=Table,description=Name of the table" jsonschema_extras:"x-collection-name=true"`
Schema string `json:"schema" jsonschema:"title=Schema,description=Schema where the table resides,default=default"`
Copy link
Member

@williamhbaker williamhbaker Nov 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For consistency with other materializations and usability, I suggest:

  • Not setting a default for schema in tableConfig
  • Making schema optional in tableConfig
  • In config, make schema_name a required field (note: I am assuming that you can't connect without some kind of schema specified), and maybe make the name of the equivalent fields in the tableConfig and config be the same
  • In (*config).Validate, error if schema_name hasn't been provided
  • In newTableConfig, by default initialize the schema to the schema from config, which will be over-written if there is a present schema property in the raw json of the resource config when it is unmarshalled elsewhere

This would support the case where most bindings use the same schema as set in the endpoint-level config, but just a few need to be edited to use a different schema, which can be done by just configuring it for those specific bindings. It would directly parallel the BigQuery and Snowflake materializations, which work in this way. As it is now, if you wanted to materialize all the bindings to a schema other than "default", you would need to set it for every single binding.

Copy link
Member

@williamhbaker williamhbaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % remaining unresolved comments

@mdibaiee
Copy link
Member Author

Merging of this depends on gazette/core#352 so holding off until that PR is merged and landed

@mdibaiee mdibaiee merged commit 56c8554 into main Nov 21, 2023
39 of 42 checks passed
@mdibaiee mdibaiee deleted the mahdi/databricks branch November 21, 2023 16:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants