materialize-databricks: new connector #1021

mdibaiee · 2023-10-18T14:46:56Z

Description:

Databricks uses the Recovery Log is authoritative & Idempotent apply pattern of a materialization
This has been tested using: integration tests, and manual testing on the 25m document collection (imported_25m)

Workflow steps:

(How does one use this feature, and how has it changed)

Documentation links affected:

(list any documentation links that you created, or existing ones that you've identified as needing updates, along with a brief description)

Notes for reviewers:

(anything that might help someone review this PR)

This change is

mdibaiee · 2023-11-09T15:49:36Z

What remains to be tested: since I have changed the logic of the CounterWriter, we need to test whether compression is still working fine. The non-compression case is working as I have tested it with Databricks.

UPDATE: tested this by running bigquery's integration tests 👍🏽

williamhbaker

Some initial comments/considerations - looking good so far, this one is a bit of a beast.

.github/workflows/ci.yaml

tests/materialize/materialize-databricks/fetch-data.go

materialize-databricks/config.go

materialize-databricks/driver.go

materialize-databricks/staged_file.go

materialize-databricks/sqlgen.go

materialize-databricks/.snapshots/TestSQLGeneration

williamhbaker · 2023-11-14T18:40:52Z

materialize-databricks/driver.go

+
+type tableConfig struct {
+	Table         string `json:"table" jsonschema:"title=Table,description=Name of the table" jsonschema_extras:"x-collection-name=true"`
+	Schema        string `json:"schema" jsonschema:"title=Schema,description=Schema where the table resides,default=default"`


For consistency with other materializations and usability, I suggest:

Not setting a default for schema in tableConfig

Making schema optional in tableConfig

In config, make schema_name a required field (note: I am assuming that you can't connect without some kind of schema specified), and maybe make the name of the equivalent fields in the tableConfig and config be the same

In (*config).Validate, error if schema_name hasn't been provided

In newTableConfig, by default initialize the schema to the schema from config, which will be over-written if there is a present schema property in the raw json of the resource config when it is unmarshalled elsewhere

This would support the case where most bindings use the same schema as set in the endpoint-level config, but just a few need to be edited to use a different schema, which can be done by just configuring it for those specific bindings. It would directly parallel the BigQuery and Snowflake materializations, which work in this way. As it is now, if you wanted to materialize all the bindings to a schema other than "default", you would need to set it for every single binding.

materialize-databricks/driver.go

williamhbaker

LGTM % remaining unresolved comments

mdibaiee · 2023-11-15T16:27:56Z

Merging of this depends on gazette/core#352 so holding off until that PR is merged and landed

mdibaiee force-pushed the mahdi/databricks branch 4 times, most recently from 10b19c0 to 5e17779 Compare October 25, 2023 14:08

mdibaiee force-pushed the mahdi/databricks branch from 5e17779 to 61d1ca5 Compare October 26, 2023 11:04

mdibaiee force-pushed the mahdi/databricks branch 10 times, most recently from c98fb69 to 3a25853 Compare November 8, 2023 11:54

mdibaiee force-pushed the mahdi/databricks branch from e071cc4 to 385e599 Compare November 9, 2023 14:40

mdibaiee marked this pull request as ready for review November 9, 2023 14:47

mdibaiee requested review from williamhbaker and willdonnelly November 9, 2023 14:47

mdibaiee force-pushed the mahdi/databricks branch from 124e35d to 48a85bc Compare November 9, 2023 15:48

mdibaiee force-pushed the mahdi/databricks branch 2 times, most recently from da101e8 to 5362209 Compare November 10, 2023 14:42

williamhbaker reviewed Nov 10, 2023

View reviewed changes

mdibaiee added 6 commits November 14, 2023 14:24

materialize-databricks: new connector

341ac01

databricks: the first working version with file uploads

0ca41f2

databricks: refactor credential config to allow for additions later

0a8f0b5

materialize-databricks: further validations and fixes

733a362

materialize-databricks: cleanup staging files

c2b0ffd

materialize-databricks: add basic tests

91f9cb2

mdibaiee added 10 commits November 14, 2023 14:31

materialize-databricks: add debug logs to stagedFile

5bb642d

materialize-databricks: switch to tmp files on disk

11529b8

materialize-databricks: run file uploads in parallel

91e5f8d

materialize-databricks: different file size limits

c9cf760

materialize-databricks: go for loop bug

782ece9

materialize-databricks: cleanup leftovers

8faa37c

materialize-databricks: ignore table not found error in MERGE INTO

0186472

materialize-motherduck: use NewCountingEncoder's gzip=true arg

22689c1

materialize-databricks: minor updates and cleanups

f400029

materialize-databricks: improvements

e2ad00c

mdibaiee force-pushed the mahdi/databricks branch from eb83e0f to e2ad00c Compare November 14, 2023 14:35

mdibaiee added 2 commits November 14, 2023 15:39

materialize-databricks: encodeRow needs to handle json messages

4e14682

materialize-databricks: fix config_test

f0573c3

williamhbaker reviewed Nov 14, 2023

View reviewed changes

materialize-databricks/sqlgen.go Outdated Show resolved Hide resolved

williamhbaker reviewed Nov 14, 2023

View reviewed changes

materialize-databricks/.snapshots/TestSQLGeneration Outdated Show resolved Hide resolved

williamhbaker reviewed Nov 14, 2023

View reviewed changes

materialize-databricks/driver.go Show resolved Hide resolved

williamhbaker approved these changes Nov 14, 2023

View reviewed changes

mdibaiee added 7 commits November 15, 2023 08:25

materialize-databricks: schemaName validation + json conversion

7443a10

materialize-databricks: fix json encoding

6cf41db

materialize-databricks: use table name in temp table name

bc5eb7e

materialize-databricks: update test snapshots

7619cd1

materialize-bigquery: revert local debugging

5999d07

materialize-databricks: create volume in the configured schema

b02d705

materialize-databricks: comment updates

6e83d69

mdibaiee mentioned this pull request Nov 17, 2023

docs: databricks estuary/flow#1288

Merged

use latest gazette master

d04ec93

mdibaiee merged commit 56c8554 into main Nov 21, 2023
39 of 42 checks passed

mdibaiee deleted the mahdi/databricks branch November 21, 2023 16:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

materialize-databricks: new connector #1021

materialize-databricks: new connector #1021

mdibaiee commented Oct 18, 2023 •

edited

Loading

mdibaiee commented Nov 9, 2023 •

edited

Loading

williamhbaker left a comment

williamhbaker Nov 14, 2023 •

edited

Loading

williamhbaker left a comment

mdibaiee commented Nov 15, 2023

materialize-databricks: new connector #1021

materialize-databricks: new connector #1021

Conversation

mdibaiee commented Oct 18, 2023 • edited Loading

mdibaiee commented Nov 9, 2023 • edited Loading

williamhbaker left a comment

Choose a reason for hiding this comment

williamhbaker Nov 14, 2023 • edited Loading

Choose a reason for hiding this comment

williamhbaker left a comment

Choose a reason for hiding this comment

mdibaiee commented Nov 15, 2023

mdibaiee commented Oct 18, 2023 •

edited

Loading

mdibaiee commented Nov 9, 2023 •

edited

Loading

williamhbaker Nov 14, 2023 •

edited

Loading