[PECO-1803] Splitting the PySql connector into the core and the non core part #417

jprakash-db · 2024-07-24T06:34:43Z

Description

databricks-sql-python library is split so that package size can be reduced for the end user based on their requirements
Particularly pyarrow is the heavy component that is planned to be kept optional
existing library split into

databricks-sql-connector ( This is kept, so that the existing users import flow does not change )
databricks-sql-connector-core ( This is the lightweight library that separates the core part )

Tasks Completed

Refractored the code into its respective folders based on the proposed design doc
pyproject.toml file has been changed to reflect the proper dependencies for the split
Made sure that all the existing e2e and units tests are working pre and post spit, ensuring parity
Added benchmarking queries to test the performance of pre and post split and a dashboard has been created for visualization
Dependency tests are also added to check how the library behaves when certain libraries are not available and the user requests their functions

How to Test

Testing pipeline remains the same as it is before the split.
pytest can be used to directly run both the integration as well as unit tests, by pytest [directory_name or file_name]

Addition of dist folder in this repo

Github actions have been setup in the databricks_sqlalchemy repo to run tests using the databricks_sql_connector_core. For running those tests currently we need the .whl file in the dist folder and for temporary testing it has been added to the PR.

Once the library gets published to a public repository such as PyPi then databricks_sqlalchemy will automatically download it from that repo and run the tests using Github actions

Performance Comparison - Benchmarking

The pre-split and post-split preformance comparison has been made using the large and small queries to make sure their is no regression of performance
Dashboard has been created so that everytime the benchmarking is run the result are stored in the benchfood, and comparisons can be made easily

Removed token removed token

Fixed __version__ Fix

…he sqla_testing

databricks_sql_connector_core/README.md

databricks_sql_connector_core/src/databricks_sql_connector_core/sql/auth/auth.py

databricks_sql_connector_core/src/databricks_sql_connector_core/sqlalchemy/__init__.py

jackyhu-db · 2024-09-19T07:00:32Z

databricks_sql_connector/pyproject.toml

@@ -0,0 +1,23 @@
+[tool.poetry]
+name = "databricks-sql-connector"
+version = "3.3.0"


As we change the package structure in this release. I would suggest to bump the major version to 4.0.0 or at least it should be greater than current release 3.4.0, so it has to start with 3.5.0. @kravets-levko wdyt?

kravets-levko · 2024-09-19T08:32:28Z

README.md

@@ -60,7 +60,7 @@ In the above example:
 - `server-hostname` is the Databricks instance host name.
 - `http-path` is the HTTP Path either to a Databricks SQL endpoint (e.g. /sql/1.0/endpoints/1234567890abcdef),
 or to a Databricks Runtime interactive cluster (e.g. /sql/protocolv1/o/1234567890123456/1234-123456-slid123)
- `personal-access-token` is the Databricks Personal Access Token for the account that will execute commands and queries
+  `personal-access-token` is the Databricks Personal Access Token for the account that will execute commands and queries


I think this is some accidental change - please revert it

kravets-levko · 2024-09-19T08:55:06Z

databricks_sql_connector_core/tests/e2e/common/large_queries_mixin.py

@@ -40,6 +44,7 @@ def fetch_rows(self, cursor, row_count, fetchmany_size):
            + "assuming 10K fetch size."
        )

+    @skipUnless(pysql_supports_arrow(), "Without pyarrow lz4 compression is not supported")


Let's stick with @pytest.mark.skipif everywhere (please check other test files as well)

kravets-levko · 2024-09-19T08:57:53Z

poetry.lock

Now, when we have two projects - each of them needs its own lockfile. Keeping the old one at old path is useless, because Poetry won't use it.

Basically, there should be a new lockfile created for databricks_sql_connector. For databricks_sql_connector the old one should be reused, with unnecessary dependencies removed

* Modified the gitignore file to not have .idea file * [PECO-1803] Splitting the PySql connector into the core and the non core part (#417) * Implemented ColumnQueue to test the fetchall without pyarrow Removed token removed token * order of fields in row corrected * Changed the folder structure and tested the basic setup to work * Refractored the code to make connector to work * Basic Setup of connector, core and sqlalchemy is working * Basic integration of core, connect and sqlalchemy is working * Setup working dynamic change from ColumnQueue to ArrowQueue * Refractored the test code and moved to respective folders * Added the unit test for column_queue Fixed __version__ Fix * venv_main added to git ignore * Added code for merging columnar table * Merging code for columnar * Fixed the retry_close sesssion test issue with logging * Fixed the databricks_sqlalchemy tests and introduced pytest.ini for the sqla_testing * Added pyarrow_test mark on pytest * Fixed databricks.sqlalchemy to databricks_sqlalchemy imports * Added poetry.lock * Added dist folder * Changed the pyproject.toml * Minor Fix * Added the pyarrow skip tag on unit tests and tested their working * Fixed the Decimal and timestamp conversion issue in non arrow pipeline * Removed not required files and reformatted * Fixed test_retry error * Changed the folder structure to src / databricks * Removed the columnar non arrow flow to another PR * Moved the README to the root * removed columnQueue instance * Revmoved databricks_sqlalchemy dependency in core * Changed the pysql_supports_arrow predicate, introduced changes in the pyproject.toml * Ran the black formatter with the original version * Extra .py removed from all the __init__.py files names * Undo formatting check * Check * Check * Check * Check * Check * Check * Check * Check * Check * Check * Check * Check * Check * Check * BIG UPDATE * Refeactor code * Refractor * Fixed versioning * Minor refractoring * Minor refractoring * Changed the folder structure such that sqlalchemy has not reference here * Fixed README.md and CONTRIBUTING.md * Added manual publish * On push trigger added * Manually setting the publish step * Changed versioning in pyproject.toml * Bumped up the version to 4.0.0.b3 and also changed the structure to have pyarrow as optional * Removed the sqlalchemy tests from integration.yml file * [PECO-1803] Print warning message if pyarrow is not installed (#468) Print warning message if pyarrow is not installed Signed-off-by: Jacky Hu <[email protected]> * [PECO-1803] Remove sqlalchemy and update README.md (#469) Remove sqlalchemy and update README.md Signed-off-by: Jacky Hu <[email protected]> * Removed all sqlalchemy related stuff * generated the lock file * Fixed failing tests * removed poetry.lock * Updated the lock file * Fixed poetry numpy 2.2.2 issue * Workflow fixes --------- Signed-off-by: Jacky Hu <[email protected]> Co-authored-by: Jacky Hu <[email protected]>

Implemented ColumnQueue to test the fetchall without pyarrow

335fc0c

Removed token removed token

jprakash-db requested a review from gopalldb July 24, 2024 06:34

jprakash-db self-assigned this Jul 24, 2024

shivam2680 and others added 7 commits July 24, 2024 14:41

order of fields in row corrected

ad2b014

Changed the folder structure and tested the basic setup to work

882e080

Refractored the code to make connector to work

98c4cd6

Basic Setup of connector, core and sqlalchemy is working

25a006d

Basic integration of core, connect and sqlalchemy is working

1cfaae2

Setup working dynamic change from ColumnQueue to ArrowQueue

c576110

Refractored the test code and moved to respective folders

0ddca9d

jprakash-db requested a review from madhav-db August 8, 2024 09:29

jprakash-db changed the base branch from main to PECO-1803/connector-split August 14, 2024 09:34

jprakash-db added 2 commits August 20, 2024 23:16

Added the unit test for column_queue

786dc1e

Fixed __version__ Fix

venv_main added to git ignore

c91d43d

jprakash-db force-pushed the jprakash-db/PECO-1803 branch from 4351410 to c91d43d Compare August 20, 2024 17:48

jprakash-db added 14 commits August 21, 2024 13:06

Added code for merging columnar table

b67b739

Merging code for columnar

32e3dcd

Fixed the retry_close sesssion test issue with logging

d062887

Fixed the databricks_sqlalchemy tests and introduced pytest.ini for t…

9908976

…he sqla_testing

Added pyarrow_test mark on pytest

af0cd4f

Fixed databricks.sqlalchemy to databricks_sqlalchemy imports

a07c232

Added poetry.lock

f0c8e7a

Added dist folder

46ae0f7

Changed the pyproject.toml

8c5e7dd

Minor Fix

24730dd

Added the pyarrow skip tag on unit tests and tested their working

61de281

Fixed the Decimal and timestamp conversion issue in non arrow pipeline

bad7c0b

Removed not required files and reformatted

520d2c8

Fixed test_retry error

051bdce

madhav-db approved these changes Sep 3, 2024

View reviewed changes

jprakash-db added 18 commits September 18, 2024 11:05

Undo formatting check

2115db1

Check

36c0f95

Check

15d5047

Check

8415230

Check

15df683

Check

b4c3029

Check

bc758d8

Check

31da868

Check

f79fc69

Check

d121fed

Check

6e393b0

Check

2383caf

Check

dd4c487

Check

6be085c

Check

e7cf5c3

BIG UPDATE

b6a5668

Refeactor code

4496a04

Refractor

726b8ed

jackyhu-db reviewed Sep 19, 2024

View reviewed changes

Fixed versioning

66bfa6d

jackyhu-db approved these changes Sep 19, 2024

View reviewed changes

kravets-levko suggested changes Sep 19, 2024

View reviewed changes

jprakash-db added 2 commits September 19, 2024 22:37

Minor refractoring

2c1cfbd

Minor refractoring

ae20a65

kravets-levko approved these changes Sep 23, 2024

View reviewed changes

jprakash-db changed the base branch from PECO-1803/connector-split to main September 23, 2024 09:01

jprakash-db changed the base branch from main to PECO-1803/connector-split September 23, 2024 16:01

Merged databricks:PECO-1803/connector-split into jprakash-db/PECO-1803

f056c80

jprakash-db merged commit 4099939 into databricks:PECO-1803/connector-split Sep 24, 2024
1 check failed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PECO-1803] Splitting the PySql connector into the core and the non core part #417

[PECO-1803] Splitting the PySql connector into the core and the non core part #417

jprakash-db commented Jul 24, 2024 •

edited

Loading

jackyhu-db Sep 19, 2024

kravets-levko Sep 19, 2024

kravets-levko Sep 19, 2024

kravets-levko Sep 19, 2024

[PECO-1803] Splitting the PySql connector into the core and the non core part #417

[PECO-1803] Splitting the PySql connector into the core and the non core part #417

Conversation

jprakash-db commented Jul 24, 2024 • edited Loading

Related Links

Description

Tasks Completed

How to Test

Addition of dist folder in this repo

Performance Comparison - Benchmarking

jackyhu-db Sep 19, 2024

Choose a reason for hiding this comment

kravets-levko Sep 19, 2024

Choose a reason for hiding this comment

kravets-levko Sep 19, 2024

Choose a reason for hiding this comment

kravets-levko Sep 19, 2024

Choose a reason for hiding this comment

jprakash-db commented Jul 24, 2024 •

edited

Loading