Skip to content

gardnmi/auto_zorder

Repository files navigation


Auto ZORDER

Take the guesswork out of ZORDER

About The Project

Alt Text

The project aims to remove the guesswork of selecting columns to be used in the ZORDER statement. It achieves this by analyzing the logged execution plan for each cluster provided and returns the top n columns that were used in filter/where clauses.

(back to top)

Built With

(back to top)

Prerequisites

Cluster log delivery

(back to top)

Installation

pip install in your Databricks Notebook

%pip install auto_zorder

(back to top)

Example Usage

Note: If the cluster log delivery has not been active for very long then you may not see any results.

Basic Usage

from auto_zorder import auto_zorder

optimize_cmd = auto_zorder(
                    cluster_ids=['cluster_id_1', 'cluster_id_2'],
                    optimize_table='my_db.my_table'
                    )

print(optimize_cmd)
>>> 'OPTIMIZE my_db.my_table ZORDER BY (col1, col2, col3, col4, col5)'

# To run the OPTIMIZE Command
spark.sql(optimize_cmd)

Limit the Number of ZORDER columns

from auto_zorder import auto_zorder

optimize_cmd = auto_zorder(
                    cluster_ids=['cluster_id_1', 'cluster_id_2'],
                    optimize_table='my_db.my_table',
                    number_of_cols=2
                    )

print(optimize_cmd)
>>> 'OPTIMIZE my_db.my_table ZORDER BY (col1, col2)'

Save auto zorder analysis

from auto_zorder import auto_zorder

optimize_cmd = auto_zorder(
                    cluster_ids=['cluster_id_1'],
                    optimize_table='my_db.my_table',
                    save_analysis='my_db.my_analysis'
                    )

Run auto zorder using analysis instead of cluster logs

from auto_zorder import auto_zorder

optimize_cmd = auto_zorder(
                    use_analysis='my_db.my_analysis',
                    optimize_table='my_db.my_table'
                    )

Include additional columns and location in ZORDER

from auto_zorder import auto_zorder

optimize_cmd = auto_zorder(
                    cluster_ids=['cluster_id_1', 'cluster_id_2'],
                    optimize_table='my_db.my_table',
                    use_add_cols=[('add_col1', 0), ('add_col2', 4)]
                    )

print(optimize_cmd)
>>> 'OPTIMIZE my_db.my_table ZORDER BY (add_col1, auto_col1, auto_col2, auto_col3, add_col2, auto_col4, auto_col5)'

Exclude columns in ZORDER

from auto_zorder import auto_zorder

optimize_cmd = auto_zorder(
                    cluster_ids=['cluster_id_1', 'cluster_id_2'],
                    optimize_table='my_db.my_table',
                    exclude_cols=['col1']
                    )

print(optimize_cmd)
>>> 'OPTIMIZE my_db.my_table ZORDER BY (col2, col3, col4, col5, col6)'

(back to top)

License

Distributed under the MIT License. See LICENSE.txt for more information.

(back to top)

Releases

No releases published

Packages

No packages published

Languages