A utility that will read the Legacy DDL and compare it against the previously extracted DDL and produce an output with the name of the objects where the DDL is no longer matching.
Most often, migration projects take many months. At each period, groups of objects are migrated. The freshness of the objects definition depends on the time at which the objects are migrated.
In many cases, the object definitions are updated in the legacy source system due to the immediate business need and the updated definition would not be replicated in the same way in BigQuery. This often results in validation failures during the later stage(UAT and Preprod Testing) of the project.
- Hive DDL Validation : The utility is designed to read from the Hive Metastore and extract the Last DDL Time on Tables and Views along with the user who last updated these. If the tables/views are new, the last updated user is saved as N/A.
- Teradata DDL Validation : Teradata TablesV contains the metadata information for the tables and databases. The utility can fetch the records in this table for tables, views and procedures and append the findings in the audit metadata table.
- Snowflake Validation : Each database in snowflake contains information schema table which is different of tables, views and procedures. Based on the user input, utility queries these metadata tables and fetches the latest records.
- Easy Infrastructure Setup : We provide terraform scripts that can be deployed using the local environment or from cloud shell to set up the audit instance, database and tables in CloudSQL and the resource secret manager based on the input from the user in the tfvars file.
- Detailed Audit Table : The audit tables are separate for each of the database types supported but the underlying structure remains the same. The Audit table contains the database name, object name, object type(table/view/procedure), last alter time, last alter user, is latest record in the audit table and when the utility was last ran ie. insert time. These help users get detailed understanding of the changes in the objects over time.
- Best Security Practice using Secret Manager : The utility uses Secret Manager to fetch the on-prem metadata environment credentials. These include the db username, db password and the URL of the db instance to connect to. This differs for different use cases. In case of snowflake, db account is taken as input instead of URL. All the user has to do is to provide the resource secret URL during infrastructure setup and while running the program.
- Centralized Configuration : The utility reads the configuration for the run from GCS using a JSON file. This only includes the database type (hive/snowflake etc), db_name, object type, resource secret URL, audit instance name, audit db user and audit db password.
- Generic Utility : The utility is designed to be more generic, hence it can incorporate new database types easily without much code change.
Below packages are needed to run the script: google-cloud-secret-manager, pandas, mysql-connector-python, snowflake-connector-python, cloud-sql-python-connector, google-cloud-storage, oracledb
-
Login to gcloud using
gcloud auth application-default login
. Make sure the environment variableGOOGLE_APPLICATION_CREDENTIALS
points to credential file. -
Install the dependencies listed in requirements.txt using pip3.
pip3 install -r requirements.txt
-
Download the key from gcloud or console and save in a trusted folder.
-
Add the required inputs for infrastructure setup in the
tfvars
file iniac
folder. Then run the following commands :-terraform init terraform plan terraform apply
-
Create a resource secret in GCP with credentials to access the hive/snowflake/teradata cluster. Here are the fields and format required for each.
4.1 : Hive
{ "db_instance" : "", "db_username": "", "db_name":"", "db_password":"" }
4.2 : Snowflake
{ "db_user" : "", "db_password": "", "db_account":"" }
4.3 : Teradata
{ "db_instance" : "", "db_user": "", "db_password":"" } 4.3 : Oracle
{ "db_account" : "", "db_user": "", "db_password":"" }
-
Create a central config and upload in GCP Cloud Storage Bucket. The
db_type
can be of typeteradata
,snowflake
orhive
.obj_type
can beview
,table
orprocedure
. Note the fields foraudit_instance_name
,audit_db_user
andaudit_db_password
from the infrastructure setup.
{
"db_type" : "",
"db_name" : "",
"obj_type" : "",
"resource_secret" : "",
"audit_instance_name" : "",
"audit_db_user" : "",
"audit_db_password" : ""
}
- Run the python script with the path of the configuration file.
python3 main.py -gcs_config_path="gs://bucket/path/to/config/file"