From 9196a4fa154dd28206ca8c1b043cb191dd24b6c0 Mon Sep 17 00:00:00 2001 From: Bruno Antonellini Date: Thu, 27 Jun 2024 08:59:53 -0300 Subject: [PATCH] Dcv 2583 improve dbt coves documentation (#480) * DCV-2583 improve dbt-coves documentation * New root README, move Contributing to root * Final Root README * Standardize headers and sections of Generate docs * Re-sort initial table of contents * Typo on airflow dags link * Remove setup old content * Remove Acknowledgements section * Generate sources video * Update README.md * Update README.md * Update README.md * Generate templates docs * dbt-coves generate metadata * Add disable_tracking at root config level * Fix prettier CI * Add templates and metadata to generate folder * Changes to alphabetical sort, position of videos, missing links in root README * Change overview for gif with Loom opening * Updates to docs * Add overview/showcase video --------- Co-authored-by: Noel Gomez --- CONTRIBUTING.md | 9 +- README.md | 767 ++---------------- dbt_coves/config/config.py | 5 + dbt_coves/utils/tracking.py | 4 +- docs/README.md | 8 + docs/commands/README.md | 26 + docs/commands/dbt/README.md | 29 + .../extract and load/airbyte/README.md | 54 ++ .../extract and load/fivetran/README.md | 69 ++ docs/commands/generate/README.md | 24 + docs/commands/generate/airflow dags/README.md | 112 +++ docs/commands/generate/docs/README.md | 19 + docs/commands/generate/metadata/README.md | 39 + docs/commands/generate/properties/README.md | 70 ++ docs/commands/generate/sources/README.md | 118 +++ docs/commands/generate/templates/README.md | 7 + docs/commands/setup/README.md | 3 + docs/settings.md | 155 ++++ 18 files changed, 791 insertions(+), 727 deletions(-) create mode 100644 docs/README.md create mode 100644 docs/commands/README.md create mode 100644 docs/commands/dbt/README.md create mode 100644 docs/commands/extract and load/airbyte/README.md create mode 100644 docs/commands/extract and load/fivetran/README.md create mode 100644 docs/commands/generate/README.md create mode 100644 docs/commands/generate/airflow dags/README.md create mode 100644 docs/commands/generate/docs/README.md create mode 100644 docs/commands/generate/metadata/README.md create mode 100644 docs/commands/generate/properties/README.md create mode 100644 docs/commands/generate/sources/README.md create mode 100644 docs/commands/generate/templates/README.md create mode 100644 docs/commands/setup/README.md create mode 100644 docs/settings.md diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md index 378dbb4a..0e484c5b 100644 --- a/CONTRIBUTING.md +++ b/CONTRIBUTING.md @@ -4,9 +4,8 @@ Thanks for looking into making `dbt-coves` better! We have some loosely defined ## How can you contribute? -- It **usually starts with creating an issue** (reporting a bug, or discussing a feature). In fact, even if you don't know how to code it or don't have the time, you're already helping out by pointing out potential issues or functionality that you would like to see implemented. -- **Creating an issue is not necessary!** - - See an already published issue that you think you can tackle? Drop a line on it and get cracking or ask questions on how you can help, it's generally a good way to make sure you'll hit home right away. It's not fun to do a lot of work and then find out the proposed change doesn't fit with the maintainer's goals. +- Contributing **usually starts with creating an issue** (reporting a bug, or discussing a feature). In fact, even if you don't know how to code it or don't have the time, you're already helping out by pointing out potential issues or functionality that you would like to see implemented. +- See an already published issue that you think you can tackle? Create an issue for it and ask questions on how your idea and how you can help, it's generally a good way to make sure you'll hit the ground running quickly. It's not fun to do a lot of work and then find out the proposed change doesn't fit with the maintainer's goals. ## Advised Process/Conventions @@ -75,7 +74,7 @@ Here's a quick guide: For official guidelines check the [GitHub documentation](https://docs.github.com/en/free-pro-team@latest/github/getting-started-with-github/fork-a-repo) 1. Create a fork from this repo in your accout by clicking the "Fork" button -2. Clone the fork on your machine via ssh or http depending on how you like to authenticate and all that. For example: +2. Clone the fork on your machine via ssh or http depending on how you like to authenticate. For example: ```bash git clone git@github.com:/dbt-coves.git @@ -122,7 +121,7 @@ If you don't want to bother, that's also OK because we also have [pre-commit.ci] #### Type Hinting -- It is recommended to use Type Hinting and have [`mypy`](http://mypy-lang.org/) enabled as your linter. Most IDE's have an extension or a way to help with this. Typing isn't necessary but **really, really** preferred. The mainteners might therefore make suggestions on how to implement typing or will enfore it for you directly in your branch. +- It is recommended to use Type Hinting and have [`mypy`](http://mypy-lang.org/) enabled as your linter. Most IDE's have an extension or a way to help with this. Typing isn't necessary but **really, really** preferred. The maintainers might therefore make suggestions on how to implement typing or will enforce it for you directly in your branch. - Mypy is also part of our `pre-commit` and it should alert you if you have any issues with type hints. ### Development diff --git a/README.md b/README.md index 135c0ba5..304b1578 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # dbt-coves -## Brough to you by your friends at Datacoves +## Brought to you by your friends at Datacoves @@ -11,40 +11,26 @@ The Datacoves platform helps enterprises overcome their data delivery challenges Hosted VS Code, dbt-core, SqlFluff, and Airflow, find out more at [Datacoves.com](https://datacoves.com/product). -## What is dbt-coves? +## Overview -dbt-coves is a CLI tool that automates certain tasks for [dbt](https://www.getdbt.com), making life simpler for the dbt user. +[![image](https://cdn.loom.com/sessions/thumbnails/7d5341f5d5b149ed8895fe1187e338c5-with-play.gif)](https://www.loom.com/share/7d5341f5d5b149ed8895fe1187e338c5) -dbt-coves generates dbt sources, staging models and property(yml) files by analyzing information from the data warehouse and creating the necessary files (sql and yml). +## Table of contents -Finally, dbt-coves includes functionality to bootstrap a dbt project and to extract and load configurations from Airbyte. +- [Introduction](#introduction) +- [Installation](#installation) +- [Usage](#usage) +- [Contributing](#contributing) -## Supported dbt versions +## Introduction -| Version | Status | -| ------- | ---------------- | -| \< 1.0 | ❌ Not supported | -| >= 1.0 | ✅ Tested | - -From `dbt-coves` 1.4.0 onwards, our major and minor versions match those of [dbt-core](https://github.com/dbt-labs/dbt-core). -This means we release a new major/minor version once it's dbt-core equivalent is tested. -Patch suffix (1.4.X) is exclusive to our continuous development and does not reflect a version match with dbt - -## Supported adapters - -| Feature | Snowflake | Redshift | BigQuery | -| --------------------------------- | --------- | --------- | --------- | -| dbt project setup | ✅ Tested | ✅ Tested | ✅ Tested | -| source model (sql) generation | ✅ Tested | ✅ Tested | ✅ Tested | -| model properties (yml) generation | ✅ Tested | ✅ Tested | ✅ Tested | +dbt-coves is a CLI tool that automates and simplifies development and release tasks for [dbt](https://www.getdbt.com). -NOTE: Other database adapters may work, although we have not tested them. Feel free to try them and let us know so we can update the table above. +In addition to other functions listed below, dbt-coves generates dbt sources, staging models and property(yml) files by analyzing information from the data warehouse and creating the necessary files (sql and yml). It can even generate Airflow DAGs based on YML input. -### Here\'s the tool in action +Finally, dbt-coves includes functionality to bootstrap a dbt project and to extract and load configurations from data-replication providers. -[![image](https://cdn.loom.com/sessions/thumbnails/74062cf71cbe4898805ca508ea2d9455-1624905546029-with-play.gif)](https://www.loom.com/share/74062cf71cbe4898805ca508ea2d9455) - -# Installation +## Installation ```console pip install dbt-coves @@ -54,712 +40,53 @@ We recommend using [python virtualenvs](https://docs.python.org/3/tutorial/venv.html) and create one separate environment per project. -# Command Reference - -For a complete list of options, please run: - -```console -dbt-coves -h -dbt-coves -h -``` - -## Environment setup - -You can configure different components: - -Set up `git` repository of dbt-coves project - -```console -dbt-coves setup git -``` - -Set up `dbt` within the project (delegates to dbt init) - -```console -dbt-coves setup dbt -``` - -Set up SSH Keys for dbt project. Supports the argument `--open_ssl_public_key` which generates an extra Public Key in Open SSL format, useful for configuring certain providers (i.e. Snowflake authentication) - -```console -dbt-coves setup ssh -``` - -Set up pre-commit for your dbt project. In this, you can configure different tools that we consider essential for proper dbt usage: `sqlfluff`, `yaml-lint`, and `dbt-checkpoint` - -```console -dbt-coves setup precommit -``` - -## Models generation - -```console -dbt-coves generate -``` - -Where _\_ could be _sources_, _properties_, _metadata_, _docs_ or _airflow-dags_. - -```console -dbt-coves generate sources -``` - -This command will generate the dbt source configuration as well as the initial dbt staging model(s). It will look in the database defined in your `profiles.yml` file or you can pass the `--database` argument or set up default configuration options (see below) - -```console -dbt-coves generate sources --database raw -``` - -Supports Jinja templates to adjust how the resources are generated. See below for examples. - -Every `dbt-coves generate ` supports `--no-prompt` flag, which will silently generate all sources/models/properties/metadata without asking anything to the user. - -### Source Generation Arguments - -dbt-coves can be used to create the initial staging models. It will do the following: - -1. Create / Update the source yml file -2. Create the initial staging model(sql) file and offer to flatten VARIANT(JSON) fields -3. Create the staging model's property(yml) file. - -`dbt-coves generate sources` supports the following args: - -See full list in help - -```console -dbt-coves generate sources -h -``` - -```console ---database -# Database to inspect -``` - -```console ---schemas -# Schema(s) to inspect. Accepts wildcards (must be enclosed in quotes if used) -``` - -```console ---select-relations -# List of relations where raw data resides. The parameter must be enclosed in quotes. Accepts wildcards. -``` - -```console ---exclude-relations -# Filter relation(s) to exclude from source file(s) generation. The parameter must be enclosed in quotes. Accepts wildcards. -``` - -```console ---sources-destination -# Where sources yml files will be generated, default: 'models/staging/{{schema}}/sources.yml' -``` - -```console ---models-destination -# Where models sql files will be generated, default: 'models/staging/{{schema}}/{{relation}}.sql' -``` - -```console ---model-props-destination -# Where models yml files will be generated, default: 'models/staging/{{schema}}/{{relation}}.yml' -``` - -```console ---update-strategy -# Action to perform when a property file already exists: 'update', 'recreate', 'fail', 'ask' (per file) -``` - -```console ---templates-folder -# Folder with jinja templates that override default sources generation templates, i.e. 'templates' -``` - -```console ---metadata -# Path to csv file containing metadata, i.e. 'metadata.csv' -``` - -```console ---flatten-json-fields -# Action to perform when JSON fields exist: 'yes', 'no', 'ask' (per file) -``` - -```console ---overwrite-staging-models -# Flag: overwrite existent staging (SQL) files -``` - -```console ---skip-model-props -# Flag: don't create model's property (yml) files -``` - -```console ---no-prompt -# Silently generate source dbt models -``` - -### Properties Generation Arguments - -You can use dbt-coves to generate and update the properties(yml) file for a given dbt model(sql) file. - -`dbt-coves generate properties` supports the following args: - -```console ---destination -# Where models yml files will be generated, default: '{{model_folder_path}}/{{model_file_name}}.yml' -``` - -```console ---update-strategy -# Action to perform when a property file already exists: 'update', 'recreate', 'fail', 'ask' (per file) -``` - -```console --s --select -# Filter model(s) to generate property file(s) -``` - -```console ---exclude -# Filter model(s) to exclude from property file(s) generation -``` - -```console ---selector -# Specify dbt selector for more complex model filtering -``` - -```console ---templates-folder -# Folder with jinja templates that override default properties generation templates, i.e. 'templates' -``` - -```console ---metadata -# Path to csv file containing metadata, i.e. 'metadata.csv' -``` - -```console ---no-prompt -# Silently generate dbt models property files -``` - -Note: `--select (or -s)`, `--exclude` and `--selector` work exactly as `dbt ls` selectors do. For usage details, visit [dbt list docs](https://docs.getdbt.com/reference/commands/list) - -### Metadata Generation Arguments - -You can use dbt-coves to generate the metadata file(s) containing the basic structure of the csv that can be used in the above `dbt-coves generate sources/properties` commands. -Usage of these metadata files can be found in [metadata](https://github.com/datacoves/dbt-coves#metadata) below. - -`dbt-coves generate metadata` supports the following args: - -```console ---database -# Database to inspect -``` - -```console ---schemas -# Schema(s) to inspect. Accepts wildcards (must be enclosed in quotes if used) -``` - -```console ---select-relations -# List of relations where raw data resides. The parameter must be enclosed in quotes. Accepts wildcards. -``` - -```console ---exclude-relations -# Filter relation(s) to exclude from source file(s) generation. The parameter must be enclosed in quotes. Accepts wildcards. -``` - -```console ---destination -# Where csv file(s) will be generated, default: 'metadata.csv' -# Supports using the Jinja tags `{{relation}}` and `{{schema}}` -# if creating one csv per relation/table in schema, i.e: "metadata/{{relation}}.csv" -``` - -```console ---no-prompt -# Silently generate metadata -``` - -### Metadata - -dbt-coves supports the argument `--metadata` which allows users to specify a csv file containing field types and descriptions to be used when creating the staging models and property files. +#### Supported dbt versions -```console -dbt-coves generate sources --metadata metadata.csv -``` - -Metadata format: -You can download a [sample csv file](sample_metadata.csv) as reference - -| database | schema | relation | column | key | type | description | -| -------- | ------ | --------------------------------- | --------------- | ---- | ------- | ----------------------------------------------- | -| raw | raw | \_airbyte_raw_country_populations | \_airbyte_data | Year | integer | Year of country population measurement | -| raw | raw | \_airbyte_raw_country_populations | \_airbyte_data | | variant | Airbyte data columns (VARIANT) in Snowflake | -| raw | raw | \_airbyte_raw_country_populations | \_airbyte_ab_id | | varchar | Airbyte unique identifier used during data load | - -### Docs generation arguments - -You can use dbt-coves to improve the standard dbt docs generation process. It generates your dbt docs, updates external links so they always open in a new tab. It also has the option to merge production `catalog.json` into the local environment when running in deferred mode, so you can run [dbt-checkpoint](https://github.com/dbt-checkpoint/dbt-checkpoint) hooks even when the model has not been run locally. - -`dbt-coves generate docs` supports the following args: - -```console ---merge-deferred -# Merge a deferred catalog.json into your generated one. -# Flag: no value required. -``` - -``` ---state -# Directory where your production catalog.json is located -# Mandatory when using --merge-deferred -``` - -### Generate airflow-dags - -```console -dbt-coves generate airflow-dags -``` - -Translate YML files into their Airflow Python code equivalent. With this, DAGs can be easily written with some `key:value` pairs. - -The basic structure of these YMLs must consist of: - -- Global configurations (description, schedule_interval, tags, catchup, etc.) -- `default_args` -- `nodes`: where tasks and task groups are defined - - each Node is a nested object, with it's `name` as key and it's configuration as values. - - this configuration must cover: - - `type`: 'task' or 'task_group' - - `operator`: Airflow operator that will run the tasks (full _module.class_ naming) - - `dependencies`: whether the task is dependent on another one(s) - - any `key:value` pair of [Operator arguments](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/index.html) - -#### Airflow DAG Generators - -When a YML Dag `node` is of type `task_group`, **Generators** can be used instead of `Operators`. - -They are custom classes that receive YML `key:value` pairs and return one or more tasks for the respective task group. Any pair specified other than `type: task_group` will be passed to the specified `generator`, and it has the responsibility of returning N amount of `task_name = Operator(params)`. - -We provide some prebuilt Generators: - -- `AirbyteGenerator` creates `AirbyteTriggerSyncOperator` tasks (one per Airbyte connection) - - It must receive Airbyte's `host` and `port`, `airbyte_conn_id` (Airbyte's connection name on Airflow) and a `connection_ids` list of Airbyte Connections to Sync -- `FivetranGenerator`: creates `FivetranOperator` tasks (one per Fivetran connection) - - It must receive Fivetran's `api_key`, `api_secret` and a `connection_ids` list of Fivetran Connectors to Sync. -- `AirbyteDbtGenerator` and `FivetranDbtGenerator`: instead of passing them Airbyte or Fivetran connections, they use dbt to discover those IDs. Apart from their parent Generators mandatory fields, they can receive: - - `dbt_project_path`: dbt/project/folder - - `virtualenv_path`: path to a virtualenv in case dbt has to be ran with another Python executable - - `run_dbt_compile`: true/false - - `run_dbt_deps`: true/false - -#### Basic YML DAG example: - -```yaml -description: "dbt-coves DAG" -schedule_interval: "@hourly" -tags: - - version_01 -default_args: - start_date: 2023-01-01 -catchup: false -nodes: - airbyte_dbt: - type: task_group - tooltip: "Sync dbt-related Airbyte connections" - generator: AirbyteDbtGenerator - host: http://localhost - port: 8000 - dbt_project_path: /path/to/dbt_project - virtualenv_path: /virtualenvs/dbt_160 - run_dbt_compile: true - run_dbt_deps: false - airbyte_conn_id: airbyte_connection - task_1: - operator: airflow.operators.bash.BashOperator - bash_command: "echo 'This runs after airbyte tasks'" - dependencies: ["airbyte_dbt"] -``` - -##### Create your custom Generator - -You can create your own DAG Generator. Any `key:value` specified in the YML DAG will be passed to it's constructor. - -This Generator needs: - -- a `imports` attribute: a list of _module.class_ Operator of the tasks it outputs -- a `generate_tasks` method that returns the set of `"task_name = Operator()"` strings to write as the task group tasks. - -```python -class PostgresGenerator(): - def __init__(self) -> None: - """ Any key:value pair in the YML Dag will get here """ - self.imports = ["airflow.providers.postgres.operators.postgres.PostgresOperator"] - - def generate_tasks(self): - """ Use your custom logic and return N `name = PostgresOperator()` strings """ - raise NotImplementedError -``` - -### airflow-dags generation arguments - -`dbt-coves generate airflow-dags` supports the following args: - -```console ---yml-path --yaml-path -# Path to the folder containing YML files to translate into Python DAGs - ---dag-path -# Path to the folder where Python DAGs will be generated. - ---validate-operators -# Ensure Airflow operators are installed by trying to import them before writing to Python. -# Flag: no value required - ---generators-folder -# Path to your Python module with custom Generators - ---generators-params -# Object with default values for the desired Generator(s) -# For example: {"AirbyteGenerator": {"host": "http://localhost", "port": "8000"}} - ---secrets-path -# Secret files location for DAG configuration, i.e. 'yml_path/secrets/' -# Secret content must match the YML dag spec of `nodes -> node_name -> config` -``` - -## Extract configuration from Airbyte - -```console -dbt-coves extract airbyte -``` - -Extracts the configuration from your Airbyte sources, connections and destinations (excluding credentials) and stores it in the specified folder. The main goal of this feature is to keep track of the configuration changes in your git repo, and rollback to a specific version when needed. - -Full usage example: - -```console -dbt-coves extract airbyte --host http://airbyte-server --port 8001 --path /config/workspace/load/airbyte -``` - -## Load configuration to Airbyte - -```console -dbt-coves load airbyte -``` - -Loads the Airbyte configuration generated with `dbt-coves extract airbyte` on an Airbyte server. Secrets folder needs to be specified separately. You can use [git-secret](https://git-secret.io/) to encrypt secrets and make them part of your git repo. - -### Loading secrets - -Secret credentials can be approached in two different ways: locally or remotely (through a provider/manager). - -In order to load encrypted fields locally: - -```console -dbt-coves load airbyte --secrets-path /path/to/secret/directory - -# This directory must have 'sources', 'destinations' and 'connections' folders nested inside, and inside them the respective JSON files with unencrypted fields. -# Naming convention: JSON unencrypted secret files must be named exactly as the extracted ones. -``` - -To load encrypted fields through a manager (in this case we are connecting to Datacoves' Service Credentials): - -```console ---secrets-manager datacoves -``` - -```console ---secrets-url https://api.datacoves.localhost/service-credentials/airbyte -``` - -```console ---secrets-token -``` - -Full usage example: - -```console -dbt-coves load airbyte --host http://airbyte-server --port 8001 --path /config/workspace/load/airbyte --secrets-path /config/workspace/secrets -``` - -## Extract configuration from Fivetran - -```console -dbt-coves extract fivetran -``` - -Extracts the configuration from your Fivetran destinations and connectors (excluding credentials) and stores it in the specified folder. The main goal of this feature is to keep track of the configuration changes in your git repo, and rollback to a specific version when needed. - -Full usage example: - -```console -dbt-coves extract fivetran --credentials /config/workspace/secrets/fivetran/credentials.yml --path /config/workspace/load/fivetran -``` - -## Load configuration to Fivetran - -```console -dbt-coves load fivetran -``` - -Loads the Fivetran configuration generated with `dbt-coves extract fivetran` on a Fivetran instance. Secrets folder needs to be specified separately. You can use [git-secret](https://git-secret.io/) to encrypt secrets and make them part of your git repo. +| Version | Status | +| ------- | ---------------- | +| \< 1.0 | ❌ Not supported | +| >= 1.0 | ✅ Tested | -### Credentials +From `dbt-coves` 1.4.0 onwards, our major and minor versions match those of [dbt-core](https://github.com/dbt-labs/dbt-core). +This means we release a new major/minor version once it's dbt-core equivalent is tested. +Patch suffix (1.4.X) is exclusive to our continuous development and does not reflect a version match with dbt. -In order for extract/load fivetran to work properly, you need to provide an api key-secret pair (you can generate them [here](https://fivetran.com/account/settings/account)). +#### Supported dbt adapters -These credentials can be used with `--api-key [key] --api-secret [secret]`, or specyfing a YML file with `--credentials /path/to/credentials.yml`. The required structure of this file is the following: +| Feature | Snowflake | Redshift | BigQuery | +| --------------------------------- | --------- | --------- | --------- | +| source model (sql) generation | ✅ Tested | ✅ Tested | ✅ Tested | +| model properties (yml) generation | ✅ Tested | ✅ Tested | ✅ Tested | -```yaml -account_name: # Any name, used by dbt-coves to ask which to use when more than one is found - api_key: [key] - api_secret: [secret] -account_name_2: - api_key: [key] - api_secret: [secret] -``` +**NOTE:** Other database adapters may work, although we have not tested them. Feel free to try them and let us know so we can update the table above. -This YML file approach allows you to both work with multiple Fivetran accounts, and treat this credentials file as a secret. +## Usage -> :warning: **Warning**: --api-key/secret and --credentials flags are mutually exclusive, don't use them together. +dbt-coves, supports the following functions: -### Loading secrets +- [dbt](docs/commands/dbt/): run dbt commands in CI and Airflow environments. +- [extract and load](docs/commands/extract%20and%20load/): save and restore your configuration from: + - [Airbyte](docs/commands/extract%20and%20load/airbyte) + - [Fivetran](docs/commands/extract%20and%20load/fivetran) +- [generate](docs/commands/generate/): + - [airflow dags](docs/commands/generate/airflow%20dags/): generate Airflow DAGs from YML files. + - [dbt docs](docs/commands/generate/docs/): generate dbt docs by merging production catalog.json, useful in combination with [dbt-checkpoint](https://github.com/dbt-checkpoint/dbt-checkpoint) and when using Slim CI + - [dbt sources](docs/commands/generate/sources/): generate the dbt source configuration as well as the initial dbt staging model(s) and their corresponding property(yml) files. + - [dbt properties](docs/commands/generate/properties/): generate and/or update the properties(yml) file for a given dbt model(sql) file. + - [metadata](docs/commands/generate/metadata/): generate metadata extract(CSV file) that can be used to collect column types and descriptions and then provided as input inthe the `generate sources` or `generate properties` command + - [templates](docs/commands/generate/templates/): generate the dbt-coves templates that dbt-coves utilizes with other dbt-coves commands +- [setup](docs/commands/setup/): used configure different components of a dbt project. -Secret credentials can be approached via `--secrets-path` flag +For a complete list of options, run: ```console -dbt-coves load fivetran --secrets-path /path/to/secret/directory -``` - -#### Field naming convention - -Although secret files can have any name, unencrypted JSON files must follow a simple structure: - -- Keys should match their corresponding Fivetran destination ID: two words automatically generated by Fivetran, which can be found in previously extracted data. -- Inside those keys, a nested dictionary of which fields should be overwritten - -For example: - -```json -{ - "extract_muscle": { - // Internal ID that Fivetran gave to a Snowflake warehouse Destination - "password": "[PASSWORD]" // Field:Value pair - }, - "centre_straighten": { - "password": "[PASSWORD]" - } -} -``` - -## Run dbt commands - -```shell -dbt-coves dbt -- -``` - -Run dbt commands on special environments such as Airflow, or CI workers, with the possibility of changing dbt project location and activating a specific virtual environment in which running commands. - -### Arguments - -`dbt-coves dbt` supports the following arguments - -```shell ---project-dir -# Path of the dbt project where command will be executed, i.e.: /opt/user/dbt_project -``` - -```shell ---virtualenv -# Virtual environment path. i.e.: /opt/user/virtualenvs/airflow -``` - -### Sample usage - -```shell -dbt-coves dbt --project-dir /opt/user/dbt_project --virtualenv /opt/user/virtualenvs/airflow -- run -s model --vars \"{key: value}\" -# Make sure to escape special characters such as quotation marks -# Double dash (--) between and are mandatory -``` - -# Settings - -dbt-coves will read settings from `.dbt_coves/config.yml`. A standard settings files could look like this: - -```yaml -generate: - sources: - database: "RAW" # Database where to look for source tables - schemas: # List of schema names where to look for source tables - - RAW - select_relations: # list of relations where raw data resides - - TABLE_1 - - TABLE_2 - exclude_relations: # Filter relation(s) to exclude from source file(s) generation - - TABLE_1 - - TABLE_2 - sources_destination: "models/staging/{{schema}}/{{schema}}.yml" # Where sources yml files will be generated - models_destination: "models/staging/{{schema}}/{{relation}}.sql" # Where models sql files will be generated - model_props_destination: "models/staging/{{schema}}/{{relation}}.yml" # Where models yml files will be generated - update_strategy: ask # Action to perform when a property file already exists. Options: update, recreate, fail, ask (per file) - templates_folder: ".dbt_coves/templates" # Folder where source generation jinja templates are located. Override default templates creating source_props.yml, source_model_props.yml, and source_model.sql under this folder - metadata: "metadata.csv" # Path to csv file containing metadata - flatten_json_fields: ask - - properties: - destination: "{{model_folder_path}}/{{model_file_name}}.yml" # Where models yml files will be generated - # You can specify a different path by declaring it explicitly, i.e.: "models/staging/{{model_file_name}}.yml" - update-strategy: ask # Action to perform when a property file already exists. Options: update, recreate, fail, ask (per file) - select: "models/staging/bays" # Filter model(s) to generate property file(s) - exclude: "models/staging/bays/test_bay" # Filter model(s) to generate property file(s) - selector: "selectors/bay_selector.yml" # Specify dbt selector for more complex model filtering - templates_folder: ".dbt_coves/templates" # Folder where source generation jinja templates are located. Override default template creating model_props.yml under this folder - metadata: "metadata.csv" # Path to csv file containing metadata - - metadata: - database: RAW # Database where to look for source tables - schemas: # List of schema names where to look for source tables - - RAW - select_relations: # list of relations where raw data resides - - TABLE_1 - - TABLE_2 - exclude_relations: # Filter relation(s) to exclude from source file(s) generation - - TABLE_1 - - TABLE_2 - destination: # Where metadata file will be generated, default: 'metadata.csv' - - docs: - merge_deferred: true - state: logs/ - dbt_args: "--no-compile --select foo --exclude bar" - - airflow_dags: - yml_path: - dags_path: - generators_params: - AirbyteDbtGenerator: - host: "{{ env_var('AIRBYTE_HOST_NAME') }}" - port: "{{ env_var('AIRBYTE_PORT') }}" - airbyte_conn_id: airbyte_connection - - dbt_project_path: "{{ env_var('DBT_HOME') }}" - run_dbt_compile: true - run_dbt_deps: false - -extract: - airbyte: - path: /config/workspace/load/airbyte # Where json files will be generated - host: http://airbyte-server # Airbyte's API hostname - port: 8001 # Airbyte's API port - fivetran: - path: /config/workspace/load/fivetran # Where Fivetran export will be generated - api_key: [KEY] # Fivetran API Key - api_secret: [SECRET] # Fivetran API Secret - credentials: /opt/fivetran_credentials.yml # Fivetran set of key:secret pairs - # 'api_key' + 'api_secret' are mutually exclusive with 'credentials', use one or the other - -load: - airbyte: - path: /config/workspace/load - host: http://airbyte-server - port: 8001 - secrets_manager: datacoves # (optional) Secret credentials provider (secrets_path OR secrets_manager should be used, can't load secrets locally and remotely at the same time) - secrets_path: /config/workspace/secrets # (optional) Secret files location if secrets_manager was not specified - secrets_url: https://api.datacoves.localhost/service-credentials/airbyte # Secrets url if secrets_manager is datacoves - secrets_token: # Secrets auth token if secrets_manager is datacoves - fivetran: - path: /config/workspace/load/fivetran # Where previous Fivetran export resides, subject of import - api_key: [KEY] # Fivetran API Key - api_secret: [SECRET] # Fivetran API Secret - secrets_path: /config/workspace/secrets/fivetran # Fivetran secret fields - credentials: /opt/fivetran_credentials.yml # Fivetran set of key:secret pairs - # 'api_key' + 'api_secret' are mutually exclusive with 'credentials', use one or the other -``` - -## env_var - -From `dbt-coves 1.6.28` onwards, you can consume environment variables in you config file using `"{{env_var('VAR_NAME', 'DEFAULT VALUE')}}"`. For example: - -```yaml -generate: - sources: - database: "{{env_var('MAIN_DATABASE', 'dev_database')}}" - schemas: - - "{{env_var('DEV_SCHEMA', 'John')}}" - - "{{env_var('STAGING_SCHEMA', 'Staging')}}" -``` - -## Telemetry - -dbt-coves has telemetry built in to help the maintainers from Datacoves understand which commands are being used and which are not to prioritize future development of dbt-coves. We do not track credentials nor details of your dbt execution such as model names. The one detail we do use related to dbt is the anonymous user_id to help us identify distinct users. - -By default this is turned on – you can opt out of event tracking at any time by adding the following to your dbt-coves `config.yaml` file: - -```yaml -disable-tracking: true +dbt-coves -h +dbt-coves -h ``` -## Override generation templates - -Customizing generated models and model properties requires placing -template files under the `.dbt-coves/templates` folder. - -There are different variables available in the templates: - -- `adapter_name` refers to the Adapter's class name being used by the target, e.g. `SnowflakeAdapter` when using [Snowflake](https://github.com/dbt-labs/dbt-snowflake/blob/21b52127e7d221db8b92114aae066fb8a7151bba/dbt/adapters/snowflake/impl.py#L33). -- `columns` contains the list of relation columns that don't contain nested (JSON) data, it's type is `List[Item]`. -- `nested` contains a dict of nested columns, grouped by column name, it's type is `Dict[column_name, Dict[nested_key, Item]]`. - -`Item` is a `dict` with the keys `id`, `name`, `type`, and `description`, where `id` contains an slugified id generated from `name`. - -### dbt-coves generate sources +## Contributing -#### Source property file (.yml) template - -This file is used to create the sources yml file - -[source_props.yml](dbt_coves/templates/source_props.yml) - -#### Staging model file (.sql) template - -This file is used to create the staging model (sql) files. - -[staging_model.sql](dbt_coves/templates/staging_model.sql) - -#### Staging model property file (.yml) template - -This file is used to create the model properties (yml) file - -[staging_model_props.yml](dbt_coves/templates/staging_model_props.yml) - -### dbt-coves generate properties - -This file is used to create the properties (yml) files for models - -[model_props.yml](dbt_coves/templates/model_props.yml) - -# Thanks - -The project main structure was inspired by [dbt-sugar](https://github.com/bitpicky/dbt-sugar). Special thanks to [Bastien Boutonnet](https://github.com/bastienboutonnet) for the great work done. - -# Authors - -- Sebastian Sassi [\@sebasuy](https://twitter.com/sebasuy) -- [Datacoves](https://datacoves.com/) -- Noel Gomez [\@noel_g](https://twitter.com/noel_g) -- [Datacoves](https://datacoves.com/) -- Bruno Antonellini -- [Datacoves](https://datacoves.com/) - -# About - -Learn more about [Datacoves](https://datacoves.com). - -⚠️ **dbt-coves is still in development, make sure to test it for your dbt project version and DW before using in production and please submit any issues you find. We also welcome any contributions from the community** +If you're interested in contributing to the development of dbt-coves, please refer to the [Contributing Guidelines](contributing.md). This document outlines the process for submitting bug reports, feature requests, and code contributions. # Metrics @@ -769,8 +96,6 @@ fury.io](https://badge.fury.io/py/dbt-coves.svg)](https://pypi.python.org/pypi/d [![Code Style](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/ambv/black) [![Imports: -isort](https://img.shields.io/badge/%20imports-isort-%231674b1?style=flat&labelColor=ef8336)](https://pycqa.github.io/isort/) -[![Imports: python](https://img.shields.io/badge/python-3.8%20%7C%203.9-blue)](https://img.shields.io/badge/python-3.8%20%7C%203.9-blue) [![Build](https://github.com/datacoves/dbt-coves/actions/workflows/main_ci.yml/badge.svg)](https://github.com/datacoves/dbt-coves/actions/workflows/main_ci.yml/badge.svg) diff --git a/dbt_coves/config/config.py b/dbt_coves/config/config.py index 8b9f72a0..67683072 100644 --- a/dbt_coves/config/config.py +++ b/dbt_coves/config/config.py @@ -169,6 +169,7 @@ class ConfigModel(BaseModel): setup: Optional[SetupModel] = SetupModel() dbt: Optional[RunDbtModel] = RunDbtModel() data_sync: Optional[DataSyncModel] = DataSyncModel() + disable_tracking: Optional[bool] = False class DbtCovesConfig: @@ -288,6 +289,10 @@ def integrated(self): target[key] = source[key] return config_copy + @property + def disable_tracking(self): + return self._config.disable_tracking + def load_and_validate_config_yaml(self) -> None: if self._config_path: yaml_dict = open_yaml(self._config_path) or {} diff --git a/dbt_coves/utils/tracking.py b/dbt_coves/utils/tracking.py index 5ff82e3a..bde27a60 100644 --- a/dbt_coves/utils/tracking.py +++ b/dbt_coves/utils/tracking.py @@ -20,7 +20,9 @@ def _get_mixpanel_env_token(): def trackable(task, **kwargs): def wrapper(task_instance, **kwargs): exit_code = task(task_instance) - if task_instance.args.uuid and not task_instance.args.disable_tracking: + if task_instance.args.uuid and not ( + task_instance.args.disable_tracking or task_instance.coves_config.disable_tracking + ): try: task_execution_props = _gen_task_usage_props(task_instance, exit_code) mixpanel = Mixpanel(token=_get_mixpanel_env_token()) diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 00000000..2cfc225b --- /dev/null +++ b/docs/README.md @@ -0,0 +1,8 @@ +# dbt-coves Documentation + +Welcome to dbt-coves docs. + +## Table of Contents + +- [Commands](commands/) +- [Settings](settings.md) diff --git a/docs/commands/README.md b/docs/commands/README.md new file mode 100644 index 00000000..180c937b --- /dev/null +++ b/docs/commands/README.md @@ -0,0 +1,26 @@ +# Commands + +## Command Structure + +`dbt-coves` commands follow a hierarchical structure. Each top-level command may have one or more subcommands, and some subcommands may have further nested subcommands. + +For example, `dbt-coves generate` command is a top-level command, while `dbt-coves generate sources` is a subcommand of `generate`. + +## Command Documentation + +The documentation for each command is organized into separate folders within the `commands` directory. Each folder represents a top-level command, and any subfolders within it represent subcommands. + +For instance, the documentation for the `dbt-coves generate` command and its subcommands can be found in the `generate` folder: + +- `generate/README.md`: Documentation for the `dbt-coves generate` command. +- `generate/sources/README.md`: Documentation for the `dbt-coves generate sources` subcommand. + +This structure allows you to easily navigate and find the documentation for the specific command or subcommand you need. + +## Usage Examples + +Throughout the command documentation, you'll find usage examples that demonstrate how to use each command and its various options. These examples are designed to help you understand the command's functionality and provide a starting point for incorporating it into your data engineering workflow. + +## Contributing + +If you find any issue or have suggestions for improving the command documentation, please refer to the [Contributing Guidelines](../contributing.md) for information on how to submit your feedback or contributions. diff --git a/docs/commands/dbt/README.md b/docs/commands/dbt/README.md new file mode 100644 index 00000000..aa049431 --- /dev/null +++ b/docs/commands/dbt/README.md @@ -0,0 +1,29 @@ +# Run dbt commands + +```shell +dbt-coves dbt -- +``` + +Run dbt commands on environments such as Airflow or CI workers, you can specify the dbt project location and activate a virtual environment in which to run the dbt commands. + +### Arguments + +`dbt-coves dbt` supports the following arguments + +```shell +--project-dir +# Path of the dbt project where command will be executed, i.e.: /opt/user/dbt_project +``` + +```shell +--virtualenv +# Virtual environment path. i.e.: /opt/user/virtualenvs/airflow +``` + +### Sample usage + +```shell +dbt-coves dbt --project-dir /opt/user/dbt_project --virtualenv /opt/user/virtualenvs/airflow -- run -s model --vars \"{key: value}\" +# Make sure to escape special characters such as quotation marks +# Double dash (--) between and are mandatory +``` diff --git a/docs/commands/extract and load/airbyte/README.md b/docs/commands/extract and load/airbyte/README.md new file mode 100644 index 00000000..c4afc06c --- /dev/null +++ b/docs/commands/extract and load/airbyte/README.md @@ -0,0 +1,54 @@ +## Extract configuration from Airbyte + +```console +dbt-coves extract airbyte +``` + +Extracts the configuration from your Airbyte sources, connections and destinations (excluding credentials) and stores it in the specified folder. The main goal of this feature is to keep track of the configuration changes in your git repo, and rollback to a specific version when needed. + +Full usage example: + +```console +dbt-coves extract airbyte --host http://airbyte-server --port 8001 --path /config/workspace/load/airbyte +``` + +## Load configuration to Airbyte + +```console +dbt-coves load airbyte +``` + +Loads the Airbyte configuration generated with `dbt-coves extract airbyte` on an Airbyte server. Secrets folder needs to be specified separately. You can use [git-secret](https://git-secret.io/) to encrypt secrets and make them part of your git repo. + +### Loading secrets + +Secret credentials can be approached in two different ways: locally or remotely (through a provider/manager). + +In order to load encrypted fields locally: + +```console +dbt-coves load airbyte --secrets-path /path/to/secret/directory + +# This directory must have 'sources', 'destinations' and 'connections' folders nested inside, and inside them the respective JSON files with unencrypted fields. +# Naming convention: JSON unencrypted secret files must be named exactly as the extracted ones. +``` + +To load encrypted fields through a manager (in this case we are connecting to Datacoves' Service Credentials): + +```console +--secrets-manager datacoves +``` + +```console +--secrets-url https://api.datacoves.localhost/service-credentials/airbyte +``` + +```console +--secrets-token +``` + +Full usage example: + +```console +dbt-coves load airbyte --host http://airbyte-server --port 8001 --path /config/workspace/load/airbyte --secrets-path /config/workspace/secrets +``` diff --git a/docs/commands/extract and load/fivetran/README.md b/docs/commands/extract and load/fivetran/README.md new file mode 100644 index 00000000..df590b1d --- /dev/null +++ b/docs/commands/extract and load/fivetran/README.md @@ -0,0 +1,69 @@ +## Extract configuration from Fivetran + +```console +dbt-coves extract fivetran +``` + +Extracts the configuration from your Fivetran destinations and connectors (excluding credentials) and stores it in the specified folder. The main goal of this feature is to keep track of the configuration changes in your git repo, and rollback to a specific version when needed. + +Full usage example: + +```console +dbt-coves extract fivetran --credentials /config/workspace/secrets/fivetran/credentials.yml --path /config/workspace/load/fivetran +``` + +## Load configuration to Fivetran + +```console +dbt-coves load fivetran +``` + +Loads the Fivetran configuration generated with `dbt-coves extract fivetran` on a Fivetran instance. Secrets folder needs to be specified separately. You can use [git-secret](https://git-secret.io/) to encrypt secrets and make them part of your git repo. + +### Credentials + +In order for extract/load fivetran to work properly, you need to provide an api key-secret pair (you can generate them [here](https://fivetran.com/account/settings/account)). + +These credentials can be used with `--api-key [key] --api-secret [secret]`, or specyfing a YML file with `--credentials /path/to/credentials.yml`. The required structure of this file is the following: + +```yaml +account_name: # Any name, used by dbt-coves to ask which to use when more than one is found + api_key: [key] + api_secret: [secret] +account_name_2: + api_key: [key] + api_secret: [secret] +``` + +This YML file approach allows you to both work with multiple Fivetran accounts, and treat this credentials file as a secret. + +> :warning: **Warning**: --api-key/secret and --credentials flags are mutually exclusive, don't use them together. + +### Loading secrets + +Secret credentials can be approached via `--secrets-path` flag + +```console +dbt-coves load fivetran --secrets-path /path/to/secret/directory +``` + +#### Field naming convention + +Although secret files can have any name, unencrypted JSON files must follow a simple structure: + +- Keys should match their corresponding Fivetran destination ID: two words automatically generated by Fivetran, which can be found in previously extracted data. +- Inside those keys, a nested dictionary of which fields should be overwritten + +For example: + +```json +{ + "extract_muscle": { + // Internal ID that Fivetran gave to a Snowflake warehouse Destination + "password": "[PASSWORD]" // Field:Value pair + }, + "centre_straighten": { + "password": "[PASSWORD]" + } +} +``` diff --git a/docs/commands/generate/README.md b/docs/commands/generate/README.md new file mode 100644 index 00000000..368bba4f --- /dev/null +++ b/docs/commands/generate/README.md @@ -0,0 +1,24 @@ +# Models generation + +## Overview + +The `dbt-coves generate` command allows you to generate different types of resources based on your project's needs. These resources can include data sources, properties, metadata, dbt docs, and even Airflow DAGs for scheduling and orchestrating your data pipelines. + +By leveraging this command, you can quickly bootstrap new projects, create boilerplate code, and maintain a consistent structure across your data engineering projects. This not only improves productivity, but also promotes code reusability and maintainability. + +## Usage + +The general syntax for the `dbt-coves generate` command is as follows: + +```console +dbt-coves generate +``` + +Where `resource` could be: + +- [_airflow-dags_](airflow%20dags/): generate Airflow DAGs for orchestration +- [_docs_](docs/): generate dbt docs +- [_metadata_](metadata/): generate metadata for your database table(s) +- [_properties_](properties/): generate sources' YML schemas +- [_sources_](sources/): generate dbt sources +- [_templates_](templates/): generate dbt-coves config folder and templates diff --git a/docs/commands/generate/airflow dags/README.md b/docs/commands/generate/airflow dags/README.md new file mode 100644 index 00000000..bd2dd425 --- /dev/null +++ b/docs/commands/generate/airflow dags/README.md @@ -0,0 +1,112 @@ +## dbt-coves generate airflow-dags + +```console +dbt-coves generate airflow-dags +``` + +Translate YML files into their Airflow Python code equivalent. With this, DAGs can be easily written with some `key:value` pairs. + +The basic structure of these YMLs must consist of: + +- Global configurations (description, schedule_interval, tags, catchup, etc.) +- `default_args` +- `nodes`: where tasks and task groups are defined + - each Node is a nested object, with it's `name` as key and it's configuration as values. + - this configuration must cover: + - `type`: 'task' or 'task_group' + - `operator`: Airflow operator that will run the tasks (full _module.class_ naming) + - `dependencies`: whether the task is dependent on another one(s) + - any `key:value` pair of [Operator arguments](https://airflow.apache.org/docs/apache-airflow/stable/_api/airflow/operators/index.html) + +### Airflow DAG Generators + +When a YML Dag `node` is of type `task_group`, **Generators** can be used instead of `Operators`. + +Generators are custom classes that receive YML `key:value` pairs and return one or more tasks for the respective task group. Any pair specified other than `type: task_group` will be passed to the specified `generator`, and it has the responsibility of returning N amount of `task_name = Operator(params)`. + +We provide some prebuilt Generators: + +- `AirbyteGenerator` creates `AirbyteTriggerSyncOperator` tasks (one per Airbyte connection) + - It must receive Airbyte's `host` and `port`, `airbyte_conn_id` (Airbyte's connection name on Airflow) and a `connection_ids` list of Airbyte Connections to Sync +- `FivetranGenerator`: creates `FivetranOperator` tasks (one per Fivetran connection) + - It must receive Fivetran's `api_key`, `api_secret` and a `connection_ids` list of Fivetran Connectors to Sync. +- `AirbyteDbtGenerator` and `FivetranDbtGenerator`: instead of passing them Airbyte or Fivetran connections, they use dbt to discover those IDs. Apart from their parent Generators mandatory fields, they can receive: + - `dbt_project_path`: dbt/project/folder + - `virtualenv_path`: path to a virtualenv in case dbt within a specific virtual env + - `run_dbt_compile`: true/false always run the dbt compile command + - `run_dbt_deps`: true/false always run the dbt deps command + +### Basic YML DAG example: + +```yaml +description: "dbt-coves DAG" +schedule_interval: "@hourly" +tags: + - version_01 +default_args: + start_date: 2023-01-01 +catchup: false +nodes: + airbyte_dbt: + type: task_group + tooltip: "Sync dbt-related Airbyte connections" + generator: AirbyteDbtGenerator + host: http://localhost + port: 8000 + dbt_project_path: /path/to/dbt_project + virtualenv_path: /virtualenvs/dbt_160 + run_dbt_compile: false + run_dbt_deps: false + airbyte_conn_id: airbyte_connection + task_1: + operator: airflow.operators.bash.DatacovesBashOperator + bash_command: "echo 'This runs after airbyte tasks'" + dependencies: ["airbyte_dbt"] +``` + +### Create your custom Generator + +You can create your own DAG Generator. Any `key:value` specified in the YML DAG will be passed to it's constructor. + +This Generator needs: + +- a `imports` attribute: a list of _module.class_ Operator of the tasks it outputs +- a `generate_tasks` method that returns the set of `"task_name = Operator()"` strings to write as the task group tasks. + +```python +class PostgresGenerator(): + def __init__(self) -> None: + """ Any key:value pair in the YML Dag will get here """ + self.imports = ["airflow.providers.postgres.operators.postgres.PostgresOperator"] + + def generate_tasks(self): + """ Use your custom logic and return N `name = PostgresOperator()` strings """ + raise NotImplementedError +``` + +### Arguments + +`dbt-coves generate airflow-dags` supports the following args: + +```console +--yml-path --yaml-path +# Path to the folder containing YML files to translate into Python DAGs + +--dag-path +# Path to the folder where Python DAGs will be generated. + +--validate-operators +# Ensure Airflow operators are installed by trying to import them before writing to Python. +# Flag: no value required + +--generators-folder +# Path to your Python module with custom Generators + +--generators-params +# Object with default values for the desired Generator(s) +# For example: {"AirbyteGenerator": {"host": "http://localhost", "port": "8000"}} + +--secrets-path +# Secret files location for DAG configuration, i.e. 'yml_path/secrets/' +# Secret content must match the YML dag spec of `nodes -> node_name -> config` +``` diff --git a/docs/commands/generate/docs/README.md b/docs/commands/generate/docs/README.md new file mode 100644 index 00000000..5cfb7997 --- /dev/null +++ b/docs/commands/generate/docs/README.md @@ -0,0 +1,19 @@ +## dbt-coves generate docs + +You can use dbt-coves to improve the standard dbt docs generation process. It generates your dbt docs, updates external links so they always open in a new tab. It also has the option to merge production `catalog.json` into the local environment when running in deferred mode, so you can run [dbt-checkpoint](https://github.com/dbt-checkpoint/dbt-checkpoint) hooks even when the model has not been run locally such as when using Slim CI. + +### Arguments + +`dbt-coves generate docs` supports the following args: + +```console +--merge-deferred +# Merge a deferred catalog.json into your generated one. +# Flag: no value required. +``` + +``` +--state +# Directory where your production catalog.json is located +# Mandatory when using --merge-deferred +``` diff --git a/docs/commands/generate/metadata/README.md b/docs/commands/generate/metadata/README.md new file mode 100644 index 00000000..56cd109b --- /dev/null +++ b/docs/commands/generate/metadata/README.md @@ -0,0 +1,39 @@ +## dbt-coves generate metadata + +This command will generate a `dbt-coves metadata` CSV file for your database table(s). This can then be used to collect descriptions from stakeholders and later used as an input to other dbt-coves commands such as `dbt-coves generate sources` + +The`Metadata` file consists of comma-separated values in which the user can specify column(s) keys and descriptions and is particularly useful for working with stakeholders to get descriptions for dbt YML files when [generate sources](../sources/README.md#metadata) or [generate properties](../properties/README.md#metadata) is used. + +### Arguments + +`dbt-coves generate metadata` supports the following args: + +```console +--database DATABASE +# Database where source relations live, if different than the dbt target +``` + +```console +--schemas SCHEMAS +# Comma separated list of schemas where raw data resides, i.e. 'RAW_SALESFORCE,RAW_HUBSPOT' +``` + +```console +--select-relations SELECT_RELATIONS +# Comma separated list of relations where raw data resides, i.e. 'hubspot_products,salesforce_users' +``` + +```console +--exclude-relations EXCLUDE_RELATIONS +# Filter relation(s) to exclude from source file(s) generation +``` + +```console +--destination DESTINATION +# Generated metadata file destination path +``` + +```console +--no-prompt +# Silently generate metadata +``` diff --git a/docs/commands/generate/properties/README.md b/docs/commands/generate/properties/README.md new file mode 100644 index 00000000..57ebc219 --- /dev/null +++ b/docs/commands/generate/properties/README.md @@ -0,0 +1,70 @@ +## dbt-coves generate properties + +### Overview + +[![image](https://cdn.loom.com/sessions/thumbnails/1dc2e830896e48cbbd7495451d25b942-with-play.gif)](https://www.loom.com/share/1dc2e830896e48cbbd7495451d25b942) + +You can use dbt-coves to generate and update the properties(yml) file for a given dbt model(sql) file. + +### Arguments + +`dbt-coves generate properties` supports the following args: + +```console +--destination +# Where models yml files will be generated, default: '{{model_folder_path}}/{{model_file_name}}.yml' +``` + +```console +--update-strategy +# Action to perform when a property file already exists: 'update', 'recreate', 'fail', 'ask' (per file) +``` + +```console +-s --select +# Filter model(s) for which to generate property file(s) +``` + +```console +--exclude +# Filter model(s) to exclude from property file(s) generation +``` + +```console +--selector +# Specify dbt selector for more complex model filtering +``` + +```console +--templates-folder +# Folder with jinja templates that override default properties generation templates, i.e. 'templates' +``` + +```console +--metadata +# Path to csv file containing metadata, i.e. 'metadata.csv' +``` + +```console +--no-prompt +# Silently generate dbt models property files +``` + +Note: `--select (or -s)`, `--exclude` and `--selector` work exactly as `dbt ls` selectors do. For usage details, visit [dbt list docs](https://docs.getdbt.com/reference/commands/list) + +### Metadata + +dbt-coves supports the argument `--metadata` which allows users to specify a csv file containing field types and descriptions to be used when creating the staging models and property files. + +```console +dbt-coves generate sources --metadata metadata.csv +``` + +Metadata format: +You can download a [sample csv file](sample_metadata.csv) as reference + +| database | schema | relation | column | key | type | description | +| -------- | ------ | --------------------------------- | --------------- | ---- | ------- | ----------------------------------------------- | +| raw | raw | \_airbyte_raw_country_populations | \_airbyte_data | Year | integer | Year of country population measurement | +| raw | raw | \_airbyte_raw_country_populations | \_airbyte_data | | variant | Airbyte data columns (VARIANT) in Snowflake | +| raw | raw | \_airbyte_raw_country_populations | \_airbyte_ab_id | | varchar | Airbyte unique identifier used during data load | diff --git a/docs/commands/generate/sources/README.md b/docs/commands/generate/sources/README.md new file mode 100644 index 00000000..f84648ef --- /dev/null +++ b/docs/commands/generate/sources/README.md @@ -0,0 +1,118 @@ +## dbt-coves generate sources + +### Overview + +[![image](https://cdn.loom.com/sessions/thumbnails/28857aab6f13462c9cf8561d2ac982fc-with-play.gif)](https://www.loom.com/share/28857aab6f13462c9cf8561d2ac982fc?sid=3e54cb5e-2346-4216-9aa5-6934ac58d932) + +This command will generate the dbt source configuration as well as the initial dbt staging model(s). It will look in the database defined in your `profiles.yml` file or you can pass the `--database` argument or set up default configuration options (see below) + +```console +dbt-coves generate sources --database raw +``` + +Supports Jinja templates to adjust how the resources are generated. See below for examples. + +dbt-coves can be used to create the initial staging models. It will do the following: + +1. Create / Update the source yml file +2. Create the initial staging model(sql) file and offer to flatten VARIANT(JSON) fields +3. Create the staging model's property(yml) file. + +**NOTE:** While there is no current option to skip source or staging model generation, if you don't want the source.yml or staging models, you can update the path in the dbt-coves config file to point to a static location such as `/tmp/not_needed.sql` and `/tmp/not_needed.yml` + +### Arguments + +`dbt-coves generate sources` supports the following args: + +See full list in help + +```console +dbt-coves generate sources -h +``` + +```console +--database +# Database to inspect +``` + +```console +--schemas +# Schema(s) to inspect. Accepts wildcards (must be enclosed in quotes if used) +``` + +```console +--select-relations +# List of relations where raw data resides. The parameter must be enclosed in quotes. Accepts wildcards. +``` + +```console +--exclude-relations +# Filter relation(s) to exclude from source file(s) generation. The parameter must be enclosed in quotes. Accepts wildcards. +``` + +```console +--sources-destination +# Where sources yml files will be generated, default: 'models/staging/{{schema}}/sources.yml' +``` + +```console +--models-destination +# Where models sql files will be generated, default: 'models/staging/{{schema}}/{{relation}}.sql' +``` + +```console +--model-props-destination +# Where models yml files will be generated, default: 'models/staging/{{schema}}/{{relation}}.yml' +``` + +```console +--update-strategy +# Action to perform when a file already exists: 'update', 'recreate', 'fail', 'ask' (per file) +``` + +```console +--templates-folder +# Folder with jinja templates that override default sources generation templates, i.e. 'templates' +``` + +```console +--metadata +# Path to csv file containing metadata, i.e. 'metadata.csv' +``` + +```console +--flatten-json-fields +# Action to perform when JSON fields exist: 'yes', 'no', 'ask' (per file) +``` + +```console +--overwrite-staging-models +# Flag: overwrite existing staging (SQL) files +``` + +```console +--skip-model-props +# Flag: don't create model's property (yml) files +``` + +```console +--no-prompt +# Silently generate source dbt models +``` + +### Metadata + +dbt-coves supports the argument `--metadata` which allows users to specify a csv file containing field types and descriptions to be used when creating the staging models and property files. + +```console +dbt-coves generate sources --metadata metadata.csv +``` + +Metadata format: +You can download a [sample csv file](sample_metadata.csv) as reference + +| database | schema | relation | column | key | type | description | +| -------- | ------ | --------------------------------- | --------------- | ---- | ------- | ----------------------------------------------- | +| raw | raw | \_airbyte_raw_country_populations | \_airbyte_data | Year | integer | Year of country population measurement | +| raw | raw | \_airbyte_raw_country_populations | \_airbyte_data | | variant | Airbyte data columns (VARIANT) in Snowflake | +| raw | raw | \_airbyte_raw_country_populations | \_airbyte_ab_id | | varchar | Airbyte unique identifier used during data load | diff --git a/docs/commands/generate/templates/README.md b/docs/commands/generate/templates/README.md new file mode 100644 index 00000000..237e0d02 --- /dev/null +++ b/docs/commands/generate/templates/README.md @@ -0,0 +1,7 @@ +## dbt-coves generate templates + +Create dbt-coves templates inside of your `.dbt_coves` config folder. These files contain the template used in generate commands such as `generate sources` and `generate properties`. Use these files to override the default behavior such as to add a `metadata:` key when generating property files. + +### In Action + +https://www.loom.com/share/3eb0d4b7a67341f6bd4f2e0c161a8e54?sid=c1db5cca-4977-4fdd-9e3f-63adb723e844 diff --git a/docs/commands/setup/README.md b/docs/commands/setup/README.md new file mode 100644 index 00000000..8ca01aaf --- /dev/null +++ b/docs/commands/setup/README.md @@ -0,0 +1,3 @@ +## dbt-coves setup + +[Work in progress] diff --git a/docs/settings.md b/docs/settings.md new file mode 100644 index 00000000..08a1d010 --- /dev/null +++ b/docs/settings.md @@ -0,0 +1,155 @@ +# Settings + +dbt-coves will read settings from `.dbt_coves/config.yml`. A standard settings files could look like this: + +```yaml +generate: + sources: + database: "RAW" # Database where to look for source tables + schemas: # List of schema names where to look for source tables + - RAW + select_relations: # list of relations where raw data resides + - TABLE_1 + - TABLE_2 + exclude_relations: # Filter relation(s) to exclude from source file(s) generation + - TABLE_1 + - TABLE_2 + sources_destination: "models/staging/{{schema}}/{{schema}}.yml" # Where sources yml files will be generated + models_destination: "models/staging/{{schema}}/{{relation}}.sql" # Where models sql files will be generated + model_props_destination: "models/staging/{{schema}}/{{relation}}.yml" # Where models yml files will be generated + update_strategy: ask # Action to perform when a property file already exists. Options: update, recreate, fail, ask (per file) + templates_folder: ".dbt_coves/templates" # Folder where source generation jinja templates are located. Override default templates creating source_props.yml, source_model_props.yml, and source_model.sql under this folder + metadata: "metadata.csv" # Path to csv file containing metadata + flatten_json_fields: ask + + properties: + destination: "{{model_folder_path}}/{{model_file_name}}.yml" # Where models yml files will be generated + # You can specify a different path by declaring it explicitly, i.e.: "models/staging/{{model_file_name}}.yml" + update-strategy: ask # Action to perform when a property file already exists. Options: update, recreate, fail, ask (per file) + select: "models/staging/bays" # Filter model(s) to generate property file(s) + exclude: "models/staging/bays/test_bay" # Filter model(s) to generate property file(s) + selector: "selectors/bay_selector.yml" # Specify dbt selector for more complex model filtering + templates_folder: ".dbt_coves/templates" # Folder where source generation jinja templates are located. Override default template creating model_props.yml under this folder + metadata: "metadata.csv" # Path to csv file containing metadata + + metadata: + database: RAW # Database where to look for source tables + schemas: # List of schema names where to look for source tables + - RAW + select_relations: # list of relations where raw data resides + - TABLE_1 + - TABLE_2 + exclude_relations: # Filter relation(s) to exclude from source file(s) generation + - TABLE_1 + - TABLE_2 + destination: # Where metadata file will be generated, default: 'metadata.csv' + + docs: + merge_deferred: true + state: logs/ + dbt_args: "--no-compile --select foo --exclude bar" + + airflow_dags: + yml_path: + dags_path: + generators_params: + AirbyteDbtGenerator: + host: "{{ env_var('AIRBYTE_HOST_NAME') }}" + port: "{{ env_var('AIRBYTE_PORT') }}" + airbyte_conn_id: airbyte_connection + + dbt_project_path: "{{ env_var('DBT_HOME') }}" + run_dbt_compile: true + run_dbt_deps: false + +extract: + airbyte: + path: /config/workspace/load/airbyte # Where json files will be generated + host: http://airbyte-server # Airbyte's API hostname + port: 8001 # Airbyte's API port + fivetran: + path: /config/workspace/load/fivetran # Where Fivetran export will be generated + api_key: [KEY] # Fivetran API Key + api_secret: [SECRET] # Fivetran API Secret + credentials: /opt/fivetran_credentials.yml # Fivetran set of key:secret pairs + # 'api_key' + 'api_secret' are mutually exclusive with 'credentials', use one or the other + +load: + airbyte: + path: /config/workspace/load + host: http://airbyte-server + port: 8001 + secrets_manager: datacoves # (optional) Secret credentials provider (secrets_path OR secrets_manager should be used, can't load secrets locally and remotely at the same time) + secrets_path: /config/workspace/secrets # (optional) Secret files location if secrets_manager was not specified + secrets_url: https://api.datacoves.localhost/service-credentials/airbyte # Secrets url if secrets_manager is datacoves + secrets_token: # Secrets auth token if secrets_manager is datacoves + fivetran: + path: /config/workspace/load/fivetran # Where previous Fivetran export resides, subject of import + api_key: [KEY] # Fivetran API Key + api_secret: [SECRET] # Fivetran API Secret + secrets_path: /config/workspace/secrets/fivetran # Fivetran secret fields + credentials: /opt/fivetran_credentials.yml # Fivetran set of key:secret pairs + # 'api_key' + 'api_secret' are mutually exclusive with 'credentials', use one or the other +``` + +## env_var + +From `dbt-coves 1.6.28` onwards, you can consume environment variables in you config file using `"{{env_var('VAR_NAME', 'DEFAULT VALUE')}}"`. For example: + +```yaml +generate: + sources: + database: "{{env_var('MAIN_DATABASE', 'dev_database')}}" + schemas: + - "{{env_var('DEV_SCHEMA', 'John')}}" + - "{{env_var('STAGING_SCHEMA', 'Staging')}}" +``` + +## Telemetry + +dbt-coves has telemetry built in to help the maintainers from Datacoves understand which commands are being used and which are not to prioritize future development of dbt-coves. We do not track credentials nor details of your dbt execution such as model names. The one detail we do use related to dbt is the anonymous user_id to help us identify distinct users. + +By default this is turned on – you can opt out of event tracking at any time by adding the following to your dbt-coves `config.yaml` file: + +```yaml +disable-tracking: true +``` + +## Override generation templates + +Customizing generated models and model properties requires placing +template files under the `.dbt-coves/templates` folder. + +There are different variables available in the templates: + +- `adapter_name` refers to the Adapter's class name being used by the target, e.g. `SnowflakeAdapter` when using [Snowflake](https://github.com/dbt-labs/dbt-snowflake/blob/21b52127e7d221db8b92114aae066fb8a7151bba/dbt/adapters/snowflake/impl.py#L33). +- `columns` contains the list of relation columns that don't contain nested (JSON) data, it's type is `List[Item]`. +- `nested` contains a dict of nested columns, grouped by column name, it's type is `Dict[column_name, Dict[nested_key, Item]]`. + +`Item` is a `dict` with the keys `id`, `name`, `type`, and `description`, where `id` contains an slugified id generated from `name`. + +### dbt-coves generate sources + +#### Source property file (.yml) template + +This file is used to create the sources yml file + +[source_props.yml](dbt_coves/templates/source_props.yml) + +#### Staging model file (.sql) template + +This file is used to create the staging model (sql) files. + +[staging_model.sql](dbt_coves/templates/staging_model.sql) + +#### Staging model property file (.yml) template + +This file is used to create the model properties (yml) file + +[staging_model_props.yml](dbt_coves/templates/staging_model_props.yml) + +### dbt-coves generate properties + +This file is used to create the properties (yml) files for models + +[model_props.yml](dbt_coves/templates/model_props.yml)