Skip to content

Commit

Permalink
Rework samples code to mimic best practices (#29)
Browse files Browse the repository at this point in the history
Rework dbt projects
  • Loading branch information
nclaeys authored Mar 12, 2024
1 parent 8b67cf8 commit cac50a8
Show file tree
Hide file tree
Showing 53 changed files with 422 additions and 17 deletions.
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ This repository contains a number of sample projects for Conveyor
### Basic

- pi_spark: use [Apache Spark](https://github.com/apache/spark) to calculate pi.
- first_project_dbt: use [dbt](https://github.com/dbt-labs/dbt-core) and [DuckDB](https://github.com/duckdb/duckdb) for the first time. Using this project is described in the Conveyor [getting started guide](https://docs.conveyordata.com/get-started/dbt).
- coffee_shop_dbt: use [dbt](https://github.com/dbt-labs/dbt-core) and [DuckDB](https://github.com/duckdb/duckdb)
for cleaning and transforming the coffee shop input data and writing the results to S3.

Expand Down
6 changes: 3 additions & 3 deletions basic/coffee_shop_dbt/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,13 @@ FROM public.ecr.aws/dataminded/dbt:v1.7.3
WORKDIR /app
COPY . .

WORKDIR /app/dbt/coffee_shop_dbt
WORKDIR /app

# install dependencies
RUN dbt deps

ENV DBT_PROFILES_DIR="/app/dbt"
ENV DBT_PROJECT_DIR="/app/dbt/coffee_shop_dbt"
ENV DBT_PROFILES_DIR="/app"
ENV DBT_PROJECT_DIR="/app"
ENV DBT_USE_COLORS="false"

# generate dbt manifest
Expand Down
10 changes: 4 additions & 6 deletions basic/coffee_shop_dbt/Makefile
Original file line number Diff line number Diff line change
@@ -1,11 +1,9 @@
current_dir:=$(shell pwd)
project_name:=coffee_shop_dbt
rel_project_dir:=dbt/$(project_name)
rel_profiles_dir:=dbt
abs_project_dir:=$(current_dir)/$(rel_project_dir)
abs_profiles_dir:=$(current_dir)/$(rel_profiles_dir)
abs_project_dir:=$(current_dir)
abs_profiles_dir:=$(current_dir)
env_file:=$(current_dir)/.env
dbt_version:=v1.6.0
dbt_version:=v1.7.0
os_docker_flag:=
ifeq ($(shell uname -s),Linux)
os_docker_flag += --add-host host.docker.internal:host-gateway
Expand All @@ -26,7 +24,7 @@ deps:
dbt deps --profiles-dir $(abs_profiles_dir) --project-dir $(abs_project_dir) $(call args,$@)

manifest: env
eval "$(docker_dbt_command)" ls --profiles-dir $(rel_profiles_dir) --project-dir $(rel_project_dir) $(call args,$@)
eval "$(docker_dbt_command)" ls --profiles-dir $(abs_profiles_dir) --project-dir $(abs_project_dir) $(call args,$@)
cp $(abs_project_dir)/target/manifest.json $(current_dir)/dags/manifest.json

debug:
Expand Down
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
File renamed without changes.
5 changes: 0 additions & 5 deletions basic/coffee_shop_dbt/dbt/coffee_shop_dbt/.gitignore

This file was deleted.

File renamed without changes.
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@ sources:
schema: "{{target.schema}}_raw"
description: E-commerce data
meta:
external_location: "read_csv_auto('s3://conveyor-samples-b9a6edf0/coffee-data/raw/{name}.csv', header=1)"
external_location: >-
{{ 'read_csv_auto("s3://conveyor-samples-b9a6edf0/coffee-data/raw/{name}.csv", header=1)' if target.name != 'local' else 'read_csv_auto("coffee-data/raw/{name}.csv", header=1)' }}
tables:
- name: raw_customers
description: One record per person who has purchased one or more items
Expand Down
File renamed without changes.
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,5 @@ default:
extensions:
- httpfs
- parquet
use_credential_provider: aws
external_root: "workspace/parquet"
target: local
File renamed without changes.
File renamed without changes.
File renamed without changes.
33 changes: 33 additions & 0 deletions basic/first_project_dbt/.dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
__pycache__
*.pyc
*.pyo
*.pyd
.Python
pip-log.txt
pip-delete-this-directory.txt
.tox
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*,cover
*.log
.git
.datafy
dags
Dockerfile
tests
target/
logs/
resources/
dags/

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
128 changes: 128 additions & 0 deletions basic/first_project_dbt/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,128 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/

# Translations
*.mo
*.pot

# Django stuff:
*.log
local_settings.py
db.sqlite3
db.sqlite3-journal

# Flask stuff:
instance/
.webassets-cache

# Scrapy stuff:
.scrapy

# Sphinx documentation
docs/_build/

# PyBuilder
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# pyenv
#.python-version

# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock

# celery beat schedule file
celerybeat-schedule

# SageMath parsed files
*.sage.py

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/

# Spyder project settings
.spyderproject
.spyproject

# Rope project settings
.ropeproject

# mkdocs documentation
/site

# mypy
.mypy_cache/
.dmypy.json
dmypy.json

# Pyre type checker
.pyre/

# dbt
.user.yml
1 change: 1 addition & 0 deletions basic/first_project_dbt/.python-version
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
3.11.4
14 changes: 14 additions & 0 deletions basic/first_project_dbt/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
FROM public.ecr.aws/dataminded/dbt:v1.7.3

WORKDIR /app
COPY . .
WORKDIR /app

# install dependencies
RUN dbt deps

ENV DBT_PROFILES_DIR="/app"
ENV DBT_PROJECT_DIR="/app"

# Using DBT ls makes sure that the DBT cache is populated, this allows DBT to use the cache every time it is started up, this will significantly reduce the startup latency of DBT jobs with many models
RUN dbt ls
41 changes: 41 additions & 0 deletions basic/first_project_dbt/Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
current_dir:=$(shell pwd)
project_name:=first_dbt_project
rel_project_dir:=$(project_name)
abs_project_dir:=$(current_dir)
abs_profiles_dir:=$(current_dir)
env_file:=$(current_dir)/.env
dbt_image_version:=v1.6.0
os_docker_flag:=
ifeq ($(shell uname -s),Linux)
os_docker_flag += --add-host host.docker.internal:host-gateway
endif
docker_dbt_shell_command:=docker run --rm $(os_docker_flag) --env-file $(env_file) --entrypoint /bin/bash --privileged -it -e NO_DOCKER=1 --network=host -v $(current_dir):/workspace -w /workspace public.ecr.aws/dataminded/dbt:$(dbt_image_version)
docker_dbt_command:=docker run --rm $(os_docker_flag) --env-file $(env_file) -it -v $(current_dir):/workspace -w /workspace public.ecr.aws/dataminded/dbt:$(dbt_image_version)

supported_args=target models select
args = $(foreach a,$(supported_args),$(if $(value $a),--$a "$($a)"))

env:
touch $(current_dir)/.env

shell: env
eval "$(docker_dbt_shell_command)"

deps:
dbt deps --profiles-dir $(abs_profiles_dir) --project-dir $(abs_project_dir) $(call args,$@)

manifest: env
eval "$(docker_dbt_command)" ls --profiles-dir ./ --project-dir ./ $(call args,$@)
cp $(abs_project_dir)/target/manifest.json $(current_dir)/dags/manifest.json

debug:
dbt debug --profiles-dir $(abs_profiles_dir) --project-dir $(abs_project_dir) $(call args,$@)

test:
dbt test --profiles-dir $(abs_profiles_dir) --project-dir $(abs_project_dir) $(call args,$@)

run:
dbt run --profiles-dir $(abs_profiles_dir) --project-dir $(abs_project_dir) $(call args,$@)

docs:
dbt docs serve --profiles-dir $(abs_profiles_dir) --project-dir $(abs_project_dir) $(call args,$@)
42 changes: 42 additions & 0 deletions basic/first_project_dbt/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
# First_dbt_project

## Prerequisites

- [dbt](https://docs.getdbt.com/dbt-cli/installation/)
- [pyenv](https://github.com/pyenv/pyenv) (recommended)

## Project Structure

```bash
root/
|-- dags/
| |-- project.py
|-- models/
|-- dbt_project.yml
|-- profiles.yml
|-- README.md
|-- Dockerfile
```

## Concepts

### dbt project structure
Consult the following documentation regarding [best practices for project structure](https://discourse.getdbt.com/t/how-we-structure-our-dbt-projects/355).

### environment variables
It is common practise to pass configuration by [environment variables](https://docs.getdbt.com/reference/dbt-jinja-functions/env_var).
Locally you use a `.env` file to store credentials.

## Commands
If you have dbt installed locally, you can use the dbt commands from the root of the project.

If you do not have dbt installed locally, you can start a dbt docker container with your local files mounted:
- `make env` to create a local `.env` file
- `make shell` to start a new shell
- `exit` to terminate the container shell

In order to use the `conveyorDbtTaskFactory` in Airflow, you need to have a `manifest.json` file in your dags folder.
You can generate the manifest as follows:
- `make manifest` executes dbt build and copies the `manifest.json` to your dags folder

Consult the [dbt documentation](https://docs.getdbt.com/docs/introduction) for additional commands.
Empty file.
Empty file.
26 changes: 26 additions & 0 deletions basic/first_project_dbt/dags/first_dbt_project.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
from airflow import DAG
from conveyor.factories import ConveyorDbtTaskFactory
from conveyor.operators import ConveyorContainerOperatorV2
from datetime import datetime, timedelta


default_args = {
"owner": "Conveyor",
"depends_on_past": False,
"start_date": datetime(year=2024, month=3, day=10),
"email": [],
"email_on_failure": False,
"email_on_retry": False,
"retries": 0,
"retry_delay": timedelta(minutes=5),
}


dag = DAG(
"first_dbt_project", default_args=default_args, schedule_interval="@daily", max_active_runs=1
)
ConveyorContainerOperatorV2(
dag=dag,
task_id="task1",
arguments=["build", "--target", "dev"],
)
Loading

0 comments on commit cac50a8

Please sign in to comment.