This little application is to be used in conjunction with The Tuva Project DBT package when utilizing PostgesSQL for you datawarehouse. The Tuva Seed Reader allows you to override the load_seed macro with a PostgreSQL version that leverages COPY FROM PROGRAM. You guessed it the program that you are copying the data from is the Tuva Seed Reader. The same URI and PATTERN argumanets passed to the post-hook calls found in The Tuva Project dbt_project.yml file are used by the Tuva Seed Reader.
You need to complete two task in order to get the Tuva Seed Reader working with your Tuva data project.
- Install the tuva seed reader on you PostgresSQL server.
- Override the load_macro macro in your DBT Project.
Windows does not have a global installtion for pipx so you need to run the PostgreSQL service under a local or domain account. When you connect you will need to connect at this service users so you can complete the following steps.
Connect to your PostgreSQL Server and complete these steps:
- Install python 3.9 or greater. (I tested mainly with 3.12)
- Connect to the server as the service user.
- Install pipx.
py -m pip install --user pipx
- Ensure pipx path.
pipx ensurepath
- Install tuva seed reader.
pipx install https://github.com/tuva-health/tuva-seed-reader/archive/refs/heads/main.zip
- Test you can run the Tuva Seed Reader
tuva-seed-reader
In your DBT project folder locate your dbt_project.yml file. Open it and add the below dispatch. Change the placeholder 'your_project_name_here' to you project's macro name space. If you do not have, one use the name of your project. That is what I did and it seems to work nicely.
dispatch:
- macro_namespace: the_tuva_project
search_order: ['your_project_name_here','the_tuva_project']
In your DBT project folder, create a macros forlder if you do not already have one. In that folder create a new file and name it load_seed.sql
. Cut and paste in the below code and save the file.
{% macro postgres__load_seed(uri,pattern,compression,headers,null_marker) %}
{% set sql %}
COPY {{ this }}
FROM PROGRAM 'tuva-seed-reader {{ uri }} {{ pattern }}'
WITH (FORMAT CSV{% if headers == True %}, HEADER MATCH{% endif %})
{% endset %}
{% if execute %}
{# debugging { log(sql, True)} #}
{% set results = run_query(sql) %}
{{ log("Loaded data from external s3 resource\n loaded to: " ~ this ~ "\n from: s3://" ~ uri ,True) }}
{# debugging { log(results, True) } #}
{% endif %}
{% endmacro %}
I would suggest running dbt seed
or dbt build
. Hopefully you will see all the terminology, reference, and valueset data get seeded properly. If you have, a failure and you want to see more in the logs you can add --log-level debug
to the dbt command you used.