docs+tests directly in a model #5093

dubravcik · 2020-01-31T11:07:27Z

dubravcik
Jan 31, 2020

Describe the feature

When I started to learn dbt, I thought that data documentation is part of a model. Which is true, once it is published on the docs server. But when talking about source code, documentation is separated to schema.yml which is not intuitive for me. I would expect to have an option to write the docs close to the code. I can imagine a part of the config for this :

Who will this benefit?

When one is making changes in a model, he can see the docs and moreover edit it immediately + he sees names of columns produced in the model.

Thanks :)

drewbanin · 2020-01-31T15:49:42Z

drewbanin
Jan 31, 2020
Maintainer

Thanks for opening this issue @dubravcik! I do like the idea of colocating the docs for a model along with its definition, but I have reservations about how well this scheme will work in practice.

What if you wanted to add column-level tests in here too - would you want to encode all of that information in the model config? And what if there are dozens and dozens of columns?

I think that cramming all of that information into a single config block might be pretty noisy in practice! Curious to hear what you think.

0 replies

dubravcik · 2020-01-31T20:14:20Z

dubravcik
Jan 31, 2020
Author

Theoretically, I'd like to move the yaml inside a model, but it is not possible afaik. We have to write in jinja expression in python, i.e. dictionary. Or is there any other option?

If it has to be a dictionary, can we have multiple config expressions? Then it just about preference between the syntax of python or yaml. If we put the docs and tests closer to the output I think it is not so noisy.

{{
  config(
    materialized = "table"
  )
}}

WITH cte AS (
  ....
)
SELECT a.[SK_Account_Master]
      ,[Account Name] = [NAME_AccountName]
      ,[Account Level] = [NAME_AccountLevel]
      ,[Account Type] = [NAME_AccountType]
      ,[Is Testing Account] = [FLAG_IsTestingAccount]
      ,[Partner Revenue] = [AMT_PartnerRevenue_cur]
      ,[Partner Revenue LFY] = [AMT_PartnerRevenue_LFY_cur]
      ,[Partner Revenue TFY] = [AMT_PartnerRevenue_TFY_cur]
      ,[Partner Revenue  Y] = [AMT_PartnerRevenue_12M_cur]
      ,a.[SK_User_Master_Inserted]
      ,a.[SK_User_Master_Modified]
      ,a.[SK_Account_Master_Parent]
      ,a.[SK_Account_Master_SuperParent]
      ,a.[SK_Account_Master_Partner]
      ,a.[SK_Partners_Partner_Master]
      ,a.[SK_Geography_Master]
      ,a.[SK_Currency_Master]
      ,a.[SK_User_Master_TSM]
      ,a.[SK_User_Master_AccountManager]
      ,a.[SK_User_Master_AccountExecutive]
      ,a.[SK_User_Master_CSM]
      ,a.[SK_Campaign_Master_SourceCampaign]
      ,[DTIME Inserted] = a.[DTIME_Inserted]
      ,[DATE Inserted] = a.[SK_Date_Inserted]
      ,[DTIME Modified] = a.[DTIME_Modified]
      ,[DATE Modified] = a.[SK_Date_Modified]
 FROM {{ source('account') }} a
  LEFT JOIN cte ...

{{
  config(
    description = "account is company"
    columns = [
      {
        "name": "SK_Account_Master",
        "description": "primary key of Account",
        "tests": ["unique", "not_null"]
      },
      { 
        "name": "Account name", 
        "description": "business name" },
      { 
        "name": "Account Level",
        "description": "basic advanced pro" 
      },
      { 
        "name": "Account Type", 
        "description": "partner or client" 
      },
      {
        "name": "SK_User_Master_Inserted",
        "description": "User who created the Account",
        "tests": { "relationships": { "to": "ref('user')", "field": "id" } }
      }
    ]
    tests = [
      { "accepted_values": {"column_name": "[account type]", "values": ['partner', 'client']}},
      { "unique": {"column_name": "concat([account name], [account type])"}}
    ]
  )
}}

I don't say this is exactly how I want it and I would use stright away, just suggestion for discussion :)

0 replies

jrandrews · 2020-01-31T21:50:13Z

jrandrews
Jan 31, 2020

Honestly I like having it separate. git-based version control systems are more "friendly" with multiple smaller files than they are with fewer larger ones. More, smaller files makes it easier to eyeball quickly where changes have taken place and also makes merge conflicts less likely.

0 replies

dubravcik · 2020-07-01T16:57:13Z

dubravcik
Jul 1, 2020
Author

I am coming with another idea. I don't know whether it is a problem for others, but for me the problem is that the documentation

is in another file
have to create the file and also structure in the file manually (sure, there are some helpers Feature: Command to auto-generate schema.yml files #1082 )
have to keep it in sync when the model is changed (probably the biggest pain)
doesn't really document the code itself

The idea is to have the documentation inside the model using comments. Other programming languages use comments for documentation, so it could work as well.

For example a comment --docs < column > < description > could be parsed and used for documentation. The position of such comment would not be important as it holds the column name, so it would work also in long models with many CTEs. It could be another option to document model and could be overridden by schema.yml
What do you think @drewbanin @jrandrews ?

SELECT
       --docs full_name full name of customer
       full_name
       --docs country country where customer is located based on IP location
       ,country
       --docs browser_language language set in customer browser
       ,browser_language
       --docs date_created_at date customer created
       --this is an unrelated commented to docs
       ,date_created_at
       --docs date_last_login date customer logged in        
       ,date_last_login
  FROM analytics.customer

I haven't thought about column level tests yet.

0 replies

jtcohen6 · 2020-12-04T18:47:35Z

jtcohen6
Dec 4, 2020
Maintainer

I'm thinking about this in the same way I'm thinking about #2401, which is on our 1.0 to-do list, and could be essentially summarized as: "Reconcile node configs and resource properties, where possible and it makes sense."

Today, it's not totally clear what can be defined in one vs. the other—or both (tags). This is a big point of confusion for new dbt users. Of course, we don't want to create more confusion in the process: We'll need to come up with ways to reconcile config-properties that are set hierarchically in .yml files with the ones set in config() blocks. I think in-file config() should always win, but I could also see wanting to raise a warning if there's conflicting information set at the exact same level (e.g. the same column described twice).

Personally, I find myself agreeing with @jrandrews's point above. I think that, by and large, we'll want to maintain the functional distinction that exists today, recast as a convention:

Configure the behavior of a model's execution with config() or dbt_project.yml
Describe a model's properties in models/*.yml

0 replies

tomsej · 2021-02-01T15:31:08Z

tomsej
Feb 1, 2021

We had a similar problem and wanted to post an issue. The worst thing for us was that we often forgot to edit the schema file when we added / removed columns. But I think having a description in sql can lead to a lot of commits without changing business logic, let alone merge conflicts. We have created a tool for us that will allow us to check the differences between schema and sql comments + much more. Maybe it can help you - https://github.com/offbi/pre-commit-dbt

0 replies

vergenzt · 2022-07-15T20:47:48Z

vergenzt
Jul 15, 2022

I've built out some scripts to do this in a project I maintain. I write specially formatted Javadoc-inspired SQL comments immediately after the declaration of each column that I want to include in the docs, and then I use a jq script to scan the compiled dbt SQL output with a regex looking for a comment of that format. I've been wanting to share for a while, but figured I should start out with a low scope by just describing the approach I took before I put the work into extricating the script from the rest of my codebase. 🙂

The SQL comments look like this:

# file: mymodel.sql

/** @modeldoc
This is my super cool model! It does cool things!
**/

select
  ... as my_cool_column /** @coldoc
    This is my description of the cool column. This whole paragraph gets extracted into
    a yaml file as the description for the column.
    
    You can even define tests too, by including well-formed dbt test JSON after one or
    more @-test annotation(s) at the end of the comment!

    @test "not_null"
    @test {"accepted_values": {"values": ["Cool", "Cooler", "Coolest"]}}
  **/
  ...

and end up in the "schema.yml" (written as JSON for ease of automation) file looking like this:

# file: models_generated.yml
{
  ...
  "models": [
    ...
    {
      "name": "mymodel",
      "description": "This is my super cool model! It does cool things!",
      "columns": [
        {
          "name": "my_cool_column",
          "description": "This is my description of the cool column. This whole paragraph gets extracted into a yaml file as the description for the column.\n\nYou can even define tests too, by including well-formed dbt test JSON after one or more @-test annotation(s) at the end of the comment!",
          "tests": [
            "not_null",
            "accepted_values": {
              "values": [
                "Cool",
                "Cooler",
                "Coolest"
              ]
            }
          ]
        },
        ...

It's all just from regex extraction, but I've found it to be pretty flexible and easy to use. I've used it for ~6 months now to build and maintain an analytic codebase of ~20 dbt models with ~200 columns in total that are all documented and fully tested this way.

1 reply

vergenzt Jul 29, 2022

Also, I just found the thread at #375, which has a great discussion of this issue! 🙂

jakebiesinger · 2023-02-03T17:05:30Z

jakebiesinger
Feb 3, 2023

I opened the related #6853 ... seems like there's not much traction on this topic, but it would be sooo nice to have :|

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs+tests directly in a model #5093

{{title}}

Replies: 8 comments 1 reply

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

docs+tests directly in a model #5093

dubravcik Jan 31, 2020

Describe the feature

Who will this benefit?

Replies: 8 comments · 1 reply

drewbanin Jan 31, 2020 Maintainer

dubravcik Jan 31, 2020 Author

jrandrews Jan 31, 2020

dubravcik Jul 1, 2020 Author

jtcohen6 Dec 4, 2020 Maintainer

tomsej Feb 1, 2021

vergenzt Jul 15, 2022

vergenzt Jul 29, 2022

jakebiesinger Feb 3, 2023

dubravcik
Jan 31, 2020

Replies: 8 comments 1 reply

drewbanin
Jan 31, 2020
Maintainer

dubravcik
Jan 31, 2020
Author

jrandrews
Jan 31, 2020

dubravcik
Jul 1, 2020
Author

jtcohen6
Dec 4, 2020
Maintainer

tomsej
Feb 1, 2021

vergenzt
Jul 15, 2022

jakebiesinger
Feb 3, 2023