Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

First sync for a large data set takes very long and can cause crashes #156

Open
njuguna-n opened this issue Sep 23, 2024 · 13 comments
Open
Labels
Type: Improvement Make something better

Comments

@njuguna-n
Copy link
Contributor

njuguna-n commented Sep 23, 2024

What feature do you want to improve?
When syncing data for the first time, couch2pg does a good job of copying over a lot of data fast but that means that dbt struggles to keep up and can end up trying to load millions of rows incrementally which has led to issues such as this one.

Describe the improvement you'd like
When the sync is running for the first time there should be a flag set so that either dbt or couch2pg can create the main base tables i.e. document_metadata, contact, and data_record. These tables hold the majority of the data take the most time during dbt runs so pre-populating them would reduce the amount of time required for a sync to complete.

Describe alternatives you've considered
We have manually updated the document_metadata table and deleted some documents but these were temporary measures that should not be done for all production deployments.

@njuguna-n
Copy link
Contributor Author

There are two possible ways to go about this:

  1. Have the tables being created and populated by couch2pg during that first sync with dbt not running and once the sync has less than a specified number of docs yet to sync then the flag can be removed and dbt can run on the existing and incoming data. This would be the fastest but the schema for the tables would be in two places.

  2. Have the dbt container check the flag and create the tables in a non-incremental way. To keep it simple and efficient the sync would have to run and be nearly complete or with a small number of docs yet to be synced so that the tables are only created once and then once the flag is off dbt can update the tables incrementally.

My preference is for option 1 since the main tables are being populated incrementally. I will start by exploring if there is a way to get the schema from the CHT Pipeline base table package so that we have only one source of truth for the schemas.

@njuguna-n
Copy link
Contributor Author

I did some tests on using batched dbt runs with the following results:

With a single Couch2pg instance replicating about 700k docs at a rate of 1000 docs per batch. I set the dbt batch at 10000 records and dbt took on average 3121 seconds to catch up and process all records. Without batching the average time was 2263 seconds

With 7 Couch2pg instances each replicating ~700k docs in batches of 1000 regular dbt runs were much faster than batched dbt runs. This varied depending on the value I set for the batch size but regular dbt runs were still faster since there was no limit to the number of rows to be processed with each run.

I also tried with 20 Couch2pg instances and added a delay between dbt runs of about 5 minutes to simulate models taking long to complete but even in that scenario regular dbt runs we often faster.

My conclusion from this is that in 99% of cases just using incremental tables with dbt runs as they are will be sufficient and performant but in rare cases like MoH Kenya where we have a lot of instances with millions of docs and constrained database resources then batched dbt runs would definitely help but we would be trading off speed for resource conservation. In this case we can run dbt in batches by setting a flag as an environment variable. I have updated this PR to refelct this approach.

@andrablaj
Copy link
Member

andrablaj commented Oct 10, 2024

@njuguna-n dbt 1.9 has a new feature for microbatching incremental models . Could this new feature solve the problem this ticket is stating?

Additionally, you can see more info in this Coalesce presentation, starting from minute 25.

@njuguna-n
Copy link
Contributor Author

@andrablaj Yes, this feature can solve the problem because the incremental model is loaded in batches. The only drawback I see with using this incremental strategy is that the batch size is currently only configurable to be a day and depending on the Couch2pg sync rate and the available Postgres resource the number of documents to be processed might still cause an issue.

I will test this config today and assess whether the CHT Sync and CHT Pipeline PRs for this issue are still required.

@njuguna-n
Copy link
Contributor Author

Testing microbatching with a beta release of dbt 1.9 and this branch of CHT Pipeline I get the error below.

2024-10-14 18:36:57 15:36:57  Running with dbt=1.9.0-b1
2024-10-14 18:36:57 15:36:57  Registered adapter: postgres=1.9.0-b1
2024-10-14 18:36:57 15:36:57  Found 9 models, 27 data tests, 1 operation, 1 source, 693 macros, 8 unit tests
2024-10-14 18:36:57 15:36:57  The selection criterion 'state:modified' does not match any enabled nodes
2024-10-14 18:36:57 15:36:57  Nothing to do. Try checking your model configs and model specification args
2024-10-14 18:36:59 15:36:59  Running with dbt=1.9.0-b1
2024-10-14 18:37:00 15:37:00  Registered adapter: postgres=1.9.0-b1
2024-10-14 18:37:00 15:37:00  Found 9 models, 27 data tests, 1 operation, 1 source, 693 macros, 8 unit tests
2024-10-14 18:37:00 15:37:00  
2024-10-14 18:37:00 15:37:00  Concurrency: 1 threads (target='default')
2024-10-14 18:37:00 15:37:00  
2024-10-14 18:37:00 15:37:00  1 of 9 START sql incremental model v1.dbt_results .............................. [RUN]
2024-10-14 18:37:00 15:37:00  1 of 9 OK created sql incremental model v1.dbt_results ......................... [INSERT 0 0 in 0.11s]
2024-10-14 18:37:00 15:37:00  2 of 9 START sql microbatch model v1.document_metadata ......................... [RUN]
2024-10-14 18:37:00 15:37:00  1 of 1 START batch 2024-10-14 of v1.document_metadata .......................... [RUN]
2024-10-14 18:37:00 15:37:00  1 of 1 ERROR creating batch 2024-10-14 of v1.document_metadata ................. [ERROR in 0.00s]
2024-10-14 18:37:00 15:37:00  Encountered an error while running operation: Compilation Error
2024-10-14 18:37:00   'dbt.artifacts.resources.v1.components.ColumnInfo object' has no attribute 'get'
2024-10-14 18:37:00   
2024-10-14 18:37:00   > in macro get_empty_schema_sql (macros/adapters/columns.sql)
2024-10-14 18:37:00   > called by macro assert_columns_equivalent (macros/relations/column/columns_spec_ddl.sql)
2024-10-14 18:37:00   > called by macro default__get_assert_columns_equivalent (macros/relations/column/columns_spec_ddl.sql)
2024-10-14 18:37:00   > called by macro get_assert_columns_equivalent (macros/relations/column/columns_spec_ddl.sql)
2024-10-14 18:37:00   > called by macro postgres__create_table_as (macros/adapters.sql)
2024-10-14 18:37:00   > called by macro create_table_as (macros/relations/table/create.sql)
2024-10-14 18:37:00   > called by macro default__get_create_table_as_sql (macros/relations/table/create.sql)
2024-10-14 18:37:00   > called by macro get_create_table_as_sql (macros/relations/table/create.sql)
2024-10-14 18:37:00   > called by macro materialization_incremental_default (macros/materializations/models/incremental/incremental.sql)
2024-10-14 18:37:00   > called by <Unknown>

@andrablaj
Copy link
Member

andrablaj commented Oct 14, 2024

@njuguna-n per this discussion, the error above might happen because the dbt-core version is ahead of the postgres adapter version (Registered adapter: postgres=1.9.0-b1 might not exist yet). Have you tried forcing a lower version of the postgres adapter?

@andrablaj
Copy link
Member

Actually, there is a dbt-postgres pre-release 1.9.0-b1. So nevermind.

@njuguna-n
Copy link
Contributor Author

I am using 1.9.0b1 for both dbt-core and dbt-postgres.

@njuguna-n
Copy link
Contributor Author

I have not been able to make the document_metadata model to build successfully using the new microbatching incremental strategy using the pre-release versions of dbt-core and dbt-postgres mentioned above. This is still worth pursuing and testing again once the generally available 1.9 release is ready.

Batching is currently an enhancement that will not be required in most CHT deployments. Additionally, the current solution involves code changes in both the CHT Sync and CHT Pipeline repos with a hard-coded table name being used making future base model updates slightly more fragile. WIth incremental microbatching we would limit the required changes to only the models defined in the CHT Pipeline repo. There is no deployment of CHT or CHT Sync that is actively waiting for this improvement right now so I suggest we hold off on merging the two linked PRs and closing them once we confirm incremental microbatching works as expected.

@njuguna-n njuguna-n moved this from In Progress to Todo in Product Team Activities Oct 16, 2024
@andrablaj
Copy link
Member

andrablaj commented Oct 18, 2024

I tried the dbt-core 1.9.0-b1 version with the test-microbatch-model branch and local couchDB (to which I generated dummy data with the test-data-generator), and I didn't get errors:

2024-10-18 17:50:32 17:50:32  Running with dbt=1.9.0-b1
2024-10-18 17:50:32 17:50:32  Registered adapter: postgres=1.9.0-b1
2024-10-18 17:50:32 17:50:32  Found 9 models, 27 data tests, 1 operation, 1 source, 693 macros, 8 unit tests
2024-10-18 17:50:32 17:50:32  The selection criterion 'state:modified' does not match any enabled nodes
2024-10-18 17:50:32 17:50:32  Nothing to do. Try checking your model configs and model specification args
2024-10-18 17:50:33 17:50:33  Running with dbt=1.9.0-b1
2024-10-18 17:50:33 17:50:33  Registered adapter: postgres=1.9.0-b1
2024-10-18 17:50:33 17:50:33  Found 9 models, 27 data tests, 1 operation, 1 source, 693 macros, 8 unit tests
2024-10-18 17:50:33 17:50:33  
2024-10-18 17:50:33 17:50:33  Concurrency: 1 threads (target='default')
2024-10-18 17:50:33 17:50:33  
2024-10-18 17:50:33 17:50:33  1 of 9 START sql incremental model v1.dbt_results .............................. [RUN]
2024-10-18 17:50:33 17:50:33  1 of 9 OK created sql incremental model v1.dbt_results ......................... [INSERT 0 0 in 0.12s]
2024-10-18 17:50:33 17:50:33  2 of 9 START sql microbatch model v1.document_metadata ......................... [RUN]
2024-10-18 17:50:33 17:50:33  2 of 9 OK created sql microbatch model v1.document_metadata .................... [MERGE 49 in 0.10s]
2024-10-18 17:50:33 17:50:33  3 of 9 START sql incremental model v1.contact .................................. [RUN]
2024-10-18 17:50:33 17:50:33  3 of 9 OK created sql incremental model v1.contact ............................. [INSERT 0 4 in 0.07s]
2024-10-18 17:50:33 17:50:33  4 of 9 START sql incremental model v1.data_record .............................. [RUN]
2024-10-18 17:50:34 17:50:34  4 of 9 OK created sql incremental model v1.data_record ......................... [INSERT 0 0 in 0.07s]
2024-10-18 17:50:34 17:50:34  5 of 9 START sql incremental model v1.user ..................................... [RUN]
2024-10-18 17:50:34 17:50:34  5 of 9 OK created sql incremental model v1.user ................................ [INSERT 0 1 in 0.06s]
2024-10-18 17:50:34 17:50:34  6 of 9 START sql incremental model v1.contact_type ............................. [RUN]
2024-10-18 17:50:34 17:50:34  6 of 9 OK created sql incremental model v1.contact_type ........................ [INSERT 0 4 in 0.05s]
2024-10-18 17:50:34 17:50:34  7 of 9 START sql incremental model v1.person ................................... [RUN]
2024-10-18 17:50:34 17:50:34  7 of 9 OK created sql incremental model v1.person .............................. [INSERT 0 2 in 0.09s]
2024-10-18 17:50:34 17:50:34  8 of 9 START sql incremental model v1.place .................................... [RUN]
2024-10-18 17:50:34 17:50:34  8 of 9 OK created sql incremental model v1.place ............................... [INSERT 0 2 in 0.06s]
2024-10-18 17:50:34 17:50:34  9 of 9 START sql incremental model v1.patient .................................. [RUN]
2024-10-18 17:50:34 17:50:34  9 of 9 OK created sql incremental model v1.patient ............................. [INSERT 0 2 in 0.06s]
2024-10-18 17:50:34 17:50:34  
2024-10-18 17:50:34 17:50:34  1 of 1 START hook: cht_pipeline_base.on-run-end.0 .............................. [RUN]
2024-10-18 17:50:34 17:50:34  1 of 1 OK hook: cht_pipeline_base.on-run-end.0 ................................. [OK in 0.03s]
2024-10-18 17:50:34 17:50:34  
2024-10-18 17:50:34 17:50:34  Finished running 9 incremental models, 1 project hook in 0 hours 0 minutes and 0.78 seconds (0.78s).
2024-10-18 17:50:34 17:50:34  
2024-10-18 17:50:34 17:50:34  Completed successfully
2024-10-18 17:50:34 17:50:34  
2024-10-18 17:50:34 17:50:34  Done. PASS=10 WARN=0 ERROR=0 SKIP=0 TOTAL=10

The only change I made in this repo was updating Dockerfile to use the latest version of dbt-core via:

RUN pip install --upgrade cffi \
    && pip install cryptography~=3.4 \
    && pip install dbt-core==1.9.0b1 dbt-postgres==1.9.0b1

Is there any extra configuration that you had locally, @njuguna-n? What was your setup?

@andrablaj andrablaj moved this from Todo to Next Week's Commitments in Product Team Activities Oct 18, 2024
@njuguna-n
Copy link
Contributor Author

I have just tried it again with this branch and it worked just like in your test. I am not sure what was different about my setup but I will try to recreate it once I am back on Monday (28th October). One thing I have noted is that the microbatch models are handled differently; in the failed run it runs a batch for a specific date range START batch 2024-10-14 of v1.document_metadata but in the successful one it does not log the date START sql microbatch model v1.document_metadata,.

@andrablaj
Copy link
Member

It would be helpful to understand your setup:

  • Were you running CouchDB locally or remotely? More broadly, what was your setup?
  • How were you generating the large data sets? Additionally, how do you define large data sets? (I see 700k in your comment)

@njuguna-n
Copy link
Contributor Author

@andrablaj I have tried to reproduce the error I got with no luck. I have changed the saved_timestamp field in the medic table to have different timestamps on different days and months and also tried various Python versions for the dbt container. I have also tried from a fresh database and with the tables already created and it all works. We still need to wait for the official 1.9 release but happy to go with this approach as opposed to the custom batching logic.

@njuguna-n njuguna-n moved this from Next Week's Commitments to Todo in Product Team Activities Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Improvement Make something better
Projects
No open projects
Status: Todo
Development

No branches or pull requests

2 participants