-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Deduplication Job #6
Conversation
Dedupe integration test run looks as follows:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few questions:
-
Where are you going to run it? A VM, GKE or something else?
-
What are the reasons you chose to do it with Spring rather than Cloud Composer? I guess because Cloud Composer is a bit pricy?
hedera-dedupe-bigquery/src/main/java/com/hedera/dedupe/DedupeRunner.java
Outdated
Show resolved
Hide resolved
hedera-dedupe-bigquery/src/main/java/com/hedera/dedupe/DedupeRunner.java
Outdated
Show resolved
Hide resolved
hedera-dedupe-bigquery/src/main/java/com/hedera/dedupe/DedupeRunner.java
Outdated
Show resolved
Hide resolved
In a VM. Devops provisioned an instance to run the Importer, planning to use the same for this.
I explored following options: But it was much easier to have a single java process doing
So went with that for v1. Can always upgrade in future. Is that okay with you? |
That's ok. As you said we can upgrade later if necessary. Make sure there are health checks setup for the machine, disk space and memory monitoring, and logging. |
Deduplication job will run as a Spring application. It'll be triggered every 5 min (configurable) and will do the following: - Get state of last run - Check if there is new data to be deduplicated - Check if there are duplicates in new data - Deduplicated - Save new state Changes: - Adds metrics to track deduplication job - Adds new column 'dedupe' to transactions table - Adds integration test - Adds create-tables.sh script to create all tables needed by hedera-etl Signed-off-by: Apekshit Sharma <[email protected]>
absolutely. tracking here #9 . |
Hey, if you don't mind, can we please merge this in and do further changes in smaller PRs? |
Signed-off-by: Apekshit Sharma <[email protected]>
Sure thing. |
- Scripts to make dedupe deployment easier - Fix bug in schema which resulted in the failed insert (#16) - Added metric to dataflow job to track ingestionDelay from consensusTimestamp (#6) - Add transaction_types table (#17) - consistent naming of schema files - Keep all schema files together. Move to a better location in future. Signed-off-by: Apekshit Sharma <[email protected]>
Detailed description:
Deduplication job will run as a Spring application.
It'll be triggered every 5 min (configurable) and will do the following:
Changes:
hedera-etl-dataflow
tohedera-etl-bigquery
Signed-off-by: Apekshit Sharma [email protected]
Special notes for your reviewer:
Documentation changes will be in followup
Checklist