Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PLAT-108920] adding another dedup iterator to merge samples instead of skipping them #41

Merged
merged 1 commit into from
May 30, 2024

Conversation

jnyi
Copy link
Collaborator

@jnyi jnyi commented May 30, 2024

Raw data has 3 copies, one has missing data points
Screenshot 2024-05-29 at 7 25 36 PM

Before this change, dips in data points:

Screenshot 2024-05-29 at 7 25 43 PM

After this change:

Screenshot 2024-05-29 at 8 35 57 PM
  • I added CHANGELOG entry for this change.
  • Change is not relevant to the end user.

Changes

Verification

More tests, especially during rolling update and the histogram used to have large spikes:

Screenshot 2024-05-30 at 9 06 49 AM

After the change it is gone:

Screenshot 2024-05-30 at 10 35 05 AM

@jnyi jnyi requested review from hczhu-db and yuchen-db May 30, 2024 01:48
@jnyi jnyi force-pushed the PLAT-108920-1 branch from 1bfcdac to 83f5aeb Compare May 30, 2024 03:37
Copy link
Collaborator

@hczhu-db hczhu-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome work!

const UseMergedSeries = "use_merged_series"

// mergedSeries is a storage.Series that implements a simple merge sort algorithm.
// when replicas has conflict values at the same timestamp, the first replica will be selected.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our use case, there are no conflicting values wit the same timestamp because such data samples are rejected in the write path.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, i meant it isn't exactly a merge sort, it will only keep 1 value at the same timestamp.

pkg/dedup/merge_iter.go Show resolved Hide resolved
Comment on lines +104 to +107
type Options struct {
GroupReplicaPartialResponseStrategy bool
EnableDedupMerge bool
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is much better than a simple var! We can add more options in the future.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep

@jnyi jnyi force-pushed the PLAT-108920-1 branch from 83f5aeb to f346376 Compare May 30, 2024 03:45
@jnyi jnyi force-pushed the PLAT-108920-1 branch from f346376 to 41f59c3 Compare May 30, 2024 04:09
@jnyi jnyi requested review from hczhu-db May 30, 2024 04:09
@jnyi jnyi merged commit 72a3cb8 into databricks:db_main May 30, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants