Skip to content
This repository has been archived by the owner on Nov 24, 2023. It is now read-only.

[WIP] Async Checkpoint with Causality Optimization #2262

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

db-will
Copy link
Contributor

@db-will db-will commented Oct 25, 2021

What problem does this PR solve?

Initial PR for implementing Async Checkpoint with Causality Optimization, this PR is created for discussion purpose

What is changed and how it works?

Check List

Tests

  • Unit test
  • Integration test
  • Manual test (add detailed scripts or steps below)
  • No code

Code changes

  • Has exported function/method change
  • Has exported variable/fields change
  • Has interface methods change
  • Has persistent data change

Side effects

  • Possible performance regression
  • Increased code complexity
  • Breaking backward compatibility

Related changes

  • Need to cherry-pick to the release branch
  • Need to update the documentation
  • Need to update the dm/dm-ansible
  • Need to be included in the release note

@ti-chi-bot
Copy link
Member

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

syncer/syncer.go Outdated
s.jobWg.Wait()
s.addCount(true, adminQueueName, job.tp, 1, job.targetTable)
return s.flushCheckPoints()
s.flushCheckPointsAsync(job.wg, job.seq)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we trigger this flush whenever we call flushJobs(), and from my understanding of the code, it's the situation need us to synchronously flush checkpoint. There are 3 places call such function in syncer.go, we could explore these cases.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, some of the flushJobs should wait for the ddls actually finish executing.

@@ -916,6 +924,7 @@ func (s *Syncer) addJob(job *job) error {
s.tctx.L().Info("All jobs is completed before syncer close, the coming job will be reject", zap.Any("job", job))
return nil
}
job.seq = s.getSeq()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to call getSeq for every job? and do we also need to use lock within getSeq()?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only rows/flush events depend on this seq, but this no harm to allocate a sequence number for every event. BTW, becuase this function is only called in the main thread, this is no need to sync.

syncer/syncer.go Show resolved Hide resolved
syncer/syncer.go Show resolved Hide resolved

snapshot := &removeCheckpointSnapshot{
id: id,
globalPointSaveTime: cp.globalPointSaveTime,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cp.globalPointSaveTime is a reference here, and it's modifiable and by the time we flush/remove the checkpoint snapshot, its value might be changed. Or, we can have other methods to avoid such case happen.

The same question also applies to other fields.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems not, go's assignment operation is not by reference for common struct.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my bad, cp.globalPointSaveTime is definitely not a good case here, i thought it was pointer before. i have concern for few other cases:

  • snapshot.globalPoint = &cp.globalPoint.location
  • tableCpSnapshots[tbl] = point.location

syncer/checkpoint.go Show resolved Hide resolved
if needFlush {
s.jobWg.Add(1)
j := newFlushJob()
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needFlush is set to true, when we found out that globalPointSavedtime hasn't updated for certain interval. In a case when it happened, we will create flush job everytime when a new job is added. A control mechanism is needed here. I implemented that in my branch, we could take reference to that.


// Run read flush tasks from input and execute one by one
func (w *checkpointFlushWorker) Run(ctx *tcontext.Context) {
for msg := range w.input {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

based on my understanding here, we are flush checkpoint snapshot one by one, and i wonder how would we guarantee the speed of flush compared to running dml jobs.

Currently, we have snapshots array field in checkpoint struct, the initial size of such array, and how we handle when the snapshots array are overloaded will need carefuly design

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we presume that flush checkpoint is not so slow. And since the length of the input chan is finite (16), so event if the flush operation is too slow, when the input chan is full, it will then block the caller from add more msg, thus it will block the whole sync. So the async mechanism should not be worse even if the flush is unexpectly slow.

return fmt.Sprintf("%v(flushed %v)", b.location.location, b.flushedLocation.location)
}

//
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported type SnapshotID should be of the form "SnapshotID ..." (with optional leading article)

return fmt.Sprintf("%v(flushed %v)", b.location.location, b.flushedLocation.location)
}

//
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported type SnapshotID should be of the form "SnapshotID ..." (with optional leading article)

return fmt.Sprintf("%v(flushed %v)", b.location.location, b.flushedLocation.location)
}

//
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

comment on exported type SnapshotID should be of the form "SnapshotID ..." (with optional leading article)

@lance6716
Copy link
Collaborator

/cc @lance6716 @glorv

transfer to @GMHDBJD if @glorv is not available

@ti-chi-bot ti-chi-bot requested review from glorv and lance6716 October 29, 2021 03:29
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants