[WIP] Async Checkpoint with Causality Optimization #2262

db-will · 2021-10-25T16:46:10Z

What problem does this PR solve?

Initial PR for implementing Async Checkpoint with Causality Optimization, this PR is created for discussion purpose

What is changed and how it works?

Check List

Tests

Unit test
Integration test
Manual test (add detailed scripts or steps below)
No code

Code changes

Has exported function/method change
Has exported variable/fields change
Has interface methods change
Has persistent data change

Side effects

Possible performance regression
Increased code complexity
Breaking backward compatibility

Related changes

Need to cherry-pick to the release branch
Need to update the documentation
Need to update the dm/dm-ansible
Need to be included in the release note

ti-chi-bot · 2021-10-25T16:46:11Z

[REVIEW NOTIFICATION]

This pull request has not been approved.

To complete the pull request process, please ask the reviewers in the list to review by filling /cc @reviewer in the comment.
After your PR has acquired the required number of LGTMs, you can assign this pull request to the committer in the list by filling /assign @committer in the comment to help you merge this pull request.

The full list of commands accepted by this bot can be found here.

Reviewer can indicate their review by submitting an approval review.
Reviewer can cancel approval by submitting a request changes review.

db-will · 2021-10-25T16:51:31Z

syncer/syncer.go

-		s.jobWg.Wait()
-		s.addCount(true, adminQueueName, job.tp, 1, job.targetTable)
-		return s.flushCheckPoints()
+		s.flushCheckPointsAsync(job.wg, job.seq)


we trigger this flush whenever we call flushJobs(), and from my understanding of the code, it's the situation need us to synchronously flush checkpoint. There are 3 places call such function in syncer.go, we could explore these cases.

Yes, some of the flushJobs should wait for the ddls actually finish executing.

db-will · 2021-10-25T18:40:48Z

syncer/syncer.go

@@ -916,6 +924,7 @@ func (s *Syncer) addJob(job *job) error {
 		s.tctx.L().Info("All jobs is completed before syncer close, the coming job will be reject", zap.Any("job", job))
 		return nil
 	}
+	job.seq = s.getSeq()


Do we need to call getSeq for every job? and do we also need to use lock within getSeq()?

Only rows/flush events depend on this seq, but this no harm to allocate a sequence number for every event. BTW, becuase this function is only called in the main thread, this is no need to sync.

syncer/syncer.go

db-will · 2021-10-25T18:57:46Z

syncer/checkpoint.go

+
+	snapshot := &removeCheckpointSnapshot{
+		id:                         id,
+		globalPointSaveTime:        cp.globalPointSaveTime,


cp.globalPointSaveTime is a reference here, and it's modifiable and by the time we flush/remove the checkpoint snapshot, its value might be changed. Or, we can have other methods to avoid such case happen.

The same question also applies to other fields.

Seems not, go's assignment operation is not by reference for common struct.

my bad, cp.globalPointSaveTime is definitely not a good case here, i thought it was pointer before. i have concern for few other cases:

snapshot.globalPoint = &cp.globalPoint.location

tableCpSnapshots[tbl] = point.location

syncer/checkpoint.go

db-will · 2021-10-25T19:07:43Z

syncer/syncer.go

 	if needFlush {
 		s.jobWg.Add(1)
+		j := newFlushJob()


needFlush is set to true, when we found out that globalPointSavedtime hasn't updated for certain interval. In a case when it happened, we will create flush job everytime when a new job is added. A control mechanism is needed here. I implemented that in my branch, we could take reference to that.

db-will · 2021-10-25T19:25:14Z

syncer/syncer.go

+
+// Run read flush tasks from input and execute one by one
+func (w *checkpointFlushWorker) Run(ctx *tcontext.Context) {
+	for msg := range w.input {


based on my understanding here, we are flush checkpoint snapshot one by one, and i wonder how would we guarantee the speed of flush compared to running dml jobs.

Currently, we have snapshots array field in checkpoint struct, the initial size of such array, and how we handle when the snapshots array are overloaded will need carefuly design

Yes, we presume that flush checkpoint is not so slow. And since the length of the input chan is finite (16), so event if the flush operation is too slow, when the input chan is full, it will then block the caller from add more msg, thus it will block the whole sync. So the async mechanism should not be worse even if the flush is unexpectly slow.

hound · 2021-10-26T05:41:58Z

syncer/checkpoint.go

+	return fmt.Sprintf("%v(flushed %v)", b.location.location, b.flushedLocation.location)
+}
+
+//


comment on exported type SnapshotID should be of the form "SnapshotID ..." (with optional leading article)

hound · 2021-10-26T05:41:59Z

syncer/checkpoint.go

+	return fmt.Sprintf("%v(flushed %v)", b.location.location, b.flushedLocation.location)
+}
+
+//


comment on exported type SnapshotID should be of the form "SnapshotID ..." (with optional leading article)

hound · 2021-10-26T05:41:59Z

syncer/checkpoint.go

+	return fmt.Sprintf("%v(flushed %v)", b.location.location, b.flushedLocation.location)
+}
+
+//


comment on exported type SnapshotID should be of the form "SnapshotID ..." (with optional leading article)

lance6716 · 2021-10-29T03:29:26Z

/cc @lance6716 @glorv

transfer to @GMHDBJD if @glorv is not available

glorv added 6 commits October 21, 2021 20:12

support flush checkpoint async

a02b0df

fix build

7ce2ef6

wait flush worker to close

57d4a95

make causality compatible with async flush

1d49249

fmt code

b759d66

Merge branch 'master' of ssh://github.com/pingcap/dm into async-cp

564b51a

ti-chi-bot added the do-not-merge/work-in-progress label Oct 25, 2021

ti-chi-bot requested review from Ehco1996 and GMHDBJD October 25, 2021 16:46

ti-chi-bot added the size/XXL label Oct 25, 2021

db-will commented Oct 25, 2021

View reviewed changes

glorv added 3 commits October 26, 2021 10:54

fix

5616b2b

fix sync flush and check global checkpoint time

4c90b4b

fix flush checkpoint sync

888d9b5

hound bot reviewed Oct 26, 2021

View reviewed changes

ti-chi-bot requested review from glorv and lance6716 October 29, 2021 03:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Async Checkpoint with Causality Optimization #2262

[WIP] Async Checkpoint with Causality Optimization #2262

db-will commented Oct 25, 2021

ti-chi-bot commented Oct 25, 2021

db-will Oct 25, 2021

glorv Oct 26, 2021

db-will Oct 25, 2021

glorv Oct 26, 2021

db-will Oct 25, 2021

glorv Oct 26, 2021

db-will Oct 26, 2021

db-will Oct 25, 2021

db-will Oct 25, 2021

glorv Oct 26, 2021

hound bot Oct 26, 2021

hound bot Oct 26, 2021

hound bot Oct 26, 2021

lance6716 commented Oct 29, 2021

[WIP] Async Checkpoint with Causality Optimization #2262

Are you sure you want to change the base?

[WIP] Async Checkpoint with Causality Optimization #2262

Conversation

db-will commented Oct 25, 2021

What problem does this PR solve?

What is changed and how it works?

Check List

ti-chi-bot commented Oct 25, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hound bot Oct 26, 2021

Choose a reason for hiding this comment

hound bot Oct 26, 2021

Choose a reason for hiding this comment

hound bot Oct 26, 2021

Choose a reason for hiding this comment

lance6716 commented Oct 29, 2021