chore(blockbuilder): cleanup #15730

owen-d · 2025-01-13T23:34:43Z

Rewrites queue for more reliable & safe state transitions (no more negative pending jobs)
Handles off-by-1 polling error where builders would poll indefinitely for the final offset of a partition which didn't exist
Adds backoff & error propagation logic for polling so jobs can fail after 3 unsuccessful attempts.

fails job after 3 successive kafka polling errors

… an argument

…nsumed offsets in builder code

ashwanthgoli · 2025-01-15T06:07:52Z

pkg/blockbuilder/builder/builder.go

@@ -592,7 +616,7 @@ func (i *BlockBuilder) loadRecords(ctx context.Context, c *kgo.Client, partition
 		}
 	}

-	return lastConsumedOffset, boff.Err()
+	return lastSeenOffset, boff.Err()


nit: this could return offset.Max if it ends up in this branch. but i think this method should return the last consumed offset instead of last seen offset.

should we instead set lastSeenOffset here only after processing the record

Yeah that's a fair point. The reason I removed this was because outside of logging, it's not used anywhere. It's returned up the chain and logged, but it doesn't influence behavior. I figured reducing unnecessary complexity made sense in this case. The new behavior is we're logging the last seen offset, rather than the last consumed offset. However, I should probably just log the final consumed offset (rather than seen), but not return thread it through our code unnecessarily.

ashwanthgoli · 2025-01-15T06:17:05Z

pkg/blockbuilder/scheduler/strategy.go

 		startOffset := max(partitionOffset.Commit.At+1, partitionOffset.Start.Offset)
+		// Likewise, endOffset is initially the next available offset: this is why we treat jobs as end-exclusive:


shouldn't we include the endOffset since it is a valid offset for a record in the partition? we'd be skipping the last record otherwise

WDYM? end offset in this case is exclusive. This is the next available record in partitions which we're caught up on. Trying to poll for these when the partition is no longer being written to causes us to hang forever waiting for data which will not appear. This is the reason we were seeing infinite polling on defunct partitions.

ashwanthgoli · 2025-01-15T06:25:03Z

pkg/blockbuilder/scheduler/queue.go

@@ -88,25 +88,21 @@ type JobQueue struct {
 	metrics *jobQueueMetrics
 }

-// NewJobQueue creates a new job queue instance
+// NewJobQueue creates a new JobQueue2 instance


nit: needs correction

ashwanthgoli · 2025-01-15T15:55:09Z

pkg/blockbuilder/scheduler/queue.go

+		q.metrics.inProgress.Inc()
+		job.StartTime = job.UpdateTime
+	case types.JobStatusComplete, types.JobStatusFailed, types.JobStatusExpired:
+		q.completed.Push(job)


nit: we might have to remove the evicted job from status map

good catch!

ashwanthgoli · 2025-01-15T16:10:20Z

pkg/blockbuilder/scheduler/queue.go

-	q.statusMap[jobMeta.ID()] = types.JobStatusInProgress
-	q.metrics.inProgress.Inc()
+		if finished {
+			level.Debug(q.logger).Log("msg", "ignoring transition for completed job; will recreate", "id", jobID, "from", currentStatus, "to", to)


is this comment correct?

Yes, but perhaps I can be clearer: we're not going to delete the old completed entry for this job. Instead we're going to recreate a copy.

ashwanthgoli · 2025-01-15T16:12:00Z

pkg/blockbuilder/scheduler/queue_test.go

-		beforeSync := time.Now()
-		q.SyncJob(jobID, job)
-		afterSync := time.Now()
+func TestJobQueue2_TransitionState(t *testing.T) {


rename tests that contain Queue2

…struct

chore(blockbuilder): more verbose error source propagation

f280a8e

fails job after 3 successive kafka polling errors

owen-d requested a review from a team as a code owner January 13, 2025 23:34

pull-request-size bot added the size/M label Jan 13, 2025

[wip] better queue tracking

3506480

pull-request-size bot added size/L and removed size/M labels Jan 14, 2025

owen-d added 5 commits January 14, 2025 11:18

working on a parallel queue2 impl with safer state transitions

5e6cc9c

moving queue into an internal construct for the scheduler rather than…

6753b5f

… an argument

queue2 expiration checks

bfcfb8f

fixes signature

e512562

queue2 priority updates; refactors scheduler impl using it

bdef959

pull-request-size bot added size/XL and removed size/L labels Jan 14, 2025

owen-d added 3 commits January 14, 2025 12:51

further queue2 refactoring

b42db21

update tests

e427630

replaces queue with new impl

964988f

pull-request-size bot added size/XXL and removed size/XL labels Jan 14, 2025

owen-d added 4 commits January 14, 2025 14:37

removes dead code & unnecessary difference between lastSeen vs lastCo…

8601fc6

…nsumed offsets in builder code

adjusts job building for inclusivity to prevent endless polling

d73f0b5

reports success state transition error

d80dfb8

prefers [min,max) job exclusivity

74f9942

owen-d changed the title ~~chore(blockbuilder): more verbose error source propagation~~ chore(blockbuilder): cleanup Jan 14, 2025

owen-d added 2 commits January 14, 2025 15:43

adds some human readable bits to the scheduler status page

7190d85

status page reports partitions the same way we calculate jobs

4a8fde0

ashwanthgoli reviewed Jan 15, 2025

View reviewed changes

owen-d added 4 commits January 15, 2025 11:33

use lastConsumedOffset in logging + remove from unnecessary code paths

376f40b

pr feedback

42f4561

refactors lag calculations consistently by introducing partition.Lag …

bfd26b0

…struct

extricates lag publishing to its own loop for more frequent updates

5e9ef1c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(blockbuilder): cleanup #15730

chore(blockbuilder): cleanup #15730

owen-d commented Jan 13, 2025 •

edited

Loading

ashwanthgoli Jan 15, 2025 •

edited

Loading

owen-d Jan 15, 2025

ashwanthgoli Jan 15, 2025

owen-d Jan 15, 2025

ashwanthgoli Jan 15, 2025

ashwanthgoli Jan 15, 2025

owen-d Jan 15, 2025

ashwanthgoli Jan 15, 2025

owen-d Jan 15, 2025

ashwanthgoli Jan 15, 2025

		startOffset := max(partitionOffset.Commit.At+1, partitionOffset.Start.Offset)
		// Likewise, endOffset is initially the next available offset: this is why we treat jobs as end-exclusive:

chore(blockbuilder): cleanup #15730

Are you sure you want to change the base?

chore(blockbuilder): cleanup #15730

Conversation

owen-d commented Jan 13, 2025 • edited Loading

ashwanthgoli Jan 15, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

owen-d commented Jan 13, 2025 •

edited

Loading

ashwanthgoli Jan 15, 2025 •

edited

Loading