-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[compat] [server] [client] [test] Global RT DIV: Chunking Support #1385
base: main
Are you sure you want to change the base?
Conversation
…ssages, and renamed `GlobalDivState` avro object to `GlobalRtDivState`. 🌯
2. Split `ChunkAssembler` for RT DIV into its own object. 🍐 3. `GlobalRtDiv` serializer is per-message to be safe, because it doesn't seem to be thread-safe. 🍋🟩 4. Fixed spotbugs. 🌶️ 5. Fixed `divChunkAssembler` for the SIT unit test. 🫨
7ee8cd9
to
d3ae029
Compare
internal/venice-common/src/main/java/com/linkedin/venice/writer/VeniceWriter.java
Outdated
Show resolved
Hide resolved
internal/venice-common/src/test/java/com/linkedin/venice/writer/VeniceWriterUnitTest.java
Outdated
Show resolved
Hide resolved
…will have `GLOBAL_RT_DIV` as the `MessageType`. 🏯
…ject` in the `sendMessageFunction`. 🪀
@@ -1479,7 +1482,7 @@ protected void updateOffsetMetadataInOffsetRecord(PartitionConsumptionState part | |||
upstreamTopic = versionTopic; | |||
} | |||
if (upstreamTopic.isRealTime()) { | |||
offsetRecord.resetUpstreamOffsetMap(partitionConsumptionState.getLatestProcessedUpstreamRTOffsetMap()); | |||
offsetRecord.mergeUpstreamOffsets(partitionConsumptionState.getLatestProcessedUpstreamRTOffsetMap()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you do this change? Personally I think this makes less explicit about the offset map we are tracking
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because I feel that reset()
is not the right word for it. It's not clearing the state or replacing the map entirely. It's updating the existing map with the values of the new map and retaining the values missing from the new map.
I can rename it to updateUpstreamOffsets()
.
@@ -21,7 +21,8 @@ | |||
*/ | |||
public enum MessageType implements VeniceEnumValue { | |||
PUT(0, Constants.PUT_KEY_HEADER_BYTE), DELETE(1, Constants.PUT_KEY_HEADER_BYTE), | |||
CONTROL_MESSAGE(2, Constants.CONTROL_MESSAGE_KEY_HEADER_BYTE), UPDATE(3, Constants.UPDATE_KEY_HEADER_BYTE); | |||
CONTROL_MESSAGE(2, Constants.CONTROL_MESSAGE_KEY_HEADER_BYTE), UPDATE(3, Constants.UPDATE_KEY_HEADER_BYTE), | |||
GLOBAL_RT_DIV(4, Constants.GLOBAL_RT_DIV_KEY_HEADER_BYTE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if it is ok to not add it into KME. I am thinking for example, Da Vinci and CDC, if the code is not upgraded, how can it be handling the global rt div? Or should we completely mute them from processing RT div message?
Will this controlled by store config or just server config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, Davinci and CDC don't need to process the RT div messages. Only the next leader has to read and process it, right. They can simply ignore them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's controlled by server config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, with this code change, we will eventually enable the feature to produce RT DIV snapshot to VT topic. This will be processed by follower, which also applicable to DVC. I know DVC won't be leader, but will the existing code handle unknown new msg type gracefully?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. The reader of the msg type (to be able to ignore it) has to be deployed before enabling this feature. i don't know if there is other way to walk around it, feel free to suggest alternatives.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't look very deep, but I think that the existing code will throw an exception when encountering an unknown message type.
Handling an unknown message type will be implemented in a future PR. Since we control the feature flag, we can choose not to enable the feature flag until all server instances support the new message type so there won't be any messages with this message type floating around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to deploy the changes to all VT consumers before enabling this feature, which includes:
- Server.
- DaVinci.
- ETL.
- CDC.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think it is better to treat the new KME type as a Control Message
, and I would like to learn why if we can't do that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would like to call out another item we didn't discuss in the review meeting:
- Cleanup the Global RT DIV messages from VT.
This new type of message can be large and with chunking support, we are leaking the data chunks as chunk id is unique.
I was thinking some strategy to let follower to send out a Kafka delete or Kafka message with empty value for the previous key after consuming a new Global RT DIV message
, and if we don't do the cleanup, the size of Kafka topic might grow a lot depending on the sending frequency.
Maybe we don't need to implement such cleanup in the MVP, but I think eventually, we need some way to clean them up from the version topics.
@@ -1904,6 +1906,10 @@ protected boolean shouldProcessRecord(PubSubMessage<KafkaKey, KafkaMessageEnvelo | |||
} | |||
} | |||
} | |||
// Global RT DIV messages should be completely ignored when leader is consuming from remote version topic | |||
if (record.getKey().isGlobalRtDiv()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to have Global RT DIV messages
in the remote version topic?
There are a few cases for remote version topic consumption:
- NR for the venice push job.
- Streaming Reprocessing job.
- Data recovery for batch store.
None of the use cases in theory would encounterGlobal RT DIV messages
.
Today, there is no support for hybrid store data recovery and it is being done via re-push.
So can you explain the scenarios in your mind?
@@ -467,6 +472,8 @@ public StoreIngestionTask( | |||
new IngestionNotificationDispatcher(notifiers, kafkaVersionTopic, isCurrentVersion); | |||
this.missingSOPCheckExecutor.execute(() -> waitForStateVersion(kafkaVersionTopic)); | |||
this.chunkAssembler = new ChunkAssembler(storeName); | |||
this.divChunkAssembler = | |||
builder.getDivChunkAssembler() != null ? builder.getDivChunkAssembler() : new ChunkAssembler(storeName); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems ChunkAssembler
is store specific, so can we use the instance from the builder, which typically only contains the sharable objects?
put.getPutValue(), | ||
record.getOffset(), | ||
put.getSchemaId(), | ||
new NoopCompressor(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we always use a certain compression algo for GLOBAL RT DIV messages
?
We know DIV is large sometimes, I wondered whether we can always enable gzip
compression or not to reduce the Kafka usage.
put.getSchemaId(), | ||
new NoopCompressor(), | ||
(valueBytes) -> GLOBAL_RT_DIV_STATE.getSerializer() | ||
.deserialize(ByteUtils.extractByteArray(valueBytes), GLOBAL_RT_DIV_STATE_SCHEMA_ID)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we update InternalAvroSpecificSerializer
to support a ByteBuffer
as the input?
KafkaMessageEnvelope value = record.getValue(); | ||
Put put = (Put) value.getPayloadUnion(); | ||
|
||
Object assembledObject = divChunkAssembler.bufferAndAssembleRecord( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw this function is using in-memory storage engine as a temp storage, which will increase the memory usage, which is different from the regular data chunk handling, which would minimize the memory usage, and I think this is also what we agreed during the design review.
cc @lluwm
@@ -3786,7 +3827,7 @@ private int processKafkaDataMessage( | |||
private void waitReadyToProcessRecord(PubSubMessage<KafkaKey, KafkaMessageEnvelope, Long> record) | |||
throws InterruptedException { | |||
KafkaMessageEnvelope kafkaValue = record.getValue(); | |||
if (record.getKey().isControlMessage() || kafkaValue == null) { | |||
if (record.getKey().isControlMessage() || record.getKey().isGlobalRtDiv() || kafkaValue == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can Global RT DIV
be one type of Control Message
?
@@ -336,5 +338,13 @@ public Builder setAAWCWorkLoadProcessingThreadPool(ExecutorService executorServi | |||
public ExecutorService getAAWCWorkLoadProcessingThreadPool() { | |||
return this.aaWCWorkLoadProcessingThreadPool; | |||
} | |||
|
|||
public Builder setDivChunkAssembler(ChunkAssembler divChunkAssembler) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ChunkAssember
has a store name field, and in theory, it is not sharable among store ingestion tasks with different stores.
Can you explain your thinking here?
@@ -21,7 +21,8 @@ | |||
*/ | |||
public enum MessageType implements VeniceEnumValue { | |||
PUT(0, Constants.PUT_KEY_HEADER_BYTE), DELETE(1, Constants.PUT_KEY_HEADER_BYTE), | |||
CONTROL_MESSAGE(2, Constants.CONTROL_MESSAGE_KEY_HEADER_BYTE), UPDATE(3, Constants.UPDATE_KEY_HEADER_BYTE); | |||
CONTROL_MESSAGE(2, Constants.CONTROL_MESSAGE_KEY_HEADER_BYTE), UPDATE(3, Constants.UPDATE_KEY_HEADER_BYTE), | |||
GLOBAL_RT_DIV(4, Constants.GLOBAL_RT_DIV_KEY_HEADER_BYTE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to deploy the changes to all VT consumers before enabling this feature, which includes:
- Server.
- DaVinci.
- ETL.
- CDC.
@@ -21,7 +21,8 @@ | |||
*/ | |||
public enum MessageType implements VeniceEnumValue { | |||
PUT(0, Constants.PUT_KEY_HEADER_BYTE), DELETE(1, Constants.PUT_KEY_HEADER_BYTE), | |||
CONTROL_MESSAGE(2, Constants.CONTROL_MESSAGE_KEY_HEADER_BYTE), UPDATE(3, Constants.UPDATE_KEY_HEADER_BYTE); | |||
CONTROL_MESSAGE(2, Constants.CONTROL_MESSAGE_KEY_HEADER_BYTE), UPDATE(3, Constants.UPDATE_KEY_HEADER_BYTE), | |||
GLOBAL_RT_DIV(4, Constants.GLOBAL_RT_DIV_KEY_HEADER_BYTE); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think it is better to treat the new KME type as a Control Message
, and I would like to learn why if we can't do that.
"type": "map", | ||
"doc": "A map that maps producer GUID -> producer state for realtime data.", | ||
"values": { | ||
"name": "ProducerPartitionState", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to reference the existing definition in PartitionState.avsc
?
Summary
Continuation from #1257.
This PR mainly focuses on adding chunking support for DIV messages when they are produced to Kafka topics, as the size of the DIV message can surpass the ~1MB Kafka message limit. The existing chunking mechanism is reused, including the
CHUNK
andCHUNKED_VALUE_MANIFEST
values in the message'sschemaId
:Every DIV message has
GLOBAL_RT_DIV
for the header byte in itsKafkaKey
. The correspondingKafkaMessageEnvelope
has aPut
payload utilizing theputValue
field containing theGlobalRtDiv
data, and which has the followingschemaId
:schemaId
is set to the current protocol version ofGLOBAL_RT_DIV
.schemaId
is set toCHUNK
.schemaId
is set toCHUNKED_VALUE_MANIFEST
. TheschemaId
of theChunkedValueManifest
will be the current protocol version ofGLOBAL_RT_DIV
.ChunkAssembler
is adapted, on the receiver side, to buffer, assemble, and deserialize DIV messages (chunked/non-chunked).Changes
MessageType
calledGlobalRtDiv
, which reuses thePut
message type format and objects. When the Venice server encounters a message withKafkaKey
containing theGlobalRtDiv
header byte, it will know to process this message differently from a regularPut
.GlobalRtDiv
message type is the header byte inKafkaKey
. Otherwise, it's identical to a regularPut
.KafkaMessageEnvelope.avsc
will not be updated to avoid the unnecessary risk of incompatible avro formats when upgrading the cluster.GlobalRtDiv
message type in KME is that theGlobalRtDiv
objects will be processed as user records and stored in the storage engine, which seems to be much less scary than a cluster upgrade issue.bufferAndAssembleRecord()
inChunkAssembler
to use a deserializationFunction
instead of aRecordDeserializer
. This is becauseRecordDeserializer
is on a separate inheritance path to theInternalAvroSpecificSerializer
.GlobalRtDiv
messages should not be processed if they originate from remote VT and RT, because those are invalid scenarios. These two conditions are checked.Minor Changes
resetUpstreamOffsetMap()
tomergeUpstreamOffsets()
inOffsetRecord
.toString()
inKafkaKey
, which incorrectly assumed all messages would beControlMessage
,Put
, orDelete
. This missesUpdate
messages and the newGlobalRtDiv
message that is being added.buildPutPayload()
andbuildManifestPayload()
) inVeniceWriter
for creating thePut
payloads and when chunking is involved.Testing
testGlobalRtDivChunking()
inVeniceWriterUnitTest
testShouldProcessRecordForGlobalRtDivMessage()
inStoreIngestionTaskTest
testProcessGlobalRtDivMessage()
inStoreIngestionTaskTest
testChunkedDiv()
inTestGlobalRtDiv
Does this PR introduce any user-facing changes?