Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix data columns not persisting for PeerDAS due to a getBlobs race condition #6756

Open
wants to merge 3 commits into
base: unstable
Choose a base branch
from

Conversation

jimmygchen
Copy link
Member

@jimmygchen jimmygchen commented Jan 6, 2025

Issue Addressed

This PR fixes a race condition with getBlobs in PeerDAS, causing data columns not to be persisted on block import.

This bug was discovered on a local devnet. The sequence of events below:

  1. A new block is received on gossip. Fetch engine blobs is triggerd before block verification completes (optimisation from Fetch blobs from EL prior to block verification #6600).
  2. Blobs is received from the EL, the node starts computing data columns, and passes a receiver to process_availability.
  3. If this happens before block is verified, the receiver gets dropped.
  4. After the block is verified, we import the block, however at this point we've lost the receiver, and data column is not persisted. We're hitting this code branch:
    // No data columns present and compute data columns task was not spawned.
    // Could either be no blobs in the block or before PeerDAS activation.
    (None, None) => None,

getBlobsSidecars endpoint shows that both supernode and fullnode are missing columns

# fullnode
$ curl http://127.0.0.1:64385/eth/v1/beacon/blob_sidecars/148
{"code":500,"message":"INTERNAL_SERVER_ERROR: Insufficient data columns to reconstruct blobs: required 64, but only 0 were found.","stacktraces":[]}%

# fullnode
$ curl http://127.0.0.1:64505/eth/v1/beacon/blob_sidecars/148
{"code":500,"message":"INTERNAL_SERVER_ERROR: Insufficient data columns to reconstruct blobs: required 64, but only 8 were found.","stacktraces":[]}%

# supernode
$ curl http://127.0.0.1:64341/eth/v1/beacon/blob_sidecars/148
{"code":500,"message":"INTERNAL_SERVER_ERROR: Insufficient data columns to reconstruct blobs: required 64, but only 0 were found.","stacktraces":[]}%

Proposed Changes

After some discussion with @michaelsproul , we think the optimisation #6600 is useful to keep, as it enables data columns to be computed and propagated to the network sooner. The solution proposed is to store the data column Receiver in the DataAvailaiblityChecker (new field in BlockImportData), so it's always available on block import.

TODO

  • Add some tests to cover the race condition if possible.

@jimmygchen jimmygchen added the das Data Availability Sampling label Jan 6, 2025
@jimmygchen jimmygchen added the ready-for-review The code is ready for review label Jan 6, 2025
@@ -602,7 +631,7 @@ impl<T: BeaconChainTypes> DataAvailabilityCheckerInner<T> {

// Check if we have all components and entire set is consistent.
if pending_components.is_available(self.sampling_column_count, log) {
write_lock.put(block_root, pending_components.clone());
write_lock.put(block_root, pending_components.clone_without_column_recv());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These clones where we delete the receiver are OK because we should be using the receiver the first time it is returned inside an available block, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes exactly - I added some docs to the function but I still think don't think it's immediately obvious:

/// Clones the `PendingComponent` without cloning `data_column_recv`, as `Receiver` is not cloneable.
/// This should only be used when the receiver is no longer needed.
pub fn clone_without_column_recv(&self) -> Self {

Added some more comments above this line in 4173135:

// We keep the pending components in the availability cache during block import (#5845).
// `data_column_recv` is returned as part of the available block and is no longer needed here.
write_lock.put(block_root, pending_components.clone_without_column_recv());

// If `data_column_recv` is `Some`, it means we have all the blobs from engine, and have
// started computing data columns. We store the receiver in `PendingComponents` for
// later use when importing the block.
pending_components.data_column_recv = data_column_recv;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If data_column_recv is already Some means we have double reconstruction happening. Should we mind that case or do something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I chatted with Michael about this - so currently it's possible any of these three happens at the same time

  • gossip block triggering EL getBlobs and column computation
  • rpc block triggering EL getBlobs and column computation
  • gossip/rpc column triggering column reconstruction (only one computation at a time)

I initially wanted to address these altogether, but it got a bit messy and I decided to leave this issue to a separate PR and to keep this PR small , as I wanted to get this PR merged first and start interop testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created issue #6764

Copy link
Collaborator

@dapplion dapplion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, just a minor comment. Thanks for the docs throughout

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling ready-for-review The code is ready for review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants