Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Writing to fallback storage fails without returning an error #173

Open
adam-xu-mantle opened this issue Oct 8, 2024 · 15 comments
Open

Writing to fallback storage fails without returning an error #173

adam-xu-mantle opened this issue Oct 8, 2024 · 15 comments
Labels
enhancement New feature or request

Comments

@adam-xu-mantle
Copy link

Problem
When eigenda-proxy put blob, if handleRedundantWrites fails, it only logs the error and does not return the error to the caller. This could lead to some blobs having an unavailable fallback, but it goes unnoticed.

Proposed Solution
Is it possible to directly return the error to the caller when handleRedundantWrites fails? In this way, the caller can ensure that all blobs have fallback storage by retrying.

// Put ... inserts a value into a storage backend based on the commitment mode
func (r *Router) Put(ctx context.Context, cm commitments.CommitmentMode, key, value []byte) ([]byte, error) {
        // ...
        
	if r.cacheEnabled() || r.fallbackEnabled() {
		err = r.handleRedundantWrites(ctx, commit, value)
		if err != nil {
			log.Error("Failed to write to redundant backends", "err", err)
		}
	}

	return commit, nil
}
@samlaf
Copy link
Collaborator

samlaf commented Oct 8, 2024

Interesting question. You're basically questioning the semantic of a "transaction" on the proxy, which involves potentially many writes:

  1. main store: eigenda or memstore
  2. secondary stoers: caches and/or fallbacks

The semantics of a put to proxy is currently:

  • return success if main store write succeeds, regardless of secondary stores

You'd prefer it to be:

  • return success iff mainstore AND secondary stores succeed

I think I would like this semantic:

  1. return 200 if all writes succeed
  2. return 201 if main store write succeeds, but some secondary storage write fails (include detailed info in body)
  3. return 500 if main store write fails

Now regarding how to deal with secondary storage write failures, there are 2 alternatives:

  1. client-side driven: we add a new put endpoint to put to secondary storage only, and get client to retry on this endpoint on 201s
  2. server-side driven: when a secondary storage errors, we get some background process to retry until it succeeds (eventual consistency)

Do you have any preference here? What is your intended use case for this feature?

@samlaf samlaf added the eigenda-proxy label Oct 8, 2024 — with Linear
@samlaf samlaf added enhancement New feature or request and removed eigenda-proxy labels Oct 8, 2024
@adam-xu-mantle
Copy link
Author

Client-side driven is the preferred option. It is easier to implement, and the current op stack treats non-200 status codes as errors and will retry PUT blob.
For partially successful, using 206 status code might be better.

@samlaf
Copy link
Collaborator

samlaf commented Oct 9, 2024

@adam-xu-mantle client retries on 201/206 would not be ideal right now because we don’t have idempotent requests, so you would be resending the same blob to eigenDA a second time, and paying for it, just to fill the fallback cache.

we are planning a small rearchitecturw to prevent double writing of the same blog but not sure on timeline atm.

@adam-xu-mantle
Copy link
Author

@samlaf Can this be a configuration? This way, users who are more concerned about data availability can retry after a 201/206 error.

@samlaf
Copy link
Collaborator

samlaf commented Oct 10, 2024

@adam-xu-mantle you mean making returning 201/206 instead of 200/500 configurable? I personally think I’d prefer sticking to one protocol. I’m fine with returning 201/206 and letting client decide what they want to do.

@epociask thoughts?

@epociask
Copy link
Collaborator

I like this idea but worry that we may be breaking the OP Alt-DA client<-->server spec by returning 201 vs 200: https://github.com/ethereum-optimism/optimism/blob/v1.7.6/op-plasma/daclient.go#L99-L101

we could add configuration but more flags - larger namespace - higher cognitive complexity.

@adam-xu-mantle @samlaf what if we instead just added observability via metrics around secondary storage interactions vs client side logging?

@adam-xu-mantle
Copy link
Author

@epociask If secondary store failed is just observed via metrics, it's still difficult to fix the secondary store data.

@samlaf
Copy link
Collaborator

samlaf commented Oct 12, 2024

@adam-xu-mantle thinking through this some more, to see whether there isn’t another solution that could satisfy you. You’re main worry is eigenDA disperser going down, and then trying to retrieve a blob which wasn’t written to fallback db right?

Would #90 solve your problem?

@epociask
Copy link
Collaborator

@adam-xu-mantle why are metrics insufficient? IIUC in the current proposed solution you'd understand a secondary storage failure via grep'ing op-batch/op-node logs which would already require some peripheral wrapping to make the error capture loud.

@samlaf
Copy link
Collaborator

samlaf commented Oct 13, 2024

@epociask metrics are not meant for this purpose. They are just meant to be parsed and periodically and shipped to some db and then visualized with grafana/datadog/etc. Would be an anti-pattern I feel to have logic depend on the output of a metric endpoint. Plus it's super inefficient beaucoup a /metrics endpoint typically dumps all metrics in a string like pattern, so you'd need to parse it etc. Just doesn't feel like the right tool here.

@adam-xu-mantle
Copy link
Author

@samlaf #90 does not resolve this issue. The EigenDA operator node going down could also result in failures to obtain blobs (if an unexpected bug occurs). This highlights the secondary store's value. However, if the EigenDA backend becomes temporarily unavailable and the blobs in the secondary store are not accessible, the actual value of the secondary store is diminished."

Copy link
Collaborator

samlaf commented Oct 15, 2024

@adam by "The EigenDA operator node going down" you seem to be saying that you are only talking to a single node, which is not the case. This might be a typo, but just in case I'll explain; eigenda works differently from celestia; blobs are encoded with RS encoding, and so you need to read any fraction of the data to be able to reconstruct the original data. You would need the entire operator set to go down (or at least 30%ish of the network depending on security and encoding parameters) to not be able to retrieve from da nodes.

But agree that for max security it's best to also make sure the fallback store is written to.

@epociask
Copy link
Collaborator

The preconditions for this failure scenario are rather unlikely:

  • EigenDA disperser is unavailable AND
  • majority operator set is unavailable AND
  • secondary storage target failed to write to backend due to non-recorded error

If expressed and a blob truly become unavailable then there could certainly be adverse network effects; e.g:

  • bridge safety: An honest node fails to reference pre-image contents due to data unavailability when assigned a one step proof during a dispute claim dissection.
  • liveness: nodes syncing from confirmed or final chain views could halt due to inaccessibility of batch contents. If the final-confirmed delta continues expanding over some max time then the sequencer's backlog tx contents could be evicted causing a reorg to any rollup node syncing from an unsafe sequencer head.

If we wanna further quantify it could be fair to say:

  • likelihood=very low (1)
  • impact=very high (5)

so somewhere around a medium-low using the 5 x 5 matrix. @samlaf why do you feel like metrics are insufficient? Is it not possible to observe when secondary insertions fail and manually replay them - analogous to web2 kafka where dlqs are managed and drained? This risk feels low enough where counter-measures can optionally be supported via a semi-peripheral flow. It's also worth noting that we are working on optimizing the dispersal latency and have begun to decouple primary write execution from secondary write execution - meaning that capturing a secondary backend write failure via server response likely won't be possible in the future.

Copy link
Collaborator

samlaf commented Oct 16, 2024

@Ethen you can use metrics for this, but I just don't think it's the right tool.

Re "have begun to decouple primary write execution from secondary write execution", what does this mean? That writing to secondary storage will be done asynchronously? Can we hold off on this, I feel like the request is valid. I think ultimately we should get rid of fallback, and each cache enabled should have a parameter to decide whether to write to it synchronously or asynchronously. Writing synchronously and failing means the request failed, and the client can then retry sending, as adam-xu is asking here.

@adam-xu-mantle
Copy link
Author

I think this is a very good suggestion. In synchronously writing, ensuring that every step is executed correctly meets our expectations.

each cache enabled should have a parameter to decide whether to write to it synchronously or asynchronously. Writing synchronously and failing means the request failed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants