-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Writing to fallback storage fails without returning an error #173
Comments
Interesting question. You're basically questioning the semantic of a "transaction" on the proxy, which involves potentially many writes:
The semantics of a put to proxy is currently:
You'd prefer it to be:
I think I would like this semantic:
Now regarding how to deal with secondary storage write failures, there are 2 alternatives:
Do you have any preference here? What is your intended use case for this feature? |
@adam-xu-mantle client retries on 201/206 would not be ideal right now because we don’t have idempotent requests, so you would be resending the same blob to eigenDA a second time, and paying for it, just to fill the fallback cache. we are planning a small rearchitecturw to prevent double writing of the same blog but not sure on timeline atm. |
@samlaf Can this be a configuration? This way, users who are more concerned about data availability can retry after a 201/206 error. |
@adam-xu-mantle you mean making returning 201/206 instead of 200/500 configurable? I personally think I’d prefer sticking to one protocol. I’m fine with returning 201/206 and letting client decide what they want to do. @epociask thoughts? |
I like this idea but worry that we may be breaking the OP Alt-DA client<-->server spec by returning 201 vs 200: https://github.com/ethereum-optimism/optimism/blob/v1.7.6/op-plasma/daclient.go#L99-L101 we could add configuration but more flags - larger namespace - higher cognitive complexity. @adam-xu-mantle @samlaf what if we instead just added observability via metrics around secondary storage interactions vs client side logging? |
@epociask If secondary store failed is just observed via metrics, it's still difficult to fix the secondary store data. |
@adam-xu-mantle thinking through this some more, to see whether there isn’t another solution that could satisfy you. You’re main worry is eigenDA disperser going down, and then trying to retrieve a blob which wasn’t written to fallback db right? Would #90 solve your problem? |
@adam-xu-mantle why are metrics insufficient? IIUC in the current proposed solution you'd understand a secondary storage failure via grep'ing op-batch/op-node logs which would already require some peripheral wrapping to make the error capture loud. |
@epociask metrics are not meant for this purpose. They are just meant to be parsed and periodically and shipped to some db and then visualized with grafana/datadog/etc. Would be an anti-pattern I feel to have logic depend on the output of a metric endpoint. Plus it's super inefficient beaucoup a /metrics endpoint typically dumps all metrics in a string like pattern, so you'd need to parse it etc. Just doesn't feel like the right tool here. |
@samlaf #90 does not resolve this issue. The EigenDA operator node going down could also result in failures to obtain blobs (if an unexpected bug occurs). This highlights the secondary store's value. However, if the EigenDA backend becomes temporarily unavailable and the blobs in the secondary store are not accessible, the actual value of the secondary store is diminished." |
@adam by "The EigenDA operator node going down" you seem to be saying that you are only talking to a single node, which is not the case. This might be a typo, but just in case I'll explain; eigenda works differently from celestia; blobs are encoded with RS encoding, and so you need to read any fraction of the data to be able to reconstruct the original data. You would need the entire operator set to go down (or at least 30%ish of the network depending on security and encoding parameters) to not be able to retrieve from da nodes. But agree that for max security it's best to also make sure the fallback store is written to. |
The preconditions for this failure scenario are rather unlikely:
If expressed and a blob truly become unavailable then there could certainly be adverse network effects; e.g:
If we wanna further quantify it could be fair to say:
so somewhere around a medium-low using the 5 x 5 matrix. @samlaf why do you feel like metrics are insufficient? Is it not possible to observe when secondary insertions fail and manually replay them - analogous to web2 kafka where dlqs are managed and drained? This risk feels low enough where counter-measures can optionally be supported via a semi-peripheral flow. It's also worth noting that we are working on optimizing the dispersal latency and have begun to decouple primary write execution from secondary write execution - meaning that capturing a secondary backend write failure via server response likely won't be possible in the future. |
@Ethen you can use metrics for this, but I just don't think it's the right tool. |
I think this is a very good suggestion. In synchronously writing, ensuring that every step is executed correctly meets our expectations.
|
Problem
When eigenda-proxy put blob, if handleRedundantWrites fails, it only logs the error and does not return the error to the caller. This could lead to some blobs having an unavailable fallback, but it goes unnoticed.
Proposed Solution
Is it possible to directly return the error to the caller when handleRedundantWrites fails? In this way, the caller can ensure that all blobs have fallback storage by retrying.
The text was updated successfully, but these errors were encountered: