-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] 2.9.0 Restoring snapshot with remote_snapshot fails with exception from functioning S3 repository #9125
Comments
Hi @ict-one-nl, I'm blocked from downloading that video with this error:
Taking a wild guess since I can't see the error, but the parameter value you need to pass to the restore API is |
Hi, I'm using cloudflare, blocking traffic from outside the EU; but that's not viable, so I removed the WAF-rule. You should be able to download the video now. Your comment is valid, but I made a typo in the report (but not in the API-calls) :) |
Side-effect: index was marked as red, which triggered cluster state as red, which stopped data-ingest entirely. Removed the searchable snapshot just now: cluster green and everything is okay again. |
Okay, got one step further, at node boot:
Set this 5000mb, error disappears. I find this a bit cumbersome:
Yet, still: |
@ict-one-nl Thank you. Definitely a bug here. The long overflow is coming from this line. Given that file sizes approaching Alternatively or in addition to, is there any chance you could test this with OpenSearch 2.8? There were some changes in this code related to the upgrade to AWS SDK v2 that could be related, but those were just done in version 2.9. |
Thanks, sure I'll try to provide some more details. Give me a day or so. What do you think about the cache size setting? Should that be a separate issue? |
okay, I do have a debug trace of trying to restore a remote snapshot on 2.9.0. Strangely it did seem to work once on the cluster I upgraded from 2.8.0 to 2.9.0 restoring a remote snapshot from 2.8.0 in a 2.9.0 cluster. Have to look if I can reproduce it, but the amount of logging and the performance impact makes that difficult. Unfortunately the logs are quite big, we're talking several hundred megabytes in a matter of minutes when I put the whole node logging to debug. It also makes the cluster almost unresponsive.
I'm hesitant to share the logs publicly, don't want to expose infrastructure details on a public forum. I would be open to share it privately with the dev team. Or I might be able to share more openly if it's not this much and I'm able to remove details from the logs. |
@ict-one-nl Thanks for working through this! Are you still able to reproduce the error? If so, I'd like to see the results of enabling assertions. My suspicion is that it is reaching this line and there actually isn't any debug logging statements there. However, if you enable assertions, then the JVM will crash when that assertion is hit and should print the details to the stdout or stderr. Regarding the cache size and replica behavior, you are absolutely right that we need better documentation. To quickly answer your questions:
|
I'm consistently getting the error when using 2.9.0 for snapshot + restore. It's working consistently on 2.8.0. Strange enough a snapshot on 2.8.0 restored correctly as remote_snapshot on 2.9.0. While typing this comment and trying to enable assertions we are suffering a production problem on our kubernetes cluster, I'll come back to that later. I'm hoping that we can solve this before the 2.10 release, otherwise I'll be stuck with this error for at least two more months :) The explanation you've given in this comment would be very useful in the docs, even if it's just a copy/paste from this ticket :) This info is missing from the docs at this time. Update: you are right about the default setting btw, I triggered the error when I added the ingest role to the search nodes for a test. |
Okay, I need more time to get consistent scenario's. I do know 2.8.0 works like a charm, but I can't pinpoint the problem in 2.9.0. But I'm quite sure something is amiss. At this time we're not sure if it's:
Unless I'm mistaken there is no functional/technical need to implement the new SDK, it's more a matter of preventing technical debt. This is a valid concern, but not really a time sensitive one. Would it be an idea to:
This would buy us some time and prevents a potentially confusing point release for users (Do I need to upgrade, why is there a point release, is there a security issue?). If there is a interoperability problem between the s3 and the new SDK, it's likely more people will report it to the SDK devs so fixes can be made. In the meantime I'm more than willing to test further and help debugging. I could look into exposing our s3 to the Opensearch devs (but I need to talk to our storage team for that, security policies and such). That would make debugging the issue much easier I suppose. |
The SDK was upgraded as a prerequisite for performance improvements in repository-s3, using features only available in the newer version. Reverting the change will be a lot of work, so let's only do it if we're 100% confident we can't fix a problem while moving forward? I suggest we debug the problem first. Btw, there's another bug report that's potentially a side effect of the upgrade, #9265. @raghuvanshraj any ideas? |
Ah didn't know that, thank you. What can I do to help? Are there new requirements for the S3 endpoint? Can I do some low level testing with a command line client based on the new SDK to rule out opensearch itself and pinpoint the problem? |
I think @andrross correctly called out where we have a problem (#9125 (comment)), but the only helpful thing to do would be to continue to narrow this down to the smallest repro possible. Short of that, if we can't have an independent repro, and since you can reproduce fairly consistently, are you able to build the source dev guide? I would check-out the 2.9 branch, and add more debug logging around that error to find out what caused it. |
Will try, I tried running the docker image with the -ea option in the OPENSEARCH_JAVA_OPTS env var but I didn't hit it. But I'm not sure the -ea flag has been picked up properly. Running opensearch on my mac is somewhat troublesome, so I need to revive my Windows laptop. WIll try to look into it asap. Update: assertions did work, but dit not hit them. Created a totally fresh cluster and tested it again, result is a different error than in the first post. I made a full capture of the process with logging of all nodes in it. Contacted Andrew on Slack to discuss if I can share the grab with the dev team under TLP:AMBER. |
With the help of Andrew talking me through it, giving me some pointers and the help of our storage team I think I finally nailed down the problem... Hard one to diagnose as the exceptions opensearch threw didn't set me on the right path. The culprit seems to be this setting on the S3 endpoint on the s3: Correct use: aws s3api create-bucket --bucket BUCKETNAME --object-lock-enabled-for-bucket We had an inconsistancy between different buckets, this was the one difference we found. Enabled it and now it's working on the cluster that was consistently malfunctioning. Many thanks to our storage team and Andrew for both their patience ghehe. I suppose the docs could be improved that this is required for the searchable snapshots to function. Exception handling could be improved so it doesn't throw these odd exceptions. What do you think? edit: testscenario. prerequisite is that the data snapshot repo exists:
|
This has been fixed. The 2.9.1 patch release did not happen but this fix will be included in the upcoming 2.10 release. |
@andrross, @bbarani do we know the blockers or overhead in doing patch release. Ideally all bug fixes should make it to a patch release. What happens to users who have upgraded to 2.9(we don't support downgrades natively). Does that mean users are supposed to stay stuck with this bug till we get to 2.10 |
@Bukhtawar @bbarani There are no blockers to doing a patch release. Ultimately it was decided to move forward with the 2.10 release in lieu of a patch release. You can track the conversation here. |
Good news is I've been able to confirm the fix :) |
Describe the bug
I'm trying to use the searchable snapshots feature, but restoring the snapshot fails as remote_snapshot fails. Picture > 1000 words, so I made a video detailing the steps and effects:
https://download.ict-one.nl/searchable_snapshots.mp4
Config:
Exception:
To Reproduce
Steps to reproduce the behavior:
See video
Expected behavior
Snapshot restored and available for searching
Plugins
s3 repository plugin
Screenshots
If applicable, add screenshots to help explain your problem.
Host/Environment (please complete the following information):
Default Opensearch 2.9.0 docker image with S3 plugin enabled.
S3 storage is a s3 appliance, but since the normal restore works I don't suspect that's the issue?
Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: