Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add out of space failpoint to robustness #18811

Open
serathius opened this issue Oct 30, 2024 · 2 comments
Open

Add out of space failpoint to robustness #18811

serathius opened this issue Oct 30, 2024 · 2 comments

Comments

@serathius
Copy link
Member

What would you like to be added?

Add a failpoint that simulates etcd running out of disk space.

Why is this needed?

Detect issues like #18810

@RostakaGmfun
Copy link

Hey @serathius, would that be a good first issue? If so, which approach would be preferred:

  1. Add gofail annotations to I/O-related code in storage and implement a failpoint that injects errors there.
  2. Configure test cluster nodes data directory to a smaller size and execute a big enough put operation.
  3. Same as 2, but write an arbitrary file to the data directory without touching etcd API.

It looks like the first approach allows more fine-grained control, and can be arbitrarily timed, but risks missing certain I/O calls. Also won't cover error handling by bbolt.

@serathius
Copy link
Member Author

@RostakaGmfun thanks for your interest. Fact that you already described multiple solutions with different trade-offs shows that it's not very newcomer friendly, but it also shows that you should be able to tackle it.

I think we could start with first option you proposed and iterate if needed. What do you think?

More thoughts about each option:

Ad 1: Exactly as you described, still should be a good for a first iteration. The risk is that our implementation will mismatch the real word scenario, but we don't need to be perfect as long as we show we can reproduce #18810.

Ad 2 I would skip it for option 3. I would recommend to avoid taking dependency on large writes, we already have issues with flakes due to limited performance of robustness tests. For the first issue we found in etcd v3.5 and wanted to reproduce in robustness tests required >1000QPS. Now we sometime are not able to hit 100.

Ad 3 first based on my knowledge setting up a volume mount with limited disk space requires a root permission like sudo mount -t tmpfs -o size=64M tmpfs datadir (at least based on how I reproduced the issue by Googling answers). I would prefer to first consider solutions that avoid that. Robustness tests are complicated enough to onboard, making their setup more complicated is last thing we should go for. Maybe you could figure out a better way to do that?

One more option for me is use FUSE filesystem like https://github.com/dsrhaslab/lazyfs to inject out of disk space errors. We already have integration with lazyfs, however it is resource intensive, limiting our throughout. If we could improve performance, that would be my preferred long term solution, use a ready tool to inject arbitrary disk errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

2 participants