Add out of space failpoint to robustness #18811

serathius · 2024-10-30T20:38:24Z

What would you like to be added?

Add a failpoint that simulates etcd running out of disk space.

Why is this needed?

Detect issues like #18810

RostakaGmfun · 2024-11-11T05:00:29Z

Hey @serathius, would that be a good first issue? If so, which approach would be preferred:

Add gofail annotations to I/O-related code in storage and implement a failpoint that injects errors there.
Configure test cluster nodes data directory to a smaller size and execute a big enough put operation.
Same as 2, but write an arbitrary file to the data directory without touching etcd API.

It looks like the first approach allows more fine-grained control, and can be arbitrarily timed, but risks missing certain I/O calls. Also won't cover error handling by bbolt.

serathius · 2024-11-11T13:53:05Z

@RostakaGmfun thanks for your interest. Fact that you already described multiple solutions with different trade-offs shows that it's not very newcomer friendly, but it also shows that you should be able to tackle it.

I think we could start with first option you proposed and iterate if needed. What do you think?

More thoughts about each option:

Ad 1: Exactly as you described, still should be a good for a first iteration. The risk is that our implementation will mismatch the real word scenario, but we don't need to be perfect as long as we show we can reproduce #18810.

Ad 2 I would skip it for option 3. I would recommend to avoid taking dependency on large writes, we already have issues with flakes due to limited performance of robustness tests. For the first issue we found in etcd v3.5 and wanted to reproduce in robustness tests required >1000QPS. Now we sometime are not able to hit 100.

Ad 3 first based on my knowledge setting up a volume mount with limited disk space requires a root permission like sudo mount -t tmpfs -o size=64M tmpfs datadir (at least based on how I reproduced the issue by Googling answers). I would prefer to first consider solutions that avoid that. Robustness tests are complicated enough to onboard, making their setup more complicated is last thing we should go for. Maybe you could figure out a better way to do that?

One more option for me is use FUSE filesystem like https://github.com/dsrhaslab/lazyfs to inject out of disk space errors. We already have integration with lazyfs, however it is resource intensive, limiting our throughout. If we could improve performance, that would be my preferred long term solution, use a ready tool to inject arbitrary disk errors.

serathius added type/feature area/robustness-testing area/testing labels Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add out of space failpoint to robustness #18811

Add out of space failpoint to robustness #18811

serathius commented Oct 30, 2024

RostakaGmfun commented Nov 11, 2024

serathius commented Nov 11, 2024

Add out of space failpoint to robustness #18811

Add out of space failpoint to robustness #18811

Comments

serathius commented Oct 30, 2024

What would you like to be added?

Why is this needed?

RostakaGmfun commented Nov 11, 2024

serathius commented Nov 11, 2024