You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
single-node test deployment uses 100% disk space and bricks the whole deployment
we have a dedicated single-node vespa deployment to test features and optimizations, it helps us predict how changes will scale to larger deployments.
this test deployment has more usable memory than usable disk, can this be an issue for resource limiter?
To Reproduce
Steps to reproduce the behavior:
Seed ~80M documents to vespa
Update these documents using Vespa's partial update feature to attach embeddings, there are ~3 photo embeddings per document
After some time, disk usage starts spiking, reaches 100%
The machine reports that too many inodes are used
Node goes down
That's the relevant embedding schema, we attach multi-dimensional photo clip embeddings where each dimension is a unique photo ID associated with that document.
field photo_embeddings type tensor<bfloat16>(photo_id{}, embedding[512]) {
indexing: attribute | index
attribute {
fast-rank
distance-metric: angular
}
index {
hnsw {
max-links-per-node: 16
neighbors-to-explore-at-insert: 96
}
}
}
Expected behavior
vespa's resource limits to kick in
get 429 status codes and make sure that feed requests are rejected
vespa content node is reachable, responds to search requests
Screenshots
content_proton_resource_usage_disk_usage_total_max metric was used
2024-08-26 08:00 we start seeding
2024-08-26 19:00 seeding is over, embeddings are starting to be attached
Environment (please complete the following information):
OS: Docker
Infrastructure: self-hosted
Memory: 512G
Disk 893G RAID1, 446G usable
Vespa version
8.363.17
Additional context
These are interesting logs and they go in such a sequence:
Aug 27, 2024 @ 17:16:48.000 what(): Fatal: Writing 2097152 bytes to '/opt/vespa/var/db/vespa/search/cluster.vinted/n2/documents/items/0.ready/attribute/photo_embeddings/snapshot-230031437/photo_embeddings.dat' failed (wrote -1): No space left on device
Aug 27, 2024 @ 17:16:48.000 PC: @ 0x7faea85ef52f (unknown) raise
Aug 27, 2024 @ 17:16:48.000 terminate called after throwing an instance of 'std::runtime_error'
Aug 27, 2024 @ 17:16:48.000 *** SIGABRT received at time=1724768208 on cpu 64 ***
Aug 27, 2024 @ 17:17:04.000 Write operations are now blocked: 'diskLimitReached: { action: "add more content nodes", reason: "disk used (0.999999) > disk limit (0.9)", stats: { capacity: 475877605376, used: 475877257216, diskUsed: 0.999999, diskLimit: 0.9}}'
Aug 27, 2024 @ 17:17:21.000 Unable to get response from service 'searchnode:2193:RUNNING:vinted/search/cluster.vinted/2': Connect to http://localhost:19107 [localhost/127.0.0.1, localhost/0:0:0:0:0:0:0:1] failed: Connection refused
The spikes you are observing are almost certainly caused by flushing of in-memory data structures to disk, which requires temporary disk usage that is proportional to the memory used by that data structure (in this case presumably a large tensor attribute).
As a general rule, it is recommended to have a disk size of at least 3x that of the memory size to avoid resource constraints during flushing and compactions.
The automatic feed blocking mechanisms are not currently clever enough to anticipate the impact that future flushes will have based on the already fed data. We should ideally look at the ratio of host memory to disk and automatically derive a reasonable default block threshold based on this—it is clear that the default limits are not appropriate for high memory + low disk setups.
I have to admit that our test hardware is quite unusual, but we have to deal with what we got. It's good that we discovered it in such circumstances.
As a general rule, it is recommended to have a disk size of at least 3x that of the memory size to avoid resource constraints during flushing and compactions.
We'll keep that in mind.
We have now reduced our test dataset size and are happy to know what caused the problem. Should the issue be left open? It does seem like a bug for a rare edge case and not of a huge importance due to likeliness to happen.
Describe the bug
we have a dedicated single-node vespa deployment to test features and optimizations, it helps us predict how changes will scale to larger deployments.
this test deployment has more usable memory than usable disk, can this be an issue for resource limiter?
To Reproduce
Steps to reproduce the behavior:
That's the relevant embedding schema, we attach multi-dimensional photo clip embeddings where each dimension is a unique photo ID associated with that document.
Expected behavior
Screenshots
content_proton_resource_usage_disk_usage_total_max
metric was usedEnvironment (please complete the following information):
Vespa version
8.363.17
Additional context
These are interesting logs and they go in such a sequence:
That's our services.xml file:
The text was updated successfully, but these errors were encountered: